CI/CD Pipeline Standards
Overview
The IWM platform CI/CD pipeline ensures code quality, security, and reliable deployments for a financial system handling MLM commissions, payments, and sensitive user data.
Pipeline Goals
| Goal | Implementation |
|---|---|
| Correctness | Comprehensive test suites for financial calculations |
| Security | Multi-layer scanning (dependencies, secrets, containers) |
| Reliability | Zero-downtime deployments with rollback capability |
| Speed | Parallel jobs, Docker layer caching, smart change detection |
| Auditability | Version tracking, deployment logs, change documentation |
Pipeline Architecture
┌─────────────────────────────────────────────────────────────────────────┐
│ PR VALIDATION │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Build │ │ Lint │ │ Type │ │ Secret │ │
│ │ Check │ │ Format │ │ Check │ │ Scan │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │ │ │
│ └─────────────────┴────────┬────────┴─────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ TEST SUITE │ │
│ │ ┌───────────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐ │ │
│ │ │ Unit │ │Integration│ │ E2E │ │ Contract │ │ │
│ │ │ (80%+) │ │ (DB/API) │ │ (Critical)│ │ (API) │ │ │
│ │ └───────────┘ └───────────┘ └───────────┘ └───────────┘ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ SECURITY GATES │ │
│ │ ┌───────────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐ │ │
│ │ │ npm audit │ │ gitleaks │ │ SAST │ │ Trivy │ │ │
│ │ │ (high+) │ │ (secrets) │ │ (CodeQL) │ │ (container│ │ │
│ │ └───────────┘ └───────────┘ └───────────┘ └───────────┘ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ DATABASE VALIDATION │ │
│ │ ┌─────────────────────┐ ┌─────────────────────────────────┐ │ │
│ │ │ Migration dry-run │ │ Schema drift detection │ │ │
│ │ └─────────────────────┘ └─────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
│
│ Merge to main
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ RELEASE PIPELINE │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Version │ │ Build │ │ Push │ │ Generate │ │
│ │ Bump │──▶│ Images │──▶│ Registry │──▶│ Changelog │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ DEPLOY TO STAGING │ │
│ │ ┌───────────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐ │ │
│ │ │ Run │ │ Deploy │ │ Health │ │ Smoke │ │ │
│ │ │Migrations │──▶│ (B/G) │──▶│ Check │──▶│ Tests │ │ │
│ │ └───────────┘ └───────────┘ └───────────┘ └───────────┘ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │ │
│ Manual Approval │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ DEPLOY TO PRODUCTION │ │
│ │ ┌───────────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐ │ │
│ │ │ Run │ │Blue-Green │ │ Health │ │ Smoke │ │ │
│ │ │Migrations │──▶│ Deploy │──▶│ Check │──▶│ + Notify │ │ │
│ │ └───────────┘ └───────────┘ └───────────┘ └───────────┘ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘PR Validation Workflow
Every pull request must pass all gates before merge is allowed.
Stage 1: Build & Static Analysis
| Check | Tool | Failure Threshold |
|---|---|---|
| TypeScript compilation | tsc --noEmit | Any error |
| ESLint | eslint | Any error (warnings allowed) |
| Prettier | prettier --check | Any formatting issue |
| Stylelint (frontend) | stylelint | Any error |
Stage 2: Test Suite
| Test Type | Scope | Coverage Requirement |
|---|---|---|
| Unit Tests | Domain logic, utilities, pure functions | 80% minimum |
| Integration Tests | Database operations, API endpoints, Redis | 70% minimum |
| E2E Tests | Critical user flows (see below) | Must pass |
| Contract Tests | API schema validation (see below) | Must pass |
Contract Tests Specification
Contract tests validate that the API implementation matches its specification and that changes don't break consumers.
What we validate:
| Aspect | Tool | Description |
|---|---|---|
| OpenAPI compliance | @apidevtools/swagger-cli validate | Schema is valid OpenAPI 3.1 |
| Response shape | Custom Jest matchers | Responses match declared schemas |
| Breaking changes | openapi-diff | Detect removed endpoints, changed types |
| Request validation | Class-validator + Zod | Input DTOs match OpenAPI parameters |
Contract test implementation:
// test/contract/api-contract.spec.ts
import { OpenAPIValidator } from 'express-openapi-validator';
import spec from '../openapi.json';
describe('API Contract Tests', () => {
const validator = new OpenAPIValidator({ spec });
describe('POST /api/v1/auth/register', () => {
it('should match request schema', async () => {
const validRequest = {
email: 'test@example.com',
password: 'SecurePass123',
referralCode: 'ABC123'
};
expect(() => validator.validateRequest({
path: '/api/v1/auth/register',
method: 'post',
body: validRequest
})).not.toThrow();
});
it('should match response schema', async () => {
const response = await request(app)
.post('/api/v1/auth/register')
.send(validRequest);
expect(() => validator.validateResponse({
path: '/api/v1/auth/register',
method: 'post',
statusCode: response.status,
body: response.body
})).not.toThrow();
});
});
});Breaking change detection in CI:
- name: Check for Breaking API Changes
run: |
# Compare current spec with main branch
git show origin/main:openapi.json > openapi-main.json
npx openapi-diff openapi-main.json openapi.json --fail-on-incompatible
# Incompatible changes (will fail):
# - Removing endpoints
# - Removing required response fields
# - Adding required request fields
# - Changing field types
# Compatible changes (allowed):
# - Adding new endpoints
# - Adding optional request fields
# - Adding response fieldsCritical E2E Flows (must always be tested):
1. Authentication Flow
- Registration with referral code
- Login with 2FA
- Password reset
- Session management
2. Order & Payment Flow
- Add to cart → Checkout → Payment → Confirmation
- Order status transitions (full state machine)
- Payment webhook processing
3. Commission Calculation Flow
- Order completion → Commission generation
- Multi-level distribution (up to 10 levels)
- Commission approval → Payout
4. MLM Tree Operations
- Partner registration under sponsor
- Tree traversal queries
- Rank qualification checkStage 3: Security Gates
| Scan | Tool | Action on Failure |
|---|---|---|
| Dependency vulnerabilities | npm audit --audit-level=high | Block merge |
| Secret detection | gitleaks | Block merge |
| Static Application Security Testing | CodeQL / SonarQube | Block on high/critical |
| Container vulnerabilities | Trivy | Block on critical |
Stage 4: Database Validation
# Migration dry-run against test database
- name: Validate Migrations
run: |
# Create temporary database
createdb iwm_migration_test
# Run all migrations
npx prisma migrate deploy --preview-feature
# Validate schema matches Prisma schema
npx prisma db pull --force
npx prisma validate
# Cleanup
dropdb iwm_migration_testRelease Workflow
Triggered on merge to main branch.
Version Management
Semantic Versioning: MAJOR.MINOR.PATCH
Auto-bump rules:
- Merge to main → PATCH increment (1.2.3 → 1.2.4)
- Manual MINOR change → Skip auto-bump, use manual version
- Manual MAJOR change → Skip auto-bump, use manual version
Version stored in: VERSION file (root)Build Stage
| Step | Description |
|---|---|
| Read VERSION | Get current or bumped version |
| Build Docker images | Backend + Frontend separately |
| Tag images | v{VERSION} + latest |
| Push to registry | GitHub Container Registry (ghcr.io) |
| Cache layers | type=gha,mode=max for faster rebuilds |
Docker Layer Caching
# Each Dockerfile instruction creates a layer with SHA256 hash
# Unchanged layers are reused from cache
FROM node:20-alpine # Layer sha256:a1b2... (cached if unchanged)
COPY package*.json ./ # Layer sha256:c3d4... (cached if package.json same)
RUN npm ci # Layer sha256:e5f6... (cached if dependencies same)
COPY . . # Layer sha256:g7h8... (changes on code change)
RUN npm run build # Layer sha256:i9j0... (rebuilds if code changed)Cache configuration:
cache-from: type=gha # Pull from GitHub Actions cache
cache-to: type=gha,mode=max # Push ALL layers (not just final)Security Gates Detail
Dependency Scanning
# Run on every PR and weekly scheduled scan
- name: Dependency Audit
run: npm audit --audit-level=high
- name: Snyk Scan
uses: snyk/actions/node@master
env:
SNYK_TOKEN: ${{ secrets.SNYK_TOKEN }}Secret Scanning
- name: Gitleaks Scan
uses: gitleaks/gitleaks-action@v2
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}Protected patterns:
| Pattern | Example |
|---|---|
| API keys | sk_live_*, pk_live_* |
| Database URLs | postgresql://*:*@* |
| JWT secrets | JWT_SECRET=* |
| Encryption keys | ENCRYPTION_MASTER_KEY=* |
| Webhook secrets | *_WEBHOOK_SECRET=* |
Container Scanning
- name: Trivy Scan
uses: aquasecurity/trivy-action@master
with:
image-ref: ${{ env.IMAGE }}
severity: 'CRITICAL,HIGH'
exit-code: '1' # Fail on critical/highDatabase Migration Strategy
Migration Validation (PR)
1. Dry-run against empty database
2. Dry-run against production clone (staging)
3. Check for destructive operations (DROP, TRUNCATE)
4. Estimate migration duration
5. Flag migrations requiring maintenance windowMigration Execution (Deploy)
┌─────────────────────────────────────────────────────────────────┐
│ MIGRATION EXECUTION │
│ │
│ 1. Create backup snapshot │
│ └─▶ pg_dump iwm_production > backup_$(date).sql │
│ │
│ 2. Run migrations │
│ └─▶ npx prisma migrate deploy │
│ │
│ 3. Validate schema │
│ └─▶ npx prisma validate │
│ │
│ 4. Health check │
│ └─▶ curl /health/ready │
│ │
│ 5. On failure: Restore from backup │
│ └─▶ psql < backup_$(date).sql │
└─────────────────────────────────────────────────────────────────┘Important: Prisma
migrate deployruns each migration file in its own transaction automatically. Do NOT wrap it in BEGIN/COMMIT manually.
DDL Transaction Limitations
Some PostgreSQL DDL operations cannot run inside a transaction. These require special handling:
| Operation | Transaction Support | Handling |
|---|---|---|
| CREATE INDEX CONCURRENTLY | No | Separate migration file, run manually |
| ALTER TYPE (enum add value) | No (PG < 12) | Separate migration, requires downtime on old PG |
| DROP INDEX CONCURRENTLY | No | Separate migration file |
| REINDEX CONCURRENTLY | No | Run during maintenance window |
For non-transactional migrations:
-- migrations/20240115_add_index_concurrently.sql
-- @non-transactional
CREATE INDEX CONCURRENTLY idx_orders_created
ON product.orders(created_at);# Pipeline handles non-transactional migrations separately
- name: Run Non-Transactional Migrations
run: |
for file in prisma/migrations/*_non_transactional.sql; do
psql $DATABASE_URL -f "$file" || exit 1
doneDestructive Migration Policy
| Operation | Requirement |
|---|---|
| DROP TABLE | Requires manual approval + backup verification |
| DROP COLUMN | Must be preceded by code removal in previous release |
| TRUNCATE | Prohibited in production migrations |
| ALTER TYPE | Requires maintenance window |
Backup Verification Testing
Backups are worthless if they can't be restored. Regular verification ensures recovery is actually possible.
Verification schedule:
| Environment | Frequency | Method |
|---|---|---|
| Production | Weekly | Full restore to isolated instance |
| Staging | Monthly | Full restore test |
| After migration | Immediate | Spot check critical tables |
Automated backup verification job:
# .github/workflows/backup-verification.yml
name: Backup Verification
on:
schedule:
- cron: '0 4 * * 0' # Weekly Sunday 4AM
workflow_dispatch:
jobs:
verify-backup:
runs-on: ubuntu-latest
steps:
- name: Create Fresh Backup
run: |
pg_dump $PRODUCTION_URL \
--format=custom \
--file=backup-$(date +%Y%m%d).dump
- name: Spin Up Isolated Instance
run: |
docker run -d \
--name pg-verify \
-e POSTGRES_PASSWORD=verify_test \
-p 5433:5432 \
postgres:15
sleep 10 # Wait for startup
- name: Restore Backup
run: |
pg_restore \
--host=localhost \
--port=5433 \
--username=postgres \
--dbname=postgres \
--clean \
--if-exists \
backup-$(date +%Y%m%d).dump
- name: Verify Data Integrity
run: |
PGPASSWORD=verify_test psql \
-h localhost -p 5433 -U postgres -d postgres \
-f scripts/verify-backup-integrity.sql
- name: Verify Row Counts
run: |
# Compare row counts with production
node scripts/compare-backup-counts.js
- name: Cleanup
if: always()
run: docker rm -f pg-verify
- name: Report Results
run: |
if [ "${{ job.status }}" == "success" ]; then
curl -X POST $SLACK_WEBHOOK \
-d '{"text": ":white_check_mark: Backup verification passed"}'
else
curl -X POST $SLACK_WEBHOOK \
-d '{"text": ":x: Backup verification FAILED - investigate immediately"}'
fiIntegrity verification script:
-- scripts/verify-backup-integrity.sql
-- Check critical tables exist and have data
DO $$
DECLARE
v_count INT;
BEGIN
-- Users table
SELECT COUNT(*) INTO v_count FROM core.users;
IF v_count = 0 THEN
RAISE EXCEPTION 'CRITICAL: users table is empty';
END IF;
RAISE NOTICE 'users: % rows', v_count;
-- Partners table
SELECT COUNT(*) INTO v_count FROM mlm.partners;
RAISE NOTICE 'partners: % rows', v_count;
-- Commission transactions
SELECT COUNT(*) INTO v_count FROM mlm.commission_transactions;
RAISE NOTICE 'commission_transactions: % rows', v_count;
-- Orders table
SELECT COUNT(*) INTO v_count FROM product.orders;
RAISE NOTICE 'orders: % rows', v_count;
-- Verify foreign key relationships intact
SELECT COUNT(*) INTO v_count
FROM mlm.partners p
LEFT JOIN core.users u ON p.user_id = u.id
WHERE u.id IS NULL;
IF v_count > 0 THEN
RAISE EXCEPTION 'CRITICAL: % orphaned partner records', v_count;
END IF;
-- Verify tree integrity
SELECT COUNT(*) INTO v_count
FROM mlm.partner_tree_paths ptp
LEFT JOIN mlm.partners p ON ptp.ancestor_id = p.id
WHERE p.id IS NULL;
IF v_count > 0 THEN
RAISE EXCEPTION 'CRITICAL: % orphaned tree path records', v_count;
END IF;
RAISE NOTICE 'All integrity checks passed';
END $$;Recovery time tracking:
| Metric | Target | Alert If |
|---|---|---|
| Backup creation time | < 30 min | > 1 hour |
| Restore time | < 1 hour | > 2 hours |
| Verification time | < 15 min | > 30 min |
| Total RTO | < 2 hours | > 4 hours |
Environment Strategy
Environment Matrix
| Environment | Purpose | Deploy Trigger | Data |
|---|---|---|---|
| Development | Local development | Manual | Seed data |
| CI | Pipeline testing | Every PR | Fresh per run |
| Staging | Pre-production validation | Merge to main | Production clone (anonymized) |
| Production | Live system | Manual approval | Real data |
Staging Data Anonymization
Production data cloned to staging must be anonymized to protect user privacy and comply with GDPR/data protection requirements.
Anonymization rules by data type:
| Data Category | Field | Anonymization Method |
|---|---|---|
| PII | faker.email() with original domain preserved | |
| PII | phone | +7900${random7digits} |
| PII | first_name, last_name | faker.name() |
| PII | address fields | faker.address() |
| Financial | bank_account | ****${last4} (masked) |
| Financial | card_number | Completely removed |
| Financial | balance amounts | Preserved (not PII) |
| KYC | passport_number | XX${random8digits} |
| KYC | tax_id | ${random12digits} |
| KYC | document_urls | Replaced with placeholder images |
| Auth | password_hash | Set to known test password hash |
| Auth | session tokens | Deleted |
| Auth | 2fa_secrets | Deleted |
| Audit | ip_address | 192.168.x.x (private range) |
| Audit | user_agent | Preserved (not PII) |
Anonymization script:
-- scripts/anonymize-staging.sql
-- Run after cloning production to staging
BEGIN;
-- Users table
UPDATE core.users SET
email = 'user_' || id::text || '@staging.iwm.local',
phone = '+7900' || LPAD(FLOOR(RANDOM() * 10000000)::TEXT, 7, '0'),
password_hash = '$2b$10$staging.password.hash.for.testing'; -- Password: "staging123"
-- User profiles
UPDATE core.user_profiles SET
first_name = 'Test',
last_name = 'User_' || SUBSTRING(user_id::text, 1, 8),
middle_name = NULL;
-- Addresses
UPDATE product.addresses SET
first_name = 'Test',
last_name = 'User',
address_line1 = FLOOR(RANDOM() * 100)::TEXT || ' Test Street',
address_line2 = 'Apt ' || FLOOR(RANDOM() * 100)::TEXT,
city = 'Test City',
phone = '+7900' || LPAD(FLOOR(RANDOM() * 10000000)::TEXT, 7, '0');
-- KYC documents
UPDATE core.kyc_verifications SET
document_number = 'XX' || LPAD(FLOOR(RANDOM() * 100000000)::TEXT, 8, '0');
UPDATE core.kyc_documents SET
file_url = 'https://staging-assets.iwm.local/placeholder-document.pdf',
file_name = 'anonymized_document.pdf';
-- Payout details (sensitive bank info)
UPDATE mlm.payout_requests SET
payout_details = jsonb_set(
payout_details,
'{account_number}',
'"****' || RIGHT(payout_details->>'account_number', 4) || '"'
)
WHERE payout_details ? 'account_number';
-- Delete sensitive auth data
DELETE FROM core.sessions;
DELETE FROM core.two_factor_auth;
-- Audit logs - anonymize IPs
UPDATE core.audit_log SET
ip_address = ('192.168.' || (RANDOM() * 255)::INT || '.' || (RANDOM() * 255)::INT)::INET;
-- Partner referral links - update domain
UPDATE mlm.referral_links SET
full_url = REPLACE(full_url, 'iwm.com', 'staging.iwm.local');
COMMIT;
-- Verify no production data leaked
DO $$
BEGIN
-- Check no real emails remain
IF EXISTS (SELECT 1 FROM core.users WHERE email NOT LIKE '%@staging.iwm.local') THEN
RAISE EXCEPTION 'Anonymization failed: real emails found';
END IF;
-- Check no real phone numbers remain
IF EXISTS (SELECT 1 FROM core.users WHERE phone NOT LIKE '+7900%') THEN
RAISE EXCEPTION 'Anonymization failed: real phone numbers found';
END IF;
END $$;Automated staging refresh:
# .github/workflows/staging-refresh.yml
name: Refresh Staging Data
on:
schedule:
- cron: '0 3 * * 0' # Weekly Sunday 3AM
workflow_dispatch: # Manual trigger
jobs:
refresh-staging:
runs-on: ubuntu-latest
steps:
- name: Create Production Snapshot
run: |
pg_dump $PRODUCTION_URL \
--no-owner \
--no-privileges \
> production-snapshot.sql
- name: Restore to Staging
run: |
psql $STAGING_URL -c "DROP SCHEMA IF EXISTS public CASCADE; CREATE SCHEMA public;"
psql $STAGING_URL < production-snapshot.sql
- name: Run Anonymization
run: psql $STAGING_URL < scripts/anonymize-staging.sql
- name: Verify Anonymization
run: node scripts/verify-anonymization.js
- name: Notify Team
run: |
curl -X POST $SLACK_WEBHOOK \
-d '{"text": "Staging environment refreshed with anonymized production data"}'Access control for staging:
| Role | Staging Access | Can See |
|---|---|---|
| Developer | Full access | Anonymized data only |
| QA | Full access | Anonymized data only |
| Support | No access | — |
| External contractor | No access | — |
Environment Parity
# All environments use identical:
- Docker images (same SHA)
- Database schema (same migrations)
- Environment variable structure (different values)
- Infrastructure configuration (scaled differently)Environment Variables Validation
// Validated at application startup using Zod
// CI must validate all required variables are defined
const envSchema = z.object({
NODE_ENV: z.enum(['development', 'staging', 'production']),
DATABASE_URL: z.string().url(),
REDIS_URL: z.string().url(),
JWT_SECRET: z.string().min(32),
ENCRYPTION_MASTER_KEY: z.string().min(32),
// ... all other required variables
});Deployment Strategy
Blue-Green Deployment
┌─────────────────────────────────────────────────────────────────┐
│ BLUE-GREEN DEPLOY │
│ │
│ Current State: │
│ ┌─────────────┐ │
│ │ BLUE │ ◀── Load Balancer ◀── Traffic │
│ │ (v1.2.3) │ │
│ └─────────────┘ │
│ ┌─────────────┐ │
│ │ GREEN │ (idle) │
│ │ (v1.2.3) │ │
│ └─────────────┘ │
│ │
│ Deploy v1.2.4: │
│ 1. Deploy to GREEN │
│ 2. Run health checks on GREEN │
│ 3. Run smoke tests on GREEN │
│ 4. Switch traffic to GREEN │
│ 5. Monitor for errors │
│ 6. If errors: Switch back to BLUE (rollback) │
│ 7. If stable: Update BLUE to v1.2.4 (sync) │
└─────────────────────────────────────────────────────────────────┘Rollback Procedure
# Automatic rollback triggers:
- Health check failure (3 consecutive)
- Error rate > 5% (compared to baseline)
- Response time > 2x baseline
# Rollback steps:
1. Switch load balancer to previous version
2. Alert on-call engineer
3. Preserve logs for investigation
4. Do NOT run backward migrations automaticallyRollback Window
| Phase | Duration | BLUE Status | Action if Issues |
|---|---|---|---|
| Immediate | 0-15 min | Running, no traffic | Instant switch back |
| Short-term | 15 min - 2 hours | Running, warm standby | Quick rollback (< 1 min) |
| Medium-term | 2-24 hours | Stopped, image preserved | Restart BLUE, switch traffic |
| Long-term | > 24 hours | Terminated | Redeploy previous version |
Data divergence consideration:
- Rollback within 2 hours: Minimal data divergence, safe to rollback
- Rollback after 2 hours: Audit new data created, may need manual reconciliation
- Rollback after 24 hours: Requires data migration plan, not automatic
# Rollback window configuration
rollback:
instant_window: 15m # BLUE kept running
warm_standby: 2h # BLUE stopped but not terminated
image_retention: 24h # Previous image kept in registry
max_auto_rollback: 2h # After this, manual approval requiredDatabase Connection Management During Deploy
During blue-green deployment, both versions may run simultaneously. This requires careful connection pool management.
┌─────────────────────────────────────────────────────────────────┐
│ CONNECTION POOL DURING DEPLOY │
│ │
│ Database Pool Limit: 100 connections │
│ │
│ Normal Operation: │
│ ┌─────────────┐ │
│ │ BLUE │ ─── 50 connections ───▶ ┌──────────────┐ │
│ │ (active) │ │ │ │
│ └─────────────┘ │ PostgreSQL │ │
│ │ │ │
│ During Deploy (both running): │ Pool: 100 │ │
│ ┌─────────────┐ │ │ │
│ │ BLUE │ ─── 40 connections ───▶ │ │ │
│ │ (draining) │ │ │ │
│ └─────────────┘ │ │ │
│ ┌─────────────┐ │ │ │
│ │ GREEN │ ─── 40 connections ───▶ │ │ │
│ │ (starting) │ └──────────────┘ │
│ └─────────────┘ │
│ │
│ Reserve: 20 connections for migrations & admin │
└─────────────────────────────────────────────────────────────────┘Connection pool configuration:
// config/database.ts
const poolConfig = {
// Normal operation
default: {
min: 5,
max: 50,
},
// During deployment (detected via DEPLOY_MODE env)
deployment: {
min: 2,
max: 40, // Reduced to allow overlap
},
};Deploy sequence to prevent connection exhaustion:
deploy_steps:
1. Set GREEN pool to deployment mode (max: 40)
2. Start GREEN instances
3. Wait for GREEN health check
4. Run migrations (uses reserved connections)
5. Gradually shift traffic (10% → 50% → 100%)
6. Set BLUE to drain mode (stop accepting new connections)
7. Wait for BLUE connections to close (max 30s)
8. Stop BLUE instances
9. Set GREEN pool to normal mode (max: 50)PgBouncer recommended for production:
# pgbouncer.ini
[pgbouncer]
pool_mode = transaction
max_client_conn = 200
default_pool_size = 50
reserve_pool_size = 10
reserve_pool_timeout = 3Health Checks
// /health/live - Is the process running?
// Returns 200 if process is alive
// /health/ready - Can it handle requests?
// Checks:
// - Database connection
// - Redis connection
// - Required services available
// /health/startup - Has it finished initialization?
// Used by Kubernetes to know when to send trafficTest Requirements
Coverage Thresholds
{
"coverageThreshold": {
"global": {
"branches": 75,
"functions": 80,
"lines": 80,
"statements": 80
},
"src/modules/mlm/domain/**": {
"branches": 90,
"functions": 95,
"lines": 95
},
"src/modules/payment/domain/**": {
"branches": 90,
"functions": 95,
"lines": 95
}
}
}Platform-Specific Test Suites
| Suite | Focus | Trigger |
|---|---|---|
| Commission Tests | Multi-level calculations, edge cases, rounding | Every PR |
| Tree Operation Tests | INSERT, MOVE, depth limits, cycle prevention | Every PR |
| Payment Integration | Webhook handling, idempotency, refunds | Every PR |
| State Machine Tests | Order transitions, invalid state prevention | Every PR |
| Encryption Tests | Encrypt/decrypt cycle, key rotation | Every PR |
| Load Tests | Commission calculation under load | Weekly / Pre-release |
Load Test Specifications
Load tests validate system performance under realistic and stress conditions.
Target Metrics:
| Scenario | Target | Threshold (Fail if) |
|---|---|---|
| Commission calculation | 100 orders/second | < 50 orders/second |
| Concurrent payout requests | 50 requests/second | < 25 requests/second |
| Tree traversal (10 levels) | < 50ms p95 | > 200ms p95 |
| API response time (p95) | < 200ms | > 500ms |
| Database connections | Stable at pool max | Pool exhaustion |
| Memory usage | < 512MB per instance | > 1GB |
Load test scenarios:
// load-tests/k6/commission-load.js
import http from 'k6/http';
import { check, sleep } from 'k6';
export const options = {
scenarios: {
// Sustained load: Normal operation
sustained: {
executor: 'constant-arrival-rate',
rate: 100, // 100 orders per second
timeUnit: '1s',
duration: '5m',
preAllocatedVUs: 50,
},
// Spike: Flash sale simulation
spike: {
executor: 'ramping-arrival-rate',
startRate: 100,
timeUnit: '1s',
stages: [
{ duration: '1m', target: 100 }, // Normal
{ duration: '30s', target: 500 }, // Spike to 5x
{ duration: '2m', target: 500 }, // Sustain spike
{ duration: '30s', target: 100 }, // Back to normal
],
preAllocatedVUs: 200,
},
// Stress: Find breaking point
stress: {
executor: 'ramping-arrival-rate',
startRate: 50,
timeUnit: '1s',
stages: [
{ duration: '2m', target: 200 },
{ duration: '2m', target: 400 },
{ duration: '2m', target: 600 },
{ duration: '2m', target: 800 }, // Find where it breaks
],
preAllocatedVUs: 300,
},
},
thresholds: {
http_req_duration: ['p(95)<500'], // 95% of requests < 500ms
http_req_failed: ['rate<0.01'], // Error rate < 1%
'commission_calculated': ['rate>50'], // At least 50/s processed
},
};
export default function () {
const orderPayload = {
userId: `user_${__VU}_${__ITER}`,
amount: Math.floor(Math.random() * 10000) + 1000,
referringPartnerId: 'partner_test_001',
};
const res = http.post(
`${__ENV.API_URL}/api/v1/orders`,
JSON.stringify(orderPayload),
{ headers: { 'Content-Type': 'application/json' } }
);
check(res, {
'order created': (r) => r.status === 201,
'commission triggered': (r) => r.json('commissionJobId') !== null,
});
sleep(0.1);
}Database connection behavior under load:
# Monitor during load tests
metrics:
- pg_stat_activity.active_connections
- pg_stat_activity.waiting_connections
- pg_stat_user_tables.seq_scan (should not spike)
- pg_stat_user_tables.idx_scan (should handle load)
- pg_locks.blocked_queries (should be zero)CI integration:
- name: Run Load Tests (Pre-release)
if: github.event_name == 'release'
run: |
k6 run load-tests/k6/commission-load.js \
--env API_URL=${{ secrets.STAGING_URL }} \
--out json=load-test-results.json
# Fail release if thresholds not met
node scripts/validate-load-test.js load-test-results.jsonMonitoring & Notifications
Deploy Notifications
# Notify on:
- Deploy started (staging/production)
- Deploy succeeded
- Deploy failed
- Rollback triggered
- Manual approval required
# Channels:
- Slack / Telegram
- Email (for failures)
- PagerDuty (for production failures)Post-Deploy Verification
# Smoke tests run immediately after deploy:
1. GET /health/ready → 200
2. POST /api/v1/auth/login (test user) → 200 + JWT
3. GET /api/v1/products → 200 + valid response
4. GET /api/v1/mlm/ranks → 200 + valid response
# If any fail → trigger rollbackMetrics to Monitor Post-Deploy
| Metric | Baseline Comparison | Alert Threshold |
|---|---|---|
| Error rate | vs. previous hour | > 2x baseline |
| Response time (p95) | vs. previous hour | > 1.5x baseline |
| Database query time | vs. previous hour | > 2x baseline |
| Memory usage | vs. previous deploy | > 120% |
| CPU usage | vs. previous deploy | > 150% |
Hotfix Procedure
When production is broken and a rapid fix is needed, the hotfix procedure allows bypassing normal flow while maintaining safety.
When to Use Hotfix
| Situation | Use Hotfix? | Normal Deploy OK? |
|---|---|---|
| Production down / 500 errors | Yes | No |
| Critical security vulnerability | Yes | No |
| Payment processing broken | Yes | No |
| Commission calculation wrong | Yes | No |
| Minor bug, users unaffected | No | Yes |
| Performance degradation (< 2x) | No | Yes |
| Feature not working as expected | No | Yes |
Hotfix Flow
┌─────────────────────────────────────────────────────────────────┐
│ HOTFIX PROCEDURE │
│ │
│ 1. ASSESS (5 min max) │
│ └─▶ Confirm severity, identify root cause │
│ └─▶ Decision: Hotfix vs Rollback vs Wait │
│ │
│ 2. BRANCH │
│ └─▶ git checkout -b hotfix/ISSUE-ID main │
│ └─▶ NOT from feature branch │
│ │
│ 3. FIX │
│ └─▶ Minimal change only │
│ └─▶ No refactoring │
│ └─▶ No "while we're here" additions │
│ │
│ 4. VALIDATE (Abbreviated) │
│ └─▶ Unit tests for changed code │
│ └─▶ Type check │
│ └─▶ Manual smoke test │
│ └─▶ Skip: Full E2E, Load tests, SAST │
│ │
│ 5. APPROVE │
│ └─▶ Single reviewer (senior engineer) │
│ └─▶ No PR required (direct push with approval) │
│ │
│ 6. DEPLOY │
│ └─▶ Direct to production (skip staging) │
│ └─▶ Watch metrics for 15 minutes │
│ │
│ 7. FOLLOW-UP (within 24 hours) │
│ └─▶ Create proper PR backport to main │
│ └─▶ Add regression test │
│ └─▶ Write incident report │
└─────────────────────────────────────────────────────────────────┘Hotfix Commands
# 1. Create hotfix branch from production tag
git fetch --tags
git checkout -b hotfix/IWM-123-fix-commission-calc v1.2.3
# 2. Make fix
# ... code changes ...
# 3. Run abbreviated tests
npm run test:unit -- --testPathPattern="commission"
npm run type-check
# 4. Deploy directly (requires HOTFIX_APPROVED=true)
HOTFIX_APPROVED=true npm run deploy:production
# 5. Tag the hotfix
git tag -a v1.2.3-hotfix.1 -m "Hotfix: Commission calculation overflow"
git push origin v1.2.3-hotfix.1
# 6. Backport to main
git checkout main
git cherry-pick <hotfix-commit-sha>
git push origin mainHotfix Approval Matrix
| Fix Type | Approver Required | Can Self-Approve |
|---|---|---|
| Logic fix (no DB) | 1 senior engineer | No |
| Config change | 1 engineer | Yes (if on-call) |
| Database fix | 2 engineers + DBA | No |
| Security fix | 1 security + 1 engineer | No |
| Revert to previous | 1 engineer | Yes (if on-call) |
On-Call & Escalation
Escalation Path
┌─────────────────────────────────────────────────────────────────┐
│ ESCALATION LADDER │
│ │
│ Level 0: Automated │
│ └─▶ Health check fails → Auto-rollback │
│ └─▶ Error rate > 5% → Auto-rollback │
│ └─▶ Alert sent to #alerts channel │
│ │
│ Level 1: On-Call Engineer (0-15 min) │
│ └─▶ Receive PagerDuty alert │
│ └─▶ Acknowledge within 5 minutes │
│ └─▶ Assess: Can fix alone? Needs escalation? │
│ │
│ Level 2: Senior Engineer (15-30 min) │
│ └─▶ Auto-escalate if L1 doesn't acknowledge │
│ └─▶ Join incident call │
│ └─▶ Decision: Hotfix vs Rollback vs External help │
│ │
│ Level 3: Engineering Lead + Team (30+ min) │
│ └─▶ Multiple engineers on call │
│ └─▶ Coordinate with stakeholders │
│ └─▶ Customer communication if needed │
│ │
│ Level 4: Executive (Major Incident) │
│ └─▶ Extended outage (> 1 hour) │
│ └─▶ Data breach / Security incident │
│ └─▶ Financial impact (payments affected) │
└─────────────────────────────────────────────────────────────────┘PagerDuty Configuration
# pagerduty-config.yml
services:
- name: IWM Production
escalation_policy: iwm-production
alert_creation: create_alerts_and_incidents
escalation_policies:
- name: iwm-production
rules:
- escalation_delay_minutes: 5
targets:
- type: schedule_reference
id: primary-oncall
- escalation_delay_minutes: 15
targets:
- type: schedule_reference
id: senior-oncall
- escalation_delay_minutes: 30
targets:
- type: user_reference
id: engineering-lead
- type: user_reference
id: cto
schedules:
- name: primary-oncall
rotation: weekly
users: [engineer_1, engineer_2, engineer_3, engineer_4]
- name: senior-oncall
rotation: weekly
users: [senior_1, senior_2]Alert Severity Levels
| Severity | Response Time | Examples |
|---|---|---|
| P1 - Critical | 5 min | Production down, payments failing, data breach |
| P2 - High | 15 min | Major feature broken, error rate > 5% |
| P3 - Medium | 1 hour | Performance degradation, non-critical errors |
| P4 - Low | Next business day | Minor bugs, monitoring alerts |
Incident Communication
# During incident:
channels:
- "#incident-active" # Real-time updates (engineers only)
- "#engineering" # Status updates (every 30 min)
- "#general" # Customer-facing status (if needed)
templates:
initial_alert: |
:rotating_light: **INCIDENT DETECTED**
Severity: {severity}
Service: {service}
Description: {description}
On-call: @{oncall_user}
Incident channel: #incident-{id}
status_update: |
**Incident Update** ({time} since start)
Status: {investigating|identified|fixing|monitoring|resolved}
Impact: {impact_description}
Next update: {eta}
resolution: |
:white_check_mark: **INCIDENT RESOLVED**
Duration: {duration}
Root cause: {root_cause}
Fix applied: {fix_description}
Follow-up: {follow_up_ticket}Secrets Rotation
Rotation Schedule
| Secret | Rotation Frequency | Auto-Rotate | Downtime Required |
|---|---|---|---|
| JWT_SECRET | 90 days | No | No (dual-key period) |
| ENCRYPTION_MASTER_KEY | 180 days | No | Yes (re-encryption) |
| Database password | 90 days | Yes (via cloud) | No |
| API keys (external) | 365 days | Varies | No |
| Webhook secrets | 180 days | No | Coordination needed |
JWT Secret Rotation (Zero Downtime)
// Support dual JWT secrets during rotation
const jwtConfig = {
// Current secret (for signing new tokens)
current: process.env.JWT_SECRET,
// Previous secret (for validating old tokens during rotation)
previous: process.env.JWT_SECRET_PREVIOUS || null,
// Rotation window (how long to accept old tokens)
rotationWindowDays: 7,
};
// Validation accepts either secret
function validateToken(token: string): JwtPayload {
try {
return jwt.verify(token, jwtConfig.current);
} catch (e) {
if (jwtConfig.previous) {
return jwt.verify(token, jwtConfig.previous);
}
throw e;
}
}Rotation procedure:
jwt_rotation_steps:
1. Generate new JWT_SECRET
2. Set JWT_SECRET_PREVIOUS = current JWT_SECRET
3. Set JWT_SECRET = new secret
4. Deploy (both secrets now valid)
5. Wait 7 days (tokens expire, refresh uses new secret)
6. Remove JWT_SECRET_PREVIOUS
7. Deploy final configEncryption Key Rotation
Encryption key rotation requires re-encrypting existing data.
// Key versioning in encrypted data
interface EncryptedField {
version: string; // 'v1', 'v2', etc.
ciphertext: string;
iv: string;
}
// Decryption with version support
async function decrypt(field: EncryptedField): Promise<string> {
const key = getKeyByVersion(field.version);
return decryptWithKey(field.ciphertext, field.iv, key);
}
// Background re-encryption job
async function reEncryptAllData(fromVersion: string, toVersion: string) {
const batchSize = 1000;
let offset = 0;
while (true) {
const records = await db.taxIdentification.findMany({
where: { encryptedData: { path: ['version'], equals: fromVersion } },
take: batchSize,
skip: offset,
});
if (records.length === 0) break;
for (const record of records) {
const decrypted = await decrypt(record.encryptedData);
const reEncrypted = await encrypt(decrypted, toVersion);
await db.taxIdentification.update({
where: { id: record.id },
data: { encryptedData: reEncrypted },
});
}
offset += batchSize;
logger.info(`Re-encrypted ${offset} records`);
}
}Rotation procedure:
encryption_key_rotation:
1. Generate new key (ENCRYPTION_MASTER_KEY_V2)
2. Deploy with both keys available
3. New encryptions use v2
4. Run background re-encryption job
5. Monitor job completion (may take hours)
6. Verify all records are v2
7. Remove old key from config
8. Securely delete old key from secrets managerWebhook Secret Rotation
External webhooks (payment providers) require coordination.
webhook_rotation:
stripe:
1. Generate new webhook in Stripe dashboard
2. Add new STRIPE_WEBHOOK_SECRET_V2 to config
3. Deploy (accept both secrets)
4. Disable old webhook in Stripe dashboard
5. Remove old secret from config
coordination: Self-service, no downtime
payment_provider:
1. Contact provider support
2. Schedule rotation window
3. Provider sends test webhook with new secret
4. Verify receipt
5. Provider switches to new secret
6. Update config
coordination: Provider-dependent, may require windowImplementation Priority
Required (Phase 1)
| Component | Reason |
|---|---|
| Build + Type check | Basic correctness |
| Unit tests (80%) | Financial calculation accuracy |
| Integration tests | Database operations correctness |
| npm audit | OWASP A06 compliance |
| Secret scanning | Prevent credential leaks |
| Migration validation | Database integrity |
| Health checks | Deployment verification |
| Staging environment | Pre-production validation |
Should Have (Phase 2)
| Component | Reason |
|---|---|
| E2E tests (critical flows) | User journey validation |
| Container scanning (Trivy) | Infrastructure security |
| SAST (CodeQL) | Code-level vulnerabilities |
| Blue-green deployment | Zero-downtime releases |
| Automatic rollback | Fast recovery |
| Deploy notifications | Team awareness |
| Coverage gates | Prevent regression |
Nice to Have (Phase 3)
| Component | Reason |
|---|---|
| Preview environments | PR testing |
| Load testing | Performance validation |
| Visual regression | UI consistency |
| Auto-changelog | Release documentation |
| Dependency auto-update | Maintenance automation |
| Feature flags | Gradual rollouts |
| Chaos testing | Resilience verification |