Skip to content

CI/CD Pipeline Standards

Overview

The IWM platform CI/CD pipeline ensures code quality, security, and reliable deployments for a financial system handling MLM commissions, payments, and sensitive user data.

Pipeline Goals

GoalImplementation
CorrectnessComprehensive test suites for financial calculations
SecurityMulti-layer scanning (dependencies, secrets, containers)
ReliabilityZero-downtime deployments with rollback capability
SpeedParallel jobs, Docker layer caching, smart change detection
AuditabilityVersion tracking, deployment logs, change documentation

Pipeline Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                           PR VALIDATION                                  │
│                                                                          │
│   ┌─────────────┐   ┌─────────────┐   ┌─────────────┐   ┌─────────────┐ │
│   │    Build    │   │    Lint     │   │    Type     │   │   Secret    │ │
│   │    Check    │   │   Format    │   │    Check    │   │    Scan     │ │
│   └──────┬──────┘   └──────┬──────┘   └──────┬──────┘   └──────┬──────┘ │
│          │                 │                 │                 │        │
│          └─────────────────┴────────┬────────┴─────────────────┘        │
│                                     │                                    │
│                                     ▼                                    │
│   ┌─────────────────────────────────────────────────────────────────┐   │
│   │                        TEST SUITE                                │   │
│   │   ┌───────────┐   ┌───────────┐   ┌───────────┐   ┌───────────┐ │   │
│   │   │   Unit    │   │Integration│   │    E2E    │   │  Contract │ │   │
│   │   │  (80%+)   │   │  (DB/API) │   │ (Critical)│   │   (API)   │ │   │
│   │   └───────────┘   └───────────┘   └───────────┘   └───────────┘ │   │
│   └─────────────────────────────────────────────────────────────────┘   │
│                                     │                                    │
│                                     ▼                                    │
│   ┌─────────────────────────────────────────────────────────────────┐   │
│   │                      SECURITY GATES                              │   │
│   │   ┌───────────┐   ┌───────────┐   ┌───────────┐   ┌───────────┐ │   │
│   │   │ npm audit │   │  gitleaks │   │   SAST    │   │   Trivy   │ │   │
│   │   │  (high+)  │   │ (secrets) │   │ (CodeQL)  │   │ (container│ │   │
│   │   └───────────┘   └───────────┘   └───────────┘   └───────────┘ │   │
│   └─────────────────────────────────────────────────────────────────┘   │
│                                     │                                    │
│                                     ▼                                    │
│   ┌─────────────────────────────────────────────────────────────────┐   │
│   │                   DATABASE VALIDATION                            │   │
│   │   ┌─────────────────────┐   ┌─────────────────────────────────┐ │   │
│   │   │  Migration dry-run  │   │  Schema drift detection         │ │   │
│   │   └─────────────────────┘   └─────────────────────────────────┘ │   │
│   └─────────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────────┘

                                      │ Merge to main

┌─────────────────────────────────────────────────────────────────────────┐
│                            RELEASE PIPELINE                              │
│                                                                          │
│   ┌─────────────┐   ┌─────────────┐   ┌─────────────┐   ┌─────────────┐ │
│   │  Version    │   │   Build     │   │    Push     │   │  Generate   │ │
│   │   Bump      │──▶│   Images    │──▶│  Registry   │──▶│  Changelog  │ │
│   └─────────────┘   └─────────────┘   └─────────────┘   └─────────────┘ │
│                                                                │         │
│                                                                ▼         │
│   ┌─────────────────────────────────────────────────────────────────┐   │
│   │                     DEPLOY TO STAGING                            │   │
│   │   ┌───────────┐   ┌───────────┐   ┌───────────┐   ┌───────────┐ │   │
│   │   │   Run     │   │  Deploy   │   │  Health   │   │   Smoke   │ │   │
│   │   │Migrations │──▶│  (B/G)    │──▶│   Check   │──▶│   Tests   │ │   │
│   │   └───────────┘   └───────────┘   └───────────┘   └───────────┘ │   │
│   └─────────────────────────────────────────────────────────────────┘   │
│                                     │                                    │
│                              Manual Approval                             │
│                                     │                                    │
│                                     ▼                                    │
│   ┌─────────────────────────────────────────────────────────────────┐   │
│   │                    DEPLOY TO PRODUCTION                          │   │
│   │   ┌───────────┐   ┌───────────┐   ┌───────────┐   ┌───────────┐ │   │
│   │   │   Run     │   │Blue-Green │   │  Health   │   │   Smoke   │ │   │
│   │   │Migrations │──▶│  Deploy   │──▶│   Check   │──▶│  + Notify │ │   │
│   │   └───────────┘   └───────────┘   └───────────┘   └───────────┘ │   │
│   └─────────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────────┘

PR Validation Workflow

Every pull request must pass all gates before merge is allowed.

Stage 1: Build & Static Analysis

CheckToolFailure Threshold
TypeScript compilationtsc --noEmitAny error
ESLinteslintAny error (warnings allowed)
Prettierprettier --checkAny formatting issue
Stylelint (frontend)stylelintAny error

Stage 2: Test Suite

Test TypeScopeCoverage Requirement
Unit TestsDomain logic, utilities, pure functions80% minimum
Integration TestsDatabase operations, API endpoints, Redis70% minimum
E2E TestsCritical user flows (see below)Must pass
Contract TestsAPI schema validation (see below)Must pass

Contract Tests Specification

Contract tests validate that the API implementation matches its specification and that changes don't break consumers.

What we validate:

AspectToolDescription
OpenAPI compliance@apidevtools/swagger-cli validateSchema is valid OpenAPI 3.1
Response shapeCustom Jest matchersResponses match declared schemas
Breaking changesopenapi-diffDetect removed endpoints, changed types
Request validationClass-validator + ZodInput DTOs match OpenAPI parameters

Contract test implementation:

typescript
// test/contract/api-contract.spec.ts
import { OpenAPIValidator } from 'express-openapi-validator';
import spec from '../openapi.json';

describe('API Contract Tests', () => {
  const validator = new OpenAPIValidator({ spec });

  describe('POST /api/v1/auth/register', () => {
    it('should match request schema', async () => {
      const validRequest = {
        email: 'test@example.com',
        password: 'SecurePass123',
        referralCode: 'ABC123'
      };

      expect(() => validator.validateRequest({
        path: '/api/v1/auth/register',
        method: 'post',
        body: validRequest
      })).not.toThrow();
    });

    it('should match response schema', async () => {
      const response = await request(app)
        .post('/api/v1/auth/register')
        .send(validRequest);

      expect(() => validator.validateResponse({
        path: '/api/v1/auth/register',
        method: 'post',
        statusCode: response.status,
        body: response.body
      })).not.toThrow();
    });
  });
});

Breaking change detection in CI:

yaml
- name: Check for Breaking API Changes
  run: |
    # Compare current spec with main branch
    git show origin/main:openapi.json > openapi-main.json

    npx openapi-diff openapi-main.json openapi.json --fail-on-incompatible

    # Incompatible changes (will fail):
    # - Removing endpoints
    # - Removing required response fields
    # - Adding required request fields
    # - Changing field types

    # Compatible changes (allowed):
    # - Adding new endpoints
    # - Adding optional request fields
    # - Adding response fields

Critical E2E Flows (must always be tested):

1. Authentication Flow
   - Registration with referral code
   - Login with 2FA
   - Password reset
   - Session management

2. Order & Payment Flow
   - Add to cart → Checkout → Payment → Confirmation
   - Order status transitions (full state machine)
   - Payment webhook processing

3. Commission Calculation Flow
   - Order completion → Commission generation
   - Multi-level distribution (up to 10 levels)
   - Commission approval → Payout

4. MLM Tree Operations
   - Partner registration under sponsor
   - Tree traversal queries
   - Rank qualification check

Stage 3: Security Gates

ScanToolAction on Failure
Dependency vulnerabilitiesnpm audit --audit-level=highBlock merge
Secret detectiongitleaksBlock merge
Static Application Security TestingCodeQL / SonarQubeBlock on high/critical
Container vulnerabilitiesTrivyBlock on critical

Stage 4: Database Validation

yaml
# Migration dry-run against test database
- name: Validate Migrations
  run: |
    # Create temporary database
    createdb iwm_migration_test

    # Run all migrations
    npx prisma migrate deploy --preview-feature

    # Validate schema matches Prisma schema
    npx prisma db pull --force
    npx prisma validate

    # Cleanup
    dropdb iwm_migration_test

Release Workflow

Triggered on merge to main branch.

Version Management

Semantic Versioning: MAJOR.MINOR.PATCH

Auto-bump rules:
- Merge to main → PATCH increment (1.2.3 → 1.2.4)
- Manual MINOR change → Skip auto-bump, use manual version
- Manual MAJOR change → Skip auto-bump, use manual version

Version stored in: VERSION file (root)

Build Stage

StepDescription
Read VERSIONGet current or bumped version
Build Docker imagesBackend + Frontend separately
Tag imagesv{VERSION} + latest
Push to registryGitHub Container Registry (ghcr.io)
Cache layerstype=gha,mode=max for faster rebuilds

Docker Layer Caching

yaml
# Each Dockerfile instruction creates a layer with SHA256 hash
# Unchanged layers are reused from cache

FROM node:20-alpine          # Layer sha256:a1b2... (cached if unchanged)
COPY package*.json ./        # Layer sha256:c3d4... (cached if package.json same)
RUN npm ci                   # Layer sha256:e5f6... (cached if dependencies same)
COPY . .                     # Layer sha256:g7h8... (changes on code change)
RUN npm run build            # Layer sha256:i9j0... (rebuilds if code changed)

Cache configuration:

yaml
cache-from: type=gha         # Pull from GitHub Actions cache
cache-to: type=gha,mode=max  # Push ALL layers (not just final)

Security Gates Detail

Dependency Scanning

yaml
# Run on every PR and weekly scheduled scan
- name: Dependency Audit
  run: npm audit --audit-level=high

- name: Snyk Scan
  uses: snyk/actions/node@master
  env:
    SNYK_TOKEN: ${{ secrets.SNYK_TOKEN }}

Secret Scanning

yaml
- name: Gitleaks Scan
  uses: gitleaks/gitleaks-action@v2
  env:
    GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

Protected patterns:

PatternExample
API keyssk_live_*, pk_live_*
Database URLspostgresql://*:*@*
JWT secretsJWT_SECRET=*
Encryption keysENCRYPTION_MASTER_KEY=*
Webhook secrets*_WEBHOOK_SECRET=*

Container Scanning

yaml
- name: Trivy Scan
  uses: aquasecurity/trivy-action@master
  with:
    image-ref: ${{ env.IMAGE }}
    severity: 'CRITICAL,HIGH'
    exit-code: '1'  # Fail on critical/high

Database Migration Strategy

Migration Validation (PR)

1. Dry-run against empty database
2. Dry-run against production clone (staging)
3. Check for destructive operations (DROP, TRUNCATE)
4. Estimate migration duration
5. Flag migrations requiring maintenance window

Migration Execution (Deploy)

┌─────────────────────────────────────────────────────────────────┐
│                    MIGRATION EXECUTION                           │
│                                                                  │
│   1. Create backup snapshot                                      │
│      └─▶ pg_dump iwm_production > backup_$(date).sql            │
│                                                                  │
│   2. Run migrations                                              │
│      └─▶ npx prisma migrate deploy                              │
│                                                                  │
│   3. Validate schema                                             │
│      └─▶ npx prisma validate                                    │
│                                                                  │
│   4. Health check                                                │
│      └─▶ curl /health/ready                                     │
│                                                                  │
│   5. On failure: Restore from backup                             │
│      └─▶ psql < backup_$(date).sql                              │
└─────────────────────────────────────────────────────────────────┘

Important: Prisma migrate deploy runs each migration file in its own transaction automatically. Do NOT wrap it in BEGIN/COMMIT manually.

DDL Transaction Limitations

Some PostgreSQL DDL operations cannot run inside a transaction. These require special handling:

OperationTransaction SupportHandling
CREATE INDEX CONCURRENTLYNoSeparate migration file, run manually
ALTER TYPE (enum add value)No (PG < 12)Separate migration, requires downtime on old PG
DROP INDEX CONCURRENTLYNoSeparate migration file
REINDEX CONCURRENTLYNoRun during maintenance window

For non-transactional migrations:

sql
-- migrations/20240115_add_index_concurrently.sql
-- @non-transactional

CREATE INDEX CONCURRENTLY idx_orders_created
ON product.orders(created_at);
yaml
# Pipeline handles non-transactional migrations separately
- name: Run Non-Transactional Migrations
  run: |
    for file in prisma/migrations/*_non_transactional.sql; do
      psql $DATABASE_URL -f "$file" || exit 1
    done

Destructive Migration Policy

OperationRequirement
DROP TABLERequires manual approval + backup verification
DROP COLUMNMust be preceded by code removal in previous release
TRUNCATEProhibited in production migrations
ALTER TYPERequires maintenance window

Backup Verification Testing

Backups are worthless if they can't be restored. Regular verification ensures recovery is actually possible.

Verification schedule:

EnvironmentFrequencyMethod
ProductionWeeklyFull restore to isolated instance
StagingMonthlyFull restore test
After migrationImmediateSpot check critical tables

Automated backup verification job:

yaml
# .github/workflows/backup-verification.yml
name: Backup Verification

on:
  schedule:
    - cron: '0 4 * * 0'  # Weekly Sunday 4AM
  workflow_dispatch:

jobs:
  verify-backup:
    runs-on: ubuntu-latest
    steps:
      - name: Create Fresh Backup
        run: |
          pg_dump $PRODUCTION_URL \
            --format=custom \
            --file=backup-$(date +%Y%m%d).dump

      - name: Spin Up Isolated Instance
        run: |
          docker run -d \
            --name pg-verify \
            -e POSTGRES_PASSWORD=verify_test \
            -p 5433:5432 \
            postgres:15

          sleep 10  # Wait for startup

      - name: Restore Backup
        run: |
          pg_restore \
            --host=localhost \
            --port=5433 \
            --username=postgres \
            --dbname=postgres \
            --clean \
            --if-exists \
            backup-$(date +%Y%m%d).dump

      - name: Verify Data Integrity
        run: |
          PGPASSWORD=verify_test psql \
            -h localhost -p 5433 -U postgres -d postgres \
            -f scripts/verify-backup-integrity.sql

      - name: Verify Row Counts
        run: |
          # Compare row counts with production
          node scripts/compare-backup-counts.js

      - name: Cleanup
        if: always()
        run: docker rm -f pg-verify

      - name: Report Results
        run: |
          if [ "${{ job.status }}" == "success" ]; then
            curl -X POST $SLACK_WEBHOOK \
              -d '{"text": ":white_check_mark: Backup verification passed"}'
          else
            curl -X POST $SLACK_WEBHOOK \
              -d '{"text": ":x: Backup verification FAILED - investigate immediately"}'
          fi

Integrity verification script:

sql
-- scripts/verify-backup-integrity.sql

-- Check critical tables exist and have data
DO $$
DECLARE
    v_count INT;
BEGIN
    -- Users table
    SELECT COUNT(*) INTO v_count FROM core.users;
    IF v_count = 0 THEN
        RAISE EXCEPTION 'CRITICAL: users table is empty';
    END IF;
    RAISE NOTICE 'users: % rows', v_count;

    -- Partners table
    SELECT COUNT(*) INTO v_count FROM mlm.partners;
    RAISE NOTICE 'partners: % rows', v_count;

    -- Commission transactions
    SELECT COUNT(*) INTO v_count FROM mlm.commission_transactions;
    RAISE NOTICE 'commission_transactions: % rows', v_count;

    -- Orders table
    SELECT COUNT(*) INTO v_count FROM product.orders;
    RAISE NOTICE 'orders: % rows', v_count;

    -- Verify foreign key relationships intact
    SELECT COUNT(*) INTO v_count
    FROM mlm.partners p
    LEFT JOIN core.users u ON p.user_id = u.id
    WHERE u.id IS NULL;

    IF v_count > 0 THEN
        RAISE EXCEPTION 'CRITICAL: % orphaned partner records', v_count;
    END IF;

    -- Verify tree integrity
    SELECT COUNT(*) INTO v_count
    FROM mlm.partner_tree_paths ptp
    LEFT JOIN mlm.partners p ON ptp.ancestor_id = p.id
    WHERE p.id IS NULL;

    IF v_count > 0 THEN
        RAISE EXCEPTION 'CRITICAL: % orphaned tree path records', v_count;
    END IF;

    RAISE NOTICE 'All integrity checks passed';
END $$;

Recovery time tracking:

MetricTargetAlert If
Backup creation time< 30 min> 1 hour
Restore time< 1 hour> 2 hours
Verification time< 15 min> 30 min
Total RTO< 2 hours> 4 hours

Environment Strategy

Environment Matrix

EnvironmentPurposeDeploy TriggerData
DevelopmentLocal developmentManualSeed data
CIPipeline testingEvery PRFresh per run
StagingPre-production validationMerge to mainProduction clone (anonymized)
ProductionLive systemManual approvalReal data

Staging Data Anonymization

Production data cloned to staging must be anonymized to protect user privacy and comply with GDPR/data protection requirements.

Anonymization rules by data type:

Data CategoryFieldAnonymization Method
PIIemailfaker.email() with original domain preserved
PIIphone+7900${random7digits}
PIIfirst_name, last_namefaker.name()
PIIaddress fieldsfaker.address()
Financialbank_account****${last4} (masked)
Financialcard_numberCompletely removed
Financialbalance amountsPreserved (not PII)
KYCpassport_numberXX${random8digits}
KYCtax_id${random12digits}
KYCdocument_urlsReplaced with placeholder images
Authpassword_hashSet to known test password hash
Authsession tokensDeleted
Auth2fa_secretsDeleted
Auditip_address192.168.x.x (private range)
Audituser_agentPreserved (not PII)

Anonymization script:

sql
-- scripts/anonymize-staging.sql
-- Run after cloning production to staging

BEGIN;

-- Users table
UPDATE core.users SET
    email = 'user_' || id::text || '@staging.iwm.local',
    phone = '+7900' || LPAD(FLOOR(RANDOM() * 10000000)::TEXT, 7, '0'),
    password_hash = '$2b$10$staging.password.hash.for.testing';  -- Password: "staging123"

-- User profiles
UPDATE core.user_profiles SET
    first_name = 'Test',
    last_name = 'User_' || SUBSTRING(user_id::text, 1, 8),
    middle_name = NULL;

-- Addresses
UPDATE product.addresses SET
    first_name = 'Test',
    last_name = 'User',
    address_line1 = FLOOR(RANDOM() * 100)::TEXT || ' Test Street',
    address_line2 = 'Apt ' || FLOOR(RANDOM() * 100)::TEXT,
    city = 'Test City',
    phone = '+7900' || LPAD(FLOOR(RANDOM() * 10000000)::TEXT, 7, '0');

-- KYC documents
UPDATE core.kyc_verifications SET
    document_number = 'XX' || LPAD(FLOOR(RANDOM() * 100000000)::TEXT, 8, '0');

UPDATE core.kyc_documents SET
    file_url = 'https://staging-assets.iwm.local/placeholder-document.pdf',
    file_name = 'anonymized_document.pdf';

-- Payout details (sensitive bank info)
UPDATE mlm.payout_requests SET
    payout_details = jsonb_set(
        payout_details,
        '{account_number}',
        '"****' || RIGHT(payout_details->>'account_number', 4) || '"'
    )
WHERE payout_details ? 'account_number';

-- Delete sensitive auth data
DELETE FROM core.sessions;
DELETE FROM core.two_factor_auth;

-- Audit logs - anonymize IPs
UPDATE core.audit_log SET
    ip_address = ('192.168.' || (RANDOM() * 255)::INT || '.' || (RANDOM() * 255)::INT)::INET;

-- Partner referral links - update domain
UPDATE mlm.referral_links SET
    full_url = REPLACE(full_url, 'iwm.com', 'staging.iwm.local');

COMMIT;

-- Verify no production data leaked
DO $$
BEGIN
    -- Check no real emails remain
    IF EXISTS (SELECT 1 FROM core.users WHERE email NOT LIKE '%@staging.iwm.local') THEN
        RAISE EXCEPTION 'Anonymization failed: real emails found';
    END IF;

    -- Check no real phone numbers remain
    IF EXISTS (SELECT 1 FROM core.users WHERE phone NOT LIKE '+7900%') THEN
        RAISE EXCEPTION 'Anonymization failed: real phone numbers found';
    END IF;
END $$;

Automated staging refresh:

yaml
# .github/workflows/staging-refresh.yml
name: Refresh Staging Data

on:
  schedule:
    - cron: '0 3 * * 0'  # Weekly Sunday 3AM
  workflow_dispatch:      # Manual trigger

jobs:
  refresh-staging:
    runs-on: ubuntu-latest
    steps:
      - name: Create Production Snapshot
        run: |
          pg_dump $PRODUCTION_URL \
            --no-owner \
            --no-privileges \
            > production-snapshot.sql

      - name: Restore to Staging
        run: |
          psql $STAGING_URL -c "DROP SCHEMA IF EXISTS public CASCADE; CREATE SCHEMA public;"
          psql $STAGING_URL < production-snapshot.sql

      - name: Run Anonymization
        run: psql $STAGING_URL < scripts/anonymize-staging.sql

      - name: Verify Anonymization
        run: node scripts/verify-anonymization.js

      - name: Notify Team
        run: |
          curl -X POST $SLACK_WEBHOOK \
            -d '{"text": "Staging environment refreshed with anonymized production data"}'

Access control for staging:

RoleStaging AccessCan See
DeveloperFull accessAnonymized data only
QAFull accessAnonymized data only
SupportNo access
External contractorNo access

Environment Parity

yaml
# All environments use identical:
- Docker images (same SHA)
- Database schema (same migrations)
- Environment variable structure (different values)
- Infrastructure configuration (scaled differently)

Environment Variables Validation

typescript
// Validated at application startup using Zod
// CI must validate all required variables are defined

const envSchema = z.object({
  NODE_ENV: z.enum(['development', 'staging', 'production']),
  DATABASE_URL: z.string().url(),
  REDIS_URL: z.string().url(),
  JWT_SECRET: z.string().min(32),
  ENCRYPTION_MASTER_KEY: z.string().min(32),
  // ... all other required variables
});

Deployment Strategy

Blue-Green Deployment

┌─────────────────────────────────────────────────────────────────┐
│                      BLUE-GREEN DEPLOY                           │
│                                                                  │
│   Current State:                                                 │
│   ┌─────────────┐                                               │
│   │    BLUE     │ ◀── Load Balancer ◀── Traffic                │
│   │   (v1.2.3)  │                                               │
│   └─────────────┘                                               │
│   ┌─────────────┐                                               │
│   │   GREEN     │     (idle)                                    │
│   │   (v1.2.3)  │                                               │
│   └─────────────┘                                               │
│                                                                  │
│   Deploy v1.2.4:                                                 │
│   1. Deploy to GREEN                                             │
│   2. Run health checks on GREEN                                  │
│   3. Run smoke tests on GREEN                                    │
│   4. Switch traffic to GREEN                                     │
│   5. Monitor for errors                                          │
│   6. If errors: Switch back to BLUE (rollback)                  │
│   7. If stable: Update BLUE to v1.2.4 (sync)                    │
└─────────────────────────────────────────────────────────────────┘

Rollback Procedure

yaml
# Automatic rollback triggers:
- Health check failure (3 consecutive)
- Error rate > 5% (compared to baseline)
- Response time > 2x baseline

# Rollback steps:
1. Switch load balancer to previous version
2. Alert on-call engineer
3. Preserve logs for investigation
4. Do NOT run backward migrations automatically

Rollback Window

PhaseDurationBLUE StatusAction if Issues
Immediate0-15 minRunning, no trafficInstant switch back
Short-term15 min - 2 hoursRunning, warm standbyQuick rollback (< 1 min)
Medium-term2-24 hoursStopped, image preservedRestart BLUE, switch traffic
Long-term> 24 hoursTerminatedRedeploy previous version

Data divergence consideration:

  • Rollback within 2 hours: Minimal data divergence, safe to rollback
  • Rollback after 2 hours: Audit new data created, may need manual reconciliation
  • Rollback after 24 hours: Requires data migration plan, not automatic
yaml
# Rollback window configuration
rollback:
  instant_window: 15m      # BLUE kept running
  warm_standby: 2h         # BLUE stopped but not terminated
  image_retention: 24h     # Previous image kept in registry
  max_auto_rollback: 2h    # After this, manual approval required

Database Connection Management During Deploy

During blue-green deployment, both versions may run simultaneously. This requires careful connection pool management.

┌─────────────────────────────────────────────────────────────────┐
│                 CONNECTION POOL DURING DEPLOY                    │
│                                                                  │
│   Database Pool Limit: 100 connections                           │
│                                                                  │
│   Normal Operation:                                              │
│   ┌─────────────┐                                               │
│   │    BLUE     │ ─── 50 connections ───▶ ┌──────────────┐     │
│   │  (active)   │                         │              │     │
│   └─────────────┘                         │  PostgreSQL  │     │
│                                           │              │     │
│   During Deploy (both running):           │  Pool: 100   │     │
│   ┌─────────────┐                         │              │     │
│   │    BLUE     │ ─── 40 connections ───▶ │              │     │
│   │  (draining) │                         │              │     │
│   └─────────────┘                         │              │     │
│   ┌─────────────┐                         │              │     │
│   │   GREEN     │ ─── 40 connections ───▶ │              │     │
│   │  (starting) │                         └──────────────┘     │
│   └─────────────┘                                               │
│                                                                  │
│   Reserve: 20 connections for migrations & admin                 │
└─────────────────────────────────────────────────────────────────┘

Connection pool configuration:

typescript
// config/database.ts
const poolConfig = {
  // Normal operation
  default: {
    min: 5,
    max: 50,
  },
  // During deployment (detected via DEPLOY_MODE env)
  deployment: {
    min: 2,
    max: 40,  // Reduced to allow overlap
  },
};

Deploy sequence to prevent connection exhaustion:

yaml
deploy_steps:
  1. Set GREEN pool to deployment mode (max: 40)
  2. Start GREEN instances
  3. Wait for GREEN health check
  4. Run migrations (uses reserved connections)
  5. Gradually shift traffic (10% → 50% → 100%)
  6. Set BLUE to drain mode (stop accepting new connections)
  7. Wait for BLUE connections to close (max 30s)
  8. Stop BLUE instances
  9. Set GREEN pool to normal mode (max: 50)

PgBouncer recommended for production:

ini
# pgbouncer.ini
[pgbouncer]
pool_mode = transaction
max_client_conn = 200
default_pool_size = 50
reserve_pool_size = 10
reserve_pool_timeout = 3

Health Checks

typescript
// /health/live - Is the process running?
// Returns 200 if process is alive

// /health/ready - Can it handle requests?
// Checks:
// - Database connection
// - Redis connection
// - Required services available

// /health/startup - Has it finished initialization?
// Used by Kubernetes to know when to send traffic

Test Requirements

Coverage Thresholds

json
{
  "coverageThreshold": {
    "global": {
      "branches": 75,
      "functions": 80,
      "lines": 80,
      "statements": 80
    },
    "src/modules/mlm/domain/**": {
      "branches": 90,
      "functions": 95,
      "lines": 95
    },
    "src/modules/payment/domain/**": {
      "branches": 90,
      "functions": 95,
      "lines": 95
    }
  }
}

Platform-Specific Test Suites

SuiteFocusTrigger
Commission TestsMulti-level calculations, edge cases, roundingEvery PR
Tree Operation TestsINSERT, MOVE, depth limits, cycle preventionEvery PR
Payment IntegrationWebhook handling, idempotency, refundsEvery PR
State Machine TestsOrder transitions, invalid state preventionEvery PR
Encryption TestsEncrypt/decrypt cycle, key rotationEvery PR
Load TestsCommission calculation under loadWeekly / Pre-release

Load Test Specifications

Load tests validate system performance under realistic and stress conditions.

Target Metrics:

ScenarioTargetThreshold (Fail if)
Commission calculation100 orders/second< 50 orders/second
Concurrent payout requests50 requests/second< 25 requests/second
Tree traversal (10 levels)< 50ms p95> 200ms p95
API response time (p95)< 200ms> 500ms
Database connectionsStable at pool maxPool exhaustion
Memory usage< 512MB per instance> 1GB

Load test scenarios:

typescript
// load-tests/k6/commission-load.js
import http from 'k6/http';
import { check, sleep } from 'k6';

export const options = {
  scenarios: {
    // Sustained load: Normal operation
    sustained: {
      executor: 'constant-arrival-rate',
      rate: 100,              // 100 orders per second
      timeUnit: '1s',
      duration: '5m',
      preAllocatedVUs: 50,
    },
    // Spike: Flash sale simulation
    spike: {
      executor: 'ramping-arrival-rate',
      startRate: 100,
      timeUnit: '1s',
      stages: [
        { duration: '1m', target: 100 },   // Normal
        { duration: '30s', target: 500 },  // Spike to 5x
        { duration: '2m', target: 500 },   // Sustain spike
        { duration: '30s', target: 100 },  // Back to normal
      ],
      preAllocatedVUs: 200,
    },
    // Stress: Find breaking point
    stress: {
      executor: 'ramping-arrival-rate',
      startRate: 50,
      timeUnit: '1s',
      stages: [
        { duration: '2m', target: 200 },
        { duration: '2m', target: 400 },
        { duration: '2m', target: 600 },
        { duration: '2m', target: 800 },   // Find where it breaks
      ],
      preAllocatedVUs: 300,
    },
  },
  thresholds: {
    http_req_duration: ['p(95)<500'],      // 95% of requests < 500ms
    http_req_failed: ['rate<0.01'],        // Error rate < 1%
    'commission_calculated': ['rate>50'],  // At least 50/s processed
  },
};

export default function () {
  const orderPayload = {
    userId: `user_${__VU}_${__ITER}`,
    amount: Math.floor(Math.random() * 10000) + 1000,
    referringPartnerId: 'partner_test_001',
  };

  const res = http.post(
    `${__ENV.API_URL}/api/v1/orders`,
    JSON.stringify(orderPayload),
    { headers: { 'Content-Type': 'application/json' } }
  );

  check(res, {
    'order created': (r) => r.status === 201,
    'commission triggered': (r) => r.json('commissionJobId') !== null,
  });

  sleep(0.1);
}

Database connection behavior under load:

yaml
# Monitor during load tests
metrics:
  - pg_stat_activity.active_connections
  - pg_stat_activity.waiting_connections
  - pg_stat_user_tables.seq_scan (should not spike)
  - pg_stat_user_tables.idx_scan (should handle load)
  - pg_locks.blocked_queries (should be zero)

CI integration:

yaml
- name: Run Load Tests (Pre-release)
  if: github.event_name == 'release'
  run: |
    k6 run load-tests/k6/commission-load.js \
      --env API_URL=${{ secrets.STAGING_URL }} \
      --out json=load-test-results.json

    # Fail release if thresholds not met
    node scripts/validate-load-test.js load-test-results.json

Monitoring & Notifications

Deploy Notifications

yaml
# Notify on:
- Deploy started (staging/production)
- Deploy succeeded
- Deploy failed
- Rollback triggered
- Manual approval required

# Channels:
- Slack / Telegram
- Email (for failures)
- PagerDuty (for production failures)

Post-Deploy Verification

yaml
# Smoke tests run immediately after deploy:
1. GET /health/ready → 200
2. POST /api/v1/auth/login (test user) → 200 + JWT
3. GET /api/v1/products → 200 + valid response
4. GET /api/v1/mlm/ranks → 200 + valid response

# If any fail → trigger rollback

Metrics to Monitor Post-Deploy

MetricBaseline ComparisonAlert Threshold
Error ratevs. previous hour> 2x baseline
Response time (p95)vs. previous hour> 1.5x baseline
Database query timevs. previous hour> 2x baseline
Memory usagevs. previous deploy> 120%
CPU usagevs. previous deploy> 150%

Hotfix Procedure

When production is broken and a rapid fix is needed, the hotfix procedure allows bypassing normal flow while maintaining safety.

When to Use Hotfix

SituationUse Hotfix?Normal Deploy OK?
Production down / 500 errorsYesNo
Critical security vulnerabilityYesNo
Payment processing brokenYesNo
Commission calculation wrongYesNo
Minor bug, users unaffectedNoYes
Performance degradation (< 2x)NoYes
Feature not working as expectedNoYes

Hotfix Flow

┌─────────────────────────────────────────────────────────────────┐
│                       HOTFIX PROCEDURE                           │
│                                                                  │
│   1. ASSESS (5 min max)                                          │
│      └─▶ Confirm severity, identify root cause                  │
│      └─▶ Decision: Hotfix vs Rollback vs Wait                   │
│                                                                  │
│   2. BRANCH                                                       │
│      └─▶ git checkout -b hotfix/ISSUE-ID main                   │
│      └─▶ NOT from feature branch                                │
│                                                                  │
│   3. FIX                                                          │
│      └─▶ Minimal change only                                    │
│      └─▶ No refactoring                                         │
│      └─▶ No "while we're here" additions                        │
│                                                                  │
│   4. VALIDATE (Abbreviated)                                       │
│      └─▶ Unit tests for changed code                            │
│      └─▶ Type check                                             │
│      └─▶ Manual smoke test                                      │
│      └─▶ Skip: Full E2E, Load tests, SAST                       │
│                                                                  │
│   5. APPROVE                                                      │
│      └─▶ Single reviewer (senior engineer)                      │
│      └─▶ No PR required (direct push with approval)            │
│                                                                  │
│   6. DEPLOY                                                       │
│      └─▶ Direct to production (skip staging)                    │
│      └─▶ Watch metrics for 15 minutes                           │
│                                                                  │
│   7. FOLLOW-UP (within 24 hours)                                 │
│      └─▶ Create proper PR backport to main                      │
│      └─▶ Add regression test                                    │
│      └─▶ Write incident report                                  │
└─────────────────────────────────────────────────────────────────┘

Hotfix Commands

bash
# 1. Create hotfix branch from production tag
git fetch --tags
git checkout -b hotfix/IWM-123-fix-commission-calc v1.2.3

# 2. Make fix
# ... code changes ...

# 3. Run abbreviated tests
npm run test:unit -- --testPathPattern="commission"
npm run type-check

# 4. Deploy directly (requires HOTFIX_APPROVED=true)
HOTFIX_APPROVED=true npm run deploy:production

# 5. Tag the hotfix
git tag -a v1.2.3-hotfix.1 -m "Hotfix: Commission calculation overflow"
git push origin v1.2.3-hotfix.1

# 6. Backport to main
git checkout main
git cherry-pick <hotfix-commit-sha>
git push origin main

Hotfix Approval Matrix

Fix TypeApprover RequiredCan Self-Approve
Logic fix (no DB)1 senior engineerNo
Config change1 engineerYes (if on-call)
Database fix2 engineers + DBANo
Security fix1 security + 1 engineerNo
Revert to previous1 engineerYes (if on-call)

On-Call & Escalation

Escalation Path

┌─────────────────────────────────────────────────────────────────┐
│                      ESCALATION LADDER                           │
│                                                                  │
│   Level 0: Automated                                             │
│   └─▶ Health check fails → Auto-rollback                        │
│   └─▶ Error rate > 5% → Auto-rollback                           │
│   └─▶ Alert sent to #alerts channel                             │
│                                                                  │
│   Level 1: On-Call Engineer (0-15 min)                          │
│   └─▶ Receive PagerDuty alert                                   │
│   └─▶ Acknowledge within 5 minutes                              │
│   └─▶ Assess: Can fix alone? Needs escalation?                  │
│                                                                  │
│   Level 2: Senior Engineer (15-30 min)                          │
│   └─▶ Auto-escalate if L1 doesn't acknowledge                   │
│   └─▶ Join incident call                                        │
│   └─▶ Decision: Hotfix vs Rollback vs External help             │
│                                                                  │
│   Level 3: Engineering Lead + Team (30+ min)                    │
│   └─▶ Multiple engineers on call                                │
│   └─▶ Coordinate with stakeholders                              │
│   └─▶ Customer communication if needed                          │
│                                                                  │
│   Level 4: Executive (Major Incident)                            │
│   └─▶ Extended outage (> 1 hour)                                │
│   └─▶ Data breach / Security incident                           │
│   └─▶ Financial impact (payments affected)                      │
└─────────────────────────────────────────────────────────────────┘

PagerDuty Configuration

yaml
# pagerduty-config.yml
services:
  - name: IWM Production
    escalation_policy: iwm-production
    alert_creation: create_alerts_and_incidents

escalation_policies:
  - name: iwm-production
    rules:
      - escalation_delay_minutes: 5
        targets:
          - type: schedule_reference
            id: primary-oncall

      - escalation_delay_minutes: 15
        targets:
          - type: schedule_reference
            id: senior-oncall

      - escalation_delay_minutes: 30
        targets:
          - type: user_reference
            id: engineering-lead
          - type: user_reference
            id: cto

schedules:
  - name: primary-oncall
    rotation: weekly
    users: [engineer_1, engineer_2, engineer_3, engineer_4]

  - name: senior-oncall
    rotation: weekly
    users: [senior_1, senior_2]

Alert Severity Levels

SeverityResponse TimeExamples
P1 - Critical5 minProduction down, payments failing, data breach
P2 - High15 minMajor feature broken, error rate > 5%
P3 - Medium1 hourPerformance degradation, non-critical errors
P4 - LowNext business dayMinor bugs, monitoring alerts

Incident Communication

yaml
# During incident:
channels:
  - "#incident-active"      # Real-time updates (engineers only)
  - "#engineering"          # Status updates (every 30 min)
  - "#general"              # Customer-facing status (if needed)

templates:
  initial_alert: |
    :rotating_light: **INCIDENT DETECTED**
    Severity: {severity}
    Service: {service}
    Description: {description}
    On-call: @{oncall_user}
    Incident channel: #incident-{id}

  status_update: |
    **Incident Update** ({time} since start)
    Status: {investigating|identified|fixing|monitoring|resolved}
    Impact: {impact_description}
    Next update: {eta}

  resolution: |
    :white_check_mark: **INCIDENT RESOLVED**
    Duration: {duration}
    Root cause: {root_cause}
    Fix applied: {fix_description}
    Follow-up: {follow_up_ticket}

Secrets Rotation

Rotation Schedule

SecretRotation FrequencyAuto-RotateDowntime Required
JWT_SECRET90 daysNoNo (dual-key period)
ENCRYPTION_MASTER_KEY180 daysNoYes (re-encryption)
Database password90 daysYes (via cloud)No
API keys (external)365 daysVariesNo
Webhook secrets180 daysNoCoordination needed

JWT Secret Rotation (Zero Downtime)

typescript
// Support dual JWT secrets during rotation
const jwtConfig = {
  // Current secret (for signing new tokens)
  current: process.env.JWT_SECRET,

  // Previous secret (for validating old tokens during rotation)
  previous: process.env.JWT_SECRET_PREVIOUS || null,

  // Rotation window (how long to accept old tokens)
  rotationWindowDays: 7,
};

// Validation accepts either secret
function validateToken(token: string): JwtPayload {
  try {
    return jwt.verify(token, jwtConfig.current);
  } catch (e) {
    if (jwtConfig.previous) {
      return jwt.verify(token, jwtConfig.previous);
    }
    throw e;
  }
}

Rotation procedure:

yaml
jwt_rotation_steps:
  1. Generate new JWT_SECRET
  2. Set JWT_SECRET_PREVIOUS = current JWT_SECRET
  3. Set JWT_SECRET = new secret
  4. Deploy (both secrets now valid)
  5. Wait 7 days (tokens expire, refresh uses new secret)
  6. Remove JWT_SECRET_PREVIOUS
  7. Deploy final config

Encryption Key Rotation

Encryption key rotation requires re-encrypting existing data.

typescript
// Key versioning in encrypted data
interface EncryptedField {
  version: string;      // 'v1', 'v2', etc.
  ciphertext: string;
  iv: string;
}

// Decryption with version support
async function decrypt(field: EncryptedField): Promise<string> {
  const key = getKeyByVersion(field.version);
  return decryptWithKey(field.ciphertext, field.iv, key);
}

// Background re-encryption job
async function reEncryptAllData(fromVersion: string, toVersion: string) {
  const batchSize = 1000;
  let offset = 0;

  while (true) {
    const records = await db.taxIdentification.findMany({
      where: { encryptedData: { path: ['version'], equals: fromVersion } },
      take: batchSize,
      skip: offset,
    });

    if (records.length === 0) break;

    for (const record of records) {
      const decrypted = await decrypt(record.encryptedData);
      const reEncrypted = await encrypt(decrypted, toVersion);

      await db.taxIdentification.update({
        where: { id: record.id },
        data: { encryptedData: reEncrypted },
      });
    }

    offset += batchSize;
    logger.info(`Re-encrypted ${offset} records`);
  }
}

Rotation procedure:

yaml
encryption_key_rotation:
  1. Generate new key (ENCRYPTION_MASTER_KEY_V2)
  2. Deploy with both keys available
  3. New encryptions use v2
  4. Run background re-encryption job
  5. Monitor job completion (may take hours)
  6. Verify all records are v2
  7. Remove old key from config
  8. Securely delete old key from secrets manager

Webhook Secret Rotation

External webhooks (payment providers) require coordination.

yaml
webhook_rotation:
  stripe:
    1. Generate new webhook in Stripe dashboard
    2. Add new STRIPE_WEBHOOK_SECRET_V2 to config
    3. Deploy (accept both secrets)
    4. Disable old webhook in Stripe dashboard
    5. Remove old secret from config
    coordination: Self-service, no downtime

  payment_provider:
    1. Contact provider support
    2. Schedule rotation window
    3. Provider sends test webhook with new secret
    4. Verify receipt
    5. Provider switches to new secret
    6. Update config
    coordination: Provider-dependent, may require window

Implementation Priority

Required (Phase 1)

ComponentReason
Build + Type checkBasic correctness
Unit tests (80%)Financial calculation accuracy
Integration testsDatabase operations correctness
npm auditOWASP A06 compliance
Secret scanningPrevent credential leaks
Migration validationDatabase integrity
Health checksDeployment verification
Staging environmentPre-production validation

Should Have (Phase 2)

ComponentReason
E2E tests (critical flows)User journey validation
Container scanning (Trivy)Infrastructure security
SAST (CodeQL)Code-level vulnerabilities
Blue-green deploymentZero-downtime releases
Automatic rollbackFast recovery
Deploy notificationsTeam awareness
Coverage gatesPrevent regression

Nice to Have (Phase 3)

ComponentReason
Preview environmentsPR testing
Load testingPerformance validation
Visual regressionUI consistency
Auto-changelogRelease documentation
Dependency auto-updateMaintenance automation
Feature flagsGradual rollouts
Chaos testingResilience verification