CI/CD Pipeline Standards

Overview

The IWM platform CI/CD pipeline ensures code quality, security, and reliable deployments for a financial system handling MLM commissions, payments, and sensitive user data.

Pipeline Goals

Goal	Implementation
Correctness	Comprehensive test suites for financial calculations
Security	Multi-layer scanning (dependencies, secrets, containers)
Reliability	Zero-downtime deployments with rollback capability
Speed	Parallel jobs, Docker layer caching, smart change detection
Auditability	Version tracking, deployment logs, change documentation

Pipeline Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                           PR VALIDATION                                  │
│                                                                          │
│   ┌─────────────┐   ┌─────────────┐   ┌─────────────┐   ┌─────────────┐ │
│   │    Build    │   │    Lint     │   │    Type     │   │   Secret    │ │
│   │    Check    │   │   Format    │   │    Check    │   │    Scan     │ │
│   └──────┬──────┘   └──────┬──────┘   └──────┬──────┘   └──────┬──────┘ │
│          │                 │                 │                 │        │
│          └─────────────────┴────────┬────────┴─────────────────┘        │
│                                     │                                    │
│                                     ▼                                    │
│   ┌─────────────────────────────────────────────────────────────────┐   │
│   │                        TEST SUITE                                │   │
│   │   ┌───────────┐   ┌───────────┐   ┌───────────┐   ┌───────────┐ │   │
│   │   │   Unit    │   │Integration│   │    E2E    │   │  Contract │ │   │
│   │   │  (80%+)   │   │  (DB/API) │   │ (Critical)│   │   (API)   │ │   │
│   │   └───────────┘   └───────────┘   └───────────┘   └───────────┘ │   │
│   └─────────────────────────────────────────────────────────────────┘   │
│                                     │                                    │
│                                     ▼                                    │
│   ┌─────────────────────────────────────────────────────────────────┐   │
│   │                      SECURITY GATES                              │   │
│   │   ┌───────────┐   ┌───────────┐   ┌───────────┐   ┌───────────┐ │   │
│   │   │ npm audit │   │  gitleaks │   │   SAST    │   │   Trivy   │ │   │
│   │   │  (high+)  │   │ (secrets) │   │ (CodeQL)  │   │ (container│ │   │
│   │   └───────────┘   └───────────┘   └───────────┘   └───────────┘ │   │
│   └─────────────────────────────────────────────────────────────────┘   │
│                                     │                                    │
│                                     ▼                                    │
│   ┌─────────────────────────────────────────────────────────────────┐   │
│   │                   DATABASE VALIDATION                            │   │
│   │   ┌─────────────────────┐   ┌─────────────────────────────────┐ │   │
│   │   │  Migration dry-run  │   │  Schema drift detection         │ │   │
│   │   └─────────────────────┘   └─────────────────────────────────┘ │   │
│   └─────────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────────┘
                                      │
                                      │ Merge to main
                                      ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                            RELEASE PIPELINE                              │
│                                                                          │
│   ┌─────────────┐   ┌─────────────┐   ┌─────────────┐   ┌─────────────┐ │
│   │  Version    │   │   Build     │   │    Push     │   │  Generate   │ │
│   │   Bump      │──▶│   Images    │──▶│  Registry   │──▶│  Changelog  │ │
│   └─────────────┘   └─────────────┘   └─────────────┘   └─────────────┘ │
│                                                                │         │
│                                                                ▼         │
│   ┌─────────────────────────────────────────────────────────────────┐   │
│   │                     DEPLOY TO STAGING                            │   │
│   │   ┌───────────┐   ┌───────────┐   ┌───────────┐   ┌───────────┐ │   │
│   │   │   Run     │   │  Deploy   │   │  Health   │   │   Smoke   │ │   │
│   │   │Migrations │──▶│  (B/G)    │──▶│   Check   │──▶│   Tests   │ │   │
│   │   └───────────┘   └───────────┘   └───────────┘   └───────────┘ │   │
│   └─────────────────────────────────────────────────────────────────┘   │
│                                     │                                    │
│                              Manual Approval                             │
│                                     │                                    │
│                                     ▼                                    │
│   ┌─────────────────────────────────────────────────────────────────┐   │
│   │                    DEPLOY TO PRODUCTION                          │   │
│   │   ┌───────────┐   ┌───────────┐   ┌───────────┐   ┌───────────┐ │   │
│   │   │   Run     │   │Blue-Green │   │  Health   │   │   Smoke   │ │   │
│   │   │Migrations │──▶│  Deploy   │──▶│   Check   │──▶│  + Notify │ │   │
│   │   └───────────┘   └───────────┘   └───────────┘   └───────────┘ │   │
│   └─────────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────────┘

PR Validation Workflow

Every pull request must pass all gates before merge is allowed.

Stage 1: Build & Static Analysis

Check	Tool	Failure Threshold
TypeScript compilation	`tsc --noEmit`	Any error
ESLint	`eslint`	Any error (warnings allowed)
Prettier	`prettier --check`	Any formatting issue
Stylelint (frontend)	`stylelint`	Any error

Stage 2: Test Suite

Test Type	Scope	Coverage Requirement
Unit Tests	Domain logic, utilities, pure functions	80% minimum
Integration Tests	Database operations, API endpoints, Redis	70% minimum
E2E Tests	Critical user flows (see below)	Must pass
Contract Tests	API schema validation (see below)	Must pass

Contract Tests Specification

Contract tests validate that the API implementation matches its specification and that changes don't break consumers.

What we validate:

Aspect	Tool	Description
OpenAPI compliance	`@apidevtools/swagger-cli validate`	Schema is valid OpenAPI 3.1
Response shape	Custom Jest matchers	Responses match declared schemas
Breaking changes	`openapi-diff`	Detect removed endpoints, changed types
Request validation	Class-validator + Zod	Input DTOs match OpenAPI parameters

Contract test implementation:

typescript

// test/contract/api-contract.spec.ts
import { OpenAPIValidator } from "express-openapi-validator";
import spec from "../openapi.json";

describe("API Contract Tests", () => {
  const validator = new OpenAPIValidator({ spec });

  describe("POST /api/v1/auth/register", () => {
    it("should match request schema", async () => {
      const validRequest = {
        email: "test@example.com",
        password: "SecurePass123",
        referralCode: "ABC123",
      };

      expect(() =>
        validator.validateRequest({
          path: "/api/v1/auth/register",
          method: "post",
          body: validRequest,
        }),
      ).not.toThrow();
    });

    it("should match response schema", async () => {
      const response = await request(app)
        .post("/api/v1/auth/register")
        .send(validRequest);

      expect(() =>
        validator.validateResponse({
          path: "/api/v1/auth/register",
          method: "post",
          statusCode: response.status,
          body: response.body,
        }),
      ).not.toThrow();
    });
  });
});

Breaking change detection in CI:

yaml

- name: Check for Breaking API Changes
  run: |
    # Compare current spec with main branch
    git show origin/main:openapi.json > openapi-main.json

    npx openapi-diff openapi-main.json openapi.json --fail-on-incompatible

    # Incompatible changes (will fail):
    # - Removing endpoints
    # - Removing required response fields
    # - Adding required request fields
    # - Changing field types

    # Compatible changes (allowed):
    # - Adding new endpoints
    # - Adding optional request fields
    # - Adding response fields

Critical E2E Flows (must always be tested):

1. Authentication Flow
   - Registration with referral code
   - Login with 2FA
   - Password reset
   - Session management

2. Order & Payment Flow
   - Add to cart → Checkout → Payment → Confirmation
   - Order status transitions (full state machine)
   - Payment webhook processing

3. Commission Calculation Flow
   - Order completion → Commission generation
   - Multi-level distribution (up to 10 levels)
   - Commission approval → Payout

4. MLM Tree Operations
   - Partner registration under sponsor
   - Tree traversal queries
   - Rank qualification check

Stage 3: Security Gates

Scan	Tool	Action on Failure
Dependency vulnerabilities	`npm audit --audit-level=high`	Block merge
Secret detection	`gitleaks`	Block merge
Static Application Security Testing	CodeQL / SonarQube	Block on high/critical
Container vulnerabilities	Trivy	Block on critical

Stage 4: Database Validation

yaml

# Migration dry-run against test database
- name: Validate Migrations
  run: |
    # Create temporary database
    createdb iwm_migration_test

    # Run all migrations
    npx prisma migrate deploy --preview-feature

    # Validate schema matches Prisma schema
    npx prisma db pull --force
    npx prisma validate

    # Cleanup
    dropdb iwm_migration_test

Release Workflow

Triggered on merge to main branch.

Version Management

Semantic Versioning: MAJOR.MINOR.PATCH

Auto-bump rules:
- Merge to main → PATCH increment (1.2.3 → 1.2.4)
- Manual MINOR change → Skip auto-bump, use manual version
- Manual MAJOR change → Skip auto-bump, use manual version

Version stored in: VERSION file (root)

Build Stage

Step	Description
Read VERSION	Get current or bumped version
Build Docker images	Backend + Frontend separately
Tag images	`v{VERSION}` + `latest`
Push to registry	GitHub Container Registry (ghcr.io)
Cache layers	`type=gha,mode=max` for faster rebuilds

Docker Layer Caching

yaml

# Each Dockerfile instruction creates a layer with SHA256 hash
# Unchanged layers are reused from cache

FROM node:20-alpine          # Layer sha256:a1b2... (cached if unchanged)
COPY package*.json ./        # Layer sha256:c3d4... (cached if package.json same)
RUN npm ci                   # Layer sha256:e5f6... (cached if dependencies same)
COPY . .                     # Layer sha256:g7h8... (changes on code change)
RUN npm run build            # Layer sha256:i9j0... (rebuilds if code changed)

Cache configuration:

yaml

cache-from: type=gha # Pull from GitHub Actions cache
cache-to: type=gha,mode=max # Push ALL layers (not just final)

Security Gates Detail

Dependency Scanning

yaml

# Run on every PR and weekly scheduled scan
- name: Dependency Audit
  run: npm audit --audit-level=high

- name: Snyk Scan
  uses: snyk/actions/node@master
  env:
    SNYK_TOKEN: ${{ secrets.SNYK_TOKEN }}

Secret Scanning

yaml

- name: Gitleaks Scan
  uses: gitleaks/gitleaks-action@v2
  env:
    GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

Protected patterns:

Pattern	Example
API keys	`sk_live_`, `pk_live_`
Database URLs	`postgresql://:@*`
JWT secrets	`JWT_SECRET=*`
Encryption keys	`ENCRYPTION_MASTER_KEY=*`
Webhook secrets	`_WEBHOOK_SECRET=`

Container Scanning

yaml

- name: Trivy Scan
  uses: aquasecurity/trivy-action@master
  with:
    image-ref: ${{ env.IMAGE }}
    severity: "CRITICAL,HIGH"
    exit-code: "1" # Fail on critical/high

Database Migration Strategy

Migration Validation (PR)

1. Dry-run against empty database
2. Dry-run against production clone (staging)
3. Check for destructive operations (DROP, TRUNCATE)
4. Estimate migration duration
5. Flag migrations requiring maintenance window

Migration Execution (Deploy)

┌─────────────────────────────────────────────────────────────────┐
│                    MIGRATION EXECUTION                           │
│                                                                  │
│   1. Create backup snapshot                                      │
│      └─▶ pg_dump iwm_production > backup_$(date).sql            │
│                                                                  │
│   2. Run migrations                                              │
│      └─▶ npx prisma migrate deploy                              │
│                                                                  │
│   3. Validate schema                                             │
│      └─▶ npx prisma validate                                    │
│                                                                  │
│   4. Health check                                                │
│      └─▶ curl /health/ready                                     │
│                                                                  │
│   5. On failure: Restore from backup                             │
│      └─▶ psql < backup_$(date).sql                              │
└─────────────────────────────────────────────────────────────────┘

Important: Prisma migrate deploy runs each migration file in its own transaction automatically. Do NOT wrap it in BEGIN/COMMIT manually.

DDL Transaction Limitations

Some PostgreSQL DDL operations cannot run inside a transaction. These require special handling:

Operation	Transaction Support	Handling
CREATE INDEX CONCURRENTLY	No	Separate migration file, run manually
ALTER TYPE (enum add value)	No (PG < 12)	Separate migration, requires downtime on old PG
DROP INDEX CONCURRENTLY	No	Separate migration file
REINDEX CONCURRENTLY	No	Run during maintenance window

For non-transactional migrations:

sql

-- migrations/20240115_add_index_concurrently.sql
-- @non-transactional

CREATE INDEX CONCURRENTLY idx_orders_created
ON product.orders(created_at);

yaml

# Pipeline handles non-transactional migrations separately
- name: Run Non-Transactional Migrations
  run: |
    for file in prisma/migrations/*_non_transactional.sql; do
      psql $DATABASE_URL -f "$file" || exit 1
    done

Destructive Migration Policy

Operation	Requirement
DROP TABLE	Requires manual approval + backup verification
DROP COLUMN	Must be preceded by code removal in previous release
TRUNCATE	Prohibited in production migrations
ALTER TYPE	Requires maintenance window

Backup Verification Testing

Backups are worthless if they can't be restored. Regular verification ensures recovery is actually possible.

Verification schedule:

Environment	Frequency	Method
Production	Weekly	Full restore to isolated instance
Staging	Monthly	Full restore test
After migration	Immediate	Spot check critical tables

Automated backup verification job:

yaml

# .github/workflows/backup-verification.yml
name: Backup Verification

on:
  schedule:
    - cron: "0 4 * * 0" # Weekly Sunday 4AM
  workflow_dispatch:

jobs:
  verify-backup:
    runs-on: ubuntu-latest
    steps:
      - name: Create Fresh Backup
        run: |
          pg_dump $PRODUCTION_URL \
            --format=custom \
            --file=backup-$(date +%Y%m%d).dump

      - name: Spin Up Isolated Instance
        run: |
          docker run -d \
            --name pg-verify \
            -e POSTGRES_PASSWORD=verify_test \
            -p 5433:5432 \
            postgres:15

          sleep 10  # Wait for startup

      - name: Restore Backup
        run: |
          pg_restore \
            --host=localhost \
            --port=5433 \
            --username=postgres \
            --dbname=postgres \
            --clean \
            --if-exists \
            backup-$(date +%Y%m%d).dump

      - name: Verify Data Integrity
        run: |
          PGPASSWORD=verify_test psql \
            -h localhost -p 5433 -U postgres -d postgres \
            -f scripts/verify-backup-integrity.sql

      - name: Verify Row Counts
        run: |
          # Compare row counts with production
          node scripts/compare-backup-counts.js

      - name: Cleanup
        if: always()
        run: docker rm -f pg-verify

      - name: Report Results
        run: |
          if [ "${{ job.status }}" == "success" ]; then
            curl -X POST $SLACK_WEBHOOK \
              -d '{"text": ":white_check_mark: Backup verification passed"}'
          else
            curl -X POST $SLACK_WEBHOOK \
              -d '{"text": ":x: Backup verification FAILED - investigate immediately"}'
          fi

Integrity verification script:

sql

-- scripts/verify-backup-integrity.sql

-- Check critical tables exist and have data
DO $$
DECLARE
    v_count INT;
BEGIN
    -- Users table
    SELECT COUNT(*) INTO v_count FROM core.users;
    IF v_count = 0 THEN
        RAISE EXCEPTION 'CRITICAL: users table is empty';
    END IF;
    RAISE NOTICE 'users: % rows', v_count;

    -- Partners table
    SELECT COUNT(*) INTO v_count FROM mlm.partners;
    RAISE NOTICE 'partners: % rows', v_count;

    -- Commission transactions
    SELECT COUNT(*) INTO v_count FROM mlm.commission_transactions;
    RAISE NOTICE 'commission_transactions: % rows', v_count;

    -- Orders table
    SELECT COUNT(*) INTO v_count FROM product.orders;
    RAISE NOTICE 'orders: % rows', v_count;

    -- Verify foreign key relationships intact
    SELECT COUNT(*) INTO v_count
    FROM mlm.partners p
    LEFT JOIN core.users u ON p.user_id = u.id
    WHERE u.id IS NULL;

    IF v_count > 0 THEN
        RAISE EXCEPTION 'CRITICAL: % orphaned partner records', v_count;
    END IF;

    -- Verify tree integrity
    SELECT COUNT(*) INTO v_count
    FROM mlm.partner_tree_paths ptp
    LEFT JOIN mlm.partners p ON ptp.ancestor_id = p.id
    WHERE p.id IS NULL;

    IF v_count > 0 THEN
        RAISE EXCEPTION 'CRITICAL: % orphaned tree path records', v_count;
    END IF;

    RAISE NOTICE 'All integrity checks passed';
END $$;

Recovery time tracking:

Metric	Target	Alert If
Backup creation time	< 30 min	> 1 hour
Restore time	< 1 hour	> 2 hours
Verification time	< 15 min	> 30 min
Total RTO	< 2 hours	> 4 hours

Environment Strategy

Environment Matrix

Environment	Purpose	Deploy Trigger	Data
Development	Local development	Manual	Seed data
CI	Pipeline testing	Every PR	Fresh per run
Staging	Pre-production validation	Merge to main	Production clone (anonymized)
Production	Live system	Manual approval	Real data

Staging Data Anonymization

Production data cloned to staging must be anonymized to protect user privacy and comply with GDPR/data protection requirements.

Anonymization rules by data type:

Data Category	Field	Anonymization Method
PII	email	`faker.email()` with original domain preserved
PII	phone	`+7900${random7digits}`
PII	first_name, last_name	`faker.name()`
PII	address fields	`faker.address()`
Financial	bank_account	`****${last4}` (masked)
Financial	card_number	Completely removed
Financial	balance amounts	Preserved (not PII)
KYC	passport_number	`XX${random8digits}`
KYC	tax_id	`${random12digits}`
KYC	document_urls	Replaced with placeholder images
Auth	password_hash	Set to known test password hash
Auth	session tokens	Deleted
Auth	2fa_secrets	Deleted
Audit	ip_address	`192.168.x.x` (private range)
Audit	user_agent	Preserved (not PII)

Anonymization script:

sql

-- scripts/anonymize-staging.sql
-- Run after cloning production to staging

BEGIN;

-- Users table
UPDATE core.users SET
    email = 'user_' || id::text || '@staging.iwm.local',
    phone = '+7900' || LPAD(FLOOR(RANDOM() * 10000000)::TEXT, 7, '0'),
    password_hash = '$2b$10$staging.password.hash.for.testing';  -- Password: "staging123"

-- User profiles
UPDATE core.user_profiles SET
    first_name = 'Test',
    last_name = 'User_' || SUBSTRING(user_id::text, 1, 8),
    middle_name = NULL;

-- Addresses
UPDATE product.addresses SET
    first_name = 'Test',
    last_name = 'User',
    address_line1 = FLOOR(RANDOM() * 100)::TEXT || ' Test Street',
    address_line2 = 'Apt ' || FLOOR(RANDOM() * 100)::TEXT,
    city = 'Test City',
    phone = '+7900' || LPAD(FLOOR(RANDOM() * 10000000)::TEXT, 7, '0');

-- KYC documents
UPDATE core.kyc_verifications SET
    document_number = 'XX' || LPAD(FLOOR(RANDOM() * 100000000)::TEXT, 8, '0');

UPDATE core.kyc_documents SET
    file_url = 'https://staging-assets.iwm.local/placeholder-document.pdf',
    file_name = 'anonymized_document.pdf';

-- Payout details (sensitive bank info)
UPDATE mlm.payout_requests SET
    payout_details = jsonb_set(
        payout_details,
        '{account_number}',
        '"****' || RIGHT(payout_details->>'account_number', 4) || '"'
    )
WHERE payout_details ? 'account_number';

-- Delete sensitive auth data
DELETE FROM core.sessions;
DELETE FROM core.two_factor_auth;

-- Audit logs - anonymize IPs
UPDATE core.audit_log SET
    ip_address = ('192.168.' || (RANDOM() * 255)::INT || '.' || (RANDOM() * 255)::INT)::INET;

-- Partner referral links - update domain
UPDATE mlm.referral_links SET
    full_url = REPLACE(full_url, 'iwm.com', 'staging.iwm.local');

COMMIT;

-- Verify no production data leaked
DO $$
BEGIN
    -- Check no real emails remain
    IF EXISTS (SELECT 1 FROM core.users WHERE email NOT LIKE '%@staging.iwm.local') THEN
        RAISE EXCEPTION 'Anonymization failed: real emails found';
    END IF;

    -- Check no real phone numbers remain
    IF EXISTS (SELECT 1 FROM core.users WHERE phone NOT LIKE '+7900%') THEN
        RAISE EXCEPTION 'Anonymization failed: real phone numbers found';
    END IF;
END $$;

Automated staging refresh:

yaml

# .github/workflows/staging-refresh.yml
name: Refresh Staging Data

on:
  schedule:
    - cron: "0 3 * * 0" # Weekly Sunday 3AM
  workflow_dispatch: # Manual trigger

jobs:
  refresh-staging:
    runs-on: ubuntu-latest
    steps:
      - name: Create Production Snapshot
        run: |
          pg_dump $PRODUCTION_URL \
            --no-owner \
            --no-privileges \
            > production-snapshot.sql

      - name: Restore to Staging
        run: |
          psql $STAGING_URL -c "DROP SCHEMA IF EXISTS public CASCADE; CREATE SCHEMA public;"
          psql $STAGING_URL < production-snapshot.sql

      - name: Run Anonymization
        run: psql $STAGING_URL < scripts/anonymize-staging.sql

      - name: Verify Anonymization
        run: node scripts/verify-anonymization.js

      - name: Notify Team
        run: |
          curl -X POST $SLACK_WEBHOOK \
            -d '{"text": "Staging environment refreshed with anonymized production data"}'

Access control for staging:

Role	Staging Access	Can See
Developer	Full access	Anonymized data only
QA	Full access	Anonymized data only
Support	No access	—
External contractor	No access	—

Environment Parity

yaml

# All environments use identical:
- Docker images (same SHA)
- Database schema (same migrations)
- Environment variable structure (different values)
- Infrastructure configuration (scaled differently)

Environment Variables Validation

typescript

// Validated at application startup using Zod
// CI must validate all required variables are defined

const envSchema = z.object({
  NODE_ENV: z.enum(["development", "staging", "production"]),
  DATABASE_URL: z.string().url(),
  REDIS_URL: z.string().url(),
  JWT_SECRET: z.string().min(32),
  ENCRYPTION_MASTER_KEY: z.string().min(32),
  // ... all other required variables
});

Deployment Strategy

Blue-Green Deployment

┌─────────────────────────────────────────────────────────────────┐
│                      BLUE-GREEN DEPLOY                           │
│                                                                  │
│   Current State:                                                 │
│   ┌─────────────┐                                               │
│   │    BLUE     │ ◀── Load Balancer ◀── Traffic                │
│   │   (v1.2.3)  │                                               │
│   └─────────────┘                                               │
│   ┌─────────────┐                                               │
│   │   GREEN     │     (idle)                                    │
│   │   (v1.2.3)  │                                               │
│   └─────────────┘                                               │
│                                                                  │
│   Deploy v1.2.4:                                                 │
│   1. Deploy to GREEN                                             │
│   2. Run health checks on GREEN                                  │
│   3. Run smoke tests on GREEN                                    │
│   4. Switch traffic to GREEN                                     │
│   5. Monitor for errors                                          │
│   6. If errors: Switch back to BLUE (rollback)                  │
│   7. If stable: Update BLUE to v1.2.4 (sync)                    │
└─────────────────────────────────────────────────────────────────┘

Rollback Procedure

yaml

# Automatic rollback triggers:
- Health check failure (3 consecutive)
- Error rate > 5% (compared to baseline)
- Response time > 2x baseline

# Rollback steps:
1. Switch load balancer to previous version
2. Alert on-call engineer
3. Preserve logs for investigation
4. Do NOT run backward migrations automatically

Rollback Window

Phase	Duration	BLUE Status	Action if Issues
Immediate	0-15 min	Running, no traffic	Instant switch back
Short-term	15 min - 2 hours	Running, warm standby	Quick rollback (< 1 min)
Medium-term	2-24 hours	Stopped, image preserved	Restart BLUE, switch traffic
Long-term	> 24 hours	Terminated	Redeploy previous version

Data divergence consideration:

Rollback within 2 hours: Minimal data divergence, safe to rollback
Rollback after 2 hours: Audit new data created, may need manual reconciliation
Rollback after 24 hours: Requires data migration plan, not automatic

yaml

# Rollback window configuration
rollback:
  instant_window: 15m # BLUE kept running
  warm_standby: 2h # BLUE stopped but not terminated
  image_retention: 24h # Previous image kept in registry
  max_auto_rollback: 2h # After this, manual approval required

Database Connection Management During Deploy

During blue-green deployment, both versions may run simultaneously. This requires careful connection pool management.

┌─────────────────────────────────────────────────────────────────┐
│                 CONNECTION POOL DURING DEPLOY                    │
│                                                                  │
│   Database Pool Limit: 100 connections                           │
│                                                                  │
│   Normal Operation:                                              │
│   ┌─────────────┐                                               │
│   │    BLUE     │ ─── 50 connections ───▶ ┌──────────────┐     │
│   │  (active)   │                         │              │     │
│   └─────────────┘                         │  PostgreSQL  │     │
│                                           │              │     │
│   During Deploy (both running):           │  Pool: 100   │     │
│   ┌─────────────┐                         │              │     │
│   │    BLUE     │ ─── 40 connections ───▶ │              │     │
│   │  (draining) │                         │              │     │
│   └─────────────┘                         │              │     │
│   ┌─────────────┐                         │              │     │
│   │   GREEN     │ ─── 40 connections ───▶ │              │     │
│   │  (starting) │                         └──────────────┘     │
│   └─────────────┘                                               │
│                                                                  │
│   Reserve: 20 connections for migrations & admin                 │
└─────────────────────────────────────────────────────────────────┘

Connection pool configuration:

typescript

// config/database.ts
const poolConfig = {
  // Normal operation
  default: {
    min: 5,
    max: 50,
  },
  // During deployment (detected via DEPLOY_MODE env)
  deployment: {
    min: 2,
    max: 40, // Reduced to allow overlap
  },
};

Deploy sequence to prevent connection exhaustion:

yaml

deploy_steps:
  1. Set GREEN pool to deployment mode (max: 40)
  2. Start GREEN instances
  3. Wait for GREEN health check
  4. Run migrations (uses reserved connections)
  5. Gradually shift traffic (10% → 50% → 100%)
  6. Set BLUE to drain mode (stop accepting new connections)
  7. Wait for BLUE connections to close (max 30s)
  8. Stop BLUE instances
  9. Set GREEN pool to normal mode (max: 50)

PgBouncer recommended for production:

ini

# pgbouncer.ini
[pgbouncer]
pool_mode = transaction
max_client_conn = 200
default_pool_size = 50
reserve_pool_size = 10
reserve_pool_timeout = 3

Health Checks

typescript

// /health/live - Is the process running?
// Returns 200 if process is alive

// /health/ready - Can it handle requests?
// Checks:
// - Database connection
// - Redis connection
// - Required services available

// /health/startup - Has it finished initialization?
// Used by Kubernetes to know when to send traffic

Test Requirements

Coverage Thresholds

json

{
  "coverageThreshold": {
    "global": {
      "branches": 75,
      "functions": 80,
      "lines": 80,
      "statements": 80
    },
    "src/modules/mlm/domain/**": {
      "branches": 90,
      "functions": 95,
      "lines": 95
    },
    "src/modules/payment/domain/**": {
      "branches": 90,
      "functions": 95,
      "lines": 95
    }
  }
}

Platform-Specific Test Suites

Suite	Focus	Trigger
Commission Tests	Multi-level calculations, edge cases, rounding	Every PR
Tree Operation Tests	INSERT, MOVE, depth limits, cycle prevention	Every PR
Payment Integration	Webhook handling, idempotency, refunds	Every PR
State Machine Tests	Order transitions, invalid state prevention	Every PR
Encryption Tests	Encrypt/decrypt cycle, key rotation	Every PR
Load Tests	Commission calculation under load	Weekly / Pre-release

Load Test Specifications

Load tests validate system performance under realistic and stress conditions.

Target Metrics:

Scenario	Target	Threshold (Fail if)
Commission calculation	100 orders/second	< 50 orders/second
Concurrent payout requests	50 requests/second	< 25 requests/second
Tree traversal (10 levels)	< 50ms p95	> 200ms p95
API response time (p95)	< 200ms	> 500ms
Database connections	Stable at pool max	Pool exhaustion
Memory usage	< 512MB per instance	> 1GB

Load test scenarios:

typescript

// load-tests/k6/commission-load.js
import http from "k6/http";
import { check, sleep } from "k6";

export const options = {
  scenarios: {
    // Sustained load: Normal operation
    sustained: {
      executor: "constant-arrival-rate",
      rate: 100, // 100 orders per second
      timeUnit: "1s",
      duration: "5m",
      preAllocatedVUs: 50,
    },
    // Spike: Flash sale simulation
    spike: {
      executor: "ramping-arrival-rate",
      startRate: 100,
      timeUnit: "1s",
      stages: [
        { duration: "1m", target: 100 }, // Normal
        { duration: "30s", target: 500 }, // Spike to 5x
        { duration: "2m", target: 500 }, // Sustain spike
        { duration: "30s", target: 100 }, // Back to normal
      ],
      preAllocatedVUs: 200,
    },
    // Stress: Find breaking point
    stress: {
      executor: "ramping-arrival-rate",
      startRate: 50,
      timeUnit: "1s",
      stages: [
        { duration: "2m", target: 200 },
        { duration: "2m", target: 400 },
        { duration: "2m", target: 600 },
        { duration: "2m", target: 800 }, // Find where it breaks
      ],
      preAllocatedVUs: 300,
    },
  },
  thresholds: {
    http_req_duration: ["p(95)<500"], // 95% of requests < 500ms
    http_req_failed: ["rate<0.01"], // Error rate < 1%
    commission_calculated: ["rate>50"], // At least 50/s processed
  },
};

export default function () {
  const orderPayload = {
    userId: `user_${__VU}_${__ITER}`,
    amount: Math.floor(Math.random() * 10000) + 1000,
    referringPartnerId: "partner_test_001",
  };

  const res = http.post(
    `${__ENV.API_URL}/api/v1/orders`,
    JSON.stringify(orderPayload),
    { headers: { "Content-Type": "application/json" } },
  );

  check(res, {
    "order created": (r) => r.status === 201,
    "commission triggered": (r) => r.json("commissionJobId") !== null,
  });

  sleep(0.1);
}

Database connection behavior under load:

yaml

# Monitor during load tests
metrics:
  - pg_stat_activity.active_connections
  - pg_stat_activity.waiting_connections
  - pg_stat_user_tables.seq_scan (should not spike)
  - pg_stat_user_tables.idx_scan (should handle load)
  - pg_locks.blocked_queries (should be zero)

CI integration:

yaml

- name: Run Load Tests (Pre-release)
  if: github.event_name == 'release'
  run: |
    k6 run load-tests/k6/commission-load.js \
      --env API_URL=${{ secrets.STAGING_URL }} \
      --out json=load-test-results.json

    # Fail release if thresholds not met
    node scripts/validate-load-test.js load-test-results.json

Monitoring & Notifications

Deploy Notifications

yaml

# Notify on:
- Deploy started (staging/production)
- Deploy succeeded
- Deploy failed
- Rollback triggered
- Manual approval required

# Channels:
- Slack / Telegram
- Email (for failures)
- PagerDuty (for production failures)

Post-Deploy Verification

yaml

# Smoke tests run immediately after deploy:
1. GET /health/ready → 200
2. POST /api/v1/auth/login (test user) → 200 + JWT
3. GET /api/v1/products → 200 + valid response
4. GET /api/v1/mlm/ranks → 200 + valid response
# If any fail → trigger rollback

Metrics to Monitor Post-Deploy

Metric	Baseline Comparison	Alert Threshold
Error rate	vs. previous hour	> 2x baseline
Response time (p95)	vs. previous hour	> 1.5x baseline
Database query time	vs. previous hour	> 2x baseline
Memory usage	vs. previous deploy	> 120%
CPU usage	vs. previous deploy	> 150%

Hotfix Procedure

When production is broken and a rapid fix is needed, the hotfix procedure allows bypassing normal flow while maintaining safety.

When to Use Hotfix

Situation	Use Hotfix?	Normal Deploy OK?
Production down / 500 errors	Yes	No
Critical security vulnerability	Yes	No
Payment processing broken	Yes	No
Commission calculation wrong	Yes	No
Minor bug, users unaffected	No	Yes
Performance degradation (< 2x)	No	Yes
Feature not working as expected	No	Yes

Hotfix Flow

┌─────────────────────────────────────────────────────────────────┐
│                       HOTFIX PROCEDURE                           │
│                                                                  │
│   1. ASSESS (5 min max)                                          │
│      └─▶ Confirm severity, identify root cause                  │
│      └─▶ Decision: Hotfix vs Rollback vs Wait                   │
│                                                                  │
│   2. BRANCH                                                       │
│      └─▶ git checkout -b hotfix/ISSUE-ID main                   │
│      └─▶ NOT from feature branch                                │
│                                                                  │
│   3. FIX                                                          │
│      └─▶ Minimal change only                                    │
│      └─▶ No refactoring                                         │
│      └─▶ No "while we're here" additions                        │
│                                                                  │
│   4. VALIDATE (Abbreviated)                                       │
│      └─▶ Unit tests for changed code                            │
│      └─▶ Type check                                             │
│      └─▶ Manual smoke test                                      │
│      └─▶ Skip: Full E2E, Load tests, SAST                       │
│                                                                  │
│   5. APPROVE                                                      │
│      └─▶ Single reviewer (senior engineer)                      │
│      └─▶ No PR required (direct push with approval)            │
│                                                                  │
│   6. DEPLOY                                                       │
│      └─▶ Direct to production (skip staging)                    │
│      └─▶ Watch metrics for 15 minutes                           │
│                                                                  │
│   7. FOLLOW-UP (within 24 hours)                                 │
│      └─▶ Create proper PR backport to main                      │
│      └─▶ Add regression test                                    │
│      └─▶ Write incident report                                  │
└─────────────────────────────────────────────────────────────────┘

Hotfix Commands

bash

# 1. Create hotfix branch from production tag
git fetch --tags
git checkout -b hotfix/IWM-123-fix-commission-calc v1.2.3

# 2. Make fix
# ... code changes ...

# 3. Run abbreviated tests
npm run test:unit -- --testPathPattern="commission"
npm run type-check

# 4. Deploy directly (requires HOTFIX_APPROVED=true)
HOTFIX_APPROVED=true npm run deploy:production

# 5. Tag the hotfix
git tag -a v1.2.3-hotfix.1 -m "Hotfix: Commission calculation overflow"
git push origin v1.2.3-hotfix.1

# 6. Backport to main
git checkout main
git cherry-pick <hotfix-commit-sha>
git push origin main

Hotfix Approval Matrix

Fix Type	Approver Required	Can Self-Approve
Logic fix (no DB)	1 senior engineer	No
Config change	1 engineer	Yes (if on-call)
Database fix	2 engineers + DBA	No
Security fix	1 security + 1 engineer	No
Revert to previous	1 engineer	Yes (if on-call)

On-Call & Escalation

Escalation Path

┌─────────────────────────────────────────────────────────────────┐
│                      ESCALATION LADDER                           │
│                                                                  │
│   Level 0: Automated                                             │
│   └─▶ Health check fails → Auto-rollback                        │
│   └─▶ Error rate > 5% → Auto-rollback                           │
│   └─▶ Alert sent to #alerts channel                             │
│                                                                  │
│   Level 1: On-Call Engineer (0-15 min)                          │
│   └─▶ Receive PagerDuty alert                                   │
│   └─▶ Acknowledge within 5 minutes                              │
│   └─▶ Assess: Can fix alone? Needs escalation?                  │
│                                                                  │
│   Level 2: Senior Engineer (15-30 min)                          │
│   └─▶ Auto-escalate if L1 doesn't acknowledge                   │
│   └─▶ Join incident call                                        │
│   └─▶ Decision: Hotfix vs Rollback vs External help             │
│                                                                  │
│   Level 3: Engineering Lead + Team (30+ min)                    │
│   └─▶ Multiple engineers on call                                │
│   └─▶ Coordinate with stakeholders                              │
│   └─▶ Customer communication if needed                          │
│                                                                  │
│   Level 4: Executive (Major Incident)                            │
│   └─▶ Extended outage (> 1 hour)                                │
│   └─▶ Data breach / Security incident                           │
│   └─▶ Financial impact (payments affected)                      │
└─────────────────────────────────────────────────────────────────┘

PagerDuty Configuration

yaml

# pagerduty-config.yml
services:
  - name: IWM Production
    escalation_policy: iwm-production
    alert_creation: create_alerts_and_incidents

escalation_policies:
  - name: iwm-production
    rules:
      - escalation_delay_minutes: 5
        targets:
          - type: schedule_reference
            id: primary-oncall

      - escalation_delay_minutes: 15
        targets:
          - type: schedule_reference
            id: senior-oncall

      - escalation_delay_minutes: 30
        targets:
          - type: user_reference
            id: engineering-lead
          - type: user_reference
            id: cto

schedules:
  - name: primary-oncall
    rotation: weekly
    users: [engineer_1, engineer_2, engineer_3, engineer_4]

  - name: senior-oncall
    rotation: weekly
    users: [senior_1, senior_2]

Alert Severity Levels

Severity	Response Time	Examples
P1 - Critical	5 min	Production down, payments failing, data breach
P2 - High	15 min	Major feature broken, error rate > 5%
P3 - Medium	1 hour	Performance degradation, non-critical errors
P4 - Low	Next business day	Minor bugs, monitoring alerts

Incident Communication

yaml

# During incident:
channels:
  - "#incident-active" # Real-time updates (engineers only)
  - "#engineering" # Status updates (every 30 min)
  - "#general" # Customer-facing status (if needed)

templates:
  initial_alert: |
    :rotating_light: **INCIDENT DETECTED**
    Severity: {severity}
    Service: {service}
    Description: {description}
    On-call: @{oncall_user}
    Incident channel: #incident-{id}

  status_update: |
    **Incident Update** ({time} since start)
    Status: {investigating|identified|fixing|monitoring|resolved}
    Impact: {impact_description}
    Next update: {eta}

  resolution: |
    :white_check_mark: **INCIDENT RESOLVED**
    Duration: {duration}
    Root cause: {root_cause}
    Fix applied: {fix_description}
    Follow-up: {follow_up_ticket}

Secrets Rotation

Rotation Schedule

Secret	Rotation Frequency	Auto-Rotate	Downtime Required
JWT_SECRET	90 days	No	No (dual-key period)
ENCRYPTION_MASTER_KEY	180 days	No	Yes (re-encryption)
Database password	90 days	Yes (via cloud)	No
API keys (external)	365 days	Varies	No
Webhook secrets	180 days	No	Coordination needed

JWT Secret Rotation (Zero Downtime)

typescript

// Support dual JWT secrets during rotation
const jwtConfig = {
  // Current secret (for signing new tokens)
  current: process.env.JWT_SECRET,

  // Previous secret (for validating old tokens during rotation)
  previous: process.env.JWT_SECRET_PREVIOUS || null,

  // Rotation window (how long to accept old tokens)
  rotationWindowDays: 7,
};

// Validation accepts either secret
function validateToken(token: string): JwtPayload {
  try {
    return jwt.verify(token, jwtConfig.current);
  } catch (e) {
    if (jwtConfig.previous) {
      return jwt.verify(token, jwtConfig.previous);
    }
    throw e;
  }
}

Rotation procedure:

yaml

jwt_rotation_steps: 1. Generate new JWT_SECRET
  2. Set JWT_SECRET_PREVIOUS = current JWT_SECRET
  3. Set JWT_SECRET = new secret
  4. Deploy (both secrets now valid)
  5. Wait 7 days (tokens expire, refresh uses new secret)
  6. Remove JWT_SECRET_PREVIOUS
  7. Deploy final config

Encryption Key Rotation

Encryption key rotation requires re-encrypting existing data.

typescript

// Key versioning in encrypted data
interface EncryptedField {
  version: string; // 'v1', 'v2', etc.
  ciphertext: string;
  iv: string;
}

// Decryption with version support
async function decrypt(field: EncryptedField): Promise<string> {
  const key = getKeyByVersion(field.version);
  return decryptWithKey(field.ciphertext, field.iv, key);
}

// Background re-encryption job
async function reEncryptAllData(fromVersion: string, toVersion: string) {
  const batchSize = 1000;
  let offset = 0;

  while (true) {
    const records = await db.taxIdentification.findMany({
      where: { encryptedData: { path: ["version"], equals: fromVersion } },
      take: batchSize,
      skip: offset,
    });

    if (records.length === 0) break;

    for (const record of records) {
      const decrypted = await decrypt(record.encryptedData);
      const reEncrypted = await encrypt(decrypted, toVersion);

      await db.taxIdentification.update({
        where: { id: record.id },
        data: { encryptedData: reEncrypted },
      });
    }

    offset += batchSize;
    logger.info(`Re-encrypted ${offset} records`);
  }
}

Rotation procedure:

yaml

encryption_key_rotation: 1. Generate new key (ENCRYPTION_MASTER_KEY_V2)
  2. Deploy with both keys available
  3. New encryptions use v2
  4. Run background re-encryption job
  5. Monitor job completion (may take hours)
  6. Verify all records are v2
  7. Remove old key from config
  8. Securely delete old key from secrets manager

Webhook Secret Rotation

External webhooks (payment providers) require coordination.

yaml

webhook_rotation:
  stripe:
    1. Generate new webhook in Stripe dashboard
    2. Add new STRIPE_WEBHOOK_SECRET_V2 to config
    3. Deploy (accept both secrets)
    4. Disable old webhook in Stripe dashboard
    5. Remove old secret from config
    coordination: Self-service, no downtime

  payment_provider:
    1. Contact provider support
    2. Schedule rotation window
    3. Provider sends test webhook with new secret
    4. Verify receipt
    5. Provider switches to new secret
    6. Update config
    coordination: Provider-dependent, may require window

Implementation Priority

Required (Phase 1)

Component	Reason
Build + Type check	Basic correctness
Unit tests (80%)	Financial calculation accuracy
Integration tests	Database operations correctness
npm audit	OWASP A06 compliance
Secret scanning	Prevent credential leaks
Migration validation	Database integrity
Health checks	Deployment verification
Staging environment	Pre-production validation

Should Have (Phase 2)

Component	Reason
E2E tests (critical flows)	User journey validation
Container scanning (Trivy)	Infrastructure security
SAST (CodeQL)	Code-level vulnerabilities
Blue-green deployment	Zero-downtime releases
Automatic rollback	Fast recovery
Deploy notifications	Team awareness
Coverage gates	Prevent regression

Nice to Have (Phase 3)

Component	Reason
Preview environments	PR testing
Load testing	Performance validation
Visual regression	UI consistency
Auto-changelog	Release documentation
Dependency auto-update	Maintenance automation
Feature flags	Gradual rollouts
Chaos testing	Resilience verification

CI/CD Pipeline Standards ​

Overview ​

Pipeline Goals ​

Pipeline Architecture ​

PR Validation Workflow ​

Stage 1: Build & Static Analysis ​

Stage 2: Test Suite ​

Contract Tests Specification ​

Stage 3: Security Gates ​

Stage 4: Database Validation ​

Release Workflow ​

Version Management ​

Build Stage ​

Docker Layer Caching ​

Security Gates Detail ​

Dependency Scanning ​

Secret Scanning ​

Container Scanning ​

Database Migration Strategy ​

Migration Validation (PR) ​

Migration Execution (Deploy) ​

DDL Transaction Limitations ​

Destructive Migration Policy ​

Backup Verification Testing ​

Environment Strategy ​

Environment Matrix ​

Staging Data Anonymization ​

Environment Parity ​

Environment Variables Validation ​

Deployment Strategy ​

Blue-Green Deployment ​

Rollback Procedure ​

Rollback Window ​

Database Connection Management During Deploy ​

Health Checks ​

Test Requirements ​

Coverage Thresholds ​

Platform-Specific Test Suites ​

Load Test Specifications ​

Monitoring & Notifications ​

Deploy Notifications ​

Post-Deploy Verification ​

Metrics to Monitor Post-Deploy ​

Hotfix Procedure ​

When to Use Hotfix ​

Hotfix Flow ​

Hotfix Commands ​

Hotfix Approval Matrix ​

On-Call & Escalation ​

Escalation Path ​

PagerDuty Configuration ​

Alert Severity Levels ​

Incident Communication ​

Secrets Rotation ​

Rotation Schedule ​

JWT Secret Rotation (Zero Downtime) ​

Encryption Key Rotation ​

Webhook Secret Rotation ​

Implementation Priority ​

Required (Phase 1) ​

Should Have (Phase 2) ​

Nice to Have (Phase 3) ​

Related Documents ​

CI/CD Pipeline Standards

Overview

Pipeline Goals

Pipeline Architecture

PR Validation Workflow

Stage 1: Build & Static Analysis

Stage 2: Test Suite

Contract Tests Specification

Stage 3: Security Gates

Stage 4: Database Validation

Release Workflow

Version Management

Build Stage

Docker Layer Caching

Security Gates Detail

Dependency Scanning

Secret Scanning

Container Scanning

Database Migration Strategy

Migration Validation (PR)

Migration Execution (Deploy)

DDL Transaction Limitations

Destructive Migration Policy

Backup Verification Testing

Environment Strategy

Environment Matrix

Staging Data Anonymization

Environment Parity

Environment Variables Validation

Deployment Strategy

Blue-Green Deployment

Rollback Procedure

Rollback Window

Database Connection Management During Deploy

Health Checks

Test Requirements

Coverage Thresholds

Platform-Specific Test Suites

Load Test Specifications

Monitoring & Notifications

Deploy Notifications

Post-Deploy Verification

Metrics to Monitor Post-Deploy

Hotfix Procedure

When to Use Hotfix

Hotfix Flow

Hotfix Commands

Hotfix Approval Matrix

On-Call & Escalation

Escalation Path

PagerDuty Configuration

Alert Severity Levels

Incident Communication

Secrets Rotation

Rotation Schedule

JWT Secret Rotation (Zero Downtime)

Encryption Key Rotation

Webhook Secret Rotation

Implementation Priority

Required (Phase 1)

Should Have (Phase 2)

Nice to Have (Phase 3)

Related Documents