Skip to main content

DevSecOps & Infrastructure Operations

Platform Overview

Awarded AI operates a sophisticated multi-cloud, multi-tenant SaaS platform using modern DevSecOps practices. The infrastructure spans commercial Azure and Azure Government clouds, supporting both US commercial and government customers.

Infrastructure Architecture

Cloud Environments

Technology Stack

ComponentTechnologyPurpose
OrchestrationKubernetes (AKS)Container orchestration
GitOpsFlux v2Declarative deployments
IaCBicep / TerraformInfrastructure provisioning
CI/CDGitHub ActionsBuild and deployment pipelines
MonitoringSentry, DataDogError tracking & APM
SecretsAzure Key Vault, SOPSSecret management
PolicyKyvernoKubernetes policies

Container Services

Core Services

  1. web - Next.js frontend application
  2. ai - Python AI/ML services
  3. nest - Shared services API (NestJS)
  4. doc-pipeline - Document processing service
  5. workflows - LangGraph workflow engine
  6. agenting - Agent services
  7. deep-contractiq - Specialized AI agent

Supporting Services

  • Supabase - Database, auth, storage
  • Kong - API gateway
  • Redis - Caching layer
  • Minio - Object storage
  • LiteLLM - LLM proxy service

CI/CD Pipeline

Build Process

# GitHub Actions Workflow Structure
on:
push:
branches:
- 'release/v*.*.*'

jobs:
build-and-push:
strategy:
matrix:
environment: [qa-commercial, qa-government]
image: [web, ai, nest, workflows, agenting]
steps:
- Build with Docker Buildx
- Push to Azure Container Registry
- Tag with version and SHA

Deployment Flow

  1. Developer creates feature branch from release branch
  2. PR Process:
    • Automated linting & type checking
    • E2E tests via Playwright
    • Production build validation
    • Code review (1-2 approvals)
  3. Merge to release branch triggers:
    • Multi-environment builds
    • Container registry push
    • Flux GitOps reconciliation
  4. Deployment via Flux:
    • Kubernetes manifests updated
    • Rolling deployments
    • Health checks
    • Database migrations

Security Architecture

Multi-Tenancy Model

# Tenant Isolation via Flux
- Cross-namespace references blocked
- Service account impersonation
- RBAC per tenant namespace
- Network policies enforced

Security Controls

  1. Image Security

    • Cosign signature verification
    • Vulnerability scanning
    • Private container registry
  2. Secret Management

    • Azure Key Vault integration
    • SOPS encryption for GitOps
    • Kubernetes secrets rotation
  3. Network Security

    • Private endpoints
    • WAF protection
    • mTLS between services

Release Management

Version Strategy

  • Semantic Versioning: v[MAJOR].[MINOR].[PATCH]
  • Release Branches: release/v1.38.0, release/v1.39.0
  • Hotfix Process: From current release branch (NOT main)

Important Note: The current branching strategy is non-standard. The develop and main branches are effectively unused/outdated. Active discussion is underway to migrate to a more standard trunk-based development or GitFlow-like workflow.

Current Release Status

  • Production: v1.38.0
  • QA: v1.39.0
  • Release Cadence: ~2 week sprints

Release Process

  1. Create release branch from previous release
  2. Update version in package.json
  3. Deploy to QA environments
  4. QA validation (manual + automated)
  5. Production deployment approval
  6. Tag and release

Infrastructure as Code

Bicep Templates (v2 Architecture)

bicep/
├── main.bicep # Main orchestrator
├── cluster.bicep # AKS cluster definition
├── network.bicep # Network configuration
├── tenant.bicep # Per-tenant resources
└── modules/
├── cert-manager.bicep # SSL certificates
├── kv.bicep # Key Vault
└── dns-zone.bicep # DNS configuration

Helm Charts

charts/
├── web/ # Frontend application
├── supabase/ # Database stack
├── eda/ # Event-driven architecture
├── shared-services/ # Common services
└── tenant-config/ # Tenant-specific configs

Operational Metrics

Key Performance Indicators

MetricTargetCurrent
Deployment Success Rate>95%92%
MTTR (Recovery)<30min45min
Release Frequency2 weeks2-3 weeks
Change Failure Rate<10%15%
Uptime99.9%99.7%

Monitoring Stack

  • Sentry: Application errors (Recently removed, pending replacement)
  • DataDog: Infrastructure metrics
  • Azure Monitor: Platform metrics
  • Synthetic Tests: User journey validation (planned)

Critical Gap: APM (Application Performance Monitoring) was removed with Sentry and needs urgent replacement. Plans include:

  • Compliant APM solution deployment
  • Synthetic monitoring implementation
  • Canary deployment monitoring

FinOps Priority: Financial operations and SRE/APM modernization are important priorities that have been delayed. This includes cost visibility, resource optimization, and automated scaling based on metrics.

Current Challenges

Technical Debt

  1. Manual Processes

    • Tenant onboarding (~2 hours manual work)
    • Certificate management
    • Database migrations coordination
  2. Scaling Issues

    • Single-region deployment
    • Manual horizontal scaling
    • Limited disaster recovery
  3. Monitoring Gaps

    • Limited distributed tracing
    • No centralized logging
    • Manual incident response

DSOPW Ticket Analysis

From JIRA DSOPW project:

  • Active Initiatives: Terraform migration, automated deployments
  • Compliance Issues: 12 open compliance tickets
  • Automation Focus: Build improvements, GitOps enhancements
  • Security Hardening: Intune deployment, secret rotation

Strategic Roadmap

Q2 2025 (Current)

  • Complete Terraform migration
  • Implement automated tenant provisioning
  • Deploy centralized logging (ELK)
  • Enhance monitoring coverage

Q3 2025

  • Multi-region deployment
  • Automated disaster recovery
  • Zero-downtime deployments
  • Service mesh implementation

Q4 2025

  • Full GitOps automation
  • Self-service tenant management
  • Advanced observability
  • Cost optimization automation

Operational Procedures

Incident Response

  1. Detection: Automated alerts via Grafana IRM
  2. Triage: On-call engineer assessment
  3. Resolution: Runbook execution
  4. Post-mortem: Within 48 hours

Change Management

  • CAB Meetings: Weekly on Wednesdays
  • Change Window: Thursday 10 PM - 2 AM EST
  • Emergency Changes: CTO approval required
  • Rollback Plan: Mandatory for all changes

Backup & Recovery

  • Database: Daily snapshots, 30-day retention
  • File Storage: Geo-redundant storage
  • Configuration: GitOps repository
  • Recovery Testing: Monthly drills

DevOps Tools & Access

Required Access

  1. Azure Portal: Commercial & Government
  2. GitHub: procurement-sciences org
  3. 1Password: DevOps vault
  4. Grafana IRM: On-call rotation
  5. DataDog: Monitoring dashboards

Key Repositories

  • chatbot-ui: Main application
  • awardedai-gitops: Infrastructure code
  • Internal docs: Confluence (DSOP space)

Useful Commands

# Connect to cluster
az aks get-credentials --resource-group <rg> --name <cluster>

# View deployments
kubectl get deployments -n <tenant>

# Check Flux status
flux get kustomizations --all-namespaces

# View logs
kubectl logs -n <namespace> deployment/<name> -f

# Scale deployment
kubectl scale deployment/<name> -n <namespace> --replicas=3

Training & Documentation

Essential Reading

  1. Flux v2 Multi-tenancy Guide (See Multi-Tenancy Model section above)
  2. Kubernetes Best Practices
  3. Azure AKS Documentation
  4. Helm Chart Development
  • Azure Administrator (AZ-104)
  • Certified Kubernetes Administrator (CKA)
  • Azure DevOps Engineer (AZ-400)

Need Help?

  • Slack: #devsecops
  • On-Call: Grafana IRM
  • Wiki: Confluence DSOP space
  • Escalation: CTO / VP Engineering