DevSecOps & Infrastructure Operations
Platform Overview
Awarded AI operates a sophisticated multi-cloud, multi-tenant SaaS platform using modern DevSecOps practices. The infrastructure spans commercial Azure and Azure Government clouds, supporting both US commercial and government customers.
Infrastructure Architecture
Cloud Environments
Technology Stack
| Component | Technology | Purpose |
|---|---|---|
| Orchestration | Kubernetes (AKS) | Container orchestration |
| GitOps | Flux v2 | Declarative deployments |
| IaC | Bicep / Terraform | Infrastructure provisioning |
| CI/CD | GitHub Actions | Build and deployment pipelines |
| Monitoring | Sentry, DataDog | Error tracking & APM |
| Secrets | Azure Key Vault, SOPS | Secret management |
| Policy | Kyverno | Kubernetes policies |
Container Services
Core Services
- web - Next.js frontend application
- ai - Python AI/ML services
- nest - Shared services API (NestJS)
- doc-pipeline - Document processing service
- workflows - LangGraph workflow engine
- agenting - Agent services
- deep-contractiq - Specialized AI agent
Supporting Services
- Supabase - Database, auth, storage
- Kong - API gateway
- Redis - Caching layer
- Minio - Object storage
- LiteLLM - LLM proxy service
CI/CD Pipeline
Build Process
# GitHub Actions Workflow Structure
on:
push:
branches:
- 'release/v*.*.*'
jobs:
build-and-push:
strategy:
matrix:
environment: [qa-commercial, qa-government]
image: [web, ai, nest, workflows, agenting]
steps:
- Build with Docker Buildx
- Push to Azure Container Registry
- Tag with version and SHA
Deployment Flow
- Developer creates feature branch from release branch
- PR Process:
- Automated linting & type checking
- E2E tests via Playwright
- Production build validation
- Code review (1-2 approvals)
- Merge to release branch triggers:
- Multi-environment builds
- Container registry push
- Flux GitOps reconciliation
- Deployment via Flux:
- Kubernetes manifests updated
- Rolling deployments
- Health checks
- Database migrations
Security Architecture
Multi-Tenancy Model
# Tenant Isolation via Flux
- Cross-namespace references blocked
- Service account impersonation
- RBAC per tenant namespace
- Network policies enforced
Security Controls
-
Image Security
- Cosign signature verification
- Vulnerability scanning
- Private container registry
-
Secret Management
- Azure Key Vault integration
- SOPS encryption for GitOps
- Kubernetes secrets rotation
-
Network Security
- Private endpoints
- WAF protection
- mTLS between services
Release Management
Version Strategy
- Semantic Versioning: v[MAJOR].[MINOR].[PATCH]
- Release Branches:
release/v1.38.0,release/v1.39.0 - Hotfix Process: From current release branch (NOT main)
Important Note: The current branching strategy is non-standard. The
developandmainbranches are effectively unused/outdated. Active discussion is underway to migrate to a more standard trunk-based development or GitFlow-like workflow.
Current Release Status
- Production: v1.38.0
- QA: v1.39.0
- Release Cadence: ~2 week sprints
Release Process
- Create release branch from previous release
- Update version in package.json
- Deploy to QA environments
- QA validation (manual + automated)
- Production deployment approval
- Tag and release
Infrastructure as Code
Bicep Templates (v2 Architecture)
bicep/
├── main.bicep # Main orchestrator
├── cluster.bicep # AKS cluster definition
├── network.bicep # Network configuration
├── tenant.bicep # Per-tenant resources
└── modules/
├── cert-manager.bicep # SSL certificates
├── kv.bicep # Key Vault
└── dns-zone.bicep # DNS configuration
Helm Charts
charts/
├── web/ # Frontend application
├── supabase/ # Database stack
├── eda/ # Event-driven architecture
├── shared-services/ # Common services
└── tenant-config/ # Tenant-specific configs
Operational Metrics
Key Performance Indicators
| Metric | Target | Current |
|---|---|---|
| Deployment Success Rate | >95% | 92% |
| MTTR (Recovery) | <30min | 45min |
| Release Frequency | 2 weeks | 2-3 weeks |
| Change Failure Rate | <10% | 15% |
| Uptime | 99.9% | 99.7% |
Monitoring Stack
- Sentry:
Application errors(Recently removed, pending replacement) - DataDog: Infrastructure metrics
- Azure Monitor: Platform metrics
- Synthetic Tests: User journey validation (planned)
Critical Gap: APM (Application Performance Monitoring) was removed with Sentry and needs urgent replacement. Plans include:
- Compliant APM solution deployment
- Synthetic monitoring implementation
- Canary deployment monitoring
FinOps Priority: Financial operations and SRE/APM modernization are important priorities that have been delayed. This includes cost visibility, resource optimization, and automated scaling based on metrics.
Current Challenges
Technical Debt
-
Manual Processes
- Tenant onboarding (~2 hours manual work)
- Certificate management
- Database migrations coordination
-
Scaling Issues
- Single-region deployment
- Manual horizontal scaling
- Limited disaster recovery
-
Monitoring Gaps
- Limited distributed tracing
- No centralized logging
- Manual incident response
DSOPW Ticket Analysis
From JIRA DSOPW project:
- Active Initiatives: Terraform migration, automated deployments
- Compliance Issues: 12 open compliance tickets
- Automation Focus: Build improvements, GitOps enhancements
- Security Hardening: Intune deployment, secret rotation
Strategic Roadmap
Q2 2025 (Current)
- Complete Terraform migration
- Implement automated tenant provisioning
- Deploy centralized logging (ELK)
- Enhance monitoring coverage
Q3 2025
- Multi-region deployment
- Automated disaster recovery
- Zero-downtime deployments
- Service mesh implementation
Q4 2025
- Full GitOps automation
- Self-service tenant management
- Advanced observability
- Cost optimization automation
Operational Procedures
Incident Response
- Detection: Automated alerts via Grafana IRM
- Triage: On-call engineer assessment
- Resolution: Runbook execution
- Post-mortem: Within 48 hours
Change Management
- CAB Meetings: Weekly on Wednesdays
- Change Window: Thursday 10 PM - 2 AM EST
- Emergency Changes: CTO approval required
- Rollback Plan: Mandatory for all changes
Backup & Recovery
- Database: Daily snapshots, 30-day retention
- File Storage: Geo-redundant storage
- Configuration: GitOps repository
- Recovery Testing: Monthly drills
DevOps Tools & Access
Required Access
- Azure Portal: Commercial & Government
- GitHub: procurement-sciences org
- 1Password: DevOps vault
- Grafana IRM: On-call rotation
- DataDog: Monitoring dashboards
Key Repositories
- chatbot-ui: Main application
- awardedai-gitops: Infrastructure code
- Internal docs: Confluence (DSOP space)
Useful Commands
# Connect to cluster
az aks get-credentials --resource-group <rg> --name <cluster>
# View deployments
kubectl get deployments -n <tenant>
# Check Flux status
flux get kustomizations --all-namespaces
# View logs
kubectl logs -n <namespace> deployment/<name> -f
# Scale deployment
kubectl scale deployment/<name> -n <namespace> --replicas=3
Training & Documentation
Essential Reading
- Flux v2 Multi-tenancy Guide (See Multi-Tenancy Model section above)
- Kubernetes Best Practices
- Azure AKS Documentation
- Helm Chart Development
Certifications Recommended
- Azure Administrator (AZ-104)
- Certified Kubernetes Administrator (CKA)
- Azure DevOps Engineer (AZ-400)
Need Help?
- Slack: #devsecops
- On-Call: Grafana IRM
- Wiki: Confluence DSOP space
- Escalation: CTO / VP Engineering