Getting Ready for Production
Moving from a solid foundation to production readiness requires careful attention to monitoring, testing, security, and operational procedures. This phase ensures that our system can handle real-world traffic, failures, and security threats whilst providing the observability needed to maintain and improve it.
For this phase, we’ll continue with our Java-based stack using Spring Boot and Gradle, deployed to Kubernetes, whilst adding the production-grade tooling and practices necessary for enterprise applications.
Monitoring and Alerting Infrastructure
Comprehensive monitoring and alerting form the nervous system of our production environment. Without proper monitoring, we’re flying blind when issues occur.
DataDog Integration
For this section I’ll use DataDog, as I used it extensively in the last years. DataDog provides comprehensive application performance monitoring, infrastructure monitoring, and log management in one platform.
In this example, the DataDog agent is embedded in the application. Another way to do it is to have it as a side-car with your Docker container.
Setup Dependencies
dependencies {
implementation 'io.micrometer:micrometer-registry-datadog'
implementation 'com.datadoghq:dd-trace-java'
}
Application Configuration
management:
metrics:
export:
datadog:
enabled: true
api-key: ${DATADOG_API_KEY}
application-key: ${DATADOG_APP_KEY}
step: 10s
uri: https://api.datadoghq.eu/api/ # EU instance
tags:
service: ${spring.application.name}
environment: ${ENVIRONMENT:local}
version: ${BUILD_VERSION:unknown}
logging:
pattern:
console: "%d{yyyy-MM-dd HH:mm:ss} [%thread] %-5level [dd.service=%X{dd.service:-},dd.trace_id=%X{dd.trace_id:-},dd.span_id=%X{dd.span_id:-}] %logger{36} - %msg%n"
Custom Metrics
Custom metrics are application-specific measurements that provide insights into our business logic and domain-specific behaviour. Unlike infrastructure metrics (CPU, memory, network), custom metrics track meaningful business events like user registrations, order processing times, or conversion rates.
Why use custom metrics:
- Business Intelligence: Track key performance indicators directly from our application
- Operational Insight: Understand how our business logic performs in production
- Proactive Monitoring: Detect business-level issues before they impact users
- Data-Driven Decisions: Make informed choices about system optimisation and feature development
@Component
public class BusinessMetrics {
private final MeterRegistry meterRegistry;
private final Counter userRegistrations;
private final Timer orderProcessingTime;
public BusinessMetrics(MeterRegistry meterRegistry) {
this.meterRegistry = meterRegistry;
this.userRegistrations = Counter.builder("user.registrations")
.description("Total user registrations")
.register(meterRegistry);
this.orderProcessingTime = Timer.builder("order.processing.time")
.description("Time taken to process orders")
.register(meterRegistry);
}
public void recordUserRegistration() {
userRegistrations.increment();
}
public void recordOrderProcessing(Duration processingTime) {
orderProcessingTime.record(processingTime);
}
}
NewRelic Alternative
NewRelic provides similar capabilities with different strengths in APM and error tracking.
Setup
dependencies {
implementation 'com.newrelic.agent.java:newrelic-api:8.7.0'
}
JVM Arguments
-javaagent:/path/to/newrelic.jar
-Dnewrelic.config.file=/path/to/newrelic.yml
Comprehensive Testing Strategy
Production readiness requires testing at multiple levels to ensure our system behaves correctly under various conditions.
Unit Testing
Unit testing involves testing individual components or methods in isolation from the rest of the system. These tests focus on verifying that a single “unit” of code (typically a method or class) behaves correctly when given specific inputs.
Key characteristics of unit tests:
- Fast: Execute in milliseconds, enabling rapid feedback
- Isolated: Don’t depend on external systems like databases or networks
- Deterministic: Always produce the same result for the same input
- Focused: Test one specific behaviour or scenario at a time
Unit tests form the foundation of our testing strategy, providing quick feedback during development and catching regressions early in the development cycle.
Enhanced JUnit 5 Setup
dependencies {
testImplementation 'org.junit.jupiter:junit-jupiter:5.10.1'
testImplementation 'org.mockito:mockito-core:5.8.0'
testImplementation 'org.mockito:mockito-junit-jupiter:5.8.0'
testImplementation 'org.assertj:assertj-core:3.24.2'
testImplementation 'org.awaitility:awaitility:4.2.0'
testImplementation 'net.jqwik:jqwik:1.8.2' // Property-based testing
}
Property-Based Testing Example
class UserValidationTest {
@Property
void validEmailsShouldPassValidation(@ForAll @Email String email) {
assertTrue(EmailValidator.isValid(email));
}
@Property
void invalidEmailsShouldFailValidation(
@ForAll @StringLength(min = 1, max = 50)
@Chars({'a', 'b', 'c', '@'}) String invalidEmail) {
Assume.that(!invalidEmail.contains("@") ||
invalidEmail.indexOf("@") != invalidEmail.lastIndexOf("@"));
assertFalse(EmailValidator.isValid(invalidEmail));
}
}
Component Integration Testing
Component integration testing verifies that multiple parts of our system work correctly together whilst still maintaining control over external dependencies. These tests validate the interactions between our application components (like services, repositories, and message handlers) using real implementations but with lightweight, controlled versions of external systems.
Key benefits:
- Realistic Testing: Uses actual database connections, message brokers, and other infrastructure
- Controlled Environment: Lightweight containers provide consistent, isolated test environments
- Fast Feedback: Faster than full end-to-end tests but more comprehensive than unit tests
- Confidence: Validates that your application components integrate properly
TestContainers is particularly powerful for these tests because it spins up real database instances, message brokers, or other services in Docker containers specifically for your tests.
TestContainers with Multiple Services
@SpringBootTest
@Testcontainers
class UserServiceIntegrationTest {
@Container
static Network network = Network.newNetwork();
@Container
static PostgreSQLContainer<?> postgres = new PostgreSQLContainer<>("postgres:15")
.withNetwork(network)
.withNetworkAliases("postgres")
.withDatabaseName("testdb")
.withUsername("test")
.withPassword("test");
@Container
static GenericContainer<?> redis = new GenericContainer<>("redis:7-alpine")
.withNetwork(network)
.withNetworkAliases("redis")
.withExposedPorts(6379);
@DynamicPropertySource
static void configureProperties(DynamicPropertyRegistry registry) {
registry.add("spring.datasource.url", postgres::getJdbcUrl);
registry.add("spring.datasource.username", postgres::getUsername);
registry.add("spring.datasource.password", postgres::getPassword);
registry.add("spring.data.redis.host", redis::getHost);
registry.add("spring.data.redis.port", () -> redis.getMappedPort(6379));
}
@Test
void shouldCreateUserWithCaching() {
// Test implementation that verifies both database persistence
// and Redis caching behaviour
}
}
End-to-End Integration Testing
End-to-end integration testing validates complete user workflows across our entire system, including all external dependencies and integrations. These tests simulate real user scenarios from start to finish, ensuring that the entire system works as expected from the user’s perspective.
Characteristics:
- Complete Workflows: Tests entire user journeys (e.g., registration → login → purchase → confirmation)
- All Dependencies: Includes databases, external APIs, message queues, and other services
- Realistic Environment: Runs against production-like environments
- Slower Execution: Takes longer to run but provides highest confidence in system functionality
Whilst valuable for critical workflows, end-to-end tests should be used sparingly due to their complexity and maintenance overhead. I recommend to have a dedicated suite of end-to-end tests that are independent from any service repository.
Contract Testing with Pact
Pact is a contract testing framework that enables independent testing of service interactions. Instead of testing against real provider services, consumers define “contracts” (expectations) about how they’ll interact with providers. These contracts are then used to test both the consumer and provider independently.
Benefits of Pact testing:
- Independent Development: Teams can develop and test services independently
- Fast Feedback: No need to coordinate deployments across services for testing
- Evolutionary Design: Contracts evolve as service interfaces change
- Confidence: Ensures service compatibility without end-to-end test complexity
dependencies {
testImplementation 'au.com.dius.pact.consumer:junit5:4.6.4'
testImplementation 'au.com.dius.pact.provider:junit5:4.6.4'
}
Consumer Test
@ExtendWith(PactConsumerTestExt.class)
@PactTestFor(providerName = "user-service", port = "8080")
class UserServiceConsumerTest {
@Pact(consumer = "notification-service")
public RequestResponsePact createUserPact(PactDslWithProvider builder) {
return builder
.given("user exists")
.uponReceiving("a request for user details")
.path("/api/v1/users/123")
.method("GET")
.willRespondWith()
.status(200)
.headers(Map.of("Content-Type", "application/json"))
.body(LambdaDsl.newJsonBody(object -> object
.stringType("id", "123")
.stringType("name", "John Doe")
.stringType("email", "john@example.com")
).build())
.toPact();
}
@Test
@PactTestFor(pactMethod = "createUserPact")
void testUserRetrieval(MockServer mockServer) {
// Test implementation using the mock server
}
}
Configuration Management
Proper configuration management ensures our application can be deployed across different environments without code changes.
Helm Charts
Helm is the “package manager for Kubernetes” that simplifies deploying and managing applications on Kubernetes clusters. A Helm chart is a collection of files that describe a related set of Kubernetes resources.
Key concepts:
- Templates: Kubernetes YAML files with placeholders for dynamic values
- Values: Configuration parameters that customize deployments for different environments
- Charts: Packaged applications that can be versioned, shared, and reused
- Releases: Deployed instances of charts with specific configurations
Why use Helm charts:
- Environment Management: Deploy the same application with different configurations across dev, staging, and production
- Complexity Management: Simplify complex Kubernetes deployments into manageable packages
- Reusability: Share and reuse common deployment patterns
- Rollbacks: Easy rollback to previous versions when deployments fail
Chart Structure
helm-chart/
├── Chart.yaml
├── values.yaml
├── values-dev.yaml
├── values-staging.yaml
├── values-production.yaml
└── templates/
├── deployment.yaml
├── service.yaml
├── configmap.yaml
├── secret.yaml
└── ingress.yaml
Secrets Management
Secure handling of sensitive configuration data is crucial for production systems. No plaintext secrets will be stored in any application code or configuration file. All secrets are encrypted and decrypted only at runtime and made available to the running application.
HashiCorp Vault Integration
HashiCorp Vault provides centralised secrets management with encryption, access control, and audit logging capabilities.
Mozilla SOPS Alternative
Mozilla SOPS (Secrets OPerationS) is a command-line tool for encrypting and decrypting structured data files (YAML, JSON, ENV, INI) using various key management systems like AWS KMS, GCP KMS, Azure Key Vault, or PGP.
Key features:
- Partial Encryption: Only encrypts values, leaving keys readable for easy management
- Version Control Safe: Encrypted files can be safely committed to Git repositories
- Multiple Key Sources: Supports various key management systems and PGP
- Team Collaboration: Multiple team members can decrypt secrets using their own keys
Why use SOPS:
- GitOps Workflows: Store encrypted secrets alongside code in version control
- Audit Trail: Track changes to secrets through Git history
- Access Control: Fine-grained control over who can decrypt specific secrets
- Simplicity: Easier to manage than complex secret management systems for smaller teams
Security Scanning
Automated security scanning helps identify vulnerabilities before they reach production.
SAST (Static Application Security Testing)
Static Application Security Testing (SAST) analyses source code, bytecode, or binary code for security vulnerabilities without executing the application. SAST tools scan your codebase to identify potential security flaws like SQL injection vulnerabilities, cross-site scripting (XSS), buffer overflows, and insecure coding practices.
Key characteristics:
- Early Detection: Finds vulnerabilities during development, before deployment
- Comprehensive Coverage: Analyses all code paths, including rarely executed ones
- No Runtime Required: Works on source code without needing a running application
- Developer Integration: Can run in IDEs and CI/CD pipelines for immediate feedback
When to use SAST:
- Development Phase: Integrate into IDE and commit hooks for immediate feedback
- CI/CD Pipeline: Gate deployments based on security scan results
- Regular Audits: Scheduled scans to catch newly discovered vulnerability patterns
- Compliance Requirements: Meet regulatory requirements for security testing
SonarQube Security Rules
sonar {
properties {
property "sonar.security.hotspots.ignored", "false"
property "sonar.coverage.exclusions", "**/config/**,**/dto/**"
property "sonar.cpd.exclusions", "**/generated/**"
}
}
SpotBugs Security Extensions
spotbugs {
effort = 'max'
reportLevel = 'low'
visitors = ['FindSecBugs']
}
dependencies {
spotbugsPlugins 'com.h3xstream.findsecbugs:findsecbugs-plugin:1.12.0'
}
DAST (Dynamic Application Security Testing)
Dynamic Application Security Testing (DAST) tests running applications for security vulnerabilities by simulating attacks against the application from the outside. Unlike SAST, DAST doesn’t require access to source code—it tests the application as a “black box” the way an attacker would.
Key characteristics:
- Runtime Testing: Tests the actual running application in a realistic environment
- External Perspective: Simulates how an external attacker would interact with your system
- Configuration Testing: Identifies security issues in deployment configuration and infrastructure
- Realistic Scenarios: Tests how security measures work under real conditions
When to use DAST:
- Pre-Production Testing: Test staging environments before production deployment
- Penetration Testing: Automated security testing as part of regular security assessments
- Compliance Validation: Verify that security controls work as intended in deployed environments
- Regression Testing: Ensure that new deployments don’t introduce security vulnerabilities
DAST complements SAST by finding vulnerabilities that only appear when the application is running, such as authentication bypasses, session management issues, and configuration problems.
OWASP ZAP Integration
# GitHub Actions workflow
- name: OWASP ZAP Scan
uses: zaproxy/action-full-scan@v0.8.0
with:
target: 'https://staging.example.com'
rules_file_name: '.zap/rules.tsv'
cmd_options: '-a'
API Contracts and Schema Registry
Production systems require robust API governance and schema management.
Schema Registry with Contract Validation
Confluent Schema Registry Setup
dependencies {
implementation 'io.confluent:kafka-avro-serializer:7.5.1'
implementation 'io.confluent:kafka-schema-registry-client:7.5.1'
}
Schema Evolution Policy
Schema evolution allows you to modify data structures over time whilst maintaining backward compatibility. This is crucial for distributed systems where different services may be updated at different times.
Client Generation
OpenAPI Client Generation
plugins {
id 'org.openapi.generator' version '7.2.0'
}
openApiGenerate {
generatorName = 'java'
inputSpec = "$rootDir/api-specs/user-service-api.yaml"
outputDir = "$buildDir/generated-client"
apiPackage = 'com.example.userservice.client.api'
modelPackage = 'com.example.userservice.client.model'
configOptions = [
dateLibrary: "java8-localdatetime",
java8: "true",
interfaceOnly: "false",
useTags: "true"
]
}
Run Books and Play Books
Operational procedures must be documented and accessible to your team.
Incident Response Runbook
Runbooks are documented procedures that provide step-by-step instructions for handling specific operational scenarios, incidents, or maintenance tasks. They serve as the “playbook” for your operations team, ensuring consistent and effective responses to common situations.
Key components of runbooks:
- Symptoms: How to identify when the runbook applies
- Investigation Steps: Systematic approach to diagnosing the issue
- Resolution Actions: Step-by-step instructions to resolve the problem
- Escalation Procedures: When and how to escalate if initial steps don’t work
- Prevention Measures: Long-term actions to prevent recurrence
Why runbooks are essential:
- Consistency: Ensure all team members respond to incidents the same way
- Speed: Reduce time to resolution with pre-planned response procedures
- Knowledge Sharing: Capture institutional knowledge and make it accessible
- Training: Help new team members learn operational procedures
- Stress Reduction: Provide clear guidance during high-pressure incident situations
Structure
# User Service Incident Response
## High CPU Usage
### Symptoms
- CPU utilisation > 80% for 5+ minutes
- Response times > 2 seconds
- Error rate > 1%
### Investigation Steps
1. Check DataDog dashboard: https://app.datadoghq.com/dashboard/abc-123
2. Review recent deployments in GitHub
3. Check database connection pool metrics
4. Examine heap memory usage
### Immediate Actions
1. Scale horizontally: `kubectl scale deployment user-service --replicas=10`
2. Restart unhealthy pods: `kubectl delete pods -l app=user-service --field-selector=status.phase!=Running`
### Root Cause Analysis
- Review application logs for memory leaks
- Check for inefficient database queries
- Analyse traffic patterns for unusual spikes
Automated Database Migrations
Database migrations are version-controlled scripts that systematically evolve our database schema over time. Each migration represents a specific change to the database structure (like adding tables, modifying columns, or creating indexes) and can be applied or rolled back as needed.
Key principles:
- Version Control: Each migration has a unique version number and is stored in source control
- Incremental Changes: Small, focused changes that can be applied sequentially
- Idempotent: Safe to run multiple times without causing issues
- Rollback Support: Ability to undo changes if problems occur
Why automated database migrations are crucial:
- Environment Consistency: Ensure all environments (dev, staging, production) have identical database schemas
- Deployment Safety: Reduce risk of manual schema changes causing production issues
- Team Collaboration: Multiple developers can work on schema changes without conflicts
- Audit Trail: Track exactly when and why schema changes were made
- Automated Deployment: Include database changes as part of our CI/CD pipeline
Flyway is a popular migration tool that tracks applied migrations and ensures they’re applied in the correct order.
Flyway Configuration
plugins {
id 'org.flywaydb.flyway' version '9.22.3'
}
flyway {
url = 'jdbc:postgresql://localhost:5432/myapp'
user = 'dbuser'
password = 'dbpass'
schemas = ['public']
locations = ['classpath:db/migration']
baselineOnMigrate = true
validateOnMigrate = true
}
Migration Scripts
-- V1__Create_users_table.sql
CREATE TABLE users (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
username VARCHAR(50) NOT NULL UNIQUE,
email VARCHAR(255) NOT NULL UNIQUE,
created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);
CREATE INDEX idx_users_email ON users(email);
CREATE INDEX idx_users_created_at ON users(created_at);
This production readiness phase ensures our distributed system can handle real-world challenges whilst providing the operational visibility and maintainability needed for long-term success. The next phase will focus on operational maturity and advanced reliability patterns.