Microservices at Scale: The 5-Repository Pattern
How we structure microservices in production with 5 independent repositories, hexagonal architecture, reactive programming with Kafka, and a full observability stack on AWS.
Microservices at Scale: The 5-Repository Pattern
Processing thousands of payments per second involves more than writing good Java code. It involves deciding how that code is organised, deployed, monitored, and recovered when something goes wrong — without a Grafana dashboard change triggering a service build.
This post describes the architecture we use at BBVA to structure payment microservices: the 5-repository pattern, along with the code, infrastructure, and observability patterns that make it work in production.
To ground the concepts, everything revolves around a fictional microservice called payment-bridge: an adapter between the internal payment platform and an external provider called VerifyPay that validates transactions. It receives encrypted Kafka events, decrypts them, transforms them, and sends them over HTTPS. The architecture it describes is identical to real services.
The Problem: A Microservice Is Not One Repository
The most important decision — and the one that surprises people coming from other projects — is that each microservice doesn’t live in a single repository. It lives in five:
| Repository | Responsibility | Language |
|---|---|---|
payment-bridge | Business logic | Java (Spring Boot) |
payment-bridge-infrastructure-composer | AWS infrastructure provisioning | Python |
payment-bridge-alarms-composer | CloudWatch alarms | Python |
payment-bridge-monitoring-composer | Grafana dashboards | Python |
payment-bridge-gocd-pipelines | CI/CD pipelines | JSON (GoCD) |
Why? The Single Responsibility Principle applied at the repository level.
Each repository has its own lifecycle, its own pipelines, and its own cadence of change. You can update an alarm threshold without redeploying the service. You can refactor the Java service without touching the infrastructure. You can create a new dashboard without the service build knowing anything about it.
This also has practical implications during incidents: if there’s a Kafka lag alert at 3am, the on-call team can adjust the threshold in alarms-composer and deploy it in minutes, with no release gate from the main service.
The Service: Hexagonal Architecture
The main repository (payment-bridge) implements the service with hexagonal architecture (ports and adapters). The idea is that business logic doesn’t know where the data comes from or how it’s transported.
The Ports
- Inbound ports: A Kafka consumer (
PaymentRequestsConsumer) and a REST controller (PaymentVerificationResource). Both feed into the same domain. - Outbound ports:
VerifyPayClient(HTTPS to VerifyPay) andCipherClient(to the decryption sidecar).
The Domain
The *Flow classes are the heart. They implement Function<KafkaRecord<byte[]>, Mono<Void>> — a simple functional interface that makes them testable, composable, and completely decoupled from the transport:
@Component
public class PaymentRequestsFlow implements Function<KafkaRecord<byte[]>, Mono<Void>> {
@Override
public Mono<Void> apply(KafkaRecord<byte[]> kafkaRecord) {
return Mono.just(kafkaRecord)
.flatMap(this::decryptKafkaRecord) // 1. Decrypt
.flatMap(this::translateToVerificationRequest) // 2. Transform
.flatMap(this::notifyToVerifyPay) // 3. Send
.then();
}
}
Each step in the pipeline transforms the payload type:
byte[] (encrypted) → Map<String,Object> (decrypted) → Map<String,Object> (VerifyPay format)
This is made possible by the immutable record as context carrier pattern, which deserves its own section.
KafkaRecord: The Immutable Carrier
In a reactive pipeline where the payload type changes at each step, you need to carry the original Kafka metadata (headers, acknowledgment, transaction-id) throughout the entire chain. KafkaRecord solves this elegantly:
public record KafkaRecord<T>(
ConsumerRecord<String, byte[]> consumerRecord,
T payload,
Acknowledgment acknowledgment
) {
public <T> KafkaRecord<T> withPayload(T newPayload) {
return new KafkaRecord<>(consumerRecord, newPayload, acknowledgment);
}
}
withPayload() creates a new record with the transformed payload, while preserving the original consumerRecord (which contains the tracing headers) and the acknowledgment. The result is that transaction-id travels effortlessly from the moment the Kafka message arrives to the final ACK, with no need to pass it explicitly through every method.
This is a perfect use of Java Records: immutability, zero boilerplate, and clear semantics.
The Reactive Bridge: Why block()?
The Spring Boot code is reactive (Project Reactor), but Spring Kafka expects a synchronous listener. CommonKafkaConsumer resolves this tension with a bridge pattern:
public void listen(KafkaRecord<byte[]> kafkaRecord, Acknowledgment acknowledgment) {
var firstAttemptTimestamp = firstTimestampKafkaMessageStore
.getFirstTimestamp(getPartition(kafkaRecord), getOffset(kafkaRecord), topicName.get());
Mono.just(kafkaRecord)
.filter(this::shouldConsumeFrom) // Feature toggle check
.doOnNext(record -> kafkaRecordMetrics
.emitMessageTripMetric(getTopic(record), getPartition(record), getProducerTimeHeader(record)))
.flatMap(kafkaFlow) // Business flow (injected)
.doOnSuccess(unused -> acknowledgment.acknowledge())
.doOnError(not(shouldDiscard(firstAttemptTimestamp)), this::logRetryError)
.doOnError(shouldDiscard(firstAttemptTimestamp), this::logDiscardError)
.doOnError(shouldDiscard(firstAttemptTimestamp), unused -> acknowledgment.acknowledge())
.onErrorComplete(shouldDiscard(firstAttemptTimestamp))
.block(); // Bridge: Reactor → Spring Kafka
}
The block() at the end is the only point where reactivity is “broken”. It’s not an antipattern here — it’s a deliberate decision to bridge two models. Everything inside the Mono is non-blocking; only the Spring Kafka boundary is synchronous.
This also implements sophisticated discard logic: if a message has been retrying longer than the configured maximum, it gets discarded (ACKed and logged as a discard). If it’s a recent message, the error is logged so Spring Kafka will retry it.
Kafka: Configuration and Naming
The Kafka configuration is designed for manual-ACK, exactly-once semantics:
| Configuration | Value | Why |
|---|---|---|
| ACK mode | MANUAL_IMMEDIATE | ACK only when processing completes successfully |
| Deserializer | ByteArrayDeserializer | Messages arrive encrypted; bytes first, then decrypt |
| Auto-commit | false | Complements manual ACK |
| Auto offset reset | earliest | Process pending messages when a new consumer joins |
| Concurrency | NUMBER_OF_PARTITIONS | One thread per partition for maximum parallelism |
Topic names embed the service version, enabling zero-downtime migrations:
public class TopicName implements Supplier<String> {
@Override
public String get() {
// "payment-requests" + "0.51.99+BUILD" → "payment-requests-0-51-99-BUILD"
return getKafkaCompliantName(getCompleteName());
}
}
This means you can deploy a new service version consuming a different topic while the old version keeps processing, then migrate traffic gradually.
Encryption: The Sidecar Pattern
Kafka messages travel encrypted with KMS (AWS Key Management Service). Decryption isn’t handled by the service directly — it delegates to a sidecar (localhost:8082) that has access to KMS keys:
public class CipherService {
public Mono<Map<String, Object>> decrypt(byte[] message) {
return Mono.just(message)
.map(Hex::encodeHexString) // bytes → hex
.map(hex -> Map.<String, Object>of("input-data", hex))
.flatMap(request -> cipherClient.decrypt(keyLabel, request)) // HTTP to sidecar
.map(response -> response.get("output-data").toString())
.map(this::convertToMap); // hex → JSON → Map
}
}
This decision has important implications: the main service needs no KMS credentials. If the sidecar is unavailable, decryption fails and the message is retried. Secret management is completely outside the application code.
HTTP Client: Layered Resilience
Communication with VerifyPay has three layers of resilience:
Layer 1 — WebClient: Aggressive connection pooling and strict timeouts for high load:
var connectionProvider = ConnectionProvider.builder("verifyPayConnectionPool")
.maxConnections(1000)
.pendingAcquireMaxCount(2000)
.build();
HttpClient.create(connectionProvider)
.option(ChannelOption.CONNECT_TIMEOUT_MILLIS, 500)
.responseTimeout(Duration.ofMillis(400));
Layer 2 — Error mapping: Typed network errors for precise discrimination:
.onErrorMap(this::isConnectionRefusedError, this::createConnectionRefusedException)
.onErrorMap(this::isRequestTimeoutError, this::createRequestTimeoutException)
Layer 3 — Discriminating retry:
.retryWhen(retryServerError) // 3 retries for 5xx (300ms delay)
.retryWhen(retryUnrecognizableError) // 3 retries for unknown errors
| Error type | Action |
|---|---|
| 4xx | Discard the message (MessageDiscardedException) |
| 5xx | Retry 3 times (300ms) |
| Connection Refused / Timeout | Map to 503, retryable |
| Unknown error | Retry 3 times → final exception |
For 4xx errors there’s a configurable fallback pattern: rather than blocking the operation, return a default response. This is a business decision — we prefer processing the transaction with a safe default over rejecting it because of a provider technical failure.
Feature Toggles: Operational Kill Switches
FF4j (in-memory) provides feature toggles that let you change service behaviour without redeploying:
@Configuration
public class FF4jConfig {
@Bean
public FF4j getFF4j() {
var ff4j = new FF4j();
ff4j.setFeatureStore(new InMemoryFeatureStore());
ff4j.autoCreate(false);
return ff4j;
}
}
Typical production toggles:
| Toggle | Purpose |
|---|---|
consume-from-payment-requests | Kill switch: stop consuming the payment topic |
consume-from-settlements | Kill switch: stop consuming settlements |
use-release-candidate-api | Test VerifyPay RC API version |
enable-multi-currency | Gradually activate multi-currency support |
Managed at runtime via REST:
GET /admin/features → List all toggles
PUT /admin/features/{feature} → Enable/disable without redeploying
In practice, the most valuable toggles are the kill switches: when a topic gets a spike of corrupted messages, you can disable consumption in seconds — no hotfix, no emergency deployment required.
Infrastructure: Composers Instead of Terraform
The infrastructure repository provisions all AWS resources using custom composers — Python libraries wrapping Boto3. Instead of Terraform or CloudFormation, explicit Python code controls the creation order and dependencies:
cad_payment_bridge_infrastructure_composer create \
--environment play \
--cde-version 0.51.99 \
--region eu-west-1
The order of operations matters:
- Parameter Store with Kafka brokers connection string
- ECS Fargate + API Gateway + networking
- Multi-region KMS keys (Ireland + Frankfurt)
- IAM permissions so the task role can access KMS
- Private CA and Secrets Manager access
Multi-region from day one: each environment is deployed in two AWS regions (Ireland eu-west-1 and Frankfurt eu-central-1). KMS keys are multi-region, allowing consumers in either region to decrypt messages produced in the other.
Observability: Three Layers
CloudWatch Alarms
The alarms-composer defines five alarm types covering the complete health of the service:
| Alarm | What it monitors | Threshold |
|---|---|---|
| Availability | Whether the process is running | process_uptime = 0 |
| Errors | Log-level ERROR events | > 4 errors in the period |
| Discards | Messages discarded after retries | ≥ 350 discards / 15 min |
| Kafka Lags | Consumer group lag | ≥ 1000 messages |
| Certificate Expiry | Days until certificate expiration | ≤ 15 days |
The Kafka Lags alarm deserves special attention: native Kafka metrics in CloudWatch are scattered across the AWS/Kafka namespace. An aggregation Lambda (part of the alarms-composer, running every 2 minutes via EventBridge) aggregates and publishes them to the service’s custom namespace:
AWS/Kafka (SumOffsetLag per consumer group)
↓ Lambda (EventBridge, every 2 min)
AWS/ECS/{env}/payment-bridge (total_kafka_lags)
↓
CloudWatch Alarm (kafka-lags ≥ 1000)
Grafana Dashboards
The monitoring-composer creates three environment-parameterised dashboards:
- Status: business metrics (requests/s, latency, errors, messages processed)
- JVM: heap memory, GC pauses, active threads
- Envoy: proxy sidecar metrics
Template variables allow reusing dashboards across environments:
${selectedEnvironment} = live / play
${selectedDatasource} = CloudWatch (LIVE) / CloudWatch (WORK)
The Shared Namespace
All components converge on a unified CloudWatch namespace:
AWS/ECS/{environment}/payment-bridge
The service publishes metrics here (via Micrometer → StatsD). Alarms query here. The Kafka Lambda aggregates here. Dashboards visualise from here. It’s a clear contract: who produces, who consumes, where.
CI/CD: The Environment Chain
The gocd-pipelines defines all orchestration. The main service flow follows a progressive chain with manual approval gates:
Build (auto) → Acceptance Tests → Security Checks → Release Gate
→ Play → Work → Performance → Live
Key details:
Build runs on every commit: compilation, unit tests, JaCoCo (coverage), PMD (static analysis).
Security runs daily at 8:00 and on every change: Semgrep (code bugs/vulnerabilities), OWASP Dependency-Check (CVEs in dependencies), Trivy (Docker image vulnerabilities).
The Release Gate requires manual approval and requires both acceptance tests and the trivy stage to have passed. Only then are Docker images published to the registry.
Each deployment requires manual approval. For live, the release-manager role is additionally required.
Each environment deploys to two regions (Frankfurt + Ireland) with separate pipelines. In total: ~31 pipelines in dev/test, 6 in production — all complexity encapsulated in the fifth repository.
Lessons Learned
1. SRP at the repository level has real value. The independence of lifecycles isn’t theoretical — in practice, alarm and dashboard changes happen 10x more frequently than service changes. Mixing them would generate constant noise in commit history and pipelines.
2. Reactivity doesn’t have to be all-or-nothing. The block() in CommonKafkaConsumer is a legitimate pattern. Forcing pure reactivity at synchronous system boundaries generates unnecessary complexity. Pragmatism matters.
3. Feature toggles are infrastructure, not a feature. Having kill switches before deploying to production — not after — is the difference between a 5-minute incident and a 2-hour one.
4. The shared metrics namespace is a team contract. When alarms, the Lambda, dashboards, and the service all converge on AWS/ECS/{env}/payment-bridge, any team member knows exactly where to look. No more “which namespace are the Kafka metrics in?”.
5. Multi-region KMS keys from day one. Adding multi-region support retroactively to an encrypted payment service is painful. Designing it from day one costs nothing.
Summary
- 5 repos = SRP at the repository level. Independent lifecycles, isolated pipelines, granular rollbacks.
- Hexagonal architecture:
*Flowclasses are pure business logic; consumers and controllers are inbound ports; clients are outbound ports. KafkaRecord<T>: immutable record as context carrier through the transformation pipeline.block()in the consumer: legitimate bridge pattern between Reactor and Spring Kafka. Not an antipattern — pragmatism.- Sidecar encryption: the main service manages no KMS credentials; delegation to a local sidecar.
- Feature toggles = kill switches: ability to disable topic consumption without redeploying.
- Observability as code: alarms, the aggregation Lambda, and Grafana dashboards live in versioned repos and are deployed with the same discipline as the service.
- Environment chain: play → work → performance → live, with manual approval gates and
release-managerfor production.