Microservices at Scale: The 5-Repository Pattern

Processing thousands of payments per second involves more than writing good Java code. It involves deciding how that code is organised, deployed, monitored, and recovered when something goes wrong — without a Grafana dashboard change triggering a service build.

This post describes the architecture we use at BBVA to structure payment microservices: the 5-repository pattern, along with the code, infrastructure, and observability patterns that make it work in production.

To ground the concepts, everything revolves around a fictional microservice called payment-bridge: an adapter between the internal payment platform and an external provider called VerifyPay that validates transactions. It receives encrypted Kafka events, decrypts them, transforms them, and sends them over HTTPS. The architecture it describes is identical to real services.

The Problem: A Microservice Is Not One Repository

The most important decision — and the one that surprises people coming from other projects — is that each microservice doesn’t live in a single repository. It lives in five:

Repository	Responsibility	Language
`payment-bridge`	Business logic	Java (Spring Boot)
`payment-bridge-infrastructure-composer`	AWS infrastructure provisioning	Python
`payment-bridge-alarms-composer`	CloudWatch alarms	Python
`payment-bridge-monitoring-composer`	Grafana dashboards	Python
`payment-bridge-gocd-pipelines`	CI/CD pipelines	JSON (GoCD)

Why? The Single Responsibility Principle applied at the repository level.

Each repository has its own lifecycle, its own pipelines, and its own cadence of change. You can update an alarm threshold without redeploying the service. You can refactor the Java service without touching the infrastructure. You can create a new dashboard without the service build knowing anything about it.

This also has practical implications during incidents: if there’s a Kafka lag alert at 3am, the on-call team can adjust the threshold in alarms-composer and deploy it in minutes, with no release gate from the main service.

The Service: Hexagonal Architecture

The main repository (payment-bridge) implements the service with hexagonal architecture (ports and adapters). The idea is that business logic doesn’t know where the data comes from or how it’s transported.

The Ports

Inbound ports: A Kafka consumer (PaymentRequestsConsumer) and a REST controller (PaymentVerificationResource). Both feed into the same domain.
Outbound ports: VerifyPayClient (HTTPS to VerifyPay) and CipherClient (to the decryption sidecar).

The Domain

The *Flow classes are the heart. They implement Function<KafkaRecord<byte[]>, Mono<Void>> — a simple functional interface that makes them testable, composable, and completely decoupled from the transport:

@Component
public class PaymentRequestsFlow implements Function<KafkaRecord<byte[]>, Mono<Void>> {

    @Override
    public Mono<Void> apply(KafkaRecord<byte[]> kafkaRecord) {
        return Mono.just(kafkaRecord)
            .flatMap(this::decryptKafkaRecord)               // 1. Decrypt
            .flatMap(this::translateToVerificationRequest)    // 2. Transform
            .flatMap(this::notifyToVerifyPay)                 // 3. Send
            .then();
    }
}

Each step in the pipeline transforms the payload type:

byte[] (encrypted) → Map<String,Object> (decrypted) → Map<String,Object> (VerifyPay format)

This is made possible by the immutable record as context carrier pattern, which deserves its own section.

KafkaRecord: The Immutable Carrier

In a reactive pipeline where the payload type changes at each step, you need to carry the original Kafka metadata (headers, acknowledgment, transaction-id) throughout the entire chain. KafkaRecord solves this elegantly:

public record KafkaRecord<T>(
    ConsumerRecord<String, byte[]> consumerRecord,
    T payload,
    Acknowledgment acknowledgment
) {
    public <T> KafkaRecord<T> withPayload(T newPayload) {
        return new KafkaRecord<>(consumerRecord, newPayload, acknowledgment);
    }
}

withPayload() creates a new record with the transformed payload, while preserving the original consumerRecord (which contains the tracing headers) and the acknowledgment. The result is that transaction-id travels effortlessly from the moment the Kafka message arrives to the final ACK, with no need to pass it explicitly through every method.

This is a perfect use of Java Records: immutability, zero boilerplate, and clear semantics.

The Reactive Bridge: Why `block()`?

The Spring Boot code is reactive (Project Reactor), but Spring Kafka expects a synchronous listener. CommonKafkaConsumer resolves this tension with a bridge pattern:

public void listen(KafkaRecord<byte[]> kafkaRecord, Acknowledgment acknowledgment) {
    var firstAttemptTimestamp = firstTimestampKafkaMessageStore
        .getFirstTimestamp(getPartition(kafkaRecord), getOffset(kafkaRecord), topicName.get());

    Mono.just(kafkaRecord)
        .filter(this::shouldConsumeFrom)              // Feature toggle check
        .doOnNext(record -> kafkaRecordMetrics
            .emitMessageTripMetric(getTopic(record), getPartition(record), getProducerTimeHeader(record)))
        .flatMap(kafkaFlow)                           // Business flow (injected)
        .doOnSuccess(unused -> acknowledgment.acknowledge())
        .doOnError(not(shouldDiscard(firstAttemptTimestamp)), this::logRetryError)
        .doOnError(shouldDiscard(firstAttemptTimestamp), this::logDiscardError)
        .doOnError(shouldDiscard(firstAttemptTimestamp), unused -> acknowledgment.acknowledge())
        .onErrorComplete(shouldDiscard(firstAttemptTimestamp))
        .block();                                     // Bridge: Reactor → Spring Kafka
}

The block() at the end is the only point where reactivity is “broken”. It’s not an antipattern here — it’s a deliberate decision to bridge two models. Everything inside the Mono is non-blocking; only the Spring Kafka boundary is synchronous.

This also implements sophisticated discard logic: if a message has been retrying longer than the configured maximum, it gets discarded (ACKed and logged as a discard). If it’s a recent message, the error is logged so Spring Kafka will retry it.

Kafka: Configuration and Naming

The Kafka configuration is designed for manual-ACK, exactly-once semantics:

Configuration	Value	Why
ACK mode	`MANUAL_IMMEDIATE`	ACK only when processing completes successfully
Deserializer	`ByteArrayDeserializer`	Messages arrive encrypted; bytes first, then decrypt
Auto-commit	`false`	Complements manual ACK
Auto offset reset	`earliest`	Process pending messages when a new consumer joins
Concurrency	`NUMBER_OF_PARTITIONS`	One thread per partition for maximum parallelism

Topic names embed the service version, enabling zero-downtime migrations:

public class TopicName implements Supplier<String> {
    @Override
    public String get() {
        // "payment-requests" + "0.51.99+BUILD" → "payment-requests-0-51-99-BUILD"
        return getKafkaCompliantName(getCompleteName());
    }
}

This means you can deploy a new service version consuming a different topic while the old version keeps processing, then migrate traffic gradually.

Encryption: The Sidecar Pattern

Kafka messages travel encrypted with KMS (AWS Key Management Service). Decryption isn’t handled by the service directly — it delegates to a sidecar (localhost:8082) that has access to KMS keys:

public class CipherService {

    public Mono<Map<String, Object>> decrypt(byte[] message) {
        return Mono.just(message)
            .map(Hex::encodeHexString)                           // bytes → hex
            .map(hex -> Map.<String, Object>of("input-data", hex))
            .flatMap(request -> cipherClient.decrypt(keyLabel, request))  // HTTP to sidecar
            .map(response -> response.get("output-data").toString())
            .map(this::convertToMap);                            // hex → JSON → Map
    }
}

This decision has important implications: the main service needs no KMS credentials. If the sidecar is unavailable, decryption fails and the message is retried. Secret management is completely outside the application code.

HTTP Client: Layered Resilience

Communication with VerifyPay has three layers of resilience:

Layer 1 — WebClient: Aggressive connection pooling and strict timeouts for high load:

var connectionProvider = ConnectionProvider.builder("verifyPayConnectionPool")
    .maxConnections(1000)
    .pendingAcquireMaxCount(2000)
    .build();

HttpClient.create(connectionProvider)
    .option(ChannelOption.CONNECT_TIMEOUT_MILLIS, 500)
    .responseTimeout(Duration.ofMillis(400));

Layer 2 — Error mapping: Typed network errors for precise discrimination:

.onErrorMap(this::isConnectionRefusedError, this::createConnectionRefusedException)
.onErrorMap(this::isRequestTimeoutError, this::createRequestTimeoutException)

Layer 3 — Discriminating retry:

.retryWhen(retryServerError)          // 3 retries for 5xx (300ms delay)
.retryWhen(retryUnrecognizableError)  // 3 retries for unknown errors

Error type	Action
4xx	Discard the message (`MessageDiscardedException`)
5xx	Retry 3 times (300ms)
Connection Refused / Timeout	Map to 503, retryable
Unknown error	Retry 3 times → final exception

For 4xx errors there’s a configurable fallback pattern: rather than blocking the operation, return a default response. This is a business decision — we prefer processing the transaction with a safe default over rejecting it because of a provider technical failure.

Feature Toggles: Operational Kill Switches

FF4j (in-memory) provides feature toggles that let you change service behaviour without redeploying:

@Configuration
public class FF4jConfig {
    @Bean
    public FF4j getFF4j() {
        var ff4j = new FF4j();
        ff4j.setFeatureStore(new InMemoryFeatureStore());
        ff4j.autoCreate(false);
        return ff4j;
    }
}

Typical production toggles:

Toggle	Purpose
`consume-from-payment-requests`	Kill switch: stop consuming the payment topic
`consume-from-settlements`	Kill switch: stop consuming settlements
`use-release-candidate-api`	Test VerifyPay RC API version
`enable-multi-currency`	Gradually activate multi-currency support

Managed at runtime via REST:

GET  /admin/features            → List all toggles
PUT  /admin/features/{feature}  → Enable/disable without redeploying

In practice, the most valuable toggles are the kill switches: when a topic gets a spike of corrupted messages, you can disable consumption in seconds — no hotfix, no emergency deployment required.

Infrastructure: Composers Instead of Terraform

The infrastructure repository provisions all AWS resources using custom composers — Python libraries wrapping Boto3. Instead of Terraform or CloudFormation, explicit Python code controls the creation order and dependencies:

cad_payment_bridge_infrastructure_composer create \
  --environment play \
  --cde-version 0.51.99 \
  --region eu-west-1

The order of operations matters:

Parameter Store with Kafka brokers connection string
ECS Fargate + API Gateway + networking
Multi-region KMS keys (Ireland + Frankfurt)
IAM permissions so the task role can access KMS
Private CA and Secrets Manager access

Multi-region from day one: each environment is deployed in two AWS regions (Ireland eu-west-1 and Frankfurt eu-central-1). KMS keys are multi-region, allowing consumers in either region to decrypt messages produced in the other.

Observability: Three Layers

CloudWatch Alarms

The alarms-composer defines five alarm types covering the complete health of the service:

Alarm	What it monitors	Threshold
Availability	Whether the process is running	`process_uptime = 0`
Errors	Log-level ERROR events	> 4 errors in the period
Discards	Messages discarded after retries	≥ 350 discards / 15 min
Kafka Lags	Consumer group lag	≥ 1000 messages
Certificate Expiry	Days until certificate expiration	≤ 15 days

The Kafka Lags alarm deserves special attention: native Kafka metrics in CloudWatch are scattered across the AWS/Kafka namespace. An aggregation Lambda (part of the alarms-composer, running every 2 minutes via EventBridge) aggregates and publishes them to the service’s custom namespace:

AWS/Kafka (SumOffsetLag per consumer group)
    ↓ Lambda (EventBridge, every 2 min)
AWS/ECS/{env}/payment-bridge (total_kafka_lags)
    ↓
CloudWatch Alarm (kafka-lags ≥ 1000)

Grafana Dashboards

The monitoring-composer creates three environment-parameterised dashboards:

Status: business metrics (requests/s, latency, errors, messages processed)
JVM: heap memory, GC pauses, active threads
Envoy: proxy sidecar metrics

Template variables allow reusing dashboards across environments:

${selectedEnvironment} = live / play
${selectedDatasource} = CloudWatch (LIVE) / CloudWatch (WORK)

The Shared Namespace

All components converge on a unified CloudWatch namespace:

AWS/ECS/{environment}/payment-bridge

The service publishes metrics here (via Micrometer → StatsD). Alarms query here. The Kafka Lambda aggregates here. Dashboards visualise from here. It’s a clear contract: who produces, who consumes, where.

CI/CD: The Environment Chain

The gocd-pipelines defines all orchestration. The main service flow follows a progressive chain with manual approval gates:

Build (auto) → Acceptance Tests → Security Checks → Release Gate
    → Play → Work → Performance → Live

Key details:

Build runs on every commit: compilation, unit tests, JaCoCo (coverage), PMD (static analysis).

Security runs daily at 8:00 and on every change: Semgrep (code bugs/vulnerabilities), OWASP Dependency-Check (CVEs in dependencies), Trivy (Docker image vulnerabilities).

The Release Gate requires manual approval and requires both acceptance tests and the trivy stage to have passed. Only then are Docker images published to the registry.

Each deployment requires manual approval. For live, the release-manager role is additionally required.

Each environment deploys to two regions (Frankfurt + Ireland) with separate pipelines. In total: ~31 pipelines in dev/test, 6 in production — all complexity encapsulated in the fifth repository.

Lessons Learned

1. SRP at the repository level has real value. The independence of lifecycles isn’t theoretical — in practice, alarm and dashboard changes happen 10x more frequently than service changes. Mixing them would generate constant noise in commit history and pipelines.

2. Reactivity doesn’t have to be all-or-nothing. The block() in CommonKafkaConsumer is a legitimate pattern. Forcing pure reactivity at synchronous system boundaries generates unnecessary complexity. Pragmatism matters.

3. Feature toggles are infrastructure, not a feature. Having kill switches before deploying to production — not after — is the difference between a 5-minute incident and a 2-hour one.

4. The shared metrics namespace is a team contract. When alarms, the Lambda, dashboards, and the service all converge on AWS/ECS/{env}/payment-bridge, any team member knows exactly where to look. No more “which namespace are the Kafka metrics in?”.

5. Multi-region KMS keys from day one. Adding multi-region support retroactively to an encrypted payment service is painful. Designing it from day one costs nothing.

Summary

5 repos = SRP at the repository level. Independent lifecycles, isolated pipelines, granular rollbacks.
Hexagonal architecture: *Flow classes are pure business logic; consumers and controllers are inbound ports; clients are outbound ports.
KafkaRecord<T>: immutable record as context carrier through the transformation pipeline.
block() in the consumer: legitimate bridge pattern between Reactor and Spring Kafka. Not an antipattern — pragmatism.
Sidecar encryption: the main service manages no KMS credentials; delegation to a local sidecar.
Feature toggles = kill switches: ability to disable topic consumption without redeploying.
Observability as code: alarms, the aggregation Lambda, and Grafana dashboards live in versioned repos and are deployed with the same discipline as the service.
Environment chain: play → work → performance → live, with manual approval gates and release-manager for production.