Building Microservices at Scale: Lessons from Production

After working with microservices at Agoda and CP Axtra, handling millions of transactions daily, I’ve learned valuable lessons about what works and what doesn’t when scaling distributed systems.

The Challenge

When you’re processing millions of payment transactions or managing inventory across thousands of retail locations, every millisecond counts. Here are the key challenges we faced:

  • Service Communication Overhead: Network latency between services
  • Data Consistency: Maintaining consistency across distributed databases
  • Service Discovery: Dynamic service registration and discovery
  • Fault Tolerance: Graceful degradation when services fail

Key Architectural Patterns

1. Event-Driven Architecture with Apache Kafka

@Service
public class PaymentEventPublisher {
    
    @Autowired
    private KafkaTemplate<String, PaymentEvent> kafkaTemplate;
    
    public void publishPaymentCompleted(Payment payment) {
        PaymentEvent event = PaymentEvent.builder()
            .paymentId(payment.getId())
            .amount(payment.getAmount())
            .status(PaymentStatus.COMPLETED)
            .timestamp(Instant.now())
            .build();
            
        kafkaTemplate.send("payment-events", event);
    }
}

2. Circuit Breaker Pattern

@Component
public class PaymentServiceClient {
    
    @CircuitBreaker(name = "payment-service", fallbackMethod = "fallbackPayment")
    @Retry(name = "payment-service")
    @TimeLimiter(name = "payment-service")
    public CompletableFuture<PaymentResponse> processPayment(PaymentRequest request) {
        return paymentServiceClient.process(request);
    }
    
    public CompletableFuture<PaymentResponse> fallbackPayment(Exception ex) {
        return CompletableFuture.completedFuture(
            PaymentResponse.builder()
                .status(PaymentStatus.PENDING)
                .message("Payment queued for retry")
                .build()
        );
    }
}

Performance Optimizations

Database Sharding Strategy

We implemented a sharding strategy based on customer segments:

  • High-volume customers: Dedicated database shards
  • Regular customers: Shared shards with load balancing
  • Geographical sharding: Asia-Pacific, Europe, Americas

Caching Strategy

@Cacheable(value = "customer-profiles", key = "#customerId")
public CustomerProfile getCustomerProfile(String customerId) {
    return customerRepository.findById(customerId);
}

@CacheEvict(value = "customer-profiles", key = "#customerId")
public void updateCustomerProfile(String customerId, CustomerProfile profile) {
    customerRepository.save(profile);
}

Results

After implementing these patterns:

  • 30% reduction in response times
  • 99.9% uptime across all services
  • 50% reduction in database load
  • Zero data loss during peak traffic periods

Key Takeaways

  1. Start with a monolith, then extract services based on business domains
  2. Invest in observability from day one - logging, metrics, tracing
  3. Design for failure - assume services will fail and plan accordingly
  4. Automate everything - deployment, monitoring, scaling, recovery
  5. Team ownership - each team owns their services end-to-end

What’s Next?

In my next post, I’ll dive deep into our Apache Kafka implementation and how we handle millions of events per day with zero message loss.


Have questions about microservices architecture? Feel free to reach out on LinkedIn or email me.