A Retrospective on High-Traffic Systems, Garbage Collection Battles, and the Rise of G1

Prologue: The Era of WebLogic and Apache Mod_WebLogic #

It was 2012. Turkcell, Turkey’s largest mobile operator, had 15 million subscribers, and the pressure was on. As part of the operations team for www.turkcell.com.tr, I managed a labyrinth of infrastructure: 10 strong linux servers which had Weblogic installed humming behind Apache mod_weblogic proxies, serving dynamic requests while Apache handled static content. The e-commerce platform, Turkcell Shop, was my responsibility—where a single GC pause during peak traffic could mean thousands of failed transactions and angry customers.

Back then, Java ruled enterprise systems, but garbage collection was a double-edged sword. Tuning JVMs like orchestra conductors—desperate to balance throughput, latency, and stability.


The GC Wars of 2012 #

The Problem: CMS GC and the 8-Second Pauses #

Our initial setup used the Concurrent Mark-Sweep (CMS) collector, the go-to for low-latency systems pre-G1. But as traffic spiked during holiday sales or new iPhone launches, CMS struggled. Major GC pauses hit 8-10 seconds, causing timeouts in Apache’s mod_weblogic connections. Customers saw spinning wheels at checkout. Our Splunk dashboards flashed red.

The Load Test Revelation #

Armed with JMeter, we simulated 10,000 concurrent users on the e-commerce app. The results were brutal:

  • CMS: 92% throughput, but 99th percentile response times of 12 seconds during GC.
  • Parallel GC: Better throughput, but even longer pauses—15 seconds during full GC.

The team debated: “Do we accept pauses for throughput, or chase lower latency?”


Discovering G1—The “Garbage-First” Gamble #

Why G1? #

In 2012, G1 was still experimental (Java 7u4), but its promise of predictable pauses and region-based collection intrigued us. Unlike CMS, G1 avoided fragmentation by incrementally compacting the heap. For a system with 12GB heaps and mixed workloads (HTTP sessions, order processing), this felt like a fit.

The Pitch to Leadership #

Convincing management to adopt an “unproven” GC was tough. I built a case:

  • Predictability: G1’s MaxGCPauseMillis let us target 200ms pauses.
  • Scalability: Regions allowed better heap utilization for Turkcell Shop’s volatile traffic.
  • Future-Proofing: Oracle’s roadmap hinted G1 would replace CMS.

After weeks of Splunk-fueled debates, we got a green light for a staged rollout.


Tuning G1 for 15 Million Subscribers #

The Configuration Wars #

We started with defaults, but G1’s early days were rocky. Full GCs still occurred when the heap filled too fast. Our tuning arsenal:

-XX:+UseG1GC  
-XX:MaxGCPauseMillis=200  
-XX:InitiatingHeapOccupancyPercent=35  # Trigger GC earlier  
-XX:G1ReservePercent=15  # Buffer for promotion failures  

Splunk Dashboards: Our GC Crystal Ball #

We piped GC logs into Splunk, tracking:

  • Heap occupancy trends before/after sales.
  • Promotion rates (Young → Old generation).
  • Pause times correlated with customer complaints.

One midnight, a dashboard alert caught a humongous allocation—a 50MB XML payload clogging a G1 region. We fixed it by splitting the payload and adding:

-XX:G1HeapRegionSize=16M  

Legacy and Lessons #

Why G1 Won

  • Adaptability: Handled Turkcell’s mix of short-lived HTTP requests and long-lived sessions.
  • Tunability: Parameters like MaxGCPauseMillis aligned with SLAs.
  • Splunk + JMeter: Data-driven decisions beat gut feelings.

The Human Factor #

As a developer-turned-SRE, I learned:

  • Collaborate: Bridged dev/ops teams by sharing Splunk dashboards.
  • Obsess Over Logs: A GC log anomaly often hid a code smell.
  • Test Relentlessly: JMeter scripts mirrored real user rage.

Epilogue: Beyond Turkcell #

When I left Turkcell in 2015, G1 was becoming mainstream. Today, ZGC and Shenandoah handle terabyte heaps, but G1’s principles—predictability, incremental compaction—live on.

To engineers battling GC pauses: Your logs tell a story. Listen to them.