A Retrospective on High-Traffic Systems, Garbage Collection Battles, and the Rise of G1

Prologue: The Era of WebLogic and Apache Mod_WebLogic

It was 2012. Turkcell, Turkey’s largest mobile operator, had 15 million subscribers, and the pressure was on. As part of the operations team for www.turkcell.com.tr, I managed a labyrinth of infrastructure: 10 strong linux servers which had Weblogic installed humming behind Apache mod_weblogic proxies, serving dynamic requests while Apache handled static content. The e-commerce platform, Turkcell Shop, was my responsibility-where a single GC pause during peak traffic could mean thousands of failed transactions and angry customers.

Back then, Java ruled enterprise systems, but garbage collection was a double-edged sword. Tuning JVMs like orchestra conductors-desperate to balance throughput, latency, and stability.

The GC Wars of 2012

The Problem: CMS GC and the 8-Second Pauses

Our initial setup used the Concurrent Mark-Sweep (CMS) collector, the go-to for low-latency systems pre-G1. But as traffic spiked during holiday sales or new iPhone launches, CMS struggled. Major GC pauses hit 8-10 seconds, causing timeouts in Apache’s mod_weblogic connections. Customers saw spinning wheels at checkout. Our Splunk dashboards flashed red.

The Load Test Revelation

Armed with JMeter, we simulated 10,000 concurrent users on the e-commerce app. The results were brutal:

CMS: 92% throughput, but 99th percentile response times of 12 seconds during GC.
Parallel GC: Better throughput, but even longer pauses-15 seconds during full GC.

The team debated: “Do we accept pauses for throughput, or chase lower latency?”

Discovering G1-The “Garbage-First” Gamble

Why G1?

In 2012, G1 was still experimental (Java 7u4), but its promise of predictable pauses and region-based collection intrigued us. Unlike CMS, G1 avoided fragmentation by incrementally compacting the heap. For a system with 12GB heaps and mixed workloads (HTTP sessions, order processing), this felt like a fit.

The Pitch to Leadership

Convincing management to adopt an “unproven” GC was tough. I built a case:

Predictability: G1’s MaxGCPauseMillis let us target 200ms pauses.
Scalability: Regions allowed better heap utilization for Turkcell Shop’s volatile traffic.
Future-Proofing: Oracle’s roadmap hinted G1 would replace CMS.

After weeks of Splunk-fueled debates, we got a green light for a staged rollout.

Tuning G1 for 15 Million Subscribers

The Configuration Wars

We started with defaults, but G1’s early days were rocky. Full GCs still occurred when the heap filled too fast. Our tuning arsenal:

-XX:+UseG1GC  
-XX:MaxGCPauseMillis=200  
-XX:InitiatingHeapOccupancyPercent=35  # Trigger GC earlier  
-XX:G1ReservePercent=15  # Buffer for promotion failures

Splunk Dashboards: Our GC Crystal Ball

We piped GC logs into Splunk, tracking:

Heap occupancy trends before/after sales.
Promotion rates (Young → Old generation).
Pause times correlated with customer complaints.

One midnight, a dashboard alert caught a humongous allocation-a 50MB XML payload clogging a G1 region. We fixed it by splitting the payload and adding:

-XX:G1HeapRegionSize=16M

Legacy and Lessons

Why G1 Won

Adaptability: Handled Turkcell’s mix of short-lived HTTP requests and long-lived sessions.
Tunability: Parameters like MaxGCPauseMillis aligned with SLAs.
Splunk + JMeter: Data-driven decisions beat gut feelings.

The Human Factor

As a developer-turned-SRE, I learned:

Collaborate: Bridged dev/ops teams by sharing Splunk dashboards.
Obsess Over Logs: A GC log anomaly often hid a code smell.
Test Relentlessly: JMeter scripts mirrored real user rage.

Epilogue: Beyond Turkcell

When I left Turkcell in 2015, G1 was becoming mainstream. Today, ZGC and Shenandoah handle terabyte heaps, but G1’s principles-predictability, incremental compaction-live on.

To engineers battling GC pauses: Your logs tell a story. Listen to them.

A Retrospective on High-Traffic Systems, Garbage Collection Battles, and the Rise of G1

Prologue: The Era of WebLogic and Apache Mod_WebLogic

The GC Wars of 2012

The Problem: CMS GC and the 8-Second Pauses

The Load Test Revelation

Discovering G1-The “Garbage-First” Gamble

Why G1?

The Pitch to Leadership

Tuning G1 for 15 Million Subscribers

The Configuration Wars

Splunk Dashboards: Our GC Crystal Ball

Legacy and Lessons

The Human Factor

Epilogue: Beyond Turkcell

Özkan Pakdil Software Engineer

Social

Projects

Prologue: The Era of WebLogic and Apache Mod_WebLogic#

The GC Wars of 2012#

The Problem: CMS GC and the 8-Second Pauses#

The Load Test Revelation#

Discovering G1-The “Garbage-First” Gamble#

Why G1?#

The Pitch to Leadership#

Tuning G1 for 15 Million Subscribers#

The Configuration Wars#

Splunk Dashboards: Our GC Crystal Ball#

Legacy and Lessons#

The Human Factor#

Epilogue: Beyond Turkcell#

Özkan Pakdil Software Engineer

Social

Projects

Prologue: The Era of WebLogic and Apache Mod_WebLogic

The GC Wars of 2012

The Problem: CMS GC and the 8-Second Pauses

The Load Test Revelation

Discovering G1-The “Garbage-First” Gamble

Why G1?

The Pitch to Leadership

Tuning G1 for 15 Million Subscribers

The Configuration Wars

Splunk Dashboards: Our GC Crystal Ball

Legacy and Lessons

The Human Factor

Epilogue: Beyond Turkcell