A Retrospective on High-Traffic Systems, Garbage Collection Battles, and the Rise of G1
Prologue: The Era of WebLogic and Apache Mod_WebLogic #
It was 2012. Turkcell, Turkey’s largest mobile operator, had 15 million subscribers, and the pressure was on. As part of the operations team for www.turkcell.com.tr, I managed a labyrinth of infrastructure: 10 strong linux servers which had Weblogic installed humming behind Apache mod_weblogic proxies, serving dynamic requests while Apache handled static content. The e-commerce platform, Turkcell Shop, was my responsibility—where a single GC pause during peak traffic could mean thousands of failed transactions and angry customers.
Back then, Java ruled enterprise systems, but garbage collection was a double-edged sword. Tuning JVMs like orchestra conductors—desperate to balance throughput, latency, and stability.
The GC Wars of 2012 #
The Problem: CMS GC and the 8-Second Pauses #
Our initial setup used the Concurrent Mark-Sweep (CMS) collector, the go-to for low-latency systems pre-G1. But as traffic spiked during holiday sales or new iPhone launches, CMS struggled. Major GC pauses hit 8-10 seconds, causing timeouts in Apache’s mod_weblogic connections. Customers saw spinning wheels at checkout. Our Splunk dashboards flashed red.
The Load Test Revelation #
Armed with JMeter, we simulated 10,000 concurrent users on the e-commerce app. The results were brutal:
- CMS: 92% throughput, but 99th percentile response times of 12 seconds during GC.
- Parallel GC: Better throughput, but even longer pauses—15 seconds during full GC.
The team debated: “Do we accept pauses for throughput, or chase lower latency?”
Discovering G1—The “Garbage-First” Gamble #
Why G1? #
In 2012, G1 was still experimental (Java 7u4), but its promise of predictable pauses and region-based collection intrigued us. Unlike CMS, G1 avoided fragmentation by incrementally compacting the heap. For a system with 12GB heaps and mixed workloads (HTTP sessions, order processing), this felt like a fit.
The Pitch to Leadership #
Convincing management to adopt an “unproven” GC was tough. I built a case:
- Predictability: G1’s
MaxGCPauseMillis
let us target 200ms pauses. - Scalability: Regions allowed better heap utilization for Turkcell Shop’s volatile traffic.
- Future-Proofing: Oracle’s roadmap hinted G1 would replace CMS.
After weeks of Splunk-fueled debates, we got a green light for a staged rollout.
Tuning G1 for 15 Million Subscribers #
The Configuration Wars #
We started with defaults, but G1’s early days were rocky. Full GCs still occurred when the heap filled too fast. Our tuning arsenal:
-XX:+UseG1GC
-XX:MaxGCPauseMillis=200
-XX:InitiatingHeapOccupancyPercent=35 # Trigger GC earlier
-XX:G1ReservePercent=15 # Buffer for promotion failures
Splunk Dashboards: Our GC Crystal Ball #
We piped GC logs into Splunk, tracking:
- Heap occupancy trends before/after sales.
- Promotion rates (Young → Old generation).
- Pause times correlated with customer complaints.
One midnight, a dashboard alert caught a humongous allocation—a 50MB XML payload clogging a G1 region. We fixed it by splitting the payload and adding:
-XX:G1HeapRegionSize=16M
Legacy and Lessons #
Why G1 Won
- Adaptability: Handled Turkcell’s mix of short-lived HTTP requests and long-lived sessions.
- Tunability: Parameters like MaxGCPauseMillis aligned with SLAs.
- Splunk + JMeter: Data-driven decisions beat gut feelings.
The Human Factor #
As a developer-turned-SRE, I learned:
- Collaborate: Bridged dev/ops teams by sharing Splunk dashboards.
- Obsess Over Logs: A GC log anomaly often hid a code smell.
- Test Relentlessly: JMeter scripts mirrored real user rage.
Epilogue: Beyond Turkcell #
When I left Turkcell in 2015, G1 was becoming mainstream. Today, ZGC and Shenandoah handle terabyte heaps, but G1’s principles—predictability, incremental compaction—live on.
To engineers battling GC pauses: Your logs tell a story. Listen to them.