Prologue: The Era of WebLogic and Apache Mod_WebLogic
It was 2012. Turkcell, Turkey’s largest mobile operator, had 15 million subscribers, and the pressure was on. As part of the operations team for www.turkcell.com.tr, I managed a labyrinth of infrastructure: 10 strong linux servers which had Weblogic installed humming behind Apache mod_weblogic proxies, serving dynamic requests while Apache handled static content. The e-commerce platform, Turkcell Shop, was my responsibility—where a single GC pause during peak traffic could mean thousands of failed transactions and angry customers.
Back then, Java ruled enterprise systems, but garbage collection was a double-edged sword. Tuning JVMs like orchestra conductors—desperate to balance throughput, latency, and stability.
The GC Wars of 2012
The Problem: CMS GC and the 8-Second Pauses
Our initial setup used the Concurrent Mark-Sweep (CMS) collector, the go-to for low-latency systems pre-G1. But as traffic spiked during holiday sales or new iPhone launches, CMS struggled. Major GC pauses hit 8-10 seconds, causing timeouts in Apache’s mod_weblogic connections. Customers saw spinning wheels at checkout. Our Splunk dashboards flashed red.
The Load Test Revelation
Armed with JMeter, we simulated 10,000 concurrent users on the e-commerce app. The results were brutal:
- CMS: 92% throughput, but 99th percentile response times of 12 seconds during GC.
- Parallel GC: Better throughput, but even longer pauses—15 seconds during full GC.
The team debated: “Do we accept pauses for throughput, or chase lower latency?”
Discovering G1—The “Garbage-First” Gamble
Why G1?
In 2012, G1 was still experimental (Java 7u4), but its promise of predictable pauses and region-based collection intrigued us. Unlike CMS, G1 avoided fragmentation by incrementally compacting the heap. For a system with 12GB heaps and mixed workloads (HTTP sessions, order processing), this felt like a fit.
The Pitch to Leadership
Convincing management to adopt an “unproven” GC was tough. I built a case:
- Predictability: G1’s
MaxGCPauseMillis
let us target 200ms pauses. - Scalability: Regions allowed better heap utilization for Turkcell Shop’s volatile traffic.
- Future-Proofing: Oracle’s roadmap hinted G1 would replace CMS.
After weeks of Splunk-fueled debates, we got a green light for a staged rollout.
Tuning G1 for 15 Million Subscribers
The Configuration Wars
We started with defaults, but G1’s early days were rocky. Full GCs still occurred when the heap filled too fast. Our tuning arsenal:
-XX:+UseG1GC
-XX:MaxGCPauseMillis=200
-XX:InitiatingHeapOccupancyPercent=35 # Trigger GC earlier
-XX:G1ReservePercent=15 # Buffer for promotion failures
Splunk Dashboards: Our GC Crystal Ball
We piped GC logs into Splunk, tracking:
- Heap occupancy trends before/after sales.
- Promotion rates (Young → Old generation).
- Pause times correlated with customer complaints.
One midnight, a dashboard alert caught a humongous allocation—a 50MB XML payload clogging a G1 region. We fixed it by splitting the payload and adding:
-XX:G1HeapRegionSize=16M
Legacy and Lessons
Why G1 Won
- Adaptability: Handled Turkcell’s mix of short-lived HTTP requests and long-lived sessions.
- Tunability: Parameters like MaxGCPauseMillis aligned with SLAs.
- Splunk + JMeter: Data-driven decisions beat gut feelings.
The Human Factor
As a developer-turned-SRE, I learned:
- Collaborate: Bridged dev/ops teams by sharing Splunk dashboards.
- Obsess Over Logs: A GC log anomaly often hid a code smell.
- Test Relentlessly: JMeter scripts mirrored real user rage.
Epilogue: Beyond Turkcell
When I left Turkcell in 2015, G1 was becoming mainstream. Today, ZGC and Shenandoah handle terabyte heaps, but G1’s principles—predictability, incremental compaction—live on.
To engineers battling GC pauses: Your logs tell a story. Listen to them.