The article examines how Pinterest Engineering reduced Apache Spark out-of-memory (OOM) failures by combining observability, precise configuration tuning, and automated memory retries.
By targeting high-impact memory pressure instead of simply adding more memory, the team cut persistent OOMs by 96%.
This stabilized long-running pipelines that support analytics for recommendations and large-scale data processing.
Observability as the Foundation for Stabilization
Persistent OOMs triggered late-stage job failures after hours of computation, disrupting critical analytics workflows.
Engineers focused on visibility into memory pressure at the executor level and across shuffle operations to identify hotspots and skewed partitions before failures occurred.
This data-driven approach enabled focused interventions.
Metrics that mapped memory pressure
Detailed instrumenting revealed where memory was used and where bottlenecks existed.
The following metrics were central to the diagnosis:
Book Your Dream Vacation Today
Flights | Hotels | Vacation Rentals | Rental Cars | Experiences
- Executor memory usage by task, stage, and shuffle operations
- Shuffle spill counts, data transfer patterns, and task execution times
- Memory pressure indicators and garbage-collection behavior
- Data skew, skewed partitions, and hotspots in the DAG
- Resource utilization across the cluster, including custom profiles
With this visibility, engineers could pinpoint hotspots and correlate memory pressure with specific stages.
Configuration Tuning and Adaptive Execution
The team implemented configuration changes that reduced pressure on Spark’s memory subsystem.
This strategy used adaptive execution techniques and preprocessing to address skew and improve stability for memory-intensive workloads.
Targeted tuning actions
- Tuning Spark memory allocation parameters to balance executor, driver, and shuffle memory
- Adjusting shuffle partitions to reduce data skew and shuffle memory pressure
- Modifying broadcast-join behavior to prevent large broadcast unions from overwhelming memory
- Enabling Adaptive Query Execution (AQE) and preprocessing to minimize data skew
- Respecting custom resource profiles and environmental constraints (e.g., Apache Gluten)
These adjustments were deployed gradually, with validation checks to ensure improvements without destabilizing job logic.
Validation, Rollout, and Stability
Validation checks flagged anomalous or unusually large datasets before they could trigger memory pressure.
High-risk jobs underwent human review to maintain stability, while routine workloads progressed automatically.
The rollout was staged, moving from ad hoc to scheduled jobs and from lower-priority to critical analytics pipelines.
Guardrails and governance
- Automated checks to identify anomalous data volumes or unexpected partitions
- Early warning signals and dashboards tracking recovered jobs, cost savings, and post-retry failures
- Human oversight for high-risk tasks to prevent cascading failures during rollout
A dedicated dashboard tracked metrics such as the number of recovered jobs, resource savings, and any remaining post-retry failures.
Auto Memory Retry: Restarting with Smarter Memory
Auto Memory Retry (AMR) provided a safety net for jobs that would have failed due to memory pressure.
Instead of failing, Spark tasks could automatically restart with updated memory settings, reducing manual tuning.
Mechanism, benefits, and limits
- Automatically restarts failed tasks or stages with adjusted memory configurations
- Cuts manual tuning time and accelerates pipeline completion
- Maintains data processing integrity by avoiding code changes for retry scenarios
AMR was rolled out progressively, with monitoring dashboards to quantify the impact on throughput, reliability, and cost.
Operational Learnings and Future Plans
Several operational lessons emerged. These included improving scheduler performance for large TaskSets and handling custom resource profiles, such as Gluten.
Refining host-exclusion behavior was also important. This ensured that OOMs would not block retries.
Pinterest plans proactive memory increases for high-risk task stages. They also aim to further optimize systems to reduce retries and overall cluster overhead.
The goal is to achieve more stable analytics at scale. These steps are intended to support reliable data processing as workloads grow.
Here is the source article for this story: Pinterest Reduces Spark OOM Failures by 96% Through Auto Memory Retries
Book Your Dream Vacation Today
Flights | Hotels | Vacation Rentals | Rental Cars | Experiences