Pinterest Reduces Spark OOM Failures 96% with Auto Memory Retries

The article examines how Pinterest Engineering reduced Apache Spark out-of-memory (OOM) failures by combining observability, precise configuration tuning, and automated memory retries.

By targeting high-impact memory pressure instead of simply adding more memory, the team cut persistent OOMs by 96%.

This stabilized long-running pipelines that support analytics for recommendations and large-scale data processing.

Table of Contents

Observability as the Foundation for Stabilization

Persistent OOMs triggered late-stage job failures after hours of computation, disrupting critical analytics workflows.

Engineers focused on visibility into memory pressure at the executor level and across shuffle operations to identify hotspots and skewed partitions before failures occurred.

This data-driven approach enabled focused interventions.

Metrics that mapped memory pressure

Detailed instrumenting revealed where memory was used and where bottlenecks existed.

The following metrics were central to the diagnosis:

Book Your Dream Vacation Today
Flights | Hotels | Vacation Rentals | Rental Cars | Experiences

Executor memory usage by task, stage, and shuffle operations
Shuffle spill counts, data transfer patterns, and task execution times
Memory pressure indicators and garbage-collection behavior
Data skew, skewed partitions, and hotspots in the DAG
Resource utilization across the cluster, including custom profiles

With this visibility, engineers could pinpoint hotspots and correlate memory pressure with specific stages.

Configuration Tuning and Adaptive Execution

The team implemented configuration changes that reduced pressure on Spark’s memory subsystem.

This strategy used adaptive execution techniques and preprocessing to address skew and improve stability for memory-intensive workloads.

Targeted tuning actions

Tuning Spark memory allocation parameters to balance executor, driver, and shuffle memory
Adjusting shuffle partitions to reduce data skew and shuffle memory pressure
Modifying broadcast-join behavior to prevent large broadcast unions from overwhelming memory
Enabling Adaptive Query Execution (AQE) and preprocessing to minimize data skew
Respecting custom resource profiles and environmental constraints (e.g., Apache Gluten)

These adjustments were deployed gradually, with validation checks to ensure improvements without destabilizing job logic.

Validation, Rollout, and Stability

Validation checks flagged anomalous or unusually large datasets before they could trigger memory pressure.

High-risk jobs underwent human review to maintain stability, while routine workloads progressed automatically.

The rollout was staged, moving from ad hoc to scheduled jobs and from lower-priority to critical analytics pipelines.

Guardrails and governance

Automated checks to identify anomalous data volumes or unexpected partitions
Early warning signals and dashboards tracking recovered jobs, cost savings, and post-retry failures
Human oversight for high-risk tasks to prevent cascading failures during rollout

A dedicated dashboard tracked metrics such as the number of recovered jobs, resource savings, and any remaining post-retry failures.

Auto Memory Retry: Restarting with Smarter Memory

Auto Memory Retry (AMR) provided a safety net for jobs that would have failed due to memory pressure.

Instead of failing, Spark tasks could automatically restart with updated memory settings, reducing manual tuning.

Mechanism, benefits, and limits

Automatically restarts failed tasks or stages with adjusted memory configurations
Cuts manual tuning time and accelerates pipeline completion
Maintains data processing integrity by avoiding code changes for retry scenarios

AMR was rolled out progressively, with monitoring dashboards to quantify the impact on throughput, reliability, and cost.

Operational Learnings and Future Plans

Several operational lessons emerged. These included improving scheduler performance for large TaskSets and handling custom resource profiles, such as Gluten.

Refining host-exclusion behavior was also important. This ensured that OOMs would not block retries.

Pinterest plans proactive memory increases for high-risk task stages. They also aim to further optimize systems to reduce retries and overall cluster overhead.

The goal is to achieve more stable analytics at scale. These steps are intended to support reliable data processing as workloads grow.

Here is the source article for this story: Pinterest Reduces Spark OOM Failures by 96% Through Auto Memory Retries