Weights & Biases: Pipeline / Training Run Failed

Current Status: Operational

Last checked: 9m ago

What We're Seeing Right Now

No recent issues reported. If you're experiencing problems with Weights & Biases, report below to help the community.

What is this error?

Your ML pipeline or training run on Weights & Biases has failed unexpectedly. This can be caused by platform-side issues (compute unavailability, orchestration bugs) or configuration problems on your end. Distinguishing between the two is key to a fast resolution.

Error Signatures

Run failed with exit code 1OOMKilledOut of memoryPipeline execution errorStep timed outCompute not availableJob preemptedContainer failed to start

Common Causes

Compute resources (GPU/CPU) temporarily unavailable on the platform
Out-of-memory errors due to model size or batch size misconfiguration
Dependency or environment conflicts in the runtime container
Network timeout between pipeline steps or data sources
Platform-side orchestration bugs causing premature job termination

✓ How to Fix It

Check the run logs for the exact error message and step that failed
Verify that the required compute tier is available on Weights & Biases's status page
Reduce batch size or model size if you're hitting memory limits
Check your data pipeline for connectivity issues to storage (S3, GCS, etc.)
Try re-running with a smaller test dataset to isolate the issue
Open a support ticket with the run ID if the error appears platform-side

Live Signals

Service Components

W&B Web

Operational

Recent Incidents

No incidents in the past 30 days

Frequently Asked Questions

How do I tell if the pipeline failure is my fault or Weights & Biases's?

Check the error logs carefully. OOM errors and code exceptions are usually your issue. 'Compute not available', 'job preempted', or infrastructure-level errors point to the platform. Check if other users are reporting similar failures on this page.

Will I be charged for a failed pipeline run?

Most MLOps platforms bill for compute time used, even on failed runs. If the failure is clearly platform-side, contact Weights & Biases support to request a credit.

How do I prevent pipeline failures from losing hours of training progress?

Enable checkpointing in your training code (most frameworks support this natively). Configure your pipeline to checkpoint every N steps so a failure only loses a fraction of progress.

📊 Weights & Biases Status Dashboard ❓ Is Weights & Biases Down?

Other Weights & Biases issues:

API Error (500 / 502 / 503)Model Registry / Artifact Store Error

🔍 All MLOps Services