โView full Weights & Biases status
Weights & Biases: Pipeline / Training Run Failed
Current Status: Operational
Last checked: 9m ago
What We're Seeing Right Now
No recent issues reported. If you're experiencing problems with Weights & Biases, report below to help the community.
What is this error?
Your ML pipeline or training run on Weights & Biases has failed unexpectedly. This can be caused by platform-side issues (compute unavailability, orchestration bugs) or configuration problems on your end. Distinguishing between the two is key to a fast resolution.
Error Signatures
Run failed with exit code 1OOMKilledOut of memoryPipeline execution errorStep timed outCompute not availableJob preemptedContainer failed to startCommon Causes
- Compute resources (GPU/CPU) temporarily unavailable on the platform
- Out-of-memory errors due to model size or batch size misconfiguration
- Dependency or environment conflicts in the runtime container
- Network timeout between pipeline steps or data sources
- Platform-side orchestration bugs causing premature job termination
โ How to Fix It
- Check the run logs for the exact error message and step that failed
- Verify that the required compute tier is available on Weights & Biases's status page
- Reduce batch size or model size if you're hitting memory limits
- Check your data pipeline for connectivity issues to storage (S3, GCS, etc.)
- Try re-running with a smaller test dataset to isolate the issue
- Open a support ticket with the run ID if the error appears platform-side
Live Signals
Service Components
W&B Web
OperationalRecent Incidents
No incidents in the past 30 days
Frequently Asked Questions
How do I tell if the pipeline failure is my fault or Weights & Biases's?
Check the error logs carefully. OOM errors and code exceptions are usually your issue. 'Compute not available', 'job preempted', or infrastructure-level errors point to the platform. Check if other users are reporting similar failures on this page.
Will I be charged for a failed pipeline run?
Most MLOps platforms bill for compute time used, even on failed runs. If the failure is clearly platform-side, contact Weights & Biases support to request a credit.
How do I prevent pipeline failures from losing hours of training progress?
Enable checkpointing in your training code (most frameworks support this natively). Configure your pipeline to checkpoint every N steps so a failure only loses a fraction of progress.