🚀 Faulty Deployment Detection

Interactive demonstration of Datadog's journey from unsupervised to supervised learning

Overview
Unsupervised Approach
Supervised Learning
Weak Supervision
Results & Comparison

🎯 The Challenge

According to Google SRE, deployments account for approximately 70% of incidents. This demo explores how Datadog developed an automated system to detect faulty deployments using machine learning.

📊 Key Challenges

No Labels
Lack of ground truth data
Data Imbalance
Faulty deployments are rare
Diversity
Different application profiles
Time Pressure
Need for quick detection

🔬 Definition of Faulty Deployment

A deployment is considered faulty if it exhibits:

  • Impact: Significant increase in error rate relative to baseline
  • Temporal Correlation: Error increase aligns with deployment timing
  • Persistence: Increased error rate is sustained over time

🛤️ Evolution Path

The project evolved through three main phases:

  1. Unsupervised Approach: Rule-based statistical checks with iterative refinement
  2. Supervised Learning: Sequential models trained on ensemble outputs
  3. Weak Supervision: Improved label quality through multiple weak signals

📈 Simulate Deployment Monitoring

50%
2%

🔍 Statistical Checks

-
Impact Check
-
Temporal Check
-
Persistence Check
-
Final Decision

⚖️ Iterative Framework

The unsupervised approach uses unanimous voting: all checks must pass for a deployment to be flagged as faulty.

Simulate deployments to see history

⏱️ Sequential Model Approach

Train models to predict 60-minute ensemble results using early data (10 and 20 minutes).

🎯 Model Performance

10-Minute Model

Coverage: 21.5%
Trade-off: High precision, low recall

20-Minute Model

Coverage: 25.9%
Trade-off: Balanced precision/recall

60-Minute Model

Coverage: 62.9%
Trade-off: High recall, slower detection

🔄 Training Process

The supervised models use features from statistical checks computed at early timestamps to predict the final ensemble decision. This approach allows for faster detection while maintaining accuracy.

🏷️ Weak Supervision Framework

Instead of manual labeling, we use multiple weak signals to generate high-quality labels automatically.

📊 Weak Label Sources

0.85
Version Rollback
0.72
Short-lived Version
0.68
New Error Signatures
0.79
Statistical Rules
0.91
Incident Correlation
0.64
Performance Degradation

🧮 Label Generation Process

-
Label Confidence
-
Predicted Label
-
Coverage
-
Estimated Accuracy

📈 Performance Improvements

Precision on Disagreements

Baseline: 67% → Weak Supervision: 78% (11% improvement)

Recall on Disagreements

Baseline: 54% → Weak Supervision: 85% (31% improvement)

⚡ Time to Detection Improvements

1.0x
Baseline (60 min)
0.72x
20-min + 60-min models
0.65x
10-min + 20-min + 60-min
0.55x
With Weak Supervision

📊 Coverage Comparison

🎯 Key Takeaways

  • Iterative Approach: Starting simple and adding complexity based on real-world feedback
  • Sequential Models: Trading off precision, recall, and time to detection
  • Weak Supervision: Using domain knowledge to generate high-quality labels
  • 45% Improvement: In time to detection compared to original approach