Faulty Deployment Detection - Interactive Demo

Overview

Unsupervised Approach

Supervised Learning

Weak Supervision

Results & Comparison

🎯 The Challenge

According to Google SRE, deployments account for approximately 70% of incidents. This demo explores how Datadog developed an automated system to detect faulty deployments using machine learning.

📊 Key Challenges

No Labels

Lack of ground truth data

Data Imbalance

Faulty deployments are rare

Diversity

Different application profiles

Time Pressure

Need for quick detection

🔬 Definition of Faulty Deployment

A deployment is considered faulty if it exhibits:

Impact: Significant increase in error rate relative to baseline
Temporal Correlation: Error increase aligns with deployment timing
Persistence: Increased error rate is sustained over time

🛤️ Evolution Path

The project evolved through three main phases:

Unsupervised Approach: Rule-based statistical checks with iterative refinement
Supervised Learning: Sequential models trained on ensemble outputs
Weak Supervision: Improved label quality through multiple weak signals

📈 Simulate Deployment Monitoring

Deployment Type:

Error Rate Increase: 50%

Base Error Rate: 2%

🔍 Statistical Checks

Impact Check

Temporal Check

Persistence Check

Final Decision

⚖️ Iterative Framework

The unsupervised approach uses unanimous voting: all checks must pass for a deployment to be flagged as faulty.

Simulate deployments to see history

⏱️ Sequential Model Approach

Train models to predict 60-minute ensemble results using early data (10 and 20 minutes).

🎯 Model Performance

10-Minute Model

Coverage: 21.5%

Trade-off: High precision, low recall

20-Minute Model

Coverage: 25.9%

Trade-off: Balanced precision/recall

60-Minute Model

Coverage: 62.9%

Trade-off: High recall, slower detection

🔄 Training Process

The supervised models use features from statistical checks computed at early timestamps to predict the final ensemble decision. This approach allows for faster detection while maintaining accuracy.

Feature Set:

🏷️ Weak Supervision Framework

Instead of manual labeling, we use multiple weak signals to generate high-quality labels automatically.

📊 Weak Label Sources

0.85

Version Rollback

0.72

Short-lived Version

0.68

New Error Signatures

0.79

Statistical Rules

0.91

Incident Correlation

0.64

Performance Degradation

🧮 Label Generation Process

Weak Labels Present:

Rollback Short-lived New Errors Incident

Label Confidence

Predicted Label

Coverage

Estimated Accuracy

📈 Performance Improvements

Precision on Disagreements

Baseline: 67% → Weak Supervision: 78% (11% improvement)

Recall on Disagreements

Baseline: 54% → Weak Supervision: 85% (31% improvement)

⚡ Time to Detection Improvements

1.0x

Baseline (60 min)

0.72x

20-min + 60-min models

0.65x

10-min + 20-min + 60-min

0.55x

With Weak Supervision

📊 Coverage Comparison

🎯 Key Takeaways

Iterative Approach: Starting simple and adding complexity based on real-world feedback
Sequential Models: Trading off precision, recall, and time to detection
Weak Supervision: Using domain knowledge to generate high-quality labels
45% Improvement: In time to detection compared to original approach

🚀 Faulty Deployment Detection