From distributed tracing to proactive SLO management: a mini-review of trace-driven performance prediction for cloud-native microservices
Abstrak
Cloud-native microservices improve development velocity and elasticity, but they also create complex and dynamic service dependencies. Resource contention, queue buildup, and downstream slowdowns can propagate through call chains, amplifying end-to-end tail latency (e.g., p95/p99) and increasing Service Level Objective (SLO) violation risks. While many studies focus on post-hoc anomaly detection and root-cause analysis, industrial operations increasingly demand proactive capabilities, like predicting performance risks before a request finishes, issuing early warnings from partial trace prefixes, and producing actionable signals for mitigation. This mini-review synthesizes recent progress on trace-driven proactive SLO management. We summarize problem formulations and evaluation protocols for SLO violation and tail-quantile prediction, prefix early warning under precision constraints, and actionable intermediate outputs such as bottleneck candidate ranking and what-if estimation. We then survey modeling approaches spanning feature-based baselines, sequence models, graph neural networks, sequence-graph fusion, and multimodal/causal extensions, highlighting practical issues such as class imbalance, sampling-induced missing spans, and topology drift. Finally, we survey commonly used public benchmarks and traces, and discuss open challenges toward deployable, trustworthy proactive SLO management.
Topik & Kata Kunci
Penulis (13)
Miaopeng Yu
Miaopeng Yu
Haonan Liu
Haonan Liu
Jinran Du
Jinran Du
Kequan Lin
Tao Dai
Tao Dai
Yanzhe Fu
Yanzhe Fu
Chunyan Yang
Chunyan Yang
Akses Cepat
- Tahun Terbit
- 2026
- Sumber Database
- DOAJ
- DOI
- 10.3389/fcomp.2026.1783945
- Akses
- Open Access ✓