arXiv Open Access 2025

ANSC: Probabilistic Capacity Health Scoring for Datacenter-Scale Reliability

Madhava Gaikwad Abhishek Gandhi
Lihat Sumber

Abstrak

We present ANSC, a probabilistic capacity health scoring framework for hyperscale datacenter fabrics. While existing alerting systems detect individual device or link failures, they do not capture the aggregate risk of cascading capacity shortfalls. ANSC provides a color-coded scoring system that indicates the urgency of issues \emph{not solely by current impact, but by the probability of imminent capacity violations}. Our system accounts for both current residual capacity and the probability of additional failures, normalized at datacenter and regional level. We demonstrate that ANSC enables operators to prioritize remediation across more than 400 datacenters and 60 regions, reducing noise and aligning SRE focus on the most critical risks.

Topik & Kata Kunci

Penulis (2)

M

Madhava Gaikwad

A

Abhishek Gandhi

Format Sitasi

Gaikwad, M., Gandhi, A. (2025). ANSC: Probabilistic Capacity Health Scoring for Datacenter-Scale Reliability. https://arxiv.org/abs/2508.16119

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2025
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓