Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
NVSentinel: Nvidia's open-source GPU resilience system for Kubernetes (github.com/nvidia)
2 points by mchmarny 4 days ago | hide | past | favorite | 1 comment
 help



Keeping a GPU cluster healthy at scale isn't just a "nice to have"—it’s the difference between seamless training and a nightmare of idle nodes. That’s why we built NVSentinel, our open-source system designed to detect, classify, and auto-remediate hardware and software faults across Kubernetes nodes and NVSwitches.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: