Tag: debug

Kernel Debugging at scale

This paper will discuss the difficulties and methods involved in debugging the Linux kernel on huge clusters. Intermittent errors that occur once every few years are hard to debug and become a real problem when running across 1000s of machines simultaneously. The more we scale clusters, the more reliability becomes critical. Many of the normal debugging luxuries like a serial console or physical access are unavailable. Instead, we need a new strategy for addressing thorny intermittent race conditions. This paper presents the case for a new set of tools that are critical to solve these problems and also very useful in a broader context. It then presents the design for one such tool created from a hybrid of a Google internal tool and the open source LTTng project. Real world case studies are included.

how to deal with rare error conditions that are hard to reproduce

A New Approach To Debugging

I have built a system, which I’m calling Amber, to record the complete execution history of arbitrary Linux processes. The history is recorded using binary instrumentation based on Valgrind. The history is indexed to support efficient queries that debuggers need, and then compressed and written to disk in a format optimized for later query and retrieval.

roc’s magic new debugging framework in more detail