Reliability at scale

Enjoyed reading a story from Google Site Reliability Engineering about the way they trace problems.

At Google scale, million-to-one chances happen all the time

All incidents should be novel.

