One of the issues of yesterday’s GitLab.com “database incident” is that most of their database backups weren’t being tested, and when they needed a restore, they discovered that most of the backup methods hadn’t been working.
Untested backup methods that turn out to be missing or broken are extremely common. I can’t fault them much because it’s a very easy mistake to make: most backups, by nature, never need to be restored from, so you never realize if something changes and they stop working… until it’s too late.
The solution is to frequently and automatically test backups by:
- Regularly downloading the latest backup from S3 (or wherever) and performing a full restore onto a clean server.
- Testing its validity in a way that a human is sure to notice if it stops working properly.
The first part sounds hard, but isn’t. For Overcast, I run an inexpensive Linode server devoted to automatically fetching, installing, and testing the latest backup every day and emailing me a report.1
The emailed report contains query results from multiple database tables that change regularly and are easy for me to mentally verify as I read it every day, such as the number of users and Premium subscribers, how long ago the latest user signed up, and the most recent episode titles of my own podcasts and other popular shows I listen to.
Automated backup testing isn’t difficult — it’s one simple shell script, called by
cron every night, piping its results to
The second part is the trick, though: it’d be too easy to start paying less attention to those daily emails over time, and if they stopped arriving, I may not notice for a while.
My solution is to tie backup tests to a task I do every week: stats collection.
I keep a running spreadsheet with pretty graphs to monitor the health and growth of my business, I update it once a week, and — critically — I pull almost all of the stats I need from the backup emails.
So if the backup ever stops working in a way that the script doesn’t detect or I fail to notice from the daily reports, I’ll still find out pretty quickly, because it’ll impact this other thing I always do that’s a high priority for me and involves important business and money things.
If the script detects any failures itself, it emails me with a very alarming subject line that a Mail.app rule highlights in red and shows an alert for. ↩︎