Stochastic Selection in Back up script?

I maintain a number of web servers and have a number of scripts running on the crontab of various machines that take care of backing up our data. For example, the output of running mysqldump for each of our databases is saved in dump files and copied to several back up machines. Once data has been copied to one of our back up machines, it is copied to a different directory and named according to the time that the data was copied. This way, we have lots of copies of our data from different times.

So that the hard drives don’t get filled with old copies of data, we also run scripts to delete old files:

This isn’t very satisfactory as it simply deletes files that are more than a set number of days old. Also, different databases are different sizes. Some are only a few MB, so I might as well have hundreds of copies of the dump files going back months. Some are hundreds of megabytes, so we can’t afford to keep everything for ever.

I’ve been thinking about updating this script so that I can achieve the following:

  • I want as many recent copies as possible
  • I want a few really old copies.
  • I want the size of the back up dir to never go above a fixed level.

The current script doesn’t to this, I simply deletes files that are older than a set number of days.

The algorithm that I’ve come up with for achieving this is as follows:

  1. If the size of the back up dir is less than the set maximum, exit.
  2. For each file in the back up dir older than a given age, assign a score equal to the age in seconds times the size of the dump file in bytes.
  3. Pick a file non-deterministically and delete it. The probability that a file will be chosen is proportional to the score in step 2. Go to step 1.

I’ll probably want to play around with the score function a bit, e.g.

if the age is A and the size is S, f(A,S) could be

A + S
A * S
A^k + S^l

and so on.

Luckily, I’ve got more than one Debian box backing up our server data so I can play around with the script without putting the data at risk.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s