Stochastic Selection in Back up script?

I maintain a number of web servers and have a number of scripts running on the crontab of various machines that take care of backing up our data. For example, the output of running mysqldump for each of our databases is saved in dump files and copied to several back up machines. Once data has been copied to one of our back up machines, it is copied to a different directory and named according to the time that the data was copied. This way, we have lots of copies of our data from different times.

So that the hard drives don’t get filled with old copies of data, we also run scripts to delete old files:

http://haddock-cms.googlecode.com/svn/tools/server-admin/trunk/bin/delete-old-dump-files.pl

This isn’t very satisfactory as it simply deletes files that are more than a set number of days old. Also, different databases are different sizes. Some are only a few MB, so I might as well have hundreds of copies of the dump files going back months. Some are hundreds of megabytes, so we can’t afford to keep everything for ever.

I’ve been thinking about updating this script so that I can achieve the following:

  • I want as many recent copies as possible
  • I want a few really old copies.
  • I want the size of the back up dir to never go above a fixed level.

The current script doesn’t to this, I simply deletes files that are older than a set number of days.

The algorithm that I’ve come up with for achieving this is as follows:

  1. If the size of the back up dir is less than the set maximum, exit.
  2. For each file in the back up dir older than a given age, assign a score equal to the age in seconds times the size of the dump file in bytes.
  3. Pick a file non-deterministically and delete it. The probability that a file will be chosen is proportional to the score in step 2. Go to step 1.

I’ll probably want to play around with the score function a bit, e.g.

if the age is A and the size is S, f(A,S) could be

A + S
A * S
A^k + S^l

and so on.

Luckily, I’ve got more than one Debian box backing up our server data so I can play around with the script without putting the data at risk.

Packaging Software

Ian Murdock, the founder of Debian, has written about how package management changed everything in the software world:

http://ianmurdock.com/2007/07/21/how-package-management-changed-everything/

I’m certainly a big fan of Debian and apt-get, it makes life much easier when I’ve got my sysadmin’s hat on. Having to hunt for individual packages using Google would take much longer. Compare installing Apache, MySQL and PHP on windows (an EXE, an MSI and a ZIP) to installing the same on Debian. With Debian, I have some assurance that all my packages will work together as I know that each package in the Debian repositories has been extensively battle tested before it reaches the stable section.

This centralised system is also a weakness. I’ve written before about the need for systems to be distributed here. One of the fun things about running an XP desktop is that there are hundreds of thousands of applications for it, just waiting to be downloaded and installed installed. Of course, something similar is possible with Debian or Ubuntu and .DEB on web sites. In order to use a package on Debian or Ubuntu, it doesn’t have to be blessed by the powers that be first, which would defeat the whole point of the open source community. Attempts are underway with AutoPacakage and Luau to make the system more decentralised.

A system wide framework like apt is only useful to a point. Many applications share libraries so updating one package to meet the dependency requirements of a new application might clobber another application that uses the library in the package. Web apps that use plug-ins tend to get around this problem by making each instance of the program have its own copies of the plug-in libraries and code. The space inefficiency is a price worth paying in return for not having to test a site that is working and has been signed off after you install a new library for a different client.

A really small packaging system for web apps that allowed rapid application development and deployment would continue the revolution (evolution?) that Ian Murdock spoke about. It would need to have lots of useful plug-ins, of course. And it should be possible to have lots of copies of the same plug-in installed on the same server without one interfering in anyway with a different version of the same plug-in used by a different site. Ideally, these plug-ins could come from anywhere and rival teams would be competing to write the best plug-in for a purpose. But most important of all, the framework shouldn’t try to do anything other than manage plug-ins. This seems to be lacking (or not lacking enough) from Word Press, Drupal, Joomla et al.

Linux from Scratch and Vitualisation

I’ve been looking at the Linux from Scratch site recently and been thinking about whether making my own custom Linux installation would be worth the time and effort.

The only programs that I might describe as being of primary importance that I use on my servers are Apache with PHP, MySQL, Postfix and Courier. I’ve been toying with the idea of doing my own DNS serving but have kept putting it off. I’d like to separate these services to their own virtual machines for security and reliability. Having a complete installation of Debian on each virtual machine seems like overkill. Also, configuration of each machine should be tailored to the use of each program. Therefore, a custom installation of Linux à la “Linux from Scratch” seems like a good idea.

I imagine that others must have had a similar idea to this at some point. Does anyone know of any projects that are trying to build custom distributions honed for a single services to be run as a guest OS?

Apache Config on Debian for phpMyAdmin

I’ve just been installing phpMyAdmin on a Debian server. This is very easy; simply:

# apt-get install phpmyadmin

However, if you are working on a machine with many vhosts, you need to set up a vhost for pma. Again this is not difficult, the vhost is mostly standard. The following allows access to the phpMyAdmin vhost over HTTPS on port 50002 with basic authentication. It assumes that the public (/etc/apache2/ssl/phpmyadmin.example.com.public.pem) and private (/etc/apache2/ssl/phpmyadmin.example.com.private.pem) key files, the password file (/etc/apache2/passwords/passwords) and the group file (/etc/apache2/passwords/groups) exist and that port 50002 is not blocked by the firewall.

<VirtualHost 1.2.3.4:80>
    ServerName phpmyadmin.example.com

    Redirect permanent / https://phpmyadmin.example.com:50002/
</VirtualHost>

Listen 50002
NameVirtualHost 1.2.3.4:50002

<VirtualHost 1.2.3.4:50002>
    SSLEngine on
    SSLCertificateFile /etc/apache2/ssl/phpmyadmin.example.com.public.pem
    SSLCertificateKeyFile /etc/apache2/ssl/phpmyadmin.example.com.private.pem

    ServerName phpmyadmin.example.com

    DocumentRoot /var/www/phpmyadmin

    <Location />
        AuthType Basic
        AuthName "phpmyadmin on example.com"
        AuthUserFile /etc/apache2/passwords/passwords
        AuthGroupFile /etc/apache2/passwords/groups
        Require group developers
    </Location>
</VirtualHost>

You run into difficulties, though, when you restart the server if you have the default AllowOverride settings. Normally it’s a good security practice to keep your Apache configuration as locked down as possible and only allow directives to be overridden when it’s necessary. Equivalent statements are true in any field of computer with regards to security. phpMyAdmin’s .htaccess file (as supplied via apt) has a number of directives that are not allowed by default config and it’s necessary to allow them in the vhost conf file.

I came up with:

<Directory /var/www/phpmyadmin>
    AllowOverride Options Indexes FileInfo Limit AuthConfig
</Directory>

Details of the AllowOverride dirctive can be found at

http://httpd.apache.org/docs/2.0/mod/core.html#allowoverride

I have to admit that I’m a little confused about the way that they grouped the directives that you can allow to be overridden. Why are AuthConfig and Limit separate groups? There seems to be a lot of semantic overlap there. What about allowing Options to be overridden? What if a sys admin wants to limit which individual options can be overridden?

Altogether, that’s:

<VirtualHost 1.2.3.4:80>
    ServerName phpmyadmin.example.com

    Redirect permanent / https://phpmyadmin.example.com:50002/
</VirtualHost>

Listen 50002
NameVirtualHost 1.2.3.4:50002

<VirtualHost 1.2.3.4:50002>
    SSLEngine on
    SSLCertificateFile /etc/apache2/ssl/phpmyadmin.example.com.public.pem
    SSLCertificateKeyFile /etc/apache2/ssl/phpmyadmin.example.com.private.pem

    ServerName phpmyadmin.example.com

    DocumentRoot /var/www/phpmyadmin

    <Directory /var/www/phpmyadmin>
        AllowOverride Options Indexes FileInfo Limit AuthConfig
    </Directory>

    <Location />
        AuthType Basic
        AuthName "phpmyadmin on example.com"
        AuthUserFile /etc/apache2/passwords/passwords
        AuthGroupFile /etc/apache2/passwords/groups
        Require group developers
    </Location>
</VirtualHost>