How to cache openSUSE repositories with Squid
Summary: how to make your local squid web cache work with openSUSE repositories and the openSUSE network installation process. In effect, how to run a fully autonomous local mirror.
Contents
Background
I do quite a lot of testing of openSUSE, and more and more often, I install over the network. Previously, I used to keep the SuSE Linux and openSUSE DVDs available over NFS on a local server, but over last couple of years, I've been working a lot more with Factory and the regular snap-shots that lead up to a final/gold release. With those it is much easier to just point the installation process to the right URL.
When testing installation or new hardware, we often have to repeat the installation process many times on different machines. Sometimes virtual machines, sometimes desktop, sometimes servers in our downstairs datacentre. We have a local squid web cache, but I have several times noticed that it doesn't seem to be very effective for caching the openSUSE repository. An alternative would be to run a local copy of the openSUSE repository, but it requires a process for keeping a the local mirror up-to-date, plus a bit of manual interaction (adding "install=<localurl>") when installing. This is all entirely possible, but I thought using squid would be a more elegant and (hopefully) fully autonomous solution. so I decided to figure out why squid wasn't working.
Well, squid and the openSUSE network installation process just don't work together very well. Not out-of-the-box anyway. The repository at download.opensuse.org is served by a load-distribution system combining mirrorbrain and metalinks. I won't go into any further detail, suffice to say that this means packages are downloaded using segmented downloading spread over multiple mirrors, which together makes it impossible for squid to do much cacheing.
The problem
Well, two problems really:
- the openSUSE repository is mirrored around the world. Mirrorbrain does a good job of picking the most suitable mirrors depending on your location, which also means a good distribution so individual mirrors aren't overloaded. However, squid does not know that multiples mirror sites serve the same file, so cacheing is rendered largely ineffective.
- the segmented download means a package is downloaded in bits from multiple mirrors. This is good for speeding up the download and making good use of the available downstream bandwidth. The problem is that squid is only able to cache whole files, not parts of files, so now cacheing is completely useless.
I have solved both of these problems:
- using a squid url rewriter, I map all the mirror locations on to a single one.
- using a squid logfile and a custom written daemon, I do complete downloads of all the files that are being fetched with segmented downloading.
Summary
For anyone, an individual or a group of people, doing repeated ad-hoc installations of openSUSE (typically Factory), using this squid setup means
- significantly faster installation due to downloads at local network speed
- significant bandwidth savings due to a working cache
- less load on openSUSE mirrors due to a working cache
- zero local mirror management (assuming a working squid setup).
- no need to worry about where to install from
The URL rewriter
First of all, we need a list of the openSUSE mirrors. This is available here - parsing the generated HTML is not exactly an optimal solution, but I've checked with the admin, and for the time being there is no text file available. It's probably also fairly safe to assume that the HTML format at mirrors.opensuse.org will not change very often.
The URL rewriter is a fairly mature piece of software named "jesred". I had to make a couple of changes to it to make it fully compatible with the latest squid (2.7), I've made my version available here.
Once you've installed jesred, you need these two lines in squid.conf:
storeurl_rewrite_program /usr/bin/jesred storeurl_rewrite_children 5
In my experience, the number of url_rewriter processes is not very critical, but 5 doesn't seem unreasonable. I think squid will write to log if it's running out of url_rewriters.
The config file for jesred, /etc/squid/jesred.conf
allow = /etc/squid/redirector.acl rules = /etc/squid/opensuse-redirect.rules redirect_log = /var/log/squid/redirect.log rewrite_log = /var/log/squid/rewrite.log
Using /etc/squid/redirector.acl you can control which clients' requests the rewriter should process:
# rewrite all URLs from 192.168.0.0/21
The rewriter rules file /etc/squid/opensuse-redirect.rules is the key component here. I create this automagically whenever a new mirror list is available. This is just an excerpt from a recently generated file:
regexi ^http://download.opensuse.org/(.*)$ http://download.opensuse.org/\1 regexi ^http://opensuse.mirror.ac.za/opensuse/(.*)$ http://download.opensuse.org/\1 regexi ^http://ftp.up.ac.za/mirrors/opensuse/opensuse/(.*)$ http://download.opensuse.org/\1 regexi ^http://mirror.bjtu.edu.cn/opensuse(.*)$ http://download.opensuse.org/\1 regexi ^http://fundawang.lcuc.org.cn/opensuse/(.*)$ http://download.opensuse.org/\1 regexi ^http://mirror.lupaworld.com/opensuse/(.*)$ http://download.opensuse.org/\1 regexi ^http://mirrors.sohu.com/opensuse/(.*)$ http://download.opensuse.org/\1
In my setup, I have a daily cron-job that fetches a copy of the mirror list, and generates a set new redirector rules if it has changed.
fetcher206
This is the daemon that works with squid to retrieve complete copies of files that were otherwise retrieved with segmented download. The explanation is that although squid is unable to assemble individual segments into a complete file, it is able to satisfy partial requests once a complete copy of a file has been retrieved.
In other words, the solution is to make sure squid gets a complete copy of every file that is retrieved with segmented dopwnload. fetcher206 does this by reading a squid logfile and using wget to get complete copies of files.
fetcher206?? Well, the daemon had to have a name, and as it's looking for completed partial HTTP requests, and these are indicated by an HTTP status code 206, I ended up with fetcher206.
Configuring the logfile for fetcher206 in /etc/squid/squid.conf:
logformat f206 %{%Y-%m-%dT%H:%M:%S}tl %Ss/%03Hs %rm %ru %mt access_log /var/log/squid/fetch206.log f206
These two lines define a new logformat called 'f206' and make squid write to the specified logfile. It would have been better to use a named pipe here, but as far as I can tell, squid doesn't support that. I use logrotate to stop this file growing too big.
For now the daemon is written in PHP - at some point I want to rewrite it in C, but I find PHP is very useful for fast prototyping. There is room for improvement, but it does a pretty decent job as it is.
Daemon pseudo-code:
read config while true check jobqueue, joblist if logfile has data look for TCP/206, if host is an openSUSE mirror, update joblist. done done
Restricting bandwidth abuse
When files are not yet cached, running fetcher206 will produce a little more network load. I have not looked at exactly how much more, but as fetcher206 is intended to help squid speed up the next installation, I use a squid delay_pool to restrict the bandwidth used, such that we a) don't slow down the current installation and b) don't abuse the internet connection:
delay_pools 1 delay_class 1 1 delay_access 1 allow localhost delay_parameters 1 1000000/1000000
This defines one delay_pool, only accessible from localhost (which is where fetcher206 will be running wgets) with a maximum bandwidth of 1MByte/sec.