Difference between revisions of "How to cache openSUSE repositories with Squid"

Revision as of 11:55, 12 November 2011

Summary: How to set up a local squid web cache that works with openSUSE repositories and the openSUSE network installation process. In effect, a fully autonomous local mirror.

Background

I do quite a lot of testing of openSUSE, and more and more often, I install over the network. Previously, I used to keep the SuSE Linux and openSUSE DVDs available over NFS on a local server, but over last couple of years, I've been working a lot more with Factory and the regular snap-shots that lead up to a final/gold release. With those it is much easier to just point the installation process to the right URL.

Especially when testing installation or new hardware, I (and/or my colleagues) end up repeating the process many times on different machines. Sometimes virtual machines, sometimes desktop, sometimes servers in our downstairs datacentre. A local copy of the repository would seem to be the right thing, but it does require a tiny a bit of manual interaction (specifying the URL when installing). Instead I thought of using squid - managing a local cache of downloaded objects is exactly what squid is good at. With this it would just be

a) download the NET iso
b) copy it onto a USB stick
c) go and boot the machine.

Alas, squid and the openSUSE network installation process don't work together very well. Not out-of-the-box anyway. The repository at download.opensuse.org is served by a load-distribution system combining mirrorbrain and metalinks. For the moment I won't go into any further detail, suffice to say that this means packages are downloaded using segmented downloading spread over multiple mirrors, which makes it impossible for squid to do much cacheing.

The problem

Well, two problems really:

the openSUSE repository is mirrored around the world. Mirrorbrain does a good job of picking the most suitable mirrors depending on your location, which also means a good distribution so individual mirrors aren't overloaded. However, squid does not know that multiples mirror sites serve the same file, so cacheing is mostly ineffective.
the segmented download means a package is downloaded in bits from multiple mirrors. This is good for speeding up the download and making good use of the available downstream bandwidth. The problem is that squid is only able to cache whole files, not bits of files, so again cacheing is rendered ineffective.

I have solved both of these problems:

using a squid url rewriter, I map all the mirror locations on to a single one.
using a squid logfile and a custom written daemon, I do complete downloads of all the files that are being fetched with segmented downloading.

Summary

For anyone doing repeated ad-hoc installations of openSUSE, using this squid setup means

downloads at local network speed
significant bandwidth savings
less load on openSUSE mirrors
zero local mirror management
no need to worry about where to install from

The URL rewriter

First of all, we need a list of the openSUSE mirrors. This is available here - parsing the generated HTML is not exactly an optimal solution, but I've checked with the admin, and for the time being there is no text file available. It's probably also fairly safe to assume that the HTML format at mirrors.opensuse.org will not change very often.

The URL rewriter is a fairly mature piece of software named 'jesred'. I had to make a couple of changes to it to make it fully compatible with the latest squid (2.7), I've made my version available here.

Once you've (built and) installed jesred, you need these two lines in squid.conf:

storeurl_rewrite_program /usr/bin/jesred
storeurl_rewrite_children 5

In my experience, the number of url_rewriter processes is not very critical, but 5 doesn't seem unreasonable. I think squid will write to log if it's running out of url_rewriters.

The config file for jesred, /etc/squid/jesred.conf

allow = /etc/squid/redirector.acl
rules = /etc/squid/opensuse-redirect.rules
redirect_log = /var/log/squid/redirect.log
rewrite_log = /var/log/squid/rewrite.log

Using /etc/squid/redirector.acl you can control which clients' requests the rewriter should process:

# rewrite all URLs from
192.168.0.0/21

The rewriter rules file /etc/squid/opensuse-redirect.rules is the key component here. I create this automagically whenever a new mirror list is available. This is just an excerpt from a recently generated file:

regexi ^http://download.opensuse.org/(.*)$                  http://download.opensuse.org/\1
regexi ^http://opensuse.mirror.ac.za/opensuse/(.*)$         http://download.opensuse.org/\1
regexi ^http://ftp.up.ac.za/mirrors/opensuse/opensuse/(.*)$ http://download.opensuse.org/\1
regexi ^http://mirror.bjtu.edu.cn/opensuse(.*)$             http://download.opensuse.org/\1
regexi ^http://fundawang.lcuc.org.cn/opensuse/(.*)$         http://download.opensuse.org/\1
regexi ^http://mirror.lupaworld.com/opensuse/(.*)$          http://download.opensuse.org/\1
regexi ^http://mirrors.sohu.com/opensuse/(.*)$              http://download.opensuse.org/\1

In my setup, I have a daily cron-job that fetches a copy of the mirror list, and generates a set new redirector rules if it has changed.

fetcher206

This is the daemon that works with squid to retrieve complete copies of files that were otherwise retrieved with segmented download. The explanation is that although squid is unable to assemble individual segments into a complete file, it is able to satisfy partial requests once the complete file has been retrieved once.

In other words, the solution is to make sure squid gets a complete copy of every file that is retrieved with segmented dopwnload. fetcher206 dos this by reading a squid logfile and using wget to get complete copies of files.

downloadedWell, it has to have a name, and as it's looking for completed partial HTTP requests, and these are indicated by an HTTP status code 206, I ended up with fetch206.

Difference between revisions of "How to cache openSUSE repositories with Squid"

Revision as of 11:55, 12 November 2011

Contents

Background

The problem

Summary

The URL rewriter

fetcher206

Navigation menu

Views

Personal tools

Navigation

Search

Tools

@@ Line 84: / Line 84: @@
 == fetcher206 ==
-This is the daemon that works with squid to retrieve complete copies of files that were otherwise retrieved  with segmented download.
+This is the daemon that works with squid to retrieve complete copies of files that were otherwise retrieved with segmented download.
-The explanation here is
+The explanation is that although squid is unable to assemble individual segments into a complete file, it is able to satisfy partial requests
+once the complete file has been retrieved once.
+In other words, the solution is to make sure squid gets a complete copy of every file that is retrieved with segmented dopwnload. fetcher206 dos this by reading a squid logfile and using wget to get complete copies of files.
 downloadedWell, it has to have a name, and as it's looking for completed partial HTTP requests, and these are indicated by an HTTP status code 206, I ended up with fetch206.