Difference between revisions of "How to cache openSUSE repositories with Squid"
Per Jessen (Talk | contribs) m (→60% faster at 6Mbit/s downstream) |
Per Jessen (Talk | contribs) m (→The problem) |
||
Line 23: | Line 23: | ||
Well, two problems really: | Well, two problems really: | ||
− | * the openSUSE | + | * the openSUSE repositories are mirrored around the world, clients are served by Mirrorbrain. Mirrorbrain does a good job of picking the most suitable mirrors depending on your location, which also means a good distribution so individual mirrors aren't overloaded. However, Squid does not know that multiples mirror sites serve the same file, making caching at best ineffective. |
* the segmented download means a package is downloaded in bits from multiple mirrors. This is good for speeding up the download and making good use of the available downstream bandwidth. The problem is that Squid is only able to cache whole files, not parts of files, rendering caching completely useless. | * the segmented download means a package is downloaded in bits from multiple mirrors. This is good for speeding up the download and making good use of the available downstream bandwidth. The problem is that Squid is only able to cache whole files, not parts of files, rendering caching completely useless. | ||
Revision as of 07:51, 22 May 2012
Contents
Summary
How to make your local Squid web cache work with openSUSE repositories and the openSUSE network installation process. In effect, how to run a fully autonomous, local on-demand repository mirror. Even with a high-speed ADSL internet connection, savings of up to 60% are easily achieved.
Background
In my company, we do quite a lot of testing of openSUSE, and over the last three-four years, we have increasingly switched to installing over the network. Prior to that, we would install from DVD images over NFS served by a local server. However, over last couple of years, we've been working a lot more with Factory and the regular snap-shots that lead up to a final/gold release. With those it is much easier to just point the installation process to the right URL and have everything downloaded there and then.
When we're testing installation or new hardware, we often have to repeat the installation process many times on different machines. Not because it doesn't work as such, but because we might be testing or debugging our own add-ons or to collect diagnostics. Sometimes we install on virtual machines, sometimes on desktops, more often on server hardware in our downstairs datacentre. We have a local Squid web cache, but after having switched to doing network installs more frequently, I have often been annoyed by the lack of effectiveness for caching the openSUSE repository. When I've already done one installation, the downloads for a subsequent one should obviously happen a lot faster, in fact at wire speed. Well, they don't and that's annoying when you know they should have been cached.
The immediate alternative would be to run a local mirror of the openSUSE repositories, but it requires a process for keeping a the local mirror up-to-date, plus a bit of manual interaction (adding the right URL when installing. This is all entirely feasible, but I thought using Squid would be a more elegant and (hopefully) fully autonomous solution. so I decided to figure out why our Squid wasn't coping.
Well, Squid and the openSUSE network installation process just don't work together very well. Not out-of-the-box anyway. The repository at download.opensuse.org is served by a load-distribution system combining mirrorbrain and metalinks. I won't go into any further detail, suffice to say that this means packages are downloaded using segmented downloading spread over multiple mirrors, which together makes it impossible for squid to do much caching.
The problem
Well, two problems really:
- the openSUSE repositories are mirrored around the world, clients are served by Mirrorbrain. Mirrorbrain does a good job of picking the most suitable mirrors depending on your location, which also means a good distribution so individual mirrors aren't overloaded. However, Squid does not know that multiples mirror sites serve the same file, making caching at best ineffective.
- the segmented download means a package is downloaded in bits from multiple mirrors. This is good for speeding up the download and making good use of the available downstream bandwidth. The problem is that Squid is only able to cache whole files, not parts of files, rendering caching completely useless.
I have solved both of these problems:
- using a Squid url rewriter, I map all the mirror locations on to a single one.
- using a Squid logfile and a custom written daemon, I do complete downloads of all the files that are being fetched with segmented downloading.
Summary
For anyone, an individual or a group of people, doing repeated ad-hoc installations of openSUSE (typically Factory), using this squid setup means
- significantly faster installation due to downloads at wire speed
- significant bandwidth savings due to a working cache
- less load on openSUSE mirrors due to a working cache
- zero local mirror management (assuming a working squid setup).
- no need to worry about where to install from
Others doing e.g. repeated updates or adding software, should enjoy similar benefits (once the packages have been cached).
60% faster at 6Mbit/s downstream
I run this setup primarily to save time on installations. In the office, we have a 6000/600Kbit ADSL connection. It's sufficient for most activities, but when installing openSUSE over the network, it's really a bit slow. For openSUSE 12.1, it takes about an hour to complete phase 1 of the install process - 6-7 minutes for the initial 6 system installation images, then 50 minutes for a vanilla KDE installation.
However, installing at wire speed (our LAN is 100Mbit) from the Squid cache is a lot faster, taking only 22 minutes (15 seconds for the initial 6 installation images, 21 minutes for phase 1 to complete. That is a reduction of a little more than 60%. With a slower network connection and perhaps slower mirrors too, only more time saved.
Download
For the impatient, I've tar'ed everything into a single download. This contains the daemon code, one sample config files and the scripts for keeping up with the list of openSUSE mirrors. It's not as easy as just plonking another package into your openSUSE system with YaST or zypper, but the following step by step guide will hopefully help.
Current version is 1.0.
Step by step
Squid
The Squid web-proxy is the key element in this setup, so a working Squid installation is prerequisite. Setting up Squid is not as complicated as it may appear, but you'll have to consult squid documentation, it's outside the scope of this article. Whether you prefer directing access using environment variables http_proxy et al, or if you run a transparent proxy (like I do), is not really important.
Note: the setup here works for Squid 2.7, I don't think the storeurl_rewrite feature has been implemented in Squid 3.x yet.
jesred
jesred is the URL rewriter. It's fairly mature, but fully functional. (original webpage). I had to make a couple of changes to make it fully compatible with squid 2.7:
For the moment, it does not come packaged, you'll have to build it from scratch:
tar xzvf <tarball> cd jesred-1.3 make
Installation: when you're done, copy the binary jesred into /usr/bin.
The config file for jesred: /etc/squid/jesred.conf
allow = /etc/squid/redirector.acl rules = /etc/squid/opensuse-redirect.rules redirect_log = /var/log/squid/redirect.log rewrite_log = /var/log/squid/rewrite.log
Using /etc/squid/redirector.acl you can control which clients' requests the rewriter should process, but I find this is actually easier to control with Squid's ACL and storeurl_access directive, so I enable for all clients:
# rewrite all URLs from 0.0.0.0/0
/etc/squid/squid.conf
Configuration: add the following lines to /etc/squid/squid.conf
storeurl_rewrite_program /usr/bin/jesred storeurl_rewrite_children 5 acl metalink req_mime_type application/metalink4+xml storeurl_access deny metalink storeurl_access allow localnet storeurl_access allow localhost acl localhost src 127.0.0.0/8 acl localnet src 192.168.0.0/16
fetcher206 logfile
Amend /etc/squid/squid./conf as follows:
logformat f206 %{%Y-%m-%dT%H:%M:%S}tl %Ss/%03Hs %rm %ru %mt access_log /var/log/squid/fetch206.log f206
This log will be read by fetcher206.
To prevent it growing too big, add the following to /etc/logrotate.d/ :
/var/log/squid/fetch206.log { compress dateext maxage 365 rotate 5 size=+4M notifempty missingok create 640 squid root sharedscripts postrotate /etc/init.d/squid reload endscript }
squid delay pool
This is an optional step - depending on your available downstream bandwidth, you may want to restrict what is used by fetcher206 for retrieving the repository files. This prevents
- slowing down the current installation and
- abuse of the internet connection
delay_pools 1 delay_class 1 1 delay_access 1 allow localhost delay_parameters 1 1000000/1000000
Add the above to /etc/squid/squid.conf - it defines one delay_pool, only accessible from localhost (which is where fetcher206 will be running wget) with a maximum bandwidth of 1MByte/sec.
If you have other http/proxy traffic originating from localhost, you could just add another 127.0.0.x address, and use that specifically for fetcher206.
mirror database
We need a current list of the available openSUSE mirrors. This can be retrieved from mirrors.opensuse.org.
mkdir -p /var/lib/fetcher206 cp tarball/Makefile.mirrors /var/lib/fetcher206/Makefile make -C /var/lib/fetcher206 cp tarball/opensuse_mirrors.cron /etc/cron.d/opensuse_mirrors
reload squid
When you've come this far, it's time to reload squid with
squid -k reconfigure
fetcher206
fetcher206 is, for the time being, a PHP script. Install it by simply copying it into /usr/bin. It has a few hard-coded options, such as number of wgets to run concurrently, name of logfile etc.
fetcher206 does not yet have a systemd service unit, nor an LSB init-script. For the time being, you simply start it with:
startproc -s -q /usr/bin/fetcher206