Mirror/Cache Websites on a Mobile Server

John Judd asked:

I have been asked to determine the feasibility of mirroring or caching websites on a mobile server on a train. Unfortunately, this has been dropped on me at the last minute and I have to come up with an answer in a day or two, and I don’t have much experience in this area.

The train is:

  1. a long distance passenger carrier
  2. not always connected to the Internet (it comes in and out of range of 3G phone towers)
  3. will be using a 3G modem to connect to the Internet when it is in range

When out of service range, the guests should be able to continue to access websites that have previously been accessed. These will be general websites that guests might access in their off-train lives that we won’t have control over. There will also be websites that are cached or “pulled in” automatically such as news and current affairs, that won’t necessarily be guest initiated.

I know that we can mirror or cache visited pages, but I’m concerned about the ‘environment’ that we’ll be operating in.

  1. Most of the mirror sites I see are permanent connections to the Internet, with updates propagating through from the main site using wget or similar. How would an intermittent connection affect this process?
  2. This should be seamless with the visitor typing in the normal URL of the site. If 3G isn’t available or hasn’t expired, the cached page should be shown; otherwise it should be loaded from the original website (and cached for later.) Is it feasible to mirror the URL as well or do we need our own domain name?
  3. I’ll need to let guest know when they’re out of range. I figure a custom version of the error page presented when browsers are not able to access a page, but served from the server, would be the way to go here. Reasonable?
  4. I’m thinking that we’ll also need to do something special for managing content that is served through CDNs. (I suspect we’ll need a decent amount of storage on this server.) Am I correct?
  5. I’m not sure of the terminology for this (which hinders searching) can anyone point me to the correct terms for what I want to do?

Any other resources you can point me to would be appreciated.


My answer:

I actually set up something very similar to this once a year for a week-long event that’s held in the middle of nowhere, so I have a little experience to share.

First, the TL;DR: You can do it, but it won’t work nearly as well as you (or your higher-ups) might hope. It might not be worth bothering, especially if the interruptions are brief. But you might want to do it anyway, in order to save bandwidth and provide a faster experience when you are connected to 3G.

The component you’re looking for is a transparent proxy, one which intercepts outgoing HTTP requests, which weren’t intended by the client to be proxied, and diverts them to a proxy server. And squid is the most common software used for transparent proxying. This is what I use.

The way this works is: A switch or router will intercept packets intended for port 80 of a remote address, and mangle them so that they end up connecting to the proxy instead. It then checks its cache and if the cache misses it goes to the network. Typical proxy stuff. I do this diversion with some simple Linux iptables rules, though many routers and switches can also be configured to do it.

For your purposes, you will also need to do some significant tweaking to squid’s configuration, to override its cache handling. In particular you will want to cause it to serve a stale cached item when it fails to revalidate it on the network. I don’t have the configuration for this offhand, since it isn’t necessary in my design, where I’m at a fixed point and have continuous wireless service. But some careful documentation reading ought to suggest a way to do it.

You will also want to create some custom Squid error pages which refer to your company and explain the various out of service conditions to be expected.

And now for the down side.

You won’t be able to do this with HTTPS requests at all. While Squid does support a method of intercepting HTTPS requests similarly to HTTP requests, you won’t be able to use it as it would require creating a CA and installing a certificate in every client’s browser. Easy enough for an enterprise, but not something you can do for a public service. And even if you could, it is not at all user friendly, will set off alarms in any privacy-minded person’s mind, and it is illegal to do so in some countries.

In addition, WebSockets, used by many web sites these days, will almost always fail when a transparent proxy is involved, because the proxy — doing what it is supposed to do — mangles the upgrade request beyond recognition. There is little you can do about this, except advise users to explicitly use the proxy server. In this case the browser knows to format the request differently, using HTTP CONNECT, so that it will pass through the proxy unmolested.

Finally, after having spoken to some people familiar with traveling on Australia’s trains, I learned that these outages can sometimes last 10 to 15 minutes. There’s very little that you can do about this; someone browsing the web during that time is quite likely to go try to click on a link to a site you haven’t yet cached, and you are not much better off than you are now, though if you have the cache in place you can at least advise the passenger of the situation (at least on HTTP). While the Internet is out, passengers might be better served by looking out the windows and trying to spot the Nullarbor Nymph.

And some basic stats. Last year the service used 42 GB of data and served an additional 17GB from cache. This year the service used 87 GB of data and served just 744 MB from cache. That’s not a mistaken calculation, or as far as I can tell a configuration error. The majority of the difference between caching last year and this year seems to be that more major web sites are now forcing HTTPS. For instance, last year I was able to cache some YouTube videos. This year I could not, because they are now served over HTTPS.

With more and more web sites moving to HTTPS, this caching strategy becomes less and less viable every year, and running the cache at all seems to be more and more pointless.

My recommendation is that you not bother. But you could set one up and run a trial on one train, and then measure the results.

You might also experiment with instructing users to configure the proxy explicitly, so that you can handle HTTPS and WebSockets, though in my experience this is something that’s difficult for users to get right. You might be able to implement WPAD to configure some users automatically, but be aware that Android and iOS devices have poor or no support for it.

View the full question and any other answers on Server Fault.

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.