In part 1 I described some of the things I was doing to improve how my site is built. I had planned to implement the translation of each XHTML file to HTML in Python, but that didn't quite work out. I went back to my primary goals:

  1. Versioning of references to JS, CSS resources, so that I can improve browser caching of my pages.
  2. Tracking dependencies so that if something a page depends on is modified, the page is rebuilt.
  3. Custom macros for each project.

I was hoping to implement all of this in November before I started my December project, but I didn't. So I decided to focus on the most important part, versioning.

To implement versioning, I need to go through all my files. If I'm going through them anyway I might as well build my sitemap. Some search engines (Bing, Google, Yahoo, Ask) will read a sitemap file to discover what URLs are on your site. The sitemap looks like:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.google.com/schemas/sitemap/0.9">
  <url>
    <loc>https://www.redblobgames.com/maps/noisy-edges/</loc>
    <lastmod>2018-07-16T17:52:51Z</lastmod>
  </url>
  <url>
    <loc>https://www.redblobgames.com/pathfinding/tower-defense/</loc>
    <lastmod>2018-12-01T05:31:29Z</lastmod>
  </url>

</urlset>

and the crawler can discover it by looking at robots.txt:

User-agent: *
Disallow: 

Sitemap: https://www.redblobgames.com/sitemap.xml

I had been using a shell script for this, but I rewrote it to use Python's os.walk to make a list of all HTML files and their last modified timestamps:

IGNORED_DIRECTORIES = [
    ".git",
    "node_modules",
    "third-party",
    …
]

catalog = {}
for root, dirs, files, dir_fd in os.fwalk(top, topdown=True):
    for ignore_dir in IGNORED_DIRECTORIES:
        if ignore_dir in dirs:
            dirs.remove(ignore_dir)

    for name in files:
        stat = None
        filename = os.path.join(root, name).replace(TOP, '')
        filetype = os.path.splitext(name)[1]
        if filetype in ('.html', '.js', '.css'):
            try:
                stat = os.stat(name, dir_fd=dir_fd)
            except FileNotFoundError as err:
                pass # it's a symlink

        catalog[filename] = File(filename, filetype, stat)

I then went through this list to output the sitemap. As with everything, it's not as simple as that! I have three different sitemaps, and have to look at the file contents to determine which sitemap to put each file into. But I'm going to leave out all those details.

The real goal is to visit all the files to add versioning. Why?

Browsers have two kinds of caching:

  1. Validation caching has the browser asking the server each time, "has this url changed since YYYY-MM-DD HH:MM:SS?" The advantage is that it never uses an out of date file. But it's slower; it won't use the cached file until it has waited for an answer from the server.
  2. Expiration caching has the browser keeping the file until a certain time. The advantage is that it doesn't have to ask each time. But it's error prone; it might use a file that's inconsistent with the other parts of the page.

To get the advantages of both and the disadvantages of neither, I need to mark the css and js urls with a version number or fingerprint. The browser will keep the file for a long time, but if the file changes, I change the url, so that the browser requests a new copy. If I were writing a "web app", bundling tools like Webpack can do this for me. However, I'm not writing a web app. I have a web site with hundreds of articles going back decades. I decided to modify all the html files with version strings. That way I don't have to separately implement versioning for each project.

For each html file, I look for references to local css and js files, look up the timestamp of those files in the catalog variable, and then I use Python's re.sub to replace them.

RE_stylesheet = re.compile(r'<link rel="stylesheet" href="([^":]+?)"')
RE_script = re.compile(r'<script src="([^":]+?)"')
def inject_version_data(catalog, html_filename, html_contents):
    "Modify the HTML contents to have timestamps on script and style urls"
    working_dir = os.path.dirname(html_filename)

    def replace_url(match):
        prefix = match.string[match.start(0):match.start(1)]
        suffix = match.string[match.end(1):match.end(0)]
        dep_url = match.group(1)
        dep_url = dep_url.partition('?')[0]
        if dep_url.startswith('/'):
            dep_filename = dep_url[1:]
        else:
            dep_filename = os.path.join(working_dir, dep_url)

        dep_filename = os.path.normpath(dep_filename)

        entry = catalog.get(dep_filename)
        if entry:
            version = datetime.datetime.fromtimestamp(entry.stat.st_mtime).strftime('%Y-%m-%d-%H-%M-%S')
            dep_url = dep_url + '?' + version

        return prefix + dep_url + suffix
    
    html_contents = RE_stylesheet.sub(replace_url, html_contents)
    html_contents = RE_script.sub(replace_url, html_contents)
    return html_contents

Python's regular expression substitution lets me run a function to generate the replacement text. Very handy! When Python finds a string '<script src="foo.js"', it runs function replace_url, which returns a string '<script src="foo.js?2015-07-11-09-00-00">'. Once I change all the file contents this way, I write them back out to the original files.

With this change, I'm hoping for two things:

  1. You won't have to shift-reload to fix a page that is broken for cache reasons.
  2. I can increase the cache times so that the pages load faster.

So far, it's working really well! My old shell script ran in 25 seconds and generated just the sitemap; the Python script runs in 1 second and generates both the sitemap and injects version numbers into the html. I'd also like to make other changes to the website build process, but this was the most important one, and I don't know how to implement the others right now, so I think I'll move on to the next project.

Labels:

0 comments: