There are many days when I don't feel like working on my project. I use this feeling to "productively procrastinate" on things that I've been wanting to do but haven't done yet. Earlier this week I decided to tackle two related problems:

  1. I want to know which pages are reachable from the home page. I can then review the ones that aren't reachable and consider adding them if they're finished.
  2. I want to make suggestions on the 404 page, but only to pages that are reachable from the home page. There are a whole bunch of random pages I have that aren't finished or useful, and I don't want to use those for suggestions.

To implement this, I parsed each page and found the links using a regular expression pattern and some quick-and-dirty code:

RE_anchor = re.compile(r'<a[^>]* href="([^"#]+)[^"]*"')
def url_links_to(relativeurl, html_contents):
    "Return the set of relative links from this page"
    site = "https://www.redblobgames.com"
    urls = []
    for url in RE_anchor.findall(html_contents):
        if url.find("mailto:") == 0: continue
        url = urllib.parse.urljoin(site + relativeurl, url)
        if not url.startswith(site): continue
        url = url.replace(site, "")
        if not (url.endswith(".html") or url.endswith("/")): continue
        if url.endswith("/index.html"): url = url.replace("/index.html", "/")
        if url == "/": url = "/index.html"
        if url in urls: continue
        urls.append(url)
    return urls

I then used depth first search to find all the pages reachable from the home page:

# link_map[url] = urls_links_to(url, contents of page)
def all_reachable_pages(link_map):
    "Return a list of all pages reachable from the home page"
    frontier = ["/index.html"]
    reached = set(frontier)
    while frontier:
        url = frontier.pop()
        if url not in link_map:
            print("WARNING: possible 404", url)
            continue
        for child in link_map[url]:
            if child not in reached:
                frontier.append(child)
                reached.add(child)

    return reached

For part 1, I made a list of the reachable pages and I plan to review it periodically.

For part 2, I want help readers who encounter a 404 on my site. I looked through the 404 server logs to see what I might be able to help with. I found lots of bogus requests such as wpAdmin and other admin URLs (people trying to break into my server), and also lots of what seemed to be buggy crawlers. But I also found many URLs that seem to come from real humans. These seem to be either from copy/paste or forums automatically linkifying URLs:

The last one looks like a Markdown typo. There are also some that look like escaping/quoting errors:

All of these seem to have an unwanted suffix. I decided to implement a suggestion on the 404 page. I looked for a prefix of the non-matching URL that matched a valid URL. I picked the longest match:

const request = window.location.pathname;
let bestUrl = "";
for (let url of urlsReachableFromHomePage) {
    if (url.length > bestUrl.length
        && request.slice(0, url.length) == url) {
        bestUrl = url;
    }
}

You can try it out by clicking on the broken links above.

This was a relatively low priority project but so satisfying.

0 comments: