How we Built pythonjobs.github.io in a week

Modern open source tools are amazing

A while ago, some colleagues (@salimfhadley and @lordmauve) mentioned an idea that they’d had:

how hard would it be to build a job site that used git pull requests for new job submissions?

At the time, the python.org jobs board had been down for maintenance for nearly a year, and there wasn’t any obvious free, moderated place to list or view job opportunities. Coincidentally, at that time, our employer was hiring, and we were all involved with that process.

Their idea was to to have a static site, using GitHub pull requests to manage submissions.

This idea is awesome for a number of reasons:

Lots of tools and services exist to make managing change requests from strangers easy
The entire process is public, transparent, and well understood
Having a slightly more technical submissions process naturally attracts more technically aware submittors.
It’s all free to run! (apart from some stock image costs, and our personal time)

Over about the next week, we built https://pythonjobs.github.io

Screenshot of the pythonjobs.github.io front page

And it was wildly successful in as much as:

despite no advertising or promotion (except right at the start), we receive several submissions a week, and have a steady stream of traffic, sustained for several years.
It was a factor in the official python.org jobs board coming online a few weeks after we launched 😎.

The Static Site

The site is built using hyde, a python jekyll alternative (hyde’s development progress has been slow, and it’s documentation is kinda sparse, so I would hesitate to use it for new projects, but it has suited our purposes very nicely).

Advertisers submit pull requests using the GitHub pull request UI: Each listing is a separate Markdown file (with a .html file extension):

With a YAML header, and Markdown contents:

Once a change is merged, the site is built using a Travis job, and automatically deployed.

Moderation

Reviewing large yaml/markdown files manually is time consuming and error-prone, so we built a set of tests that run on each pull request to ensure that the basics check-out (we still check the content of each submission manually, but it’s more about the textual content, rather than syntax).

This is implemented as a hyde pluign that runs on each node in the jobs directory, and checks many things. For example, that a contact name has been provided:

Excerpt from our job testing script

If any of these don’t pass then, as with any CI test, the Pull Request is marked as failed, stopping us from blindly approving it.

One problem we had was that the actual error that caused the failure can be quite hidden. The reason for failure is shown in the travis build logs, as a traceback, with a friendly message, but..

Most of the people sending us adverts are recruiters, or non-technical hiring managers. Often this is their first dealing with GitHub. To get to the stage of creating a PR, they’ve already had to get to grips with a lot of unfamiliar concepts (git/github, markdown, yaml, pull requests, etc..). Asking them to realise (with no clear indications) that they have to click through to the build failure logs, parse the output till they find the traceback, and then extract the relevant information is a step (or 10!) too far.

This led me to recently add a tool that comments directly on the PR with friendly details (similar to https://www.travisbuddy.com/). The comment lays out why the tests failed in a friendly, easily understood manner:

When the tests pass, there is another script that grabs a screenshot of the listing using phantomjs and selenium, and posts that as a comment, to make reviewing easier.

Infrastructure

We use three git repos to run pythonjobs. This seems more complicated than needed, but is designed to keep the submissions process simple:

jobs: just holds the job listing files, and some travis related scripts. This is the only repo that submitters have to interact with to create a listing.
template: Has all the hyde related code for generating the full static site, graphics resources, build scripts, etc.
pythonjobs.github.io: Contains the generated static site, directly hosted by github pages.

The build script pulls together and merges the repos as needed:

When the PR is merged to master, the same CI job commits the result to the output repo:

Github handles all the rest for us, and seconds later, the changes are visible online.

We push the changes to the live repo by using a travis secure environment variable to hold a github key. Unfortunately, for good reasons, you can’t use secure environment variables in pull requests from non-members (they could change the travis script to make it print out the secrets)

To get around this, there are some simple Amazon lambda functions to handle generating comments when a PR is tested.

Search

Despite being a static site, with no server code, we have full-text searching across all job listings:

This is implemented very similarly to Sphinx client-side search (I’ve since learned). During build time, a script looks at each job listing, and builds a compact JSON based prefix tree from each word in the advert.

Words are stemmed, using the python stemming library, and scored by where they appear (words in the title score higher than words in the body).

This prefix tree is then output as a JSON file which is read by the browser, and used to provide real-time search capability. The JSON first contains an array of each job, followed by the prefix tree as a set of nested maps. Each map leaf gives an index into the job name array, and a corresponding score:

In this example, the word jack appears in Job2, with a score of 1. While java appears twice (score 2) in Job0, and once in Job1

Highlighting is done client-side with a simple regex based search of the job listing.

Locations

We ask submitters to give us a text-based location (Washington, USA), and don’t care too much about how precise that location is (some companies don’t like publishing exact locations publicly). From that, we generate a google maps view of all active job listings:

So how do we go from ‘London Bridge, London’ to a marker on the map?

The google geocoding api can translate location names into the latitude-longitude coordinates required by google maps, so the build script queries this for each job advert to find its exact location.

Because this is done in public, we can’t use a secret API key to query these locations, and so Google quickly throttle these requests. To avoid this, the jobcheck script uses an N+1 cache, where the results of the previous queries are stored in the site rendered output and used to ensure we only query new locations on each build (almost always just 1 location).

This data is then turned into inline JS data in the maps.html page, so that we can populate the map on page load with markers.

GitHub

GitHub is, and should be, primarily a code hosting site. It does this very well. GitHub pages are designed to allow some documentation and light-weight hosting to support code projects.

Pythonjobs isn’t either of these things, so I asked GitHub in advance if this would be OK for us to do. They were very supportive of the idea, and their friendly replies helped make this site a joy to set up.

As part of being open, one of Sal’s ideas was to enforce that all submissions name the company being hired for. While this causes problems for some recruiters, it has the benefit of keeping everything simple, there’s little in the way of vague language or concerns about locations.

Summary

The fact that the three of us could set a site list this up for free, in about a week, and run it for several years with minimal involvement says a lot about the quality and availability of free tooling available to open source projects today.