Should I Watch This? is a web application that gives a recommendation about a movie or show. For this it gets the ratings of Rotten Tomatoes, IMDb and metacritic and generates a consolidated recommendation based on those third party scores. To achieve that it needs a database of movies in the first place.
The initial version of Should I Watch This? used the OMDb API as the source for movie data. The OMDb API is unfortunately proprietary, which is not a dependency I wanted to keep in the long run. The movie data is at the core of the application. Furthermore, OMDb API is an API. This means that there’s network latency and potential failure for every data request. For an autocomplete search for example that’s not really suitable, both in terms of speed and in terms of the API rate limits.
After combing the internet for an alternative I came across Wikidata:
Wikidata acts as central storage for the structured data of its Wikimedia sister projects including Wikipedia, Wikivoyage, Wiktionary, Wikisource, and others.
The data is licensed under CC0 1.0 Universal, which means anyone can use the data freely.
There are different ways to access Wikidata’s data, including different APIs. To avoid rate limits and having full flexibility of how to transform the data to my desired data structured I went with the JSON dump. The downside of this is that the file is around 93GB in size, and I’d need a pipeline to transform the data.
The whole pipeline takes around 12 hours.
As already mentioned this is a 93GB download. This is the slowest part and can’t really be optimized. We just have to wait for the download to be complete.
Duration: 6.8h
The compressed dump is very large and contains everything on Wikidata. I’m only interested in “film” and “humans” entries. The “human” entries are needed for the actors and directors. Instead of decompressing the whole dump and use up unnecessary disk space the dump is prefiltered on decompression using ripgrep. Another reason for preprocessing the dump is that it’ll need two passes to extract the “film” and “human” entries. Going through the full dump would take considerably longer. Since the dump is made up of one json entry per line it’s easy to reduce the dump by line.
The following is the reduced command that creates the reduced dump. The pattern passed to rg is a lot longer:
lbzcat latest-all.json.bz2 | \
rg '(\"Q5\"|\"Q100707163\"|\"Q101071244\"|\"Q101716172\"|\"Q101973724\"|...)' \
> latest-all-reduced.json
The pattern contains the “human” identifier and all instances and sub instances of film and series. By querying the SPARQL query service we first obtain a list of all relevant instance identifiers.
SELECT DISTINCT ?film ?filmLabel WHERE {
{
?film wdt:P279* wd:Q11424 .
}
UNION
{
?film wdt:P279* wd:Q5398426 .
}
}
Since this is only a grep based filtering, it’s not exact. The next steps will do the proper filtering.
Duration: 4.2h
These two separate steps both use wikibase-dump-filter to extract only the relevant entries.
The following is the resulting command for the “film” entries:
cat latest-all-reduced.json | \
parallel --pipe --block 100M --line-buffer "npx wikibase-dump-filter --claim tmp/wikibase-dump-filter-movies-claim" \
> #{output_file}
wikibase-dump-filter-movies-claim contains a list of the film
instance and all its sub instances identifiers. This is the main mechanism to
filter the Wikidata entries.The human file is very large, and we need the humans as a base dataset for each movie’s directors and actors. With jq we extract all humans and create a minimal hash data structure that we can keep in memory while importing the movie records (see 4. Import the movie records).
jq -c '[.id, .labels.mul.value, .labels.en.value, .labels["en-us"].value]'
Duration: 92min
Finally we import each line of the generated movie json, extract the relevant attributes and persist it as records in the PostgreSQL database. We also generate the TSV columns that are needed for search.
Duration: 7min
All in all the whole process takes around 12.5 hours. The result is around 530'000 movies and shows in our database.
Tools used in this pipeline: