I love listening to old episodes of In Our Time.
There are enough old episodes that, when I want to learn about a topic, this show is where I turn first. But finding episodes from the archive is hard. Hence this very unoffical site.
Current status: let’s say v0.2? There’s barely any design yet. It’s likely there are still bugs.
How this site was built
The canonical source of everything here is the official In Our Time website by the BBC.
The pages are spidered, retrieved, and stored locally. Python Requests-HTML is used for the web scraping, a standard tool.
(Yes I know about Wikipedia’s hyperlinked list of In Our Time episodes but I wanted to do something with more automation, and a greater focus on findability.)
In Our Time has been broadcast for 25 years so while there is some consistency to the way the show notes are structured, there is also a lot of variation.
So to extract the data, we use GPT-3 by OpenAI. The HTML is minimally simplified then GPT-3 is prompted to extract:
- the episode description
- guests and their affiliations
- and the reading list.
GPT-3 is asked to respond with valid JSON (a machine-readable data format) which it mostly does, although this often needs fixing.
GPT-3 is also used to:
- give each episode a Dewey-Decimal classification (a library code) – the guess turns out to be pretty good
- calculate similar episodes by converting each show description to an embedding vector, and finding the nearest neighbours using cosine distance.
Converting the scraped pages into machine-readable JSON doesn’t require programmatic use of an AI… but it’s considerably more straightforward than writing lots of fiddly code to do the equivalent job. On the other hand, classifying episodes and finding related episodes are ideal uses of a large language model; both are surprisingly reliable.
There is a reading list for many episodes. Books are checked against the Google Books API and a link is provided to the Google Books website when a matching title can be found. There is approx 88% success in matching books this way. Thanks to Tom Critchlow for early analysis in finding the best approach to validating book data.
Interim data is stored in sqlite (shout out to Datasette for making this easy to explore while I’ve been developing).
Finally static pages for a website are written out, and this public site is built using GitHub Pages.
There’s a deeper dive into what it feels like to code with AI on my blog here.
- It is likely that this site will lag the official site in showing new episodes. This is because I haven’t built the automation yet.
- Because of the use of GPT-3 and web scraping, there may be omissions - or even AI hallucinations! - in the data presented here.
Please let me know about any errors and I will endeavour to fix them.
Who made this?
I’m Matt Webb. Find out about me here. (Email address etc at that link too.) Some trivia: I was involved in setting up the In Our Time podcast, way back in 2004. It was the first podcast by the BBC, and the BBC was the first national broadcaster to do any podcasting at all. Still a fan.