My podcast workflow (HPR Show 2211)

Dave Morriss


Table of Contents

Overview

I have been listening to podcasts for many years. I started in 2005, when I bought my first MP3 player.

Various podcast downloaders (or podcatchers) have existed over this time, some of which I have tried. Now I use a script based on Bashpodder, which I have built to meet my needs. I also use a database to hold details of the feeds I subscribe to, what episodes have been downloaded, what is on a player to be listened to and what can be deleted. I have written many scripts (in Bash, Perl and Python) to manage all of this, and I will be describing the overall workflow in this episode without going into too much detail.

I was prompted to put together this show by folky’s HPR episode 1992 “How I’m handling my podcast-subscriptions and -listening released on 2016-03-22. Thanks to him for a very interesting episode.

Note: I’m embarrassed to say that I started this episode in April 2016 and somehow forgot all about it until January 2017!

Podcast Feeds

A podcast feed is defined by an XML file, using one of two main formats. These formats are called RSS and Atom. Both formats basically consist of a list of structured items each of which can contain a link to a multimedia file or “enclosure”. It’s the enclosure that makes it a podcast as opposed to other sorts of feeds - see the Wikipedia article on the subject.

The way in which the feed is intended to be used is that when new material is released on the site, the feed is updated to reflect the change. Then podcatchers can monitor the feed for changes and take action when an update is detected. The relevant action with a podcast feed is that the enclosures in the feed are downloaded, and the podcatcher maintains a local list of what has already been downloaded.

The structure of an RSS or Atom feed allows for there to be a unique identifier associated with each enclosure, and this is intended to act as a label for that enclosure to make it easier to to avoid duplicates.

Workflow

Bashpodder

I use a rewritten version of Bashpodder to download my podcasts. I have modified the original design in two main ways:

  1. I enhanced the XSLT file (parse_enclosure.xsl) used for parsing the feed (using xsltproc1) so that it can handle feeds using Atom as well as RSS. The original only handled RSS.
  2. I made it keep a file of ID strings from the feeds to help determine which episode has already been downloaded. The original only kept the episode URLs which was fine at the time, but is not enough in these days of idiosyncratic feeds. My XSLT file is called parse_id.xsl.

My Bashpodder clone cannot deal with feeds where the enclosure URL does not show the actual download URL. I am working on a solution to this but haven’t got a good one yet. Charles in NJ mentions a fix for a similar (or maybe the same) problem in his show 1935 “Quick Bashpodder Fix.

I run this script on one of my Raspberry Pi’s once a day during the night. This was originally done because I had a slow ADSL connection which was being quite heavily used by my kids during the day. The Pi in question places the downloads in a directory which I export with NFS and mount on other machines.

Database

As I have already said, I use a database to hold the details of my feeds and downloads. This came about because of several reasons:

  • I’m interested in databases and want to learn how to use them
  • I chose PostgreSQL because it is very feature-rich and flexible, and at the time I was using it at work.
  • I wanted to be able to generate all sorts of reports and perform all kinds of actions based on the contents of the database

The database runs on my workstation rather than on the server.

As far as design is concerned, I “bolted on” the database to the existing Bashpodder model where podcasts are downloaded and stored in a directory according to the date. Playlists were generated by the original Bashpodder for each day’s episodes, and I have continued to do this until fairly recently.

Really, if using a database in this way, it would be better to integrate the podcatcher with it. However, I didn’t do this because of the way it evolved.

As a result I have scripts which I run each morning whose job it is to look at the night’s downloads and update the database with their details. The long-term plan is to write a whole new system from scratch which integrates everything, but I don’t see this happening for a while.

In my database I have the following main tables:

feeds

Contains the feed details like its title and URL. It also classifies each feed into a group like science or documentary

episodes

Contains the items within the feeds with information like the title, the URL of the media, where the downloaded episode is and the feed the episode belongs to.

groups

This table contains the groups I have defined, like comedy and music. This is just my personal classification

players

The database has a list of all the players I own. I did a show about this in 2014.

playlists

I make my own playlists for each player, and these are stored in the database (and on the player).

Audio tags

Many podcasters generate excellent metadata for their episodes. All of the players I use on a regular basis run Rockbox, and it can display the metadata tags which helps me to work out what I’m listening to and what’s coming next. I also like to look at tags when I’m dealing with podcast episodes on my workstation, so I reckon having good quality metadata is important.

Because a number of podcast episodes have poor or even non-existent tags I wanted to write tools to improve them. I originally wrote a tool called fix_tags, which has been used on the HPR server for several years, and is available on GitHub. I also wrote a tag management tool for daily use.

The daily tool is called tag_manager and it scans all of the podcast episodes I currently have on disk and applies tag rules to them. Rules are things like: “if there is no title tag, add one from the title field of the item in the feed”. I also do things like add a prefix to the title in some cases, such as adding ‘HPR’ to all HPR episodes so it’s easier to identify them individually in a list.

The rules are written in a format which is really ugly, but it works. I have plans to develop my own rule “language” at some point.

Here’s the rule for the BBC “Elements” podcast:

<rule "Elements">
    genre = $default_genre
    year = "\".(defined(\$ep_year) ? \$ep_year : \$fileyear).\""
    album = "Elements"
    comment = "\".clean_string(\$comment).\""
    # If no title, use the enclosure title
    <regex "^\s*$">
        match = title
        title = "\$ep_title"
    </regex>
    # If no comment, use the enclosure description
    <regex "^\s*$">
        match = comment
        comment = "\$ep_description"
    </regex>
    # Add 'Elements:' to the front of the title if it's not there
    <regex "^(?!Elements: )(\S.+)$">
        match = title
        title = "Elements: \$1"
    </regex>
</rule>

Writing episodes to a player

I use tools I have written to copy podcast episodes to whichever player I want to use. Normally I listen to everything on a given player then refill it after re-charging it. I usually write podcast episodes in groups, so I might load a particular player with groups like business, comedy, documentary, environment, and history.

As episodes are written their status is updated in the database and a playlist is created. The playlist is held in the database but is also written to a file on the player. Rockbox has the ability to work from pre-defined playlist files, and this is the way I organise my listening on a given player.

Deleting what I’ve listened to

As I listen to an episode I run a script on my workstation to mark that particular episode as “being listened to”, and when I have finished a given episode I run another script to delete it. The deletion script simply looks for episodes in the “being listened to” state and asks which of these to delete.

This way I make sure that episodes are deleted as soon as possible after listening to them. I never explicitly delete episodes from the players, I simply over-write them when I next load a particular player.

Other tools

A lot of other tools have been developed for viewing the status of the system, fixing problems and so forth. Some of the key tools are:

  • A feed viewer: it summarises the feed and any downloaded episodes. It can generate reports in a variety of formats. I used it to generate the notes for two HPR shows (1516, 1518) I did on the podcast feeds I’m subscribed to.
  • A tool for subscribing to a new feed; this is the point at which the feed is assigned to a group and where it is decided which episodes are to be initially downloaded.
  • A tool for cancelling a subscription: such feeds are held in an archive with notes about why they were cancelled - for the sake of posterity. Also, I have been known to re-subscribe to a feed I have cancelled. The subscribing script checks it in the archive and asks if I really want to do this and why I said I wanted to cancel last time!

Conclusions

I have been fiddling about with this way of doing things for a long time. I seem to have started in 2011 and since that time have kept a journal associated with the project. This currently contains over 8000 lines of notes about what I have been doing, problems, solutions, etc.

What’s good about this scheme?

  • It’s pretty much all mine! I was inspired originally by Bashpodder, but the current script is a complete rewrite.
  • It works, and does pretty much all I want it to do and now needs very little effort to run and maintain.
  • Along the way I have learned tons of stuff. For example:
    • I understand XML and XSLT better
    • I understand RSS and Atom feeds better
    • I know a lot more about Bash scripting, though I’m still learning!
    • I have learned a fair bit more about PostgreSQL and databases in general
    • I understand a fair bit more about audio tags and the TagLib library that I use to manipulate them (both in Perl and Python)
  • It does have what I think are a lot of good ideas about how to deal with podcast feeds and episodes, though these are often implemented badly in my scripts.

What’s bad?

  • It’s clunky and badly designed. It’s the result of hacks layered on hacks. It’s really an alpha version of what I want to implement and should be junked and completely rewritten.
  • It is not sufficiently resilient to feed issues and bad practices by feed owners. For example, the BBC have this strange habit of releasing an episode then re-releasing it a while later for reasons unknown. They make it difficult to recognise the re-release for what it is, so I sometimes get duplicates. Other podcatchers deal with this situation better than my system does.
  • It’s not easy to extend. For example, the current trend of “hiding” podcast episodes behind strange URLs which have to be interrogated through layers of redirection to find the actual name of the file containing the episode. Adding an algorithm to handle this is quite challenging, due to the design.
  • It’s completely incapable of being shared. I’d have liked to offer my efforts to the world, but in its current incarnation it’s absolutely not something anyone else would want.

  1. I had forgotten the name of the parsing tool xsltproc when recording the audio, so added it in the notes.