How I prepared episode 2493: YouTube Subscriptions - update (HPR Show 2544)

Dave Morriss


Table of Contents

Introduction

In show 2493 I listed a number of the YouTube channels I watch. Some of what I did to prepare the notes was to cut and paste information from YouTube pages, but the basic list itself was generated programmatically. I thought the process I used might be of interest to somebody so I am describing it here.

Components

I needed four components to achieve what I wanted:

I will talk a little about the first three components in this episode in order to provide an overview.

YouTube subscription list

To find this go to the ‘Subscription Manager’ page of YouTube (https://www.youtube.com/subscription_manager) and select the ‘Manage Subscriptions’ tab. At the bottom of the page is an ‘Export’ option which generates OPML. By default this is written to a file called subscription_manager.

An OPML file is in XML format and is designed to be used by an application that processes RSS feeds such as a Podcatcher or a Video manager. For me it is a convenient format to parse in order to extract the basic channel information. I could not find any other way of doing this apart from scraping the YouTube website. If you know better please let me know in a comment or by submitting a show of your own.

Using xmlstarlet

This is a tool designed to parse XML files from the command line. I run Debian Testing and was able to install it from the repository.

There are other tools that could be used for parsing but xmlstarlet is the Swiss Army knife of such tools for analysing and parsing such data. The tool deserves a show to itself, or even a short series. I know that Ken Fallon (who uses it a lot) has expressed a desire to go into detail about it at some point.

I am just going to describe how I decided to generate a simple CSV file from the OPML and found out how to do so with xmlstarlet.

Finding the structure of the XML

I copied the subscription_manager file to yt_subs.opml as a more meaningful name.

I ran the following command against this file to find out its structure:

$ xmlstarlet el -u yt_subs.opml
opml
opml/body
opml/body/outline
opml/body/outline/outline

It is possible to work this out by looking at the XML but it’s all squashed together and is difficult to read. It can be reformatted as follows:

$ xmllint --format yt_subs.opml | head -7
<?xml version="1.0"?>
<opml version="1.1">
  <body>
    <outline text="YouTube Subscriptions" title="YouTube Subscriptions">
      <outline text="John Heisz - I Build It" title="John Heisz - I Build It" type="rss" xmlUrl="https://www.youtube.com/feeds/videos.xml?channel_id=UCjA8vRlL1c7BDixQRJ39-LQ"/>
      <outline text="MatterHackers" title="MatterHackers" type="rss" xmlUrl="https://www.youtube.com/feeds/videos.xml?channel_id=UCDk3ScYL7OaeGbOPdDIqIlQ"/>
      <outline text="Alec Steele" title="Alec Steele" type="rss" xmlUrl="https://www.youtube.com/feeds/videos.xml?channel_id=UCWizIdwZdmr43zfxlCktmNw"/>

The program xmllint is part of the libxml2-utils package on Debian, which also requires libxml2.

I think the xmlstarlet output is easier to read and understand.

The XML contains attributes (such as the title) which you can ask xmlstarlet to report on:

$ xmlstarlet el -a yt_subs.opml | head -11
opml
opml/@version
opml/body
opml/body/outline
opml/body/outline/@text
opml/body/outline/@title
opml/body/outline/outline
opml/body/outline/outline/@text
opml/body/outline/outline/@title
opml/body/outline/outline/@type
opml/body/outline/outline/@xmlUrl

Extracting data from the XML

So, the xmlstarlet command I came up with (after some trial and error) was as follows. I have broken the long pipeline into lines by adding backslashes and newlines so it’s slightly more readable, and in this example I have just shown the first 5 lines it generated. In actuality I wrote the output to a file called yt_data.csv:

$ (echo 'title,feed,seen,skip'; \
> xmlstarlet sel -t -m "/opml/body/outline/outline" \
> -s A:T:- @title \
> -v "concat(@title,',',@xmlUrl,',0,0')" \
> -n yt_subs.opml) | head -5
title,feed,seen,skip
akiyuky,https://www.youtube.com/feeds/videos.xml?channel_id=UCCJJNQIhS15ypcHqDfEPNXg,0,0
Alain Vaillancourt,https://www.youtube.com/feeds/videos.xml?channel_id=UCCsdIja21VT7AKkbVI5y8bQ,0,0
Alec Steele,https://www.youtube.com/feeds/videos.xml?channel_id=UCWizIdwZdmr43zfxlCktmNw,0,0
Alex Eames,https://www.youtube.com/feeds/videos.xml?channel_id=UCEXoiRx_rwsMfcD0KjfiMHA,0,0

Here is a breakdown of what is being done here:

  1. There is an echo command and the xmlstarlet command enclosed in parentheses. This causes Bash to create a sub-process to run everything. In the process the echo command generates the column titles for the CSV as we’ll see later. The output of the entire process is written as a stream of lines so the header and data all go to the same place.

  2. The xmlstarlet command takes a sub-command which in this case is sel which causes it to “Select data or query XML document(s)” (quoted from the manual page)
    • -t defines a template
    • -m precedes the XPATH expression to match (as part of the template). The XPATH expression here is /opml/body/outline/outline which targets each XML node which contains the attributes we want.
    • -s A:T:- @title defines sorting where A:T:- is the operation and @title is the XPATH expression to sort by
    • -v expression defines what is to be reported; in this case it’s the @title and @xmlUrl attributes, then two zeroes all separated by commas thereby making a line of CSV data
    • -n defines the XML file to be read
  3. The entire sub-process is piped into head -5 which returns the first 5 lines. In the actual case the output is redirected to a file with > yt_data.csv

  4. The reason for making four columns will become clear later, but in summary it’s so that I can mark lines in particular ways. The ‘seen’ column is for marking the channels I spoke about in an earlier episode (2202) so I didn’t include them again in this one, and the ‘skip’ column is for channels I didn’t want to include for various reasons.

Generating HTML with Template Toolkit

Template Toolkit is a template system. There are many of these for different programming languages and applications. I have been using this one for over 15 years and am very happy with its features and capabilities.

I currently use it when generating show notes for my HPR contributions, and it’s used in many of the scripts I use to perform tasks as an HPR Admin.

Installing Template Toolkit

The Template Toolkit (TT) is written in Perl so it’s necessary to have Perl installed on the machine it’s to be run on. This happens as a matter of course on most Linux and Unix-like operating systems. It is necessary to have a version of Perl later than 5.6.0 (I have 5.26.1 on Debian Testing).

The Toolkit can be installed from the CPAN (Comprehensive Perl Archive Network), but if you do not have your system configured to do this the alternative is shown below (method copied from the Template Toolkit site):

$ wget http://cpan.org/modules/by-module/Template/Template-Toolkit-2.26.tar.gz
$ tar zxf Template-Toolkit-2.26.tar.gz
$ cd Template-Toolkit-2.26
$ perl Makefile.PL
$ make
$ make test
$ sudo make install

These instructions are relative to the current version of Template Toolkit at the time of writing, version 2.26. The site mentioned above will refer to the latest version.

Making a template

Using the Template Toolkit is a big subject, and I will not go into great detail here. If there is any interest I will do an episode on it in the future.

The principle is that TT reads a template file containing directives in the TT syntax. Usually TT is called out of a script written in Perl (or Python – a new Python version has been released recently). The template can be passed data from the script, but it can also obtain data itself. I used this latter ability to process the CSV file.

TT directives are enclosed in [% and %] sequences. They provide features such as loops, variables, control statements and so forth.

To make TT access the CSV data file I used a plugin that comes with the Template Toolkit package. This plugin is called Template::Plugin::Datafile. It is linked to the required data file with the following directive:
[% USE name = datafile('file_path', delim = ',') %]

The plugin reads files with fields delimited by colons by default, but in this instance we redefine this to be a comma. The name variable is actually a list of hashes which gives access to the lines of the data.

The following example template shows TT being connected to the file we created earlier, with a loop which iterates through the list of hashes, generating output data.

[% USE ytlist = datafile('yt_data.csv', delim = ',') -%]
- YouTube channels:
[% FOREACH chan IN ytlist -%]
[% NEXT IF chan.seen || chan.skip -%]
    - [*[% chan.title %]*]([% chan.feed.replace('feeds/videos\.xml.channel_id=', 'channel/') %])
[% END -%]

Note that the TT directives are interleaved with the information we want to write. The line ‘- YouTube channels:’ is an example of a Markdown list element.

This is followed by a FOREACH loop which iterates through the ytlist list, placing the current line in the hash variable chan. The loop is terminated with an END directive.

The NEXT directive causes the loop to skip a line of data if either the seen or skip column holds the value true (1). These fields are referenced as chan.seen and chan.skip meaning the elements of the hash chan. Before running this template I edited the list and set these values to control what was reported.

The line after NEXT is simply outputting the contents of the hash. It is turning the data into a Markdown sub-list. Because the URL in the OPML file contained the address of a feed, whereas we need a channel address, the replace function (actually a virtual method) performs the necessary editing.

The expression chan.feed.replace() shows the replace virtual method being applied to the field feed of the chan hash.

Running the template

Running the template is simply a matter of calling the tpage command on it, where this command is part of the Template Toolkit package:

$ tpage yt_template.tpl | head -5
- YouTube channels:
    - [*Anne of All Trades*](https://www.youtube.com/channel/UCCkFJmUgzrZdkeHl_qPItsA)
    - [*bigclivedotcom*](https://www.youtube.com/channel/UCtM5z2gkrGRuWd0JQMx76qA)
    - [*Computerphile*](https://www.youtube.com/channel/UC9-y-6csu5WGm29I7JiwpnA)
    - [*David Waelder*](https://www.youtube.com/channel/UCcapFP3gxL1aJiC8RdwxqRA)

The output is Markdown and these lines are links. I only showed the first 5 lines generated. It is actually possible to pipe the output of tpage directly into pandoc to generate HTML as follows:

$ tpage hpr____/yt_template.tpl | pandoc -f markdown -t html5 | head -5
<ul>
<li>YouTube channels:
<ul>
<li><a href="https://www.youtube.com/channel/UCCkFJmUgzrZdkeHl_qPItsA"><em>Anne of All Trades</em></a></li>
<li><a href="https://www.youtube.com/channel/UCtM5z2gkrGRuWd0JQMx76qA"><em>bigclivedotcom</em></a></li>

You can see the result of running this to generate the notes for show 2493 by looking at the Links section of the long notes on that show.

Conclusion

I guess I could be accused of overkill here. When creating the notes for show 2493 I actually did more than what I have described here because it made the slightly tedious process of building a list a bit more interesting than it would have been otherwise.

Also, should I ever wish to record another show updating my YouTube subscriptions I can do something similar to what I have done here, so it is not necessarily wasted effort.

Along the way I learnt about getting data out of YouTube and I learnt more about using xmlstarlet. I also learnt some new things about Template Toolkit.

Of course, I also contributed another episode to Hacker Public Radio!

You may not agree, but I think this whole process is cool (even though it might be described as over-engineered).