In show 2493 I listed a number of the YouTube channels I watch. Some of what I did to prepare the notes was to cut and paste information from YouTube pages, but the basic list itself was generated programmatically. I thought the process I used might be of interest to somebody so I am describing it here.
I needed four components to achieve what I wanted:
- YouTube subscription list (only available in OPML format as far as I know)
xmlstarlettool to parse the OPML
- Template Toolkit which I used to generate Markdown
pandocdocument converter tool to generate HTML
I will talk a little about the first three components in this episode in order to provide an overview.
YouTube subscription list
To find this go to the ‘Subscription Manager’ page of YouTube (
https://www.youtube.com/subscription_manager) and select the ‘Manage Subscriptions’ tab. At the bottom of the page is an ‘Export’ option which generates OPML. By default this is written to a file called
An OPML file is in XML format and is designed to be used by an application that processes RSS feeds such as a Podcatcher or a Video manager. For me it is a convenient format to parse in order to extract the basic channel information. I could not find any other way of doing this apart from scraping the YouTube website. If you know better please let me know in a comment or by submitting a show of your own.
This is a tool designed to parse XML files from the command line. I run Debian Testing and was able to install it from the repository.
There are other tools that could be used for parsing but
xmlstarlet is the Swiss Army knife of such tools for analysing and parsing such data. The tool deserves a show to itself, or even a short series. I know that Ken Fallon (who uses it a lot) has expressed a desire to go into detail about it at some point.
I am just going to describe how I decided to generate a simple CSV file from the OPML and found out how to do so with
Finding the structure of the XML
I copied the
subscription_manager file to
yt_subs.opml as a more meaningful name.
I ran the following command against this file to find out its structure:
$ xmlstarlet el -u yt_subs.opml opml opml/body opml/body/outline opml/body/outline/outline
It is possible to work this out by looking at the XML but it’s all squashed together and is difficult to read. It can be reformatted as follows:
$ xmllint --format yt_subs.opml | head -7 <?xml version="1.0"?> <opml version="1.1"> <body> <outline text="YouTube Subscriptions" title="YouTube Subscriptions"> <outline text="John Heisz - I Build It" title="John Heisz - I Build It" type="rss" xmlUrl="https://www.youtube.com/feeds/videos.xml?channel_id=UCjA8vRlL1c7BDixQRJ39-LQ"/> <outline text="MatterHackers" title="MatterHackers" type="rss" xmlUrl="https://www.youtube.com/feeds/videos.xml?channel_id=UCDk3ScYL7OaeGbOPdDIqIlQ"/> <outline text="Alec Steele" title="Alec Steele" type="rss" xmlUrl="https://www.youtube.com/feeds/videos.xml?channel_id=UCWizIdwZdmr43zfxlCktmNw"/>
xmllint is part of the
libxml2-utils package on Debian, which also requires
I think the
xmlstarlet output is easier to read and understand.
The XML contains attributes (such as the
title) which you can ask
xmlstarlet to report on:
$ xmlstarlet el -a yt_subs.opml | head -11 opml opml/@version opml/body opml/body/outline opml/body/outline/@text opml/body/outline/@title opml/body/outline/outline opml/body/outline/outline/@text opml/body/outline/outline/@title opml/body/outline/outline/@type opml/body/outline/outline/@xmlUrl
Extracting data from the XML
xmlstarlet command I came up with (after some trial and error) was as follows. I have broken the long pipeline into lines by adding backslashes and newlines so it’s slightly more readable, and in this example I have just shown the first 5 lines it generated. In actuality I wrote the output to a file called
$ (echo 'title,feed,seen,skip'; \ > xmlstarlet sel -t -m "/opml/body/outline/outline" \ > -s A:T:- @title \ > -v "concat(@title,',',@xmlUrl,',0,0')" \ > -n yt_subs.opml) | head -5 title,feed,seen,skip akiyuky,https://www.youtube.com/feeds/videos.xml?channel_id=UCCJJNQIhS15ypcHqDfEPNXg,0,0 Alain Vaillancourt,https://www.youtube.com/feeds/videos.xml?channel_id=UCCsdIja21VT7AKkbVI5y8bQ,0,0 Alec Steele,https://www.youtube.com/feeds/videos.xml?channel_id=UCWizIdwZdmr43zfxlCktmNw,0,0 Alex Eames,https://www.youtube.com/feeds/videos.xml?channel_id=UCEXoiRx_rwsMfcD0KjfiMHA,0,0
Here is a breakdown of what is being done here:
There is an
echocommand and the
xmlstarletcommand enclosed in parentheses. This causes Bash to create a sub-process to run everything. In the process the
echocommand generates the column titles for the CSV as we’ll see later. The output of the entire process is written as a stream of lines so the header and data all go to the same place.
xmlstarletcommand takes a sub-command which in this case is
selwhich causes it to “Select data or query XML document(s)” (quoted from the manual page)
-tdefines a template
-mprecedes the XPATH expression to match (as part of the template). The XPATH expression here is
/opml/body/outline/outlinewhich targets each XML node which contains the attributes we want.
-s A:T:- @titledefines sorting where
A:T:-is the operation and
@titleis the XPATH expression to sort by
-v expressiondefines what is to be reported; in this case it’s the
@xmlUrlattributes, then two zeroes all separated by commas thereby making a line of CSV data
-ndefines the XML file to be read
The entire sub-process is piped into
head -5which returns the first 5 lines. In the actual case the output is redirected to a file with
The reason for making four columns will become clear later, but in summary it’s so that I can mark lines in particular ways. The ‘
seen’ column is for marking the channels I spoke about in an earlier episode (2202) so I didn’t include them again in this one, and the ‘
skip’ column is for channels I didn’t want to include for various reasons.
Generating HTML with Template Toolkit
Template Toolkit is a template system. There are many of these for different programming languages and applications. I have been using this one for over 15 years and am very happy with its features and capabilities.
I currently use it when generating show notes for my HPR contributions, and it’s used in many of the scripts I use to perform tasks as an HPR Admin.
Installing Template Toolkit
The Template Toolkit (TT) is written in Perl so it’s necessary to have Perl installed on the machine it’s to be run on. This happens as a matter of course on most Linux and Unix-like operating systems. It is necessary to have a version of Perl later than 5.6.0 (I have 5.26.1 on Debian Testing).
The Toolkit can be installed from the CPAN (Comprehensive Perl Archive Network), but if you do not have your system configured to do this the alternative is shown below (method copied from the Template Toolkit site):
$ wget http://cpan.org/modules/by-module/Template/Template-Toolkit-2.26.tar.gz $ tar zxf Template-Toolkit-2.26.tar.gz $ cd Template-Toolkit-2.26 $ perl Makefile.PL $ make $ make test $ sudo make install
These instructions are relative to the current version of Template Toolkit at the time of writing, version 2.26. The site mentioned above will refer to the latest version.
Making a template
Using the Template Toolkit is a big subject, and I will not go into great detail here. If there is any interest I will do an episode on it in the future.
The principle is that TT reads a template file containing directives in the TT syntax. Usually TT is called out of a script written in Perl (or Python – a new Python version has been released recently). The template can be passed data from the script, but it can also obtain data itself. I used this latter ability to process the CSV file.
TT directives are enclosed in
%] sequences. They provide features such as loops, variables, control statements and so forth.
Template::Plugin::Datafile. It is linked to the required data file with the following directive:
[% USE name = datafile('file_path', delim = ',') %]
The plugin reads files with fields delimited by colons by default, but in this instance we redefine this to be a comma. The
name variable is actually a list of hashes which gives access to the lines of the data.
The following example template shows TT being connected to the file we created earlier, with a loop which iterates through the list of hashes, generating output data.
[% USE ytlist = datafile('yt_data.csv', delim = ',') -%] - YouTube channels: [% FOREACH chan IN ytlist -%] [% NEXT IF chan.seen || chan.skip -%] - [*[% chan.title %]*]([% chan.feed.replace('feeds/videos\.xml.channel_id=', 'channel/') %]) [% END -%]
Note that the TT directives are interleaved with the information we want to write. The line ‘
- YouTube channels:’ is an example of a Markdown list element.
This is followed by a
FOREACH loop which iterates through the
ytlist list, placing the current line in the hash variable
chan. The loop is terminated with an
NEXT directive causes the loop to skip a line of data if either the
skip column holds the value true (1). These fields are referenced as
chan.skip meaning the elements of the hash
chan. Before running this template I edited the list and set these values to control what was reported.
The line after
NEXT is simply outputting the contents of the hash. It is turning the data into a Markdown sub-list. Because the URL in the OPML file contained the address of a feed, whereas we need a channel address, the
replace function (actually a virtual method) performs the necessary editing.
chan.feed.replace() shows the
replace virtual method being applied to the field
feed of the
Running the template
Running the template is simply a matter of calling the
tpage command on it, where this command is part of the Template Toolkit package:
$ tpage yt_template.tpl | head -5 - YouTube channels: - [*Anne of All Trades*](https://www.youtube.com/channel/UCCkFJmUgzrZdkeHl_qPItsA) - [*bigclivedotcom*](https://www.youtube.com/channel/UCtM5z2gkrGRuWd0JQMx76qA) - [*Computerphile*](https://www.youtube.com/channel/UC9-y-6csu5WGm29I7JiwpnA) - [*David Waelder*](https://www.youtube.com/channel/UCcapFP3gxL1aJiC8RdwxqRA)
The output is Markdown and these lines are links. I only showed the first 5 lines generated. It is actually possible to pipe the output of
tpage directly into
pandoc to generate HTML as follows:
$ tpage hpr____/yt_template.tpl | pandoc -f markdown -t html5 | head -5 <ul> <li>YouTube channels: <ul> <li><a href="https://www.youtube.com/channel/UCCkFJmUgzrZdkeHl_qPItsA"><em>Anne of All Trades</em></a></li> <li><a href="https://www.youtube.com/channel/UCtM5z2gkrGRuWd0JQMx76qA"><em>bigclivedotcom</em></a></li>
You can see the result of running this to generate the notes for show 2493 by looking at the Links section of the long notes on that show.
I guess I could be accused of overkill here. When creating the notes for show 2493 I actually did more than what I have described here because it made the slightly tedious process of building a list a bit more interesting than it would have been otherwise.
Also, should I ever wish to record another show updating my YouTube subscriptions I can do something similar to what I have done here, so it is not necessarily wasted effort.
Along the way I learnt about getting data out of YouTube and I learnt more about using
xmlstarlet. I also learnt some new things about Template Toolkit.
Of course, I also contributed another episode to Hacker Public Radio!
You may not agree, but I think this whole process is cool (even though it might be described as over-engineered).
- YouTube Subscription Manager page
xmlstarletmanual (HTML, single page)
- Template Toolkit
- Previous HPR shows referred to:
- Example template file: yt_template.tpl