Managing tags on HPR episodes - 2 (HPR Show 2260)

Dave Morriss


Table of Contents

Introduction

This is the second show looking at the subject of Managing Tags.

In the first show we looked at why we need tags, examined the present system and considered the advantages and disadvantages of doing things the current way.

To reiterate my disclaimer from the last episode: I am not a trained database designer. I have worked with databases a fair bit over the past 15 years or so but am largely self-taught. If I am talking nonsense, or if there are far better ways of doing what I’m suggesting please let me know!

Database Design

In the last episode we looked at method 1 (the current method) – using a comma-separated list of tag strings. Now we’ll look at an alternative method.

2. Using a tags table

Tagging has been in use on the Web for a long time, and looking around for suggestions on how best to set it up in a database I came across one solution which uses a single tags table (see the “Scuttle solution”).

The table, named ‘tags’ in my test database, holds three columns that I have called:

id    - a reference to a show number in the 'eps' table
tag   - a single tag in mixed case
lctag - a single tag in lower case form

The use of a mixed case and a lower case form is probably not necessary since MariaDB performs case-insensitive matches by default. I was not aware of this when I put this test table together.

Setting up and managing the table

For the purposes of testing I started by taking the tags from the comma separated list in the ‘eps’ table.

Method 1 using deep dark Database magic to manage the tags

My first approach to populating this table was to use database tools to do the work. I wrote a stored database function to parse the current ‘eps.tags’ field and extract the individual tags, and another to manage the process of populating the new ‘tags’ table. This is controlled by a piece of SQL in a file which first empties the ‘tags’ table before rebuilding it from the ‘eps’ table.


Note: This is a fairly advanced Database Administration topic. The use of stored procedures in a database is something I have done before, but I am by no means an expert on it. I have included this mainly for interest and it can be skipped with little detriment if you are not interested.

The SQL file defining the table, indexes (see below) and the code to process tags is called normalise_tags_1.sql and is available on the HPR website as part of this episode. It is based on an article found on Stack Overflow.

The SQL I run to refresh the ‘tags’ table is called refresh_tags.sql and is also available with this show. This method empties and rebuilds the ‘tags’ table.

Method 2 using a Perl script to manage the tags

An alternative method of managing this table is included. It consists of a Perl script (refresh_tags) which adds and removes individual entries rather than clearing the table and rebuilding it.

The script is a work in progress. It works by scanning the entire database and collecting all of the tags stored in the ‘eps.tags’ field in CSV format (discussed at length in the last episode). It uses a Perl module ‘Text::CSV_XS’ to parse the CSV data. The script ensures the parsed CSV tags are unique per episode and sorts them internally.

The script then collects all of the tags already in the ‘tags’ table and organises them so that they can be compared with the first collection. Differences between the two tag sources are then built and, unless the script is in ‘dry run’ mode these are applied to the ‘tags’ table. There may be deletions if for some reason a tag has been removed from the CSV list, and additions if new tags have been added.

The importance of indexes

One of the important parts of this tag solution is the use of indexes associated with the ‘tags’ table. There is an index called ‘tags_all’ that ensures every row of the table is unique – it makes no sense to have the same tag repeated for an episode, for example. There are indexes on the other fields as well which are intended to speed up access.

I am running this locally on a copy of the HPR database, and in my experiments it takes less than two seconds to rebuild the table starting from scratch and considerably less to update using the Perl script.

The table contains repeated instances of a tag, one for each matching episode number:

MariaDB [hpr_hpr]> SELECT id,tag FROM tags WHERE tag = 'grep' ORDER BY id;
+------+------+
| id   | tag  |
+------+------+
| 2040 | grep |
| 2072 | grep |
+------+------+
2 rows in set (0.00 sec)

(You would expect ‘grep’ to appear more often than this, but it doesn’t at the moment due to the relative scarcity of tags)

Advantages

  • This approach gives a much more reliable and efficient solution to the problem of storing and finding tags.
  • The separate table allows for indexes to be built to optimise access speeds, which is something that cannot be done for the comma-separated string approach.

Disadvantages

  • Using method 1 (stored procedures) the solution requires the clearing out and repopulating of the tags table. This means that during the time when the table is empty and being rebuilt the tag functionality is missing or reduced. When using method 2 and the Perl script mentioned above this disadvantage vanishes.
  • The approach of doing the parsing of the comma-separated list with SQL (method 1) is not ideal since it does not allow us to use properly formatted CSV data. For example, if a tag with a comma is enclosed in quotes to “protect” the comma this solution doesn’t recognise it. As before, when using the Perl script described above for method 2 this problem vanishes.
  • In database terms it is desirable to “normalise” the structure as much as possible. That means that storing duplicate values like the same tag over and over again is frowned upon. This solution is not “normalised”.

Searching

We can now perform much more sophisticated searches since the work of parsing and extracting tags has already been done – when the table was built. Unlike the solution discussed in the first show where each search requires the tag list per episode to be examined and parsed, this solution is much more efficient. The parsing is performed once and the results stored.

So if we want to be able to follow droops’ suggestion to examine the tags on a given show and find all the other shows that share the same tags we can, as demonstrated below.

Finding shows with a given tag

In the last episode we searched using the ‘tags’ field in the ‘eps’ table, but now we can use the ‘tags’ table to find all the shows associated with the tag ‘community’, and we then report those shows. It’s necessary to order the result since the rows will not be sorted:

MariaDB [hpr_hpr]> SELECT e.id,date,substr(title,1,30) AS title,tags FROM eps e, tags t WHERE e.id = t.id AND  tag = 'community' ORDER BY id;
+------+------------+--------------------------------+---------------------------------------------------------------+
| id   | date       | title                          | tags                                                          |
+------+------------+--------------------------------+---------------------------------------------------------------+
|    1 | 2007-12-31 | Introduction to HPR            | hpr, twat, community                                          |
|  947 | 2012-03-20 | Presentation by Jared Smith at | Fedora,community                                              |
| 1000 | 2012-05-31 | Episode 1000                   | HPR,community,congratulations                                 |
| 1024 | 2012-07-05 | Episode 1024                   | HPR,community,anniversary                                     |
| 1509 | 2014-05-15 | HPR Needs Shows                | HPR, shows, request, call to action, community, contribute    |
| 1913 | 2015-12-02 | The Linux Experiment           | linux, the linux experiment, community                        |
| 2008 | 2016-04-13 | HPR needs shows to survive.    | HPR,community,shows,call to action,contribute                 |
| 2035 | 2016-05-20 | Building Community             | community                                                     |
| 2077 | 2016-07-19 | libernil.net and self hosting  | gnu, linux, networking, community, servers, services, commons |
+------+------------+--------------------------------+---------------------------------------------------------------+
9 rows in set (0.00 sec)


Skip if not interested

This SQL query is more complex than those seen before. It is selecting from the ‘eps’ and ‘tags’ tables at once. This is called a ‘JOIN’ and the two tables are given aliases (‘e’ and ‘t’) to save typing.

The first clause after the ‘WHERE’ is matching the ‘id’ fields in the two tables so we only get episodes related to tags (and vice versa). The clause ‘tag = 'community'’ selects just those matching tags and therefore the related episodes.

In case you were wondering the part ‘substr(title,1,30) AS title’ trims the title to 30 characters to fit these notes, and the ‘AS title’ part just ensures there’s a sensible name over the output column.

Note that the database used has the raw, unprocessed ‘eps.tags’ field. The cleaned form is not needed when using the ‘tags’ table managed by the Perl script because the script processes the tags internally.

I used the SQLeo tool mentioned by Ken Fallon in HPR episode 1965 and have included an image of the query we have just examined:

SQLeo
Screenshot from SQLeo

Note that there are no lines connecting the tables and that is because we don’t have foreign keys in the HPR database. I think this is because of its origins, perhaps before MySQL had reliably implemented this capability? Also the types of tables used do not support foreign keys. It is my view that this is another area that needs attention in the HPR database.

Finding shows with combinations of tags

In the last show we looked at ways in which the combination of tags ‘community’ and/or ‘HPR’ could be searched for in the CSV tags. Using the ‘tags’ table this is simpler (in database terms).

MariaDB [hpr_hpr]> SELECT e.id,e.date,substr(e.title,1,30) AS title,tags FROM eps e, tags t WHERE e.id = t.id AND t.tag IN ('community','hpr') GROUP BY e.id;
+------+------------+--------------------------------+-------------------------------------------------------------------------------------------------------------------+
| id   | date       | title                          | tags                                                                                                              |
+------+------------+--------------------------------+-------------------------------------------------------------------------------------------------------------------+
|    1 | 2007-12-31 | Introduction to HPR            | hpr, twat, community                                                                                              |
|  947 | 2012-03-20 | Presentation by Jared Smith at | Fedora,community                                                                                                  |
| 1000 | 2012-05-31 | Episode 1000                   | HPR,community,congratulations                                                                                     |
| 1024 | 2012-07-05 | Episode 1024                   | HPR,community,anniversary                                                                                         |
| 1371 | 2013-11-04 | The Lost Banner of HPR         | hpr,banner                                                                                                        |
| 1509 | 2014-05-15 | HPR Needs Shows                | HPR, shows, request, call to action, community, contribute                                                        |
| 1726 | 2015-03-16 | 15 Excuses not to Record a sho | hpr,podcasting,tips,techniques,kw,knightwise,excuses                                                              |
| 1818 | 2015-07-22 | Review of HPR's Interview Reco | Zoom H1, microphone, recording, review, DVR, digital voice recorder, tutorial, getting started, guide, howto, HPR |
| 1877 | 2015-10-13 | Recording HPR on the fly on yo | android, hpr, audio, recording                                                                                    |
| 1913 | 2015-12-02 | The Linux Experiment           | linux, the linux experiment, community                                                                            |
| 2008 | 2016-04-13 | HPR needs shows to survive.    | HPR,community,shows,call to action,contribute                                                                     |
| 2035 | 2016-05-20 | Building Community             | community                                                                                                         |
| 2077 | 2016-07-19 | libernil.net and self hosting  | gnu, linux, networking, community, servers, services, commons                                                     |
| 2195 | 2016-12-30 | All you need to know when uplo | HPR                                                                                                               |
+------+------------+--------------------------------+-------------------------------------------------------------------------------------------------------------------+
14 rows in set (0.00 sec)

This is like the ‘OR’ example in part 1.

The ‘GROUP BY’ clause is a way of de-duplicating the result. Without it 19 rows are returned but 5 are duplicates.

MariaDB [hpr_hpr]> SELECT e.id,e.date,substr(e.title,1,30) AS title, tags FROM eps e, tags t WHERE e.id = t.id AND t.tag in ('community','hpr') GROUP BY e.id HAVING count(e.id) = 2;
+------+------------+-----------------------------+------------------------------------------------------------+
| id   | date       | title                       | tags                                                       |
+------+------------+-----------------------------+------------------------------------------------------------+
|    1 | 2007-12-31 | Introduction to HPR         | hpr, twat, community                                       |
| 1000 | 2012-05-31 | Episode 1000                | HPR,community,congratulations                              |
| 1024 | 2012-07-05 | Episode 1024                | HPR,community,anniversary                                  |
| 1509 | 2014-05-15 | HPR Needs Shows             | HPR, shows, request, call to action, community, contribute |
| 2008 | 2016-04-13 | HPR needs shows to survive. | HPR,community,shows,call to action,contribute              |
+------+------------+-----------------------------+------------------------------------------------------------+
5 rows in set (0.00 sec)

This one is like the ‘AND’ example in part 1.

The query is the same as the previous one except for the ‘HAVING’ clause. This causes the database engine to count the number of rows with the same id value and only show those where there are two.

Note that the 14 rows in the ‘OR’ example and the 5 rows in the ‘AND’ example add to 19, which is how many times these two tags occur in the database.

We probably want to make a more complex query that takes a show number, uses its tags and searches for other shows that also use them, in order to do what droops suggested. I wrote a query which I have included as file called find_shows_sharing_tags.sql in case anyone is interested.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
/*
 * Given a show number find all other shows that share each of its tags.
 * Report the tags first followed by some of the episode details
*/

-- SET @show = 383;

-- Report the tags on the show just for information
SELECT
    tag
FROM
    tags
WHERE
    id = @show;

-- Find all shows sharing the tags, omitting the target one
SELECT
    t.lctag, e.id, e.date, h.host, e.title
FROM
    tags t,
    eps e,
    hosts h
WHERE
    t.id = e.id
    AND e.hostid = h.hostid
    AND t.tag IN (
        SELECT
            tag
        FROM
            tags
        WHERE
            id = @show)
GROUP BY e.id
HAVING e.id <> @show
ORDER BY t.lctag, e.id;

There are two queries in the file. The first one just reports the tags associated with the target show, merely for demonstration purposes. The main query reports which tags have matched which other shows, since I thought that might be useful when generating a web page around it.

I will not attempt to explain this in this episode. Perhaps we need a specific Database series to cover such things.

To make this demonstration easier the query uses a variable called @show which has to be set beforehand. The following example shows it being used to find two sets of shows:

  • one related to show number 2071 (“Undocumented features of Baofeng UV-5R Radio” by MrX)
  • the other related to show 2072 (“That Awesome Time I Deleted My Home Directory” by sigflup)

MariaDB [hpr_hpr]> SET @show = 2071;
Query OK, 0 rows affected (0.00 sec)

MariaDB [hpr_hpr]> \. find_shows_sharing_tags.sql
+---------------+
| tag           |
+---------------+
| Amateur Radio |
| Electronics   |
| Open Source   |
+---------------+
3 rows in set (0.00 sec)

+---------------+------+------------+---------------------------+-------------------------------------------------------------+
| lctag         | id   | date       | host                      | title                                                       |
+---------------+------+------------+---------------------------+-------------------------------------------------------------+
| amateur radio |  911 | 2012-01-27 | MrX                       | Hobbies                                                     |
| amateur radio | 1036 | 2012-07-23 | Joel                      | Setting up Your First Ham Radio Station                     |
| amateur radio | 1092 | 2012-10-08 | MrGadgets                 | Ham Radio: The Original Tech Geek Passion                   |
| amateur radio | 1701 | 2015-02-09 | Ken Fallon                | FOSDEM 2015 Part 4 of 5                                     |
| amateur radio | 2062 | 2016-06-28 | MrX                       | Now The Chips Are Definitely Down                           |
| electronics   | 1817 | 2015-07-21 | NYbill                    | Gathering Parts                                             |
| electronics   | 1858 | 2015-09-16 | NYbill                    | Multimeter Mod's Part 2                                     |
| electronics   | 1971 | 2016-02-22 | Dave Morriss              | BlinkStick                                                  |
| electronics   | 2029 | 2016-05-12 | NYbill                    | The DSO138 Oscilloscope Kit                                 |
| electronics   | 2044 | 2016-06-02 | NYbill                    | Bring on the Power!                                         |
| electronics   | 2056 | 2016-06-20 | Tony Hughes AKA TonyH1212 | Interview with a young hacker                               |
| electronics   | 2089 | 2016-08-04 | MrX                       | Solving a blinkstick python problem                         |
| electronics   | 2148 | 2016-10-26 | NYbill                    | The DSO138 Oscilloscope Kit Part 2                          |
| open source   |  242 | 2008-12-03 | UTOSC                     | Open Source in Government Panel Discussion                  |
| open source   | 1402 | 2013-12-17 | Thaj Sara                 | How I Started Using Linux and Free and Open Source Software |
| open source   | 1529 | 2014-06-12 | Ahuka                     | TrueCrypt, Heartbleed, and Lessons Learned                  |
| open source   | 1641 | 2014-11-17 | johanv                    | The real reasons for using Linux                            |
| open source   | 1653 | 2014-12-03 | Ahuka                     | Ruth Suehle at Ohio Linux Fest 2014                         |
| open source   | 1682 | 2015-01-13 | daw                       | Introduction to the Netizen Empowerment Federation          |
| open source   | 1686 | 2015-01-19 | Steve Bickle              | Interview with Joel Gibbard of OpenHand                     |
| open source   | 1723 | 2015-03-11 | Kevie                     | Success With Students                                       |
| open source   | 1736 | 2015-03-30 | b-yeezi                   | How I run my small business using Linux                     |
| open source   | 1783 | 2015-06-03 | GNULinuxRTM               | Windows To Linux - Better Late Than Never.                  |
| open source   | 1788 | 2015-06-10 | Kevie                     | Podcrawl Glasgow 2015                                       |
| open source   | 1917 | 2015-12-08 | klaatu                    | OpenSource.com                                              |
| open source   | 1984 | 2016-03-10 | Clinton Roy               | A Love Letter to linux.conf.au                              |
| open source   | 2036 | 2016-05-23 | Dave Morriss              | Glasgow Podcrawl 2016                                       |
| open source   | 2155 | 2016-11-04 | Ahuka                     | Ohio LinuxFest 2016                                         |
| open source   | 2170 | 2016-11-25 | Ken Fallon                | soundtrap.io                                                |
| open source   | 2182 | 2016-12-13 | spaceman                  | why say GNU/Linux ?                                         |
+---------------+------+------------+---------------------------+-------------------------------------------------------------+
30 rows in set (0.00 sec)

MariaDB [hpr_hpr]> SET @show = 2072;
Query OK, 0 rows affected (0.00 sec)

MariaDB [hpr_hpr]> \. find_shows_sharing_tags.sql
+------------+
| tag        |
+------------+
| dd         |
| filesystem |
| grep       |
+------------+
3 rows in set (0.00 sec)

+-------+------+------------+---------+-----------------+
| lctag | id   | date       | host    | title           |
+-------+------+------------+---------+-----------------+
| grep  | 2040 | 2016-05-27 | matthew | Why I Use Linux |
+-------+------+------------+---------+-----------------+
1 row in set (0.00 sec)

It is interesting that we only get one other show back in the second case. I think this demonstrates the shortage of good tags in the database at the moment.

Need I say more?

Using regular expressions

This one is just for fun. I was experimenting with other query types and came up with this one that looks for a partial tag using a regular expression. The tag being searched for is anything containing a word ending in ‘working’. The expression ‘[[:>:]]’ is MySQL/MariaDB’s regexp word boundary operator.

MariaDB [hpr_hpr]> SELECT e.id,date,h.host,title,e.tags AS eps_tags,(SELECT group_concat(tag) FROM tags GROUP BY id HAVING id = e.id) AS taglist FROM eps e JOIN hosts h USING (hostid) JOIN tags t USING (id) WHERE t.tag REGEXP 'working[[:>:]]' GROUP BY e.id;
+------+------------+----------------------+-------------------------------------------------------------------+---------------------------------------------------------------+---------------------------------------------------------+
| id   | date       | host                 | title                                                             | eps_tags                                                      | taglist                                                 |
+------+------------+----------------------+-------------------------------------------------------------------+---------------------------------------------------------------+---------------------------------------------------------+
| 1121 | 2012-11-19 | klaatu               | Klaatu continues his Networking Basics series with a SAMBA howto. | networking,SMB,CIFS,SAMBA,file server,NFS,AFP                 | AFP,CIFS,file server,networking,NFS,SAMBA,SMB           |
| 1127 | 2012-11-27 | klaatu               | AFP file share on a Linux server                                  | networking,AFP,Apple Filing Protocol,Netatalk                 | AFP,Apple Filing Protocol,Netatalk,networking           |
| 1193 | 2013-02-27 | Ken Fallon           | Chris Conder Catchup on Broadband for Rural North                 | networking,broadband,fibre optic,Lancashire                   | broadband,fibre optic,Lancashire,networking             |
| 1774 | 2015-05-21 | Jon Kulp             | Router Hacking                                                    | Networking, Routers, Printer Setup, dd-wrt, tomato, openwrt   | dd-wrt,Networking,openwrt,Printer Setup,Routers,tomato  |
| 1954 | 2016-01-28 | Jon Kulp             | Grandpa Shows Us How to Turn Custom Pens                          | DIY, pens, woodworking, lathe, writing instruments            | DIY,lathe,pens,woodworking,writing instruments          |
| 2077 | 2016-07-19 | Christopher M. Hobbs | libernil.net and self hosting for friends and family              | gnu, linux, networking, community, servers, services, commons | commons,community,gnu,linux,networking,servers,services |
+------+------------+----------------------+-------------------------------------------------------------------+---------------------------------------------------------------+---------------------------------------------------------+
6 rows in set (0.03 sec)

The query computes one of the fields it returns by scanning the ‘tags’ table using group_concat which concatenates multiple rows into a list. It uses this method to display the tags held in the ‘eps’ table with their equivalents in the ‘tags’ table for comparison.

Conclusion

I probably do not need to say that I prefer this solution to the one discussed in the last episode.

The down side as far as my original solution using only SQL and stored procedures is concerned is that the ‘tags’ table regeneration is overkill. On the other hand, as mentioned above, the Perl script (refresh_tags) only operates on the differences and so seems a lot better.

Both of these approaches rely on the fact that the current ‘tags’ field in the ‘eps’ table provides the raw information, and tags in this form are easy to manage.

Note that it makes no difference to the Perl solution whether the CSV data in the ‘eps.tags’ field is properly formatted or not (within limits anyway). With the much more simplistic SQL-only solution then this matters a lot.

Of course, it would be possible to replace the ‘eps.tags’ field with a ‘tags’ table. This would require software to written to display and edit tags per episode. This would not be difficult.

Epilogue

As before a couple of requests regarding tags:

  1. Please include tags when uploading your shows. Just add a few keywords to the tags field reflecting what your show was about or topics you spoke about.
  2. Note: more contributions to the project to add missing tags will always be welcome! Visit the page on the HPR website listing missing summaries and tags to find out how you could help.