Managing tags on HPR episodes - 1 (HPR Show 2245)

Dave Morriss


Table of Contents

Introduction

We have been collecting and storing tags for new HPR shows for a while now with the intention of eventually offering a search interface. In addition, a number of contributors, including myself have been adding tags (and summaries), to shows that do not have them, since August 2015. There is still a way to go, but we’re making progress. At the time of writing (2017-01-31) 56.29% (1248) of all HPR shows (2217) have tags.

In recent times the way in which we should use these tags has been discussed. In show 2035 on 2016-05-20 droops suggested:

The website, which is a lot of work, needs to have related shows listed on each individual show’s page. This will take a tag system and someone to tag all of the almost uncountable previous episodes.

This episode begins a discussion about some of the ways that tags can be stored, managed and accessed efficiently in the HPR database.

I started planning a show about this subject in the summer of 2016, and the amount of information I have accumulated has grown since then. There is now quite a lot, so I am going to split what was originally going to be one show into three.

The subject becomes quite technical in the later shows, discussing database design techniques, and all three of the shows contain examples of database queries and scripts. If you are not interested in this subject than feel free to skip past. However, you might find this first episode more palatable, and any thoughts you might have on the subject would be appreciated.

Previous discussions

There have been discussions in the past about whether we should use a database at all to hold HPR shows, and whether a static site might be better for our needs. This has been motivated by security considerations amongst other things. Such a static site would probably be generated from a database, since there are a number of instances where the site needs to contain computed values such as the number of shows in a series, or the number recorded by a given host, and so forth, and databases are good at this.

If, as now, a database is being used then there are differing opinions on how it should be put together. Database administration skills are quite specialised and the concepts behind databases can be a little difficult to comprehend when first encountered.

One view has been that using any of the advanced features of the chosen database system should be avoided because doing so will be too complicated for future volunteer HPR administrators to maintain. The types of capabilities that would be desirable include:

  • Allowing there to be more than one host associated with a show.
  • Allowing a show to be associated with more than one series.
  • Implementing a tag system.

On the other hand, the view has certainly been expressed that using a relational database like a collection of spreadsheets (as now) is a woeful misuse of a powerful tool.

I will give examples of some of these issues in these episodes.

Tags and their potential uses

The question needs to be asked: “Why have tags at all?”

1. Relationships between shows

As suggested by droops, if a particular show has tags associated with it, then all of the shows which share any of the same tags should be related and should be linked (or linkable) from that show somehow. Displaying such relationships would be helpful in finding other shows worth listening to.

2. Listing shows by tags

It might also be useful to list all the shows that have a given tag, perhaps as a sort of master show index. I am envisaging something like the index of a book with a tag on the left and a show number link on the right of the page. Alternatively a table listing tags with show number and title to the right. Of course, with a well-populated tag system, there will often be several shows which are tagged with a given tag.

3. Tag queries

Many tagging systems allow complex queries such as:

interview and ( oggcamp or fscons )

This means all shows tagged ‘interview’ and either ‘oggcamp’ or ‘fscons’. Although implementing this at the database level is possible, I suspect that such a feature would require Javascript on the front end to make a useable interface on the website, so it may not be implemented for some time (if at all).

On the other hand, doing this through an URL-based interface may be simpler.

Database design

Now let us look at how a tag system could be implemented in the database.

The HPR database currently holds tags as a comma-separated list - the exact same way we request them to be entered into the form when an episode is being uploaded. We do not currently reformat these in any standard way, so although they are comma-separated entities they do not conform to the Comma Separated Variable standard (see RFC4180).

The reason why using such a standard would be desirable is, for example, there may be a need to add a tag which contains a comma. For instance, the tags for a show might be:

TED
Technology, Education and Design

Here the expectation was that “Technology, Education and Design” was a single tag, not the two tags it might be interpreted as.

The storage currently allocated for the tag list is 200 characters. This is potentially quite restrictive since some very long shows would benefit from more tags than might fit this field. The longest tag list currently in the database is 196 characters.

So, the question is what would be the best way to store tags? I will look at the current solution and two other different solutions and comment about them. This episode will concentrate on the current way of doing things and the next two will look at the other suggested alternatives.

As a disclaimer, I am not a trained database designer. I have worked with databases a fair bit over the past 15 years or so but am largely self-taught. Listeners to this episode may well have better ideas for doing what is needed, and will probably have had proper training and much more experience. If so, please add ideas and suggestions in the comments, send email to the HPR list, or email me directly.

1. Tags as a comma-separated list

This is the current solution. The ‘tags’ field is a column of the ‘eps’ table which holds episode details in the HPR database.

Advantages

  • This is a very simple to store such tags. We just place words or phrases in a comma-separated list and add them to the tags field for an episode.
  • It is very simple to maintain. We just add tags to the field or remove them – with relative ease.

Disadvantages

  • The current field is small and can easily be exceeded when dealing with long and complex shows that would benefit from a lot of tags (such as the New Year shows where people tend to list things such as favourite software or hardware which could be tagged).
  • The tags should be pre-processed to ensure they conform to CSV format before storage. For instance, we would want tags with commas in them like the earlier example “Technology, Education and Design” to be enclosed in quotes.
  • This format is wasteful since the same tags are duplicated throughout episodes (there are currently 84 instances of “Linux” in the tags for example).
  • This format is very difficult to search efficiently (see below). In database terms searching requires each ‘tags’ field to be examined with a string search which is expensive and scales badly.
  • Databases are usually designed to use indexes to optimise searches, and this cannot be done here (or is beyond my skill level).

Searching

As mentioned, there is a database table called ‘eps’ which holds episode details such as the show number, release date and episode title, and the tags are held in a column of this table.

Finding shows with a given tag

If we look at the show droops recorded called “Building Community”, episode 2035, this has a single tag, the word ‘community’. Say we want to find other shows with this tag – we might use the following query:

SELECT id,date,title FROM eps WHERE tags LIKE '%community%';

Note: This is SQL, Structured Query Language. ‘SELECT’ initiates a query, where information is retrieved from the database. The name of the table being queried follows ‘FROM’, and the fields to display follow the ‘SELECT’. The ‘WHERE’ part introduces the segment of the query which filters out a subset of the table. Here we are checking the ‘tags’ field, looking for “community”.

In short, the query finds all rows in the ‘eps’ table with “community” in the ‘tags’ field.

We cannot use the equality operator in the ‘WHERE’ clause, since we’re looking for the word ‘community’ in a list of other words, so we use the wildcard capabilities of SQL where ‘%’ is the wildcard used in conjunction with the ‘LIKE’ operator. We are looking for any tag field containing ‘community’. The comparisons are not case-sensitive.

Running this query currently returns 95 rows, since the Community News shows are all tagged ‘community news’ and this approach does not differentiate between ‘community’ as a tag or as part of a tag.

We could try this:

SELECT id,date,title FROM eps WHERE tags LIKE '%,community,%';

This time we get three hits but it does not include show 2035 because there are no commas in the tags field in this case, since there is just one tag.

We could try covering all possibilities:

SELECT id,date,title FROM eps WHERE tags LIKE '%community%'
    OR tags LIKE '%,community%' OR tags LIKE '%community,%'
    OR tags LIKE '%,community,%';

However, this produces 95 results again since it doesn’t exclude ‘community news’ and is getting quite ugly.

The database server, MariaDB, can handle (limited) regular expressions so we might try that as a solution:

SELECT id,date,title FROM eps WHERE tags REGEXP '(^|,)community(,|$)';

Here the regular expression matches ‘community’ at the start of the field, or after a comma, as well as at the end of the field or before a comma.

This now returns five matches. However, the tag strings often contain spaces since they have not yet been cleaned up, so it is necessary to enhance the expression to avoid such problems:

MariaDB [hpr_hpr]> SELECT id,date,title,tags FROM eps WHERE tags REGEXP '(^|,) *community *(,|$)';
+------+------------+--------------------------------------------------------------------+---------------------------------------------------------------+
| id   | date       | title                                                              | tags                                                          |
+------+------------+--------------------------------------------------------------------+---------------------------------------------------------------+
|    1 | 2007-12-31 | Introduction to HPR                                                | hpr, twat, community                                          |
|  947 | 2012-03-20 | Presentation by Jared Smith at the Columbia Area Linux Users Group | Fedora,community                                              |
| 1000 | 2012-05-31 | Episode 1000                                                       | HPR,community,congratulations                                 |
| 1024 | 2012-07-05 | Episode 1024                                                       | HPR,community,anniversary                                     |
| 1509 | 2014-05-15 | HPR Needs Shows                                                    | HPR, shows, request, call to action, community, contribute    |
| 1913 | 2015-12-02 | The Linux Experiment                                               | linux, the linux experiment, community                        |
| 2008 | 2016-04-13 | HPR needs shows to survive.                                        | HPR,community,shows,call to action,contribute                 |
| 2035 | 2016-05-20 | Building Community                                                 | community                                                     |
| 2077 | 2016-07-19 | libernil.net and self hosting for friends and family               | gnu, linux, networking, community, servers, services, commons |
+------+------------+--------------------------------------------------------------------+---------------------------------------------------------------+
9 rows in set (0.01 sec)

Here we get nine matches, which is correct.

It is also possible to use the regular expression to find multiple tags. In this example we look for the string ‘hpr’ or ‘community’ and get 14 matches:

MariaDB [hpr_hpr]> SELECT id,date,title,tags FROM eps WHERE tags REGEXP '(^|,) *(hpr|community) *(,|$)';
+------+------------+--------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------+
| id   | date       | title                                                              | tags                                                                                                              |
+------+------------+--------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------+
|    1 | 2007-12-31 | Introduction to HPR                                                | hpr, twat, community                                                                                              |
|  947 | 2012-03-20 | Presentation by Jared Smith at the Columbia Area Linux Users Group | Fedora,community                                                                                                  |
| 2195 | 2016-12-30 | All you need to know when uploading a show                         | HPR                                                                                                               |
| 1000 | 2012-05-31 | Episode 1000                                                       | HPR,community,congratulations                                                                                     |
| 1024 | 2012-07-05 | Episode 1024                                                       | HPR,community,anniversary                                                                                         |
| 1371 | 2013-11-04 | The Lost Banner of HPR                                             | hpr,banner                                                                                                        |
| 1509 | 2014-05-15 | HPR Needs Shows                                                    | HPR, shows, request, call to action, community, contribute                                                        |
| 1726 | 2015-03-16 | 15 Excuses not to Record a show for HPR                            | hpr,podcasting,tips,techniques,kw,knightwise,excuses                                                              |
| 1818 | 2015-07-22 | Review of HPR's Interview Recorder: Zoom H1                        | Zoom H1, microphone, recording, review, DVR, digital voice recorder, tutorial, getting started, guide, howto, HPR |
| 1877 | 2015-10-13 | Recording HPR on the fly on your Android phone                     | android, hpr, audio, recording                                                                                    |
| 1913 | 2015-12-02 | The Linux Experiment                                               | linux, the linux experiment, community                                                                            |
| 2008 | 2016-04-13 | HPR needs shows to survive.                                        | HPR,community,shows,call to action,contribute                                                                     |
| 2035 | 2016-05-20 | Building Community                                                 | community                                                                                                         |
| 2077 | 2016-07-19 | libernil.net and self hosting for friends and family               | gnu, linux, networking, community, servers, services, commons                                                     |
+------+------------+--------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------+
14 rows in set (0.00 sec)

If however we want to find ‘hpr’ and ‘community’ then we have to use two regular expressions. This time there are only 5 matches, as might be expected:

MariaDB [hpr_hpr]> SELECT id,date,title,tags FROM eps WHERE tags REGEXP '(^|,) *(community) *(,|$)' AND tags REGEXP '(^|,) *(hpr) *(,|$)';
+------+------------+-----------------------------+------------------------------------------------------------+
| id   | date       | title                       | tags                                                       |
+------+------------+-----------------------------+------------------------------------------------------------+
|    1 | 2007-12-31 | Introduction to HPR         | hpr, twat, community                                       |
| 1000 | 2012-05-31 | Episode 1000                | HPR,community,congratulations                              |
| 1024 | 2012-07-05 | Episode 1024                | HPR,community,anniversary                                  |
| 1509 | 2014-05-15 | HPR Needs Shows             | HPR, shows, request, call to action, community, contribute |
| 2008 | 2016-04-13 | HPR needs shows to survive. | HPR,community,shows,call to action,contribute              |
+------+------------+-----------------------------+------------------------------------------------------------+
5 rows in set (0.00 sec)

MariaDB offers a function ‘find_in_set’ which parses comma-separated lists, so it is possible to do this:

SELECT id,date,title FROM eps WHERE find_in_set('community',tags) > 0;

However, this only returns five matches, because of the extra spaces before the tags, so the regular expression method seems best unless the tags have been significantly cleaned up.

How then could a query be constructed to do the thing that droops suggested: given a show number, extract its tags and use them to perform a search for other shows matching each tag?

I can think of ways this could be scripted, though it would take more than one query

  • one query to get the target show
  • the script would need to parse the CSV data in the tags field for the show
  • then queries would need to be issued for each of the tags so discovered

I cannot think of any way in which this could be performed as a single SQL query.

Conclusion

I think that using the comma separated tag field for anything but the very simplest types of queries is undesirable. Even if we continued to use this method, we would need to:

  1. Increase the size of the field (e.g. ‘alter table eps modify tags varchar(400) null;’)
  2. Ensure the contents conform to CSV standards (I have written a fairly simple Perl script called ‘clean_csv_tags’ to do this, which is available with this show)
  3. Ideally, perform checks on the tags for spelling, unnecessary plurals, spurious punctuation and so on (my script could probably be modified to do this)

Notwithstanding these points, my feeling is that we should look for a better solution.

Epilogue

A couple of requests regarding tags:

  1. Please include tags when uploading your shows. Just add a few keywords to the tags field reflecting what your show was about or topics you spoke about.
  2. Note: more contributions to the project to add missing tags will always be welcome! Visit the page on the HPR website listing missing summaries and tags to find out how you could help.