Managing tags on HPR episodes - 3 (HPR Show 2270)

Dave Morriss


Table of Contents

Introduction

This is the third (and last) show looking at the subject of Managing Tags relating to HPR shows.

In the first show we looked at why we need tags, and examined the advantages and disadvantages of the present system of storage. We considered the drawbacks of this design when searching the tags.

Then in the second show we looked at a simple way of making a tags table and how to query it in order to fulfil the requirements defined in the first show.

In this show we’ll look at a more rigorous, efficient, “normalised” solution.

Database Design

In this episode we have finally arrived at a design which a database designer would choose. It has been a fairly long journey, but the object was to examine alternatives and evaluate them.

3. Many-to-many relationship

In the last episode we looked at a design where we built and managed a tag table. One disadvantage with the method used is that the table contains the same tag multiple times. It also fails to conform to the accepted database design recommendations.

The solution described in show 2 works and is a considerable improvement on the first solution. However, this design does not really reflect the relationship between HPR episodes and tags. This relationship is what database designers call “many-to-many”.

What this means is that a given episode may have many tags and a given tag may be associated with many episodes. This way of doing things was explained very well by Mike Ray in episode 1569, “Many-to-many data relationship howto. I would recommend you listen to that show if you’d like a good understanding of how to set up such a relationship in a database.

In such a design one copy of each tag would be held in a tags table, and there would be a second linking (or cross-reference) table joining episodes and tags.

The following is a simplistic diagram which represents imaginary show number 1234 in an imaginary episodes table being associated with the tag “banana”. The linkage is is via the joining table which shows the association between episode 1234 and the tag which is stored with id 456.

Episode 1234 is associated with tag number 456 banana
Episode 1234 is associated with tag number 456 “banana”

Setting up and managing the tables

As with the previous method, I am using the comma separated list in the ‘eps’ table to populate the new tables. However, this time I am not trying to write stored procedures and functions to work on these tables, but have developed a Perl script instead, which we will look at later.

The SQL which defines the new tables is included with this show, and is called normalise_tags_2.sql. This is shown below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
/*
 * -----------------------------------------------------------------------------
 * Many-to-many tag tables
 *
 * The 'DROP ...' things are in this file in case we plan to regenerate
 * everything, perhaps after a table design change.
 * -----------------------------------------------------------------------------
 *
 * .............................................................................
 *
 * Create table 'tags2'
 *
 * This holds all tags with an associated index. It's called 'tags2' because
 * we already have a 'tags' table demonstrating an alternative tag solution.
 *
 * .............................................................................
 *
 */
-- DROP TABLE IF EXISTS tags2;
CREATE TABLE tags2 (
    id int(5) PRIMARY KEY NOT NULL AUTO_INCREMENT,
    tag varchar(200) NOT NULL,
    lctag varchar(200) NOT NULL
);

/*
 * An index to make it easier to find tags and to enforce uniqueness
 */

-- DROP INDEX tags2_tag ON tags2;
CREATE UNIQUE INDEX tags2_tag ON tags2 (tag);

/*
 * .............................................................................
 *
 * Create table 'eps_tags2_xref'
 *
 * This is the cross reference or 'joining' table
 *
 * .............................................................................
 *
 */
-- DROP TABLE IF EXISTS eps_tags2_xref;
CREATE TABLE eps_tags2_xref (
    eps_id int(5) NOT NULL,
    tags2_id int(5) NOT NULL
);

/*
 * Make a primary key from the two columns
 */
-- DROP INDEX all_eps_tags2 ON eps_tags2_xref;
CREATE UNIQUE INDEX all_eps_tags2 ON eps_tags2_xref (eps_id,tags2_id);

/*
 * Make a tag id index to speed deletion (special case)
 */
-- DROP INDEX all_tags ON eps_tags2_xref;
CREATE INDEX all_tags ON eps_tags2_xref (tags2_id);


-- vim: syntax=sql:ts=8:ai:tw=78:et:fo=tcrqn21:comments+=b\:--
  • A new table of tags called ‘tags2’ is defined which is equivalent to the ‘tags’ table we saw in the last episode. It will contain only single instances of each tag. As mentioned before, there is probably no need to hold both the mixed case and lower case versions of the tags.

  • The table ‘eps_tags2_xref’ is the joining table. It contains a column ‘eps_id’ which is the key of a row in the ‘eps’ table. The column ‘tags2_id’ is the key of a row in the ‘tags2’ table.

  • The joining table will contain multiple references to episodes and to tags, but there will never be a row containing both the same episode id and the same tags id combination. This is enforced by the index all_eps_tags2 which indexes the combined keys.

  • The index ‘tags2_tag’ on the ‘tags2’ table ensures that the ‘tag’ field is unique.

  • The index ‘all_tags’ indexes the ‘tags2_id’ column of the ‘eps_tags2_xref’ table where it helps speed up deletion of joining entries when a tag association with an episode is removed (which can happen when editing tags on a show).

Foreign keys

The way in which these new tables are set up does not use the full relational capabilities of MariaDB (“referential integrity”). By default MariaDB database tables do not support foreign keys. This feature can be enabled by defining the tables as having the type InnoDB as opposed to the default MyISAM. None of the tables in the HPR database are defined as InnoDB at the moment.

The concept of foreign keys is a way of making database tables dependent on one another - of defining relationships between them. The field ‘eps_id’ in the ‘eps_tags2_xref’ table is an episode id number in the ‘eps’ table. It should only contain episode id numbers which match those in the ‘eps’ table. Making it a foreign key linked to the ‘id’ field in the ‘eps’ table ensures that this is so.

A similar foreign key is the ‘tags2_id’ field which contains the id number of a tag in the ‘tags2’ table.

Another advantage of having foreign keys here would be that the database itself can ensure consistency. This is achieved with features like cascading deletion. What this means is that if all references to a tag in the ‘eps_tags2_xref’ are deleted, potentially leaving a tag “orphaned”, the database can configured to delete the tag.

Mike Ray’s show referenced earlier (episode 1569) covers these issues very well.

Perl script ‘refresh_tags_2

This script (refresh_tags_2) is quite complex, so we will not look at it in detail here.

In essence it scans the ‘eps’ table in the database, collecting all of the tags stored in CSV form with the episode number they belong to. It also collects the tags already stored in the ‘tags2’ table and stores them with the associated episode number.

Then it can compare the two sets of tags, noting differences. If a new tag has appeared it can add it. If a tag has disappeared it can delete it. It manages the joining table ‘eps_tags2_xref’ with the ‘tags2’ table.

The script also performs actions that the database itself could carry out if the tables allowed foreign keys. As mentioned elsewhere, one of the changes needed to operate the HPR database properly is to enable the foreign key features.

Advantages

  • This method of storing tags is the most efficient one.

  • This method is vastly preferable to the comma separated variable method examined in episode one of this mini-series.

  • It is also preferable to the method shown in the last episode because a given tag is stored only once. This means that making a spelling correction to a tag, for example, need only be done once.

  • With the use of full relational capabilities (foreign keys, cascading deletion) this design will remain self-consistent.

Disadvantages

  • Although this is the best solution in terms of database design the concepts can be a little daunting to people less experienced in the ways of relational databases.

Searching

Finding shows with a given tag

The query needed to find all shows with the tag ‘community’ is a little more complex now since we have to use three tables: ‘eps’, ‘eps_tags2_xref’ and ‘tags2’.

SELECT e.id,e.date,e.title,e.tags
FROM eps e, eps_tags2_xref et, tags2 t
WHERE e.id = et.eps_id
AND et.tags2_id = t.id
AND t.tag = 'community'

Note: Since writing the notes for the last show I have found out how to run database queries from the templating system I use to make show notes, and have used this to make an HTML table, which I hope is clearer.

Show Date Title Tags
1 2007-12-31 Introduction to HPR hpr, twat, community
947 2012-03-20 Presentation by Jared Smith at the Columbia Area Linux Users Group Fedora,community
1000 2012-05-31 Episode 1000 HPR,community,congratulations
1024 2012-07-05 Episode 1024 HPR,community,anniversary
1509 2014-05-15 HPR Needs Shows HPR, shows, request, call to action, community, contribute
1913 2015-12-02 The Linux Experiment linux, the linux experiment, community
2008 2016-04-13 HPR needs shows to survive. HPR,community,shows,call to action,contribute
2035 2016-05-20 Building Community community
2077 2016-07-19 libernil.net and self hosting for friends and family gnu, linux, networking, community, servers, services, commons
2255 2017-03-24 The Good Ship HPR HPR,community,contribution,podcast

Note: since the last episode one more show has been added to the database with the ‘community’ tag.

Finding shows with combinations of tags

The query here is very similar to the equivalent from the last show except that we have another table, the joining table, to refer to:

SELECT e.id,e.date,e.title,e.tags
FROM eps e, eps_tags2_xref et, tags2 t
WHERE e.id = et.eps_id
AND et.tags2_id = t.id
AND t.tag IN ('community','hpr')
GROUP BY e.id

This query looks for shows having either of the two tags, so it is an ‘OR’ operation.

Show Date Title Tags
1 2007-12-31 Introduction to HPR hpr, twat, community
947 2012-03-20 Presentation by Jared Smith at the Columbia Area Linux Users Group Fedora,community
1000 2012-05-31 Episode 1000 HPR,community,congratulations
1024 2012-07-05 Episode 1024 HPR,community,anniversary
1371 2013-11-04 The Lost Banner of HPR hpr,banner
1509 2014-05-15 HPR Needs Shows HPR, shows, request, call to action, community, contribute
1726 2015-03-16 15 Excuses not to Record a show for HPR hpr,podcasting,tips,techniques,kw,knightwise,excuses
1818 2015-07-22 Review of HPR’s Interview Recorder: Zoom H1 Zoom H1, microphone, recording, review, DVR, digital voice recorder, tutorial, getting started, guide, howto, HPR
1877 2015-10-13 Recording HPR on the fly on your Android phone android, hpr, audio, recording
1913 2015-12-02 The Linux Experiment linux, the linux experiment, community
2008 2016-04-13 HPR needs shows to survive. HPR,community,shows,call to action,contribute
2035 2016-05-20 Building Community community
2077 2016-07-19 libernil.net and self hosting for friends and family gnu, linux, networking, community, servers, services, commons
2195 2016-12-30 All you need to know when uploading a show HPR
2245 2017-03-10 Managing tags on HPR episodes - 1 HPR,database,schema,tag
2255 2017-03-24 The Good Ship HPR HPR,community,contribution,podcast
2260 2017-03-31 Managing tags on HPR episodes - 2 HPR,database,schema,tag,SQL,foreign key

Note: since the last episode three more shows have been added to the database with the ‘community’ and ‘hpr’ tags so we have 17 rows returned this time.

If we wanted to look for shows which have both tags, then the following query would be needed:

SELECT e.id,e.date,e.title,e.tags
FROM eps e, eps_tags2_xref et, tags2 t
WHERE e.id = et.eps_id
AND et.tags2_id = t.id
AND t.tag IN ('community','hpr')
GROUP BY e.id
HAVING count(e.id) = 2

This query looks for shows having both of the two tags, so it is an ‘AND’ operation.

Show Date Title Tags
1 2007-12-31 Introduction to HPR hpr, twat, community
1000 2012-05-31 Episode 1000 HPR,community,congratulations
1024 2012-07-05 Episode 1024 HPR,community,anniversary
1509 2014-05-15 HPR Needs Shows HPR, shows, request, call to action, community, contribute
2008 2016-04-13 HPR needs shows to survive. HPR,community,shows,call to action,contribute
2255 2017-03-24 The Good Ship HPR HPR,community,contribution,podcast

This is the same exercise we used in the last show, following droops’ suggestion. Rather than write a script to do this I have just shown the SQL queries here. There are two. The first simply shows the tags on a given episode (2071), and the second queries the database for shows (other than 2071) which have any of the same tags:

SELECT t.tag
FROM tags2 t, eps_tags2_xref et
WHERE et.tags2_id = t.id
AND et.eps_id = 2071
ORDER BY t.tag
SELECT t.lctag, e.id, e.date, e.title, h.host
FROM eps e, hosts h, eps_tags2_xref et, tags2 t
WHERE e.id = et.eps_id
AND e.hostid = h.hostid
AND et.tags2_id = t.id
AND t.tag IN (
    SELECT tag t
    FROM tags2 t, eps_tags2_xref et
    WHERE et.tags2_id = t.id
    AND et.eps_id = 2071
)
GROUP BY e.id
HAVING e.id <> 2071
ORDER BY t.lctag, e.id

Here are the results from the two queries:

Tag
amateur radio
electronics
open source

Tag Show Date Title Host
amateur radio 911 2012-01-27 Hobbies MrX
amateur radio 1036 2012-07-23 Setting up Your First Ham Radio Station Joel
amateur radio 1092 2012-10-08 Ham Radio: The Original Tech Geek Passion MrGadgets
amateur radio 1701 2015-02-09 FOSDEM 2015 Part 4 of 5 Ken Fallon
amateur radio 2240 2017-03-03 Amateur Radio Round Table Various Hosts
electronics 1817 2015-07-21 Gathering Parts NYbill
electronics 1858 2015-09-16 Multimeter Mod’s Part 2 NYbill
electronics 1971 2016-02-22 BlinkStick Dave Morriss
electronics 2029 2016-05-12 The DSO138 Oscilloscope Kit NYbill
electronics 2044 2016-06-02 Bring on the Power! NYbill
electronics 2056 2016-06-20 Interview with a young hacker Tony Hughes AKA TonyH1212
electronics 2148 2016-10-26 The DSO138 Oscilloscope Kit Part 2 NYbill
electronics 2220 2017-02-03 Taking apart a tablet laindir
open source 242 2008-12-03 Open Source in Government Panel Discussion UTOSC
open source 1402 2013-12-17 How I Started Using Linux and Free and Open Source Software Thaj Sara
open source 1529 2014-06-12 TrueCrypt, Heartbleed, and Lessons Learned Ahuka
open source 1641 2014-11-17 The real reasons for using Linux johanv
open source 1653 2014-12-03 Ruth Suehle at Ohio Linux Fest 2014 Ahuka
open source 1682 2015-01-13 Introduction to the Netizen Empowerment Federation daw
open source 1686 2015-01-19 Interview with Joel Gibbard of OpenHand Steve Bickle
open source 1723 2015-03-11 Success With Students Kevie
open source 1736 2015-03-30 How I run my small business using Linux b-yeezi
open source 1783 2015-06-03 Windows To Linux - Better Late Than Never. GNULinuxRTM
open source 1788 2015-06-10 Podcrawl Glasgow 2015 Kevie
open source 1917 2015-12-08 OpenSource.com klaatu
open source 1984 2016-03-10 A Love Letter to linux.conf.au Clinton Roy
open source 2036 2016-05-23 Glasgow Podcrawl 2016 Dave Morriss
open source 2062 2016-06-28 Now The Chips Are Definitely Down MrX
open source 2089 2016-08-04 Solving a blinkstick python problem MrX
open source 2155 2016-11-04 Ohio LinuxFest 2016 Ahuka
open source 2170 2016-11-25 soundtrap.io Ken Fallon
open source 2182 2016-12-13 why say GNU/Linux ? spaceman

Using regular expressions

In the last episode we used the regular expression capabilities of MariaDB to find partial tags. Here is a similar example using the new tables:

SELECT e.id, e.date, e.title, h.host, e.tags
FROM eps e, hosts h, eps_tags2_xref et, tags2 t
WHERE e.hostid = h.hostid
AND e.id = et.eps_id
AND et.tags2_id = t.id
AND t.tag REGEXP '[[:<:]]ham[[:>:]]'
GROUP BY e.id

This query scans four tables:

  • eps’ containing episode data
  • hosts’ containing host information
  • the joining table ‘eps_tags2_xref
  • the tags table ‘tags2

Most of the clauses in the ‘WHERE’ part of the query join together tables to make subsets. The part containing the regular expression looks for any instance of the word ‘ham’ as a distinct word. This means it can be part of a tag but not part of a word.

The ‘GROUP BY’ part ensures that if a tag matches twice only one episode will be returned.

This example is probably more complex than it needs to be.

Show Date Title Host Tags
6 2008-01-08 Part 15 Broadcasting dosman Part 15, HAM, soldering, fcc, radio
1036 2012-07-23 Setting up Your First Ham Radio Station Joel Amateur radio,Ham radio
1092 2012-10-08 Ham Radio: The Original Tech Geek Passion MrGadgets HAM radio,amateur radio,CB radio,Morse code
2041 2016-05-30 Router Antennas More = better ? Lyle Lastinger router,antenna,ham radio
2189 2016-12-22 Working Amateur Radio Satellites Christopher M. Hobbs hamradio, ham, radio, amateur, satellites, projects
2216 2017-01-30 Working AO-85 with my son Christopher M. Hobbs hamradio, ham, radio, amateur, satellites, projects
2226 2017-02-13 FOSDEM 2017 AW Building Ken Fallon FOSDEM 2017, coreboot, GNU GRUB, Olimex, Automotive Grade Linux, Ham radio, CorteXlab, OpenEmbedded
2240 2017-03-03 Amateur Radio Round Table Various Hosts amateur radio, ham

Simplifying things with a VIEW

As an experiment I have included the definition of a ‘VIEW’ which helps to hide some of the complexity of the queries needed to use the many-to-many tables.

The SQL which defines the experimental view is included with this show, and is called eps_hosts_tags_view.sql. This is shown below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
/*
 * Create a view to simplify eps, host and tag access using the many to many
 * tag tables. This view is a demonstration of what could be done in the live
 * database, where more views could be created, of various levels of
 * complexity, depending on need.
 */

CREATE OR REPLACE VIEW eht_view AS
    SELECT
        e.*,
        h.host, h.email,
        t.tag,
        (SELECT group_concat(tag) FROM tags2 t2, eps_tags2_xref et2 WHERE
            et2.tags2_id = t2.id GROUP BY et2.eps_id HAVING et2.eps_id = e.id)
        AS taglist
    FROM eps e, hosts h, eps_tags2_xref et, tags2 t
    WHERE e.hostid = h.hostid
    AND e.id = et.eps_id
    AND et.tags2_id = t.id;

-- vim: syntax=sql:ts=8:ai:tw=78:et:fo=tcrqn21:comments+=b\:--

The view is called ‘eht_view’ and is really a way of storing a ‘SELECT’ query for repeated use. The result is a sort of virtual table which can be used in further queries.

For example:

SELECT id,date,title,host,taglist
FROM eht_view
WHERE tag REGEXP '[[:<:]]solder[[:>:]]'
GROUP BY id;
Show Date Title Host Tag list
941 2012-03-12 Whats in my bag / Portable Apps Digital Maniac windows,screwdriver,flash drive,portable apps,solder
1037 2012-07-24 Soldering Part 1 MrX soldering,soldering iron,solder,dry joint,soldering sponge
1047 2012-08-07 Soldering Part 2: An audio demonstration of soldering MrX strip board,perf board,bread board,flux,multicore solder

Notice that the view contains a sub-SELECT which concatenates all the tags belonging to an episode. This demonstrates that storing the tags in a CSV list as seen in episode 1 is unnecessary.

Conclusion

The HPR database is very much in need of a tag mechanism. In this mini-series we have looked at the present tag storage system and have concluded that it is not a good way to store and access tags. We have looked at a somewhat better way of achieving what is required in show 2, but have concluded that this also has drawbacks. In this third episode we have examined a better way of using a relational database to represent the true relationship between episodes and tags - a many-to-many relationship.

Although it will require some work, it is strongly recommended that we implement a tag scheme this way. It is also recommended that:

  • We enable the foreign key capabilities of MariaDB which will give many advantages when managing these new tables (and others).

  • We look at performing a similar database upgrade to enable the many-to-many relationship of hosts and episodes to be properly represented.

  • Although not as critical as the hosts/episodes relationship we should also set up a many-to-many relationship between episodes and series.

Epilogue

A couple of requests regarding tags:

  1. Please include tags when uploading your shows. Just add a few keywords to the tags field reflecting what your show was about or topics you spoke about.
  2. Note: more contributions to the project to add missing tags will always be welcome! Visit the page on the HPR website listing missing summaries and tags to find out how you could help.