Table of Contents
Introduction
This is the third (and last) show looking at the subject of Managing Tags relating to HPR shows.
In the first show we looked at why we need tags, and examined the advantages and disadvantages of the present system of storage. We considered the drawbacks of this design when searching the tags.
Then in the second show we looked at a simple way of making a tags table and how to query it in order to fulfil the requirements defined in the first show.
In this show we’ll look at a more rigorous, efficient, “normalised” solution.
Database Design
In this episode we have finally arrived at a design which a database designer would choose. It has been a fairly long journey, but the object was to examine alternatives and evaluate them.
3. Many-to-many relationship
In the last episode we looked at a design where we built and managed a tag table. One disadvantage with the method used is that the table contains the same tag multiple times. It also fails to conform to the accepted database design recommendations.
The solution described in show 2 works and is a considerable improvement on the first solution. However, this design does not really reflect the relationship between HPR episodes and tags. This relationship is what database designers call “many-to-many”.
What this means is that a given episode may have many tags and a given tag may be associated with many episodes. This way of doing things was explained very well by Mike Ray in episode 1569, “Many-to-many data relationship howto”. I would recommend you listen to that show if you’d like a good understanding of how to set up such a relationship in a database.
In such a design one copy of each tag would be held in a tags table, and there would be a second linking (or cross-reference) table joining episodes and tags.
The following is a simplistic diagram which represents imaginary show number 1234 in an imaginary episodes
table being associated with the tag “banana
”. The linkage is is via the joining table which shows the association between episode 1234 and the tag which is stored with id 456.
Setting up and managing the tables
As with the previous method, I am using the comma separated list in the ‘eps
’ table to populate the new tables. However, this time I am not trying to write stored procedures and functions to work on these tables, but have developed a Perl script instead, which we will look at later.
The SQL which defines the new tables is included with this show, and is called normalise_tags_2.sql
. This is shown below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 |
|
A new table of tags called ‘
tags2
’ is defined which is equivalent to the ‘tags
’ table we saw in the last episode. It will contain only single instances of each tag. As mentioned before, there is probably no need to hold both the mixed case and lower case versions of the tags.The table ‘
eps_tags2_xref
’ is the joining table. It contains a column ‘eps_id
’ which is the key of a row in the ‘eps
’ table. The column ‘tags2_id
’ is the key of a row in the ‘tags2
’ table.The joining table will contain multiple references to episodes and to tags, but there will never be a row containing both the same episode id and the same tags id combination. This is enforced by the index
all_eps_tags2
which indexes the combined keys.The index ‘
tags2_tag
’ on the ‘tags2
’ table ensures that the ‘tag
’ field is unique.The index ‘
all_tags
’ indexes the ‘tags2_id
’ column of the ‘eps_tags2_xref
’ table where it helps speed up deletion of joining entries when a tag association with an episode is removed (which can happen when editing tags on a show).
Foreign keys
The way in which these new tables are set up does not use the full relational capabilities of MariaDB (“referential integrity”). By default MariaDB database tables do not support foreign keys. This feature can be enabled by defining the tables as having the type InnoDB as opposed to the default MyISAM. None of the tables in the HPR database are defined as InnoDB at the moment.
The concept of foreign keys is a way of making database tables dependent on one another - of defining relationships between them. The field ‘eps_id
’ in the ‘eps_tags2_xref
’ table is an episode id number in the ‘eps
’ table. It should only contain episode id numbers which match those in the ‘eps
’ table. Making it a foreign key linked to the ‘id
’ field in the ‘eps
’ table ensures that this is so.
A similar foreign key is the ‘tags2_id
’ field which contains the id number of a tag in the ‘tags2
’ table.
Another advantage of having foreign keys here would be that the database itself can ensure consistency. This is achieved with features like cascading deletion. What this means is that if all references to a tag in the ‘eps_tags2_xref
’ are deleted, potentially leaving a tag “orphaned”, the database can configured to delete the tag.
Mike Ray’s show referenced earlier (episode 1569) covers these issues very well.
Perl script ‘refresh_tags_2
’
This script (refresh_tags_2
) is quite complex, so we will not look at it in detail here.
In essence it scans the ‘eps
’ table in the database, collecting all of the tags stored in CSV form with the episode number they belong to. It also collects the tags already stored in the ‘tags2
’ table and stores them with the associated episode number.
Then it can compare the two sets of tags, noting differences. If a new tag has appeared it can add it. If a tag has disappeared it can delete it. It manages the joining table ‘eps_tags2_xref
’ with the ‘tags2
’ table.
The script also performs actions that the database itself could carry out if the tables allowed foreign keys. As mentioned elsewhere, one of the changes needed to operate the HPR database properly is to enable the foreign key features.
Advantages
This method of storing tags is the most efficient one.
This method is vastly preferable to the comma separated variable method examined in episode one of this mini-series.
It is also preferable to the method shown in the last episode because a given tag is stored only once. This means that making a spelling correction to a tag, for example, need only be done once.
With the use of full relational capabilities (foreign keys, cascading deletion) this design will remain self-consistent.
Disadvantages
- Although this is the best solution in terms of database design the concepts can be a little daunting to people less experienced in the ways of relational databases.
Searching
Finding shows with a given tag
The query needed to find all shows with the tag ‘community’ is a little more complex now since we have to use three tables: ‘eps
’, ‘eps_tags2_xref
’ and ‘tags2
’.
SELECT e.id,e.date,e.title,e.tags
FROM eps e, eps_tags2_xref et, tags2 t
WHERE e.id = et.eps_id
AND et.tags2_id = t.id
AND t.tag = 'community'
Note: Since writing the notes for the last show I have found out how to run database queries from the templating system I use to make show notes, and have used this to make an HTML table, which I hope is clearer.
Show | Date | Title | Tags |
---|---|---|---|
1 | 2007-12-31 | Introduction to HPR | hpr, twat, community |
947 | 2012-03-20 | Presentation by Jared Smith at the Columbia Area Linux Users Group | Fedora,community |
1000 | 2012-05-31 | Episode 1000 | HPR,community,congratulations |
1024 | 2012-07-05 | Episode 1024 | HPR,community,anniversary |
1509 | 2014-05-15 | HPR Needs Shows | HPR, shows, request, call to action, community, contribute |
1913 | 2015-12-02 | The Linux Experiment | linux, the linux experiment, community |
2008 | 2016-04-13 | HPR needs shows to survive. | HPR,community,shows,call to action,contribute |
2035 | 2016-05-20 | Building Community | community |
2077 | 2016-07-19 | libernil.net and self hosting for friends and family | gnu, linux, networking, community, servers, services, commons |
2255 | 2017-03-24 | The Good Ship HPR | HPR,community,contribution,podcast |
Note: since the last episode one more show has been added to the database with the ‘community’ tag.
Finding shows with combinations of tags
The query here is very similar to the equivalent from the last show except that we have another table, the joining table, to refer to:
SELECT e.id,e.date,e.title,e.tags
FROM eps e, eps_tags2_xref et, tags2 t
WHERE e.id = et.eps_id
AND et.tags2_id = t.id
AND t.tag IN ('community','hpr')
GROUP BY e.id
This query looks for shows having either of the two tags, so it is an ‘OR’ operation.
Show | Date | Title | Tags |
---|---|---|---|
1 | 2007-12-31 | Introduction to HPR | hpr, twat, community |
947 | 2012-03-20 | Presentation by Jared Smith at the Columbia Area Linux Users Group | Fedora,community |
1000 | 2012-05-31 | Episode 1000 | HPR,community,congratulations |
1024 | 2012-07-05 | Episode 1024 | HPR,community,anniversary |
1371 | 2013-11-04 | The Lost Banner of HPR | hpr,banner |
1509 | 2014-05-15 | HPR Needs Shows | HPR, shows, request, call to action, community, contribute |
1726 | 2015-03-16 | 15 Excuses not to Record a show for HPR | hpr,podcasting,tips,techniques,kw,knightwise,excuses |
1818 | 2015-07-22 | Review of HPR’s Interview Recorder: Zoom H1 | Zoom H1, microphone, recording, review, DVR, digital voice recorder, tutorial, getting started, guide, howto, HPR |
1877 | 2015-10-13 | Recording HPR on the fly on your Android phone | android, hpr, audio, recording |
1913 | 2015-12-02 | The Linux Experiment | linux, the linux experiment, community |
2008 | 2016-04-13 | HPR needs shows to survive. | HPR,community,shows,call to action,contribute |
2035 | 2016-05-20 | Building Community | community |
2077 | 2016-07-19 | libernil.net and self hosting for friends and family | gnu, linux, networking, community, servers, services, commons |
2195 | 2016-12-30 | All you need to know when uploading a show | HPR |
2245 | 2017-03-10 | Managing tags on HPR episodes - 1 | HPR,database,schema,tag |
2255 | 2017-03-24 | The Good Ship HPR | HPR,community,contribution,podcast |
2260 | 2017-03-31 | Managing tags on HPR episodes - 2 | HPR,database,schema,tag,SQL,foreign key |
Note: since the last episode three more shows have been added to the database with the ‘community’ and ‘hpr’ tags so we have 17 rows returned this time.
If we wanted to look for shows which have both tags, then the following query would be needed:
SELECT e.id,e.date,e.title,e.tags
FROM eps e, eps_tags2_xref et, tags2 t
WHERE e.id = et.eps_id
AND et.tags2_id = t.id
AND t.tag IN ('community','hpr')
GROUP BY e.id
HAVING count(e.id) = 2
This query looks for shows having both of the two tags, so it is an ‘AND’ operation.
Show | Date | Title | Tags |
---|---|---|---|
1 | 2007-12-31 | Introduction to HPR | hpr, twat, community |
1000 | 2012-05-31 | Episode 1000 | HPR,community,congratulations |
1024 | 2012-07-05 | Episode 1024 | HPR,community,anniversary |
1509 | 2014-05-15 | HPR Needs Shows | HPR, shows, request, call to action, community, contribute |
2008 | 2016-04-13 | HPR needs shows to survive. | HPR,community,shows,call to action,contribute |
2255 | 2017-03-24 | The Good Ship HPR | HPR,community,contribution,podcast |
Finding shows related to a given show using tags
This is the same exercise we used in the last show, following droops’ suggestion. Rather than write a script to do this I have just shown the SQL queries here. There are two. The first simply shows the tags on a given episode (2071), and the second queries the database for shows (other than 2071) which have any of the same tags:
SELECT t.tag
FROM tags2 t, eps_tags2_xref et
WHERE et.tags2_id = t.id
AND et.eps_id = 2071
ORDER BY t.tag
SELECT t.lctag, e.id, e.date, e.title, h.host
FROM eps e, hosts h, eps_tags2_xref et, tags2 t
WHERE e.id = et.eps_id
AND e.hostid = h.hostid
AND et.tags2_id = t.id
AND t.tag IN (
SELECT tag t
FROM tags2 t, eps_tags2_xref et
WHERE et.tags2_id = t.id
AND et.eps_id = 2071
)
GROUP BY e.id
HAVING e.id <> 2071
ORDER BY t.lctag, e.id
Here are the results from the two queries:
Tag |
---|
amateur radio |
electronics |
open source |
Tag | Show | Date | Title | Host |
---|---|---|---|---|
amateur radio | 911 | 2012-01-27 | Hobbies | MrX |
amateur radio | 1036 | 2012-07-23 | Setting up Your First Ham Radio Station | Joel |
amateur radio | 1092 | 2012-10-08 | Ham Radio: The Original Tech Geek Passion | MrGadgets |
amateur radio | 1701 | 2015-02-09 | FOSDEM 2015 Part 4 of 5 | Ken Fallon |
amateur radio | 2240 | 2017-03-03 | Amateur Radio Round Table | Various Hosts |
electronics | 1817 | 2015-07-21 | Gathering Parts | NYbill |
electronics | 1858 | 2015-09-16 | Multimeter Mod’s Part 2 | NYbill |
electronics | 1971 | 2016-02-22 | BlinkStick | Dave Morriss |
electronics | 2029 | 2016-05-12 | The DSO138 Oscilloscope Kit | NYbill |
electronics | 2044 | 2016-06-02 | Bring on the Power! | NYbill |
electronics | 2056 | 2016-06-20 | Interview with a young hacker | Tony Hughes AKA TonyH1212 |
electronics | 2148 | 2016-10-26 | The DSO138 Oscilloscope Kit Part 2 | NYbill |
electronics | 2220 | 2017-02-03 | Taking apart a tablet | laindir |
open source | 242 | 2008-12-03 | Open Source in Government Panel Discussion | UTOSC |
open source | 1402 | 2013-12-17 | How I Started Using Linux and Free and Open Source Software | Thaj Sara |
open source | 1529 | 2014-06-12 | TrueCrypt, Heartbleed, and Lessons Learned | Ahuka |
open source | 1641 | 2014-11-17 | The real reasons for using Linux | johanv |
open source | 1653 | 2014-12-03 | Ruth Suehle at Ohio Linux Fest 2014 | Ahuka |
open source | 1682 | 2015-01-13 | Introduction to the Netizen Empowerment Federation | daw |
open source | 1686 | 2015-01-19 | Interview with Joel Gibbard of OpenHand | Steve Bickle |
open source | 1723 | 2015-03-11 | Success With Students | Kevie |
open source | 1736 | 2015-03-30 | How I run my small business using Linux | b-yeezi |
open source | 1783 | 2015-06-03 | Windows To Linux - Better Late Than Never. | GNULinuxRTM |
open source | 1788 | 2015-06-10 | Podcrawl Glasgow 2015 | Kevie |
open source | 1917 | 2015-12-08 | OpenSource.com | klaatu |
open source | 1984 | 2016-03-10 | A Love Letter to linux.conf.au | Clinton Roy |
open source | 2036 | 2016-05-23 | Glasgow Podcrawl 2016 | Dave Morriss |
open source | 2062 | 2016-06-28 | Now The Chips Are Definitely Down | MrX |
open source | 2089 | 2016-08-04 | Solving a blinkstick python problem | MrX |
open source | 2155 | 2016-11-04 | Ohio LinuxFest 2016 | Ahuka |
open source | 2170 | 2016-11-25 | soundtrap.io | Ken Fallon |
open source | 2182 | 2016-12-13 | why say GNU/Linux ? | spaceman |
Using regular expressions
In the last episode we used the regular expression capabilities of MariaDB to find partial tags. Here is a similar example using the new tables:
SELECT e.id, e.date, e.title, h.host, e.tags
FROM eps e, hosts h, eps_tags2_xref et, tags2 t
WHERE e.hostid = h.hostid
AND e.id = et.eps_id
AND et.tags2_id = t.id
AND t.tag REGEXP '[[:<:]]ham[[:>:]]'
GROUP BY e.id
This query scans four tables:
- ‘
eps
’ containing episode data - ‘
hosts
’ containing host information - the joining table ‘
eps_tags2_xref
’ - the tags table ‘
tags2
’
Most of the clauses in the ‘WHERE
’ part of the query join together tables to make subsets. The part containing the regular expression looks for any instance of the word ‘ham’ as a distinct word. This means it can be part of a tag but not part of a word.
The ‘GROUP BY
’ part ensures that if a tag matches twice only one episode will be returned.
This example is probably more complex than it needs to be.
Show | Date | Title | Host | Tags |
---|---|---|---|---|
6 | 2008-01-08 | Part 15 Broadcasting | dosman | Part 15, HAM, soldering, fcc, radio |
1036 | 2012-07-23 | Setting up Your First Ham Radio Station | Joel | Amateur radio,Ham radio |
1092 | 2012-10-08 | Ham Radio: The Original Tech Geek Passion | MrGadgets | HAM radio,amateur radio,CB radio,Morse code |
2041 | 2016-05-30 | Router Antennas More = better ? | Lyle Lastinger | router,antenna,ham radio |
2189 | 2016-12-22 | Working Amateur Radio Satellites | Christopher M. Hobbs | hamradio, ham, radio, amateur, satellites, projects |
2216 | 2017-01-30 | Working AO-85 with my son | Christopher M. Hobbs | hamradio, ham, radio, amateur, satellites, projects |
2226 | 2017-02-13 | FOSDEM 2017 AW Building | Ken Fallon | FOSDEM 2017, coreboot, GNU GRUB, Olimex, Automotive Grade Linux, Ham radio, CorteXlab, OpenEmbedded |
2240 | 2017-03-03 | Amateur Radio Round Table | Various Hosts | amateur radio, ham |
Simplifying things with a VIEW
As an experiment I have included the definition of a ‘VIEW
’ which helps to hide some of the complexity of the queries needed to use the many-to-many tables.
The SQL which defines the experimental view is included with this show, and is called eps_hosts_tags_view.sql
. This is shown below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
|
The view is called ‘eht_view
’ and is really a way of storing a ‘SELECT
’ query for repeated use. The result is a sort of virtual table which can be used in further queries.
For example:
SELECT id,date,title,host,taglist
FROM eht_view
WHERE tag REGEXP '[[:<:]]solder[[:>:]]'
GROUP BY id;
Show | Date | Title | Host | Tag list |
---|---|---|---|---|
941 | 2012-03-12 | Whats in my bag / Portable Apps | Digital Maniac | windows,screwdriver,flash drive,portable apps,solder |
1037 | 2012-07-24 | Soldering Part 1 | MrX | soldering,soldering iron,solder,dry joint,soldering sponge |
1047 | 2012-08-07 | Soldering Part 2: An audio demonstration of soldering | MrX | strip board,perf board,bread board,flux,multicore solder |
Notice that the view contains a sub-SELECT
which concatenates all the tags belonging to an episode. This demonstrates that storing the tags in a CSV list as seen in episode 1 is unnecessary.
Conclusion
The HPR database is very much in need of a tag mechanism. In this mini-series we have looked at the present tag storage system and have concluded that it is not a good way to store and access tags. We have looked at a somewhat better way of achieving what is required in show 2, but have concluded that this also has drawbacks. In this third episode we have examined a better way of using a relational database to represent the true relationship between episodes and tags - a many-to-many relationship.
Although it will require some work, it is strongly recommended that we implement a tag scheme this way. It is also recommended that:
We enable the foreign key capabilities of MariaDB which will give many advantages when managing these new tables (and others).
We look at performing a similar database upgrade to enable the many-to-many relationship of hosts and episodes to be properly represented.
Although not as critical as the hosts/episodes relationship we should also set up a many-to-many relationship between episodes and series.
Epilogue
A couple of requests regarding tags:
- Please include tags when uploading your shows. Just add a few keywords to the tags field reflecting what your show was about or topics you spoke about.
- Note: more contributions to the project to add missing tags will always be welcome! Visit the page on the HPR website listing missing summaries and tags to find out how you could help.
Links
- Previous shows in this set:
- “Managing tags on HPR episodes - 1” - episode 2245
- “Managing tags on HPR episodes - 2” - episode 2260
- Mike Ray’s show on this subject: episode 1569, “Many-to-many data relationship howto”
- HPR page listing shows missing summaries and tags
- Included files:
- Image representing a many to many relationship:
table_relationships.svg
- SQL to create the tables:
normalise_tags_2.sql
- SQL to create a view:
eps_hosts_tags_view.sql
- Perl script to manage the tables:
refresh_tags_2
- Image representing a many to many relationship: