This MHPR episode 2,245 entitled Managing Attack on HP R Episode 1. It infested my name Morris and in about 26 minutes long and carry my next visit flag. The summary is, looking for the best way to store and Managing Attack in the HP R Nature Base, Part 1. This episode of HP R is brought to you by an honest host at com. Get 15% discount on all shared hosting with the offer code HP R15. It's HP R15. Bit your web hosting that's honest and fair at an honest host.com. Hello everybody this is Dave Morris welcome to Hack Public Radio. I'm starting a series of three shows today and it's on the subject of the tags that we ask you to add to shows when you submit them and which we store in the HP R database. There was a recent query on the HP R mailing list from Clinton Roy asking if we had a search interface for these tags and the answer had to be no. There's partly because we haven't yet completely resolved how we're going to store and manage them. Thinking about this for some number of months and I wanted to give you the benefit of what I've been doing and to stimulate some conversation about it. So to introduce it then just I should say there's some long notes that go with this and some short notes that you'll see on the website. We've been collecting and storing tags for your HP shows for a bit of time now. I think a couple of years you and I'm not sure the precise time but we've been doing it for for enough time to make a difference and we planned originally to offer some sort of search interface we still do and in that time I put together a mechanism for adding tags to existing shows and a number of people have been helping out with doing this including myself and that's been happening since August 2015. It's a little of way to go but we are making slow progress at the time of recording this which is the 31st of January 2017 for future lessons. We've processed 56% that's 1,249 shows of the 2,217 we have in the system. Now in recent times the way in which we could or should use these tags was discussed. For example in show 2035 in 2016 troops made a suggestion which was the website which is a lot of work needs to have related shows listed on each individual shows page. This will take a tag system and someone to tag all of the almost uncountable previous episodes. So yeah it's not uncountable but it's a lot and we do have a tag system but I'm not happy with it and I'm pre-empting the rest of the discussion there. The episode 2035 begins a discussion about some of the ways that tags can be stored managed and accessed efficiently in the database so it's prompted this show in other words and the two that follow. So I started putting this particular show together which I am at envisages being one in summer of 2016 but it's grown the whole thing has grown a lot since then and there's a lot so I've split it into three. I should warn you that the subject gets quite technical in the later shows. I'm discussing database design techniques and all three of the shows contain examples of database queries and scripts and if you're not interested in this subject we're not really interested in databases shall we say then feel free to skip. However you might find this first episode a bit more palatable than the next two and whatever it would be very helpful if you have any thoughts on the subject that you would contribute to them to us in some form or other. The whole subject of databases and so forth has been I guess hanging in the air for some time. There's there has been discussion as to whether we should use a database at all to hold our HPR shows and there's also been discussion about whether we should put together a static site rather than the current dynamic one. The static site idea has been motivated by security considerations amongst other things. I imagine that such a static site would probably be best generated from a database. There you could argue against that I suppose but databases are good at doing the stuff that we currently do so even with a static site it would be good to have a database behind it but even if we have a database as we do now there are quite a lot of opinions about how it should be put together. Part of this discussion is motivated by the fact that database administration skills are quite specialized and the concept behind databases can be a little difficult to comprehend when first encountered. One view has been that using any of the advanced features of the chosen database system which is mySQL or MariaDB is now and they should be avoided because doing so will be too complicated for future voluntary HPR administrators to maintain. The types of capabilities that would be desirable and this is my list I guess would be that it would be good if we could allow there to be more than one host associated with a show. We have had and will continue to have instances where more than one person is on a show and their co-hosts or whatever and we don't have any means of signifying this at the moment. We do have entries in the database where we have hosts in quotes who are written down as two people so you know a person like person B and they're a separate host from person A and person B which doesn't like any sense at all to my mind. We'd also quite we'd also like to have shows associated with more than one series though that would be good though not as important as the host thing and of course this how we would implement attack system. The way that the database which is a relational database is this reasonably advanced and at the moment it's been treated as if it's just a bunch of spreadsheets maybe that's a little bit harsh I don't know but it's sunny it's full capabilities are not really being used so I want to enlarge on this in the this show and the later ones so let's look at tags and their potential uses why have tags at all well first point would be using them as a way of showing a relationship between shows so as groups suggested if a particular show has tags associated with it and all of the shows which share any of the same tags should be related we'd expect them to be related and they should be linked or be linkable from that show in some way so displaying these sorts of relationships list of other shows perhaps would be helpful in finding other shows worth listening to because at the moment it can be hard to find it is hard to find shows with a particular topic other than by scanning through there the titles or using Google search on the site which effectively does that it does actually it wouldn't use the tags because the tags are displayed one of the views on the on the side but it's not ideal. I thought it might be useful to be able to list shows by tags I'm not sure if anybody would agree with this but I thought perhaps there might be some mileage in having a sort of master show index so I'm thinking of something like the index for book with the the name of a tag on the left hand side say and a show number link on the other side otherwise we could have a table listing tags with the show number entitled to the right of the of the tag it's we're we're we're talking about two thousand plus two thousand two hundred plus shows a moment so that's not a huge display I'd like to experiment with it personally just to see it would be usable you definitely can have multiple shows next to a given tag at least when the tag system is has been populated the third point of course is using queries against the tag you might want to do something like looking for the tags into view and either I'll camp or FS cons for example the I'd mention it's because they that combination exists so you might be looking for shows which have been tagged as interviews either I'll camp or at FS cons this is not particularly difficult to implement the database level but it would require something quite sophisticated on the front end to to make a usable interface on the website and that implies JavaScript which is never popular there is though possibility of using a URL based interface so you could invoke this as a as a query in your browser your browser's address line I guess I don't know how popular that would be however still think that having some means of querying is is desirable and I'm doing this show because that is a thing that has been requested in various forms so let's look at database design then how could you implement a tag system in the HBR database? Currently it holds tags as a comma separated list and this is exactly the same way that we request them to be entered into the submission form when an episode's being uploaded we don't currently reformat these in any standard way so although they're comma separated entities words or phrases they don't conform to the CSV standard which I've referenced here as RSC4108 though the reason why using such a standard is desirable is you might want to tag that's got a comment and I've given an example where you might have tags on a show which might be TED and technology comma education and design which is the full meaning of that of TED so the expectation would be the second full instance with the comma in it was a single tag so you would need to limit it in some way to make sure that the comma wasn't interpreted does a tag separator now the storage that's currently allocated for the tag list per episode is only 200 characters and this is quite restrictive and there are some longer shows that would benefit from more tags that might fit in this field and the longest tag list currently in the database I did a quick scan to see is 196 characters so it's almost we've almost got an overflow already think that tag list was was created by myself but I think the point stands so the question is then what would be the best way to store tags so I'm going to look at what we currently have today on this episode and later we'll look at the two other different solutions and I'll comment about each one of them just to give a disclaimer I'm not a trained database designer I worked a database as a fair bit over the past 15 years or so but I'm largely self-taught and listeners to this episode may well have better ideas for doing what's needed and hopefully we'll have had proper training and much more experience than I've had so if that's the case then please add your ideas and suggestions either to the comments or sending email to the HBR list or indeed email me directly if you prefer I'd like that to be a community discussion about this if possible so let's look at the tags as a comma separated list then there's a field called tags in the table that holds episodes which is called EPS and it this table holds all of the episode detail so what's good about doing this but it's simple to store such tags just place words and phrases in a comma separated list stick them into the tabs tags field it's easy to maintain we just add tags to the field or remove them with relative ease and I do this to when I add tags to to shows which I do if they're missing when show gets uploaded to the internet archive and I've actually written a little script that calls the vim editor to do it what's the what are the disadvantages well the current field is small and as I've said can be easily exceeded and there are shows which might need quite a lot of tags and particularly the one that's got 196 on it 100 no six characters on it is one of the new years shows and people like to list things like software or hardware and so on which as an argument for tagging I think the tags should be pre-processed or processed at some stage to ensure that they can form to the CSB format preferably before storage so that business of having a tag with the comma and it needs to be dealt with the format is quite wasteful since the same tags are duplicated throughout in the various episodes so for example there are 84 instances of Linux in the tags for example so it's it's all stored there whereas really you just want one and you want to be able to point to all the episodes that use it store Linux once and point to the show in particular this format is difficult to search efficiently and I'm going to go into detail about this in a minute and in the database it's necessary to search the tags field of each show with a string search to find stuff and this is expensive to run and scales badly now I know it's this is actually a tiny database I mean I've got this number of shows we store the show notes in the database too but it's still very small in database terms so efficiency arguments can be shot down on the basis that the cost is low on a on a system which is so small but I still think it's it's a it's a bad thing to do because there are more efficient ways of doing it let's put it that way and the more efficient ways of doing it give you benefits which I'll talk about database is usually use some sort of an index on tables and fields to help optimize searches and this is very very difficult to do when you've got a single field holding multiple tags I don't actually know how you would do this I'm sure there are ways in which it could be done but it would be there's a full text search capability for example in my SQL and really DB haven't looked into it but I I feel that that would be a bad solution anyway so that's why I've not looked at it much so let's look at the searching issue then so this apps table it's the thing that holds the show number the release date the episode title etc etc and the tags in the database so if we wanted to find all shows with a given tag we could take for example the the show groups recorded called building community episode 235 which has a single tag the word community so we wanted to find other shows with this tag I've put here a query that could possibly be used to do it now I'd want to give a lesson in structured query language or SQL which is what this is but essentially what we have here is a select command I guess which initiates a query it's just the verb that's used to query a database to find information retrieve information it contains also the verb from so you select some stuff from and then the name of a table between the select and the from are the fields you want out of the table and then following the from and the table name is a where verb which introduces the filtering part of the query and this is the thing where you would say show me only those rows in the table in the database which match these criteria so in this particular case we've got select ID comma date comma title from apps the table I mentioned before where and then we have tags like quote percent community percent close quote semi-colon they need semi-colons on the end what this query will do is find all rows in the apps table with community in the tags field now the word verb can be followed by an equal sign can contain an equal sign and it's expression but we can't use that here because there there won't it's rare that the tags field will only contain the word community it will in the case where we're looking at but we won't elsewhere we're looking for the word community and the list of other words so the like verb that's being used here it's an operator really uses wild cards and in sequel the wild card used with like is the percent percent is the sort of whatever so percent community percent means community followed um procedure by and followed by anything including nothing comparisons in my SQL and Maria to be a not case sensitive so we don't care about the case now when we run this we get back 95 rows and the reason we get so many that's the wrong number I should say is because community news shows all have the tags community new and what we're doing here doesn't differentiate between community as a tag all of its own and as part of another tag so I've shown this query again which is the same except that the the string after like is percent comma community comma percent and this works when we get three hits but it doesn't include show two or three five because in that case there weren't commas in the tags field there was just one word there my next example shows a whole bunch of ways in which you can say where tags like and then a string or tags like other string or blah blah blah there's four possible ways of doing it but that produces 95 results again because it doesn't although it's complicated it doesn't exclude the community news case now Maria db and my SQL can handle limited regular expressions I put limited because regular expressions are disappointingly lacking in features anyway that's personal prejudice so instead of where tags like you can do where tags and then the word rejects before or by in single quotes a regular expression my regular expression is open parenthesis circumflex vertical bar comma closed parenthesis community open parenthesis comma vertical bar dollar closed parenthesis so what that saying is community preceded by either the start of the line or a comma and followed by either a comma or the end of the line so that's now we're getting a much closer to sort of thing that we would want to do to get the tags that are just the word community somewhere other when you run this it returns five matches but there are more I know there are more show you in a minute but that's because some contain spaces after the comma or before the comma in some cases and because we're not cleaning them up it's not getting them all so finally in this in these examples here then I show an example where the same expression contains a space and an asterisk meaning 1 or 0 or more spaces then community then space asterisk after community so it's catering for the case there might be spaces before or after community and in that case we get back nine matches and I've run the queries and put the output into the notes so you can see the sort of thing you get back nine matches is correct I do believe anyway so you could use the regular expression thing to find multiple tags and I've got an example here where I won't read this one because they're now getting quite complex and just just be with me and believe what I say and you can look to to see whether it's turning the truth or not in this particular case I'm looking for the string HPR or the string community and I get 14 matches back and included the tags in the the result of the query so you can see what matched if we wanted to find HPR and community because that first that query before was anything that had HPR and the tags or anything that contained community in the tag some of these contain both but it it's select all of them so if you want to use and then we need two regular expressions so it's got even more complicated in this case we get back five hit there is a find in set function within MariaDB and HiSQL and that's capable of parsing comma separated list so it is possible to do a more complicated search at least a more and I guess yeah a little bit less complicated actually after the where we have the function find underscore in underscore set open breathuses and in quotes community comma tags that's the community is what you search and foreign tags is is the field you're looking in and you get back a number from that call so you're looking for anything greater than zero this works and but only gives us five matches on the community tag because of the the space problem I mentioned before so regular expression a thing might be better until the tags are cleaned up now how would you do the thing that groups suggested given a show number extract its tags and use them to perform a search for other shows matching each tag well this is quite difficult to do using this this thing I don't really know of a way of doing it just in plain SQL you need to make one query in script say one query to get the target show then script would need to take all of the tags from that show and in a loop would need to query to return the details of all of the shows with those tags and a single query to do that is beyond my capabilities though it would be possible to write a script to do it or something in PHP I think the point though the point I'm trying to make anyway is that using a comma separated tag feel is not a good way of doing things because it gives you a lot more problems than you really need and I say in my conclusion that doing this way is not not desirable and even if we did continue to use it this way we'd need to increase the size of the tags field I've actually put in an example of how you would do that in using structured query language not really very relevant but anyway it's there you'd need to make sure that all of the tags were properly CSV formatted I actually wrote a little poll script to do this which I call clean CSV tags I've included it with this show in case you're interested and that does the job I've run it against a copy of the HPR database that I hold and it's good it puts quotes around everything takes out all the spirit spaces and stuff ideally I'd like to be processing the tags for spelling in case there were any any misspelled tags and there's an issue with plurals I think and I think that a tag if you had two tags in two different shows one of which was a counting an example banana banana and one case and bananas in the other case then a lot of tag matches would regard those as different even though one's just the plural of the other so it would be better to process this stuff and remove unnecessary unnecessary plurals and another thing that sometimes happens I don't as many cases is that people put unnecessary postrophies in words and it would be good to spot and remove these I haven't done anything with that though I can imagine the little script that I wrote could be modified to to do that so we could live with it but I really don't think we should so I've included what I call an epilogue at the end of this which is just to say yeah could you please include tags when uploading your shows just a few key words that reflect what you show is about or the topics you spoke about would be great there are several sitting in the queue to be uploaded which don't have tags which means I I feel I need to add them because it adds a lot to them both eventually on HPR but also on the archive.org site and the other point is if you feel motivated to help with the missing tags problem then it would be most appreciated and I've linked to where you can go to find out about what's needed all right okay well we'll call it quits there I hope that that covered the subject and has given you food for thought and in the next episode I will be talking about my first solution to this problem using database techniques to achieve it okay thanks a lot then by now you've been listening to Hecker Public Radio at Hecker Public Radio.org we are a community podcast network that release the shows every weekday Monday through Friday today's show like all our shows was contributed by a HPR listener like yourself if you ever thought of recording a podcast and click on our contribute in to find out how easy it really is Hecker Public Radio was found by the digital.com and the informomicon computer club and it's part of the binary revolution at binref.com if you have comments on today's show please email the host directly leave a comment on the website or record a follow up episode yourself on their otherwise status today's show is released on the creative comments description share a like that