Bash snippet - using coproc with SQLite (HPR Show 3413)

Sending multiple queries to a running instance of sqlite3

Dave Morriss


Table of Contents

Introduction

I am in the process of rewriting some scripts I use to manage Magnatune albums. I’m a lifetime Magnatune member and have access to the whole music collection. I wrote a script for downloading albums and placing them in my ~/Music directory which I talked about in 2013 (show 1204). The original scripts are still available on GitLab and I know of one other person who made use of them!

Since 2013 I have written a few other support scripts, for example one to manage a queue of albums I want to buy and download, and one which summarises the state of this queue.

It’s this 'show_queue' script I am currently updating (called show_queue_orig, and available in the resources to this show). The original version of this script took Magnatune album URLs from a file (acting as a queue of stuff I wanted to buy), parsed out a piece of the URL and used it to grep a pre-prepared summary in another file. This file of summaries had been made from a master XML file provided by Magnatune (see update_albums on GitLab).

Magnatune has moved away from this master XML file to a SQLite database in recent years, so I want to perform a database lookup for each URL to list its details.

The first version of the new script wasn’t difficult to write: just extract the search data as before and run a query on the database using this data. I have included this script which I call show_queue_db_1 amongst the resources for this episode, so you can see what I’m talking about – and what I want to improve on. It felt bad to be performing multiple calls on the sqlite3 command in a loop, so I looked around for an alternative way.

In April 2019 clacke did a show (number 2793) about the Bash coproc command.

This command creates a subshell running a command or group of commands which is connected to the calling (parent) process through two file descriptors (FDs). It’s possible for the calling shell to write to the input descriptor and read from the output one and thereby communicate with whatever is running in the subshell.

I was vaguely aware of coproc at the time of clacke’s show but hadn’t looked into it. I found the show fascinating but didn’t have a use for the feature at the time.

To solve my need to show my Magnatune queue of future purchases, it looked as if a sqlite3 instance running in a subshell could be given queries one after the other and return the answers I needed. My journey to a Bash script using coproc then followed.

Details

My ‘pending’ queue

As I said, I have a file containing Magnatune URLs of the form:

http://magnatune.com/artists/albums/antiqcool-acousticguitar

The final component is the 'SKU', which is the key to the album in the system. In the original XML-based system I see the following example information when I run my current script:

Artist: Antiqcool
Album:  Original Instrumental Acoustic Guitar Songs Vol 1
Genres: Alt Rock,Folk-Rock,Instrumental Rock
Code:   antiqcool-acousticguitar
----

I store the URLs because that’s the form Magnatune uses to send album details in their periodic email messages about new albums. It’s easier to cut and paste them.

The original show_queue script just reads this queue file in a loop and looks up the SKU in a file of reformatted XML information. As mentioned, I have included this script for reference as one of the resources accompanying this show (show_queue_orig).

More about coproc

In clacke’s HPR show 2793 he described the coproc command in some detail, with examples of how it behaves.

It is documented in a fairly terse fashion in the Bash Reference Manual (see link below).

The coproc command

In essence coproc runs a command as a subshell (or coprocess) with a two-way pipe connected to it. This allows the shell which generated it to write to and read from the coprocess.

The command format is:

coproc [NAME] command [redirections]

The syntax is a little strange to my way of thinking. The use of the name depends on the type of command. If it’s a simple command then no name can be provided, and the default of COPROC is used so you can only run one coprocess at a time. The alternative to a simple command is a command group enclosed in braces or parentheses, and the user-supplied name is used in this case, otherwise, if there’s no name COPROC is used as before.

The relevance of the name is that it is used to create variables relating to the coprocess. There’s a variable ‘name_PID’ which holds the process ID number of the subshell (coprocess), and an array.

The two-way pipe for communicating with the coprocess is connected to two file descriptors in the executing shell. These are stored in an indexed array called name. Element zero contains the descriptor for the standard output of the coprocess, and element 1 the descriptor for standard input.

Note: I haven’t talked about Bash file descriptors in my Bash scripting shows but plan to do so before long.

Simple usage

Here’s an example of running the date command in a coprocess. Not a lot of point in doing this of course, but it might help explain things reasonably well:

$ coproc date; cat <&"${COPROC[0]}"
[1] 40206
Thu 29 Jul 10:32:15 BST 2021
[1]+  Done                    coproc COPROC date
  • The coproc command is followed by the command to run - 'date'. Then, after a semicolon the output from the coprocess is displayed with cat. We do this on the same line because the coprocess will finish very quickly and will delete the COPROC array, making it impossible to see the output.
  • The first line beginning '[1]' shows the PID of the coprocess (an examination of the process hierarchy should show this to be a subprocess of the calling process)
  • The next line shows the date returned from the coprocess via cat
  • The second line beginning '[1]' shows that the coprocess has finished.

If for any reason you have a coprocess that continues to run unexpectedly you can look for it with the Bash 'jobs' command – this is job 1 in the above case, as shown by the '[1]' followed by the PID. You can kill the job with the command 'kill %1'.

Gory details

I have tried to learn the intricacies of coproc since deciding to use it, but can’t say that I fully understand all the details yet!

Using coproc for single line stuff is not too difficult, but seems quite pointless. However, things get a lot more complex when dealing with coprocesses that receive and send multiple lines of data.

The issues which complicate matters are:

  1. It’s your coprocess so you know whether it expects input or not. You assume it’s receiving what you send by how it responds. There may be buffering problems that complicate this.
  2. When you’ve finished sending stuff to the coprocess and you want to tell it it’s all done you must close the input file descriptor. You do this with the rather arcane command 'exec {NAME[1]}>&-' where 'NAME' is the name you assigned or was assigned for you ('COPROC'). This is an example of the operator described in the Bash Reference Manual under Duplicating File Descriptors. Though the manual doesn’t explain how you use that operator with a FD held in an array! This subject is on the list for the Bash Tips series in due course.
  3. You can’t really tell how much output to read from the coprocess, and you may be dealing with something that performs I/O buffering so it will hold on to its output unexpectedly. This can cause a deadlock if you get it wrong! A deadlock is where the parent and child processes are both waiting for the other process to do something.

I have written a coproc example in the form of a Bash script. It’s called coproc_test.sh:

#!/bin/bash -

#
# Use bash in the coprocess but turn off buffering
#
process='stdbuf -i0 -o0 -e0 bash'

#
# Indexed array of bash commands
#
declare -a com=('date +%F' 'whoami' 'id' 'echo "$BASH_VERSION"'
    'printf "Hello\nWorld\n"')

#
# Count commands in the array
#
n="${#com[@]}"

#
# Start the coprocess
#
coproc child { $process; }

#
# Loop though the commands
#
i=0
while [[ $i -lt $n ]]; do
    # Show the command
    echo "\$ ${com[$i]}"

    # Send to coprocess
    echo "${com[$i]}" >&"${child[1]}"

    # Read a line from the coprocess
    read -u "${child[0]}" -r results

    # Show the line received
    echo "$results"

    ((i++))
done

#
# Send an EOF to the coprocess (if needed)
#
if [[ -v child_PID ]]; then
    echo "-- End --"
    exec {child[1]}>&-

    # Flush any remaining results
    cat <&"${child[0]}"
fi

The key points are:

  • The coprocess is running an instance of bash which expects commands and returns results
  • The command to be run is preceded by 'stdbuf -i0 -o0 -e0' which turns off all buffering
  • Commands to be sent to it are in an array and are fed to it one at a time
  • I use a counter and a while loop to do this, but could just as well have used for (( i = 1; i < n; i++ )).
  • Each time a command is sent one line is read back (using a read command on the FD in array child[0]) and displayed
  • An end of file is sent to the coprocess by closing the input channel, but only if it’s still running
  • I use a cat command to flush any remaining output after closing the FD.
  • Most of the commands generate one line of output, but the last one: 'printf' creates two. Only one is read in the loop, but the cat returns it after the input FD has been closed.
  • What would happen if a command was sent which produced no output? Try adding ‘:’ to the list of commands (this being a “null” command in Bash). The script will hang waiting for output that will never come. Adding a timeout to the read could be a way to avoid this problem.

Running this script gives the following output:

$ date +%F
2021-08-21
$ whoami
cendjm
$ id
uid=1000(cendjm) gid=1000(cendjm) groups=1000(cendjm),4(adm),24(cdrom),25(floppy),29(audio),30(dip),44(video),46(plugdev),108(netdev),110(lpadmin),114(bluetooth),115(scanner)
$ echo "$BASH_VERSION"
5.1.4(1)-release
$ printf "Hello\nWorld\n"
Hello
-- End --
World

Note that the two-line printf at the end gets its first line displayed in the loop, then the loop ends and the script detects that the coprocess is still running, writes '-- End --' and then flushes the remaining line.

Also note that the job control messages we saw in the simple example above are disabled by Bash when running coproc out of a script.

I’m not sure that this example shows anything useful, however. It seems more of a novelty than anything else!

Coprocesses in gawk

In the spirit of enquiry I wrote a brief gawk script to do largely the same as the previous example. The coprocess features are not available in plain awk, they are a GNU extension. This script, called coproc_test.awk has been included in the resources for this show, where it can be downloaded.

I will not cover it any further in this show.

My eventual script using coproc

I found out how to run coproc to do what I wanted, but I spent a long time working out how to do it. Looking back, I got an HPR show out of it (though I doubt whether anyone will thank me for it!), and answered my question, but it probably wasn’t worth it.

The eventual script is presented here and is a resource available for download (show_queue_db_2).

#!/bin/bash -
#===============================================================================
#
#         FILE: show_queue
#
#        USAGE: ./show_queue
#
#  DESCRIPTION: Show the pending queue, expanding each album's details from
#               the database.
#
#               / This version calls sqlite3 once and feeds it queries in a loop /
#
#      OPTIONS: ---
# REQUIREMENTS: ---
#         BUGS: ---
#        NOTES: ---
#       AUTHOR: Dave Morriss (djm), Dave.Morriss@gmail.com
#      VERSION: 0.2.1
#      CREATED: 2020-09-15 12:38:03
#     REVISION: 2021-08-05 16:39:05
#
#===============================================================================

set -o nounset                              # Treat unset variables as an error

SCRIPT=${0##*/}
#DIR=${0%/*}

VERSION='0.2.1'

#
# Files and directories
#
BASEDIR="$HOME/MusicDownloads"
DATADIR="$BASEDIR/Magnatune_Data"
SCRIPTDIR="$BASEDIR/magnatune-downloader"

QUEUE="$SCRIPTDIR/pending"
DB="$DATADIR/sqlite_normalized.db"

#
# Sanity checks
#
[ -e "$QUEUE" ] || { echo "$QUEUE not found"; exit 1; }

#
# Check the queue contains data
#
if [[ ! -s $QUEUE ]]; then
    echo "$SCRIPT($VERSION): there is nothing in the queue"
    exit
fi

RE='^http://magnatune.com/artists/albums/([A-Za-z0-9-]+)/?$'

#
# Template SQL for printf
#
SQLtemplate=$(cat <<ENDSQL1
SELECT
    ar.name AS Artist,
    al.name AS Album,
    group_concat(distinct ge.name) AS 'Genre',
    group_concat(distinct sg.name) AS 'Subgenre',
    al.sku as Code
FROM albums al
JOIN artists ar ON al.artist_id = ar.artists_id
JOIN genres_albums ga on al.album_id = ga.album_id
JOIN genres ge ON ge.genre_id = ga.genre_id
JOIN subgenres_albums sa on al.album_id = sa.album_id
JOIN subgenres sg ON sg.subgenre_id = sa.subgenre_id
GROUP BY al.album_id
HAVING sku = '%s';
.print '--------'
ENDSQL1
)

#
# Start the coprocess
#
coproc dbproc { stdbuf -i0 -o0 sqlite3 -line "$DB"; }

#
# Read and report the queue elements.
#
n=0
while read -r URL; do
    ((n++))

    if [[ $URL =~ $RE ]]; then
        SKU="${BASH_REMATCH[1]}"
    else
        echo "Problem parsing URL in queue (line $n): $URL"
        continue
    fi

    #
    # Build the query and write it to the coprocess
    #
    # shellcheck disable=SC2059
    printf "$SQLtemplate\n" "$SKU" >&"${dbproc[1]}"

done < "$QUEUE"

#
# Close the input pipe (a file descriptor move documented as '[n]>&digit-'
# which "moves the file descriptor digit to file descriptor n, or the standard
# output (file descriptor 1) if n is not specified". There is no digit here,
# so presumably /nothing/ is being moved to the file descriptor in dbproc[1].
#
exec {dbproc[1]}>&-

#
# Collect everything from the coprocess
#
cat <&"${dbproc[0]}"

# vim: syntax=sh:ts=8:sw=4:ai:et:tw=78:fo=tcrqn21

Salient points are:

  1. The query is stored in the variable SQLtemplate in the form of a format string for printf. This lets me substitute a SKU value each time it’s used. The string consists of a SQL query and a SQLite ‘dot’ command (.print) which I use to print a line of hyphens between each album summary.
  2. The coprocess consists of a sqlite3 command preceded by a stdbuff call which turns off all buffering.
  3. In the loop which is reading the queue we generate a new query on each iteration and use printf to produce it and write it to the coprocess. We do not read back from the coprocess in the loop.
  4. Once the loop is finished we close the input pipe then use cat to collect all that’s available from the coprocess and display it.

Running this script gives the following output (truncated after the first 12 lines):

  Artist = Antiqcool
   Album = Original Instrumental Acoustic Guitar Songs Vol 1
   Genre = Alt Rock
Subgenre = Folk,Instrumental New Age
    Code = antiqcool-acousticguitar
--------
  Artist = Mokhov
   Album = Jupiter Melodies
   Genre = Electronica
Subgenre = Electro Rock,Electronica,Instrumental Classical
    Code = mokhov-jupitermelodies
--------

I suspect that the strategy of feeding data to the coprocess in a loop but not reading from it until the loop has ended might be dangerous. I think this relies on the fact that the pipe will store data, but it’s not clear what limit there is on such storage. It’s possible that it’s fairly small, and that this script could fail if the queue was long.

It might be possible to avoid this problem by reading data from the coprocess on each loop iteration using read with a timeout. Detecting the timeout could possibly be used to determine that the output pipe is empty and it’s time to write to the input pipe again.

I have not tried this idea though - it feels very clunky!

Conclusion

An interesting voyage, but:

  • I still don’t fully understand what 'exec {NAME[1]}>&-' means. I know what it does, but don’t get the syntax!
  • The general conclusion I find from various sources are:
    • named pipes are better
    • if you want to interact with a command use something like expect (I spent several years using expect and expectk in my job)
    • see the Stack Exchange reference below for more details
  • For this application I can write a much simpler Perl script that connects to the SQLite database, prepares a query with a substitution point and repeats the call with different values without a coprocess (available as a resource with this episode with the name show_queue.pl.zip). Many other programming solutions are also available. I do not believe that this is a task for Bash.

I’m in general agreement with clacke that Bash coproc is a feature looking for a use!

Feel free to show me the error of my ways!