Some other Bash tips (HPR Show 2045)

Dave Morriss


Table of Contents

Expansion

As we saw in the last episode 1951 (and others in this sub-series) there are eight types of expansion applied to the command line in the following order:

  • Brace expansion (we looked at this subject in episode 1884)
  • Tilde expansion (seen in episode 1903)
  • Parameter and variable expansion (this was covered in episode 1648)
  • Command substitution (seen in episode 1903)
  • Arithmetic expansion (seen in episode 1951)
  • Process substitution
  • Word splitting
  • Pathname expansion

We will look at process substitution and word splitting in this episode but since there is a lot to cover in these subjects, we'll save pathname expansion for the next episode.

Process substitution

The process substitution feature in Bash is a way in which data can be passed to or from a process. A process is most simply thought of as one or more commands running together.

Not all Unix systems can implement process substitution since it either uses what are known as "named pipes" (or FIFOs) or special files called /dev/fd/<n> (where <n> represents a number). These are temporary data storage structures which separate processes are able to access for reading or writing.

The name of this pipe or /dev/fd/<n> "interconnecting file" is passed as an argument to the initial command. This can be a difficult concept to understand, and we'll look at it in more detail soon.

There are two forms of process substitution:

>(command list)
<(command list)

The first form receives input which has been sent via the interconnecting file and passes it to the command list. The second form generates output from the command list and passes it on via the interconnecting file which needs to be read to receive the result. In both cases there must be no spaces between the '<' or the '>' and the open parenthesis.

Some experiments with simplified commands might help to clarify this. First consider a simple pipeline:

$ echo Test | sed -e 's/^.*$/[&]/'
[Test]

This is a pipeline where the echo command generates data on its STDOUT which is passed to the sed command on its STDIN via the pipe. The sed command modifies what it receives by placing square brackets around it and passes the result to its STDOUT and the text is displayed by the shell (Bash).

Contrast this with process substitution. If we want to generate the same text and pass it to sed within a process we might write:

$ echo Test >(sed -e 's/^.*$/[&]/')
Test /dev/fd/63

This does not work as expected. What is happening here is that an interconnecting file name (/dev/fd/63) has been created and passed to the echo command, with the expectation that it will be used to send data to the process. However, this is seen by the echo which has simply displayed it. The process substitution containing the sed command has received nothing.

This example needs to be rewritten by adding a redirection symbol (>) after the echo:

$ echo Test > >(sed -e 's/^.*$/[&]/')
[Test]

Behind the scenes, Bash will have changed this (invisibly) into something like:

$ echo Test > /dev/fd/63 >(sed -e 's/^.*$/[&]/')

This time the interconnection between the command on the left and the process substitution expression holding the sed command has been made.

Note that using a pipe instead also does not work:

$ echo Test | >(sed -e 's/^.*$/[&]/')
bash: /dev/fd/63: Permission denied

This is because the filename is being used on the right of the pipe symbol where a command, script or program name is expected.

The corresponding version of this example using the other form of process substitution is:

$ sed -e 's/^.*$/[&]/' <(echo Test)
[Test]

Here the interconnecting file name is being provided to the sed command. To visualise this we can modify the sed script by using the F command, a GNU extension which reports the name of the input file (followed by a newline):

$ sed -e 'F;s/^.*$/[&]/' <(echo Test)
/dev/fd/63
[Test]

To wrap up, the Bash manual page states the following in the context of process substitution as part of the larger topic of Expansion:

When available, process substitution is performed simultaneously with parameter and variable expansion, command substitution, and arithmetic expansion.

We have now seen all of these expansion types in this series.

Process Substitution Examples

An example of the first form is:

$ echo "Hacker Public Radio" > >(sed -ne 's/\([A-Z]\)\([a-z]\+\)/\l\1\U\2/gp;')
hACKER pUBLIC rADIO

This just uses a sed script to reverse the case of the words fed to it. It is equivalent to the very simple tests we did earlier to demonstrate the concepts of process substitution.

An example of the second form:

$ sed -ne '1~5p' <(nl -w2 -ba -s': ' sed_demo1.txt)
 1: Hacker Public Radio (HPR) is an Internet Radio show (podcast) that releases
 6:
11: what topics have been covered so far just have a look at our Archive.

This is a modification of an example used in the "Introduction to sed" series. Here sed has been requested to print line 1 of its input stream, followed by every fifth line thereafter. We used the nl (number lines) command to number the incoming lines to make it clear what was happening.

Here the nl command is being used in a process substitution to feed data to sed on its STDIN channel (via the interconnecting file of course). Note that this is just a demonstration of the topic, it would not make sense to use these two commands together in this way.

Another example of the second form:

$ join <(shuf -n5 /usr/share/dict/words | sed -e "s/'.*$//" | nl) \
    <(shuf -n5 /usr/share/dict/words | sed -e "s/'.*$//" | nl)
1 brine rationed
2 resister desks
3 democrats gall
4 Margie Bligh
5 segregates screwdrivers

This example uses the join command to join two lists of random words. This command expects two files (or two data sources) containing lines with identical join fields. The default field is the first.

The two processes provide the words, both doing the same thing. The shuf command selects 5 words from the system dictionary. We use sed to strip any apostrophe and the characters that follow from these words. We then number the words using nl. Each process will generate the same numbers, and these make the field for join.

Thus the first pipeline will generate five words and add the numbers 1-5 to them, and so will the second, and so join will join together word 1 from the first stream with word 1 from the second, and so forth.

It's a fairly useless example, but I find it amusing.

A final example. This one uses a Bash while loop to receive lines from a database query in a multi-command process substitution expression (think of this as an excerpt from a longer script):

count=0
while read id name
do
    printf "%02d \"%s\"\n" $id "$name"
    ((count++))
done < <(echo "select id, name from series order by lower(name);" | sqlite3 hpr_talks.db)
echo "Found $count series"

51 "10 buck review"
80 "5150 Shades of Beer"
38 "A Little Bit of Python"
79 "Accessibility"
22 "All Songs Considered"
83 "April Fools Shows"
42 "Bash Scripting"
.
.
Found 82 series

The query is requesting a list of series numbers and names from a SQLite database containing information about HPR shows. The query is made by using echo to pipe an SQL query expression to the sqlite3 command.

The output from the query is being collected by a read statement into two variables id and name. These are then printed with a printf command, though such a loop would normally be used to perform some action using these variables, not just print them.

Normally such a while loop would read from a file by placing a redirection symbol (<) and a file name after the done part. Here, instead of a file we are reading the output of a process which is querying a database.

A counter variable count is set up before the loop, and incremented1 within it. The final total is reported after the loop.

A sample of the output is shown after the code snippet.

As an aside, it is possible to build a loop using a pipeline, avoiding process substitution altogether:

count=0
echo "select id, name from series order by lower(name);" |\
sqlite3 hpr_talks.db |\
while read id name
do
    printf "%02d \"%s\"\n" $id "$name"
    ((count++))
done
echo "Found $count series"

The problem is that the final echo returns a count of zero.

This can be puzzling to new users of Bash - it certainly puzzled me when I first encountered it. It is because the while loop in this case runs in a separate process. Bash does not share variables between processes, so the count variable inside the loop is different from the one initialised before the loop. Thus the count outside the loop remains at zero.

If there is interest in looking further at this issue and other "Bash Gotchas" like it it could be included in a future episode of this (sub-)series.

Word splitting

The subject of word splitting is important to the understanding of how Bash works. Not fully appreciating this has been the downfall of many Bash script-writers, myself included.

Some examples of word splitting

By default, and looked at simplistically, Bash separates words using spaces. The following simple function can be used to report how many arguments it has received in order to demonstrate this:

function countargs () {
    echo $#
}

The function is called countargs for "count arguments". When it is called with arguments these are available to the function in the same way that arguments are available to a script. So, the special variable # contains the argument count.

Calling countargs with no arguments gives the answer 0:

$ countargs
0

However, calling it with a string argument returns 1, showing that a string is a word in the context of a Bash script:

$ countargs "Mary had a little lamb"
1
$ countargs 'Mary had a little lamb'
1

This also works if the string is empty:

$ countargs ""
1

When variables are used things become a little more complex:

$ str="fish fingers and custard"
$ countargs $str
4

Here the variable str has been expanded, which has resulted in four words being passed to the function. This has happened because Bash has applied word splitting to the result of expanding str.

If you want to pass a string like this to a function such that it can be used in the same way as in the calling environment, then it needs to be enclosed in double quotes:

$ countargs "$str"
1

While we are examining arguments on the command line it might be useful to create another function. This one, printargs, simply prints its arguments, one per line with the argument number in front of each.

function printargs() {
    i=1
    for arg; do
        echo "$i $arg"
        ((i++))
    done
}

This function uses a Bash for loop. Normally this is written as:

for var in list; do
    # Do things
done

However, if the in list part is omitted the loop cycles through the arguments passed to a script or function. That is what is being done here, using the variable arg to hold the values. The variable i is used to count through the argument numbers.

We will look at Bash loops in more detail in a later episode.

Using this function instead of countargs we get:

$ printargs $str
1 fish
2 fingers
3 and
4 custard

$ printargs "$str"
1 fish fingers and custard

We can see that word splitting has taken place in the first example but in the second enclosing the variable expansion in double quotes has suppressed this splitting.

The Internal Field Separator (IFS)

As we have seen, word splitting normally takes place using spaces as the word delimiter.

In fact, this delimiter is controlled by a special Bash variable "IFS" (which stands for "Internal Field Separator").

Normally, by default, the IFS variable contains three characters: a space, a tab and a newline. If the IFS variable is unset it is treated as if it holds these characters. However, if its value is null then no splitting occurs.

It is important to understand the difference between unset and null if you want to manipulate this variable. If a Bash variable is unset it is not defined at all. This can be achieved with the command:

unset IFS

If the variable is null then it is defined but has no value, which can be achieved with the command:

IFS=

If you are changing the IFS variable and need to change it back to its default value then this can be achieved in several ways. First, it can be defined explicitly by typing a space, tab and a newline in a string. This is not as simple as it seems though, so an alternative is this:

printf -v IFS " \t\n"

This relies on the ability of printf to generate special characters using escape sequences, and the -v option which writes the result to a variable.

The technique normally used in a Bash script is to save the value of IFS before changing it, then restore it later. For example:

oldIFS="$IFS"
IFS=":"

# Commands using changed IFS

IFS="$oldIFS"

One method of checking what the IFS variable contains is by using the cat command with the option -A which is a shorthand way of using the options -v, -T and -E which have the following effects:

  • Option -v displays non-printing characters (except tab)
  • Option -T displays tab characters as "^I"
  • Option -E displays a $ character at the end of each line

Simply echoing IFS into cat will do it:

$ echo "$IFS" | cat -A
 ^I$
$

The output shows the space, followed by the ^I representation of tab followed by a dollar sign which marks the end of a line. This is because of the newline character. The next line also contains a dollar sign because this represents the line generated by the newline character.

An alternative is to use the od (octal dump) command. This is intended for dumping the contents of files to examine their binary formats. Here I have chosen the -a option which generates character names, and -c which shows characters as backslash escapes:

$ echo -n "$IFS" | od -ac
0000000  sp  ht  nl
             \t  \n
0000003

The leading numbers are offsets in the "file". Note that we used the -n option to echo to prevent it generating an extra newline.

So, now we know that the IFS variable contains three characters by default, and any of these will be used as a delimiter. So, it is possible to prepare strings as follows:

$ str="  Wynken, Blynken, and Nod one night
> Sailed off in a wooden shoe  "

The > character on the second line is the Bash prompt indicating that the string is incomplete because the closing quote has not yet been entered. Note the existence of leading and trailing spaces.

$ printargs $str
1 Wynken,
2 Blynken,
3 and
4 Nod
5 one
6 night
7 Sailed
8 off
9 in
10 a
11 wooden
12 shoe

Note that the embedded newline is treated as a word delimiter and the leading and trailing spaces are ignored.

If we quote the string (and add square brackets around it to show leading and trailing spaces) we get the following:

$ printargs "[$str]"
1 [  Wynken, Blynken, and Nod one night
Sailed off in a wooden shoe  ]

This time the leading spaces and embedded newline are retained and printed as part of the string.

Finally we will look at how the IFS variable can be used to perform word splitting on other delimiters.

In this example we'll define a string then save the old IFS value and set a new one. We will use an underscore as the delimiter:

$ str="all dressed up - and nowhere to go"
$ oldIFS="$IFS"
$ IFS="_"
$ printargs $str
1 all dressed up - and nowhere to go

Note that the string is no longer split up since it contains none of the delimiters in the IFS variable.

Here we use one of the Bash features we met in episode 1648 "Bash parameter manipulation", Pattern substitution, with which we change all spaces to underscores:

$ printargs ${str// /_}
1 all
2 dressed
3 up
4 -
5 and
6 nowhere
7 to
8 go

Note that we get 8 words this way since the hyphen is treated as a word.

If however we change the IFS variable again to include the hyphen as a delimiter we get a different result:

$ IFS="_-"
$ printargs ${str// /_}
1 all
2 dressed
3 up
4
5
6 and
7 nowhere
8 to
9 go

Here we have 9 words since the hyphen is now a delimiter. It might be useful to show the result of the substitution before word splitting:

$ echo "${str// /_}"
all_dressed_up_-_and_nowhere_to_go

There are three delimiters in sequence here which are interpreted as two null words (words 4 and 5 above).

Don't forget to restore the IFS variable afterwards otherwise you will probably find Bash behaves in a rather unexpected way:

$ IFS="$oldIFS"

Manual Page Extracts

EXPANSION

Expansion is performed on the command line after it has been split into words. There are seven kinds of expansion performed: brace expansion, tilde expansion, parameter and variable expansion, command substitution, arithmetic expansion, word splitting, and pathname expansion.

The order of expansions is: brace expansion; tilde expansion, parameter and variable expansion, arithmetic expansion, and command substitution (done in a left-to-right fashion); word splitting; and pathname expansion.

On systems that can support it, there is an additional expansion available: process substitution. This is performed at the same time as tilde, parameter, variable, and arithmetic expansion and command substitution.

Only brace expansion, word splitting, and pathname expansion can change the number of words of the expansion; other expansions expand a single word to a single word. The only exceptions to this are the expansions of "$@" and "${name[@]}" as explained above (see PARAMETERS).

Brace Expansion

See the notes for HPR show 1884.

Tilde Expansion

See the notes for HPR show 1903.

Parameter Expansion

See the notes for HPR show 1648.

Command Substitution

See the notes for HPR show 1903.

Arithmetic Expansion

See the notes for HPR show 1951.

Process Substitution

Process substitution is supported on systems that support named pipes (FIFOs) or the /dev/fd method of naming open files. It takes the form of <(list) or >(list). The process list is run with its input or output connected to a FIFO or some file in /dev/fd. The name of this file is passed as an argument to the current command as the result of the expansion. If the >(list) form is used, writing to the file will provide input for list. If the <(list) form is used, the file passed as an argument should be read to obtain the output of list.

When available, process substitution is performed simultaneously with parameter and variable expansion, command substitution, and arithmetic expansion.

Word Splitting

The shell scans the results of parameter expansion, command substitution, and arithmetic expansion that did not occur within double quotes for word splitting.

The shell treats each character of IFS as a delimiter, and splits the results of the other expansions into words using these characters as field terminators. If IFS is unset, or its value is exactly <space><tab><newline>, the default, then sequences of <space>, <tab>, and <newline> at the beginning and end of the results of the previous expansions are ignored, and any sequence of IFS characters not at the beginning or end serves to delimit words. If IFS has a value other than the default, then sequences of the whitespace characters space and tab are ignored at the beginning and end of the word, as long as the whitespace character is in the value of IFS (an IFS whitespace character). Any character in IFS that is not IFS whitespace, along with any adjacent IFS whitespace characters, delimits a field. A sequence of IFS whitespace characters is also treated as a delimiter. If the value of IFS is null, no word splitting occurs.

Explicit null arguments ("" or '') are retained. Unquoted implicit null arguments, resulting from the expansion of parameters that have no values, are removed. If a parameter with no value is expanded within double quotes, a null argument results and is retained.

Note that if no expansion occurs, no splitting is performed.


  1. For some reason I said "post decrement" in the audio, where this is obviously a "post increment". Oops!