Additional ancillary Bash tips - 12 (HPR Show 2669)

Dave Morriss


Table of Contents

Making decisions in Bash

This is the twelfth episode in the Bash Tips sub-series. It is the fourth of a group of shows about making decisions in Bash.

In the last three episodes we saw the types of test Bash provides, and we looked briefly at some of the commands that use these tests. We looked at conditional expressions and all of the operators Bash provides to do this. We concentrated particularly on string comparisons which use glob and extended glob patterns.

Now we want to look at the other form of string comparison, using regular expressions.

Regular Expressions

Regular expressions appeared in Bash around 2004, in version 3. They can only be used in extended tests ([[...]]). It took a few sub-versions of Bash before the regular expression feature stabilised, so take care when researching the subject that what you find refers to versions greater than 3.21.

The operator '=~' is used to compare a string with a regular expression. The string or variable to be matched is written on the left of the =~ operator and the regular expression on the right (never the other way round).

Let’s begin by looking at a simple example of the use of a regular expression in Bash:

if [[ $server =~ ^(hacker|hobby)publicradio\.org$ ]]; then
    echo "This is HPR"
fi

Here the variable 'server' is being checked against a regular expression to determine whether it either contains hackerpublicradio.org or hobbypublicradio.org. If either match then the message 'This is HPR' is displayed, otherwise nothing is displayed.

Things to note:

  • The regular expression is not enclosed in quotes (remember how this is also the case with glob and extended glob patterns in the last episode)
  • It starts with a caret ('^') which anchors it to the start of the text
  • Two alternative sub-expressions are enclosed in parentheses with a vertical bar ('|') between them; this means either 'hacker' or 'hobby' will match
  • The full-stop before 'org' is a regular expression metacharacter so needs to be escaped with a backslash ('\')
  • The regular expression ends with a '$' which anchors it to the end of the text

As usual, the return value of the regular expression is 0 (true) if the string matches the pattern, and 1 (false) otherwise. If the regular expression is syntactically incorrect, the return value is 2. The regular expression is affected by the shell option nocasematch (as previously mentioned for glob patterns).

If the regular expression is enclosed in quotes then it is treated as a string, not as a regular expression.

A common convention is to store the regular expression in a Bash variable and then use it as the right hand side of the expression. This allows the regular expression to be built without concern for the characters it contains being misinterpreted by Bash. However, if the variable is enclosed in quotes in the conditional expression this causes Bash to treat it as a string, not as a regular expression.

If any part of the regular expression pattern is quoted then that part is treated as a string. This is how it is described in the GNU Bash Manual:

Any part of the pattern may be quoted to force the quoted portion to be matched as a string.

You would expect this to allow regular expression metacharacters to be used literally. I have not managed to get this to work, nor have I found any advice on using it in my researches.

The following downloadable script included with this show contains my failed test of this feature and is listed below.

$ cat bash12_ex1.sh
#!/bin/bash

# -~ -~ -~ -~ -~ -~ -~ -~ -~ -~ -~ -~ -~ -~ -~ -~ -~ -~ -~ -~
# Experimenting with the meaning of the statement in the GNU Bash Manual:
#       "Any part of the pattern may be quoted to force the quoted portion to
#       be matched as a string."
# -~ -~ -~ -~ -~ -~ -~ -~ -~ -~ -~ -~ -~ -~ -~ -~ -~ -~ -~ -~

server="hackerpublicradio.org"

#
# Try some regular expressions in a loop. The first is a standard type, but
# the second and third use a quoted regular expression metacharacter trying
# different quotes.
#
for re in \
    '^(hacker|hobby)publicradio\.org$' \
    '^(hacker|hobby)publicradio"."org$' \
    "^(hacker|hobby)publicradio'.'org$"
do
    echo "Using regular expression: $re"
    if [[ $server =~ $re ]]; then
        echo "This is HPR"
    else
        echo "No match"
    fi
done
$ ./bash12_ex1.sh
Using regular expression: ^(hacker|hobby)publicradio\.org$
This is HPR
Using regular expression: ^(hacker|hobby)publicradio"."org$
No match
Using regular expression: ^(hacker|hobby)publicradio'.'org$
No match

The script may be found here: bash12_ex1.sh if you want to experiment with it.

Regular Expression Syntax

A regular expression is a pattern that describes a set of strings. Regular expressions are constructed analogously to arithmetic expressions by using various operators to combine smaller expressions.

The fundamental building blocks are the regular expressions that match a single character. Most characters, including all letters and digits, are regular expressions that match themselves. Any metacharacter with special meaning may be quoted by preceding it with a backslash. Some regular expression operators contain backslashes, which may be a little confusing at first glance.

There are different types of regular expressions used by various tools and programming languages. Bash regular expressions use a form called extended regular expressions (ERE) and the metacharacters used within these expressions are described below:

Operator Description
. Represents any single character
* Modifies the item to the left; the item matches zero or more times
? Modifies the item to the left; the item matches zero or one time
+ Modifies the item to the left; the item matches one or more times
{n} Modifier making the item to the left match exactly n times
{n,} Modifier making the item to the left match n or more times
{n,m} Modifier making the item to the left match between n and m times
{,m} Modifier making the item to the left match between zero and m times2
^ Matches the null character at the start of a line
$ Matches the null character at the end of a line
[...] Matches a single character from the set in brackets
| Separates two regular expressions allowing alternative matches
(...) Parentheses can enclose multiple alternative regular expressions
\b Matches the empty string at the edge of a word
\B Matches the empty string provided it’s not at the edge of a word
\< Matches the empty string at the beginning of a word
\> Matches the empty string at the end of a word

Examples

Demonstrations of the use of the above regular expression operators.

Example 1

Match a blank line, or a line containing only whitespace with the following:

^$
^[[:blank:]]*$

A downloadable script demonstrating this concept:

$ cat bash12_ex2.sh
#!/bin/bash

#
# Demonstrate the use of a regular expression to detect blank lines in a file,
# and those containing only whitespace
#

re="^[[:digit:]]+[[:blank:]]*$"

while read -r line; do
    [[ $line =~ $re ]] && continue
    echo "$line"
done < <(cat -n "$0")

When run the script prints itself with line numbers but omits the blank lines:

$ ./bash12_ex2.sh
1       #!/bin/bash
3       #
4       # Demonstrate the use of a regular expression to detect blank lines in a file,
5       # and those containing only whitespace
6       #
8       re="^[[:digit:]]+[[:blank:]]*$"
10      while read -r line; do
11          [[ $line =~ $re ]] && continue
12          echo "$line"
13      done < <(cat -n "$0")

The variable 're' holds the regular expression we want to match against every line of the input. In this case we start with one or more digits because we’re feeding the result of cat -n to the loop and we get back lines with a number at the start. Other than the line number, we are looking for lines which only contain spaces, so we can omit them.

The loop is a while loop which calls read as its test expression. This will return true when it reads a line and false when there are no more. The -r option deals with backslash escapes, and isn’t strictly necessary, though it is recommended.

The read gets its data from the redirection after the done part of the loop. Here we see a process substitution consisting of a cat -n of the current script using argument $0 which holds the script name.

Inside the loop the first line is a test (in a command list) compares the latest line read from the file with the regular expression, and if it matches a continue command is called which skips to the end of the loop ready for the next iteration. If the line does not match the echo command is invoked and the line is printed.

So, the overall effect is to print all lines which are not blank (after the line number).

This example is available as a downloadable file (bash12_ex2.sh).

For more information about POSIX character classes such as [:digit:] see the appendix at the end of these notes.

Example 2

This example uses more of the regular expression operators listed above to match words:

\<.{4,}[tl]ing\>

Breaking this down:

Operator Meaning
\< this matches the start of a word
.{4,} matches 4 or more characters
[tl] matches either the letter 't' or the letter 'l'
ing matches the letters 'ing'
\> matches the end of a word

So we’re matching words ending in 'ing' preceded by a 't' or an 'l' with 4 or more letters (characters to be exact) before that. We will be using the dictionary in /usr/share/dict/words and extracting random words from it.

As before we have a downloadable script demonstrating this algorithm:

$ cat bash12_ex3.sh
#!/bin/bash

#
# Demonstrate a more complex regular expression to detect matching words in
# a file (one per line)
#

re='\<.{4,}[tl]ing\>'

while read -r line; do
    if [[ $line =~ $re ]]; then
        echo "$line"
    fi
done < <(shuf -n 100 /usr/share/dict/words)

When run the script prints out a number of random words which match the regular expression:

$ ./bash12_ex3.sh
airmailing
squinting
intersecting

Things to note in this example are:

  • The regular expression is stored in a variable. This is always wise, and particularly so in this case because there are characters in it which would have been misinterpreted by Bash if the expression had been written in the extended test.
  • Again we’re using a process substitution to run shuf, a tool which selects random lines from the nominated file, 100 lines in this case.

This example is available as a downloadable file (bash12_ex3.sh).

Example 3

This example takes a date as an argument and checks that it’s in the ISO8601 'YYYY-MM-DD' format.

$ cat bash12_ex4.sh
#!/bin/bash

#
# Building a regular expression to match a simple-format ISO8601 date
#

re='^[0-9]{4}(-[0-9]{2}){2}$'

#
# The date is expected as the only argument
#
if [[ $# -ne 1 ]]; then
    echo "Usage: $0 ISO8601_date"
    exit 1
fi

#
# Validate against the regex
#
if [[ $1 =~ $re ]]; then
    echo "$1 is a valid date"
else
    echo "$1 is not a valid date"
fi

Things to note are:

  • The regular expression looks for 4 digits, a hyphen, two digits, a hyphen and two digits. Since the “hyphen and two digits” part is repeated we enclose it in parentheses and add a repeat count modifier. The expression is anchored at the start and end otherwise a date like '2018-09-15-' would be valid.
  • The script makes a check on the number of arguments ('$#'), exiting with an error unless there’s one argument.

Examples of running the script:

$ ./bash12_ex4.sh
Usage: ./bash12_ex4.sh ISO8601_date

$ ./bash12_ex4.sh 2018-09-XX
2018-09-XX is not a valid date

$ ./bash12_ex4.sh 2018-09-15
2018-09-15 is a valid date

This example is available as a downloadable file (bash12_ex4.sh).

Example 4

This example is similar to the previous one. It takes an IP address (version 4) as an argument and checks that it’s in the correct format. It performs more sophisticated validation than example 3, but it’s not using regular expressions to do this.

$ cat bash12_ex5.sh
#!/bin/bash

#
# An IP address looks like this:
#       192.168.0.5
# Four groups of 1-3 numbers in the range 0..255 separated by dots.
#
re='^([0-9]{1,3}\.){3}[0-9]{1,3}$'

#
# The address is expected as the only argument
#
if [[ $# -ne 1 ]]; then
    echo "Usage: $0 IP_address"
    exit 1
fi

#
# Validate against the regex
#
if [[ $1 =~ $re ]]; then
    #
    # Look at the components and check they are all in range
    #
    for d in ${1//./ }; do
        if [[ $d -lt 0 || $d -gt 255 ]]; then
            echo "$1 is not a valid IP address (contains $d)"
            exit 1
        fi
    done

    echo "$1 is a valid IP address"
else
    echo "$1 is not a valid IP address"
fi

As mentioned, there is an extra check in this example. After confirming that the address consists of four groups of numbers it is split into its components with a parameter substitution and the components checked in a loop to ensure they are between 0 and 255. If this test fails the loop exits with an error message indicating which component failed to validate.

Examples of running the script:

$ ./bash12_ex5.sh
Usage: ./bash12_ex5.sh IP_address

$ ./bash12_ex5.sh 192.168.0.
192.168.0. is not a valid IP address

$ ./bash12_ex5.sh 192.168.0.5
192.168.0.5 is a valid IP address

$ ./bash12_ex5.sh 192.168.0.256
192.168.0.256 is not a valid IP address (contains 256)

This example is available as a downloadable file (bash12_ex5.sh).

Capture groups

As well as providing a means of grouping regular expression operators – to define alternatives or to allow a modifier to apply to a sub-expressions – parentheses also define capture groups as seen when looking at sed and awk.

We will look at this subject in the next (and last) episode of this sub-series.


Appendix 1 - POSIX Character Classes3

POSIX class Equivalent to Matches
[:alnum:] [A-Za-z0-9] digits, uppercase and lowercase letters
[:alpha:] [A-Za-z] upper- and lowercase letters
[:ascii:] [\x00-\x7F] ASCII characters
[:blank:] [ \t] space and TAB characters only
[:cntrl:] [\x00-\x1F\x7F] Control characters
[:digit:] [0-9] digits
[:graph:] [^[:cntrl:]] graphic characters (all characters which have graphic representation)
[:lower:] [a-z] lowercase letters
[:print:] [[:graph] ] graphic characters and space
[:punct:] [-!"#$%&’()*+,./:;<=>?@[]^_`{ | }~] all punctuation characters (all graphic characters except letters and digits)
[:space:] [ \t\n\r\f\v] all blank (whitespace) characters, including spaces, tabs, new lines, carriage returns, form feeds, and vertical tabs
[:upper:] [A-Z] uppercase letters
[:word:] [A-Za-z0-9_] word characters
[:xdigit:] [0-9A-Fa-f] hexadecimal digits


  1. This is the version I am using:
    $ bash --version
    GNU bash, version 4.4.23(1)-release (x86_64-pc-linux-gnu)
    Copyright (C) 2016 Free Software Foundation, Inc.

  2. The bounds expression {,m} is not documented but seems to work as expected.

  3. Borrowed from https://www.npmjs.com/package/posix-character-classes