Extra ancillary Bash tips - 13 (HPR Show 2679)

Dave Morriss


Table of Contents

Making decisions in Bash

This is the thirteenth episode in the Bash Tips sub-series. It is the fifth and final of a group of shows about making decisions in Bash.

In the last four episodes we saw the types of test Bash provides, and we looked briefly at some of the commands that use these tests. We looked at conditional expressions and all of the operators Bash provides to do this. We concentrated particularly on string comparisons which use glob and extended glob patterns then we devoted an episode to Bash regular expressions.

Now we want to look at the final topic within regular expressions, the use of capture groups.

Capture groups

If you have followed the series on sed or the one covering the awk language the existence of capture groups will not be a surprise to you. It’s a way in which you can group elements of a regular expression using parentheses to denote a component of the string being compared.

For example you might want to look for three-word sentences:

re='^([a-zA-Z]+) +([a-zA-Z]+) +([a-zA-Z]+) *\.?'
  • There are three groups. They consist of ([a-zA-Z]+) meaning one or more alphabetic characters.
  • The characters of each word are followed by one or more spaces (' +') in the first and second cases. The third case is followed by zero or more spaces and an optional full-stop.
  • The entire regular expression is anchored to the start of the string.
  • Only the words themselves are being captured by being in groups, not the intervening spaces.

We will look at a script that uses this regular expression soon.

BASH_REMATCH

Bash uses an internal read-only array called BASH_REMATCH to hold what is matched by a regular expression. The zeroth element of the array holds what the entire regular expression has matched, and the rest hold what was matched by any capture groups in the regular expression.

Like other regular expression systems each capture group is numbered in order of occurrence, so element 1 of BASH_REMATCH contains the first, element 2 the second and so forth.

In sed is is possible to refer to a capture group with a sequence such as '\1', allowing regular expressions themselves to repeat parts such as '\(cat\)\1'. This is shown by the following sed example:

$ echo "catcat" | sed -e 's/\(cat\)\1/match/'
match

Sadly this is apparently not available in Bash – or at least nothing is documented as far as I can find. (There are references to a partial implementation, but this doesn’t seem to be something to rely on).

See the example 2 below for some experiments with this.

The following downloadable example bash13_ex1.sh demonstrates the use of BASH_REMATCH:

$ cat bash13_ex1.sh
#!/bin/bash

#
# Three word regular expression
#
re='^([a-zA-Z]+) +([a-zA-Z]+) +([a-zA-Z]+) *\.?'

#
# A sentence is expected as the only argument
#
if [[ $# -ne 1 ]]; then
    echo "Usage: $0 sentence"
    exit 1
fi

echo "Sentence: $1"
if [[ $1 =~ $re ]]; then
    echo "Matched"
    for i in {0..3}; do
        printf '%2d %s\n' $i "${BASH_REMATCH[$i]}"
    done
fi

This uses the regular expression discussed above in an if command. If the regular expression matches then a message is output and in a for loop the elements of BASH_REMATCH are printed with the index.

$ ./bash13_ex1.sh 'Aardvarks eat ants.'
Sentence: Aardvarks eat ants.
Matched
 0 Aardvarks eat ants.
 1 Aardvarks
 2 eat
 3 ants

Note that you cannot rewrite the regular expression using repetition with expectation that the capture groups will behave as the explicit form:

re='^(([a-zA-Z]+) *){3}\.?'

There is only one capture group here, which is applied three times. The result is that the regular expression matches and BASH_REMATCH[0] contains the whole matched string but elements 1 and 2 will contain the last matching word:

 0 Aardvarks eat ants.
 1 ants
 2 ants

Examples

Example 1

In this example we enhance Example 4 from the last episode which checks an IP address for validity.

The example (bash13_ex2.sh) is downloadable from the HPR site.

$ cat bash13_ex2.sh
#!/bin/bash

# =~ =~ =~ =~ =~ =~ =~ =~ =~ =~ =~ =~ =~ =~ =~ =~ =~ =~ =~ =~ =~
# IP Address parsing revisited
# =~ =~ =~ =~ =~ =~ =~ =~ =~ =~ =~ =~ =~ =~ =~ =~ =~ =~ =~ =~ =~
#
# An IP address looks like this:
#       192.168.0.5
# Four groups of 1-3 numbers in the range 0..255 separated by dots.
#
re='^([0-9]{1,3})\.([0-9]{1,3})\.([0-9]{1,3})\.([0-9]{1,3})$'

#
# The address is expected as the only argument
#
if [[ $# -ne 1 ]]; then
    echo "Usage: $0 IP_address"
    exit 1
fi

#
# Validate against the regex
#
if [[ $1 =~ $re ]]; then
    #
    # Look at the components and check they are all in range
    #
    errs=0
    problems=
    for i in {1..4}; do
        d="${BASH_REMATCH[$i]}"
        if [[ $d -lt 0 || $d -gt 255 ]]; then
            ((errs++))
            problems+="$d "
        fi
    done

    #
    # Report any problems found
    #
    if [[ $errs -gt 0 ]]; then
        problems="${problems:0:-1}"
        echo "$1 is not a valid IP address; contains ${problems// /, }"
        exit 1
    fi

    echo "$1 is a valid IP address"
else
    echo "$1 is not a valid IP address"
fi

The regular expression in this case is:

^([0-9]{1,3})\.([0-9]{1,3})\.([0-9]{1,3})\.([0-9]{1,3})$

Note how each group of digits is in parentheses making it a capture group. The intervening dots ('.') are outside the groups.

The loop which checks each group steps a value from 1 to 4, saving each element of BASH_REMATCH in a variable 'd' for convenience. If there is an error with a value lower than 0 or greater than 255 a variable 'errs' is incremented and the failing number is appended to the variable 'problems'.

The error count is checked once the loop has completed and if greater than zero an error message is produced with the list of problem numbers and the script exits with a false value.

Note that 'problems="${problems:0:-1}"' removes the last character (a trailing space) from the variable. Also '${problems// /, }' replaces all spaces in the string with a comma and a space to make a readable list.

Examples of running the script:

$ ./bash13_ex2.sh 192.168.0.
192.168.0. is not a valid IP address

$ ./bash13_ex2.sh 192.168.0.5
192.168.0.5 is a valid IP address

$ ./bash13_ex2.sh 192.168.500.256
192.168.500.256 is not a valid IP address; contains 500, 256

Example 2

Although I could not find any official documentation about back references in Bash regular expressions there does seem to be something in the version I am using. This example demonstrates the use of this feature in a simple way.

A back reference consist of a backslash ('\') and a number. The number refers to the capture group, counting from the left of the regular expression.

It looks, after testing, as if only a single digit is catered for, so this means capture groups 1-9.

This example is downloadable as usual: bash13_ex3.sh

$ cat bash13_ex3.sh
#!/bin/bash

# =~ =~ =~ =~ =~ =~ =~ =~ =~ =~ =~ =~ =~ =~ =~ =~ =~ =~ =~ =~ =~
# Experimenting with backreferences in Bash regular expressions
# =~ =~ =~ =~ =~ =~ =~ =~ =~ =~ =~ =~ =~ =~ =~ =~ =~ =~ =~ =~ =~

re='(\<.{1,10}\>) \1'

if [[ $1 =~ $re ]]; then
    echo "Matched: $1"
else
    echo "No match: $1"
fi

The regular expression matches a 1-10 letter word followed by a space and the same word already captured.

$ ./bash13_ex3.sh 'turnip turnip'
Matched: turnip turnip

Example 3

This is a moderately complex example which tries to parse a file of email addresses. The format of email addresses is quite complex, and this script does not try to be comprehensive in what it does. A Bash script is not the best way to perform this validation but it should be of interest nevertheless.

The formats catered for are:

  • local-part@domain – such as 'vim@vim.org'
  • name <local-part@domain> – such as 'HPR List <hpr@hackerpublicradio.org>'

There are others, but these are the ones most likely to be encountered.

This downloadable example (bash13_ex4.sh) reads data from a file (bash13_ex4.txt) which is also downloadable.

$ cat bash13_ex4.sh
#!/bin/bash

#
# Check that the data file exists
#
data="bash13_ex4.txt"
[ -e "$data" ] || { echo "File $data not found"; exit 1; }

#
# Email addresses can be:
#   1. local-part@domain
#   2. Name <local-part@domain>
#
part1='([a-zA-Z0-9_][a-zA-Z0-9_.]+@[a-zA-Z0-9.-]+)'
part2='([^<]+)<([a-zA-Z0-9_][a-zA-Z0-9_.]+@[a-zA-Z0-9.-]+)>'
re="^($part1|$part2)$"

#
# Read and check each line from the file
#
while read -r line; do
    #
    # Does it match the regular expression?
    #
    if [[ $line =~ $re ]]; then
        #declare -p BASH_REMATCH
        #
        # Decide which format it is depending on whether element 2 of
        # BASH_REMATCH is zero length
        #
        if [[ -z ${BASH_REMATCH[2]} ]]; then
            # Type 2
            name="${BASH_REMATCH[3]}"
            email="${BASH_REMATCH[4]}"
        else
            # Type 1
            name=
            email="${BASH_REMATCH[2]}"
        fi
        echo "Name: $name"
        echo "Email: $email"
    else
        echo "Not recognised: $line"
    fi
    echo
done < "$data"

This script uses a single regular expression to match either of the formats. For convenience, because it is so long, I have build the variable 're' from the two variables 'part1' and 'part2'. The two alternative regular expressions are enclosed in parentheses, and separated by a vertical bar '|'. The entire thing is anchored at the start and end of the string. The sub-expressions are:

([a-zA-Z0-9_][a-zA-Z0-9_.]+@[a-zA-Z0-9.-]+)
([^<]+)<(([a-zA-Z0-9_][a-zA-Z0-9_.]+@[a-zA-Z0-9.-]+))>

The first one matches the local-part@domain format and the second matches name <local-part@domain. Let’s examine them both in detail:

([a-zA-Z0-9_][a-zA-Z0-9_.]+@[a-zA-Z0-9.-]+)

  • The first square bracketed part matches any letter, any digit or an underscore. This is because the local-part cannot begin with a dot '.'.
  • The second square bracketed part matches the rest of the local-part. In real life many more characters are allowed, but we’re keeping it simpler here.
  • This is followed by a '@' symbol and the final square bracketed part that matches the domain.
  • The entire sub-expression is enclosed in parentheses as a capture group.

([^<]+)<(([a-zA-Z0-9_][a-zA-Z0-9_.]+@[a-zA-Z0-9.-]+))>

  • Here there are two capture groups. The first contains a square bracketed expression which defines any character that is not a less than sign ('<'). The modifier is a plus sign meaning one to any number of these characters.
  • Between the groups is a less than symbol which we don’t want to capture.
  • The second group is the same as the first sub-expression, and is followed by a greater than sign ('>').

A while loop with a read command is used to read from the data file which was defined earlier in the script and its existence verified.

Inside the loop the regular expression is compared with the line just read from the file. If it doesn’t match then the line is reported as not recognised. If it matches then the script can collect the elements from the BASH_REMATCH array and report them.

Because the regular expression is complex the way in which the important capture groups are written to BASH_REMATCH differs according to which sub-expression matched. The script contains a declare -p command which is commented out. Removing the '#' from this activates it; it is a way of displaying the attributes and contents of an array in Bash (as a command which could be used to build the array).

Doing this and looking at what happens when the script encounters addresses of the two types shows the following type of thing:

declare -ar BASH_REMATCH=([0]="kawasaki@me.com" [1]="kawasaki@me.com" [2]="kawasaki@me.com" [3]="" [4]="")
Name: 
Email: kawasaki@me.com

declare -ar BASH_REMATCH=([0]="S Meir <smeier@yahoo.com>" [1]="S Meir <smeier@yahoo.com>" [2]="" [3]="S Meir " [4]="smeier@yahoo.com")
Name: S Meir 
Email: smeier@yahoo.com

The first address kawasaki@me.com matches the first sub-expression.

  • Remember that element zero of BASH_REMATCH contains everything matched by the regular expression, so we can ignore that.
  • Element one also matches everything because we have created an extra capture group by enclosing the two alternative sub-expressions in parentheses. This can also be ignored.
  • If the address matches the first sub-expression it will be written to the second element of BASH_REMATCH because this is the second capture group.
  • The third and fourth capture groups in the second sub-expression are not matched in this case so these elements of BASH_REMATCH are empty.

The second address S Meir <smeier@yahoo.com> matches the second sub-expression in the regular expression.

  • We can ignore BASH_REMATCH elements zero and one for the same reason as before.
  • Element 2 is empty because the address does not match the second capture group.
  • Elements three and four match the third and fourth capture groups.

The script uses the fact that element two of BASH_REMATCH is zero length ('-z') to determine which type of address was matched and to report the name and email address details accordingly.

Here is an excerpt from what is displayed when the script is run (with the declare command commented out):

Name: A Feldspar 
Email: afeldspar@yahoo.ca

Name: 
Email: mcrawfor@live.com

Not recognised: .42@unknown.mars
...

Note: these are dummy addresses.