More supplementary Bash tips (HPR Show 2293)

Pathname expansion; part 2 of 2

Dave Morriss


Table of Contents

Expansion

As we saw in the last episode 2278 (and others in this sub-series) there are eight types of expansion applied to the command line in the following order:

  • Brace expansion (we looked at this subject in episode 1884)
  • Tilde expansion (seen in episode 1903)
  • Parameter and variable expansion (this was covered in episode 1648)
  • Command substitution (seen in episode 1903)
  • Arithmetic expansion (seen in episode 1951)
  • Process substitution (seen in episode 2045)
  • Word splitting (seen in episode 2045)
  • Pathname expansion (the previous episode 2278 and this one)

This is the last topic in the (sub-) series about expansion in Bash.

In this episode we will look at extended pattern matching as also defined in the “Manual Page Extracts” section at the end of the long notes.

Pathname expansion - continued

As we saw in the last episode (2278), if we enable the option ‘extglob’ using the ‘shopt’ command we enable a number of additional extended pattern matching features1.

In the following description, a pattern-list is a list of one or more patterns separated by a ‘|’. Composite patterns may be formed using one or more of the following sub-patterns:

?(pattern-list)

Matches zero or one occurrence of the given patterns

*(pattern-list)

Matches zero or more occurrences of the given patterns

+(pattern-list)

Matches one or more occurrences of the given patterns

@(pattern-list)

Matches one of the given patterns

!(pattern-list)

Matches anything except one of the given patterns

Notes

  1. This is a fairly new feature
  2. It does not seem to be very well documented
  3. There are some similarities to regular expressions

Warning!: It is not explained explicitly in the Bash manpage but these patterns are applied to each filename. So the pattern:

a?(b)c

matches a file which begins with ‘a’, is followed by zero or one instance of letter ‘b’ and ends with ‘c’. This means it can match only the filenames ‘abc’ and ‘ac’. This is explained more completely below.

Some of the confusion this can cause can be seen in the Stack Exchange questions listed in the Links section below.

Examples

It turns out that the 33,800 files generated in the last episode are not particularly useful when demonstrating how this feature works. I had not investigated extended glob patterns when I created them unfortunately.

Although these files will be used for these examples we will create some more directories and files of a simpler structure, and will turn on ‘extglob’ (assuming it’s not on by default - see the footnote):

$ cd Pathname_expansion
$ mkdir test
$ touch test/{abbc,abc,ac,axc}
$ touch test/{x,xx,xxx}.dat
$ ls -1 test/
abbc
abc
ac
axc
x.dat
xx.dat
xxx.dat
$ shopt -s extglob

(Some examples here are derived from the Stack Exchange articles mentioned earlier and listed in the Links section.)

Example 1 - “match zero or one occurrence”

?(pattern-list)

In the first demonstration we are asking for zero or one occurrence of ‘b’ between the ‘a’ and ‘b’. We get the files ‘abc’ and ‘ac’ because they match the zero and one cases.

$ echo test/a?(b)c
test/abc test/ac

Next we have asked for zero or one letter ‘b’ or letter ‘x’ in the centre, so in this case we also see ‘axc’.

$ echo test/a?(b|x)c
test/abc test/ac test/axc

Note that the pattern list has become a little more complex, since we have an alternative character.

Now we will move to a more complex example using the large collection of test files.

Here we are searching though the directories that start with a vowel for all files that have ‘a’ or ‘b’ as the second letter and ‘01’, ‘10’ or ‘11’ as the next two digits, or files whose second letter is ‘a’ or ‘b’ followed by the digits ‘50’:

$ ls -w 50 -x [aeiou]/?(?[ab][01][01]*|?[ab]50*)
a/aa01.txt  a/aa10.txt  a/aa11.txt  a/aa50.txt
a/ab01.txt  a/ab10.txt  a/ab11.txt  a/ab50.txt
e/ea01.txt  e/ea10.txt  e/ea11.txt  e/ea50.txt
e/eb01.txt  e/eb10.txt  e/eb11.txt  e/eb50.txt
i/ia01.txt  i/ia10.txt  i/ia11.txt  i/ia50.txt
i/ib01.txt  i/ib10.txt  i/ib11.txt  i/ib50.txt
o/oa01.txt  o/oa10.txt  o/oa11.txt  o/oa50.txt
o/ob01.txt  o/ob10.txt  o/ob11.txt  o/ob50.txt
u/ua01.txt  u/ua10.txt  u/ua11.txt  u/ua50.txt
u/ub01.txt  u/ub10.txt  u/ub11.txt  u/ub50.txt

The ‘-l 50’ option to ‘ls’ limits the output width for better readability in these notes. We also use ‘-x’ which lists files in row order rather than the default column order so you can read left to right.

There are some important points to understand in this example:

  • Although we are using the “match zero or one occurrence” sub-pattern there are no cases where there are zero matches. The main benefit we are getting from this feature is that we can use alternation (vertical bar).

  • Use of the ‘*’ wildcard in the sub-pattern avoids the need to be explicit about the ‘.txt’ suffix on the files. The same effect would be achieved with the following:

    [aeiou]/?(?[ab][01][01]|?[ab]50).txt
  • Adding a ‘*’ wildcard to the end will result in the sub-expression having no effect, and all files in the directories will be returned. That is because the wildcard matches everything! The difference is shown below:

    $ echo [aeiou]/?(?[ab][01][01]*|?[ab]50*) | wc -w
    40
    $ echo [aeiou]/?(?[ab][01][01]*|?[ab]*)* | wc -w
    6500
    $ echo [aeiou]/* | wc -w
    6500

Example 2 - “match zero or more occurrences”

*(pattern-list)

In the next demonstration we are asking for zero or more occurrences of ‘b’ between the ‘a’ and ‘b’. We get the files ‘abbc’, ‘abc’ and ‘ac’ because they match the zero and more than zero cases.

$ echo test/a*(b)c
test/abbc test/abc test/ac

Not surprisingly, adding ‘x’ to the list in the sub-expression also returns ‘axc’.

$ echo test/a*(b|x)c
test/abbc test/abc test/ac test/axc

There are files in the ‘test’ directory with one to three ‘x’ characters at the start of their names. We can search for them as follows:

$ echo test/*(x).dat
test/x.dat test/xx.dat test/xxx.dat

There is no instance of zero ‘x’es followed by  ’.dat‘  but a file  ’.dat‘  would match, though it would only be shown if  ’dotglob’ was set.

Applying this sub-pattern to the large collection of test files from the last episode we might want to find all files in directory ‘a’ which begin with two ’a’s and numbers in the range 1-3:

$ ls -w 50 -x a/*(a)*([1-3]).txt
a/aa11.txt  a/aa12.txt  a/aa13.txt  a/aa21.txt
a/aa22.txt  a/aa23.txt  a/aa31.txt  a/aa32.txt
a/aa33.txt

You might expect to get back only ‘a/aa11.txt’, ‘a/aa22.txt’ and ‘a/aa33.txt’ but what is actually returned matches ‘aa’ followed by two numbers, each in the range 1-3. This is the same as:

$ ls -w 50 -x a/aa[1-3][1-3].txt
a/aa11.txt  a/aa12.txt  a/aa13.txt  a/aa21.txt
a/aa22.txt  a/aa23.txt  a/aa31.txt  a/aa32.txt
a/aa33.txt

Just to demonstrate how these sub-patterns work, the following example returns the three files in the first column above:

$ ls -1 a/?(*(a)*(1)|*(a)*(2)|*(a)*(3)).txt
a/aa11.txt
a/aa22.txt
a/aa33.txt

However, it does not seem very practical!

Example 3 - “match one or more occurrences”

+(pattern-list)

The next demonstration requests one or more instances of the letter ‘b’ between the other letters and returns the files ‘abbc’ (two ‘b’s) and  ’abc‘  (one’b’):

$ echo test/a+(b)c
test/abbc test/abc

As before, adding ‘x’ as an alternative adds file ‘axc’ to the list:

$ echo test/a+(b|x)c
test/abbc test/abc test/axc

The following example looks in directories ‘a’ and ‘b’ for files that begin with an ‘a’ or a ‘b’ and end with ‘01.txt’:

$ ls -w 50 -x [ab]/*(a|b)*01.txt
a/aa01.txt  a/ab01.txt  a/ac01.txt  a/ad01.txt
a/ae01.txt  a/af01.txt  a/ag01.txt  a/ah01.txt
a/ai01.txt  a/aj01.txt  a/ak01.txt  a/al01.txt
a/am01.txt  a/an01.txt  a/ao01.txt  a/ap01.txt
a/aq01.txt  a/ar01.txt  a/as01.txt  a/at01.txt
a/au01.txt  a/av01.txt  a/aw01.txt  a/ax01.txt
a/ay01.txt  a/az01.txt  b/ba01.txt  b/bb01.txt
b/bc01.txt  b/bd01.txt  b/be01.txt  b/bf01.txt
b/bg01.txt  b/bh01.txt  b/bi01.txt  b/bj01.txt
b/bk01.txt  b/bl01.txt  b/bm01.txt  b/bn01.txt
b/bo01.txt  b/bp01.txt  b/bq01.txt  b/br01.txt
b/bs01.txt  b/bt01.txt  b/bu01.txt  b/bv01.txt
b/bw01.txt  b/bx01.txt  b/by01.txt  b/bz01.txt

This could just as well have been achieved with:

$ ls -w 50 -x [ab]/[ab]*01.txt

Example 4 - “match one of the given patterns”

@(pattern-list)

This demonstration requests one instance of the letter ‘b’ between the other letters and returns one file ‘abc’:

$ echo test/a@(b)c
test/abc

Again, adding ‘x’ as an alternative adds file ‘axc’ to the list:

$ echo test/a@(b|x)c
test/abc test/axc

To make some better search targets I ran the following commands:

$ mkdir words
$ while read word; do
> word=${word%[^a-zA-Z]*}
> word=${word,,}
> touch words/$word
> done < <(shuf -n100 /usr/share/dict/words)
  • A directory ‘words’ was created
  • A ‘while’ loop was started to read data into a variable called ‘word’ (this starts a multi-line command so the prompt changes to ‘>’ until the entire loop is typed in)
  • The ‘word’ variable is stripped of all non alphabetic characters at the end to remove trailing apostrophes or ‘'s’ sequences.
  • The ‘word’ variable is converted to lower case
  • The ‘touch’ command makes an empty file named whatever variable ‘word’ contains
  • The loop ends with ‘done’ and the loop is “fed” with data by a process substitution (see show 2045). This runs the ‘shuf’ command to return 100 random words from ‘/usr/share/dict/words’.

If you try this you will get different words.

In my case I used the following command to return words containing one of ‘ee’, ‘oo’, ‘th’ and ‘ss’:

$ ls -w 60 words/*@(ee|oo|th|ss)*
words/commandeering  words/katherine      words/woolly
words/eighteenths    words/slathering
words/ingress        words/thoughtlessly

Example 5 - “match anything but”

!(pattern-list)

In the final demonstration we look for file names which do not contain a ‘b’ between the ‘a’ and ‘c’:

$ echo test/a!(b)c
test/abbc test/ac test/axc

Notice how this list includes ‘abbc’ because there are multiple ’b’s between the other letters and the pattern specified one.

If we replace the ‘b’ in the pattern with a further pattern which means “one or more” then we do not get ‘abbc’:

$ echo test/a!(+(b))c
test/ac test/axc

This again demonstrates that patterns can contain patterns!

As a more complex example to show how this sub-pattern works we might try searching for files thus:

$ ls -w 50 -x a/a!([c-z]*).txt
a/aa01.txt  a/aa02.txt  a/aa03.txt  a/aa04.txt
a/aa05.txt  a/aa06.txt  a/aa07.txt  a/aa08.txt
...
a/aa49.txt  a/aa50.txt  a/ab01.txt  a/ab02.txt
a/ab03.txt  a/ab04.txt  a/ab05.txt  a/ab06.txt
...
a/ab47.txt  a/ab48.txt  a/ab49.txt  a/ab50.txt

Here we’re looking for files in the directory ‘a’ where the first letter is ‘a’ (they all are) and the second letter is not in the range ‘[c-z]’. The output here shows a subset of what was returned.

Let’s finish with an example searching the directory of words. This time we have a pattern within a pattern. The inner pattern is a @(pattern-list) which contains a list of pairs of letters, mostly identical. This pattern is surrounded by asterisk wildcards. The effect of this is to select all words that contain one of the letter pairs.

This is enclosed in a !(pattern-list) pattern which negates the inner selection making it match words which do not contain the pairs of letters.

$ ls -w 70 words/!(*@(bb|cc|dd|ee|gg|ll|oo|pp|tt|th|ss)*)
words/adela          words/falconers      words/protectively
words/adversest      words/frankie        words/quits
words/ails           words/gnomes         words/rashes
words/airline        words/haring         words/recites
words/alton          words/indianapolis   words/rescuers
...
words/dickson        words/pitchfork      words/weightlifting
words/elitist        words/pomade         words/whales
words/enactment      words/prepackaging   words/writings
words/épées          words/preview        words/yens
words/exit           words/profusion      words/yodel

The result is 81 of the 100 words in the directory.

Example 6 - use of patterns elsewhere

We have seen at various times in this series that glob-style patterns can be used in other contexts. One instance was when manipulating Bash parameters (show 1648):

$ x="aaabbbccc"
$ echo ${x/a/-}
-aabbbccc

Here we created a variable ‘x’ and used pattern substitution to replace the first ‘a’ with a hyphen.

$ echo ${x/+(a)/-}
-bbbccc

This time we have used the ‘+(a)’ pattern to match one or more ’a’s. Note that the matched group is replaced by one hyphen. If we want to replace each of the letters with a hyphen then we’d use an alternative type pattern substitution that works through the entire string:

$ echo ${x//a/-}
---bbbccc

This time we didn’t want to match a group of letters, so didn’t use extended pattern matching.

Another place where extended pattern matching can be used is in ‘case’ statements. I will not go into further detail about this here. However, there is a Stack Exchange question about it listed in the Links section.

To summarise: anywhere where a filename-type pattern match is allowed then extended patterns can be used (assuming ‘extglob’ is set).

Conclusion

Until I started investigating these extended pattern matching features of Bash I did not think I would find them particularly useful. It also took me quite a while to understand how they worked.

Now I actually find them quite powerful and will use them in future in scripts I write.

Bash extended patterns are similar in concept to Regular Expressions, although they are written totally differently. For example, the Bash pattern: ‘hot*(dog)’ means the same as the RE: ‘hot(dog)*’. They both match the words “hot” and “hotdog”. The difference is that ‘*’ in a RE means that the preceding expression may match zero or more times, and can follow many sorts of expressions. The extended pattern is not quite so general.

I hope this episode has helped you understand these Bash features and that you also find them useful.


Manual Page Extracts

EXPANSION

Expansion is performed on the command line after it has been split into words. There are seven kinds of expansion performed: brace expansion, tilde expansion, parameter and variable expansion, command substitution, arithmetic expansion, word splitting, and pathname expansion.

The order of expansions is: brace expansion; tilde expansion, parameter and variable expansion, arithmetic expansion, and command substitution (done in a left-to-right fashion); word splitting; and pathname expansion.

On systems that can support it, there is an additional expansion available: process substitution. This is performed at the same time as tilde, parameter, variable, and arithmetic expansion and command substitution.

Only brace expansion, word splitting, and pathname expansion can change the number of words of the expansion; other expansions expand a single word to a single word. The only exceptions to this are the expansions of “$@” and “${name[@]}” as explained above (see PARAMETERS).

Brace Expansion

See the notes for HPR show 1884.

Tilde Expansion

See the notes for HPR show 1903.

Parameter Expansion

See the notes for HPR show 1648.

Command Substitution

See the notes for HPR show 1903.

Arithmetic Expansion

See the notes for HPR show 1951.

Process Substitution

See the notes for HPR show 2045.

Word Splitting

See the notes for HPR show 2045.

Pathname Expansion

See the notes for HPR show 2278 for some of the material in this section.

After word splitting, unless the -f option has been set, bash scans each word for the characters *, ?, and [. If one of these characters appears, then the word is regarded as a pattern, and replaced with an alphabetically sorted list of filenames matching the pattern (see Pattern Matching below). If no matching filenames are found, and the shell option nullglob is not enabled, the word is left unchanged. If the nullglob option is set, and no matches are found, the word is removed. If the failglob shell option is set, and no matches are found, an error message is printed and the command is not executed. If the shell option nocaseglob is enabled, the match is performed without regard to the case of alphabetic characters. Note that when using range expressions like [a-z] (see below), letters of the other case may be included, depending on the setting of LC_COLLATE. When a pattern is used for pathname expansion, the character “.” at the start of a name or immediately following a slash must be matched explicitly, unless the shell option dotglob is set. When matching a pathname, the slash character must always be matched explicitly. In other cases, the “.” character is not treated specially. See the description of shopt below under SHELL BUILTIN COMMANDS for a description of the nocaseglob, nullglob, failglob, and dotglob shell options.

The GLOBIGNORE shell variable may be used to restrict the set of filenames matching a pattern. If GLOBIGNORE is set, each matching filename that also matches one of the patterns in GLOBIGNORE is removed from the list of matches. The filenames “.” and “..” are always ignored when GLOBIGNORE is set and not null. However, setting GLOBIGNORE to a non-null value has the effect of enabling the dotglob shell option, so all other file‐ names beginning with a “.” will match. To get the old behavior of ignoring filenames beginning with a “.”, make “.*" one of the patterns in GLOBIGNORE. The dotglob option is disabled when GLOBIGNORE is unset.

Pattern Matching

Any character that appears in a pattern, other than the special pattern characters described below, matches itself. The NUL character may not occur in a pattern. A backslash escapes the following character; the escaping backslash is discarded when matching. The special pattern characters must be quoted if they are to be matched literally.

The special pattern characters have the following meanings:

*

Matches any string, including the null string. When the globstar shell option is enabled, and * is used in a pathname expansion context, two adjacent *s used as a single pattern will match all files and zero or more directories and subdirectories. If followed by a /, two adjacent *s will match only directories and subdirectories.

?

Matches any single character.

[…]

Matches any one of the enclosed characters. A pair of characters separated by a hyphen denotes a range expression; any character that falls between those two characters, inclusive, using the current locale’s collating sequence and character set, is matched. If the first character following the [ is a ! or a ^ then any character not enclosed is matched. The sorting order of characters in range expressions is determined by the current locale and the values of the LC_COLLATE or LC_ALL shell variables, if set. To obtain the traditional interpretation of range expressions, where [a-d] is equivalent to [abcd], set value of the LC_ALL shell variable to C, or enable the globasciiranges shell option. A - may be matched by including it as the first or last character in the set. A ] may be matched by including it as the first character in the set.

Within [ and ], character classes can be specified using the syntax [:class:], where class is one of the following classes defined in the POSIX standard: alnum alpha ascii blank cntrl digit graph lower print punct space upper word xdigit A character class matches any character belonging to that class. The word character class matches letters, digits, and the character _.

Within [ and ], an equivalence class can be specified using the syntax [=c=], which matches all characters with the same collation weight (as defined by the current locale) as the character c.

Within [ and ], the syntax [.symbol.] matches the collating symbol symbol.

If the extglob shell option is enabled using the shopt builtin, several extended pattern matching operators are recognized. In the following description, a pattern-list is a list of one or more patterns separated by a |. Composite patterns may be formed using one or more of the following sub-patterns:

?(pattern-list)

Matches zero or one occurrence of the given patterns

*(pattern-list)

Matches zero or more occurrences of the given patterns

+(pattern-list)

Matches one or more occurrences of the given patterns

@(pattern-list)

Matches one of the given patterns

!(pattern-list)

Matches anything except one of the given patterns


  1. Note that on the versions of GNU Linux that I run (Debian, KDE Neon and Raspbian) ‘extglob’ is on by default. It is actually set in /usr/share/bash-completion/bash_completion which is invoked directly or from /etc/bash_completion which is invoked from the default ~/.bashrc. These are all Debian-derived distributions, so I can’t speak for others.