Some supplementary Bash tips (HPR Show 2278)

Pathname expansion; part 1 of 2

Dave Morriss


Table of Contents

Expansion

As we saw in the last episode 2045 (and others in this sub-series) there are eight types of expansion applied to the command line in the following order:

  • Brace expansion (we looked at this subject in episode 1884)
  • Tilde expansion (seen in episode 1903)
  • Parameter and variable expansion (this was covered in episode 1648)
  • Command substitution (seen in episode 1903)
  • Arithmetic expansion (seen in episode 1951)
  • Process substitution (seen in episode 2045)
  • Word splitting (seen in episode 2045)
  • Pathname expansion (this episode and the next)

This is the last topic in the (sub-) series about expansion in Bash. However, when writing the notes for this episode it became apparent that there was too much to fit into a single HPR episode. Consequently I have made it into two.

In this episode we will look at simple pathname expansion and some of the ways in which its behaviour can be controlled. In the next episode we’ll finish by looking at extended pattern matching. Both are included in the “Manual Page Extracts” section at the end of the long notes.

Pathname expansion

This type of expansion is also known as Filename Expansion or Globbing. It is about the expansion of wildcard characters such as ‘*’ in filenames. You have almost certainly used it in commands like:

ls *.txt

The names glob and globbing have an historical origin. In the early days of Unix this type of wildcard expansion was performed by the separate program /etc/glob, an abbreviation of the phrase “global command”. Later a library function ‘glob()’ was provided to replace it and the name has stuck since then.

Operating systems other than Unix, and other environments and scripting languages also have a similar concept of glob patterns using wildcard characters. The actual characters often vary from those used in Bash, but the concepts are very similar. See the Wikipedia article on this subject for more details.

Note that this process does not use regular expressions in the sense you will have seen in other places (such as in the HPR series called Learning sed). These glob patterns are older and not as sophisticated.

Although this process of wildcard expansion is normally used in the context of file names or paths to files, such patterns are used in other contexts as well. When we looked at parameter and variable expansion in episode 1648 we saw expressions such as:

$ dir="/home/user/Downloads/hpr1648.ogg"
$ echo ${dir##*/}
hpr1648.ogg

Here ‘*/’ matches a part of the path in variable ‘dir’ and the operation strips it all away leaving just the terminal filename in the same way as the ‘basename’ command.

Making test files

To have some files to experiment with for this episode (and the next one) I created a series of directories and files within them:

$ mkdir Pathname_expansion
$ cd Pathname_expansion
$ mkdir {a..z}
$ for d in {a..z}
> do
> touch $d/${d}{a..z}{01..50}.txt
> done

These commands do the following:

  • Create a directory ‘Pathname_expansion
  • Change directory into ‘Pathname_expansion
  • Using brace expansion create the directories ‘a’ to ‘z
  • The line beginning ‘for’ is the start of a multi-line command; the ‘>’ means Bash is prompting for the next line:
    • Loop through the directories just created
    • Using ‘touch’ create a series of files in each one. The files begin with the letter of the directory, followed by another letter, followed by a two-digit number in the range 01-50, followed by ‘.txt’.
    • Note that to use variable ‘d’ as part of the filename it needs to be enclosed in ‘{}’ braces to separate it from the following brace expansion.

Each directory will therefore contain 26*50=1,300 files making a total of 33,800 (empty) files.

Using the test files

So, now we need to look at how these various files could be referred to using pathname expansion.

According to the manual page:

Bash scans each word for the characters ‘*’, ‘?’, and ‘[’. If one of these characters appears, then the word is regarded as a pattern, and replaced with an alphabetically sorted list of filenames matching the pattern.

These pattern characters have the following meanings (see the “Manual Page Extracts” section below under the heading “Pattern Matching” for a more detailed description):

*

Matches any string, including the null string.

?

Matches any single character.

[…]

Matches any one of the enclosed characters, such as [abc].

A pair of characters separated by a hyphen (such as [a-z]) denotes a range expression; any character that falls between those two characters, inclusive is matched.1

If the first character following the ‘[’ is a ‘!’ (exclamation mark) or a ‘^’ (circumflex) then any character not enclosed is matched. Note that ‘!’ is the POSIX standard, making ‘^’ non-standard.

A ‘-’ (hyphen) may be matched by including it as the first or last character in the set, such as [-a-z] or [a-z-] meaning any letter from the range expression as well as the hyphen.

A ‘]’ (close square bracket) may be matched by including it as the first character in the set, such as []a-z].

Other character classes may be used within these square brackets, where a class has the form [:class:], such as [[:alnum:]] which has the same meaning as [a-zA-Z0-9]. See the more detailed manual page extracts at the end of this document.

To refer to all files in the directory ‘a’ which have ‘a’ as the second letter we could use the following pattern:

$ ls a/?a*

Here the question mark ‘?’ means any character (though we know that all the files begin with ‘a’ in this directory). This is followed by a letter ‘a’ meaning the second letter must be ‘a’. Finally there is an asterisk ‘*’ which means that the rest of the filename can be anything.

This command returns 50 filenames like this (only the first two lines are shown):

$ ls -w 60 a/?a*
a/aa01.txt  a/aa11.txt  a/aa21.txt  a/aa31.txt  a/aa41.txt
a/aa02.txt  a/aa12.txt  a/aa22.txt  a/aa32.txt  a/aa42.txt
...

Note that the use of ‘-w 60’ restricts the number of columns produced by ls.

Some notes about pattern matching

As already mentioned, there is a certain resemblance between these patterns and regular expressions, which you may have encountered in other HPR episodes such as Learning sed and Learning Awk. The two should not be confused, regular expressions are far more powerful, but are not available in Bash in the same context.

The expansion of these patterns takes place on the command line, resulting in an alphabetical list of pathnames, and these are presented to the command. For example, the echo command may be used:

$ echo a/?a0*
a/aa01.txt a/aa02.txt a/aa03.txt a/aa04.txt a/aa05.txt a/aa06.txt a/aa07.txt a/aa08.txt a/aa09.txt

Here the pattern ‘a/?a0*’ was used, meaning files in directory ‘a’ starting with any character, followed by an ‘a’, a zero and then any number of further characters. This was expanded by the Bash shell and the nine pathnames were passed to echo which printed them.

It might help to demonstrate this more clearly by using arrays (covered to some extent in the episode entitled Bash parameter manipulation):

$ vec=(a/?a0*)
$ echo ${#vec[@]}
9
$ echo ${vec[@]}
a/aa01.txt a/aa02.txt a/aa03.txt a/aa04.txt a/aa05.txt a/aa06.txt a/aa07.txt a/aa08.txt a/aa09.txt

Here the array called ‘vec’ is filled with the result of the pathname expansion using the same pattern as before. When we use the substitution syntax ‘${#vec[@]}’ we get the number of elements in the array, and ‘${vec[@]}’ returns all of the elements which are printed by echo.

These pattern expansions do not occur when enclosed in single or double quotation marks. Such a pattern is treated simply as a verbatim string:

$ echo "a/?a0*"
a/?a0*

Note also that if the pattern does not end with a wildcard then it implies that the final part exactly matches the end of the file name:

$ echo a/aa1*tx
a/a1*tx
$ echo a/aa1*txt
a/aa10.txt a/aa11.txt a/aa12.txt a/aa13.txt a/aa14.txt a/aa15.txt a/aa16.txt a/aa17.txt a/aa18.txt a/aa19.txt

All files end with ‘txt’, so ending the pattern with ‘tx’ matches nothing.

Later in this episode we will look in more detail at how the expansion process just returns the pattern if there are no matches.

Using shopt in relation to pathname expansion

There are a number of Bash options that affect the way that pathname expansion works. These are referred to in detail in the manual page extracts at the end of these notes.

The shopt command is built into the Bash shell. Typing it on its own results in a list of all of the options and their settings:

$ shopt
autocd          off
cdable_vars     off
cdspell         off
...

(The rest have been omitted.)

Typing shopt with the name of an option returns its current setting:

$ shopt dotglob
dotglob         off

To turn on an option use shopt -s followed by the option name (‘s’ stands for ‘set’):

$ shopt -s dotglob

Turning it off is achieved with shopt -u (‘u’ stands for ‘unset’):

$ shopt -u dotglob

The status of the settings can be reported in a form that can be saved and used as commands by specifying the -p option:

$ shopt -p dotglob failglob
shopt -u dotglob
shopt -u failglob

We will look at a subset of the settings controlled by shopt. This subset consists of the settings which are of relevance to pathname expansion.

The dotglob option

This option controls whether files beginning with a dot (‘.’) are returned by pathname expansion. Normally, they are not.

To demonstrate this we first create a file with a name beginning with a dot:

$ touch a/.dotfile

Such files are called hidden because many parts of the operating system do not show them unless requested.

Normally trying to find such a file using a pathname with wildcards fails:

$ ls a/*dot*
ls: cannot access 'a/*dot*': No such file or directory
$ ls a/?dot*
ls: cannot access 'a/?dot*': No such file or directory
$ ls a/[.]dot*
ls: cannot access 'a/[.]dot*': No such file or directory

You might think that adding the -a option to ls (which shows hidden files) might solve the problem. It does not. The issue is that the target file is not returned by the expansion, so the ls command is simply given the pattern, which it treats as a filename and there is no file called “asterisk-d-o-t-asterisk” or any of the others with literal wildcards.

However, if dotglob is on then the file becomes visible:

$ shopt -s dotglob
$ ls a/*dot*
a/.dotfile

Of course, the file is visible to ls if the -a option is used (and dotglob is off), and no pathname expansion is used. However, in this case all 1300 other files in the directory would be listed.

We’ll list the filenames in one column (-1) and view just the last 3 to demonstrate this:

$ ls -1 -a a | tail -3
az49.txt
az50.txt
.dotfile

As as Unix newbie I struggled with this “dotfile” issue a lot. I hope this has helped to clarify things for you.

The extglob option

This option controls whether extended pattern matching features are enabled or not. We will look at these in the next episode.

The failglob option

This option controls whether an error is produced when a pattern fails to match filenames during pathname expansion.

The example shows that when failglob is on the failure of the match is detected early and the command aborted, otherwise the failed pattern is passed to the ls command.

$ shopt -s failglob
$ ls a/aa50*
a/aa50.txt
$ ls a/aa51*
-bash: no match: a/aa51*
$ shopt -u failglob
$ ls a/aa51*
ls: cannot access 'a/aa51*': No such file or directory

Note that turning on the failglob option has other effects that might not be very desirable, such as on Tab completion. Use with caution.

The globasciiranges option

Setting this option on disables the use of the collating sequence of the current locale, and reverts to traditional ASCII. It is relevant to bracket expressions like [a-z].

$ mkdir test
$ cd test
$ touch á
$ touch b
$ shopt -s globasciiranges
$ ls [a-b]
b
$ shopt -u globasciiranges
$ ls [a-b]
á b
$ cd -

Setting globasciiranges makes the file called ‘á’ disappear.

The globstar option

When this option is on the pattern ‘**’ causes recursive scanning of directories when pattern matching.

To demonstrate this we will create some extra directories and files:

$ mkdir -p dir1/dir2/dir3
$ touch dir1/dir2/dir3/{test.tmp,README} dir1/dir2/{test2.tmp,test2.dat} dir1/test3.tmp
$ tree dir1/
dir1/
├── dir2
│   ├── dir3
│   │   ├── README
│   │   └── test.tmp
│   ├── test2.dat
│   └── test2.tmp
└── test3.tmp

2 directories, 5 files

Note the use of mkdir -p to create all directories at once, the multiple arguments to touch that use brace expansion and the tree command that draws a diagram of the directory structure.

Now we can list files in this tree structure by using the ‘**’ pattern. We will use echo again since ls will show all files in directories if their names are returned after expansion:

$ shopt -s globstar
$ echo **/*.tmp
dir1/dir2/dir3/test.tmp dir1/dir2/test2.tmp dir1/test3.tmp

Note that if ‘**’ is followed by a ‘/’ only directories are matched.

$ echo dir1/**/
dir1/ dir1/dir2/ dir1/dir2/dir3/

Here, the ls command receives the directory names then lists their contents:

$ ls dir1/**/
dir1/:
dir2  test3.tmp

dir1/dir2/:
dir3  test2.dat  test2.tmp

dir1/dir2/dir3/:
README  test.tmp

With globstar turned off we cannot recurse through the directory structure looking for files and ‘**’ has no special meaning:

$ shopt -u globstar
$ echo **/*.tmp
dir1/test3.tmp

Learning to use the find command can be a better solution to the problems of finding files in a directory hierarchy:

$ find dir1 -name "*.tmp"
dir1/dir2/dir3/test.tmp
dir1/dir2/test2.tmp
dir1/test3.tmp

The nocaseglob option

Normally pathname expansion is case sensitive. Setting the nocaseglob option turns off case-sensitivity.

With nocaseglob off the pattern matches nothing:

$ echo a/AA0*
a/AA0*

With the option on the same pattern matches files:

$ shopt -s nocaseglob
$ echo a/AA0*
a/aa01.txt a/aa02.txt a/aa03.txt a/aa04.txt a/aa05.txt a/aa06.txt a/aa07.txt a/aa08.txt a/aa09.txt

However, the directory does not match in a case-insensitive way because it is not part of a pattern:

$ echo A/AA0*
A/AA0*

When a pattern is used for the directory then a case-insensitive match works:

$ echo [A]/AA0*
a/aa01.txt a/aa02.txt a/aa03.txt a/aa04.txt a/aa05.txt a/aa06.txt a/aa07.txt a/aa08.txt a/aa09.txt
$ echo ?/AA0*
a/aa01.txt a/aa02.txt a/aa03.txt a/aa04.txt a/aa05.txt a/aa06.txt a/aa07.txt a/aa08.txt a/aa09.txt

The nullglob option

As we saw when discussing dotglob, a pattern that matches nothing is returned intact, which might result in a command treating it as a pathname.

The nullglob option, when on, results in a null string being returned in such cases.

$ ls a/*dot*
ls: cannot access 'a/*dot*': No such file or directory
$ shopt -s nullglob
$ echo "[" a/*dot* "]"
[ ]
$ shopt -u nullglob
$ echo "[" a/*dot* "]"
[ a/*dot* ]

Here the ls command used before is demonstrated showing the pattern being returned when the match fails. Then nullglob is turned on and echo is used to demonstrate the null string being returned. We use (quoted) brackets to show this. When nullglob is off then the pattern is returned as before.

Conclusion

Pathname expansion and a knowledge of the patterns Bash uses is very important for effective use of the Bash command line or for writing Bash scripts. The various options controlled by shopt are less critical, with the exception of dotglob perhaps.

In the next (and final) episode about expansion we will look at other factors controlling expansion and will examine the extended pattern matching operators.


Manual Page Extracts

EXPANSION

Expansion is performed on the command line after it has been split into words. There are seven kinds of expansion performed: brace expansion, tilde expansion, parameter and variable expansion, command substitution, arithmetic expansion, word splitting, and pathname expansion.

The order of expansions is: brace expansion; tilde expansion, parameter and variable expansion, arithmetic expansion, and command substitution (done in a left-to-right fashion); word splitting; and pathname expansion.

On systems that can support it, there is an additional expansion available: process substitution. This is performed at the same time as tilde, parameter, variable, and arithmetic expansion and command substitution.

Only brace expansion, word splitting, and pathname expansion can change the number of words of the expansion; other expansions expand a single word to a single word. The only exceptions to this are the expansions of “$@” and “${name[@]}” as explained above (see PARAMETERS).

Brace Expansion

See the notes for HPR show 1884.

Tilde Expansion

See the notes for HPR show 1903.

Parameter Expansion

See the notes for HPR show 1648.

Command Substitution

See the notes for HPR show 1903.

Arithmetic Expansion

See the notes for HPR show 1951.

Process Substitution

See the notes for HPR show 2045.

Word Splitting

See the notes for HPR show 2045.

Pathname Expansion

After word splitting, unless the -f option has been set, bash scans each word for the characters *, ?, and [. If one of these characters appears, then the word is regarded as a pattern, and replaced with an alphabetically sorted list of filenames matching the pattern (see Pattern Matching below). If no matching filenames are found, and the shell option nullglob is not enabled, the word is left unchanged. If the nullglob option is set, and no matches are found, the word is removed. If the failglob shell option is set, and no matches are found, an error message is printed and the command is not executed. If the shell option nocaseglob is enabled, the match is performed without regard to the case of alphabetic characters. Note that when using range expressions like [a-z] (see below), letters of the other case may be included, depending on the setting of LC_COLLATE. When a pattern is used for pathname expansion, the character “.” at the start of a name or immediately following a slash must be matched explicitly, unless the shell option dotglob is set. When matching a pathname, the slash character must always be matched explicitly. In other cases, the “.” character is not treated specially. See the description of shopt below under SHELL BUILTIN COMMANDS for a description of the nocaseglob, nullglob, failglob, and dotglob shell options.

The GLOBIGNORE shell variable may be used to restrict the set of filenames matching a pattern. If GLOBIGNORE is set, each matching filename that also matches one of the patterns in GLOBIGNORE is removed from the list of matches. The filenames “.” and “..” are always ignored when GLOBIGNORE is set and not null. However, setting GLOBIGNORE to a non-null value has the effect of enabling the dotglob shell option, so all other file‐ names beginning with a “.” will match. To get the old behavior of ignoring filenames beginning with a “.”, make “.*" one of the patterns in GLOBIGNORE. The dotglob option is disabled when GLOBIGNORE is unset.

Pattern Matching

Any character that appears in a pattern, other than the special pattern characters described below, matches itself. The NUL character may not occur in a pattern. A backslash escapes the following character; the escaping backslash is discarded when matching. The special pattern characters must be quoted if they are to be matched literally.

The special pattern characters have the following meanings:

*

Matches any string, including the null string. When the globstar shell option is enabled, and * is used in a pathname expansion context, two adjacent *s used as a single pattern will match all files and zero or more directories and subdirectories. If followed by a /, two adjacent *s will match only directories and subdirectories.

?

Matches any single character.

[…]

Matches any one of the enclosed characters. A pair of characters separated by a hyphen denotes a range expression; any character that falls between those two characters, inclusive, using the current locale’s collating sequence and character set, is matched. If the first character following the [ is a ! or a ^ then any character not enclosed is matched. The sorting order of characters in range expressions is determined by the current locale and the values of the LC_COLLATE or LC_ALL shell variables, if set. To obtain the traditional interpretation of range expressions, where [a-d] is equivalent to [abcd], set value of the LC_ALL shell variable to C, or enable the globasciiranges shell option. A - may be matched by including it as the first or last character in the set. A ] may be matched by including it as the first character in the set.

Within [ and ], character classes can be specified using the syntax [:class:], where class is one of the following classes defined in the POSIX standard: alnum alpha ascii blank cntrl digit graph lower print punct space upper word xdigit A character class matches any character belonging to that class. The word character class matches letters, digits, and the character _.

Within [ and ], an equivalence class can be specified using the syntax [=c=], which matches all characters with the same collation weight (as defined by the current locale) as the character c.

Within [ and ], the syntax [.symbol.] matches the collating symbol symbol.

If the extglob shell option is enabled using the shopt builtin, several extended pattern matching operators are recognized. In the following description, a pattern-list is a list of one or more patterns separated by a |. Composite patterns may be formed using one or more of the following sub-patterns:

?(pattern-list)

Matches zero or one occurrence of the given patterns

(pattern-list*)

Matches zero or more occurrences of the given patterns

+(pattern-list)

Matches one or more occurrences of the given patterns

@(pattern-list)

Matches one of the given patterns

!(pattern-list)

Matches anything except one of the given patterns


SHELL BUILTIN COMMANDS

This is an extract relating to the shopt builtin. Only the options relating to pathname expansion are included. For the full list refer to the Bash manual page.

shopt [-pqsu] [-o] [optname …]

Toggle the values of settings controlling optional shell behavior. The settings can be either those listed below, or, if the -o option is used, those available with the -o option to the set builtin command. With no options, or with the -p option, a list of all settable options is displayed, with an indication of whether or not each is set. The -p option causes output to be displayed in a form that may be reused as input. Other options have the following meanings:

-s     Enable (set) each optname.
-u     Disable (unset) each optname.
-q     Suppresses normal output (quiet mode); the return status
       indicates whether the optname is set or unset. If multiple  optname
       arguments are given with -q, the return status is zero if all optnames
       are enabled; non-zero otherwise.
-o     Restricts the values of optname to be those defined for the -o
       option to the set builtin.

If either -s or -u is used with no optname arguments, shopt shows only those options which are set or unset, respectively. Unless otherwise noted, the shopt options are disabled (unset) by default.

The return status when listing options is zero if all optnames are enabled, non-zero otherwise. When setting or unsetting options, the return status is zero unless an optname is not a valid shell option.

The list of shopt options is:

dotglob If set, bash includes filenames beginning with a `.' in the
        results of pathname expansion.
extglob If set, the extended pattern matching features described above
        under Pathname Expansion are enabled.
failglob
        If set, patterns which fail to match filenames during pathname
        expansion result in an expansion error.
globasciiranges
        If set, range expressions used in pattern matching bracket
        expressions (see Pattern Matching above) behave as if in the
        traditional C locale when performing comparisons. That is, the
        current locale's collating sequence is not taken into account,
        so b will not collate between A and B, and upper-case and
        lower-case ASCII characters will collate together.
globstar
        If set, the pattern ** used in a pathname expansion context
        will match all files and zero or more directories and
        subdirectories. If the pattern is followed by a /, only
        directories and subdirectories match.
nocaseglob
        If set, bash matches filenames in a case-insensitive fashion
        when performing pathname expansion (see Pathname Expansion
        above).
nullglob
        If set, bash allows patterns which match no files (see
        Pathname Expansion above) to expand to a  null string, rather
        than themselves.

  1. The simple concept of a range expression is complicated considerably by the fact that since it was invented many more character sets than plain ASCII have been added. The way in which such ranges are interpreted depends on the current LOCALE. See the Manual Page Extracts section for details.