Introduction to sed - part 2 (HPR Show 1986)

Dave Morriss


Table of Contents

Introduction

In the last episode we looked at sed at the simplest level. We looked at three command-line options and the 's' command. We introduced the idea of basic regular expressions.

In this episode we will cover all of these topics in more detail.

We are looking at GNU sed in this series. This version contains many extensions to POSIX sed. These extensions provide many more features, but sed scripts written this way are not portable.

This episode uses two new data files called sed_demo2.txt and sed_demo3.txt in the various demonstrations and examples.

Command line options

We looked at the -e and -f options in the last episode. We will look at several more of the available options this time, but will not cover everything. Refer to the GNU manual for the full list.

-n or --quiet or --silent

By default, sed prints out the pattern space at the end of each cycle through the script (see "How sed works" in the last episode). These options disable this automatic printing, and sed only produces output when explicitly told to via the 'p' flag or command (see "The p flag" below).

-i[SUFFIX] or --in-place[=SUFFIX]

This option allows sed to edit files in place. If a suffix is specified the original file is renamed by appending the suffix, and the edited file given the original name. This provides a way of creating a backup of the original. If no suffix is given the original file is replaced by the edited file.

By default sed treats the input files on the command line as a single stream of data. When the -i option is used the files are treated separately (see the -s option).

If the suffix contains a '*' symbol then this is replaced by the current file name. See Example 1 below for how to use this.

--follow-symlinks

This option is relevant to the -i option and is available only on systems that support symbolic links. If specified then, if the file being edited is a symbolic link the link will be followed and the actual file edited. If omitted (the default) the link will be broken and the actual file will not be changed.

-s or --separate

By default sed treats the input files on the command line as a single stream of data. This GNU sed extension causes the command to consider them as separate files. The relevance of this will become apparent in later episodes.

-r or --regexp-extended

By default sed uses basic regular expressions, but this GNU extension allows the use of extended regular expressions (those allowed by egrep). Standard sed uses backslashes to denote many special characters. In extended mode these backslashes are not required. However, the result is not portable.

More about the s command

Regular expressions

Regular expressions in sed can be more complex than those we looked at in the last episode, allowing much greater flexibility. The new meta-characters we'll look at this time all start with a backslash. Many other Unix tools that use regular expressions do the same, but others do not. This can be confusing, so it's important to be aware of the differences.

Expression   Meaning
\+ Similar to * but matches a sequence of one or more instances of the preceding item
\? Similar to * but matches a sequence of zero or one instance of the preceding item
\{i\} Matches exactly i sequences (i is a decimal integer)
\{i,j\} Matches between i and j sequences, inclusive
\{i,\} Matches i or more sequences, inclusive
\(regexp\) Groups the inner regexp. Allows it to be followed by a postfix operator, or can be used for back references (see below)
regexp1\|regexp2 Matches regexp1 or regexp2, \| is used to separate alternatives

One or more of the preceding

Using the '\+' modifier matches sequences of variable length starting with one instance. So, using an example from the last episode:

s/a\+bc/def/

Here the sequence being matched is 'abc', 'aabc', 'aaabc' and so forth. It does not batch 'bc' since there has to be at least one 'a'.

This is a GNU extension.

Zero or one of the preceding

The '\?' modifier matches zero or one of the preceding expression. So, considering the following example:

s/a\?bc/def/

This matches 'bc' and 'abc' because zero or one 'a' is specified.

This is a GNU extension.

A fixed number of the preceding

Using the '\{i\}' modifier we specify a fixed number of the preceding expression:

s/a\{3\}bc/def/

This only matches 'aaabc' since three 'a' characters are needed.

Between i and j of the preceding

Using the '\{i,j\}' modifier we specify a number of the preceding expression between lower and upper bounds:

s/a\{1,5\}bc/def/

This matches 'abc', 'aabc', 'aaabc', 'aaaabc' and 'aaaaabc'; that is, between 1 and 5 'a' characters followed by 'bc'.

From i or more of the preceding

Using the '\{i,\}' modifier we specify a number of the preceding expression from a lower value to an undefined upper limit:

s/a\{1,\}bc/def/

This matches 'abc', 'aabc' and so on, with no limit to the number of 'a' characters. This is the same as:

s/a\+bc/def/

However, the lower limit does not have to be 1.

Grouping a regexp

So far the modifiers we have seen have been applied to single characters. However, with grouping we can apply them to a more complex expression. The group is enclosed in \( and \). For example:

s/\(abc\)*def/ghi/

Here the complete regex matches 'def', 'abcdef', 'abcabcdef' and so forth with multiple instances of 'abc'.

Each group is numbered by sed simply by counting \( occurrences. This allows references to be made to these sub-expressions as we will see shortly.

Alternative regexps

It is possible to build a regexp with alternative sub-expressions separated by the characters \|. For example, say the intention is to match either 'Hello World' or 'Goodbye World' without an exclamation mark at the end and add one, the following might be tried as a first attempt:

$ echo "Hello World" | sed -e 's/Hello\|Goodbye World/&!/'
Hello! World
$ echo "Goodbye World" | sed -e 's/Hello\|Goodbye World/&!/'
Goodbye World!

Those results might be unexpected. What has happened is that sed has just matched the 'Hello' in the first case, and so the replacement '&!' has resulted in an exclamation mark being placed after this word. However, it has matched 'Goodbye World' in the second case so the exclamation mark has been placed as we expected.

To match either 'Hello' or 'Goodbye' we need grouping:

$ echo "Hello World" | sed -e 's/\(Hello\|Goodbye\) World/&!/'
Hello World!
$ echo "Goodbye World" | sed -e 's/\(Hello\|Goodbye\) World/&!/'
Goodbye World!

The number of alternatives may be more than two:

$ echo "Farewell World" | sed -e 's/\(Hello\|Goodbye\|Farewell\) World/&!/'
Farewell World!

This meta-character is a GNU extension.

Greediness

The way that sed matches a regexp is sometimes a little unexpected. This because of what is referred to as "greediness", where more is matched than might be predicted.

The following is taken from the GNU manual:

Note that the regular expression matcher is greedy, i.e., matches are attempted from left to right and, if two or more matches are possible starting at the same character, it selects the longest.

For example, say we are trying to process the example file for this episode sed_demo2.txt, looking for a word starting with capital 'H' at the start of a line. It would be tempting to use a regexp such as '^H.\+ ' meaning a line starting with capital 'H' up to a space. In the example below we enclose what was matched by square brackets, printing out only the lines that matched (see the sections entitled "Command line options" for the '-n' option and "The p flag" below):

$ sed -ne 's/^H.\+ /[&]/p' sed_demo2.txt
[Hacker Public Radio (HPR) is an Internet Radio show (podcast) that ]releases
[HPR" for more ]information.
[Hacker Public Radio is dedicated to sharing knowledge. We do ]not

The regexp matcher has matched everything from the leading 'H' to the last space on the line.

One technique for limiting this behaviour is shown below:

$ sed -ne 's/^H[^ ]\+ /[&]/p' sed_demo2.txt
[Hacker ]Public Radio (HPR) is an Internet Radio show (podcast) that releases
[HPR" ]for more information.
[Hacker ]Public Radio is dedicated to sharing knowledge. We do not

Here, rather than following the 'H' with a dot (any character) we use a list in square brackets. The list is negated by using a circumflex, so it means "not space". So, here we are looking for a capital 'H' at the start of a line followed by one or more "not spaces" then a space. Notice how this has constrained the greediness.

Replacement

Last time we saw the use of & meaning the whole of the line which matched the REGEXP part of the command.

Back references

As we saw earlier, there is also a way of referring to a matching group. We use \n where n is a number between 1 and 9 which refers to the nth group between \( and \) delimiters (as discussed above under "Grouping a regexp").

For example:

$ echo "Hacker Public Radio" | sed -e 's/\(.\+\) \(.\+\) \(.\+\)/\3 \2 \1/'
Radio Public Hacker

Here we look for three groups of characters separated by a single space and we group each one. We then replace them in the order 3, 2, 1, resulting in the words being printed in reverse order.

Interestingly, these back references can be used inside the regexp itself:

$ echo "Run Lola Run" | sed -e 's/\(.\+\) \(.\+\) \1/\2 \1 \1/'
Lola Run Run

Here the first group matches the first "Run", and we use it as the last element of the regexp. We could have made it a group:

$ echo "Run Lola Run" | sed -e 's/\(.\+\) \(.\+\) \(\1\)/\2 \3 \1/'
Lola Run Run

There is no point in doing this since the result is the same yet it makes sed work harder.

Case manipulation

GNU sed provides a means of changing the case of the replacement text using the sequences \L, \l, \U, \u and \E.

\L
Turn the replacement to lowercase until a \U or \E is found,
\l
Turn the next character to lowercase,
\U
Turn the replacement to uppercase until a \L or \E is found,
\u
Turn the next character to uppercase,
\E
Stop case conversion started by \L or \U.

When used in conjunction with grouping the following results may be obtained (from Ken's script for the Community News perhaps):

$ echo "Hacker Public Radio" |\
    sed -e 's/\(.\+\) \(.\+\) \(.\+\)/\U\1 \L\1 \U\2 \L\2 \U\3 \L\3/'
HACKER hacker PUBLIC public RADIO radio

Flags

We saw the 'g' flag in the last episode, which makes the substitution repeat for each line applying to all matches. We will look at some other flags in this episode, but some of the more advanced features will be omitted here.

The number flag

There is also a number flag which only applies the numberth match. For example:

$ echo "eeny, meeny, miny" | sed -e 's/ny/\U&/2'
eeny, meeNY, miny

Here the match is for 'ny', and the replacement is the matching text forced to upper case (see "Case manipulation" above). However, we restrict the substitution to just the second match, as you can see from the result.

The p flag

This causes the result of the substitution to be printed. More precisely, it causes the pattern space to be printed if the substitution was made.

Normally this happens anyway, but when the -n command line option has been selected (see "Command line options") nothing is printed unless the script explicitly requests it.

$ sed -n -e 's/Hacker /Hobby /p' sed_demo2.txt
Hobby Public Radio (HPR) is an Internet Radio show (podcast) that releases
Hobby Public Radio is dedicated to sharing knowledge. We do not

Only the lines where 'Hacker ' was replaced by 'Hobby ' are reported.

The I and i flags

These flags are a GNU sed extension. They cause the regexp to be case-insensitive. Both forms of this flag have the same meaning.

$ sed -n -e 's/hacker /Hobby /ip' sed_demo2.txt
Hobby Public Radio (HPR) is an Internet Radio show (podcast) that releases
Hobby Public Radio is dedicated to sharing knowledge. We do not

GNU Extensions for Escapes in Regular Expressions

GNU sed contains a way of referencing (or producing) special characters. These are documented in the GNU Manual (under the same title as this section). We will not look at all of these in this series, but will touch on some of the more generally useful ones.

\n
Produces or matches a newline (ASCII 10).
\t
Produces or matches a horizontal tab (ASCII 9).

There are also escapes which match a particular character class which are valid only in regular expressions. These are mentioned here because they can be very useful, as we will see in the examples:

\w
Matches any word character. A word character is any letter or digit or the underscore character.
\W
Matches any non-word character.
\b
Matches a word boundary; that is it matches if the character to the left is a word character and the character to the right is a non-word character, or vice-versa.
\< \>
(These are not very clear in the sed documentation but are available). These are alternative ways of denoting word boundaries, with \< being used for the left boundary and \> for the right.
\B
Matches everywhere but on a word boundary; that is it matches if the character to the left and the character to the right are either both word characters or both non-word characters.

Examples

Example 1

This example shows the use of the -i option:

$ for f in {A..C}; do echo $RANDOM > $f; done
$ sed -i'saved_*.sav' -e 's/4/@/g' {A..C}
$ cat {A..C}
1@855
2@593
@217
$ cat saved_{A..C}.sav
14855
24593
4217

The first line generates three files called A, B and C using brace expansion in a for loop. Each file contains a random number. The second line runs sed against these files replacing any instance of the digit 4 by an '@' symbol. The third line shows the contents of these three files. Backups of their original contents are held in files called saved_A.sav, saved_B.sav and saved_C.sav. Their contents are shown by the final cat command.

Example 2

The second example file sed_demo3.txt contains statistics pulled from the HPR website. Imagine that we are writing a Bash script to parse this, and we want the number of days to the next free slot in a variable. The line in question looks like this:

Days to next free slot: 8

There are two lines beginning with the word 'Days' so we have to be careful:

$ DTNFS="$(sed -ne 's/^Days to[^:]\+:[\t ]\+\([0-9]\+\)/\1/p' sed_demo3.txt)"
$ echo "DTNFS=$DTNFS"
DTNFS=8

The regexp starts with '^Days to' which makes it match the target line. After this come some other words and a colon. We'll represent this with '[^:]\+:' meaning one or more "not colons" followed by a colon. Then there are what look like spaces or could be a tab character (Hint: it's actually a tab). For safety's sake we'll represent this as '[\t ]\+' meaning one or more of tab or space. Then we have a regexp group consisting of '[0-9]\+' meaning one or more digits.

If this matches then we'll have a back reference to the group which we can return -- 8 in this case. The overall sed command uses the '-n' option suppressing printing and the 's' command uses the 'p' flag to print just the matched line.

The output from the sed command is returned in a command substitution and is used to set the variable DTNFS. This is echoed in this fragment to show what was returned.

It is possible that the sed command could return nothing, in which case the variable would not be set. An actual Bash script doing this should check for this eventuality and take appropriate action.

Example 3

In this example we use the '\n' escape we examined earlier (backslash 'n' meaning newline):

$ sed -e 's/\(Hacker\) \(Public\) \(Radio\) /\1\n\2\n\3\n/' sed_demo2.txt | head -4
Hacker
Public
Radio
(HPR) is an Internet Radio show (podcast) that releases

We simply looked for the words "Hacker Public Radio", grouping each of them so that they could be back referenced, and output them each followed by a newline. We used the head command to view just the first 4 lines produced by this sed command.

You might have expected that the following would join all the lines of the file together, but that doesn't happen:

$ sed -e 's/\n//' sed_demo2.txt

That is because sed places one line at a time into the pattern space, removing the trailing newline. Then it applies the script to it and (unless the '-n' option was used) prints it out with a trailing newline.

We will look at ways in which actions like line concatenation can be achieved in a later episode.

Example 4

We saw the '-r' (--regexp_extended) option earlier in this episode. If we were to use this in conjunction with Example 3 we would write the following:

$ sed -r -e 's/(Hacker) (Public) (Radio) /\1\n\2\n\3\n/' sed_demo2.txt | head -4
Hacker
Public
Radio
(HPR) is an Internet Radio show (podcast) that releases

This is a useful feature, but it needs to be used with caution because it is specific to GNU sed and not portable.

Example 5

One task often needed when processing text is to remove leading and trailing spaces. With sed you might expect the following would work:

$ echo "    Hello World!      " | sed -e 's/^ *\| *$//'
Hello World!

At first glance it seems to, until you test it by enclosing the result of the trimming in visible characters:

$ echo "    Hello World!      " | sed -e 's/^ *\| *$//;s/^/</;s/$/>/'
<Hello World!      >

In this case sed has stopped after the first match. This is an example where the 'g' flag is needed to make sed repeat the match and substitution:

$ echo "    Hello World!      " | sed -e 's/^ *\| *$//g;s/^/</;s/$/>/'
<Hello World!>

Example 6

In the audio I said that I would be demonstrating the use of word boundaries in an example. I had forgotten to add it at the time of recording, so this one is not described in the podcast.

Really, this is a piece of extreme silliness, but it does demonstrate word boundaries. It is being run on the example file from the last episode.

$ sed -e 's/\<[A-Z]\w*\>/Chicken/g;s/\b[a-z]\w*\b/chicken/g' sed_demo1.txt

The example consists of two 's' commands separated by a semicolon. The first matches any word that begins with a capital letter, using the \< and \> word boundaries and the \w expression. It replaces each occurrence it finds with an alternative capitalised word, using the 'g' flag to ensure this happens.

The second 's' command does the same for lower-case words but uses the \b word boundary instead.

I will leave you to try it yourself.