Introduction to sed - part 1 (HPR Show 1976)

Dave Morriss


Table of Contents

Introduction

sed is an editor which expects to read a stream of text, apply some action to the text and send it to another stream. It filters and transforms the text along the way according to instructions provided to it. These instructions are referred to as a sed script.

The name "sed" comes from Stream Editor, and sed was developed from 1973 to 1974 as a Unix utility by Lee E. McMahon of Bell Labs. GNU sed added several new features including better documentation, though most of it is only available on the command line through the info command. The full manual is of course available on the web.

Using sed

The sed command is usually invoked with a sed script and an input file on the command line. You might see:

$ sed -e 's/old/new/' infile > outfile

In this example the -e introduces the sed script which is enclosed in single quotation marks. The file infile is read and edited. The result is written to standard output which in this case is being redirected to a file called outfile.

In this episode the sed examples are often being applied to a small file of text, containing the following lines copied from the "about" page on the HPR site:

Hacker Public Radio (HPR) is an Internet Radio show (podcast) that releases
shows every weekday Monday through Friday. HPR has a long lineage going back to
Radio FreeK America, Binary Revolution Radio & Infonomicon, and it is a direct
continuation of Twatech radio. Please listen to StankDawg's "Introduction to
HPR" for more information.

What differentiates HPR from other podcasts is that the shows are
produced by the community - fellow listeners like you. There is no
restrictions on how long the show can be, nor on the topic you can
cover as long as they "are of interest to Hackers". If you want to see
what topics have been covered so far just have a look at our Archive.
We also allow for a series of shows so that host(s) can go into more
detail on a topic.

The file sed_demo1.txt is available on the HPR site.

If the input file is missing sed expects its input to come from standard input so you might see a pipeline such as:

$ wc -l sed_demo1.txt | sed -e 's/ .*$//'

Here the wc command counts the lines in sed_demo1.txt and normally reports the number and the filename:

$ wc -l sed_demo1.txt
13 sed_demo1.txt

We remove the filename using sed leaving just the number - 13. We'll be looking at how this sed example works later.

Note: using wc the way shown below is a simpler way of solving this problem:

$ wc -l < sed_demo1.txt
13

Options

Some of the most frequently used options to the sed command are:

-e SCRIPT or --expression=SCRIPT

Defines the sed commands to be executed (the sed "script"). There can be multiple such options.

-f SCRIPT-FILE or --file=SCRIPT-FILE

Defines a file of sed commands. There can be multiple files, and these can be combined with scripts on the command-line as well.

--help

Displays help information and exits

If no -e, --expression, -f, or --file option is given, then the first non-option argument is taken as the sed script to interpret. All remaining arguments are names of input files; if no input files are specified, then the standard input is read.

How sed works

We will just look at the basics of how sed uses commands to process incoming data in this episode. We will look into this subject in more depth in later episodes.

As mentioned under Options sed takes in commands or scripts from the command line or from files, and stores them.

It then processes the data it has been given through input files or piped to it on STDIN. It reads this input one line at a time, placing it in what is referred to as the pattern space.

Then sed runs the saved commands on the pattern space. The full range of available commands is such that they can be conditional, but we'll leave these details until a later episode. The commands may change the data in the pattern space.

Once all the commands have been executed the contents of the pattern space are printed, the pattern space cleared and the next line is read.

The printing of the pattern space is the default behaviour but can be overridden as we will see in a later episode.

Simple sed scripts (the s command)

The commonest sed command is the s (substitute) command. It has the structure:

s/REGEXP/REPLACEMENT/FLAGS

Its purpose is to look for a pattern (REGEXP) and, if found, to replace it (with REPLACEMENT). The real power of sed (and other parts of Linux and Unix) is in the type of pattern called a regular expression (regexp for short).

We need to look at the fundamentals of regular expressions to appreciate the sophistication of what can be done.

The FLAGS part is used to modify the behaviour of the command. We'll look at one commonly-used flag in this episode but will reserve the full range for later episodes.

Simple Regular Expressions

Regular expressions are patterns which are used to match a string. We will begin by looking at some of the simplest forms.

A regular expression is a sort of language in which certain characters have special meanings. The following table shows some of the simpler meta characters used by sed. We will look into these in more detail in a later episode.

Expression Meaning
any character A single ordinary character matches itself
. Matches any character
* Matches a sequence of zero or more instances of the preceding item
[list] Matches any single character in list: for example, [aeiou] matches all vowels
[^list] A leading '^' reverses the meaning of list, so that it matches any single character not in list
^ Matches the beginning of the line (anchors the search at the start)
$ Matches the end of the line (anchors the search at the end)

Simple character matching

The simplest form of match is where a particular sequence of characters is being searched for. So, the regexp 'abc' matches any string which contains the characters 'abc' in that order.

s/abc/def/

This will find the first occurrence of 'abc' and will change it to 'def'.

Matching arbitrary characters

Using the '.' (dot) character, which matches any character, we could search and change 'abc' or 'aac' of any other three character string beginning with 'a' and ending with 'c' like so:

s/a.c/def/

If it is necessary to indicate an actual '.' character then it needs to be escaped by preceding it with a '\' (backslash) character. This indicates that its special regexp meaning is not to be used in this instance.

s/17\.30/17:30/

Zero or more of the preceding

Using the '*' character we can match sequences of variable length. So, if it is necessary to match 'bc', 'abc', 'aabc' or 'aaabc', for example, then the following could be used:

s/a*bc/def/

What this indicates is that the 'a' can occur zero or more times, followed by the 'bc'. So, the '*' indicates that we are searching for zero or more instances of the preceding item.

If it is necessary to indicate an actual '*' character then it needs to be escaped by preceding it with a '\' (backslash) character. This indicates that its special regexp meaning is not to be used in this instance.

Matching characters in or not in a set

Using the '[list]' expression we can match one of the characters in the given list. So, for example to match 'c' followed by any vowel, followed by 't' and replace it by 'dog' we could use:

s/c[aeiou]t/dog/

This will find all instances of 'cat', 'cet', 'cit', 'cot' and 'cut' and will replace them with 'dog'.

The other form of this expression '[^list]' matches any character not in the given list.

s/([^)]*)/(example 1)/

This is a common type of expression used in sed and elsewhere that you might find regular expressions. Here we are matching an open parenthesis followed by any characters which are not a close parenthesis followed by a close parenthesis. We replace what we find by the text '(example 1)'. This regexp will match any number of enclosed characters including zero. Note that the open and the close parentheses must be on the same line in this example. Of course, sed is a line-orientated editor.

The list can be simply a list of characters as we have seen, but it can also be a range such as 0-9 meaning all the digits from 0 to 9 inclusive. So this is a way of specifying an arbitrary digit such as:

s/A[4-6]/An/

This will replace 'A4', 'A5' or 'A6' with 'An'.

Anchoring at start or end of line

The character '^' (circumflex) , when it occurs at the start of a regexp, indicates the start of a line. If it is used anywhere else it indicates the '^' character itself (though we just saw it being used for another purpose in a list).

The character '$' (dollar sign), when it occurs at the end of a regexp, indicates the end of a line. If it is used anywhere else it indicates the '$' character itself.

If the sequence 'abc' starts at the beginning of the line then use:

s/^abc/def/

If at the end of the line then this regexp would be needed:

s/abc$/def/

Replacement in the s command

The replacement used in the s command can be more complex than we have seen so far. We will go into more detail with what can be done here later in the series, but for now we'll look at the & character.

The & character denotes the whole matched portion of the REGEXP part of the command. If an actual '&' character is required, then it must be escaped.

So, to append 'def' to 'abc' the command would be:

s/abc/&def/

This can be seen as replacing 'abc' with 'abcdef'.

If a literal '&' is required then it needs to be escaped with a backslash:

s/fruit/apples \& pears/

Otherwise undesirable consequences will result:

$ echo "Eat your fruit!" | sed -e 's/fruit/apples & pears/'
Eat your apples fruit pears!

Flags and the s command

The flag we will examine this time is g. This causes the replacement to be applied to all matches, not just the first. So, for example:

s/abc/def/g

This means that all instances of the sequence 'abc' will be replaced with 'def' in the current line. Without it, as we saw earlier, just the first instance will be replaced.

Using sed commands in a file

As we saw in the Options section, sed can take its commands from a file (as well as from the command line). Commands can be formatted one per line, in which case the end of each line separates one command from another. There can be multiple commands per line, in which case they are separated by semicolons.

One way of using commands in a file might be the following:

$ sed -f - sed_demo1.txt <<END
s/\./!/g
s/community/Community/
END

This uses the Bash shell's heredoc feature. This is directly equivalent to using a quoted list of commands:

$ sed -e 's/\./!/g
s/community/Community/' sed_demo1.txt

In general it is better to create a sed command file in the way you would create any other text file, such as in an editor. Giving the file an extension of '.sed' will help to remind you what it is.

$ cat commands.sed
s/\./!/g
s/community/Community/
$ sed -f commands.sed sed_demo1.txt

Examples

Example 1

$ wc -l sed_demo1.txt | sed -e 's/ .*$//'

This is a rather artificial example, as we have already seen, but we know that the wc command returns the number of lines followed by the filename when run in this way:

13 sed_demo1.txt

This is passed to sed which runs the script s/ .*$//. This replaces the first space and the zero or more characters that follow up to the end of the string by nothing, thereby deleting them. This leaves the number of lines as the final result.

Example 2

$ sed -e 's/is no/are no/' sed_demo1.txt

This fixes the fragment "There is no restrictions" replacing it with "There are no restrictions" in sed_demo1.txt. You will see that the word restrictions is on the next line, so it cannot be included in the regexp.

Of course, we cannot just change 'is' to 'are' because there are many uses of this letter sequence throughout the file. That is why we make it more specific by using the regexp 'is no'.

We are not permanently changing the file with this command, but you can isolate and display the changes by adding a call to grep in a pipeline as follows:

$ sed -e 's/is no/are no/' sed_demo1.txt | grep -A1 "are no"
produced by the community - fellow listeners like you. There are no
restrictions on how long the show can be, nor on the topic you can

The -A option to grep displays a number of lines after the target line, and the number chosen here is one line.

We will look at how sed can alter a file and save the results back to it in a later episode.

Example 3

$ sed -e 's/is no/are no/' -e 's/topic /topics /' sed_demo1.txt

This fixes the same fragment as Example 2, but also sorts out the phrase "the topic you can cover". The change is needed because of the use of the word "they" later in the sentence. We include the space in the target regexp because the word "topics" occurs later in the file.

We will look at this more in later shows in this series, but a sed script can consist of multiple commands, and these can be separated by semi-colons. So, the following way of writing the earlier command in this example is exactly equivalent:

$ sed -e 's/is no/are no/;s/topic /topics /' sed_demo1.txt

Example 4

$ sed -e 's/Hacker /Hobby /;s/Hackers/Hobbyists/' sed_demo1.txt

There is one instance of "Hacker" and one of "Hackers" in the text. We don't want "Hackers" to be turned into "Hobbys", so we differentiate the two instances as shown.

Example 5

$ sed -e 's/is no/are no/;s/topic /topics /;s/\. /.  /;s/ /#/g' sed_demo1.txt

This final example applies the earlier grammatical corrections, replaces a single space after a full-stop with two spaces, and (perversely) turns all spaces into hash marks. This stage uses the g flag to process all spaces.

This example shows that each of the commands is applied to each line in turn, and that it is possible to accumulate many commands to make a complex script. We have already seen how scripts can be more conveniently executed from a file, and we will examine this subject more deeply in a forthcoming episode in this series.