Introduction to sed - part 4 (HPR Show 2011)

Dave Morriss


Table of Contents

Introduction

In the last episode we looked at some of the more frequently used sed commands, having spent previous episodes looking at the s command, and we also covered the concept of line addressing.

In this episode we will look at how sed really works in all the gory details, examine some of the remaining sed commands and begin to use what we know to build useful sed programs.

How sed REALLY works

In Episode 1 we looked briefly at the pattern space where sed holds the incoming data while commands are executed on it. In this episode we will look at this data buffer and its counterpart, the hold space in more detail.

When considering the pattern space in earlier episodes it was simpler to visualise it as a relatively small storage area, capable of holding one line from the input stream. In fact, it is a buffer which can hold an arbitrarily large amount of data, though it is normally used to hold just the latest input line.

As you know from previous discussions, the pattern space is processed in the following cycle:

  • A line is read from the input stream, the trailing newline is removed and the result is stored in the pattern space
  • The commands making up the sed script are executed (as appropriate regarding addressing, etc.)
  • When command execution has finished the pattern space is printed to the output stream, after adding the trailing newline (if it was removed). This auto printing does not happen if the -n command line option is in effect.
  • The cycle then begins again, with the pattern space being cleared before the next line is read. This part of the cycle can be altered by a few special commands, which we will look at later.

The hold space on the other hand, is a storage buffer like the pattern space which is not affected by the cycle described above. Data placed in the hold space remains there until sed exits or it is deleted explicitly. Commands exist which can move data to and from the hold space, as we will see.

Commands

This episode is following the GNU sed manual, particularly the section about less frequently-used commands. Some of the commands in this category in the manual have been omitted in this series though, and some will be held over to the next episode.

The y command

The y command transforms (transliterates) characters. The format is:

y/source-chars/dest-chars/

It operates on the pattern space transliterating any characters which match any of the source-chars with the corresponding character in dest-chars.

The delimiter used to separate the two parts is normally a slash (/), but it can be changed as we saw with the s command without preceding the first instance with a backslash.

If the delimiter is used in either of the two lists of characters it must be preceded by a backslash (\) to escape it.

The two lists must be the same length (not counting backslash escapes).

The y command has no flags.

In the following example the first two lines of the example file are processed with a y command. The lower-case vowels in these two lines are converted to the next vowel in the sequence, so 'a' becomes 'e', 'e' becomes 'i' and so forth:

$ sed -ne '1,2{y/aeiou/eioua/;p}' sed_demo1.txt
Heckir Pabloc Redou (HPR) os en Intirnit Redou shuw (pudcest) thet riliesis
shuws iviry wiikdey Mundey thruagh Frodey. HPR hes e lung loniegi guong beck tu

The next example uses nl as in earlier episodes to number the lines so you can see what has been done to them. The script contains two groups, both of which perform transliterations. The first group is controlled by an address expression which operates on odd lines, and the second group operates on even lines. The y commands perform similar vowel transformations as the previous example, but they cater for upper-case vowels as well. The vowel sequences are "rotated" differently for the even versus the odd lines. Only the first five lines of output are shown here.

$ nl -w3 -ba sed_demo1.txt | sed -ne '1~2{y/aeiouAEIOU/eiouaEIOUA/;p};2~2{y|aeiouAEIOU|iouaeIOUAE|;p}'
1     Heckir Pabloc Redou (HPR) os en Ontirnit Redou shuw (pudcest) thet riliesis
2     shaws ovory wookdiy Mandiy thraegh Frudiy. HPR his i lang lunoigo gaung bick ta
3     Redou FriiK Emiroce, Bonery Rivulatoun Redou & Onfunumocun, end ot os e dorict
4     cantuneituan af Twitoch ridua. Ploiso luston ta StinkDiwg's "Untradectuan ta
5     HPR" fur muri onfurmetoun.

The = command

This is a GNU extension. It causes sed to print out the line number, followed by a newline. The number represents a count of the lines read on the input stream.

The command can be preceded by any of the address types we saw in episode 2.

The following example uses the = command to print out the number of the last line of the input file:

$ sed -ne '$=' sed_demo1.txt
13

The next example prints out the line number followed by the line. Note how the newline after the number means that it is not on the same line as the text:

$ sed -ne '${=;p}' sed_demo1.txt
13
detail on a topic.

The usual issues about contiguous or separate files apply here and using the -s command line option has the following effect:

$ sed -sne '${=;p}' sed_demo1.txt sed_demo2.txt
13
detail on a topic.
26
contribute one show a year.

Commands that operate on the pattern space

The following four commands perform actions on the pattern space. Their usefulness can be difficult to appreciate without examples, but we need to know about them, and the other set of hold space commands that follow before we can begin building such examples.

The D command

This command deletes from the pattern space in a related way to the d command. However, it only deletes up to the first newline. Then the cycle is restarted using the resulting pattern space and without reading any input.

If there is no newline the pattern space is deleted, and a new cycle is begun with a new input line being read. Under these circumstances, the D command behaves as the d command does.

The command can be preceded by any of the address types we saw in episode 2.

The N command

This command adds the next line of input to the pattern space, preceded by a newline. If there is no more input then sed exits without processing any more commands.

The command can be preceded by any of the address types we saw in episode 2.

The P command

This command prints out the contents of the pattern space up to the first newline.

The command can be preceded by any of the address types we saw in episode 2.

The l command

Format: l n

This command can be a useful tool for debugging a sed script since it shows what is currently in the pattern space.

The pattern space is "dumped" in fixed-length lines, where the length is controlled by the numeric value of n. There is a command-line option -l N or --line-length=N which provides a value if n is not provided with the command. The default value is 70. A value of 0 prevents line wrapping.

The n option to the command is a GNU sed extension.

The l command shows non-printable characters as sequences such as '\n' and '\t'. Each wrapped line ends with a '\' and the end of each line is shown by a '$' character.

The command can be preceded by any of the address types we saw in episode 2.

Running the l command on lines 1 and 2 of sed_demo1.txt with a width of 80 we see:

$ sed -ne '1,2l80' sed_demo1.txt
Hacker Public Radio (HPR) is an Internet Radio show (podcast) that releases$
shows every weekday Monday through Friday. HPR has a long lineage going back to$

Using the N command to accumulate the two lines in the pattern space before dumping it (using the default width) we see:

$ sed -ne '1,2{N;l}' sed_demo1.txt
Hacker Public Radio (HPR) is an Internet Radio show (podcast) that re\
leases\nshows every weekday Monday through Friday. HPR has a long lin\
eage going back to$

Example using the pattern space

This example demonstrates the use of the N and D commands.

for i in {1..10}; do
    w=$(shuf -n1 /usr/share/dict/words)
    w=${w%\'s}
    echo "$i: $w"
done | tee /tmp/$$ | sed -e 'N;1,5D'

The loop iterates 10 times, using variable i. For each iteration variable w is set to a random word from the system dictionary /usr/share/dict/words. Since many of these words end in "'s" we remove such endings. The result is printed, with the iteration number in front.

The stream of 10 words are sent to the tee1 command which saves a copy (this command writes to a file and also copies its input to STDOUT). The file chosen is a temporary file /tmp/$$ where Bash replaces the $$ symbol with the process id number of the current process.2

The stream of numbered words is also sent to sed and each line is appended to the pattern space. For input lines 1 to 5 inclusive the D command deletes a line from the accumulated lines in the pattern space, and the result is the last 5 lines remain at the end and are auto-printed.

During testing, this type of pipeline can be written to a file and run as a Bash script, or it can be written out on one line, as I normally do:

for i in {1..10}; do w=$(shuf -n1 /usr/share/dict/words); w=${w%\'s}; echo "$i: $w"; done | tee /tmp/$$ | sed -e 'N;1,5D'

The temporary file is useful to check the before and after states.

The Bash script discussed here is available as demo3.sh on the HPR website.

Commands to transfer to and from the hold space

The next five commands move lines to and from the hold space.

The h command

This command replaces the contents of the hold space with the contents of the pattern space. After executing the command the original contents of the hold space will be lost, and the contents of the pattern space will be in the hold space and the pattern space.

The command can be preceded by any of the address types we saw in episode 2.

The H command

This command appends the contents of the pattern space to the hold space preceded by a newline. The contents of the pattern space will not be affected by this process.

The command can be preceded by any of the address types we saw in episode 2.

The g command

This command replaces the contents of the pattern space with the contents of the hold space. After executing the command the original contents of the pattern space will be lost, and the two buffers will have the same contents.

The command can be preceded by any of the address types we saw in episode 2.

The G command

This command appends the contents of the hold space to the pattern space preceded by a newline. The contents of the hold space will not be affected by this process.

The command can be preceded by any of the address types we saw in episode 2.

The x command

This command exchanges the contents of the hold and pattern spaces.

The command can be preceded by any of the address types we saw in episode 2.

Flags and modifiers we omitted earlier

When we looked at the s command in episodes 1 and 2 we encountered a subset of the flags, and when we were looking at line addresses in episode 3 we missed out one of the modifiers.

One of the missing flags to s was 'M' (and 'm' which is a synonym, just as 'I' and 'i' are) and the missing modifier was 'M', and they all affect regular expression matching in the same way.

The 'M' modifier/flag stands for multi-line and is useful in the case where the pattern space contains more than one line. It is a GNU sed extension.

The modifier causes '^' to match empty string after a newline and '$' to match the empty string before a newline. There are also special metacharacters which match the beginning and end of the buffer. These are: '\`' for the beginning and "\'" for the end

The following brief examples demonstrate the features of the 'M' modifier.

Here we have accumulated two lines in the hold space, which have then been transferred to the pattern space. We use s commands (with a 'g' modifier, which is superfluous in this example, but useful later3) to add square brackets at the beginning and end:

$ sed -ne '1,2H;2{g;s/^/[/g;s/$/]/g;p}' sed_demo1.txt
[
Hacker Public Radio (HPR) is an Internet Radio show (podcast) that releases
shows every weekday Monday through Friday. HPR has a long lineage going back to]

Remember that there's an extra newline at the start of the pattern space due to the way the H command works. This example shows that there is only one "beginning" and "end" in this buffer.

If we then modify both of the s commands with the 'M' flag/modifier we get:

$ sed -ne '1,2H;2{g;s/^/[/gM;s/$/]/gM;p}' sed_demo1.txt
[]
[Hacker Public Radio (HPR) is an Internet Radio show (podcast) that releases]
[shows every weekday Monday through Friday. HPR has a long lineage going back to]

Now '^' and '$' relate to newlines and surround each of the lines.

Now, to indicate the start and end of the buffer we need to use '\`' and "\'". However, we have a problem since these characters are significant to the Bash shell, so we move to placing these commands in a file called demo4.sed:

$ cat demo4.sed
1,2H
2{
    g
    s/\`/[/gM
    s/\'/]/gM
    p
}
$ sed -nf demo4.sed sed_demo1.txt
[
Hacker Public Radio (HPR) is an Internet Radio show (podcast) that releases
shows every weekday Monday through Friday. HPR has a long lineage going back to]

The demo4.sed file is available on the HPR website

Examples

Example 1

This example mainly demonstrates the use of the P and y commands:

$ sed -ne '1,2{s/$/\n-/;P;y/aeiou/eioua/;p}' sed_demo1.txt
Hacker Public Radio (HPR) is an Internet Radio show (podcast) that releases
Heckir Pabloc Redou (HPR) os en Intirnit Redou shuw (pudcest) thet riliesis
-
shows every weekday Monday through Friday. HPR has a long lineage going back to
shuws iviry wiikdey Mundey thruagh Frodey. HPR hes e lung loniegi guong beck tu
-

Auto-printing is turned off, and the sed commands are all grouped together and controlled by an address range that covers the first two lines of the file.

First an s command adds a newline followed by a hyphen to the current line in the pattern space. The following P command prints out the line that has just been edited in the pattern space, up to the newline that we just added (so we don't see the hyphen).

Then a y command operates on the line, which is still in the pattern space. It changes all the vowels by shifting them to the next in the alphabetic order - 'a' becomes 'e' and so forth. A final p command prints the edited line, which now generates two lines because of the newline and hyphen we added at the start.

Example 2

Here we use H and G to make use of the hold space and pattern space:

$ sed -e '1,/^$/{H;d};${G;s/\n$//}' sed_demo1.txt
What differentiates HPR from other podcasts is that the shows are
produced by the community - fellow listeners like you. There is no
restrictions on how long the show can be, nor on the topic you can
cover as long as they "are of interest to Hackers". If you want to see
what topics have been covered so far just have a look at our Archive.
We also allow for a series of shows so that host(s) can go into more
detail on a topic.

Hacker Public Radio (HPR) is an Internet Radio show (podcast) that releases
shows every weekday Monday through Friday. HPR has a long lineage going back to
Radio FreeK America, Binary Revolution Radio & Infonomicon, and it is a direct
continuation of Twatech radio. Please listen to StankDawg's "Introduction to
HPR" for more information.

On every line from number 1 to the first blank line the first group of commands is run. The H command appends the input line to the hold space with a newline on the front of it. The d command deletes the line from the pattern space, preventing auto-printing of it. All other lines outside this address range are printed automatically.

When the last line is encountered the another command group is run. The G command appends the hold space to the pattern space, but an extra blank line will have been generated on the front by the first H command. The s command removes the last newline from the pattern space balancing this addition. The pattern space will then be auto-printed.

The effect will be to take the first paragraph from the text and move it to the end.

Example 3

In Episode 2 I speculated on a solution to the problem of joining all of the lines in a text file to make one long line. The following example offers a solution to this:

$ x=$(sed -ne 'H;${g;s/^\n//;s/\n/ /g;p}' sed_demo1.txt)
$ echo ${#x}
768

The example runs the sed command in a command substitution expression such that the variable x contains the result (this subject was covered in my episode entitled "Some further Bash tips"). The length of the variable is then reported.

The sed command turns off auto-printing. Within the sed script itself the H command is run on every line, and this causes every input line to be appended to the hold space with a newline on the front.

When the last line of the file is encountered a group of commands is run. The g command replaces the pattern space with the hold space. The pattern space now contains the newline that was appended before the first input line was saved. The first s command removes this from the front of the pattern space. The second s command replaces all the newlines in the pattern space with a space, thereby making one continuous string. This is then printed with the p command.

As a point of interest, the resulting text is the same length as the original, as can be proved by the following:

$ y=$(cat sed_demo1.txt)
$ echo ${#y}
768

Quiz

Pig Latin / Igpay Atinlay

Use the test data in sed_demo1.txt from Episode 1 and, using a single invocation of sed, convert the first line to Pig Latin. The rules of how to generate this are simple in essence, though there are some exceptions (see the Wikipedia entry for the full details). We will just go for the simplest solution in this quiz, though if you want to be more advanced in your submission please go ahead.

In brief the rules are:

  • Take the first letter of each word and place it at the end, followed by 'ay'. Thus 'pig' becomes 'igpay' and 'latin' becomes 'atinlay'.
  • Skip 1-, 2- and 3-letter words, since 'a' -> 'aay' is not wanted.
  • Do not bother about capitals. Ideally 'Latin' should become 'Atinlay', but sed may not be the best tool to use to do that!

I will include my solution to this problem in the next episode. I hope you will be able to come up with a much better answer than I do!

Note: If you submit a working solution you may be eligible for a prize of some HPR stickers. Send your submission to me. My email address is available here after removing the anti-spam measures. The competition will close after I have posted episode 5 in this series to the HPR site.


  1. My explanation of tee in the audio was less than clear. I should have said that everything sent through the command is written both to the file and to STDOUT. I think the text explains it though.

  2. If you run the script demo3.sh and try to look at the temporary file you will not see it. This is because the PID generated by $$ is local to the process running the script. I have modified the script to report the name of the file to allow you to examine it once it has run.

  3. I used the 'g' flags here just because I used them in the next example, they don't actually do anything. With hindsight, it might have been better if I had removed them in this one.