Introduction to sed - part 5 (HPR Show 2060)

Dave Morriss


Table of Contents

Introduction

This episode is the last one in the "Introduction to sed" series.

In the last episode we looked at the full story of how sed works with the hold and pattern buffers. We looked at some of the commands that we had not yet seen and how they can be used to do more advanced processing using sed's buffers.

In this episode we will look at a selection of the remaining commands, which might be described as quite obscure (even very obscure). We will also look at some of the example sed scripts found in the GNU sed manual.

Commands

Finishing off less frequently used commands

We omitted a few commands in this group in the last episode. I will not cover everything in this category but there are some that might be useful, which we'll look at now.

The c command

This is one of the commands for inserting text in sed. The command is written as:

c\
line1\
line2

The c command itself must be followed by a backslash, as should all of the lines which follow, except the last. The backslashes stand for newlines.

The command can be preceded by any of the address types we saw in episode 2. The lines matching the address(es) are deleted and replaced by the line(s) associated with this command. If no addresses are given all lines are replaced.

Since the command deletes the pattern space a new cycle follows.

The c command can be used on the command line, but not very usefully. For example, it is not possible to follow it with any more sed commands and another -e option needs to be resorted to:

$ sed -e '1c\Line removed' -e '3q' sed_demo1.txt
Line removed
shows every weekday Monday through Friday. HPR has a long lineage going back to
Radio FreeK America, Binary Revolution Radio & Infonomicon, and it is a direct

Also, only one line can be generated this way:

$ sed -e '1c\**Censored**\Do not read!' -e '3q' sed_demo1.txt
**Censored**Do not read!
shows every weekday Monday through Friday. HPR has a long lineage going back to
Radio FreeK America, Binary Revolution Radio & Infonomicon, and it is a direct

However, escape characters can be used so the following example generates two lines as intended:

$ sed -e '1c\**Censored**\nDo not read!' -e '3q' sed_demo1.txt
**Censored**
Do not read!
shows every weekday Monday through Friday. HPR has a long lineage going back to
Radio FreeK America, Binary Revolution Radio & Infonomicon, and it is a direct

This feature is a GNU extension just for single-line additions.

The c command is best used in a file of sed commands. One has been prepared as demo5.sed which is available on the HPR website. The example below shows the file being listed with the nl command to show line numbers then it is used as a sed script and the results are shown:

$ nl -w2 -ba -s': ' demo5.sed
1: 1c\
2: ------\
3: This line has been censored\
4: By the Department of Not Seeing Stuff\
5: ------
6:
7: 3q
$ sed -f demo5.sed sed_demo1.txt
------
This line has been censored
By the Department of Not Seeing Stuff
------
shows every weekday Monday through Friday. HPR has a long lineage going back to
Radio FreeK America, Binary Revolution Radio & Infonomicon, and it is a direct

Of course, this could all be done on one line using '\n' sequences, as we saw above, but that is extremely GNU sed-specific.

The a command

This command is a GNU extension. It has the same structure of lines as the c command.

a\
line1\
line2

The command can be preceded by any of the address types we saw in episode 2. The lines matching the address(es) are processed as normal but are followed by the line(s) associated with this command at the end of the current cycle, or when the next input line is read. If no addresses are given all lines of processed by sed are followed by the line(s) of the a command.

If using the one-line form (as discussed with the c command) escape sequences like '\n' are allowed.

$ sed -e '1a\Chickens' -e '1q' sed_demo1.txt
Hacker Public Radio (HPR) is an Internet Radio show (podcast) that releases
Chickens

Here the a command only applies to the first line, after which a line is added. The second -e expression stops processing after line 1, so we only see one original line and one added line.

The following example adds a line containing just a hyphen after each line of the file, but the second -e expression stops processing after line 3 so we only see three lines of the file:

$ sed -e 'a\-' -e '3q' sed_demo1.txt
Hacker Public Radio (HPR) is an Internet Radio show (podcast) that releases
-
shows every weekday Monday through Friday. HPR has a long lineage going back to
-
Radio FreeK America, Binary Revolution Radio & Infonomicon, and it is a direct
-

The i command

This command is a GNU extension. It has the same structure of lines as the c and a commands.

i\
line1\
line2

The command can be preceded by any of the address types we saw in episode 2. The lines matching the address(es) are preceded by the line(s) associated with this command. If no addresses are given all lines of processed by sed are preceded by the line(s) of the i command.

If using the one-line form (as discussed with the c command) escape sequences like '\n' are allowed.

The following example adds a line containing just a hyphen before each line of the file, but the second -e expression stops processing after line 3 so we only see three lines of the file:

$ sed -e 'i\-' -e '3q' sed_demo1.txt
-
Hacker Public Radio (HPR) is an Internet Radio show (podcast) that releases
-
shows every weekday Monday through Friday. HPR has a long lineage going back to
-
Radio FreeK America, Binary Revolution Radio & Infonomicon, and it is a direct

This example is similar to the preceding one but it adds an open square bracket before each line and a close square bracket after it. It uses the i and a commands to do this.

$ sed -e 'i\[' -e 'a\]' -e '3q' sed_demo1.txt
[
Hacker Public Radio (HPR) is an Internet Radio show (podcast) that releases
]
[
shows every weekday Monday through Friday. HPR has a long lineage going back to
]
[
Radio FreeK America, Binary Revolution Radio & Infonomicon, and it is a direct
]

"Guru" level commands

The commands we have not seen yet are quite obscure. Even the section on "Commands for sed gurus" in the GNU Manual states:

In most cases, use of these commands indicates that you are probably better off programming in something like awk or Perl. But occasionally one is committed to sticking with sed, and these commands can enable one to write quite convoluted scripts.

I am including them in this episode because they will help with understanding some of the examples from the GNU Manual later on.

Defining a label

It is possible to create simple loops within sed but only by branching to a label conditionally or unconditionally. The label itself consists of a colon and a character sequence:

: label

The label cannot be associated with an address (it makes no sense), and it serves no other purpose than to act as a point for transfer of execution.

The b command

This command takes the form:

b label

It causes an unconditional branch to a label. The label may be omitted in which case the b command causes the next cycle to start.

See the third example below "Reverse characters of lines" for an example of this command's use.

The t command

This command takes the form:

t label

It causes a conditional branch to the label. This happens only if there has been a successful substitution (s command) since the last input line was read or conditional branch was taken. The label may be omitted in which case the t command causes the next cycle to start.

See the third example below "Reverse characters of lines" for an example of this command's use.

Commands specific to GNU sed

This is one of the commands which are specific to GNU sed. For the full list refer to the GNU Manual.

The F command

This command prints out the file name of the current input file (with a trailing newline).

This example contains a command group that is obeyed on line 1 of the input. The commands are an F which prints the filename, and a q which stops processing. Because sed is run in the default "read and print" mode the first line is printed:

$ sed -e '1{F;q}' sed_demo1.txt
sed_demo1.txt
Hacker Public Radio (HPR) is an Internet Radio show (podcast) that releases

Examples from the GNU manual

Centering lines

This example is from the GNU manual and centres all lines of a file in a width of 80 columns.

The script, called centre.sed, has been made available on the HPR site, and is reproduced below with line numbers for easy reference. Note that the path to sed has been changed from the original since many Linux distributions store it in /bin rather than /usr/bin.

Note that option -f is needed to make sed read the rest of the file.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#!/bin/sed -f

# Put 80 spaces in the buffer
1 {
x
s/^$/          /
s/^.*$/&&&&&&&&/
x
}

# del leading and trailing spaces
y/\t/ /
s/^ *//
s/ *$//

# add a newline and 80 spaces to end of line
G

# keep first 81 chars (80 + a newline)
s/^\(.\{81\}\).*$/\1/

# \2 matches half of the spaces, which are moved to the beginning
s/^\(.*\)\n\(.*\)\2/\2\1/
  • Lines 4-9: This group of commands is executed on line 1 of the input stream.
    • Line 5: The x command exchanges the pattern space and the hold space. There will be data in the pattern space which will be stored in the hold space, but the hold space will have been empty originally, so now the pattern space is empty.
    • Line 6: Replaces the empty pattern space by 10 spaces.
    • Line 7: Replaces the 10 spaces in the pattern space by itself 8 times, thereby creating 80 spaces.
    • Line 8: Exchanges the buffers again so that the 80 spaces are stored in the hold space and the pattern space is back as it was.
  • Line 12: In the GNU manual the command is written as y/tab/ / but the word tab is meant to signify a tab character, since it is invisible. The copy used here has used the '\t' metacharacter (or escape sequence), though this is GNU-specific. The y command replaces all tabs by spaces.
  • Line 13: This s command removes leading spaces.
  • Line 14: This s command removes trailing spaces.
  • Line 17: The G command appends the contents of the hold space to the pattern space, preceded by a newline. The contents of the hold space are not changed. Remember that the hold space contains 80 spaces.
  • Line 20: This s command replaces the pattern space by the first 81 characters, so this should consist of the original line, the newline and some of the newly added spaces.
  • Line 23: This s command matches the line up to the newline (using grouping), and enough of the spaces after the newline which can be split into two equal parts. Then half of the spaces (\2) are placed at the beginning of the line, centring it.

This example is built for centring in 80 columns and would need a change to the s command on line 20 to use a different width. It will also truncate lines longer than 80 characters. However, it is a useful demonstration.

Reverse lines of files

This example is from the GNU manual. It emulates the Unix command tac which is a reverse version of cat. The example is quite well described in the manual, but it seemed desirable to look at it in even more detail.

The script, called tac.sed, has been made available on the HPR site, and is reproduced below with line numbers for easy reference. Note that the path to sed has been changed as before.

Note that in addition to option -f we also have -n to suppress auto-printing.

1
2
3
4
5
6
7
8
9
10
11
12
13
#!/bin/sed -nf

# reverse all lines of input, i.e. first line became last, ...

# from the second line, the buffer (which contains all previous lines)
# is *appended* to current line, so, the order will be reversed
1! G

# on the last line we're done -- print everything
$ p

# store everything on the buffer again
h
  • Line 1: This is the usual crunch-bang or hash-bang line that is found on executable sed scripts.
  • Line 7: This is an address and a single command. The address is a line number, 1, but is negated so that it refers to all other lines. The G command appends a newline to the contents of the pattern space, and then appends the contents of the hold space to that of the pattern space.
  • Line 10: This is another command controlled by an address, a $, as we saw in episode 3. The command is p which prints the pattern space. So, when the last input line has been reached the entire accumulated pattern space is printed.
  • Line 13: The h command replaces the contents of the hold space with the contents of the pattern space. This is done for every input line since it has no address.

So, the algorithm used here is:

  • The first line read by sed does not trigger anything other than the h command on line 13 of the script. This means that the line is stored in the hold space.
  • The second and subsequent input lines trigger the G command on line 7 of the script. For input line 2, for example, this command appends a newline to the pattern space, then appends input line 1 (previous stored in the hold space) to it. Then the h command on line 13 is invoked and the pattern space (in the order line 2/line 1) is stored in the hold space again. In this way, each line is appended to the already accumulated lines in reverse order.
  • When the last line is read the G command on line 7 will be triggered as before, appending the hold space contents again, with the result that the pattern space now holds the entire file in reverse order. Now, however, the p command on line 10 will trigger and the result of reversing everything will be printed.

It bothers me slightly that the h command on line 13 will be run again after printing everything, but its effects will not be seen. I would have wanted to make line 10 into:

$ {
    p
    q
}

This would stop sed after printing. However, this is probably just obsessive thinking on my part!

Reverse characters of lines

This example is from the GNU manual where sed is used to emulate the rev command. The script, called reverse_characters.sed, has been made available on the HPR site, and is reproduced below with line numbers for easy reference. Note that the path to sed has been changed from the original as before. I have also changed line 6, replacing implicit newlines by '\n' sequences, which might mean the modified script will not run on non GNU sed versions.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
#!/bin/sed -f

/../! b

# Reverse a line. Begin embedding the line between two newlines
s/^.*$/\n&\n/

# Move first character at the end. The regexp matches until
# there are zero or one characters between the markers
tx
:x
s/\(\n.\)\(.*\)\(.\n\)/\3\2\1/
tx

# Remove the newline markers
s/\n//g
  • Line 1: This is the usual hash-bang line that is found on executable sed scripts. This one does not suppress auto-printing.
  • Line 3: Here the b command is invoked on any line that does not have two characters in it. The b command normally invokes an unconditional branch to a label, but if the label is omitted it triggers a new cycle. The effect here is that any line with one character or less is simply printed and the rest of the commands are ignored. There is no point in reversing such a line!
  • Line 6: This s command replaces the current line by itself (&) with a newline at the beginning and the end.
  • Line 10: This is documented in the GNU manual as: "This is often needed to reset the flag that is tested by the t command." I have tried removing it and the script still works. Other versions of sed may not however.
  • Line 11: This is a label 'x' for branch commands.
  • Line 12: This s command uses the newlines added on line 6 to determine which characters to swap. It uses groups to indicate the character after the first newline and before the second one, and groups the rest of the line, allowing that part to be zero or more characters long. It replaces what it finds with a reversed version of the first and third groups. This also ensures that the moved characters end up on the other side of the newlines. Note that this only finds the characters inside the newlines and swaps two. The rest of the line before the first newline and after the second are left alone.
  • Line 13: The t command is a conditional branch to label 'x'. It will only branch if the s command on line 12 performs a substitution. In this way lines 11-13 form a loop to repeat the action on line 12 until the regular expression stops matching.
  • Line 16: Having reversed the line the newlines can be removed, and this s command does this, and the reversed line can then be printed before the next cycle begins.

The processing of a line can be visualised by using the l command. I have provided another version of this script containing such commands to show what is happening:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
#!/bin/sed -f

# reverse_characters_debug.sed
#
# A version which prints what it's doing to help understand the process

/../! b

# Reverse a line.  Begin embedding the line between two newlines
s/^.*$/\n&\n/

# List the line to see what the command above did to it
l

# Move first character at the end.  The regexp matches until
# there are zero or one characters between the markers
tx
:x
s/\(\n.\)\(.*\)\(.\n\)/\3\2\1/
# List the result of each loop iteration
l
tx

# Remove the newline markers
s/\n//g

It is available on the HPR site as reverse_characters_debug.sed and you can examine it yourself. Running it on a simple string gives output as follows:

$ echo abcdefghijklmnopqrstuvwxyz | ./reverse_characters_debug.sed
\nabcdefghijklmnopqrstuvwxyz\n$
z\nbcdefghijklmnopqrstuvwxy\na$
zy\ncdefghijklmnopqrstuvwx\nba$
zyx\ndefghijklmnopqrstuvw\ncba$
zyxw\nefghijklmnopqrstuv\ndcba$
zyxwv\nfghijklmnopqrstu\nedcba$
zyxwvu\nghijklmnopqrst\nfedcba$
zyxwvut\nhijklmnopqrs\ngfedcba$
zyxwvuts\nijklmnopqr\nhgfedcba$
zyxwvutsr\njklmnopq\nihgfedcba$
zyxwvutsrq\nklmnop\njihgfedcba$
zyxwvutsrqp\nlmno\nkjihgfedcba$
zyxwvutsrqpo\nmn\nlkjihgfedcba$
zyxwvutsrqpon\n\nmlkjihgfedcba$
zyxwvutsrqpon\n\nmlkjihgfedcba$
zyxwvutsrqponmlkjihgfedcba

The first line of the output shows the original line being embedded between two newlines.

The second line shows the 'a' and 'z' being swapped as discussed in the explanation. Then successive lines show further swaps based on the positions of the two newlines.

The auto-printed last line shows the final result after all swaps have been carried out.

My answer to the quiz in the last episode

As promised here is my answer to the quiz I set in episode 4. The request was to use sed_demo1.txt, taking the first line and converting it to Pig Latin. The brief rules were:

  • Take the first letter of each word and place it at the end, followed by 'ay'. Thus 'pig' becomes 'igpay' and 'latin' becomes 'atinlay'.
  • Skip 1- and 2-letter words, since 'a' -> 'aay' is not wanted.
  • Do not bother about capitals.

Here's what I did:

$ sed -ne '1s/\(\b\w\)\(\w\{2,\}\)/\2\1ay/gp' sed_demo1.txt
ackerHay ublicPay adioRay (PRHay) is an nternetIay adioRay howsay (odcastpay) hattay eleasesray

Sadly, there were no winners of this little competition because there were no entries. It's probably just as well that I am finishing this series here because I think I probably sent everyone to sleep several episodes back!!