Gnu Awk - Part 2 (HPR Show 2129)

Dave Morriss


Table of Contents

Introduction

This is the second episode in a series where b-yeezi and I will be looking at the AWK language (more particularly its GNU variant gawk). It is a comprehensive interpreted scripting language designed to be used for manipulating text.

The name AWK comes from the names of the authors: Alfred V. Aho, Peter J. Weinberger, and Brian W. Kernighan. The original version of AWK was written in 19771 at AT&T Bell Laboratories. See the GNU Awk User’s Guide for the full history of awk and gawk.

Strictly the name of the language is AWK in capitals, but the command that is typed to invoke it is awk or gawk, so I will use the lower-case version throughout these notes unless it is important to differentiate the two. Nowadays, on most Linux distributions, awk and gawk are synonyms referring to GNU Awk.

I first encountered awk in the late 1980’s when I was working on a Digital Equipment Corporation (DEC) VAXCluster running OpenVMS. This operating system did not have any very good ways of manipulating text without writing a compiled program, which was something I frequently needed to do. A version of gawk was ported to OpenVMS around this time, which I installed. For me gawk (and sed) totally changed the way I was able to work on OpenVMS at that time.

Simple Awk Usage Recap

Invoking Awk

As we saw in the last episode, awk is invoked on the command line as:

$ awk [options] 'program' inputfile1 inputfile2...
  • awk is the command
  • [options] are the options accepted by the command, one of which, -F was introduced in the last episode
  • program is the awk program enclosed in single quotes; this may be preceded by -e (like sed) to make it clear that the program follows (where it might otherwise be ambiguous)
  • inputfile1 is the first file to be processed; there may be many; if the character - is given instead of a filename data is expected on standard input

What Awk does

Awk views its input data as a series of “records” (usually newline-delimited lines), where each record contains a series of “fields”. A field is a component of a record delimited by a “field separator”.

In the last episode field separators were whitespace (spaces, TABs and newlines), which is the default, or a comma (-F "," or -F,).

One of the features of awk is that it treats multiple space separators as one, as we saw in the last episode. There were multiple spaces between many of the fields of the test file.

Other separators are not treated this way, so with the following example record, assuming that the field separator is a comma, three fields are found, with the second one being of zero length:

a,,b

Awk program

As we saw in the last episode, an awk program consists of a series of rules where each rule consists of:

pattern { action }

Normally each rule begins on a new line in the program (though this is not mandatory). There are program components other than rules, but we’ll deal with these later on.

In a rule ‘pattern’ is used to identify a line in some way, and ‘{ action }’ defines what will be done to the line which has been matched by the pattern. Patterns can be simple comparisons, regular expressions, combinations of the two and quite a few other things that will be covered throughout the series.

A pattern may be omitted, in which case the action is applied to every record. Also, a rule can consist only of a pattern, in which case the entire record is written as if the action was { print } (which means print the record).

Awk programs are essentially data driven in that actions depend on the data, so they are quite a bit different from programs in many other programming languages.

More about fields and records

As was covered in episode 1, once Awk has separated an input record into fields they are stored as numbered entities. These are available by using a dollar sign followed by a number. So, $1 refers to field 1, $2 field 2, and so on. The variable $0 refers to the entire record in an un-split state.

The number after a dollar sign is actually an expression, so $2 and $(1+1) mean the same thing. This is an example of an arithmetic expression, and is a useful feature of awk.

There is a special variable called NF in which awk stores the number of fields it has found in the current record. This can be printed or used in tests as shown in the following example (which uses file1.txt introduced in episode 1):

$ awk '{ print $0 " (" NF ")" }' file1.txt | head -3
name       color  amount (3)
apple      red    4 (3)
banana     yellow 6 (3)

(Note that we used ‘head -3’ to truncate the output here.)

The way in which print works in awk is: it takes a series of arguments which may be variables or strings and concatenates them together. Here we have $0, the record itself, followed by a string containing a space and an open parenthesis, the NF variable, and another string containing a close parenthesis.

As well as counting fields per record, awk also counts input records. The record number is held in the variable NR, and this can be used in the same was as we have seen with NF. For example, to print the record number before each line we could write:

$ awk '{ print NR ": " $0 }' file1.txt
1: name       color  amount
2: apple      red    4
3: banana     yellow 6
4: strawberry red    3
5: grape      purple 10
6: apple      green  8
7: plum       purple 2
8: kiwi       brown  4
9: potato     brown  9
10: pineapple  yellow 5

Note that writing the above with no spaces other than the one after print is completely acceptable (though potentially less clear):

$ awk '{print NR": "$0}' file1.txt

In the audio I wasn’t sure about this, but I have since checked.

More about printing

So far we have seen the print statement and have found that it is a little awkward to use to print a mixture of fixed text and variables. In particular, there is no interpolation of variables into strings as can be seen in other scripting languages (e.g. Bash).

There is also a printf statement in Awk. This is similar to printf in C and Bash. It takes a format argument followed by a comma-separated list of items. The argument list may be enclosed in parentheses.

printf format, item1, item2, ...

The format argument (or format string) defines how each of the other arguments is to be output. It uses format specifiers to do this, amongst which are ‘%s’ which means “output a string” and ‘%d’ for outputting a whole decimal number. For example, the following printf statement outputs the record followed by a parenthesised number of fields:

printf "%s (%d)\n",$0,NF

Note that, unlike print no newline is generated unless requested explicitly. The escape sequence ‘\n’ does this.

There are more format specifiers and more features of printf to be described, and these will be covered later in the series.

More about Awk programs

So far we have seen examples of simple awk programs written on the command line. For more complex programs it is usually preferable to place them in files. The option -f FILE may be used to invoke such a file containing a program. File example1.awk, included with this episode, is an example of this and holds the following:

/^a/ { print "A: " $0 }
/^b/ { print "B: " $0 }

This would be run as follows:

$ awk -f example1.awk file1.txt
A: apple      red    4
B: banana     yellow 6
A: apple      green  8

It is the convention to give such files the extension .awk to make it clear that they hold an Awk program. This is not mandatory but it gives a useful clue to file managers and editors as to what the file is.

As you will have seen if you followed the sed series and other HPR episodes on scripting, an Awk program file can be made into a script by adding a #! line at the top and making it executable. The file example2.awk has been included with this episode to demonstrate this feature. It looks like this:

1
2
3
4
5
#!/usr/bin/awk -f
#
# Print all but line 1 with the line number on the front
#
NR > 1 { printf "%d: %s\n",NR,$0 }

Note that we added the path to the where the awk program may be found, and ‘-f’ to the first line. Without the option, awk will not read the rest of the file.

Note also that lines 2-4 are comments. Line 5 is the program which prints each line with a line number, but only if the number is greater than 1. Thus the header line is not printed.

The Awk file must be made executable for this to work:

$ chmod u+x example2.awk

Then it can be invoked as follows (assuming it is in the current directory):

$ ./example2.awk file1.txt
2: apple      red    4
3: banana     yellow 6
4: strawberry red    3
5: grape      purple 10
6: apple      green  8
7: plum       purple 2
8: kiwi       brown  4
9: potato     brown  9
10: pineapple  yellow 5

Summary

This episode covered:

  • Awk’s concept of records and fields
  • How spaces as field separators are different from any other separators
  • How an Awk program is made up of ‘pattern { action }’ rules
  • How fields are referred to by a dollar sign followed by a numeric expression
  • The variables NF and NR which hold the number of fields and the record number
  • The print and printf statements
  • Awk program files and the -f option
  • Executable Awk scripts

  1. I said 1997 in the audio, not 1977. Doh!