Gnu Awk - Part 15 (HPR Show 2824)

Redirection of input and output - part 2

Dave Morriss


Table of Contents

Introduction

This is the fifteenth episode of the “Learning Awk” series which is being produced by b-yeezi and myself.

This is the second of a pair of episodes looking at redirection in Awk scripts.

In this episode I will spend some time looking at the getline command used for explicit input (as opposed to the usual implicit sort), often with redirection. The getline command is a complex subject which I will cover only relatively briefly. You are directed to the getline section of the GNU Awk User’s Guide for the full details.

Redirection of input

A reminder of how awk processes rules

We are going to look at how awk’s normal input processing is changed in this episode, so I thought it might be a good idea to revisit how things work in the normal course of events.

The awk script reads a line from a file or standard input and then scans the (non BEGIN/END) rules that make up the script in the sequence they are listed. If a rule matches then it is run, and the process of matching continues until all rules have been checked. It is entirely possible that multiple rules will match, and they will all be executed if so, in the sequence they are encountered.

I have prepared a data file awk15_testdata1 and a simple script awk15_ex1.awk to demonstrate this, both downloadable. The data is generated with the lorem1 command thus:

$ printf "%s\n" $(lorem -w 3) > awk15_testdata1

The two files are shown here:

$ cat awk15_ex1.awk
#!/usr/bin/awk -f

# Downloadable example 1 for GNU Awk Part 15

{ print "R1 ---" }
{ print "R2",$0 }
{ print "R3",$0 }

$ cat awk15_testdata1
voluptatibus
quaerat
sunt

Running the script gives the following result:

$ ./awk15_ex1.awk awk15_testdata1
R1 ---
R2 voluptatibus
R3 voluptatibus
R1 ---
R2 quaerat
R3 quaerat
R1 ---
R2 sunt
R3 sunt

You can see that each rule is run for each line read from the data file. Rule 1 just prints some hyphens and does nothing with the data, but rules 2 and 3 print the line that was read. There is nothing to stop any of these rules from running.

The getline command

So far we have encountered awk scripts which have read lines from a file or standard input and used them to match patterns which invoke various actions. That is different from the way many other programming languages handle input – and is one of the great strengths of awk.

The 'getline' command can be used to read lines explicitly outside the usual read→pattern-match→action cycle of awk.

Simple usage

The 'getline' command used on its own (with no arguments) reads in the next line and splits it up into fields in the normal way. If used with normal input it affects how data is read and how rules are executed.

If 'getline' finds a record it returns 1, and if it encounters the end of the file it returns 0. If there’s an error while reading it returns -1 (and the variable 'ERRNO' will contain a description of the error).

The following script (awk15_ex2.awk) is the same as the one just looked at except that it now calls 'getline' inside rule 2.

$ cat awk15_ex2.awk
#!/usr/bin/awk -f

# Downloadable example 2 for GNU Awk Part 15

{ print "R1 ---" }
{ print "R2",$0; getline }
{ print "R3",$0 }

Running the script gives the following result:

$ ./awk15_ex2.awk awk15_testdata1
R1 ---
R2 voluptatibus
R3 quaerat
R1 ---
R2 sunt
R3 sunt

Here it can be see that rule 2 printed the first line read from the data file. The 'getline' call then read the second line, replacing the first one, and rule 3 then printed it. The third line was then read in the normal way and there was nothing for the 'getline' to read, so rules 2 and 3 both printed that last line.

The following downloadable example deals with a file of text where some lines have continuations. This is shown by the line ending with a hyphen. The script detects these lines and concatenates them with the next line. The data file (edited output from the 'lorem' command again) is included with this show (awk15_testdata2) and is listed below.

$ cat awk15_ex3.awk
#!/usr/bin/awk -f

# Downloadable example 3 for GNU Awk Part 15

{
    if ($NF == "-") {
        $NF = ""
        line = $0
        getline
        print line $0
    }
    else {
        print $0
    }
}
$ cat awk15_testdata2
Dolore eum corporis excepturi. -
Dolorum nulla qui nemo at earum beatae. Laborum
quo hic rem aspernatur accusamus -
praesentium. Impedit eveniet ut reprehenderit
deleniti aut placeat. -
Laudantium sapiente eaque dolor.

Running the script (awk15_ex3.awk) gives the following result:

$ ./awk15_ex3.awk awk15_testdata2
Dolore eum corporis excepturi. Dolorum nulla qui nemo at earum beatae. Laborum
quo hic rem aspernatur accusamus praesentium. Impedit eveniet ut reprehenderit
deleniti aut placeat. Laudantium sapiente eaque dolor.

If the last field ('$NF') is a hyphen then it’s deleted and the line is saved. The 'getline' call then re-fills '$0' and it is printed preceded by the saved line. Using 'getline' makes this type of processing simpler.

Note that this script is too simple for real use since it doesn’t deal with cases like the final '-' not being separated from the preceding word, and would fail if there was a hyphen ending the last line – and so on.

See the more sophisticated example in the GNU Awk User’s Guide (4.10.1 Using getline with No Arguments).

Reading into a variable

If 'getline var' is used the next record is read from the main input stream into a variable (var in this example). The record is not split into fields, and variables like 'NF' are not changed. Since the main input stream is being read, variables like 'NR' (number of records) are changed.

Reading from a file

This is another case of redirection:

getline < file

Here 'file' is a string expression that specifies the file name.

As mentioned earlier, the string expression used here can also be used to close the file with the 'close' command, but has to be specified exactly. Saving the expression in a variable helps with this:

input = path "/" filename
getline < input
close(input)

In this fragment it is assumed that 'path' contains a file path, which is concatenated with a slash and a file name to produce the input specification.

Reading from a file into a variable

This is a concatenation of the previous two forms:

getline var < file

As before 'file' is a string expression that specifies the file name.

The following simple example (downloadable as part of this episode) deals with the file we generated in episode 14 'fruit_names'.

$ cat awk15_ex4.awk
#!/usr/bin/awk -f

# Downloadable example 4 for GNU Awk Part 15

BEGIN {
    if (ARGC != 2 ) {
        print "Needs a file name argument" > "/dev/stderr"
        exit
    }

    data = ARGV[1]

    while ( (getline line < data) > 0 )
        print line
    close(data)
}

Note: I did not explain 'ARGC' and 'ARGV' very clearly in the audio. As with other Unix-like systems, 'ARGC' is a numeric variable containing the count of arguments given to the script when it is run from the command line. The arguments themselves are stored in the array 'ARGV', and element zero is always the name of command or script, so 'ARGC' is one greater than expected because of this.

Running the script (awk15_ex4.awk) simply lists the file.

$ ./awk15_ex4.awk fruit_names
apple
banana
strawberry
grape
apple
plum
kiwi
potato
pineapple

This is (another) trivial script presented as an example of how this form of 'getline' can be used. Everything runs in the 'BEGIN' rule. First a check is made to ensure the script has been given an argument (the input file), and if so the name is stored in the variable 'data'. If not an error message is written and the script exits. If all is well a 'while' loop runs, reading lines from the file and printing them. Finally the file is closed.

As a seasoned awk user by now you will have realised that the above could have been achieved with the much simpler script:

$ awk '{print}' fruit_names

Reading from a pipe

Using 'command | getline' or 'command | getline var' reads from a command. In the first case the record is split into fields in the usual way, and in the second case it is stored in a variable.

The following simple example (awk15_ex5.awk downloadable as part of this episode) runs 'wget' to read the HPR statistics page:

$ cat awk15_ex5.awk
#!/usr/bin/awk -f

# Downloadable example 5 for GNU Awk Part 15

BEGIN {
    cmd = "wget -q http://hackerpublicradio.org/stats.php -O -"
    while ((cmd | getline) > 0) {
        if ($0 ~ /^Shows in Queue:/)
            printf "Queued shows on HPR: %d\n", $4
    }
    close(cmd)
}

The statistics include a line 'Shows in Queue: x' which the script checks for. If it is found then the number at the end is extracted (as a normal awk field) and it is displayed with different text. Running the script gives the following result (at the time of generating these notes):

$ ./awk15_ex5.awk

Queued shows on HPR: 27

The following downloadable example (awk15_ex6.awk) is essentially the same as the previous one except that it uses 'command | getline var':

$ cat awk15_ex6.awk
#!/usr/bin/awk -f

# Downloadable example 6 for GNU Awk Part 15

BEGIN {
    cmd = "wget -q http://hackerpublicradio.org/stats.php -O -"
    while ((cmd | getline line) > 0);
    close(cmd)

    split(line,fields,",")
    printf "Queued shows on HPR: %d\n", fields[10]
}

It loops through the lines returned, placing each in the variable 'line' but doing nothing else. This means that the last line is left in the variable at the end. This contains comma-separated numbers which are separated into an array called 'fields' using the 'split' function. The 10th element contains the number of queued shows in this case.

Using getline with a coprocess

This feature is provided by Gnu Awk and it allows a coprocess to be created which can be written to and read from. In the context of print and printf we send data to the coprocess with the '|&' operator, as we have seen briefly already. Not surprisingly, 'getline' can be used to read data back, either being split up into fields in the normal way, or being saved in a variable.

This subject is quite advanced and will not be discussed in much depth here. The GNU Awk User’s Guide can be used to find out more about getline and coprocesses and about the whole subject of Two-Way I/O.

The following downloadable example (awk15_ex7.awk) demonstrates a use for this feature. In this case we have a SQLite database. This is a copy of one that I use to keep track of HPR episodes on the Internet Archive and is called awktest.db in this incarnation. It is not included with the show.

The command to interact with the database is simply 'sqlite3 awktest.db' and this command can be fed an SQL query of the form:

select id,title from episodes where id = ?;

Here the '?' represents a show number that is inserted into the query (actually in the form of a 'printf' template using '%d', as you will see). On the command line you can do this type of thing in this way:

$ printf 'select id,title from episodes where id = %d;\n' {2796..2800} | sqlite3 awktest.db
2796    IRS,Credit Freezes and Junk Mail Ohh My!
2797    Writing Web Game in Haskell - Simulation at high level
2798    Should Podcasters be Pirates ?
2799    building an arduino programmer
2800    My YouTube Subscriptions #6

Here is the script:

$ cat awk15_ex7.awk
#!/usr/bin/awk -f

# Downloadable example 7 for GNU Awk Part 15

BEGIN {
    db = "awktest.db"
    cmd = "sqlite3 " db
    querytpl = "select id,title from episodes where id = %d;\n"
}

$0 ~ /^[0-9]+$/ {
    printf querytpl,$0 |& cmd
    cmd |& getline result
    print result
}

In the 'BEGIN' rule the variables 'db', 'cmd' and 'querytpl' are initialised with the database name, the command to interact with it and a template to be used to construct a query.

The main rule looks for numbers which are to be used in the query. If a number is detected a 'printf' command uses the format string in 'querytpl', and the number just received, to generate the query and pass it to the coprocess which is running the database command.

Then we use 'getline' to read the result from the database into a variable called 'result' which is printed. Be aware that this is a simple script which does not cater for errors of any kind.

There are various ways in which this script could be run. One number could be echoed into it, a string of multiple lines containing numbers could be passed in, as could a file of numbers. It could also read from the terminal and process numbers as they are typed in. We will demonstrate it running with a file of show numbers which is listed before the script is run (but not included in downloadable form):

$ cat awk15_ex5.data
2761
2789
2773
$ ./awk15_ex7.awk awk15_ex5.data
2761    HPR Community News for February 2019
2789    Pacing In Storytelling
2773    Lead/Acid Battery Maintenance and Calcium Charge Voltage

If this subject is of interest you could refer to clacke’s HPR episode about coprocesses in Bash – hpr2793 :: bash coproc: the future (2009) is here.

Finale

There is more that could be said about redirection of input and output, as well as about coprocesses. In fact there are many more subjects within Gnu Awk that could be examined. However, this series will soon be coming to an end.

My collaborator b-yeezi and I feel that the areas of Gnu Awk we have not covered in this series might be best left for you to investigate further if you have the need. We both feel that awk is a very useful tool in many respects, but does not stand comparison with more advanced scripting languages such as Python, Ruby and Perl. Perl in particular has borrowed many ideas from Awk but has extended them considerably. Ruby was designed with Perl in mind, and Python has innovated considerably too and is a very widely-used language. Even though Gnu Awk has advanced considerably since it was created it still shows its age and its usefulness is limited.

There are cases where quite complex scripts might be written in Awk, but the way most people seem to use it is as part of a pipeline or inside shell scripts of various sorts. Where you might write a complex script in Perl, Python or Ruby (for example), taking on a large project solely in Awk seems like a bad choice today.

Before we finish this series it is planned to produce one more episode – number 16. In it b-yeezi and I will record a show together. At the time of writing there is no timescale, but we will endeavour to do this as soon as our schedules allow.


  1. The Lorem Ipsum text here is generated by the 'lorem' command which is installed with the Perl module called Text::Lorem. You can generate words, sentences or paragraphs of pseudo-Latin with it. The module exists as a Debian package called 'libtext-lorem-perl' amongst others.