Gnu Awk - Part 12 (HPR Show 2610)

Dave Morriss


Table of Contents

Introduction

This is the twelfth episode of the “Learning Awk” series which is being produced by b-yeezi and myself.

In this episode I want to continue with the subject I started in episode 10, an advanced-level look at arrays in Awk.

In case it might be of interest I have also included a section describing a recent use I made of awk to solve a problem, though this does not use arrays.

More about arrays in Awk

Using patsplit

We saw the split function in episode 10, but there is also a more powerful function for splitting strings into array elements called patsplit.

patsplit(string, array [, fieldpat [, seps ] ])

Divide string into pieces defined by fieldpat and store the pieces in array and the separator strings in the seps array.
This is the same as split in episode 10; consult this episode for the details of this type of string splitting. The main difference from split is that the third argument, fieldpat, is a regular expression which defines the field rather than the separator.

Examples

1. Using patsplit to split a comma-delimited string. This could just as well have been done by setting the FS variable and using awk’s standard splitting mechanism (or FPAT which has not been covered in this series so far):

$ cat awk12_ex1.awk
{
    patsplit($0,a,/[^,]*/)
    for (i in a)
        printf "%s ",a[i]
    print ""
}
$ x="An apple a day keeps the doctor away"
$ echo "${x// /,}"
An,apple,a,day,keeps,the,doctor,away
$ echo "${x// /,}" | awk -f awk12_ex1.awk
An apple a day keeps the doctor away

Note that the fieldpat argument is not the delimiter, but a definition of the field structure itself. Here the regexp specifies a sequence of zero or more characters which are not commas.

Note also that Bash variable 'x' is set to a string, then this is edited to replace spaces by commas and fed to the awk script - which removes them again!

2. Another example using a more complex regular expression:

$ cat awk12_ex2.awk
{
    patsplit($0,a,/([^,]*)|("[^"]+")/)
    for (i in a)
        printf "<%s> ",a[i]
    print ""
}
$ echo "A,\"red bird\",in,the,hand,is,worth,two,in,the,bush" | awk -f awk12_ex2.awk
<A> <"red bird"> <in> <the> <hand> <is> <worth> <two> <in> <the> <bush>

This regexp handles data which is more like the standard CSV format:

([^,]*)|("[^"]+")
  • The first sub-expression deals with a series of zero or more not commas.
  • The second one looks for a double-quoted string containing one or more not double quote characters. The CSV standard requires elements with embedded spaces to be quoted.

3. Showing what happens to the separators:

$ cat awk12_ex3.awk
{
    flds = patsplit($0,a,/[A-Za-z]+/,s)
    for (i in a)
        printf "%s ",a[i]
    print ""
    for (i=1; i<=flds; i++)
        printf "%s ",s[i]
    print ""
}
$ echo "Grinning--------like----a-Cheshire--------cat---" | awk -f awk12_ex3.awk
Grinning like a Cheshire cat
-------- ---- - -------- ---

In this example the number of fields is stored in flds. The regexp used to define the fields is a sequence of one or more letters. These are printed in a loop as before.

The separators are printed in a loop which counts from 1 to the number of fields, and these elements are shown. There is also an element zero because patsplit saves the separator which precedes the first field, but this is empty and we don’t print it here.

Skip unless really interested

The data sent to this example was generated by an awk script which is shown below and is available in the downloadable file awk12_extra.awk. Note that this one has been made into a standalone script by the addition of the #! line at the start (and has been made executable):

$ cat awk12_extra.awk
#!/usr/bin/awk -f
#
# Awk script to take a sequence of words separated by spaces and turn them
# into a string where each word is followed by as many hyphens as there are
# letters in the word itself.
#
{
    for (i=1; i<=NF; i++){
        fill=$i
        gsub(/./,"-",fill)
        printf "%s%s",$i,fill
    }
    print ""
}
$ echo "Grinning like a Cheshire cat" | ./awk12_extra.awk
Grinning--------like----a-Cheshire--------cat---

Sorting arrays

Using PROCINFO

In standard awk, the order in which the elements of an array are returned is not defined and it’s necessary to go to some trouble to order them in a specific way.

Gnu Awk (gawk) lets you control the order in which the array elements are returned by use of a special built-in array called PROCINFO.

Setting PROCINFO["sorted_in"] to one of a set of predefined values allows array sorting. The values are:

Value Effect
"@unsorted" Array elements are unsorted as in standard awk
"@ind_str_asc" Order by indices in ascending order compared as strings
"@ind_str_desc" Order by indices in descending order compared as strings
"@ind_num_asc" Order by indices in ascending order forcing them to be treated as numbers
"@ind_num_desc" Order by indices in descending order forcing them to be treated as numbers
"@val_type_asc" Order by element values in ascending order. Ordering is by the type assigned to the element
"@val_type_desc" Order by element values in descending order. Ordering is by the type assigned to the element
"@val_str_asc" Order by element values in ascending order. Scalar values are compared as strings.
"@val_str_desc" Order by element values in descending order. Scalar values are compared as strings.
"@val_num_asc" Order by element values in ascending order. Scalar values are compared as numbers.
"@val_num_desc" Order by element values in descending order. Scalar values are compared as numbers.

Caveats:

  • The sort order is determined before the loop begins and cannot be changed inside it.
  • The value of PROCINFO["sorted_in"] is effective throughout the script and affects all array-scanning loops; it is not localised.

This feature of GNU Awk is more complicated than has been described here. For example, arrays can be more complex than we have seen so far, and PROCINFO["sorted_in"] can also be used to call a user-defined function for sorting. The full details are available in the GNU Awk Manual, starting with section 8.1.6.

Examples

1. Sorting an array by its values:

$ cat awk12_ex4.awk
BEGIN{
    PROCINFO["sorted_in"]="@val_str_asc"
}
{
    split($0,a," ")
    for (i in a)
        printf "%d: %s\n",i,a[i]
}
$ echo "An Englishman's home is his castle" | awk -f awk12_ex4.awk
1: An
2: Englishman's
6: castle
5: his
3: home
4: is

Here the array is populated using split. The setting of PROCINFO["sorted_in"] has requested sorting by element values in ascending order (in the BEGIN rule). The array is printed showing the indices and values and you can see that the order is as requested. Note that the words with capitals sort before the lowercase ones.

Addendum: I have included another example of the use of PROCINFO later in the notes. Since the audio has already been recorded I have named the example awk12_ex10.awk to avoid changing other file names.

Using Awk’s Array Sorting Functions

As mentioned in episode 11, there are two functions for sorting arrays in GNU Awk: asort and asorti.

asort(source [, dest [, how ] ])

Returns the number of elements in the array source.
Sorts the values of source and replaces the indices of the sorted values of source with sequential integers starting with one.
If the optional array dest is specified, then source is duplicated into dest. dest is then sorted, leaving the array source unchanged.
The third argument how specifies how the array is to be sorted.

asorti(source [, dest [, how ] ])

Returns the number of elements in the array source.
Sorts the indices of source instead of the values.
If the optional array dest is specified, then source is duplicated into dest. dest is then sorted, leaving the array source unchanged.
The third argument how specifies how the array is to be sorted.

In both cases the optional how argument defines the type of sorting. This must be one of the strings already defined: "@ind_str_asc" to "@val_num_desc". It can also be, as mentioned above, the name of a user-defined function. We have not looked at user-defined functions yet, so we will leave this option for the moment.

Examples

1. Sorting an array with numeric indexes with asort reorders the indices:

$ cat awk12_ex5.awk
BEGIN{
    a[1]="Jones"
    a[2]="X"
    a[3]="Smith"
    asort(a)
    for (i in a)
        printf "%s %s\n",i,a[i]
}
$ awk -f awk12_ex5.awk
1 Jones
2 Smith
3 X

Note that the indices have been destroyed and replaced with 1, 2 and 3, in this case in a different order from their original values.

2. Sorting an array with character indices using asort, showing that providing a destination array is a way to avoid affecting the original:

$ cat awk12_ex6.awk
BEGIN{
    a["a"]="Jones"
    a["b"]="X"
    a["c"]="Smith"
    asort(a,b)
    for (i in b)
        printf "b[%s] = %s\n",i,b[i]
    print ""
    for (i in a)
        printf "a[%s] = %s\n",i,a[i]
}
$ awk -f awk12_ex6.awk
b[1] = Jones
b[2] = Smith
b[3] = X

a[a] = Jones
a[b] = X
a[c] = Smith

This again shows the sorted array 'b' has had its indices replaced by the numbers 1, 2 and 3, so if these were important it might be a problem.

3. Sorting an array with string indices using asorti rebuilds the array with just the indexes, which is usually not useful on its own:

$ cat awk12_ex7.awk
BEGIN{
    a["third"]="Jones"
    a["second"]="X"
    a["first"]="Smith"
    asorti(a)
    for (i in a)
        printf "%s %s\n",i,a[i]
}
$ awk -f awk12_ex7.awk
1 first
2 second
3 third

In this case the contents of the array 'a' have been destroyed, making the indices the contents and adding numeric indices.

4. Sorting an array with string indices using asorti but using the dest argument results in an array that can be used to access the original array in sorted order without changing it:

$ cat awk12_ex8.awk
BEGIN{
    a["third"]="Jones"
    a["second"]="X"
    a["first"]="Smith"
    asorti(a,b)

    print "What array a contains:"
    for (i in a)
        printf "a[%s] = %s\n",i,a[i]
    print ""

    print "What array b contains:"
    for (i in b)
        printf "b[%s] = %s\n",i,b[i]
    print ""

    print "Accessing original array a with sorted indices in b"
    for (i in b)
        printf "%6s: %s\n",b[i],a[b[i]]
}
$ awk -f awk12_ex8.awk
What array a contains:
a[first] = Smith
a[third] = Jones
a[second] = X

What array b contains:
b[1] = first
b[2] = second
b[3] = third

Accessing original array a with sorted indices in b
 first: Smith
second: X
 third: Jones

Note: Since the audio explanation of this example was a bit vague I have enhanced the example to (hopefully) make it more understandable.

5. Sorting an array with character indices using asort but requesting a sort type "@val_str_desc" - descending order of element values:

$ cat awk12_ex9.awk
BEGIN{
    a["a"]="Jones"
    a["b"]="X"
    a["c"]="Smith"
    asort(a,b,"@val_str_desc")
    for (i in b)
        printf "%s %s\n",i,b[i]
}
$ awk -f awk12_ex9.awk
1 X
2 Smith
3 Jones

Extra example

1. Another PROCINFO example which counts the initial letters of words in a dictionary:

$ cat awk12_ex10.awk
#!/usr/bin/awk -f

#
# Sort the indices as strings in ascending order
#
BEGIN{
    PROCINFO["sorted_in"]="@ind_str_asc"
}

#
# Make a frequency table of the first letter of each word
#
{
    freq[substr($1,1,1)]++
}

#
# Print the results in the frequency table
#
END{
    for (i in freq)
        printf "%s: %d\n",i,freq[i]
}
$ ./awk12_ex10.awk /usr/share/dict/words
A: 1412
B: 1462
C: 1592
D: 828
E: 641
F: 529
G: 834
H: 916
I: 350
J: 558
K: 659
...

In this example I have made the script executable and have added a hash bang line to define it as an Awk script. Don’t forget the '-f' at the end of that extra line.

In this example the dictionary file /usr/share/dict/words is scanned. Each line contains a single word and the script takes the first letter of this word and uses it as an index to the array freq. This element is incremented by 1 resulting in the accumulation of the frequencies of these initial letters. The frequency table is printed in the END rule but because a sort order has been defined in the BEGIN rule the elements appear in ascending order of the index.

Yet more about arrays

There is more to be said about arrays in Gnu Awk. It is possible to have multi-dimensional arrays (of a sort) and to have arrays as array elements too (a GNU extension).

We probably will not be covering these further topics in this series, though there is plenty of information in the GNU Awk manual if you want to dig deeper.

Of course, if we receive a request to cover this area in more depth then we will reconsider!


Real-world Awk example

One of the things I do for HPR is to process the show notes sent in with episodes, many of which are plain text. Since we need HTML for loading into the HPR database I run these through an editor and a series of scripts to turn them into Markdown, and then generate HTML from them. I do this on my workstation after grabbing a copy of the notes from the HPR server.

In order to check that the generated HTML looks OK I make a local copy of it, which can be viewed with a browser, and I use a tool called pandoc to make this version. This tool turns Markdown into HTML (amongst other document conversion tasks), but lately some of its requirements have changed necessitating a change to my workflow.

To make the HTML copy I want for local viewing pandoc needs some additional information. The information takes the form of two delimited lines in YAML format, such as:

This metadata is used to generate headers in the final document.

To generate this I added the following awk script to the Bash script I wrote that runs pandoc:

The first line is the invocation of awk. Note that the argument to the -f option is '-', which means the standard input channel. This is catered for by the Bash heredoc which is everything from "<<'ENDAWK'" to the last line in the example. This is Bash’s way of embedding data in a script without having to put it in a string and risk all the issues that can ensue with string delimiters.

The character string (ENDAWK) used in the heredoc to enclose the information to be offered to awk on standard input is chosen by the user, but it must be unique within the Bash script. Enclosing the first instance in single quotes turns off the Bash parameter substitution within the enclosed document - so '$0' in this example would have been seen and interpreted by Bash as a shell variable if this had not been done.

The data file being processed by awk is a file containing the output of the show submission form, the name of which is in the RAWFILE variable. The output from awk is written to a temporary file, the name of which is in the variable TMP1.

The awk script itself writes the necessary three hyphens in the BEGIN rule (line 2) and the final three fullstops in the END rule (line 13).

There are two regular expression matching rules. One matches ^Title: which precedes the show title in the input file. The other matches ^Host_Name: which labels the line containing the name of the host.

In both cases these labels, with the trailing white space (often a Tab) are deleted using the sub function (lines 4 and 9).

Because the resulting strings might contain quotes, a gsub call is used to ensure that any quotes are doubled using gsub (lines 5 and 10).

Finally the two strings are written out with the required labels for pandoc, using single quotes to enclose each of them (lines 6 and 11).

The resulting file of YAML-format metadata is read by pandoc before the file of notes for the show.

Note that the viewable HTML file created here uses the HPR CSS so that it looks just as it will when the show is released.

This is not a very complex Awk script, but I thought it might be of interest, especially given that a Bash heredoc is being used.