Site Map - skip to main content - dyslexic font - mobile - text - print

Hacker Public Radio

Your ideas, projects, opinions - podcasted.

New episodes Monday through Friday.


In-Depth Series

Learning Awk

Episodes about using Awk, the text manipulation language. It comes in various forms called awk, nawk, mawk and gawk, but the standard version on Linux is GNU Awk (gawk). It's a programming language optimised for the manipulation of delimited text.

Awk Part 7 - b-yeezi | 2017-07-07

In this episode, I will (very) briefly go over loops in the Awk programming language. Loops are useful when you want to run the same command(s) on a collection of data or when you just want to repeat the same commands many times.

When using loops, a command or group of commands is repeated until a condition (or many) is met.

While Loop

Here is a silly example of a while loop:

#!/bin/awk -f
BEGIN {

# Print the squares from 1 to 10 the first way

    i=1;
    while (i <= 10) {
        print "The square of ", i, " is ", i*i;
        i = i+1;
    }

exit;
}

Our condition is set in the braces after the while statement. We set a variable, i, before entering the loop, then increment i inside of the loop. If you forget to make a way to meet the condition, the while will go on forever.

Do While Loop

Here is an equally silly example of a do while loop:

#!/bin/awk -f
BEGIN {

    i=2;
    do {
        print "The square of ", i, " is ", i*i;
        i = i + 1
    }

    while (i != 2)

exit;
}

Here, the commands in the do code block are executed at the start, then the looping begins.

For Loop

Another silly example of a for loop:

#!/bin/awk -f
BEGIN {

    for (i=1; i <= 10; i++) {
        print "The square of ", i, " is ", i*i;
    }

exit;
}

As you can see, we set the variable, set the condition and set the increment method all in the braces after the for statement.

For Loop Over Arrays

Here is a more useful example of a for loop. Here, we are adding the different values of column 2 into an array/hash-table called a. After processing the file, we print the different values.

For file.txt:

name       color  amount
apple      red    4
banana     yellow 6
strawberry red    3
grape      purple 10
apple      green  8
plum       purple 2
kiwi       brown  4
potato     brown  9
pineapple  yellow 5

Using the awk file of:

NR != 1 {
    a[$2]++
}
END {
    for (b in a) {
        print b
    }
}

We get the results of:

brown
purple
red
yellow
green

In another example, we do a similar process. This time, not only do we store all the distinct values of the second column, we perform a sum operation on column 3 for each distinct value of column 2.

For file.csv:

name,color,amount
apple,red,4
banana,yellow,6
strawberry,red,3
grape,purple,10
apple,green,8
plum,purple,2
kiwi,brown,4
potato,brown,9
pineapple,yellow,5

Using the awk file of:

BEGIN {
    FS=",";
    OFS=",";
    print "color,sum";
}
NR != 1 {
    a[$2]+=$3;
}
END {
    for (b in a) {
        print b, a[b]
    }
}

We get the results of:

color,sum
brown,13
purple,12
red,7
yellow,11
green,8

As you can see, we are also printing a header column prior to processing the file using the BEGIN code block.


Gnu Awk - Part 6 - Dave Morriss | 2017-03-01

Gnu Awk - Part 6

Introduction

This is the sixth episode of the “Learning Awk” series that b-yeezi and I are doing.

Recap of the last episode

Regular expressions

In the last episode we saw regular expressions in the ‘pattern’ part of a ‘pattern {action}’ sequence. Such a sequence is called a ‘RULE’, (as we have seen in earlier episodes).

$1 ~ /p[elu]/ {print $0}

Meaning: If field 1 contains a ‘p’ followed by one of ‘e’, ‘l’ or ‘u’ print the whole line.

$2 ~ /e{2}/ {print $0}

Meaning: If field 2 contains two instances of letter ‘e’ in sequence, print the whole line.

It is usual to enclose the regular expression in slashes, which make it a regexp constant.

We had a look at many of the operators used in regular expressions in episode 5. Unfortunately, some small errors crept into the list of operators mentioned in that episode. These are incorrect:

  • \A (beginning of a string)
  • \z (end of a string)
  • \b (on a word boundary)

The first two operators exist, in languages like Perl and Ruby, but not in GNU Awk.

For the ‘\b’ sequence the GNU manual says:

In other GNU software, the word-boundary operator is ‘\b’. However, that conflicts with the awk language’s definition of ‘\b’ as backspace, so gawk uses a different letter. An alternative method would have been to require two backslashes in the GNU operators, but this was deemed too confusing. The current method of using ‘\y’ for the GNU ‘\b’ appears to be the lesser of two evils.

The corrected list of operators is discussed later in this episode.

Replacement

Last episode we saw the built-in functions that use regular expressions for manipulating strings. These are sub, gsub and gensub. Regular expressions are used in other functions but we will look at them later.

We will be looking at sub, gsub and gensub in more detail in this episode.

Long notes

I have written out a set of longer notes for this episode and these are available here.


Gnu Awk - Part 5 - b-yeezi | 2016-12-15

GNU AWK - Part 5

Regular Expressions in AWK

The syntax for using regular expressions to match lines in AWK is as follows:

word ~ /match/

Or for not matching, use the following:

word !~ /match/

Remember the following file from the previous episodes:

name       color  amount
apple      red    4
banana     yellow 6
strawberry red    3
grape      purple 10
apple      green  8
plum       purple 2
kiwi       brown  4
potato     brown  9
pineapple  yellow 5

We can run the following command:

$1 ~ /p[elu]/ {print $0}

We will get the following output:

apple      red    4
grape      purple 10
apple      green  8
plum       purple 2
pineapple  yellow 5

In another example:

$2 ~ /e{2}/ {print $0}

Will produce the output:

apple      green  8

Regular expression basics

Certain characters have special meaning when using regular expressions.

Anchors

  • ^ - beginning of the line
  • $ - end of the line
  • \A - beginning of a string
  • \z - end of a string
  • \b on a word boundary

Characters

  • [ad] - a or d
  • [a-d] - any character a through d
  • [^a-d] - not any character a through d
  • \w - any word
  • \s - any white-space character
  • \d - any digit

The capital version of w, s, and d are negations.

Or, you can reference characters the POSIX standard way:

  • [:alnum:] - Alphanumeric characters
  • [:alpha:] - Alphabetic characters
  • [:blank:] - Space and TAB characters
  • [:cntrl:] - Control characters
  • [:digit:] - Numeric characters
  • [:graph:] - Characters that are both printable and visible (a space is printable but not visible, whereas an ‘a’ is both)
  • [:lower:] - Lowercase alphabetic characters
  • [:print:] - Printable characters (characters that are not control characters)
  • [:punct:] - Punctuation characters (characters that are not letters, digits, control characters, or space characters)
  • [:space:] - Space characters (such as space, TAB, and formfeed, to name a few)
  • [:upper:] - Uppercase alphabetic characters
  • [:xdigit:] - Characters that are hexadecimal digits

Quantifiers

  • . - match any character
  • + - match preceding one or more times
  • * - match preceding zero or more times
  • ? - match preceding zero or one time
  • {n} - match preceding exactly n times
  • {n,} - match preceding n or more times
  • {n,m} - match preceding between n and m times

Grouped Matches

  • (...) - Parentheses are used for grouping
  • | - Means or in the context of a grouped match

Replacement

  • The sub command substitutes the match with the replacement string. This only applies to the first match.
  • The gsub command substitutes all matching items.
  • The gensub command command substitutes the in a similar way as sub and gsub, but with extra functionality
  • The & character in the replacement field references the matched text. You have to use \& to replace the match with the literal & character.

Example:

{ sub(/apple/, "nut", $1);
    print $1}

The output is:

name
nut
banana
strawberry
grape
nut
plum
kiwi
potato
pinenut

Another example:

{ sub(/.+(pp|rr)/, "test-&", $1);
    print $1}

This produces the following output:

name
test-apple
banana
test-strawberry
grape
test-apple
plum
kiwi
potato
test-pineapple

Resources


Gnu Awk - Part 4 - Dave Morriss | 2016-11-16

Gnu Awk - Part 4

Introduction

This is the fourth episode of the series that b-yeezi and I are doing. These shows are now collected under the series title “Learning Awk”.

Recap of the last episode

Logical Operators

We have seen the operators ‘&&’ (and) and ‘||’ (or). These are also called Boolean Operators. There is also one more operator ‘!’ (not) which we haven’t yet encountered. These operators allow the construction of Boolean expressions which may be quite complex.

If you are used to programming you will expect these operators to have a precedence, just like operators in arithmetic do. We will deal with this subject in more detail later since it is relevant not only in patterns but also in other parts of an Awk program.

The next statement

We saw this statement in the last episode and learned that it causes the processing of the current input record to stop. No more patterns are tested against this record and no more actions in the current rule are executed. Note that “next” is a statement like “print”, and can only occur in the action part of a rule. It is also not permitted in BEGIN or END rules (more of which anon).

The BEGIN and END rules

The BEGIN and END elements are special patterns, which in conjunction with actions enclosed in curly brackets make up rules in the same sense that the ‘pattern {action}’ sequences we have seen so far are rules. As we saw in the last episode, BEGIN rules are run before the main ‘pattern {action}’ rules are processed and the input file is (or files are) read, whereas END rules run after the input files have been processed.

It is permitted to write more than one BEGIN rule and more than one END rule. These are just concatenated together in the order they are encountered by Awk.

Awk will complain if either BEGIN or END is not followed by an action since this is meaningless.

Variables, arrays, loops, etc

Learning a programming language is never a linear process, and sometimes reference is made to new features that have not yet been explained. A number of new features were mentioned in passing in the last episode, and we will look at these in more detail in this episode.

Long notes

I have written out a moderately long set of notes for this episode and these are available here http://hackerpublicradio.org/eps/hpr2163/full_shownotes.html.

With a view to making portable notes for this series I have included ePub and PDF versions with this episode. Feedback is welcome to help decide which version is preferable, as are any suggestions on the improvement of the layout.


Gnu Awk - Part 3 - b-yeezi | 2016-10-19

Awk Part 3

Remember our file:

name       color  amount
apple      red    4
banana     yellow 6
strawberry red    3
grape      purple 10
apple      green  8
plum       purple 2
kiwi       brown  4
potato     brown  9
pineapple  yellow 5

Replace Grep

As we saw in earlier episodes, we can use awk to filter for rows that match a pattern or text. If you know the grep command, you know that it does the same function, but has extended capabilities. For simple filter, you don't need to pipe grep outputs to awk. You can just filter in awk.

Logical Operators

You can use logical operators "and" and "or" represented as "&&" and "||", respectively. See example:

$2 == "purple" && $3 < 5 {print $1}

Here, we are selecting for color to to equal "purple" AND amount less than 5.

Next command

Say we want to flag every record in our file where the amount is greater than or equal to 8 with a '**'. Every record between 5 (inclusive) and 8, we want to flag with a '*'. We can use consecutive filter commands, but there affects will be additive. To remedy this, we can use the "next" command. This tells awk that after the action is taken, proceed to the next record. See the following example:

NR == 1 {
  print $0;
  next;
}

$3 >= 8 {
  printf "%s\t%s\n", $0, "**";
  next;
}

$3 >= 5 {
  printf "%s\t%s\n", $0, "*";
  next;
}

$3 < 5 {
  print $0;
}

End Command

The "BEGIN" and "END" commands allow you to do actions before and after awk does its actions. For instance, sometimes we want to evaluate all records, then print the cumulative results. In this example, we pipe the output of the df command into awk. Our command is:

df -l | awk -f end.awk

Our awk file looks like this:

$1 != "tempfs" {
    used += $3;
    available += $4;
}

END {
    printf "%d GiB used\n%d GiB available\n", used/2^20, available/2^20;
}

Here, we are setting two variables, "used" and "available". We add the records in the respective columns all together, then we print the totals.

In the next example, we create a distinct list of colors from our file:

NR != 1 {
    a[$2]++
}
END {
    for (b in a) {
        print b
    }
}

This is a more advanced script. The details of which, we will get into in future episodes.

BEGIN command

Like stated above, the begin command lets us print and set variables before the awk command starts. For instance, we can set the input and output field separators inside our awk file as follows:

BEGIN {
    FS=",";
    OFS=",";
    print "color,count";
}
NR != 1 {
    a[$2]+=1;
}
END {
    for (b in a) {
        print b, a[b]
    }
}

In this example, we are finding the distinct count of colors in our csv file, and format the output in csv format as well. We will get into the details of how this script works in future episodes.

For another example, instead of distinct count, we can get the sum of the amount column grouped by color:

BEGIN {
    FS=",";
    OFS=",";
    print "color,sum";
}
NR != 1 {
    a[$2]+=$3;
}
END {
    for (b in a) {
        print b, a[b]
    }
}

Gnu Awk - Part 2 - Dave Morriss | 2016-09-29

Gnu Awk - Part 2

This is the second episode in a series where b-yeezi and I will be looking at the AWK language (more particularly its GNU variant gawk). It is a comprehensive interpreted scripting language designed to be used for manipulating text.

I have written out a moderately long set of notes for this episode and these are available here http://hackerpublicradio.org/eps/hpr2129/full_shownotes.html.


Gnu Awk - Part 1 - b-yeezi | 2016-09-08

Introduction to Awk

Awk is a powerful text parsing tool for unix and unix-like systems.

The basic syntax is:

awk [options] 'pattern {action}' file

Here is a simple example file that we will be using, called file1.txt:

name       color  amount
apple      red    4
banana     yellow 6
strawberry red    3
grape      purple 10
apple      green  8
plum       purple 2
kiwi       brown  4
potato     brown  9
pineapple  yellow 5

First command:

awk '{print $2}' file1.txt

As you can see, the “print” command will display the whatever follows. In this case we are showing the second column using “$2”. This is intuitive. To display all columns, use “$0”.

This example will output:

color
red
yellow
red
purple
green
purple
brown
brown
yellow

Second command:

awk '$2=="yellow"{print $1}' file1.txt

This will output:

banana
pineapple

As you can see, the command matches items in column 2 matching “yellow”, but prints column 1.

Field separator

By default, awk uses white space as the file separator. You can change this by using the -F option. For instance, file1.csv looks like this:

name,color,amount
apple,red,4
banana,yellow,6
strawberry,red,3
grape,purple,10
apple,green,8
plum,purple,2
kiwi,brown,4
potato,brown,9
pineapple,yellow,5

A similar command as before:

awk -F"," '$2=="yellow" {print $1}' file1.csv

will still output:

banana
pineapple

Regular expressions work as well:

awk '$2 ~ /p.+p/ {print $0}' file1.txt

This returns:

grape   purple  10
plum    purple  2

Numbers are interpreted automatically:

awk '$3>5 {print $1, $2}' file1.txt

Will output:

name    color
banana  yellow
grape   purple
apple   green
potato  brown

Using output redirection, you can write your results to file. For example:

awk -F, '$3>5 {print $1, $2} file1.csv > output.txt

This will output a file with the contents of the query.

Here’s a cool trick! You can automatically split a file into multiple files grouped by column. For example, if I want to split file1.txt into multiple files by color, here is the command.

awk '{print > $2".txt"}' file1.txt

This will produce files named yellow.txt, red.txt, etc. In upcoming episodes, we will show how to improve the outputs.

Resources

  1. http://www.theunixschool.com/p/awk-sed.html
  2. http://www.tecmint.com/category/awk-command/
  3. http://linux.die.net/man/1/awk

Coming up

  • More options
  • Built-in Variables
  • Arithmetic operations
  • Awk language and syntax