100 Useful Command-Line Utilities

by Oliver; 2014

41. awk

From An Introduction to the Command-Line (on Unix-like systems) - awk: awk and sed are command line utilities which are themselves programming languages built for text processing. As such, they're vast subjects as these huge manuals—GNU Awk Guide, Bruce Barnett's Awk Guide, GNU Sed Guide—attest. Both of these languages are almost antiques which have been pushed into obsolescence by Perl and Python. For anything serious, you probably don't want to use them. However, their syntax makes them useful for simple parsing or text manipulation problems that crop up on the command line. Writing a simple line of awk can be faster and less hassle than hauling out Perl or Python.

The key point about awk is, it works line by line. A typical awk construction is:

cat file.txt | awk '{ some code }'

Awk executes its code once every line. Let's say we have a file, test.txt, such that:

$ cat test.txt 
1	c
3	c
2	t
1	c

In awk, the notation for the first field is $1, $2 is for second, and so on. The whole line is $0. For example:

$ cat test.txt | awk '{print}'     # print full line
1	c
3	c
2	t
1	c

$ cat test.txt | awk '{print $0}'  # print full line
1	c
3	c
2	t
1	c

$ cat test.txt | awk '{print $1}'  # print col 1
1
3
2
1

$ cat test.txt | awk '{print $2}'  # print col 2
c
c
t
c

There are two exceptions to the execute code per line rule: anything in a BEGIN block gets executed before the file is read and anything in an END block gets executed after it's read. If you define variables in awk they're global and persist rather than being cleared every line. For example, we can concatenate the elements of the first column with an @ delimiter using the variable x:

$ cat test.txt | awk 'BEGIN{x=""}{x=x"@"$1; print x}'
@1
@1@3
@1@3@2
@1@3@2@1

$ cat test.txt | awk 'BEGIN{x=""}{x=x"@"$1}END{print x}'
@1@3@2@1

Or we can sum up all values in the first column:

$ cat test.txt | awk '{x+=$1}END{print x}'  # x+=$1 is the same as x=x+$1
7

Awk has a bunch of built-in variables which are handy: NR is the row number; NF is the total number of fields; and OFS is the output delimiter. There are many more you can read about here. Continuing with our very contrived examples, let's see how these can help us:

$ cat test.txt | awk '{print $1"\t"$2}'        # write tab explicitly
1	c
3	c
2	t
1	c

$ cat test.txt | awk '{OFS="\t"; print $1,$2}' # set output field separator to tab
1	c
3	c
2	t
1	c

Setting OFS spares us having to type a "\t" every time we want to print a tab. We can just use a comma instead. Look at the following three examples:

$ cat test.txt | awk '{OFS="\t"; print $1,$2}'        # print file as is
1	c
3	c
2	t
1	c

$ cat test.txt | awk '{OFS="\t"; print NR,$1,$2}'     # print row num
1	1	c
2	3	c
3	2	t
4	1	c

$ cat test.txt | awk '{OFS="\t"; print NR,NF,$1,$2}'  # print row & field num
1	2	1	c
2	2	3	c
3	2	2	t
4	2	1	c

So the first command prints the file as it is. The second command prints the file with the row number added in front. And the third prints the file with the row number in the first column and the number of fields in the second—in our case always two. Although these are purely pedagogical examples, these variables can do a lot for you. For example, if you wanted to print the 3^rd row of your file, you could use:

$ cat test.txt | awk '{if (NR==3) {print $0}}' # print the 3rd row of your file
2       t

$ cat test.txt | awk '{if (NR==3) {print}}'    # same thing, more compact syntax
2       t

$ cat test.txt | awk 'NR==3'                   # same thing, most compact syntax
2       t

Sometimes you have a file and you want to check if every row has the same number of columns. Then use:

$ cat test.txt | awk '{print NF}' | sort -u
2

In awk $NF refers to the contents of the last field:

$ cat test.txt | awk '{print $NF}' 
c
c
t
c

An important point is that by default awk delimits on white-space, not tabs (unlike, say, cut). White space means any combination of spaces and tabs. You can tell awk to delimit on anything you like by using the -F flag. For instance, let's look at the following situation:

$ echo "a b" | awk '{print $1}'
a

$ echo "a b" | awk -F"\t" '{print $1}'
a b

When we feed a space b into awk, $1 refers to the first field, a. However, if we explicitly tell awk to delimit on tabs, then $1 refers to a b because it occurs before a tab.

You can also use shell variables inside your awk by importing them with the -v flag:

$ x=hello
$ cat test.txt | awk -v var=$x '{ print var"\t"$0 }' 
hello	1       c
hello	3       c
hello	2       t
hello	1       c

And you can write to multiple files from inside awk:

$ cat test.txt | awk '{if ($1==1) {print > "file1.txt"} else {print > "file2.txt"}}'

$ cat file1.txt 
1       c
1       c

$ cat file2.txt 
3       c
2       t

For loops in awk:

$ echo joe | awk '{for (i = 1; i <= 5; i++) {print i}}'
1
2
3
4
5

Question: In the following case, how would you print the row numbers such that the first field equals the second field?

$ echo -e "a\ta\na\tc\na\tz\na\ta"
a	a
a	c
a	z
a	a

Here's the answer:

$ echo -e "a\ta\na\tc\na\tz\na\ta" | awk '$1==$2{print NR}'
1
4

Question: How would you print the average of the first column in a text file?

$ cat file.txt | awk 'BEGIN{x=0}{x=x+$1;}END{print x/NR}'

NR is a special variable representing row number.

The take-home lesson is, you can do tons with awk, but you don't want to do too much. Anything that you can do crisply on one, or a few, lines is awk-able. For more involved scripting examples, see An Introduction to the Command-Line (on Unix-like systems) - More awk examples.

<PREV NEXT>