100 Useful Command-Line Utilities

by Oliver; 2014

100. datamash

Note: datamash is not a default shell program. You have to download and install it.

GNU datamash is a great program for crunching through text files and collapsing rows on a common ID or computing basic statistics. Here are some simple examples of what it can do.

Collapse rows in one column based on a common ID in another column:
$ cat file.txt
3       d
2       w
3       c
4       x
1       a
$ cat file.txt | datamash -g 1 collapse 2 -s -W
1       a
2       w
3       d,c
4       x
The -g flag is the ID column; the collapse field picks the second column; the -s flag pre-sorts the file; and the -W flag allows us to delimit on whitespace.

Average rows in one column on a common ID:
$ cat file.txt
A       1       3       SOME_OTHER_INFO
A       1       4       SOME_OTHER_INFO2
B       2       30      SOME_OTHER_INFO4
A       2       5       SOME_OTHER_INFO3
B       1       1       SOME_OTHER_INFO4
B       2       3       SOME_OTHER_INFO4
B       2       1       SOME_OTHER_INFO4
$ cat file.txt | datamash -s -g 1,2 mean 3 -f -s
A       1       3       SOME_OTHER_INFO 	3.5
A       2       5       SOME_OTHER_INFO3        5
B       1       1       SOME_OTHER_INFO4        1
B       2       30      SOME_OTHER_INFO4        11.333333333333
In this case, the ID is the combination of columns one and two and the mean of column 3 is added as an additional column.

Simply sum a file of numbers:
$ cat file.txt | datamash sum 1
Hat tip: Albert

<PREV   NEXT>