100 Useful Command-Line Utilities

by Oliver; 2014

59. md5, md5sum

Imagine the following scenario. You've just downloaded a large file from the internet. How do you know no data was lost during the transfer and you've made an exact copy of the one that was online?

To solve this problem, let's review of the concept of hashing. If you're familiar with a dict in Python or a hash in Perl, you know that a hash, as a data structure, is simply a way to map a set of unique keys to a set of values. In ordinary life, an English dictionary is a good representation of this data structure. If you know your key is "cat" you can find your value is "a small domesticated carnivorous mammal with soft fur, a short snout, and retractile claws", as Google defines it. In the English dictionary, the authors assigned values to keys, but suppose we only have keys and we want to assign values to them. A hash function describes a method for how to boil down keys into values. Without getting deep into the theory of hashing, it's remarkable that you can hash, say, text files of arbitrary length into a determined range of numbers. For example, a very stupid hash would be to assign every letter to a number:
A -> 1
B -> 2
C -> 3
.
.
.
and then to go through the file and sum up all the numbers; and finally to take, say, modulo 1000. With this, we could assign the novels Moby Dick, Great Expectations, and Middlemarch all to numbers between 1 and 1000! This isn't a good hash function because two novels might well get the same number but nevermind—enough of a digression already.

md5 is a hash function that hashes a whole file into a long string. The commands md5 and md5sum do about the same thing. For example, to compute the md5 hash of a file tmp.txt:
$ md5 tmp.txt 
84fac4682b93268061e4adb49cee9788  tmp.txt
$ md5sum tmp.txt 
84fac4682b93268061e4adb49cee9788  tmp.txt
This is a great way to check that you've made a faithful copy of a file. If you're downloading an important file, ask the file's owner to provide the md5 sum. After you've downloaded the file, compute the md5 on your end and check that it's the same as the provided one.

md5 is one of many hashing functions. Another one, for example, is sha1—the unix utility is sha1sum—which will be familiar to users of git:
$ sha1sum tmp.txt
fbaaa780c23da55182f448e38b1a0677292dde01  tmp.txt

<PREV   NEXT>