Hashes explained

So what’s a file hash? The simplest way to think of it is as a (nearly) unique fingerprint of the binary data held within the file. The hash for a given file won’t change if you change the name of the file (or any of its properties) but will be completely different if you change so much as one byte inside the file itself. A good primer on hashes can be found on Wikipedia here.

I say ‘nearly’ unique because it is possible for two different files to have identical hashes. This is known as a ‘collision’. The probability of a collision occurring varies depending on the strength of the hash you generate.

As you might have gathered, there are different types of hash you can generate. Some common ones include MD5, CRC32 and SHA-256. Each of these uses slightly different algorithms and different ‘strengths’ (and result in different hashes for the same file) but the beauty is that you can use any of a number of different pieces of software to calculate a file hash and it will always be the same for the same file and algorithm.

Where else you’ll find hashes

You’ve probably noticed that some download sites give you a file hash for the file you’re downloading. This is so that you can generate the hash yourself on the file you’ve downloaded (before opening it, of course) and check that your hash matches the hash they’ve given. If not then you should not trust the file you’ve downloaded.

Also, hashes are how your passwords are stored by websites (or at least should be!) When you enter your password for the first time, the website generates a hash (with or without salt!) and stores the resulting hash, rather than your actual password. When you next log in, the website will regenerate the hash and compare it to the one stored. This way, they never store your actual password so if their database is compromised your password should remain safe.

Collisions

Returning to collisions (and to show that for all practical purposes you can ignore the possibility of one occurring), if you use the SHA-256 algorithm then the probability of a collision if you’re comparing a billion files is about 4.3 x 10-60, so pretty low then. See here for more information on the SHA-2 family of hashes.

Software

There is a great application called HashTabs by Implbits that can generate file hashes. This simple tool (free for personal use) integrates nicely into Windows Explorer and adds a tab under ‘Properties’ for a file and allows you to choose from a few dozen different hash algorithms (find more under Settings by right-clicking on the tab window). It also allows you to compare two files which can come in handy.

There is plenty of other software out there that will calculate file hashes so find one that suits your needs – a good list is here.

Other modelling uses

So what else can you use hashes for? We’ve actually found a use inside some of our models to identify whether inputs have been changed from one version of a model to another. We’ve built a UDF that will generate a hash for the values in a given range of cells. We can then store that hash in the model and compare that against the currently generated hash for the same inputs to see if they’ve changed. That means that rather than storing all of the input data just to see if it’s changed, we only need to store one value. We’ve even used the UDF to check to see if the formulae in a given range of cells has changed – useful if you’re distributing template files for others to fill in and want to check that no formulae have been changed.

Numeritas

If you’d like to know more about hashes and how they could help you with your models then please feel free to drop us a line and we’d be happy to discuss them with you.