What is a hash? How do I decide which hashing algorithm to use? Where can I find 2500 words to read about password hashing?
These are all questions I’ve received while working as a security engineer, most often from software developers. There are tons of resources out there that resolve each of these questions, but I wanted to write up my own resource that I could refer back to or refer others to when this question inevitably comes up again.
Hashing is a one-way function that transforms an input value into an output value deterministically. That means that whenever your hashing function receives a particular input, it will always produce the same output. The output values are often fixed-length, meaning you could take the entire contents of Herman Melville’s Moby Dick, pass it through a hashing function, and end up with, say, 256 bits of data. But you could also take a short message, such as “Don’t forget to like and subscribe” and still end up with just 256 bits of data.
These three properties, one-way, deterministic, and fixed-length, make hashing a useful tool for security.
Perhaps one of the most prevalent places this happens is with passwords. Your password is used to log you into a website. Once upon a time both you and the website would know your password, and when you login the website would check against the password it knows, and if they matched, it would log you in. But this presented a big problem – when websites would get hacked, your password was stored in plain text next to your username or email. Now three parties know your password – you, the website, and the attacker. The attacker can login as you and do whatever you can do.
Then along came hashing, and developers were advised to hash passwords so that the passwords couldn’t be recovered in the event of a data breach. Just pass the user’s password into this magic function and we will always spit out a value that is the same length and that can’t be reversed to reveal the password. When the user logs in, hash their password and then compare with the hashed version from the database. This had the added bonus that now you would always know exactly how long your
password field should be in the database – whatever the output size of your hash function is.
;? These characters used to be responsible for tons of SQL injections through a combination of not hashing passwords at the application level as well as not using parameterized SQL statements. These days there are still some reasons to reject these characters, but they don’t necessarily mean that your passwords are stored in plaintext.
One additonal property of good hashes for security purposes is that a small change in the input can result in a dramatic difference in the output. For example, if we were hashing the word “password” with a really awful hashing function that I’m making up on the spot, the output might look like “pswr”. (Note: This function is one-way and deterministic, but not fixed-length – but it helps make the point) Using that same awful hashing function, we hash the word “passwords” and we get “pswrs”. Notice how similar this value is to the hash of “password”? Ignoring the fact that my awful hashing algorithm is just “remove every other letter”, we can see that similar inputs get similar outputs. By looking at just the outputs of the two hashes, and knowing how my hashing algorithm works, you can determine that the input values must be quite similar. Now see what happens when we use those same two inputs for the
$ echo "password" | sha256sum 6b3a55e0261b0304143f805a24924d0c1c44524821305f31d9277843b8a10f4e - $ echo "passwords" | sha256sum c1f1f36d9be6f64a85698d9db973fd07eb86641c67ef91260120a628084dd632 -
Notice how the whole string looks completely different? Now, even having both of these hashes, I don’t really know anything about how they might relate to one another.
There are all sorts of hashing functions, with varying properties that make them stronger or weaker than other hashing functions. Cryptographers are always analyzing hashing algorithms to find flaws in them that could weaken them. Sometimes it doesn’t even take any flaw to consider a hash weak. The rapid growth of computing power means that functions that used to be slow to perform, maybe only dozens or hundreds of executions per second, are now able to be performed millions of times per second. What does this mean for the strength of a hashing function? To answer that, we’ll need to look at how hashes are attacked.
Remember how I said one of the properties of hashing that made it appealing to security was that it was one-way? Well, that’s true, but by itself it doesn’t necessarily mean that a hash’s input can’t be figured out. Continuing to use passwords as our example, think about what else has historically needed to be true for passwords – the people creating them needed to be able to remember them and type them in. That means that passwords were often words, short phrases, or short strings of random characters. Whatever the chosen password is, a human probably created it, and there are only so many things a human is likely to create.
An attacker can lean into this and combine the deterministic nature of hashing functions in order to generate tons of hashes for known inputs. It used to be especially common to pre-generate all of these possible hashes based on all the inputs you wanted to guess and then store them for later in what is known as a Rainbow Table. Then, when you get a hash, you can just look it up in your table, rather than try to compute all the possible inputs that might produce that output on-demand.
Rainbow tables are a less common attack vector today for a variety of reasons: improved hashing algorithms made precomputing results more difficult, GPU-based cracking got so fast that they could search a very similar amount of hashes per second as performing a lookup in a rainbow table, as well as a trend towards optimizing hash cracking efforts towards human generated passwords. One early attempt at making rainbow tables non-viable was the introduction of salting.
Salting a Hash
Salting a hash refers to generating a unique random string of characters that gets concatenated to the input value, in our case a password, before being fed into the hash function. Now you can have a series of randomly computer generated characters that combine with the user’s provided password in order to create a harder to crack password. A good password salt will have a large keyspace and be randomly generated per user. A less optimal password salt might be something like using the user’s username or email as the salt. A good password salt is helpful in preventing rainbow table attacks because values aren’t likely to be precomputed in the rainbow table for easy retrieval.
The salt that is used with a password hash is typically stored in plaintext along side the hash. That way you can lookup the user, concatenate their salt to the provided password, and compare against the stored hash value. The salt by itself is not useful to an attacker, and is usually a completely randomly generated string, so storing it in plaintext along side the hash isn’t a big deal.
These salts also help protect against weak or commonly used passwords, since the same input password, e.g. “password”, will have a different salt for every user, and so every user’s hash would have to be cracked separately – cracking just one instance of a “password” password won’t immediately crack all instances of “password” passwords.
Peppering a Hash
Peppering a hash is a similar concept to salting a hash, in fact NIST calls it “secret salting”. It’s pretty much functionally equivalent to salting, in that its typically a random value that is concatenated to a password before being hashed. But the key difference is that a salt is usually stored alongside the hash, whereas the pepper must be stored separately – such as in a Hardware Security Module.
There are additional sub-types of peppering, including a shared-secret pepper and a per-user pepper. The shared-secret pepper is used by every user of a particular application, which means the developer only needs to keep this one value secret in the HSM (or other storage). The per-user pepper, however, requires a way to not only store the secret value for each user, but also a way to quickly identify which pepper belongs to which user so that hashing can commence. Both of these options can significantly increase the cost of attacking password hashes, as the pepper must also be guessed. But if a hashing algorithm is fast enough that it can be computed millions of times per second on a single desktop computer, then even relying on a secret pepper value might bring little comfort.
I personally haven’t seen a whole lot of use of peppering these days, as modern hashing algorithms have taken different approaches to making hashes harder to crack, such as compute-hard functions and memory-hard functions.
Compute-hard hashing is an evolution of the hashing function that includes a sliding computation cost. Perhaps the most popular example of this is Password Based Key Derivation Function 2 (PBKDF2). PBKDF2 introduces a psuedorandom function, most commonly Hash-based Message Authentiation Code (HMAC), that gets applied to the input repeatedly in order to derive the output that is used as the key. A PBKDF2 hash contains within the stored value a number of iterations to use, and the process must be repeated that many times on a given input in order to get the desired output.
PBKDF2 also maintains the concept of salting. This means that the inputs into a PBKDF2 hashing function will include:
- The psuedorandom function to use
- The user’s provided password
- The per-user randomly generated salt
- The number of iterations to run the function
A neat thing about typical PBKDF2 implementations is that the number of iterations can be scaled over time because the number of iterations is typically stored alongside the hash. So even if you’re using a fast algorithm, by using a high number of iterations you are able to effectively combat the increased compute speed available to attackers. OWASP recommends 310,000 iterations of PBKDF2-HMAC-SHA2561, for instance.
Another example of compute-hard hashing is bcrypt, which was first published in 1999. It uses an iteration count mechanism as well in order to make the hashing function slower, thereby increasing the computation power required to crack hashes. Technically speaking, bcrypt is actually cache-hard, it works in a way that requires a bunch of data to be moved in and out of L1 cache, which is what makes it computationally expensive. But for simplicity sake, we can just remember that bcrypt takes longer based on how many iterations it has.
But what if compute-hard hashing isn’t enough? I mean, compute power is cheap and compute-hard hashes can be highly parallelized across machines in order to significantly cut down on wall-time to crack hashes (the amount of actual time that passes, rather than cpu-time, which is the amount of time spent by a computer, summed across parallel executions).
Memory-hard hashing is similar to compute-hard hashing in that it’s goal is to create hashes that can be configured with a scalable level of resistance to attacks. These hashes introduce a minimum amount of memory required to compute the hash, which helps to reduce how easy it is to parallelize cracking the hashes. Example hashes in this category include Argon2id and scrypt, which both allow configuring tradeoffs between the amount of CPU and RAM usage required to compute a hash.
Identifying a Hash
Okay so now we’ve talked about all of these different properties of hashing, some attack types against hashes, and what qualities make a good hash. What do they look like, though? Some hashes identify themselves in the output value, such as md5crypt using
$1$ as a prefix. Other hashes, though, do not make any attempt to make themselves identifiable, such as the sha256 examples used above.
Whenever I have a hash and I want to know what kind of hash it is, I typically head directly to the Hashcat Example Hashes wiki page. It’s a page designed for helping debug hashcat3 commands against known example hashes, but that also makes it perfect for trying to understand a hash that you might be looking at.
For example, let’s look at an example Django password hash, which defaults to a PBKDF2 based hash, as discussed in the compute-hard hashing section. Django helpfully provides documentation on how it stores password hashes, the default list of password hashes, and how to use Argon2 in Django, all in it’s documentation on a page titled Password management in Django.
Those are the components used for storing a User’s password, separated by the dollar-sign character and consist of: the hashing algorithm, the number of algorithm iterations (work factor), the random salt, and the resulting password hash.
This tells us that the
pbkdf2_sha256 hasher was used for this hash, with a work factor of
20,000 iterations, a salt of
H0dPx8NeajVu, and the resulting password hash is
GiC4k5kqbbR9qWBlsRgDywNqC2vd9kqfk7zdorEnNas=. Note that this example hash, from the Hashcat wiki, uses only 20,000 iterations, which is 290,000 iterations less than OWASP recommends. Django currently, at time of writing, defaults to 480,000 iterations4 – above and beyond the recommended OWASP value.
Hashing vs Encrypting
Hold up there, Joey. It’s a common mixup to say that you’re going to encrypt passwords when what you really mean is you’re going to hash passwords. You see, there’s a very important difference between encryption and hashing. Besides the algorithms involved being different, the main thing you should know about the difference between encryption and hashing is that encryption is two-way. Remember from earlier, we defined hashing as being a one-way function. If you encrypt something, the expectation is that there exists a method to undo the encryption, a.k.a decrypt it, and get the original value back, typically through the possession of some sort of key. But if you hash something, there isn’t a method to de-hash the value – you must know (or be able to guess) the original value, hash that, and then compare in order to determine if you got the right value.
This one-way property of hashing is what makes hashing passwords much safer than encrypting them, and it’s important to not confuse the two concepts.
I hope this has been a helpful exploration on the properties of hashing, types of attacks against hashes, and which hashes you should use. A quick recap:
- Hashing is a one-way, deterministic function that produces a fixed-length output.
- Salting and/or peppering can help prevent rainbow table attacks.
- Using compute-hard or memory-hard hashing is the best way to protect the secrets hidden behind the hash.
- Use Argon2 if available, otherwise scrypt, bcrypt, or PBKDF2 are viable options.
- Encryption is two-way, which makes it distinctly different than hashing
Additional thanks goes to Jeremi Gosney for reviewing this post and providing additional insights into bcrypt, the fall of rainbow tables, and pointing out that banned characters from passwords doesn’t necessarily mean that passwords are being stored in plaintext. This post was written for a fairly beginner friendly understanding of hashing, but if you want to really understand it, Jeremi is the guy to talk to.
Hashcat is a popular open source hash cracking utility. ↩︎