Hashing and Cryptography
When starting with Cryptography we need to understand some of the terminology first.
Plaintext:
Even if it is called “plaintext” it doesn´t need to be raw text. Every data that hasn’t been encypted or hashed will be called plaintext. This goes for text as well as for files or pictures.
Encoding:
Note: Encoding is not a form of encryption! It is just a form of data-representation that can be altered or reverted at any given point. There are clear schemes for Encoding or Decoding. Examples for Encoding are hexadecimal representation, URL-Encoding or Base 64 (I’ve done an article about Base64 here). By definition Encoding does not provide additional security. It is Obfuscation at best.
Hash:
A Hash is the output of an hash-function. Hash functions are used to map data of arbitrary size to fixed size values of bits and bytes. Good hashing functions produce (at best) non-overlapping and differing hashes (even small deviations in the original data result in large changes in the output.) The hashes can be reproduced but (again, if it is a good function) they cannot be reverted. It´s meant to be impossible (or at least very hard) to reconstruct the input if you have the output available. The initial data stays hidden. That’s why hashes are commonly used to store passwords. (Attention: There are some downfalls as well! For Example: The hashes for the most common passwords are well known and if the password is not strong enough, it can still be bruteforced!). The time needed by the hashing function to generate a hash is a fundamental part of of the hash security. The faster the function is, the better Bruteforce attacks work.
The output of a hash-function normally consists of raw bytes, which are then encoded. Decoding those hashes don´t give us anything useful. Common encodings for this usecase are base64 or hexadecimal.
Hash collisions occur when 2 different inputs give the same output. Hash functions are designed to avoid this as best as they can, but due to the pigeonhole effect, collisions are not avoidable. The pigeonhole-effect: As we have hashes with a fixed size (f.e. md5 hashes consist of 16 bytes / 128 bits) we also have a limited amount of different hashes. And as we have unlimited input size some hashes have to be the same. Think of it as pidgeons in a stable. We have 128 pidgeons and only 100 pidgeon-holes. Some of the holes will be used twice if all pidgeons fly out. It`s the same with those hashes.
Brute Force:
Bruteforcing is the process of attacking a cryptographic value by simply trying/guessing different passwords until one does fit. There are huge password-lists out there that are used in those attacks. Brute-Force attacks are often only limited by the computing power / bandwith available for trying those passwords. (Similar approach: There are rainbow-tables out there. Rainbow tables also contain the raw hashes instead of only the plaintext passwords. As we don´t need to apply the cryptographic hash-function to the passwords (as in pure bruteforce), rainbow-table attacks are much faster (but require more storage space..))
Cryptanalysis
This is the process of attacking the cryptography by finding/targeting weaknesses in the underlying maths. This requires a deep knowledge of the hashing-algorythm…
WHY DO WE USE HASHES AND HASH-FUNCTIONS?
There are two main reasons:
Workaround for storing passwords safely
Nearly every User-Based application requires Password Authentication for granting access to its services. To realize the authentication we heavily rely on the combination of usernames and passwords. While the username is visible on a global scale the password should be kept hidden to avoid the theft of sensitive information as personal, banking or credit-card information. To realize such authentication system we need the possibility to compare the user-provided password to the password we have stored in our database. While storing these passwords in encrypted state or even plain text is possible, this should be avoided at any cost. Storing the passwords in any form bears an incalculable risk of data theft or unauthorized data access. And this in turn poses a risk for all users of the application.
Yeah, you can, in fact, encrypt those passwords, but if you know the encryption method, you can possibly decrypt the stolen data as well. The second risk is data theft while the stored password is decrypted for comparison with the presented authentication credentials.
This is where hash functions come into play. Since hash functions are mathematical one-way functions, there is no way to infer from the end product, the hash, back to the password provided. So we do not store the password as such, but only the hash value related to it. This procedure is even more secure if we add random strings to the user-selected password at fixed positions (e.g. at the beginning or at the end of the password) before hashing. In this way, we also reduce the risk that someone, should they somehow manage to capture our hashes, can still identify the source password using so-called rainbow tables. (Rainbow tables are pre-calculated tables with password hashes that can be matched with other hashes in a very short time) Since users tend to use passwords more than once, and those passwords are most likely not alphanumeric and random, the procedure of appending strings is an additional security gain. This practice is also called salting or peppering a password. These strings (the salt and pepper) can be stored in plain text, since they do not give any hint to the password due to the mathematical one-way functions. At least not if the hashing algorithm has been implemented correctly. NOTE: It´s necessary that every user has an unique/random salt. Otherwise identical passwords would still result in identical hashes – potentially weakening our password security.
In short: you can´t decrypt password hashes as they are NOT encrypted. There is no way of simply reverting those functions if they´re implemented in the right way. Your only chance to crack the password is hashing a large amount of potential passwords with adding the salt if there is one, and compare the result with the target hash. Once it matches you know what password has been used.
It follows, that still any password can be cracked if you have either infinite time, or infinite computing power..
Example for Hashes
Sometimes it is important to be able to recognize the most important Hash-Types out there. And there is a lot of intel about that topic on the web. For now, i`ll just recommend you to have a short look at this link: https://hashcat.net/wiki/doku.php?id=example_hashes
It´s a brief overview on the most important hashes out there. If you look carefully you´ll notice, that some of the hashes have a unique prefix. Those hashes are quite easy to recognize. Other Hashes can only be identified by the length of the hash and the context in which you found them. F.e. sha256-hashing functions produce 256-bit hashes. Those are usually represented as hexadecimal numbers of 64 digits.
sha512-hasing functions produce hashes with a fixed length of 512 bits instead, resulting in 128 digits of hexadecimal numbers.
Unix style password hashes are very easy to recognize, as they have a prefix. The prefix tells you the hashing algorithm used to generate the hash. The standard format is
$format$rounds$salt$hash
f.e. nowadays Linux-Systems mostly use sha512crypt for generating Hashes. Those come with the prefix $6$. The following hash, for example, is the result from using the sha512crypt-function on the password “evenspace”:
$6$rtm7MAMutUh4wIkk$PRI9jvL8/m5Dl4C8RhVbpsZNZfcfcEXbfJ3dlDhvcRklE/h1rOm1fnF2V8iF2FkgajxoKjyx6O0eReHYyvb8S1
If you don´t know what kind of hash you are looking at you can try using automated hashing tools. One Example is: https://pypi.org/project/hashID/ Those tools will work great with prefixed hashes but are far from perfect! For Hashes without prefixes those tools can be quite unreliable, resulting in the return of false hash-types.
2. Integrity Check of Data
As hashing functions produce hashes from arbitrary data sets, which are unique and do not overlap, these hashes can also be used to find out whether data sets are identical or duplicated.
If two hashes are identical, you can assume with a very high probability that the available data is also identical. This is especially useful when downloading files from the Internet. If the specified hash of the providing party matches the hash generated by us after a download, the file has not been modified by unauthorized third parties.
The integrity of the data is thus guaranteed.
That´s everything for now. 🙂