Hash Functions

October 18, 2022

Hash functions

Given a set of data (e.g., an employee’s name, address, birthday etc.) we would like to assign a “unique” identifier to the data set in a fast, algorithmic way. This identifier is called a hash or a message digest of the given set of data.

Applications:

  • Organizing data
  • Checking for data integrity
  • Finding duplicate records
  • Password checking
  • Digital signatures
  • Pseudorandom bit string generation
  • Generating a key (e.g. AES) from a passphrase

Hash Functions

A hash function \(h:M \longrightarrow D\) is a function that inputs a message of arbitrary length and outputs a message digest of fixed length.

Example: Checking data integrity

library(openssl)
## Linking to: OpenSSL 3.0.2 15 Mar 2022
bg1 <- "Take me out to the ball game, Take me out with the crowd; Buy me some peanuts and 
        Cracker Jack, I don't care if I never get back. Let me root, root, root for the 
        home team, If they don't win, it's a shame. For it's one, two, three strikes, 
        you're out, At the old ball game."
bg2 <- "Take me out to the ball game, Take me out to the crowds;  Buy me some peanuts and 
        Cracker Jack, I don't care if I never get back. Let me root, root, root for the 
        home team, If they don't win, it's a shame. For it's one, two, three strikes, 
        you're out, At the old ball game."
md5(c(bg1,bg2))
## [1] "80ecebc0ae83f1efb1536a242aeebde9" "1265b26773cc1f4f430312356bd6e1a2"

Example: Password verification

A system stores hashed passwords:

passwords <- c("qwerty", "mypass", "admin", "Runner4567")
sha256(passwords) 
## [1] "65e84be33532fb784c48129675f9eff3a682b27168c0ea744b2cf58ee02337c5"
## [2] "ea71c25a7a602246b4c39824b855678894a96f43bb9b71319c39700a1e045222"
## [3] "8c6976e5b5410415bde908bd4dee15dfb167a9c873fc4bb8a81f6f2ab448a918"
## [4] "54d712cf917f585cc5314e7d99b5e2fe7ac72341727e477da0c2b2d021ea5468"

When a user enters a password \(m\), the hash \(h(m)\) is compared to the stored values.

Properties of Cryptographic Hash Functions

Cryptographic hash functions should satisfy the following properties:

  • Efficiency. Computing \(h(m)\) is fast.
  • One-way. Given a digest \(d\), it is practically impossible to find a message \(m\) such that \(h(m)=d\).
  • Collision-free. It is practically impossible to find messages \(m_1\) and \(m_2\) such that \(h(m_1)=h(m_2)\).
  • Diffusion. Changing one bit of \(m\) completely changes \(h(m)\).

Examples

Discuss. For each possible hash function \(h\), which properties (efficiency, one-way, collision-free, diffusion) does \(h\) have?

  1. \(h(m) = m \bmod p\), where \(m > p\).
  2. \(h(m) = \alpha^m \bmod p\), where \(\alpha\) is a primitive root modulo \(p\).
  3. \(h(m) =\) the first block of the AES ciphertext of \(m\) using a public key.

Common Hash Functions

Function Year Digest size Secure?
MD5 1991 128 bit no (2004-2005)
SHA-1 1993 160 bit no (2005)
SHA-2 2001 224, 256, 384, or 512 bit yes
SHA-3 2015 224, 256, 384, or 512 bit yes


For example, openssl defaults to sha256, which is SHA-2. Check out the pseudocode.

Attacking Hash Functions

Dictionary attacks

Suppose a set \(P\) of passwords are stored as hashes: \(\{h(p) \mid p \in P\}\).

An attacker can create a list \(L = \{h(d) \mid d \in \mathcal{D}\}\), where \(\mathcal{D}\) is a dictionary list of common passwords.

A match between \(L\) and \(P\) will reveal a password.

passwords <- c("qwerty", "mypass", "admin", "Runner4567")
sha256(passwords) 
## [1] "65e84be33532fb784c48129675f9eff3a682b27168c0ea744b2cf58ee02337c5"
## [2] "ea71c25a7a602246b4c39824b855678894a96f43bb9b71319c39700a1e045222"
## [3] "8c6976e5b5410415bde908bd4dee15dfb167a9c873fc4bb8a81f6f2ab448a918"
## [4] "54d712cf917f585cc5314e7d99b5e2fe7ac72341727e477da0c2b2d021ea5468"

Salt

Password systems sometimes incorporate a random string, called salt, into a password before it is hashed.

nyoung:$6$WQ0w1hcR$7wO.W0nyjzcakz42KgoPhfgdP8NWOZXSSTZzxEhPBTCiXAKcJ1kHv4VgTMKyx
U6Y7XNXaBmkPtlsLp1q7VPVC1:18212:0:99999:7:::
rfrink:$6$T3ortZMG$HOC3KWnrWHGvjujYnH42ZrfhdDVBbbfXLYFFG80rG9kvnJynEY8O/V37l.1T/
kcitXgkBFobngRxFYlhmXEL5.:18166:0:99999:7:::
roroku:$6$YTcauXgg$CFgWgR/oXj3NhFV56oqosnQAxrBto5Ho640uzTmcAR0OnRD.qkY4O6zbjVP
NqIoguUuLCK4Y8WW.PDmPWPK0:18227:0:99999:7:::
samundson:$6$CllynVdl$9N76VLgzfR1XsrqwSscksfgMOWs59oSW67ai4g3oHu29lnXe2pwk4g9Wvmp
A6XMHH8W0Sq4togY1I6UQKgeBMV0:18166:0:99999:7:::

Question: If \(B\) is the number of bits of salt, how does salting affect the time needed for a dictionary attack?

Pepper

  • Similar to salt
  • Pepper is random and secret, while salt is just random.
  • Store pepper somewhere else; salt gets stored with the hashed passwords.

Collisions

  • If Eve can produce a collision, she can do something nefarious, like forge a digital signature/certificate.
  • So an attack on a hash function involves finding a collision.
  • No collisions have ever been found for the best hash functions.
    • If a collision is found, the hash function is considered broken.

Random Oracle

An ideal hash function \(h: M \longrightarrow D\) is a random oracle.

  • The oracle answers the question \(m\in M\) with a randomly selected \(h(m) \in D\) (selected with replacement).
  • If the same question \(m\) is asked a second time, the same answer \(h(m)\) is given.

To calculate theoretical probabilities, we often assume we have a random oracle hash function.

The Birthday Paradox

What is the probability that two people in a group have the same birthday?

  • Write down your birthday (day/month) and the birthday of one family member.
  • Did we obtain a collision?

Group Exercise

For simplicity, consider a “random birthday” to be a number chosen at random from the set \(\{1,2,3,\ldots, 365\}\)

  1. What is the probability that two random birthdays are the same?

  2. What is the probability that two random birthdays are different?

  3. Suppose you have already drawn two random birthdays, and the birthdays are different. What is the probability that a third random birthday will be different from the previous two?

  4. What is the probability that three random birthdays will all be different?

  5. Give a formula for the probability that \(r\) random birthdays will all be different.

The Birthday Theorem

Birthday Theorem. Suppose \(h:M \longrightarrow D\) is a hash function, where there are \(N\) possible hashes. Assuming that \(h\) is a random oracle, the probability of a collision when hashing \(r\) messages is \[ 1-\left(\frac{N-1}{N}\right)\left(\frac{N-2}{N}\right)\cdots\left(\frac{N-r+1}{N}\right) \] This probability is approximately \(1-e^{-r^2/2N}\).

1-prod(1-(1:30)/365)
## [1] 0.7304546
1-exp(-30^2/(2*365))
## [1] 0.7085471

Birthday Attacks

Choose a value for \(r\), and look for collisions.

How big must \(r\) be to have a “high” probability \(\gamma\) of finding a collision? Solve (asymptotically):

\[ 1-e^{-r^2/(2N)} = \gamma \]

\(r \approx \sqrt{N}\), asymptotically. (Can solve exactly, given \(\gamma\).)

“Number of bits of security” \(\approx \frac{1}{2} \times \text{digest size}\)

Programming a birthday attack

Programming problem: Write a birthday attack on a semi-weak hash function miniSHA.

  • Try hashing a bunch of strings, and find a collision (i.e., a duplicated hash for two different strings).
  • Advice: Check out R’s built-in duplicated function. Read the help menu to see how it works.
  • Your hashes will be strings, so duplicated will work just fine.

How duplicated works

The function duplicated “determines which elements of a vector or data frame are duplicates of elements with smaller subscripts, and returns a logical vector indicating which elements (rows) are duplicates.”

piDigits <- c(3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5)
dups <- duplicated(piDigits)
dups
##  [1] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE
which(dups)
## [1]  4  9 10 11
piDigits[which(dups)]
## [1] 1 5 3 5

Other Applications

Application: Using hash functions to encrypt

Idea: create a random bitstring using hash functions, a key, and some sort of feedback. Encrypt by XORing (as with certain block cipher modes).

Application: Merkle Tree data structure

Problem: verify the integrity of \(n\) independent blocks \(B_1, B_2, \ldots, B_n\). (e.g., peer-to-peer data)

  • \(h(B_1), h(B_2), \ldots, h(B_n)\) form the leaves of the tree.
  • Each parent node is formed by concatenating the children and hashing.
  • Only the top hash needs to be checked to verify integrity of all blocks.

Application: Block chaining

Examples: Git, Bitcoin

  • Each new block contains a hash of the previous block.
  • Difficult to go back and change an old block, because the hashes won’t check.

Application: Bitcoin Proof of Work

Problem: A new block needs to be added to the Bitcoin blockchain. Who decides?

  • Miners race to find a nonce, such that, when appended to the new block, the hash begins with a certain number of zeroes.
  • Anybody can check that the nonce works, to verify the integrity of the new block.
  • The first one to find a nonce that works adds the new block to the blockchain and gets rewarded. Everybody is going to accept the new blockchain because it is the longest.

Reference: https://bitcoin.org/bitcoin.pdf