Given a set of data (e.g., an employee’s name, address, birthday etc.) we would like to assign a “unique” identifier to the data set in a fast, algorithmic way. This identifier is called a hash or a message digest of the given set of data.
Applications:
A hash function \(h:M \longrightarrow D\) is a function that inputs a message of arbitrary length and outputs a message digest of fixed length.
## Linking to: OpenSSL 3.0.2 15 Mar 2022
bg1 <- "Take me out to the ball game, Take me out with the crowd; Buy me some peanuts and
Cracker Jack, I don't care if I never get back. Let me root, root, root for the
home team, If they don't win, it's a shame. For it's one, two, three strikes,
you're out, At the old ball game."
bg2 <- "Take me out to the ball game, Take me out to the crowds; Buy me some peanuts and
Cracker Jack, I don't care if I never get back. Let me root, root, root for the
home team, If they don't win, it's a shame. For it's one, two, three strikes,
you're out, At the old ball game."
md5(c(bg1,bg2))
## [1] "80ecebc0ae83f1efb1536a242aeebde9" "1265b26773cc1f4f430312356bd6e1a2"
A system stores hashed passwords:
## [1] "65e84be33532fb784c48129675f9eff3a682b27168c0ea744b2cf58ee02337c5"
## [2] "ea71c25a7a602246b4c39824b855678894a96f43bb9b71319c39700a1e045222"
## [3] "8c6976e5b5410415bde908bd4dee15dfb167a9c873fc4bb8a81f6f2ab448a918"
## [4] "54d712cf917f585cc5314e7d99b5e2fe7ac72341727e477da0c2b2d021ea5468"
When a user enters a password \(m\), the hash \(h(m)\) is compared to the stored values.
Cryptographic hash functions should satisfy the following properties:
Discuss. For each possible hash function \(h\), which properties (efficiency, one-way, collision-free, diffusion) does \(h\) have?
Function | Year | Digest size | Secure? |
---|---|---|---|
MD5 | 1991 | 128 bit | no (2004-2005) |
SHA-1 | 1993 | 160 bit | no (2005) |
SHA-2 | 2001 | 224, 256, 384, or 512 bit | yes |
SHA-3 | 2015 | 224, 256, 384, or 512 bit | yes |
For example, openssl defaults to sha256
, which is SHA-2.
Check out the
pseudocode.
Suppose a set \(P\) of passwords are stored as hashes: \(\{h(p) \mid p \in P\}\).
An attacker can create a list \(L = \{h(d) \mid d \in \mathcal{D}\}\), where \(\mathcal{D}\) is a dictionary list of common passwords.
A match between \(L\) and \(P\) will reveal a password.
## [1] "65e84be33532fb784c48129675f9eff3a682b27168c0ea744b2cf58ee02337c5"
## [2] "ea71c25a7a602246b4c39824b855678894a96f43bb9b71319c39700a1e045222"
## [3] "8c6976e5b5410415bde908bd4dee15dfb167a9c873fc4bb8a81f6f2ab448a918"
## [4] "54d712cf917f585cc5314e7d99b5e2fe7ac72341727e477da0c2b2d021ea5468"
Password systems sometimes incorporate a random string, called salt, into a password before it is hashed.
nyoung:$6$WQ0w1hcR$7wO.W0nyjzcakz42KgoPhfgdP8NWOZXSSTZzxEhPBTCiXAKcJ1kHv4VgTMKyx
U6Y7XNXaBmkPtlsLp1q7VPVC1:18212:0:99999:7:::
rfrink:$6$T3ortZMG$HOC3KWnrWHGvjujYnH42ZrfhdDVBbbfXLYFFG80rG9kvnJynEY8O/V37l.1T/
kcitXgkBFobngRxFYlhmXEL5.:18166:0:99999:7:::
roroku:$6$YTcauXgg$CFgWgR/oXj3NhFV56oqosnQAxrBto5Ho640uzTmcAR0OnRD.qkY4O6zbjVP
NqIoguUuLCK4Y8WW.PDmPWPK0:18227:0:99999:7:::
samundson:$6$CllynVdl$9N76VLgzfR1XsrqwSscksfgMOWs59oSW67ai4g3oHu29lnXe2pwk4g9Wvmp
A6XMHH8W0Sq4togY1I6UQKgeBMV0:18166:0:99999:7:::
Question: If \(B\) is the number of bits of salt, how does salting affect the time needed for a dictionary attack?
An ideal hash function \(h: M \longrightarrow D\) is a random oracle.
To calculate theoretical probabilities, we often assume we have a random oracle hash function.
What is the probability that two people in a group have the same birthday?
For simplicity, consider a “random birthday” to be a number chosen at random from the set \(\{1,2,3,\ldots, 365\}\)
What is the probability that two random birthdays are the same?
What is the probability that two random birthdays are different?
Suppose you have already drawn two random birthdays, and the birthdays are different. What is the probability that a third random birthday will be different from the previous two?
What is the probability that three random birthdays will all be different?
Give a formula for the probability that \(r\) random birthdays will all be different.
Birthday Theorem. Suppose \(h:M \longrightarrow D\) is a hash function, where there are \(N\) possible hashes. Assuming that \(h\) is a random oracle, the probability of a collision when hashing \(r\) messages is \[ 1-\left(\frac{N-1}{N}\right)\left(\frac{N-2}{N}\right)\cdots\left(\frac{N-r+1}{N}\right) \] This probability is approximately \(1-e^{-r^2/2N}\).
## [1] 0.7304546
## [1] 0.7085471
Choose a value for \(r\), and look for collisions.
How big must \(r\) be to have a “high” probability \(\gamma\) of finding a collision? Solve (asymptotically):
\[ 1-e^{-r^2/(2N)} = \gamma \]
\(r \approx \sqrt{N}\), asymptotically. (Can solve exactly, given \(\gamma\).)
“Number of bits of security” \(\approx \frac{1}{2} \times \text{digest size}\)
Programming problem: Write a birthday attack on a semi-weak
hash function miniSHA
.
duplicated
function. Read the help menu to see how it
works.duplicated
will work
just fine.duplicated
worksThe function duplicated
“determines which elements of a
vector or data frame are duplicates of elements with smaller subscripts,
and returns a logical vector indicating which elements (rows) are
duplicates.”
## [1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
## [1] 4 9 10 11
## [1] 1 5 3 5
Idea: create a random bitstring using hash functions, a key, and some sort of feedback. Encrypt by XORing (as with certain block cipher modes).
Problem: verify the integrity of \(n\) independent blocks \(B_1, B_2, \ldots, B_n\). (e.g., peer-to-peer data)
Examples: Git, Bitcoin
Problem: A new block needs to be added to the Bitcoin blockchain. Who decides?
Reference: https://bitcoin.org/bitcoin.pdf