Hash Function Collisions in String Hashing Algorithm
Hi! Recently I've been fighting one weird bug in my application. Turns out hashing function for strings produces collisions which I did not expect for my use case. It's easy to fix by introducing custom hashing function for the data type, but I decided to share my experience with you just to hear your thoughts. First of all, the documentation doesn't mention possible collisions using
Hash
Hash
trait. Perhaps we should reconsider the algorithm of strings hashing, because it doesn't produce unique hashes for even normal use cases. For example, I have a list of unique
1040
1040
IATA codes, and after hashing, I'm getting only
895
895
unique hashes. Same thing happens to country codes. Of course, I'm not the expert in hashing algorithms, but I always thought that distribution should be strong enough for common use cases such as I described. You are welcome into this thread to see what LLM thinks about current implementation of string hashing algorithm.