How does the HyperLogLog algorithm work?

I’ve been learning about different algorithms in my spare time recently, and one that I came across which appears to be very interesting is called the HyperLogLog algorithm – which estimates how many unique items are in a list.

This was particularly interesting to me because it brought me back to my MySQL days when I saw that “Cardinality” value (which I always assumed until recently that it was calculated not estimated).

So I know how to write an algorithm in O(n) that will calculate how many unique items are in an array. I wrote this in JavaScript:

function countUniqueAlgo1(arr) {
    var Table = {};
    var numUnique = 0;
    var numDataPoints = arr.length;
    for (var j = 0; j < numDataPoints; j++) {
        var val = arr[j];
        if (Table[val] != null) {
            continue;
        }
        Table[val] = 1;
        numUnique++;
    }
    return numUnique;
}

But the problem is that my algorithm, while O(n), uses a lot of memory (storing values in Table).

I’ve been reading this paper about how to count duplicates in a list in O(n) time and using minimal memory.

It explains that by hashing and counting bits or something one can estimate within a certain probability (assuming the list is evenly distributed) the number of unique items in a list.

I’ve read the paper, but I can’t seem to understand it. Can someone give a more layperson’s explanation? I know what hashes are, but I don’t understand how they are used in this HyperLogLog algorithm.

3 Answers
3

Leave a Comment