Archive for September, 2010
Posted on September 2, 2010, under general.
A common software engineering pattern is the need to test elements for set membership.
Practical questions like “is this item in my cache” arise surprisingly often. Bloom Filters are a great mechanism for performing this test, and if you haven’t encountered them before – they’re well worth knowing about. But over the last year, I’ve been experimenting with a different method for these kinds of membership tests on enumerated sets, and thought it was worth sharing.
It’s definitely not the first time the technique has been used, Cian found some papers on its use. But it’s not a common pattern, which is a pity – because it’s incredibly simple, and lends itself very well to modern hardware.
First things first, some huge disclaimers;
- The approach only works on enumerated sets. Each element in the problem space has to be assigned a unique counted number. Hashes won’t do.
- The method is constrained by the size of product of the total number of elements and the maximum size of any set.
So, it’s very simple; Take each element in the problem space, assign it a unique prime number, and then represent sets as the products of those primes. This works because of the fundamental theorem of arithmetic.
Here’s a summary of operations;
Computing a set:
Compute the product of all of the elemental primes in the set.
s = a * b * c
Testing for element membership:
Test the modulus of the set number.
(s % a) == 0
Adding to the set:
Multiply the existing set product by the new element’s prime number.
s *= d
Removing from the set:
Divide the existing set product by the element’s prime number.
s /= a
Now at this point, you’re probably skeptical about the usefulness of the technique, given the constraints. Obviously other operations like unions and intersections between sets are possible, but they require factorisation – and so are not particularly efficient (though you’d be surprised how quickly they do run). But look at the benefits;
- Unlike Bloom filters, there are zero false positives, and zero false negatives. The method is 100% precise.
- Due to their use in cryptography; libraries, CPUs and other hardware have arisen that are highly efficient at computing the products and modulus of very large numbers.
As a case-in-point, let’s use the internet as an example. Suppose you want to model the internet as a system of interconnected autonomous systems and paths. A really common test when using such models is to determine whether or not a particular system is on a particular path.
If we take the internet as 100,000 autonomous systems, and the longest practical path on the internet as containing 20 such systems (longer paths exist, but are oddities – in real terms these numbers are both over-estimates) the largest product we would ever need to compute is smaller than 2410. That’s much smaller than even modern cryptographic primes themselves, and much much smaller than their products. There is a lot more room.
Actually, that was my first real-world use-case for this technique – modeling the internet in a few hundred lines of python, on a laptop – as a prototype. Surprisingly, a Macbook Air could perform well over one million such membership tests per second without skipping a beat – after reordering to assign the most popular AS numbers the lowest prime-numbers. And incidentally, if that kind of problem interests you – Cloudfront are hiring.
Now a question, for the information theorists; How is it that this method can seemingly perfectly encode N bits of information in fewer than N bits? For example it takes 2 bits to represent the number “2″ and 2-bits to represent the number “3″, yet to represent both through their product – “6″ it takes only 3 bits. But this isn’t like addition or some other operation – there really only is one way of factorising 6 to get back to “3″ and “2″. The information is seemingly encoded perfectly.