## Prime and Proper

Posted on September 2, 2010, under general.

A common software engineering pattern is the need to test elements for set membership.

Practical questions like “is this item in my cache” arise surprisingly often. Bloom Filters are a great mechanism for performing this test, and if you haven’t encountered them before – they’re well worth knowing about. But over the last year, I’ve been experimenting with a different method for these kinds of membership tests on *enumerated* sets, and thought it was worth sharing.

It’s definitely not the first time the technique has been used, Cian found some papers on its use. But it’s not a common pattern, which is a pity – because it’s incredibly simple, and lends itself very well to modern hardware.

First things first, some huge disclaimers;

- The approach only works on
*enumerated*sets. Each element in the problem space has to be assigned a unique counted number. Hashes won’t do. - The method is constrained by the size of product of the total number of elements and the maximum size of any set.

So, it’s very simple; **Take each element in the problem space, assign it a unique prime number, and then represent sets as the products of those primes.** This works because of the fundamental theorem of arithmetic.

Here’s a summary of operations;

**Computing a set:**

Compute the product of all of the elemental primes in the set.

s = a * b * c

**Testing for element membership:**

Test the modulus of the set number.

(s % a) == 0

**Adding to the set:**

Multiply the existing set product by the new element’s prime number.

s *= d

**Removing from the set:**

Divide the existing set product by the element’s prime number.

s /= a

Now at this point, you’re probably skeptical about the usefulness of the technique, given the constraints. Obviously other operations like unions and intersections between sets are possible, but they require factorisation – and so are not particularly efficient (though you’d be surprised how quickly they *do* run). But look at the benefits;

- Unlike Bloom filters, there are zero false positives, and zero false negatives. The method is 100% precise.
- Due to their use in cryptography; libraries, CPUs and other hardware have arisen that are highly efficient at computing the products and modulus of very large numbers.

As a case-in-point, let’s use the internet as an example. Suppose you want to model the internet as a system of interconnected autonomous systems and *paths*. A really common test when using such models is to determine whether or not a particular system is on a particular path.

If we take the internet as 100,000 autonomous systems, and the longest practical path on the internet as containing 20 such systems (longer paths exist, but are oddities – in real terms these numbers are both over-estimates) the largest product we would ever need to compute is smaller than 2^{410}. That’s much smaller than even modern cryptographic primes themselves, and much much smaller than *their* products. There is a **lot** more room.

Actually, that was my first real-world use-case for this technique – modeling the internet in a few hundred lines of python, on a laptop – as a prototype. Surprisingly, a Macbook Air could perform well over one million such membership tests per second without skipping a beat – after reordering to assign the most popular AS numbers the lowest prime-numbers. And incidentally, if that kind of problem interests you – Cloudfront are hiring.

Now a question, for the information theorists; How is it that this method can seemingly perfectly encode N bits of information in fewer than N bits? For example it takes 2 bits to represent the number “2″ and 2-bits to represent the number “3″, yet to represent both through their product – “6″ it takes only 3 bits. But this isn’t like addition or some other operation – there really only is one way of factorising 6 to get back to “3″ and “2″. The information is seemingly encoded perfectly.

## 13 Replies to "*Prime and Proper*"

**colmmacc** on September 2, 2010

I think the key difference is zero false-positives, sometimes that really really helps – when it’s expensive to find and iterate the set itself to rule out the FPs. It would take 2^{100000} bits to represent sets of up to 100,000 elements with a zero false positive rate, so it gets too unwieldy.

Thanks for the information theory, I think that does make sense!

**dwmalone** on September 2, 2010

I’m thinking of a bit field where you have one bit per member. You only need 100,000 bits for you example, not 2^100000, unless I’ve misunderstood something. (The set of all subsets of 100,000 elements is of size 2^100,000, but that means you only need 100,000 bits to represent any one subset).

I thought of another way to view the information theory question on the train. Suppose you say that your sets on average have 20 elements. That means that you want to encode a vector of bits that has each bit set with probability ~20/100000. The entropy of this is h(p) = 20/100000 * lg(20/100000) + 99980/100000 * lg(99980/100000) = 0.00274. This means that on average you only need 10000*0.00274 = 274 bits to encode this on average with an optimal encoder.

**colmmacc** on September 2, 2010

Sorry, you’re right – it’s only 100,000 bits. You’ve had me thinking though about comparative complexity of the prime approach vs the probabilistic bloom filters.

Hashes are typically at least 128-bits wide, and based on counting some disassembled code, it looks like it takes about twice as many operations to calculate the 128-bit MD5 hash of a 32-bit integer as it does to mod a 1028 bit number with a 32 bit number. (md5sum vs python’s native bignum). That might be down to implementation differences, but I’m guessing not all of it. I should work out the numbers fully, the mechanisms for computing a hash are pretty similar to the mechanisms a bignum library has to use too – the comparison between the two methods a strange kind of symmetry.

274 bits is quite a bit lower than 410, I wonder how more can be squeezed practically. I did originally wonder if the technique works in a modular field – with a careful choice of modulus (potentially itself prime) – to further constrain the bits in a manner similar to hashing – and what the probabilities for false-positives might become. 274 suggests that there may be moduluses fr whih the probability of false-positives is still “0″.

**Nick Johnson** on September 2, 2010

This is a sensible scheme if, rather than a set, you want to store counts for enumerated objects – but for set representation, it’s extremely wasteful, since there are many values that can only be generated by prime powers.

If you’re only operating on set membership, a numbering system similar to Combinadics is probably a better, more compact, choice.

**Ganesh Sittampalam** on September 2, 2010

Although 2 takes two bits of information to store, in your encoding it only represents one bit of information (the presence or absence of the thing it corresponds to). Similarly for 3. So the fact that you can store numbers up to 6 in three bits isn’t really interesting in itself, since the total information its encoding is only two bits (presence or absence of both 2 and 3).

**dwmalone** on September 2, 2010

Colm- Your encoding actually lets you store a count for each object as the exponent of the prime number. You could think of the extra bits between 410 and 274 as allowing you to count some of the objects multiple times, providing they don’t make the product too big.

There might be some kind of hint in this about how to tweak the technique. Do you know about Unique Factorisation Domains? The might be useful…

**CraigHughes** on September 2, 2010

Re: information theory part. “2″ and “3″ don’t require a full 2 bits to store in your system. Because you can’t represent all the numbers. “2″ really corresponds to “1″ since it’s the first prime, “3″ is really “2″, etc. You can’t encode “4″. So the apparent compression is that you’re skipping all the non-prime numbers. Make sense?

**Fergal Daly** on September 3, 2010

On optimal encodings.

It’s not optimal but simply assigning the possible elements to the number from 0 to 99999, then listing them in order, taking differences and encoding that with a variable length integer encoding gets you down below 320 bits for a 20-element set.

I think 340 bits is the worst case for this encoding. This happens when the elements are evenly spaced. E.g. [0, 5000, 10000, ..., 99999]. So you get [0, 5000, 5000, 5000, ..., 4999].

For the integer encoding, the numbers are < 99999 so you can encode them as LN where L is 4 bits and N is L+1 bits long that gets you 0-131071 (so 0 is "00000", 5 is "0011101" (L=3, N=5) etc. These are between 5 and 21 bits and in the case above L=13 because the numbers are all < 8192, so that's 17 bits * 20 items = 340.

I think this is the worst case because making any number bigger forces another to be smaller. Making a number smaller can decrease it's L and making it bigger can increase it but on average you have to move further to increase the number of bits in a number than to decrease it. You can also look at it as a maximising a sum of logs of numbers that add up to 100000 and I bet the max is when they're all equal.

Best case is [0,1,2,3, ..., 19] which takes only 5*20 = 100 bits to represent. I have no idea how close the average gets to 274.

As usual, without extreme cleverness, better encoding leads to worse computational behaviour as elementof, union etc are nasty with the above.

**Fergal Daly** on September 3, 2010

I was also thinking about unique factorisation domains and things like a + b*sqrt(2). These are the Gaussian integers, they factorise uniquely, they multiple fairly inexpensively and mod is ok – it involves some multiplication and then an integer mod I think and sometimes you have to do that twice.

So instead of 32 bits for 1 integer you could have 16 bits for a and for b. I’m not quite sure what it gets you. I think you get a lot more primes in N+N bits of Gaussian integers than you do with 2*N bits of integers so you end up being able to include higher powers of more primes without the number of bits blowing as quickly.

So integers are good if you have some elements that occur in your bags with much higher multiplicities than others, you assign the small primes to them and the bigger ones to those that occur with low multiplicities. The Gaussian integers are better if you have a flatter distributions as they have more “small” primes.

I’m not sure if a + b*sqrt(2) + c*sqrt(3) also gets you a UFD but if it does I bet it has way more small primes again. I expect, the more sqrts you add on the close you get to something very flat.

**Darius Bacon** on September 3, 2010

So it worked all right for this use; but did it work *better* than Python’s built-in set type? I guess it probably used less space — was that why you tried it?

**b0b0b0b** on September 3, 2010

I think this approach is useful for sparse sets. I tried something like this for a fast poker hand evaluator I wrote. Anyway, consider an enum set with 52 possible elements; using a bitmask we have a constant cost of 52 bits to represent any subset (4e15). Using the primes method, a lower bound on the number representing “every element” is 52!, which google tells me is 8e67. I have a feeling I’m rehashing what others have said, but I wanted to say it in these terms.

**Marco** on September 9, 2010

Intriguing.

Of course, representing a set of 20 elements out of at most 100,000 through basic concatenation only requires 340 bits. Since the average prime for a set of at most 100,000 is 622607 [1] the probability that the prime method leads to a more compact representation for a uniformly distributed set is really small. On average, it will never overtake it [2]. Searching a short list is probably more efficient (and trivially parallel) than modulo arithmetic on very large numbers. Still, it seems worth seeing/doing some analysis on bloom filter vs. prime set vs. concatenation. Do you have some references to those papers?

Also, it seems like you’re wasting bits. For example, 60 doesn’t factor into distinct primes. Instead it factors into 2*2*3*5. So 60 is equivalent to 30 in your system. Not sure how you’d fix that.

dwmaloneon September 2, 2010I think union and intersection correspond to lcm and gcd in this framework, so there are cheap ways to compute them.

I was trying to understand why this is better than a bitmask, where the test operations are even cheaper and you get union and intersection cheaply. I guess that you are depending on the fact that the sets that you actually want to store are small relative to the total number of elements, otherwise things will be very big and inefficient.

From an information theory point of view, this explains why you seem to need a relatively small number of bits. In your space of things that you want to encode, you’ve assigned a zero probability to many of the objects, so you’re only leaving space for (N \choose 20) rather than 2^N.