Packing Numbers of Arbitrary Length into a Byte Stream? (in C/C++)

Extrarius · Feb 19, 2003

I'm working on a project where I need to pack numbers into a character array as tight as possible. By this, I mean that if there is a number that can be from 0 to say 30, it will take up only ln(30)/ln(2) bits and no more. I know it is possible to do, as I've done it before(unfortunately I've lost the source for it =-/), but I can't remeber the method I used. I could round the number of bits up to the nearest whole, but that is undesirable as it adds a lot of wasted space to the output that really doesn't need to be there.

The way I'm thinking of it right now is to keep 2 integers, the max value that can be represented (product of the max value of every number input so far) and the current value held. The value is calculated as N1 + M1*N2 + M1*M2*N3 etc where Nx is the Xth number inserted and Mx is the max value the the Xth number could be. When the stored max value is >= 256, it means there are at least 8 bits to be output, so the lower 8 bits are extracted from the stored value and both the value and max are divided 256. That doesn't seem like it would work though because the resulting max could then be a fraction and all of this should be done using integer math (at least thats how it seems to me since the stored number always has to be an integer and the output byte is an integer).

Write(N, M) = Insert N into the stream, Where N is a number from 0 to M
N = Read(M) = Extract a number from the stream, Where the number can be from from 0 to M
SN = Stored Number
SM = Stored Max

As an example:
SM = 1
SN = 0
Write(1, 10) makes SN = 1, SM = SM*10 = 10
Write(7, 10) makes SN = 1+7*SM = 71, SM = SM*10 = 100
Write(5, 10) makes SN = 71+5*SM = 571, SM = SM*10 = 1000

Read(10) returns (SN%10) = 1, and makes SN = SN/10 = 57
Read(10) returns (SN%10) = 7, and makes SN = SN/10 = 5
Read(10) returns (SN%10) = 5, and makes SN = SN/10 = 0

The problem occurs when the SM becomes >= the maximum value an unsigned integer (2^32-1) can hold, which is the reason I need to stream it. As you can see, there is no reguard for the bit boundaries and there really doesnt seem to be any need to care about them, but maybe that is the only way to stream them effectively. I chose to use byte stream because eventually it will get turned into that anyways. A solution that stores it in in an array of integers would work just as fine since I can easily convert it.

Does anybody know how to do this, or where I might at least find some hints? I've looked through all my backups for the past several years and I've done several web searches but I can't find any information on doing this.

DaveSimmons · Feb 19, 2003

Interesting problem -- I've written plenty of code where I've cared about wasting whole bits, but never fractional ones!

You say it "eventually" will be turned into a stream, but is there any problem of buffering some or all of it all until that point? If not, you could use one of the available libraries for infinite-precision math and add a module to convert the internal representation to an efficient bit stream when you're done adding values.

I think C/C++ Users Journal has had at least one infinite precision library aritcle.

notfred · Feb 19, 2003

the PNG image format does this. For example, if you use one bit pixels, it will store 8 pixels per byte. However, the entire stream will have the same bpp, so you'll either have all 1 bit numbers, or 2,4,8, or 16 bit numbers, but all the numbers in the stream are the same size. The number of BPP is sent at the beginning of the stream. If each of your numbers is arbitraty in length, then you're going to have more problems, as it seems like you're going to have to use a fixed length number before each compressed number to tell it's length, and that would likely take up more space than just using standard 4 bit ints.

DaveSimmons · Feb 19, 2003

the PNG image format does this. For example, if you use one bit pixels, it will store 8 pixels per byte. However, the entire stream will have the same bpp

I was reading it that way at first, but he wants to use a fractional bpp based on the max -- for example if the max was 11, then only 3.5 bits per digit are needed rather than 4.

Extrarius · Feb 19, 2003

Exactly, I don't want to use more bits than needed - The reason being that its one stage of a compression algorithm I'm working on. Using less bits is good =-)
As you can see in the example where 'base' 10 is used, it is easily possible to store numbers with 'fractional bits' (each base 10 character takes ~3.322 bits). The real problem is when the number gets large. The problem with using an 'infinite precision' library is that they all (or at least the ones I've seen) use binary coded decimals (using 4 bits per digit{0-10} when really 4 bits could hold a number from 0-15) which is again a waste of space. I guess I'll start looking around though because if I could find one that didnt use BCD it would be exactly what I needed. I was having a hard time thinking of keywords to use in a search, and that should help a lot. If I find it, I'll post a link for anybody else that cares.
THANKS!

About having to prefix each number with its length: The reason I don't have to do that is because I have an algorithm to determining the length of the next number.

DaveSimmons · Feb 19, 2003

The problem with using an 'infinite precision' library is that they all (or at least the ones I've seen) use binary coded decimals (using 4 bits per digit{0-10} when really 4 bits could hold a number from 0-15)

I didn't mean to use the library as your main storage though, just as a temporary buffer for building "big numbers" before streaming them as true binary, something like:

stream in K small numbers -> 1 big number in BCD -> BCD to true binary -> stream out K numbers

If you can "chunk up" your N numbers into blocks of K then the max wasted bits in the scheme above is 7(N/K). Of course if K is set to N then the max wasted in the stream is just 7, but (a) you waste quite a bit of temporary space until you do stream out, and (b) you waste the same space for temporary storage when you go to extract numbers from the stream.

Extrarius · Feb 19, 2003

Well, the problem with converting from BCD to binary is the problem I'm trying to solve. Its easy to take each nibble and turn it to binary by doing the steps shownin my example. The problem is inserting the 10th digit will overflow an integer, and if I divide by 256 when I get enough bits to write, I'll probably end up with a max being a fraction (for example, after 3 digits, max is 1000, so it divides by 256 to get 3.90625 for max, which isn't right because {eventually} there will be loss of precision)

notfred · Feb 19, 2003

Originally posted by: DaveSimmons

the PNG image format does this. For example, if you use one bit pixels, it will store 8 pixels per byte. However, the entire stream will have the same bpp

Click to expand...

I was reading it that way at first, but he wants to use a fractional bpp based on the max -- for example if the max was 11, then only 3.5 bits per digit are needed rather than 4.

four bits are needed. 1011. You can't have half a bit.

If you actually figure out a way to use the upper quarter of that 4 bit vaue to represent something else, I'd love to see a sample bitstream with half a dozen or so integers decoded from it, and the method for decoding them.

Extrarius · Feb 19, 2003

Look at the example in my first post: Each individual digit would take 4 bits to store(0-9, 3 bits can only store 0-7), but 4 bits can store 0-15 so the values from 10-15 are being 'wasted'. However, when you run the algorithm for 3 digits and get a max of 1000, that can be stored in 10 bits (which can hold from 0-1023).

You can either use 4 bits for each number (resulting in 12 used bits for the 3 numbers) or you can use 10 bits to store the combined number, which comes out to 10/3 bits per number which is ~3.333 bits per number.

Of course, its not yet streamed yet, but it will be soon as I think I've found a simple (but imperfect) solution to my fraction problem that won't result in too much wasted space(a maximum of 1 bit wasted per 3 bytes).

DaveSimmons · Feb 19, 2003

Originally posted by: notfred

Originally posted by: DaveSimmons

the PNG image format does this. For example, if you use one bit pixels, it will store 8 pixels per byte. However, the entire stream will have the same bpp

Click to expand...

I was reading it that way at first, but he wants to use a fractional bpp based on the max -- for example if the max was 11, then only 3.5 bits per digit are needed rather than 4

Click to expand...

four bits are needed. 1011. You can't have half a bit.

If you actually figure out a way to use the upper quarter of that 4 bit vaue to represent something else, I'd love to see a sample bitstream with half a dozen or so integers decoded from it, and the method for decoding them.

It can be done, in groups. Using my 0-11 range example, any 4 base-12 numbers (or 4-pixel image) can be stored as a 15-bit unsigned binary number, using 3.75 bits per digit, saving one precious bit of space 😉

To see this, consider BBBB (the largest 4-digit base-12 number) and note that it's equal to 19008 + 1584 + 132 + 11 = 20,735 decimal, fitting comfortably in the 0-32767 range of 2^15. Out of boredom I also checked and 5 base-12 digits can be stored as 1 - 18-digit binary number, or 3.6 bits per digit.

Extrarius · Feb 20, 2003

=-/ Unfortunately the fix I thought I had is apparently harder to implement than I thought and I can't get it working (or it might not be a fix at all and I implemented it right but it didnt help, who knows?) =-(

notfred · Feb 20, 2003

I jsut thought of something - how with this scheme deal with signed numbers? How do you compress a number when the leftmost bit is a sign bit?

DaveSimmons · Feb 20, 2003

Originally posted by: notfred
I jsut thought of something - how with this scheme deal with signed numbers? How do you compress a number when the leftmost bit is a sign bit?

As stated the values are unsigned, but for signed in -max,max you can use 1 extra bit per number just like with binary numbers. For my example earlier 19 bits could store 4 numbers in the range -11,+11 still saving that one precious bit 🙂

If you know that the negative minium is some fixed value like -3 for ALL numbers you can use less than one bit for the sign by simply adding +3 to all values before storing, making your minimum 0 and your "max" values all (max+3).

Extrarius · Feb 20, 2003

I don't need to store signed integers, so it doesn't matter =-)

Extrarius · Feb 20, 2003

C'mon, somebody has to be an algorithm guru. Anandtech has a billion members, how could there not be a programming guru =-/
/me heads over to the newsgroups and prepares to wait years for a response

{aka BUMP}

Packing Numbers of Arbitrary Length into a Byte Stream? (in C/C++)

Extrarius

Senior member

DaveSimmons

Elite Member

notfred

Lifer

DaveSimmons

Elite Member

Extrarius

Senior member

DaveSimmons

Elite Member

Extrarius

Senior member

notfred

Lifer

Extrarius

Senior member

DaveSimmons

Elite Member

Extrarius

Senior member

notfred

Lifer

DaveSimmons

Elite Member

Extrarius

Senior member

Extrarius

Senior member

TRENDING THREADS