this page was originally written for a different part of my site. i moved it here with minimal edits. it's still valuable information, so i didn't want to discard it


compressing files with large homogeneous sections

storing files with large regions of consecutive equal bits in less space. the original context for this was in storing long 1-bit audio files more efficiently such that they might fit in the flash memory of an arduino uno

the basis of this idea is simple: just count bits up. count how many of the same bit appear in a row and add that value to an array, switch the bit you're looking for, and repeat. things get a bit more complex when you begin looking into how to store that information effectively

bit-depth...

...is a term that refers to how many bits are used to store a single sample of information. for most regular applications, it makes sense to use a bit depth that's a multiple of 8 bits. this is not a regular application. depending on how "homogeneous" the file you're compressing actually is, most homogeneous fields (sequences of identical bits) might not reach 255 bits, and it won't make sense to use a bit-depth of 8 or higher. for my project, it sometimes makes sense to use a bit-depth of 4. so how does that work?

bit depth: 4

this initially sounds like a problem, we can't store things in spaces smaller than one byte, right? that's not an incorrect assessment, but nothing's stopping us from cramming everything we want in one byte anyway.

we can cut our byte into halves. we'll store 4 bits in the upper bits, and 4 bits in the lower bits. to access these in isolation looks as follows:

upper bits

shift the byte over
AAAABBBB
>> 4
=
0000AAAA
some implementations of bitshift might cause the value to be malformed, fix it with bitwise and:
1111AAAA
&
00001111 or 0xf

lower bits

bitwise and to zero out the upper bits
AAAABBBB
&
00001111
=
0000BBBB

manners of representing deltas

each "delta" represents the number of same bits in a row, that much is pretty clear. how do we represent them in bits though? the first obvious thing is just to store the count in each section of size bit-depth. this comes with a couple issues. we waste as many bits as our bit_depth anytime we have to store a delta larger than the max integer, we waste bit_depth many bits if the file starts with a 0, and again if it ends in a 1.

an alternative way

i haven't actually used this method yet, but it's been on my mind a while so i'll describe it here:
in each byte, store the bit this delta represents in the highest order bit. now you don't waste bits in all those cases i described before :)

you could store other information in that bit too, like instead of what bit this delta refers to, you could mark whether this value specifies some multiple of bit_depth-1 and store enormously lengthy homogeneous fields in just two bytes. you'd still be wasting space in those scenarios i mentioned before, but it could save huge amounts of total space on particular kinds of files

conclusion

save space by counting. ez. huge W. who knew file compression for very particular kinds of files was possible just by using pre-school level logic and techniques. lol