Attached is my prototype Huffman text encoder. I'm posting it here in case this proves useful to anyone, or if anyone has any idea of how to improve upon this concept. This is a rough BETA of this tool.
This demo achieves compression ratios as high as %60 for the kind of character frequencies present in plain english, by way of a dynamic frequency-table based Huffman codec. This will allow game authors to save valuable space should they have the need to store lots of textual data.
Use
The tool takes an input file that is a series of name/string pairs, kind of like a c source file. Several comment styles are also supported, as are several string styles.
Code: Select all
/* this is a comment */
// this is too
# and so is this
example1 = "hello compressed world"
example2 = 'lorem ipsum dolor sit amet.'
example3 = """sed iaculis felis ac augue."""
Additionally, you may specify a variable called 'font' which defines what character values will be used in the codec, in place of ascii values. Of course you'll want this to match the tileset for the corresponding characters in your program; the first char in font will be mapped to zero, the second to one, and so forth. In case your font's tileset isn't at the start of your tileset in memory, there's a fix for that - read on.
Code: Select all
font = """ !"#$%&'()*+,-./0123456789;:<=>?@abcdefghijklmnopqrstuvwxyz{~}|"""
Code: Select all
_hprint(vram,&example1,0);
Notes and TODO
The encoder is written in the D Programming Language (see my signature), and may be tough for some folks to build as the language lacks a suite as nice as WINAVR. If this becomes popular enough I might be willing to be on the hook for binaries later on - the first one's a freebie and included in the .zip. That said, my code is far from optimal at this time.
String data is stored in PROGMEM to save on the rather scant SRAM space.
The codec gets fantastic results for large strings, but breaks down when you get to short bits of text. This is because the length is encoded as a whole byte at the start of the string itself. I may wind up packing all strings together into a big array, and code up the string byte and bit offsets as #defines instead.
The generated assembler code is a fair tradeoff between CPU consumption and PROGMEM storage. I could have stored the Huffman tree itself in memory, but I felt that traversal and constant changing of the Z address register would have cost too much run-time. I'm open to suggestions on how to optimize this, either by encoding the huffman table in SRAM or some other trick. That said, the code footprint is appalling and pretty much requires the developer to encode a lot of text before the benefits of compression become apparent.
Keep in mind, that no matter what solution is finally embraced for code generation, this will burn dozens of CPU cycles - far more than simple ASCII storage - when it comes to blitting text. Depending on how hungry your program is, a "stop the world and read" game design approach is recommended. As it happens, the rather limited screen resolution of the Uzebox pretty much demands this approach anyway.
My D parser generator, Enki, is also in use on this project. Because of this, the parser can be easily expanded to handle other kinds of input - so suggestions are welcome here. Offhand, I can't think of anything special that might come up for such a straightforward tool, but you never know.
Enjoy. Questions and Comments are welcome.