Huffman Text Encoder/Code Generator

Topics on software tools like TileStudio, comments on documentation and tutorials (or the lack of) should go here.
Post Reply
User avatar
pragma
Posts: 175
Joined: Sat Sep 20, 2008 2:16 am
Location: Washington, DC

Huffman Text Encoder/Code Generator

Post by pragma »

All,
Attached is my prototype Huffman text encoder. I'm posting it here in case this proves useful to anyone, or if anyone has any idea of how to improve upon this concept. This is a rough BETA of this tool.

This demo achieves compression ratios as high as %60 for the kind of character frequencies present in plain english, by way of a dynamic frequency-table based Huffman codec. This will allow game authors to save valuable space should they have the need to store lots of textual data.

Use

The tool takes an input file that is a series of name/string pairs, kind of like a c source file. Several comment styles are also supported, as are several string styles.

Code: Select all

/* this is a comment */
// this is too
# and so is this
example1 = "hello compressed world"
example2 = 'lorem ipsum dolor sit amet.'
example3 = """sed iaculis felis ac augue."""
The tool simply takes a single command line argument, which is the name of this file (e.g. huffman.exe text.txt). It then takes this input, and generates code which is substituted into a template.s and template.h file, which are in turn transformed into text.s and text.h files. These can then be used for development by compiling them directly into your program.

Additionally, you may specify a variable called 'font' which defines what character values will be used in the codec, in place of ascii values. Of course you'll want this to match the tileset for the corresponding characters in your program; the first char in font will be mapped to zero, the second to one, and so forth. In case your font's tileset isn't at the start of your tileset in memory, there's a fix for that - read on.

Code: Select all

font = """ !"#$%&'()*+,-./0123456789;:<=>?@abcdefghijklmnopqrstuvwxyz{~}|"""
Using the generated code is as simple as calling _hprint() from .c. The register mappings are also defined in the template, in case you wish to call it straight from assembler. The underscore in the name is intentional - as you can see, the call signature is something you may want to wrap with a friendlier hprint():

Code: Select all

_hprint(vram,&example1,0);
_hprint() expects the vram address of where you wish to begin printing the string, the address of the start of the string, and the offset in tile memory to the first char of your font. This allows you to blit a string anywhere in vram, using any Huffman encoded string, with an arbitrary font. Figuring out how all that works with respect to x/y screen coordinates is left as an exercise to the reader.

Notes and TODO
The encoder is written in the D Programming Language (see my signature), and may be tough for some folks to build as the language lacks a suite as nice as WINAVR. If this becomes popular enough I might be willing to be on the hook for binaries later on - the first one's a freebie and included in the .zip. That said, my code is far from optimal at this time.

String data is stored in PROGMEM to save on the rather scant SRAM space.

The codec gets fantastic results for large strings, but breaks down when you get to short bits of text. This is because the length is encoded as a whole byte at the start of the string itself. I may wind up packing all strings together into a big array, and code up the string byte and bit offsets as #defines instead.

The generated assembler code is a fair tradeoff between CPU consumption and PROGMEM storage. I could have stored the Huffman tree itself in memory, but I felt that traversal and constant changing of the Z address register would have cost too much run-time. I'm open to suggestions on how to optimize this, either by encoding the huffman table in SRAM or some other trick. That said, the code footprint is appalling and pretty much requires the developer to encode a lot of text before the benefits of compression become apparent.

Keep in mind, that no matter what solution is finally embraced for code generation, this will burn dozens of CPU cycles - far more than simple ASCII storage - when it comes to blitting text. Depending on how hungry your program is, a "stop the world and read" game design approach is recommended. As it happens, the rather limited screen resolution of the Uzebox pretty much demands this approach anyway.

My D parser generator, Enki, is also in use on this project. Because of this, the parser can be easily expanded to handle other kinds of input - so suggestions are welcome here. Offhand, I can't think of anything special that might come up for such a straightforward tool, but you never know. :)

Enjoy. Questions and Comments are welcome.
Attachments
huffman.zip
Includes huffman.exe in case you don't have a D development environment handy.
(237.65 KiB) Downloaded 682 times
User avatar
uze6666
Site Admin
Posts: 4801
Joined: Tue Aug 12, 2008 9:13 pm
Location: Montreal, Canada
Contact:

Re: Huffman Text Encoder/Code Generator

Post by uze6666 »

Thanks a lot for sharing. I'll have a detailed look when I can find some spare cycles!

Uze
Post Reply