Managing and modeling a lot of data in a usable way is a hard problem. In my current project, I have a lot of information from various sources about Japanese characters (kanji). Every character has a list of “readings”, or pronunciations. Every character has a list of basic meanings, and then some of the readings have additional (different) meanings.
Most characters are composite characters—they’re a combination of other characters with other meanings. Sometimes the meanings of the other characters are related—for example, “bird” + “tree” means to flock or gather together. Sometimes the connection isn’t immediately obviously, but it makes a good mnemonic—such as “woman” + “child” means to like or be fond of. Others seem completely unrelated—such as “new” + “axe”, which means place or location—but you can still form a reasonable mnemonic (e.g. “Using an axe, I make a new place for myself to live”).
To further complicate things, some of the components are not characters themselves, but “radicals”—combinations of strokes that, by themselves don’t really mean anything, but are present in many kanji. These radicals have names and meanings too, but often they’re more related to the shape or look of the radical than any meaning they give the kanji of which they’re a part.
Then there’s all the stroke information about the characters. How do you draw this character? Which strokes go in which order? How many strokes are there? And what’s the best way to represent a “stroke” in a computer format, so I can display it later? And, finally, other miscellaneous tidbits of information, such as the approximate frequency with which this kanji appears, or the Unicode value of the character.
Currently, my kanji quiz program has all this information, but it’s all scattered in dozens of different text files from numerous different sources. Every time a bit of information is needed, my program scans through the text files, slowly hunting for the thing it needs. This makes the program hideously slow in some cases (up to five or ten seconds to display all the information about a single character).
So, I’m working on a way to represent all this data in a consistent, easy to manipulate, easy to read format—preferably, a format that can be eventually stored in a database for fast retrieval. That should speed up my program by at least an order of magnitude (much less than a second to display the most complicated kanji).
But it’s a hard problem to figure out the best way to store all this information. I’ll probably be working on it for quite some time.
Entries (RSS)