11/30/00

Adventures in Kanjiland

This Japanese character has caused me a lot of heartache over the past week. I have no idea what, if anything, it means, but it has a peculiar property. Japanese characters, especially those of the kanji variety like this one, are challenging to represent on a computer. This particular character is represented by what looks to a Western program as an accented letter "o" (like ò) followed by a backslash.

The backslash is what made my life difficult. Many programming languages, including Javascript, which is used extensively on web pages, use the backslash character as an indication that the next character is special. For instance, a tab character is represented as "\t". The backslash causes the "t" to be interpreted as a tab, rather than as, well, a "t". The backslash itself disappears - it is only used as a signal that the next character is special.

So, if the backslash disappears, how does one print a backslash? The trick is to use the backslash to mark itself, or rather a second copy of itself, as special. So, "\\" is interpreted as "the next character is special, and it is a backslash, so output exactly one backslash".

Now we get to the crux of my problem. Since that Japanese character (which is pronounced "hyou" - one of the few things I know about it) incorporates what looks like a backslash in its encoding, my software was happily doubling it, thinking that it was just a standalone backslash instead of part of another larger character. This caused an extra backslash to appear in the title bar of the browser on output. This extra backslash is displayed as a yen symbol (¥) on a Japanese system. So Japanese users were getting random currency symbols in the middle of their words, whi$ch was definit$ely annoyin$g.

Ok, so I changed the software to look for these kanji characters, and when it found them, to output them as "Unicode escape sequences". This sequence is a way to tell the browser to display a specific character without considering it to be part of the program (that is, it was just treated as graphical output). When I did this, however, I got the character to the right.

As you can see, they are not the same character. I didn't realize this right away, however, because my original example of the character was much, much smaller, and hey, "all those characters look alike". What was tripping me up was the standard used for encoding the character. There are at least four different and not very compatible ways to represent Japanese characters on a computer, and I had run afoul of this fact. While my data source was using "Shift JIS", I had assumed it was using Unicode (which is the default character set for Microsoft's Windows NT). I have a book an inch thick that describes these different encoding methods and how they interact - suffice it to say I know a lot more about them now than I ever wanted to know. (But I'm sure you know the feeling...)

Once I realized my error, it was relatively straightforward to fix - in fact, I ended up removing almost all of my interim testing changes and making one or two relatively minor changes to continue doubling the backslash character, but only within a Javascript string context where the extra backslash would be absorbed. This required a tweak to the way I output the title of the web page (the text that appears in the title bar of the web browser you are using now, for instance - it should say "Things We Found Dead In Our Pool (and some other stuff)").

And there you have it - far too much detail about an interesting dilemma I've been dealing with this week. To celebrate my (probable) solution to this issue, my boss and I had shashimi for lunch. Yum! Now on to the next problem: "Mommy, why can't I save a file with a Japanese name?" You can bet backslashes will be involved again, although for a different reason this time...

You can respond to my ranting here.


A fool and his rant are soon parted.