Character Encodings

Last modified: 27 November 1997.

Page Contents Character Encoding of Files  Displaying Characters  Hi Lab support  Supported Encodings 

Version 2.2+ of the Hi HelpIndex Java applet supports different character encodings for the index and language files.

For the more technical of you, this facility is provided by the JavaSoft Java Development Kit (JDK) version 1.1. Not all browsers fully support JDK 1.1.

Character Encoding of Files

Files on your computer disks or a web server use a character encoding scheme to store text. Originally, most computer systems use just one byte (8 bits) to store a character, ie 256 possible values. However there are more than 256 characters in the alphabets of all the world's languages. So the best solution nowadays is called Unicode, which maps letters to code points, commonly encoded into two bytes (ie decimal 0-65535), with extensions to permit even more characters.

To make matters worse, where characters are stored as one byte, not all computer systems agree what each character code represents. Virtually all agree that character code 64 represents a capital letter A. However some DOS systems display character code 224 as the greek letter alpha while Windows 95 or 98 displays 224 as lower case a with a grave accent.
Each of these different characters sets (or character encodings) has a name. For example Windows 95 or 98 uses ANSI. Windows NT/2000/XP uses Unicode internally but also accepts ANSI.

Java uses Unicode to hold text characters. When a file with another character encoding is read, the characters have to be converted to Unicode. Before version 2.2 of Hi HelpIndex, characters were assumed to be in the Windows ANSI character encoding.

However in version 2.2, you can specify the character encoding of index files and language files. See the usage instructions. The character encodings supported by Java are listed below.

The default character set is "8859_1" which corresponds to the ISO 8859-1 character set and is equivalent to Windows ANSI. Note carefully: Even if a user's computer is a Macintosh, you will normally just use the default 8859_1 encoding as this is the format the files will be in on your server. This is the case even if your server is a Unix machine.
You should only use different character encodings if you really are storing your index file or language in a different format.

Note that Hi HelpIndex will always use "8859_1" for some of its internal data files.

Displaying Characters

Once Hi HelpIndex has read characters in your chosen character set how will it display them?

Currently it just uses the default character sets for the user's computer. This may mean that a character cannot be displayed properly. Say your language file includes a greek alpha character. Standard US English Windows 95 and 98 systems will not display this character so it will just appear as a square block instead.

If you know what you are doing, you can add other display fonts and character encodings to the JDK 1.1 runtime. From the JDK 1.1 "Internationalization" page, select the link to "Adding Fonts to the Java Runtime" and follow the instructions.
You need to set up a valid font.properties file so that Unicode characters can be converted to the correct display font. Various predefined Asian and other language font files are available as standard.

Hi Lab support

Note very carefully that Hi Lab currently only produces index and language files in the Windows ANSI character set.

So if you want to use a different character encoding, you will have to build these files from scratch yourself, or somehow translate and amend the Hi Lab output files.


Supported Encodings

Java says that these character encodings are supported. PHD has not checked that they all work.

8859_1 8859_2 8859_3 8859_4 8859_5 8859_6 8859_7 8859_8 8859_9

Big5 CNS11643 DBCS_ASCII DBCS_EBCDIC EUC EUCJIS GB2312 JIS JIS0208 JISAutoDetect KOI8_R KSC5601 MacArabic MacCentralEurope MacCroatian MacCyrillic MacDingbat MacGreek MacHebrew MacIceland MacRoman MacRomania MacSymbol MacThai MacTurkish MacUkraine MS874 SJIS UTF8 Unicode UnicodeBig UnicodeLittle

Cp037 Cp273 Cp277 Cp278 Cp280 Cp284 Cp285 Cp297 Cp420 Cp424 Cp437 Cp500 Cp737 Cp775 Cp838 Cp850 Cp852 Cp855 Cp856 Cp857 Cp860 Cp861 Cp862 Cp864 Cp865 Cp866 Cp868 Cp869 Cp870 Cp871 Cp874 Cp875 Cp918 Cp921 Cp922 Cp930 Cp933 Cp935 Cp937 Cp939 Cp942 Cp948 Cp949 Cp950 Cp964 Cp970 Cp1006 Cp1025 Cp1026 Cp1046 Cp1097 Cp1098 Cp1112 Cp1122 Cp1123 Cp1124 Cp1250 Cp1251 Cp1252 Cp1253 Cp1254 Cp1255 Cp1256 Cp1257 Cp1258 Cp1381 Cp1383 Cp33722

Listed but does not work

SingleByte
HelpIndex    PHD