FindinSite-CD: Search engine for CD/DVD   .
 
Powered by FindinSite-MS
. Home | Examples | Starting | Set up | Advanced | Languages | Purchasing | Email .
. .
  Overview | Character sets | Japanese | Chinese | Traditional Chinese

 

findinsite-cd international character support


Browser setup
FindinSite-CD-Wizard usage
Testing in W2000/XP
Implementation details

Introduction

Specifying letters and characters in your text correctly is important, especially if your information is not in the English language. You have to ensure that your users will be able to read your information, and that FindinSite can search it.

After this introduction, this page

Browsers and other viewers

Your first task is to ensure that users can read your files correctly. This is not a trivial task. Even though most web browsers can understand web pages written for different languages, they may not be display the information correctly on screen because the user does not have the relevant fonts or language support installed on their computer.

One option here is to use a different file format such as Adobe PDF or Microsoft Word. Files in these formats may render better on a user's screen than a web page. Adobe PDF also lets you control the page layout much more closely.

File formats

The starting point for representing characters correctly is to use the correct file format.

For HTML web pages, the key is to specify and use the most appropriate character set for the characters you are using. Most web page editors should let you specify the character set and encode characters in the character set correctly.

For other formats, such as Adobe PDF or Microsoft Word, the authoring tools may give you some scope to specify which fonts etc are used when saving a file.


FindinSite character support

This page describes the character support of the various FindinSite products.
  • The FindinSite and Findex indexing programs support a wide range of characters.
  • FindinSite-CD-Wizard also lets you view and edit a search database in your computer's default character format.
  • The FindinSite runtimes display characters as best as possible for the user's browser and computer.
  • The FindinSite runtimes are supplied fully internationalised for many languages as described in the Language support page.
Also see the FindinSite-CD Japanese Japanese, Chinese Simplified Chinese and Chinese Traditional Chinese pages.


FindinSite and Findex character support

HTML Web pages
A character set (charset) indicates how the contents of a web page should be interpreted as characters. Different languages will need different character sets if they contain different characters. The Unicode character set is a superset of most other character sets, but is not usually used to encode characters in web pages because it makes the pages larger.

TXT text files
FindinSite and Findex find all characters in single byte ANSI and double byte Unicode plain text files.

PDF Adobe files
Please see the PDF Scanning page for details of the character encodings that are recognised by the FindinSite and Findex indexers.

DOC, XLS and PPT Office documents
FindinSite and Findex extract all character information from the supported Microsoft Office documents.

HTML Character set basics

For web pages, the FindinSite and Findex indexers recognise 32 different character sets (and many more name variants), including multi-byte character sets. FindinSite-CD can display all characters if your computer and browser are set up correctly with the right fonts available. FindinSite-JS and FindinSite-MS use the UTF-8 character set, giving the best chance of displaying characters correctly.

As an example of character sets, Central European languages might need to use a capital R character with an acute accent above it, Ŕ. In the "Windows 1250" character set, byte code 192 (0xC0 in hexadecimal) represents this character. If the default character set were used instead, then this byte would be displayed as a capital A with a grave accent, ie À.

The default character set for web pages is called "ISO 8859-1". This is almost exactly the same as the standard Windows English character set called "Windows 1252".

Note that some character sets now include a Euro currency symbol (€) with byte code 128, ie 0x80 in hexadecimal (or 0x88 for the Windows-1251 charset). Older browsers may not be able to display this character.

Character normalisation

Internally, FindinSite and Findex use the Unicode character set. Each Unicode character uses 16 bits, ie a value from 0 to 65535, or U+0000 to U+FFFF to use the correct hexadecimal notation. The indexers translate characters from the page's character set into Unicode before processing them further. The search database uses a compact form of Unicode called UTF-8 that usually reduces the amount of storage space required.

As an example, the Euro currency symbol (€) is U+20AC, hexadecimal 0x20AC. The indexers translate byte code 128 in the "Windows 1252" character set into U+20AC. They also recognise other ways of specifying characters, using 'character references'. The Euro symbol may also be specified as a string €, as a decimal number € or as a hexadecimal number €.

Specifying character sets

You specify the character set for a web page using the following line, which must appear in the HEAD section of a web page. Replace ISO-8859-1 with an appropriate character set string. The list of character sets supported by the indexers is given below.
View the source of this page to see an example of this tag.
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=ISO-8859-1">

Supported character sets

The FindinSite and Findex indexers recognise the following character sets. Any of the listed charset string names are recognised in a META Content-Type tag. The Windows code page is also listed, for your information.

Character set name charset string Windows code page
Central European (Windows) Windows-1250
x-cp1250
1250
Cyrillic (Windows) Windows-1251
x-cp1251
1251
Western Windows-1252
ANSI_X3.4-1968
ANSI_X3.4-1986
ascii
cp367
csASCII
IBM367
ibm819
ISO_646.irv:1991
ISO646-US
iso-ir-6
us
us-ascii
x-ansi
1252
Greek (Windows) Windows-1253 1253
Turkish (Windows) Windows-1254 1254
Hebrew (ISO-logical) Windows-1255 1255
Arabic (Windows) Windows-1256
cp1256
1256
Baltic (Windows) Windows-1257 1257
Vietnamese Windows-1258 1258
ISO 8859-1 ISO-8859-1
cp819
csISO
Latin1
ibm819
iso_8859-1
iso_8859-1:1987
iso8859-1
iso-ir-100
l1
latin1
1252
ISO 8859-2 ISO-8859-2
csISOLatin2
iso_8859-2
iso_8859-2:1987
iso8859-2
iso-ir-101
l2
latin2
28592
ISO 8859-3 ISO-8859-3
csISO
Latin3
ISO_8859-3
ISO_8859-3:1988
iso-ir-109
l3
latin3
28593
ISO 8859-4 ISO-8859-4
csISOLatin4
ISO_8859-4
ISO_8859-4:1988
iso-ir-110
l4
latin4
28594
ISO 8859-5 ISO-8859-5
csISOLatin5
csISOLatinCyrillic
cyrillic
ISO_8859-5
ISO_8859-5:1988
iso-ir-144
l5
28595
ISO 8859-6 ISO-8859-6
arabic
csISOLatinArabic
ECMA-114
ISO_8859-6
ISO_8859-6:1987
iso-ir-127
28596
ISO 8859-7 ISO-8859-7
csISOLatinGreek
ECMA-118
ELOT_928
greek
greek8
ISO_8859-7
ISO_8859-7:1987
iso-ir-126
28597
ISO 8859-8 ISO-8859-8
iso-8859-8-i
logical
csISOLatinHebrew
hebrew
ISO_8859-8
ISO_8859-8:1988
iso-ir-138
visual
28598
ISO 8859-9 ISO-8859-9
csISO
Latin5
ISO_8859-9
ISO_8859-9:1989
iso-ir-148
28599
ISO 8859-15 ISO-8859-15
csISO
Latin9
ISO_8859-15
l9
latin9
28605
Japanese (Shift-JIS) shift_jis
csShiftJIS
csWindows31J
ms_Kanji
shift-jis
x-ms-cp932
x-sjis
932
Japanese (JIS) csISO2022JP
iso-2022-jp
_iso-2022-jp
_iso-2022-jp$sio
Japanese (EUC) euc-jp
csEUCPkdFmtJapanese
Extended_UNIX_Code_Packed_Format_for_Japanese
x-euc
x-euc-jp
Simplified Chinese (GB2312) gb2312
chinese
CN-GB
csGB2312
csGB231280
csISO58GB231280
GB_2312-80
GB231280
GB2312-80
GBK
iso-ir-58
936
Simplified Chinese (HZ) HZ-GB-2312 Shifted 936
Traditional Chinese (BIG5) big5
cn-big5
csbig5
x-x-big5
950
Korean (KSC5601) ks_c_5601-1987
csKSC56011987
iso-ir-149
korean
ks_c_5601
ks_c_5601_1987
ks_c_5601-1989
KSC_5601
KSC5601
949
Korean (EUC) euc-kr
cseuckr
51949
Hebrew (DOS) dos-862 862
Thai (Windows) Windows-874
DOS-874
iso-8859-11
TIS-620
874
UTF-8 UTF-8
unicode-1-1-utf-8
unicode-2-0-utf-8
x-unicode-2-0-utf-8
65001
Unicode Unicode
utf-16
1200
Unicode (Big Endian) UnicodeFEFF 1201

The indexers automatically detect pages that are encoded in the UTF-8, Unicode or UnicodeFEFF character sets when it starts to read a page. However the character set used for decoding may be overridden using the above META character set tag.

Email support if you would any other character sets supported.


Viewing characters in FindinSite-CD

Screenshot of FindinSite-CD running in Japanese In general, if someone is using FindinSite-CD in a browser to search for text in their "own" language then FindinSite-CD should work as expected.

To be more precise, your customers must have a computer set up as follows for satisfactory use of FindinSite-CD.

  1. Install the appropriate fonts for the language you want to display.
  2. If you run a version of Windows that can change locales, you need to change the default system locale to match the language.
  3. The browser must be set up to use a suitable font for pages in the desired language.
  4. The browser's Java fonts may need to be altered.
Windows 2000 and XP may be set up so that "western" developers can see non-Western characters, both in FindinSite-CD-Wizard and FindinSite-CD running in a browser. See below for instructions on setting up Windows 2000 and XP to cover points 1 and 2 above, so that FindinSite-CD-Wizard runs satisfactorily.

Internet Explorer

Windows Internet Explorer will display non-Western characters in its main browser window and the Microsoft Java VM will display and work with non-Western characters.

In Windows 2000 the default system locale must be set appropriately for this to work, eg set the default system locale to Japanese if you want to use FindinSite-CD to search for Japanese words. You must also install the Japanese Input Locale if you want to type in Japanese characters. In Windows XP, these steps do not seem to be necessary.

In Internet Explorer 5+, you can change the fonts that are used to display non-Western HTML. Select "Tools+Internet Options...". In the "General" tab press the "Fonts..." button. Select the character set type (eg Japanese) and then select a "Web page font" and a "Plain text font".

If you run FindinSite-CD in Internet Explorer in a computer that is not set up appropriately, FindinSite-CD will not display characters correctly.

Note: when running Internet Explorer in Windows 95, 98 and Me, you may well be able to view pages with non-Western characters by installing the appropriate language plug-in.

Sun Java VM

The Sun Java VM is used by most browsers, and can be used by Windows Internet Explorer. The Sun Java VM can display non-Western characters properly on Western computers.

VM 1.5 or later

To ensure that the right fonts are used for your language, simply open the Control Panel "Regional and Language Settings". In the "Regional Options" tab, select your language and country and press OK. (In some cases you may need to select "Supplemental language support" checkboxes in the "Language" tab, and appropriate Code pages in the "Advanced" tab.)

Restart the browser.

VM 1.4 and earlier

The Sun Java VM can display non-Western characters properly on Western computers, if the correct font properties are selected; you may be able to enter non-Western characters and search for them successfully.

If your computer is non-Western, then the Sun Java VM installation may have selected the correct font file already.  If not, then follow these instructions:

First, find where the Java VM plug-in is installed.  On Windows computers, this might be in C:\Program files\Java\j2re1.4.2_01\.  Move to the lib\ sub-directory.

The file font.properties determines which font the Java VM uses.  Save a copy of the current file, eg to font.properties.save.  The table on the right lists some alternative font properties files you could use.  For example, to use Japanese fonts, copy font.properties.ja to font.properties.

Restart the browser.

Western font.properties
Japanese font.properties.ja
Korean font.properties.ko
Simplified Chinese font.properties.zh
Traditional Chinese font.properties.zh_TW

Scanning and viewing characters in FindinSite-CD-Wizard

FindinSite-CD-Wizard will scan all character sets correctly when running in most recent Windows platforms, ie Windows 95, Windows 98, Windows Me, Windows NT 4, Windows 2000, Windows XP and later. It works in all language versions. All the messages are in English.

However you will only be able to view characters correctly in FindinSite-CD-Wizard if Windows is set up appropriately for the characters that you are trying to display. Nonetheless, you can still edit "undisplayable" characters. So you can edit Japanese characters on a Western PC - if you are careful.

If a character cannot be displayed using the default system code page (character set), FindinSite-CD-Wizard does one of two things:

  • In the Words list (and other places) FindinSite-CD-Wizard displays the character as a question mark (?). When you select a word in the words list, FindinSite-CD-Wizard also displays the word beneath the "Key" symbols. It uses the font that you have selected for printing to display the character. It is possible therefore that the character will be displayed correctly here, while not displayed correctly in the main words list.
    Character not displayed correctly Windows 2000 and XP

    In this example, a Chinese character, U+5BFC 导, is listed as a word in the Words tab. FindinSite-CD-Wizard is being run in Windows 2000. Windows 2000 is not using Chinese as its system default locale so FindinSite-CD-Wizard cannot display this character properly in the words list. Therefore a ? is displayed instead in the main words list.

    However the chosen printer font, Simsun, can display this character. So when the character is selected in the words list, the character is displayed correctly under the "Key".

    In the western versions of Windows 95, Windows 98 and Windows Me you will never be able to view this character properly.

  • For the Description and various entries in the Pages tab, FindinSite-CD-Wizard sees if any characters in the string cannot be displayed in the default code page. If there are any problems, FindinSite-CD-Wizard displays the entire string in UTF-8. This replaces any non-ASCII characters with two or three other characters. "UTF-8" is shown in the FindinSite-CD-Wizard status bar.

    You can edit the UTF-8 string - carefully. Suppose that the first character in the description is a Japanese character that cannot be edited normally. FindinSite-CD-Wizard uses its UTF-8 equivalent, ie three characters. If you delete the first of these characters, FindinSite-CD-Wizard displays the following message in the status bar below Could not decode the edit box string. If you go on to delete the other two characters, FindinSite-CD-Wizard will be able to decode the UTF-8 and so the status bar error message will go away.

    Editing Japanese characters

Windows 2000 Japanese as system locale
    Windows 2000 and XP(Japanese): Editing Japanese characters

    This example shows FindinSite-CD-Wizard successfully showing and editing Japanese characters. The selected title in the Pages tab is shown correctly both in the list box above and the edit box below.

    For this example, FindinSite-CD-Wizard was run in Windows 2000 with Japanese as the default system locale.

    Editing Chinese characters

Windows 2000 Japanese as system locale
    Windows 2000 and XP (Japanese): Editing Chinese characters

    In this example showing the Pages tab, FindinSite-CD-Wizard is again being run in Windows 2000 with Japanese as the default system locale.

    However, in this case, FindinSite-CD-Wizard is trying to display some Chinese characters. Some Chinese characters can be displayed in the Japanese default system locale, but not all.

    In the Pages list, the page title is shown with any unrecognised characters changed into question marks. The second, fourth and last characters are shown this way.

    However, in the Title box below, the page title is all shown in UTF-8 form. You can edit the title - in UTF-8 form - in this box.

    Editing Chinese characters

Windows 2000 English as system locale
    Windows 2000 and XP (Chinese): Editing Chinese characters

    This example shows the same Chinese characters when FindinSite-CD-Wizard is run in Windows 2000 with Chinese as the default system locale. All the characters are displayed properly and can be edited easily.

    Editing Chinese characters

Windows 98 English as system locale
    Windows 98 (English): Editing Chinese characters

    This example shows the same Chinese characters when FindinSite-CD-Wizard is run in Western Windows 98. None of the characters can be displayed properly in the standard font, so all the non-space characters appear as question marks.

    The Title box again shows the page title in UTF-8 form. Even though exactly the same UTF-8 characters are output, Windows displays them differently in the default Western character set font.

In conclusion, it is possible to use FindinSite-CD-Wizard to edit words and pages in any recent version of Windows. However, if you need to do a lot of editing, it is best if you use a version of Windows that can display the characters you are working with.


Testing in Windows 2000 and XP

"Western" FindinSite-CD developers who wish to work with non-Western character set pages will find Windows 2000 and XP very useful.

Windows 2000 and XP have features that let you test your non-Western FindinSite-CD implementation fully. The first step is to install the locales of interest. From the Control Panel, select the "Regional Settings" applet. In the "General" tab, check the "Language settings for the system". Make sure that all the locales of interest are installed. If necessary, press the "Advanced" button to see what Code Pages are installed.

In FindinSite-CD-Wizard, you should now be able to File+Print all the characters in your search database. You may well need to change the print font, in the "File+Print options..." dialog box. The following fonts may be useful: "Microsoft Sans Serif", "Ludica sans Unicode" or "Arial Unicode MS" might display most characters, "MS Mincho" displays Japanese and "Simsun" displays Chinese. You should also now be able to see the selected Word displayed correctly under the "Key".

FindinSite-CD-Wizard will not yet be able to display and edit non-Western characters easily (although you can view and edit in UTF-8). To make this work you will have to set the default system locale appropriately. In the "Regional Settings" "General" tab, press the "Set default". Choose the desired locale. You will probably have to reboot to bring this change into effect.
Note that changing "your locale" in the "Regional Settings" "General" tab is usually not sufficient to make FindinSite-CD-Wizard work well.

As a final step, you will need to add the appropriate input locales. In the "Regional Settings" box, select the "Input Locales" tab. Add any locales that you need. Each input locale has different ways of entering characters, called Input Method Editors (IME).

To run FindinSite-CD in a browser in Windows 2000 and XP, see the instructions above.


Unicode implementation details

Unicode and UTF-8
All characters are initially translated from the file's character representation into Unicode. A Unicode character is a stored in 2 bytes. However the words are stored in the FindinSite search database in UTF-8 format. This usually saves space, as it stores most "western" characters in a single byte. However other characters take up 2 or 3 bytes in UTF-8 format.
Canonicalisation
Canonicalisation means translating characters into a "standard form". For example, the Japanese character set has full width latin letters, eg a full width capital letter A. This character has a different code to the standard latin A. The indexers' canonicalisation process converts any full width characters into the standard latin equivalent. This means that a search for a word that contains A will correctly match a Japanese full width A.

The indexers convert all half-width Katakana and Hangul characters into their standard width character codes. Many other useful character code translations are also done.

The FindinSite runtimes also perform these same translations, so that if you enter a Japanese full width A, the search process correctly matches the latin A.

Possible improvement: Translate traditional Chinese character into their simplified Chinese equivalent when doing searches.

Word Splitting
FindinSite and Findex split up text into words. Words contain only letters and numbers, so any other characters break the text into words.

Each non-latin character (ie non-Western and non-Arabic) forms a separate word, as described in the next section.

Non-western characters
FindinSite and Findex treat all non-Western and non-Arabic characters are treated as single words. For example, the three characters in the word "Japanese" (日本語) are separate words, 日, 本 and 語. However, if you search for 日本語 then FindinSite will effectively put double quotes around these characters, so that only instances of these three characters together will be found. If you want to find all instances of 日, 本 and 語 on a page, then search for 日 本  語, ie with spaces in between.

This approach is used because most non-Western languages do not use white space characters to indicate breaks between words.

A character is defined as being Western if its Unicode code is less than or equal to U+02A8, or if it is in the Unicode range U+1E00 to U+1EF9 inclusive.

HTML Tag and Tag Attribute names
Note that all HTML tag names and HTML tag attribute names must be in Western characters, in the Unicode range U+0000 to U+00FF inclusive. However tag attribute values can be in Unicode. For example, the following line is accepted by the indexers:
    <META NAME="description" CONTENT="日本語">
In this example, META is a tag name, and NAME and CONTENT are tag attribute names. Note that the tag attribute values, "description" and "日本語", can use non-Western characters.

Although web page names and target frame names can include any characters, it is recommended that they be in Western characters.

Lower Case
FindinSite and Findex convert words to lower case when matching words in the search database. Details of the lower case conversions are available on request.
Stop words
Currently there are no non-Western stop word files.
FindinSite-CD Word highlighting
Word highlighting only occurs for HTML pages in reasonably recent browsers.

For Microsoft Internet Explorer 4 or later, the browser decodes the character sets before word highlighting. This makes it possible to do word highlighting whatever the character set, ie word highlighting works in IE4+.

Word highlighting in Netscape Navigator 4 or similar only works if the character set is "iso-8859-1" or "Windows-1252". For other character sets, word highlighting is not attempted because FindinSite-CD does not decode the charset at runtime.

  All site Copyright © 1996-2011 PHD Computer Consultants Ltd, PHDCC   Privacy  

Last modified: 8 February 2006.

Valid HTML 4.01 Transitional Valid CSS!