FindinSite-CD: Search engine for CD/DVD   .
  search
Powered by FindinSite-MS
. Home | Examples | Starting | Set up | Advanced | Languages | Purchasing | Email .
. .
  Overview | Character sets | Japanese | Chinese | Traditional Chinese

 

findinsite-cd international character support


Browser setup
FindinSite-CD-Wizard usage
Testing in W2000/XP
Implementation details

Introduction

Specifying letters and characters in your text correctly is important, especially if your information is not in the English language. You have to ensure that your users will be able to read your information, and that FindinSite can search it.

After this introduction, this page
  • describes the FindinSite-CD-Wizard, Findex, FindinSite-JS and FindinSite-MS character support,
  • gives hints on how to set up a browser to run FindinSite-CD,
  • gives advice on viewing characters in FindinSite-CD-Wizard,
  • suggests how to test non-Western characters, and
  • rounds off with some implementation details.
  • Browsers and other viewers

    Your first task is to ensure that users can read your files correctly. This is not a trivial task. Even though most web browsers can understand web pages written for different languages, they may not be display the information correctly on screen because the user does not have the relevant fonts or language support installed on their computer.

    One option here is to use a different file format such as Adobe PDF or Microsoft Word. Files in these formats may render better on a user's screen than a web page. Adobe PDF also lets you control the page layout much more closely.

    File formats

    The starting point for representing characters correctly is to use the correct file format.

    For HTML web pages, the key is to specify and use the most appropriate character set for the characters you are using. Most web page editors should let you specify the character set and encode characters in the character set correctly.

    For other formats, such as Adobe PDF or Microsoft Word, the authoring tools may give you some scope to specify which fonts etc are used when saving a file.


    FindinSite character support

    This page describes the character support of the various FindinSite products.
    • The FindinSite and Findex indexing programs support a wide range of characters.
    • FindinSite-CD-Wizard also lets you view and edit a search database in your computer's default character format.
    • The FindinSite runtimes display characters as best as possible for the user's browser and computer.
    • The FindinSite runtimes are supplied fully internationalised for many languages as described in the Language support page.
    Also see the FindinSite-CD Japanese Japanese, Chinese Simplified Chinese and Chinese Traditional Chinese pages.


    FindinSite and Findex character support

    HTML Web pages
    A character set (charset) indicates how the contents of a web page should be interpreted as characters. Different languages will need different character sets if they contain different characters. The Unicode character set is a superset of most other character sets, but is not usually used to encode characters in web pages because it makes the pages larger.

    TXT text files
    FindinSite and Findex find all characters in single byte ANSI and double byte Unicode plain text files.

    PDF Adobe files
    Please see the PDF Scanning page for details of the character encodings that are recognised by the FindinSite and Findex indexers.

    DOC, XLS and PPT Office documents
    FindinSite and Findex extract all character information from the supported Microsoft Office documents. Currently, only FindinSite-CD-Wizard indexes XLS files.

    HTML Character set basics

    For web pages, the FindinSite and Findex indexers recognise 32 different character sets (and many more name variants), including multi-byte character sets. FindinSite-CD can display all characters if your computer and browser are set up correctly with the right fonts available. FindinSite-JS and FindinSite-MS use the UTF-8 character set, giving the best chance of displaying characters correctly.

    As an example of character sets, Central European languages might need to use a capital R character with an acute accent above it, Ŕ. In the "Windows 1250" character set, byte code 192 (0xC0 in hexadecimal) represents this character. If the default character set were used instead, then this byte would be displayed as a capital A with a grave accent, ie À.

    The default character set for web pages is called "ISO 8859-1". This is almost exactly the same as the standard Windows English character set called "Windows 1252".

    Note that some character sets now include a Euro currency symbol (€) with byte code 128, ie 0x80 in hexadecimal (or 0x88 for the Windows-1251 charset). Older browsers may not be able to display this character.

    Character normalisation

    Internally, FindinSite and Findex use the Unicode character set. Each Unicode character uses 16 bits, ie a value from 0 to 65535, or U+0000 to U+FFFF to use the correct hexadecimal notation. The indexers translate characters from the page's character set into Unicode before processing them further. The search database uses a compact form of Unicode called UTF-8 that usually reduces the amount of storage space required.

    As an example, the Euro currency symbol (€) is U+20AC, hexadecimal 0x20AC. The indexers translate byte code 128 in the "Windows 1252" character set into U+20AC. They also recognise other ways of specifying characters, using 'character references'. The Euro symbol may also be specified as a string €, as a decimal number € or as a hexadecimal number €.

    Specifying character sets

    You specify the character set for a web page using the following line, which must appear in the HEAD section of a web page. Replace ISO-8859-1 with an appropriate character set string. The list of character sets supported by the indexers is given below.
    View the source of this page to see an example of this tag.
    <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=ISO-8859-1">

    Supported character sets

    The FindinSite and Findex indexers recognise the following character sets. Any of the listed charset string names are recognised in a META Content-Type tag. The Windows code page is also listed, for your information.

    Character set name charset string Windows code page
    Central European (Windows) Windows-1250
    x-cp1250
    1250
    Cyrillic (Windows) Windows-1251
    x-cp1251
    1251
    Western Windows-1252
    ANSI_X3.4-1968
    ANSI_X3.4-1986
    ascii
    cp367
    csASCII
    IBM367
    ibm819
    ISO_646.irv:1991
    ISO646-US
    iso-ir-6
    us
    us-ascii
    x-ansi
    1252
    Greek (Windows) Windows-1253 1253
    Turkish (Windows) Windows-1254 1254
    Hebrew (ISO-logical) Windows-1255 1255
    Arabic (Windows) Windows-1256
    cp1256
    1256
    Baltic (Windows) Windows-1257 1257
    Vietnamese Windows-1258 1258
    ISO 8859-1 ISO-8859-1
    cp819
    csISO
    Latin1
    ibm819
    iso_8859-1
    iso_8859-1:1987
    iso8859-1
    iso-ir-100
    l1
    latin1
    1252
    ISO 8859-2 ISO-8859-2
    csISOLatin2
    iso_8859-2
    iso_8859-2:1987
    iso8859-2
    iso-ir-101
    l2
    latin2
    28592
    ISO 8859-3 ISO-8859-3
    csISO
    Latin3
    ISO_8859-3
    ISO_8859-3:1988
    iso-ir-109
    l3
    latin3
    28593
    ISO 8859-4 ISO-8859-4
    csISOLatin4
    ISO_8859-4
    ISO_8859-4:1988
    iso-ir-110
    l4
    latin4
    28594
    ISO 8859-5 ISO-8859-5
    csISOLatin5
    csISOLatinCyrillic
    cyrillic
    ISO_8859-5
    ISO_8859-5:1988
    iso-ir-144
    l5
    28595
    ISO 8859-6 ISO-8859-6
    arabic
    csISOLatinArabic
    ECMA-114
    ISO_8859-6
    ISO_8859-6:1987
    iso-ir-127
    28596
    ISO 8859-7 ISO-8859-7
    csISOLatinGreek
    ECMA-118
    ELOT_928
    greek
    greek8
    ISO_8859-7
    ISO_8859-7:1987
    iso-ir-126
    28597
    ISO 8859-8 ISO-8859-8
    iso-8859-8-i
    logical
    csISOLatinHebrew
    hebrew
    ISO_8859-8
    ISO_8859-8:1988
    iso-ir-138
    visual
    28598
    ISO 8859-9 ISO-8859-9
    csISO
    Latin5
    ISO_8859-9
    ISO_8859-9:1989
    iso-ir-148
    28599
    ISO 8859-15 ISO-8859-15
    csISO
    Latin9
    ISO_8859-15
    l9
    latin9
    28605
    Japanese (Shift-JIS) shift_jis
    csShiftJIS
    csWindows31J
    ms_Kanji
    shift-jis
    x-ms-cp932
    x-sjis
    932
    Japanese (JIS) csISO2022JP
    iso-2022-jp
    _iso-2022-jp
    _iso-2022-jp$sio
    Japanese (EUC) euc-jp
    csEUCPkdFmtJapanese
    Extended_UNIX_Code_Packed_Format_for_Japanese
    x-euc
    x-euc-jp
    Simplified Chinese (GB2312) gb2312
    chinese
    CN-GB
    csGB2312
    csGB231280
    csISO58GB231280
    GB_2312-80
    GB231280
    GB2312-80
    GBK
    iso-ir-58
    936
    Simplified Chinese (HZ) HZ-GB-2312 Shifted 936
    Traditional Chinese (BIG5) big5
    cn-big5
    csbig5
    x-x-big5
    950
    Korean (KSC5601) ks_c_5601-1987
    csKSC56011987
    iso-ir-149
    korean
    ks_c_5601
    ks_c_5601_1987
    ks_c_5601-1989
    KSC_5601
    KSC5601
    949
    Korean (EUC) euc-kr
    cseuckr
    51949
    Hebrew (DOS) dos-862 862
    Thai (Windows) Windows-874
    DOS-874
    iso-8859-11
    TIS-620
    874
    UTF-8 UTF-8
    unicode-1-1-utf-8
    unicode-2-0-utf-8
    x-unicode-2-0-utf-8
    65001
    Unicode Unicode
    utf-16
    1200
    Unicode (Big Endian) UnicodeFEFF 1201

    The indexers automatically detect pages that are encoded in the UTF-8, Unicode or UnicodeFEFF character sets when it starts to read a page. However the character set used for decoding may be overridden using the above META character set tag.

    Email support if you would any other character sets supported.


    Viewing characters in FindinSite-CD

    Screenshot of FindinSite-CD running in Japanese In general, if someone is using FindinSite-CD in a browser to search for text in their "own" language then FindinSite-CD should work as expected.

    To be more precise, your customers must have a computer set up as follows for satisfactory use of FindinSite-CD.

    1. Install the appropriate fonts for the language you want to display.
    2. If you run a version of Windows that can change locales, you need to change the default system locale to match the language.
    3. The browser must be set up to use a suitable font for pages in the desired language.
    4. The browser's Java fonts may need to be altered.
    Windows 2000 and XP may be set up so that "western" developers can see non-Western characters, both in FindinSite-CD-Wizard and FindinSite-CD running in a browser. See below for instructions on setting up Windows 2000 and XP to cover points 1 and 2 above, so that FindinSite-CD-Wizard runs satisfactorily.

    Internet Explorer

    Windows Internet Explorer will display non-Western characters in its main browser window and the Microsoft Java VM will display and work with non-Western characters.

    In Windows 2000 the default system locale must be set appropriately for this to work, eg set the default system locale to Japanese if you want to use FindinSite-CD to search for Japanese words. You must also install the Japanese Input Locale if you want to type in Japanese characters. In Windows XP, these steps do not seem to be necessary.

    In Internet Explorer 5+, you can change the fonts that are used to display non-Western HTML. Select "Tools+Internet Options...". In the "General" tab press the "Fonts..." button. Select the character set type (eg Japanese) and then select a "Web page font" and a "Plain text font".

    If you run FindinSite-CD in Internet Explorer in a computer that is not set up appropriately, FindinSite-CD will not display characters correctly.

    Note: when running Internet Explorer in Windows 95, 98 and Me, you may well be able to view pages with non-Western characters by installing the appropriate language plug-in.

    Sun Java VM

    The Sun Java VM is used by most browsers, and can be used by Windows Internet Explorer. The Sun Java VM can display non-Western characters properly on Western computers.

    VM 1.5 or later

    To ensure that the right fonts are used for your language, simply open the Control Panel "Regional and Language Settings". In the "Regional Options" tab, select your language and country and press OK. (In some cases you may need to select "Supplemental language support" checkboxes in the "Language" tab, and appropriate Code pages in the "Advanced" tab.)

    Restart the browser.

    VM 1.4 and earlier

    The Sun Java VM can display non-Western characters properly on Western computers, if the correct font properties are selected; you may be able to enter non-Western characters and search for them successfully.

    If your computer is non-Western, then the Sun Java VM installation may have selected the correct font file already.  If not, then follow these instructions:

    First, find where the Java VM plug-in is installed.  On Windows computers, this might be in C:\Program files\Java\j2re1.4.2_01\.  Move to the lib\ sub-directory.

    The file font.properties determines which font the Java VM uses.  Save a copy of the current file, eg to font.properties.save.  The table on the right lists some alternative font properties files you could use.  For example, to use Japanese fonts, copy font.properties.ja to font.properties.

    Restart the browser.

    Western font.properties
    Japanese font.properties.ja
    Korean font.properties.ko
    Simplified Chinese font.properties.zh
    Traditional Chinese font.properties.zh_TW

    Scanning and viewing characters in FindinSite-CD-Wizard

    FindinSite-CD-Wizard will scan all character sets correctly when running in most recent Windows platforms, ie Windows 95, Windows 98, Windows Me, Windows NT 4, Windows 2000, Windows XP and later. It works in all language versions. All the messages are in English.

    However you will only be able to view characters correctly in FindinSite-CD-Wizard if Windows is set up appropriately for the characters that you are trying to display. Nonetheless, you can still edit "undisplayable" characters. So you can edit Japanese characters on a Western PC - if you are careful.

    If a character cannot be displayed using the default system code page (character set), FindinSite-CD-Wizard does one of two things:

    • In the Words list (and other places) FindinSite-CD-Wizard displays the character as a question mark (?). When you select a word in the words list, FindinSite-CD-Wizard also displays the word beneath the "Key" symbols. It uses the font that you have selected for printing to display the character. It is possible therefore that the character will be displayed correctly here, while not displayed correctly in the main words list.
      Windows 2000 and XP

      In this example, a Chinese character, U+5BFC 导, is listed as a word in the Words tab. FindinSite-CD-Wizard is being run in Windows 2000. Windows 2000 is not using Chinese as its system default locale so FindinSite-CD-Wizard cannot display this character properly in the words list. Therefore a ? is displayed instead in the main words list.

      However the chosen printer font, Simsun, can display this character. So when the character is selected in the words list, the character is displayed correctly under the "Key".

      In the western versions of Windows 95, Windows 98 and Windows Me you will never be able to view this character properly.

    • For the Description and various entries in the Pages tab, FindinSite-CD-Wizard sees if any characters in the string cannot be displayed in the default code page. If there are any problems, FindinSite-CD-Wizard displays the entire string in UTF-8. This replaces any non-ASCII characters with two or three other characters. "UTF-8" is shown in the FindinSite-CD-Wizard status bar.

      You can edit the UTF-8 string - carefully. Suppose that the first character in the description is a Japanese character that cannot be edited normally. FindinSite-CD-Wizard uses its UTF-8 equivalent, ie three characters. If you delete the first of these characters, FindinSite-CD-Wizard displays the following message in the status bar below Could not decode the edit box string. If you go on to delete the other two characters, FindinSite-CD-Wizard will be able to decode the UTF-8 and so the status bar error message will go away.

      Editing Japanese characters

Windows 2000 Japanese as system locale
      Windows 2000 and XP(Japanese): Editing Japanese characters

      This example shows FindinSite-CD-Wizard successfully showing and editing Japanese characters. The selected title in the Pages tab is shown correctly both in the list box above and the edit box below.

      For this example, FindinSite-CD-Wizard was run in Windows 2000 with Japanese as the default system locale.

      Editing Chinese characters

Windows 2000 Japanese as system locale
      Windows 2000 and XP (Japanese): Editing Chinese characters

      In this example showing the Pages tab, FindinSite-CD-Wizard is again being run in Windows 2000 with Japanese as the default system locale.

      However, in this case, FindinSite-CD-Wizard is trying to display some Chinese characters. Some Chinese characters can be displayed in the Japanese default system locale, but not all.

      In the Pages list, the page title is shown with any unrecognised characters changed into question marks. The second, fourth and last characters are shown this way.

      However, in the Title box below, the page title is all shown in UTF-8 form. You can edit the title - in UTF-8 form - in this box.

      Editing Chinese characters

Windows 2000 English as system locale
      Windows 2000 and XP (Chinese): Editing Chinese characters

      This example shows the same Chinese characters when FindinSite-CD-Wizard is run in Windows 2000 with Chinese as the default system locale. All the characters are displayed properly and can be edited easily.

      Editing Chinese characters

Windows 98 English as system locale
      Windows 98 (English): Editing Chinese characters

      This example shows the same Chinese characters when FindinSite-CD-Wizard is run in Western Windows 98. None of the characters can be displayed properly in the standard font, so all the non-space characters appear as question marks.

      The Title box again shows the page title in UTF-8 form. Even though exactly the same UTF-8 characters are output, Windows displays them differently in the default Western character set font.

    In conclusion, it is possible to use FindinSite-CD-Wizard to edit words and pages in any recent version of Windows. However, if you need to do a lot of editing, it is best if you use a version of Windows that can display the characters you are working with.


    Testing in Windows 2000 and XP

    "Western" FindinSite-CD developers who wish to work with non-Western character set pages will find Windows 2000 and XP very useful.

    Windows 2000 and XP have features that let you test your non-Western FindinSite-CD implementation fully. The first step is to install the locales of interest. From the Control Panel, select the "Regional Settings" applet. In the "General" tab, check the "Language settings for the system". Make sure that all the locales of interest are installed. If necessary, press the "Advanced" button to see what Code Pages are installed.

    In FindinSite-CD-Wizard, you should now be able to File+Print all the characters in your search database. You may well need to change the print font, in the "File+Print options..." dialog box. The following fonts may be useful: "Microsoft Sans Serif", "Ludica sans Unicode" or "Arial Unicode MS" might display most characters, "MS Mincho" displays Japanese and "Simsun" displays Chinese. You should also now be able to see the selected Word displayed correctly under the "Key".

    FindinSite-CD-Wizard will not yet be able to display and edit non-Western characters easily (although you can view and edit in UTF-8). To make this work you will have to set the default system locale appropriately. In the "Regional Settings" "General" tab, press the "Set default". Choose the desired locale. You will probably have to reboot to bring this change into effect.
    Note that changing "your locale" in the "Regional Settings" "General" tab is usually not sufficient to make FindinSite-CD-Wizard work well.

    As a final step, you will need to add the appropriate input locales. In the "Regional Settings" box, select the "Input Locales" tab. Add any locales that you need. Each input locale has different ways of entering characters, called Input Method Editors (IME).

    To run FindinSite-CD in a browser in Windows 2000 and XP, see the instructions above.


    Unicode implementation details

    Unicode and UTF-8
    All characters are initially translated from the file's character representation into Unicode. A Unicode character is a stored in 2 bytes. However the words are stored in the FindinSite search database in UTF-8 format. This usually saves space, as it stores most "western" characters in a single byte. However other characters take up 2 or 3 bytes in UTF-8 format.
    Canonicalisation
    Canonicalisation means translating characters into a "standard form". For example, the Japanese character set has full width latin letters, eg a full width capital letter A. This character has a different code to the standard latin A. The indexers' canonicalisation process converts any full width characters into the standard latin equivalent. This means that a search for a word that contains A will correctly match a Japanese full width A.

    The indexers convert all half-width Katakana and Hangul characters into their standard width character codes. Many other useful character code translations are also done.

    The FindinSite runtimes also perform these same translations, so that if you enter a Japanese full width A, the search process correctly matches the latin A.

    Possible improvement: Translate traditional Chinese character into their simplified Chinese equivalent when doing searches.

    Word Splitting
    FindinSite and Findex split up text into words. Words contain only letters and numbers, so any other characters break the text into words.

    Each non-latin character (ie non-Western and non-Arabic) forms a separate word, as described in the next section.

    Non-western characters
    FindinSite and Findex treat all non-Western and non-Arabic characters are treated as single words. For example, the three characters in the word "Japanese" (日本語) are separate words, 日, 本 and 語. However, if you search for 日本語 then FindinSite will effectively put double quotes around these characters, so that only instances of these three characters together will be found. If you want to find all instances of 日, 本 and 語 on a page, then search for 日 本  語, ie with spaces in between.

    This approach is used because most non-Western languages do not use white space characters to indicate breaks between words.

    A character is defined as being Western if its Unicode code is less than or equal to U+02A8, or if it is in the Unicode range U+1E00 to U+1EF9 inclusive.

    HTML Tag and Tag Attribute names
    Note that all HTML tag names and HTML tag attribute names must be in Western characters, in the Unicode range U+0000 to U+00FF inclusive. However tag attribute values can be in Unicode. For example, the following line is accepted by the indexers:
        <META NAME="description" CONTENT="日本語">
    In this example, META is a tag name, and NAME and CONTENT are tag attribute names. Note that the tag attribute values, "description" and "日本語", can use non-Western characters.

    Although web page names and target frame names can include any characters, it is recommended that they be in Western characters.

    Lower Case
    FindinSite and Findex convert words to lower case when matching words in the search database. Details of the lower case conversions are available on request.
    Stop words
    Currently there are no non-Western stop word files.
    FindinSite-CD Word highlighting
    Word highlighting only occurs for HTML pages in reasonably recent browsers.

    For Microsoft Internet Explorer 4 or later, the browser decodes the character sets before word highlighting. This makes it possible to do word highlighting whatever the character set, ie word highlighting works in IE4+.

    Word highlighting in Netscape Navigator 4 or similar only works if the character set is "iso-8859-1" or "Windows-1252". For other character sets, word highlighting is not attempted because FindinSite-CD does not decode the charset at runtime.

      All site Copyright © 1996-2008 PHD Computer Consultants Ltd, PHDCC   Privacy  

    Last modified: 8 February 2006.