|
findinsite-cd international character support
Introduction
Specifying letters and characters in your text correctly is important,
especially if your information is not in the English language.
You have to ensure that your users will be able to read your information, and
that FindinSite can search it.
|
After this introduction, this page
|
|
Browsers and other viewers
Your first task is to ensure that users can read your files correctly.
This is not a trivial task. Even though most web browsers can understand web pages written
for different languages, they may not be display the information correctly on screen
because the user does not have the relevant fonts or language support installed on their computer.
One option here is to use a different file format such as Adobe PDF or Microsoft Word.
Files in these formats may render better on a user's screen than a web page.
Adobe PDF also lets you control the page layout much more closely.
File formats
The starting point for representing characters correctly is to use the correct file format.
For HTML web pages, the key is to specify and use the most appropriate character set
for the characters you are using. Most web page editors should let you specify the character set
and encode characters in the character set correctly.
For other formats, such as Adobe PDF or Microsoft Word, the authoring tools may give you some scope
to specify which fonts etc are used when saving a file.
FindinSite character support
This page describes the character support of the various FindinSite products.
- The FindinSite and Findex indexing programs support a wide range of characters.
- FindinSite-CD-Wizard also lets you view and edit a search database
in your computer's default character format.
- The FindinSite runtimes display characters as best as possible for the user's browser and computer.
- The FindinSite runtimes are supplied fully internationalised for many languages
as described in the Language support page.
Also see the FindinSite-CD
Japanese,
Simplified Chinese and
Traditional Chinese pages.
FindinSite and Findex character support
- HTML Web pages
- A character set (charset) indicates how the contents of a web page should be interpreted as characters.
Different languages will need different character sets if they contain different characters.
The Unicode character set is a superset of most other character sets, but is not usually
used to encode characters in web pages because it makes the pages larger.
- TXT text files
- FindinSite and Findex find all characters in single byte ANSI and double byte Unicode
plain text files.
- PDF Adobe files
- Please see the PDF Scanning page for details of
the character encodings that are recognised by the FindinSite and Findex indexers.
- DOC, XLS and PPT Office documents
- FindinSite and Findex extract all character information from the supported Microsoft Office documents.
HTML Character set basics
For web pages, the FindinSite and Findex indexers
recognise 32 different character sets (and many more name variants),
including multi-byte character sets.
FindinSite-CD can display all characters if your computer and browser
are set up correctly with the right fonts available.
FindinSite-JS and FindinSite-MS use the UTF-8 character set, giving the best chance of displaying characters correctly.
As an example of character sets, Central European languages might need to use a capital R character with
an acute accent above it, Ŕ. In the "Windows 1250" character set, byte code 192
(0xC0 in hexadecimal) represents this character. If the default
character set were used instead, then this byte would be displayed as a capital A with a
grave accent, ie À.
The default character set for web pages is called "ISO 8859-1". This is almost
exactly the same as the standard Windows English character set called "Windows 1252".
Note that some character sets now include a Euro currency symbol (€) with byte code 128,
ie 0x80 in hexadecimal (or 0x88 for the Windows-1251 charset). Older browsers may not be able to display this character.
Character normalisation
Internally, FindinSite and Findex use the Unicode character set. Each Unicode character uses 16 bits,
ie a value from 0 to 65535, or U+0000 to U+FFFF to use the correct hexadecimal notation.
The indexers
translate characters from the page's character set into Unicode before processing them further.
The search database uses a compact form of Unicode called UTF-8 that usually reduces the amount of storage space required.
As an example, the Euro currency symbol (€) is U+20AC, hexadecimal 0x20AC.
The indexers translate byte code 128 in the
"Windows 1252" character set into U+20AC.
They also recognise other ways of specifying characters,
using 'character references'. The Euro symbol may also be specified as a string €,
as a decimal number € or as a hexadecimal number €.
Specifying character sets
You specify the character set for a web page using the following line, which must appear
in the HEAD section of a web page. Replace ISO-8859-1 with an
appropriate character set string.
The list of character sets supported by the indexers is given below.
View the source of this page to see an example of this tag.
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=ISO-8859-1">
Supported character sets
The FindinSite and Findex indexers recognise the following character sets.
Any of the listed charset string names are recognised in a META Content-Type tag.
The Windows code page is also listed, for your information.
| Character set name |
charset string |
Windows code page |
| Central European (Windows) |
Windows-1250 x-cp1250 |
1250 |
| Cyrillic (Windows) |
Windows-1251 x-cp1251 |
1251 |
| Western |
Windows-1252 ANSI_X3.4-1968 ANSI_X3.4-1986 ascii cp367 csASCII IBM367 ibm819 ISO_646.irv:1991 ISO646-US iso-ir-6 us us-ascii x-ansi |
1252 |
| Greek (Windows) |
Windows-1253 |
1253 |
| Turkish (Windows) |
Windows-1254 |
1254 |
| Hebrew (ISO-logical) |
Windows-1255 |
1255 |
| Arabic (Windows) |
Windows-1256 cp1256 |
1256 |
| Baltic (Windows) |
Windows-1257 |
1257 |
| Vietnamese |
Windows-1258 |
1258 |
| ISO 8859-1 |
ISO-8859-1 cp819 csISO Latin1 ibm819 iso_8859-1 iso_8859-1:1987 iso8859-1 iso-ir-100 l1 latin1 |
1252 |
| ISO 8859-2 |
ISO-8859-2 csISOLatin2 iso_8859-2 iso_8859-2:1987 iso8859-2 iso-ir-101 l2 latin2 |
28592 |
| ISO 8859-3 |
ISO-8859-3 csISO Latin3 ISO_8859-3 ISO_8859-3:1988 iso-ir-109 l3 latin3 |
28593 |
| ISO 8859-4 |
ISO-8859-4 csISOLatin4 ISO_8859-4 ISO_8859-4:1988 iso-ir-110 l4 latin4 |
28594 |
| ISO 8859-5 |
ISO-8859-5 csISOLatin5 csISOLatinCyrillic cyrillic ISO_8859-5 ISO_8859-5:1988 iso-ir-144 l5 |
28595 |
| ISO 8859-6 |
ISO-8859-6 arabic csISOLatinArabic ECMA-114 ISO_8859-6 ISO_8859-6:1987 iso-ir-127 |
28596 |
| ISO 8859-7 |
ISO-8859-7 csISOLatinGreek ECMA-118 ELOT_928 greek greek8 ISO_8859-7 ISO_8859-7:1987 iso-ir-126 |
28597 |
| ISO 8859-8 |
ISO-8859-8 iso-8859-8-i logical csISOLatinHebrew hebrew ISO_8859-8 ISO_8859-8:1988 iso-ir-138 visual |
28598 |
| ISO 8859-9 |
ISO-8859-9 csISO Latin5 ISO_8859-9 ISO_8859-9:1989 iso-ir-148 |
28599 |
| ISO 8859-15 |
ISO-8859-15 csISO Latin9 ISO_8859-15 l9 latin9 |
28605 |
| Japanese (Shift-JIS) |
shift_jis csShiftJIS csWindows31J ms_Kanji shift-jis x-ms-cp932 x-sjis |
932 |
| Japanese (JIS) |
csISO2022JP iso-2022-jp _iso-2022-jp _iso-2022-jp$sio |
|
| Japanese (EUC) |
euc-jp csEUCPkdFmtJapanese Extended_UNIX_Code_Packed_Format_for_Japanese x-euc x-euc-jp |
|
| Simplified Chinese (GB2312) |
gb2312 chinese CN-GB csGB2312 csGB231280 csISO58GB231280 GB_2312-80 GB231280 GB2312-80 GBK iso-ir-58 |
936 |
| Simplified Chinese (HZ) |
HZ-GB-2312 |
Shifted 936 |
| Traditional Chinese (BIG5) |
big5 cn-big5 csbig5 x-x-big5 |
950 |
| Korean (KSC5601) |
ks_c_5601-1987 csKSC56011987 iso-ir-149 korean ks_c_5601 ks_c_5601_1987 ks_c_5601-1989 KSC_5601 KSC5601 |
949 |
| Korean (EUC) |
euc-kr cseuckr |
51949 |
| Hebrew (DOS) |
dos-862 |
862 |
| Thai (Windows) |
Windows-874 DOS-874 iso-8859-11 TIS-620 |
874 |
| UTF-8 |
UTF-8 unicode-1-1-utf-8 unicode-2-0-utf-8 x-unicode-2-0-utf-8 |
65001 |
| Unicode |
Unicode utf-16 |
1200 |
| Unicode (Big Endian) |
UnicodeFEFF |
1201 |
The indexers automatically detect pages that are encoded in the UTF-8, Unicode or UnicodeFEFF character
sets when it starts to read a page. However the character set used for decoding may be overridden using the
above META character set tag.
Email support if you would any other
character sets supported.
|
Viewing characters in FindinSite-CD
|
In general, if someone is using FindinSite-CD in a browser to search for text in their "own" language
then FindinSite-CD should work as expected.
To be more precise, your customers must have a computer set up as follows for satisfactory use of FindinSite-CD.
- Install the appropriate fonts for the language you want to display.
- If you run a version of Windows that can change locales, you need to change the default
system locale to match the language.
- The browser must be set up to use a suitable font for pages in the desired language.
- The browser's Java fonts may need to be altered.
Windows 2000 and XP may be set up so that "western" developers can see non-Western characters,
both in FindinSite-CD-Wizard and FindinSite-CD running in a browser.
See below for instructions on setting up Windows 2000 and XP to cover points
1 and 2 above, so that FindinSite-CD-Wizard runs satisfactorily.
Internet Explorer
Windows Internet Explorer will display non-Western characters in its main browser window
and the Microsoft Java VM will display and work with non-Western characters.
In Windows 2000 the default system locale must be set appropriately for this to work,
eg set the default system locale to Japanese if you want to use FindinSite-CD to search for Japanese words.
You must also install the Japanese Input Locale if you want to type in Japanese characters.
In Windows XP, these steps do not seem to be necessary.
In Internet Explorer 5+, you can change the fonts that are used to display non-Western HTML.
Select "Tools+Internet Options...". In the "General" tab press the "Fonts..." button.
Select the character set type (eg Japanese) and then select a "Web page font" and a
"Plain text font".
If you run FindinSite-CD in Internet Explorer in a computer that is not set up appropriately,
FindinSite-CD will not display characters correctly.
Note: when running Internet Explorer in Windows 95, 98 and Me, you may well be able to
view pages with non-Western characters by installing the appropriate language plug-in.
Sun Java VM
The Sun Java VM is used by most browsers, and can be used by Windows Internet Explorer.
The Sun Java VM can display non-Western characters properly on Western computers.
VM 1.5 or later
To ensure that the right fonts are used for your language, simply open the Control Panel
"Regional and Language Settings". In the "Regional Options" tab, select your language and country
and press OK. (In some cases you may need to select "Supplemental language support" checkboxes in the
"Language" tab, and appropriate Code pages in the "Advanced" tab.)
Restart the browser.
VM 1.4 and earlier
The Sun Java VM can display non-Western characters properly on Western computers,
if the correct font properties are selected;
you may be able to enter non-Western characters and search for them successfully.
|
If your computer is non-Western, then the Sun Java VM installation may have selected the
correct font file already. If not, then follow these instructions:
First, find where the Java VM plug-in is installed. On Windows computers, this might be in
C:\Program files\Java\j2re1.4.2_01\. Move to the
lib\ sub-directory.
The file font.properties determines which font the Java VM uses.
Save a copy of the current file, eg to font.properties.save.
The table on the right lists some alternative font
properties files you could use. For example, to use Japanese fonts, copy
font.properties.ja to font.properties.
Restart the browser.
|
| Western |
font.properties |
| Japanese |
font.properties.ja |
| Korean |
font.properties.ko |
| Simplified Chinese |
font.properties.zh |
| Traditional Chinese |
font.properties.zh_TW |
|
|
Scanning and viewing characters in FindinSite-CD-Wizard
|
FindinSite-CD-Wizard will scan all character sets correctly when running in most recent Windows platforms,
ie Windows 95, Windows 98, Windows Me, Windows NT 4, Windows 2000,
Windows XP and later.
It works in all language versions.
All the messages are in English.
However you will only be able to view characters correctly in FindinSite-CD-Wizard if Windows is set up
appropriately for the characters that you are trying to display. Nonetheless, you can still edit
"undisplayable" characters. So you can edit Japanese characters on a Western PC - if you are careful.
If a character cannot be displayed using the default system code page (character set),
FindinSite-CD-Wizard does one of two things:
- In the Words list (and other places) FindinSite-CD-Wizard displays the character as a question mark (?).
When you select a word in the words list, FindinSite-CD-Wizard also displays the word beneath the
"Key" symbols. It uses the font that you have selected for printing to display the character.
It is possible therefore that the character will be displayed correctly here, while
not displayed correctly in the main words list.
 |
Windows 2000 and XP
In this example, a Chinese character, U+5BFC 导,
is listed as a word in the Words tab. FindinSite-CD-Wizard is being run
in Windows 2000. Windows 2000 is not using Chinese as its system default
locale so FindinSite-CD-Wizard cannot display this character properly in the
words list. Therefore a ? is displayed instead in the main words list.
However the chosen printer font, Simsun, can display this character.
So when the character is selected in the words list, the character is displayed
correctly under the "Key".
In the western versions of Windows 95, Windows 98 and Windows Me
you will never be able to view this character properly.
|
- For the Description and various entries in the Pages tab, FindinSite-CD-Wizard sees if any characters
in the string cannot be displayed in the default code page. If there are any problems,
FindinSite-CD-Wizard displays the entire string in UTF-8.
This replaces any non-ASCII characters with two or three other characters.
"UTF-8" is shown in the FindinSite-CD-Wizard status bar.
You can edit the UTF-8 string - carefully. Suppose that the first character in the description
is a Japanese character that cannot be edited normally. FindinSite-CD-Wizard uses its UTF-8
equivalent, ie three characters. If you delete the first of these characters, FindinSite-CD-Wizard
displays the following message in the status bar below Could not decode the edit box string.
If you go on to delete the other two characters, FindinSite-CD-Wizard will be able to decode the UTF-8
and so the status bar error message will go away.
|
Windows 2000 and XP(Japanese): Editing Japanese characters
This example shows FindinSite-CD-Wizard successfully showing and editing Japanese characters.
The selected title in the Pages tab is shown correctly both in the list box above and
the edit box below.
For this example, FindinSite-CD-Wizard was run in
Windows 2000 with Japanese as the default system locale.
|
|
Windows 2000 and XP (Japanese): Editing Chinese characters
In this example showing the Pages tab, FindinSite-CD-Wizard is again being run in
Windows 2000 with Japanese as the default system locale.
However, in this case, FindinSite-CD-Wizard is trying to display some Chinese characters.
Some Chinese characters can be displayed in the Japanese default system locale, but not all.
In the Pages list, the page title is shown with any unrecognised characters changed
into question marks. The second, fourth and last characters are shown this way.
However, in the Title box below, the page title is all shown in UTF-8 form.
You can edit the title - in UTF-8 form - in this box.
|
|
Windows 2000 and XP (Chinese): Editing Chinese characters
This example shows the same Chinese characters when FindinSite-CD-Wizard
is run in Windows 2000 with Chinese as the default system locale.
All the characters are displayed properly and can be edited easily.
|
|
Windows 98 (English): Editing Chinese characters
This example shows the same Chinese characters when FindinSite-CD-Wizard
is run in Western Windows 98. None of the characters can be displayed properly
in the standard font, so all the non-space characters appear as question marks.
The Title box again shows the page title in UTF-8 form.
Even though exactly the same UTF-8 characters are output, Windows displays them
differently in the default Western character set font.
|
In conclusion, it is possible to use FindinSite-CD-Wizard to edit words and pages
in any recent version of Windows. However, if you need to do a lot of editing,
it is best if you use a version of Windows that can display the characters you are
working with.
|
Testing in Windows 2000 and XP
|
"Western" FindinSite-CD developers who wish to work with non-Western character set pages
will find Windows 2000 and XP very useful.
Windows 2000 and XP have features that let you test your non-Western FindinSite-CD implementation fully.
The first step is to install the locales of interest. From the Control Panel, select the
"Regional Settings" applet. In the "General" tab, check the "Language settings for the system".
Make sure that all the locales of interest are installed. If necessary, press the "Advanced"
button to see what Code Pages are installed.
In FindinSite-CD-Wizard, you should now be able to File+Print all the characters in your search database.
You may well need to change the print font, in the "File+Print options..." dialog box.
The following fonts may be useful: "Microsoft Sans Serif", "Ludica sans Unicode" or
"Arial Unicode MS" might
display most characters, "MS Mincho" displays Japanese and "Simsun" displays Chinese.
You should also now be able to see the selected Word displayed correctly under the "Key".
FindinSite-CD-Wizard will not yet be able to display and edit non-Western characters easily (although
you can view and edit in UTF-8). To make this work you will have to set the default system locale
appropriately. In the "Regional Settings" "General" tab, press the "Set default". Choose the desired
locale. You will probably have to reboot to bring this change into effect.
Note that changing "your locale" in the "Regional Settings" "General" tab is usually not
sufficient to make FindinSite-CD-Wizard work well.
As a final step, you will need to add the appropriate input locales. In the "Regional Settings"
box, select the "Input Locales" tab. Add any locales that you need. Each input locale has
different ways of entering characters, called Input Method Editors (IME).
To run FindinSite-CD in a browser in Windows 2000 and XP, see the instructions above.
|
Unicode implementation details
|
Unicode and UTF-8
All characters are initially translated from the file's character representation into Unicode.
A Unicode character is a stored in 2 bytes. However the words are stored in the FindinSite search database
in UTF-8 format. This usually saves space, as it stores most "western" characters in a single byte.
However other characters take up 2 or 3 bytes in UTF-8 format.
Canonicalisation
Canonicalisation means translating characters into a "standard form". For example, the Japanese
character set has full width latin letters, eg a full width capital letter A. This character has
a different code to the standard latin A.
The indexers' canonicalisation process converts any full width characters into the standard latin
equivalent. This means that a search for a word that contains A will correctly match
a Japanese full width A.
The indexers convert all half-width Katakana and Hangul characters into their
standard width character codes.
Many other useful character code translations are also done.
The FindinSite runtimes also perform these same translations,
so that if you enter a Japanese full width A,
the search process correctly matches the latin A.
Possible improvement: Translate traditional Chinese character into their simplified Chinese equivalent
when doing searches.
Word Splitting
FindinSite and Findex split up text into words. Words contain only letters and numbers,
so any other characters break the text into words.
Each non-latin character (ie non-Western and non-Arabic) forms a separate word,
as described in the next section.
Non-western characters
FindinSite and Findex treat all non-Western and non-Arabic characters are treated as single words.
For example, the three characters
in the word "Japanese" (日本語) are separate words, 日, 本 and 語.
However, if you search for 日本語
then FindinSite will effectively put double quotes around these characters, so that only instances
of these three characters together will be found. If you want to find all instances of 日, 本 and 語
on a page, then search for 日 本 語, ie with spaces in between.
This approach is used because most non-Western languages do not use white space characters to
indicate breaks between words.
A character is defined as being Western if its Unicode code is less than or equal to U+02A8,
or if it is in the Unicode range U+1E00 to U+1EF9 inclusive.
HTML Tag and Tag Attribute names
Note that all HTML tag names and HTML tag attribute names must be in Western characters,
in the Unicode range U+0000 to U+00FF inclusive. However tag attribute values can be
in Unicode.
For example, the following line is accepted by the indexers:
<META NAME="description" CONTENT="日本語">
In this example, META is a tag name, and NAME and CONTENT are tag attribute names.
Note that the tag attribute values, "description" and "日本語", can use non-Western characters.
Although web page names and target frame names can include any characters, it is recommended
that they be in Western characters.
Lower Case
FindinSite and Findex convert words to lower case when matching words in the search database.
Details of the lower case conversions are available on request.
Stop words
Currently there are no non-Western stop word files.
FindinSite-CD Word highlighting
Word highlighting only occurs for HTML pages in reasonably recent browsers.
For Microsoft Internet Explorer 4 or later, the browser decodes the character sets before
word highlighting. This makes it possible to do word highlighting whatever the character
set, ie word highlighting works in IE4+.
Word highlighting in Netscape Navigator 4 or similar only works if the character set is
"iso-8859-1" or "Windows-1252". For other character sets, word highlighting is not
attempted because FindinSite-CD does not decode the charset at runtime.
|