findinsite rules for Word stemming and Synonyms
A rules file tells FindinSite how to do word stemming, match synonyms and correct spellings.
Rules files are used by FindinSite-CD, FindinSite-JS and FindinSite-MS.
Word stemming means taking the stem of a word and generating common variants of the word.
As an example, if the search text is throws then the word stem is throw and
common variants of this stem include thrower, throwers and throwing.
FindinSite uses rules to check to see if the word stem or any of the variants exist in the search database.
If they do, then these words are added as alternatives to the original text.
A synonym is a different word with the same meaning. The FindinSite rules contain
a list of equivalent words. As well as coping with synonyms, this lets FindinSite match regional
variations, eg color matches colour and vice versa. A specialised rule form
lets FindinSite correct mis-spellings, so that teh matches the, but not vice versa.
With these rules in place, a search for teh color fade matches The colour fades.
Note that the rules can go wrong, so that a search for car will match carer.
The default search page created by FindinSite-CD-Wizard tells FindinSite-CD to use
the English rules file rulesen.txt.
We also have prototype rules files for French (rulesfr.txt)
and German (rulesde.txt - thanks to Paul Croome, Software AG).
Let us know if these rules work for you.
If your information is in French, then change the rules parameter
to refer to the French rules file only:
<PARAM NAME=rules VALUE="rulesfr.txt">
We need to improve FindinSite-CD's handling of rules files for multiple languages.
Suppose a French user is looking at the FindinSite-CD English documentation.
If both English and French rules files are available, then FindinSite-CD will use
the French rules for the French user. In fact, even this French user should use
the English rules because that's the language of the information.
In most cases this is not a problem, because your information will only be in one language
and you should provide only one language file, as described earlier.
Using FindinSite-CD rules
The default search page generated by FindinSite-CD-Wizard tells FindinSite-CD to use the English rules
rulesen.txt, described below.
It is this line in the search page that tells FindinSite-CD to use this rules file:
If you remove this line completely then FindinSite-CD will not use any rules, ie there will
be no word stemming, synonyms or spelling corrections.
<PARAM NAME=rules VALUE="rulesen.txt">
You can specify more than one rules file, comma-separated in the
rules parameter. Each rules file has locale identifiers, ie
language and optional country codes. FindinSite-CD chooses the most appropriate rules file
for the user's locale at startup, or the first rules file if there is no match.
If you then switch language, then FindinSite-CD does the same again for the new locale.
If you write your own rules file and give it a different filename, then make sure that
you change the filename in the search page text. If you want a rules file put alongside
all new FindinSite-CD-Wizard generated search pages then put it in the FindinSite-CD
directory. See the template page if you want to alter the
search page that FindinSite-CD-Wizard generates.
The Rules File Format
A rules file is either a plain text file in ANSI characters or a text file using UTF-8 characters.
In both cases, the file is divided up into lines.
||1 if using ANSI
2 if using UTF-8
||The language code and optional country codes, separated by a space,
||A description of the rules file
Displayed in FindinSite
|Line 4 onwards
||One rule per line
Each rule is a comma (or greater-than-sign) separated list of items.
There are three types of rules:
How it works
FindinSite takes each word in the search text and applies all rules to this word.
Each generated word then has all rules applied again and again until no new words
are generated. (The rules are not applied again if a generated word is longer than
previous word - this stops words becoming infinitely long.)
The end result is a list of possible alternatives to the original word.
FindinSite then goes through this list of words and removes a word if it
does not appear anywhere in the search database.
All rule tests are letter case insensitive.
Rules are not applied to words with non-latin characters.
Do not put punctuation characters in rules, eg
will not work.
Word stemming rules
A word stemming rule is a comma-separated list of items.
If a word matches the first item in a rule, then word variants specified
in the remaining items are added to the list of possible words.
For example, a rule
*,*s,*es applied to the search word
throw produces alternative words throws and throwes.
All items must start with an asterisk
If there are any subsequent characters in the first item then the end of the test word must contain the same
characters. These characters are then removed to form the word stem.
The second and following items in the rule describe what alternative words should be
generated. The initial
in these items is replaced by the word stem.
Any subsequent characters in these items are added to the word stem.
A word stem of just one character is not used. Rules are not applied to words of just one character.
As an example, a rule
*ise,*ize applied to the search word
authorise produces an alternative word authorize. Note that you need
the following rule if you want authorize to produce an alternative word authorise:
Special rule forms can check for words that contain either consonants or vowels:
*# matches words that end in a consonant, and
*$ matches words that end in a vowel.
- For rule
the search word throw produces alternative word thrower.
- For rule
the search word care produces alternative word carer.
characters can be used, where each
$ must match the same character. For example:
- For rule
the search word begin produces alternative words beginner, beginned
- For rule
the search word beginning produces alternative word begin.
Equivalent words rules
Equivalent words are simply put in a comma-separated list. For example, for the rule
the search word magazine produces alternative words paper and journal.
Rules can be used to cope with different regional spellings,
copes with the US and UK spelling of this word.
could be used to make common possessive pronouns interchangeable.
Spelling correction rules
To correct spelling mistakes in the search text, put the mis-spelling first, then a
greater-than-sign > and then the correct spelling.
generates the as an alternative word for teh. However
teh is not generated as an alternative word for the.
If you want to correct spelling mistakes in your web pages then just use an equivalent
words rule, eg
Basic English Rules file
The supplied English rules file has a several basic rules. Please suggest any improvements
to these rules. Let us have a copy of rules files for other languages.
Word stemming rules
The rules first remove any common word endings to get a word stem.
The rules then try different word endings
The rules then add common word endings:
Equivalent words and Spelling corrections