|
Home
Installation
Indexing
Configuration
Purchasing
|
|
Introduction
A rules file tells Spy-CD how to do word stemming, match synonyms and correct spellings.
Rules files are also used by Spy-Server.
Word stemming means taking the stem of a word and generating common variants of the word.
As an example, if the search text is throws then the word stem is throw and
common variants of this stem include thrower, throwers and throwing.
Spy-CD uses rules to check to see if the word stem or any of the variants exist in the search database.
If they do, then these words are added as alternatives to the original text.
A synonym is a different word with the same meaning. The Spy-CD rules contain
a list of equivalent words. As well as coping with synonyms, this lets Spy-CD match regional
variations, eg color matches colour and vice versa. A specialised rule form
lets Spy-CD correct mis-spellings, so that teh matches the, but not vice versa.
With these rules in place, a search for teh color fade matches The colour fades.
Note that the rules can go wrong, so that a search for car will match carer.
The Rules File Format
A rules file is either a plain text file in ANSI characters or a text file using UTF-8 characters.
In both cases, the file is divided up into lines.
| Line 1 |
1 if using ANSI
2 if using UTF-8
|
| Line 2 |
The language code and optional country codes, separated by a space,
eg
en or
en GB
|
| Line 3 |
A description of the rules file
Displayed in Spy-CD
|
| Line 4 onwards |
One rule per line
Each rule is a comma (or greater-than-sign) separated list of items.
There are three types of rules:
|
How it works
Spy-CD takes each word in the search text and applies all rules to this word.
Each generated word then has all rules applied again and again until no new words
are generated. (The rules are not applied again if a generated word is longer than
previous word - this stops words becoming infinitely long.)
The end result is a list of possible alternatives to the original word.
Spy-CD then goes through this list of words and removes a word if it
does not appear anywhere in the search database.
All rule tests are letter case insensitive.
Rules are not applied to words with non-latin characters.
Do not put punctuation characters in rules, eg
co-operation,cooperation
will not work.
Word stemming rules
A word stemming rule is a comma-separated list of items.
If a word matches the first item in a rule, then word variants specified
in the remaining items are added to the list of possible words.
For example, a rule *,*s,*es applied to the search word
throw produces alternative words throws and throwes.
All items must start with an asterisk *.
If there are any subsequent characters in the first item then the end of the test word must contain the same
characters. These characters are then removed to form the word stem.
The second and following items in the rule describe what alternative words should be
generated. The initial *
in these items is replaced by the word stem.
Any subsequent characters in these items are added to the word stem.
A word stem of just one character is not used. Rules are not applied to words of just one character.
As an example, a rule *ise,*ize applied to the search word
authorise produces an alternative word authorize. Note that you need
the following rule if you want authorize to produce an alternative word authorise:
*ize,*ise.
Special rule forms can check for words that contain either consonants or vowels:
*# matches words that end in a consonant, and
*$ matches words that end in a vowel.
Examples:
For rule *#,*#er,
the search word throw produces alternative word thrower.
For rule *$,*er
the search word care produces alternative word carer.
Multiple # or
$
characters can be used, where each
# or
$ must match the same character. For example:
For rule *#,*##er,*##ed,*##ing
the search word begin produces alternative words beginner, beginned
and beginning.
For rule *##ing,*#
the search word beginning produces alternative word begin.
Equivalent words rules
Equivalent words are simply put in a comma-separated list. For example, for the rule
paper,magazine,journal,
the search word magazine produces alternative words paper and journal.
Rules can be used to cope with different regional spellings,
eg color,colour
copes with the US and UK spelling of this word.
Similarly, his,her,their
could be used to make common possessive pronouns interchangeable.
Spelling correction rules
To correct spelling mistakes in the search text, put the mis-spelling first, then a
greater-than-sign > and then the correct spelling.
For example, teh>the
generates the as an alternative word for teh. However
teh is not generated as an alternative word for the.
If you want to correct spelling mistakes in your web pages then just use an equivalent
words rule, eg teh,the.
Basic English Rules file
The supplied English rules file has a several basic rules. Please suggest any improvements
to these rules. Let us have a copy of rules files for other languages.
Word stemming rules
The rules first remove any common word endings to get a word stem.
*s,*
*er,*
*ers,*
*ed,*
*ing,*
*eer,*
*ier,*
*ly,*
*ion,*
*ise,*
*ize,*
*er,*e
*ed,*e
*ion,*e
*##ing,*#
*##er,*#
*##ed,*#
The rules then try different word endings
*ise,*ize
*ize,*ise
*or,*er
*er,*or
*our,*or
*or,*our
*y,*ies
*able,*ible
*ible,*able
*ance,*ence
*ence,*ance
*g,*gue
*gue,*g
The rules then add common word endings:
*,*s,*es
*#,*#e,*#er,*#ers,*#ed,*#ing,*#eer,*#ier,*#ly,*#ise,*#ize,*#ion
*e,*er,*ers,*ed,*ing,*ion
*#,*##er,*##ers,*##ed,*##ing
Equivalent words and Spelling corrections
color,colour
licence,license
language,langauge
a,an
his,her,their
affect,effect
teh>the
neccesary>necessary
recieve>receive
francais,français
|