Rules for Word stemming and Synonyms

Last modified: 9 January 2004.
  • Home
  • Installation
  • Indexing
  • Configuration
  • Purchasing
  • Introduction

    A rules file tells Spy-CD how to do word stemming, match synonyms and correct spellings. Rules files are also used by Spy-Server.

    Word stemming means taking the stem of a word and generating common variants of the word. As an example, if the search text is throws then the word stem is throw and common variants of this stem include thrower, throwers and throwing. Spy-CD uses rules to check to see if the word stem or any of the variants exist in the search database. If they do, then these words are added as alternatives to the original text.

    A synonym is a different word with the same meaning. The Spy-CD rules contain a list of equivalent words. As well as coping with synonyms, this lets Spy-CD match regional variations, eg color matches colour and vice versa. A specialised rule form lets Spy-CD correct mis-spellings, so that teh matches the, but not vice versa.

    With these rules in place, a search for teh color fade matches The colour fades. Note that the rules can go wrong, so that a search for car will match carer.


    The Rules File Format

    A rules file is either a plain text file in ANSI characters or a text file using UTF-8 characters. In both cases, the file is divided up into lines.

    Line 1 1 if using ANSI
    2 if using UTF-8
    Line 2 The language code and optional country codes, separated by a space,
    eg en or en GB
    Line 3 A description of the rules file
    Displayed in Spy-CD
    Line 4 onwards One rule per line
    Each rule is a comma (or greater-than-sign) separated list of items.

    There are three types of rules:

  • Word stemming
  • Equivalent words
  • Spelling correction
  • How it works

    Spy-CD takes each word in the search text and applies all rules to this word. Each generated word then has all rules applied again and again until no new words are generated. (The rules are not applied again if a generated word is longer than previous word - this stops words becoming infinitely long.)

    The end result is a list of possible alternatives to the original word. Spy-CD then goes through this list of words and removes a word if it does not appear anywhere in the search database.

    All rule tests are letter case insensitive. Rules are not applied to words with non-latin characters. Do not put punctuation characters in rules, eg co-operation,cooperation will not work.

    Word stemming rules

    A word stemming rule is a comma-separated list of items. If a word matches the first item in a rule, then word variants specified in the remaining items are added to the list of possible words.

    For example, a rule *,*s,*es applied to the search word throw produces alternative words throws and throwes.

    All items must start with an asterisk *. If there are any subsequent characters in the first item then the end of the test word must contain the same characters. These characters are then removed to form the word stem.

    The second and following items in the rule describe what alternative words should be generated. The initial * in these items is replaced by the word stem. Any subsequent characters in these items are added to the word stem. A word stem of just one character is not used. Rules are not applied to words of just one character.

    As an example, a rule *ise,*ize applied to the search word authorise produces an alternative word authorize. Note that you need the following rule if you want authorize to produce an alternative word authorise: *ize,*ise.

    Special rule forms can check for words that contain either consonants or vowels:

  • *# matches words that end in a consonant, and
  • *$ matches words that end in a vowel.

    Examples:

  • For rule *#,*#er, the search word throw produces alternative word thrower.
  • For rule *$,*er the search word care produces alternative word carer.

    Multiple # or $ characters can be used, where each # or $ must match the same character. For example:

  • For rule *#,*##er,*##ed,*##ing the search word begin produces alternative words beginner, beginned and beginning.
  • For rule *##ing,*# the search word beginning produces alternative word begin.
  • Equivalent words rules

    Equivalent words are simply put in a comma-separated list. For example, for the rule paper,magazine,journal, the search word magazine produces alternative words paper and journal.

    Rules can be used to cope with different regional spellings, eg color,colour copes with the US and UK spelling of this word.

    Similarly, his,her,their could be used to make common possessive pronouns interchangeable.

    Spelling correction rules

    To correct spelling mistakes in the search text, put the mis-spelling first, then a greater-than-sign > and then the correct spelling.

    For example, teh>the generates the as an alternative word for teh. However teh is not generated as an alternative word for the.

    If you want to correct spelling mistakes in your web pages then just use an equivalent words rule, eg teh,the.


    Basic English Rules file

    The supplied English rules file has a several basic rules. Please suggest any improvements to these rules. Let us have a copy of rules files for other languages.

    Word stemming rules

    The rules first remove any common word endings to get a word stem.
    *s,*
    *er,*
    *ers,*
    *ed,*
    *ing,*
    *eer,*
    *ier,*
    *ly,*
    *ion,*
    *ise,*
    *ize,*
    
    *er,*e
    *ed,*e
    *ion,*e
    
    *##ing,*#
    *##er,*#
    *##ed,*#

    The rules then try different word endings

    *ise,*ize
    *ize,*ise
    *or,*er
    *er,*or
    *our,*or
    *or,*our
    *y,*ies
    *able,*ible
    *ible,*able
    *ance,*ence
    *ence,*ance
    *g,*gue
    *gue,*g

    The rules then add common word endings:

    *,*s,*es
    *#,*#e,*#er,*#ers,*#ed,*#ing,*#eer,*#ier,*#ly,*#ise,*#ize,*#ion
    *e,*er,*ers,*ed,*ing,*ion
    *#,*##er,*##ers,*##ed,*##ing

    Equivalent words and Spelling corrections

    color,colour
    licence,license
    language,langauge
    a,an
    his,her,their
    affect,effect
    teh>the
    neccesary>necessary
    recieve>receive
    francais,français

    Spy-Server online
    Find in Site search engines for MS-servers, Java-servers and CDs/DVDsSpy-Server is part of the FindinSite software range - search engines for MS-servers, Java-servers and CDs/DVDs