FindinSite-CD: Search engine for CD/DVD   .
  search
Powered by FindinSite-MS
. Home | Examples | Starting | Set up | Advanced | Languages | Purchasing | Email .
. .
  Getting started | FAQ | FindinSite-CD-Wizard | Findex | File Types | PDF | RDF | Parser API | HTML CDs/DVDs

 

findex Java indexing tool


Introduction

Findex is a Java version of the FindinSite-CD-Wizard indexing tool for FindinSite-CD.  As a Java application, it should run on both Windows and non-Windows platforms.  (The Findex indexing code is also used by FindinSite-JS, the site search engine for web sites.)

Findex indexes HTML web pages, TXT text files, PDF files, Word 97+ DOC documents, PowerPoint PPT/PPS presentations and JPEG images for meta-data.  It does not currently index XLS files.  Findex does not have a search database editing facility.  Findex will not index Office documents if you run it using the Microsoft VM.

Findex supports Parser plug-ins to index new file types.  The supplied phdccRDF plug-in finds meta-data in RDF/XML files - this plug-in does run under the Microsoft VM.

Findex finds any meta-data information for field searches, including that contained in RDF/XML files.  A search database produced by Findex cannot be properly read by FindinSite-CD-Wizard because FindinSite-CD-Wizard does not support field searches.

Apart from fields, a search database produced by Findex will be functionally equivalent to one produced by FindinSite-CD-Wizard (with the same options selected).  However the search database physical files will not be identical.  Indeed, the search database files produced by different Java implementations (eg the Sun JVM and the Microsoft VM) will not be physically identical [because each implementation enumerates the contents of a hash table in a different order].

Findex is usually much slower than FindinSite-CD-Wizard on the same computer.  In addition, Findex will tend to use more memory than FindinSite-CD-Wizard.  The Microsoft VM seems to require less memory that the Sun JVM.  However the Sun JVM seems to run faster than the Microsoft VM for smaller sites.

Findex has been tested in Windows using the Microsoft and Sun JVMs, and in Sun Solaris 2.6 and Solaris 9 using the Sun JVM.  As described above, some aspects of Findex will not run under the Microsoft VM.

Executable

The Findex executable (Findex.jar) is only supplied in the FindinSite-CD development kit for Windows, and is placed in the installation directory, eg:
C:\Program Files\PHD\fisCD\Findex.jar

To use Findex on other platforms, you must first install the FindinSite-CD development kit on a Windows computer. Then copy Findex.jar from the installation directory to a suitable directory on the new platform.

You may need to add Findex.jar or its directory to the Java CLASSPATH environment variable definition if you are not using its full path when running it.

Usage

Findex is a Java application that can either:
  • Make a new search database
  • Reindex an existing search database

    Findex is supplied in a Java archive file called Findex.jar, with the main class com.phdcc.findex.Findex identified as the "Main-Class" in the JAR manifest.

    Run Findex in the Java Virtual Machine on your computer, passing parameters to tell it what to do.  There should be no need to make any further changes to your existing Java CLASSPATH definition provided it already refers to the standard Java Runtime Environment classes.  For the Sun JVM you may need to use the -Xmx parameter to the java command to increase the maximum size of the memory allocation pool from 64MB if you run into Out of Memory errors.

    Here are four different examples.  The first uses the Microsoft VM.  The last three use the Sun JVM in different ways.  In each case, the parameters are shown in green.

    jview /cp:a Findex.jar com.phdcc.findex.Findex @index.properties

    java -Xmx256M -jar Findex.jar -r MySite.his

    java -cp Findex.jar;phdccRDF.jar com.phdcc.findex.Findex -q @index.properties

    java -jar C:\Program Files\PHD\fisCD\Findex.jar -v

    Findex returns an exit code of zero if there are no errors or scan problems.  A positive number usually indicates a hard error.  A negative number usually indicates the number of scan problems.  See below for full details.

    See the parser plug-ins documentation for details of how to run Findex with plug-ins.

    Parameters

    The following parameters can be passed to Findex:
    -v | -vv
    or
    [-q | -qq] ( @index_instructions_file | -r search_database )
    If -v is specified then Findex lists several Java system properties to the "standard output" device.  For -vv, Findex lists all available system properties.

    If -q is specified then Findex runs in quiet mode, not writing any progress information to the "standard output" device, but still writing error information to the "standard error" device. If -qq is specified then Findex runs in silent mode, not writing any progress information to the "standard output" or "standard error" devices.

    You must then specify whether you want to create a new search database (using @index_instructions_file) or reindex an existing search database (using -r search_database).  In fact, you can create a new search database to overwrite an existing one.

    Reindexing a search database

    To reindex a search database, simply pass Findex the name of the search database system file, eg MySite.his, after -r.

    Note that the Include and Exclude settings from an Index Instructions File are not saved in the search database system file, so any earlier inclusions or exclusions will not be used during a reindex. Similarly, plug-in options are not stored so these will be lost if you do a reindex.

    If the search database was created by FindinSite-CD-Wizard and included options to scan file types not supported by Findex, then Findex will stop immediately with return value 31.

    Index Instructions File

    To create a new search database, you must pass Findex the name of an "Index Instructions File".

    The Index Instructions file specifies the indexing options that Findex should use, such as the pathname of the search database to create.  The instructions file is in the format of a Java properties text file, where each line specifies a property and a value separated by an equals sign, with various escape sequences recognised - see below.  Property names are case sensitive.  See also the full example.

    You must give a SaveAsPathname.  Depending on the ScanType, you must specify other parameters as follows:

  • dir: ScanDirectory and ScanDirLevels
  • file: ScanPathname
  • url: ScanURL
    You must set at least one of these properties: ParseHTML, ParsePDF, ParseDOC, ParsePPT and ParseTXT.

    The following table details all the available properties:

    Name Optional Description Default
    Description Yes The search database description Taken from the first page title found
    SaveAsPathname No The pathname used to save the search database, without any extension
    The directory element of the pathname must exist.
     
    ScanType Yes Indicates how Findex finds files to index:
    dir Scan all files in ScanDirectory to a depth of ScanDirLevels
    file Scan by following links from ScanPathname
    url Scan by following links from ScanURL
    dir
    ScanDirectory Yes * The directory used to find files if ScanType is dir  
    ScanDirLevels Yes The number of directory levels to scan if ScanType is dir.  Use a number in the range 0 to 255, or all. all
    ScanPathname Yes * The initial file scanned if ScanType is file  
    ScanURL Yes * The initial URL scanned if ScanType is url  
    ParseHTML Yes Specify true if you want to scan HTML web pages, or false if not. true
    HTML_Files Yes The file specification for HTML files, using * and ? wildcards as needed.  Separate individual specifiers with a comma. *.htm,*.html
    ParsePDF Yes Specify true if you want to scan PDF files, or false if not. false
    ParserN Yes Specify zero or more plug-in parsers
    See the ParserN description on the Plug-in page.
     
    PDF_Files Yes The file specification for PDF files, using * and ? wildcards as needed.  Separate individual specifiers with a comma. *.pdf
    PDF_Passwords Yes Specify a comma-separated list of passwords to open PDF files.  
    PDF_ReportCharacterDecodeProblems Yes Specify true if you want to have any PDF character decode problems listed, or false if not. false
    ParseDOC Yes Specify true if you want to scan DOC Word 97+ documents, or false if not. false
    DOC_Files Yes The file specification for DOC files, using * and ? wildcards as needed.  Separate individual specifiers with a comma. *.doc
    ParsePPT Yes Specify true if you want to scan PPT PowerPoint presentations, or false if not. false
    PPT_Files Yes The file specification for PPT files, using * and ? wildcards as needed.  Separate individual specifiers with a comma. *.ppt,*.pps
    ParseTXT Yes Specify true if you want to scan TXT text files, or false if not. false
    TXT_Files Yes The file specification for TXT files, using * and ? wildcards as needed.  Separate individual specifiers with a comma. *.txt
    ParseImage Yes Specify true if you want to scan JPEG images for meta-data, or false if not. false
    Image_Files Yes The file specification for JPEG files, using * and ? wildcards as needed.  Separate individual specifiers with a comma. *.jpg,*.jpeg
    CaseSignificant Yes If finding files by following links, then the case of filenames is ignored if false.  If true then Findex views test.htm and Test.htm as separate files.
    Windows always seems to ignore filename letter cases.  In Unix, filename case must be correct.
    Windows: false
    non-Windows: true
    StoreStopWords Yes If false, Findex does not include words specified in StopWordFile. true
    StopWordFile Yes The pathname of the file containing stop words, with one word per line in UTF-8 format.  
    NoTitleIgnorePageLinks Yes If finding files by following links and this property is set to true, then links are not followed if a page has no title. false
    ParseUpHierarchy Yes If finding files by following links and this property is set to true, then links are followed to directories above the initial file. false
    StorePositions Yes If true then Findex stores word positions so that "adjacent word" searches will work. true
    StoreLoneWords Yes If true then Findex stores a word's position even if the two surrounding words are stop words. true
    UseNoBaseURLs Yes Determines whether to include a Base URL prefix for each page in the search database false for URL scans
    true for File/Dir scans
    UseMetaDescriptionAsAbstract Yes If true then the page abstract will be taken from the page META description tag. true
    UseMetaAbstractAsAbstract Yes If true then the page abstract will be taken from the (new) page META abstract tag. true
    AbstractWords Yes If building the abstract from the words in a file, this property indicates the number of words to use. 30
    Include Yes A list of file specifications to include in the search database. See below All files will be included
    Exclude Yes A list of file specifications to exclude from the search database. See below No files will be excluded

    Include and Exclude files

    The Include and Exclude properties provide an optional list of file-specs to determine the files to include or exclude in the search database.

    The initial list of acceptable files is determined by the appropriate HTML_Files, HTML_Files, DOC_Files, XLS_Files, PPT_Files or TXT_Files settings. Then:

  • If an Include file-spec set is given, then only files meeting one of the given file-specs will be indexed.
  • If an Exclude file-spec set is given, then any files meeting one of the given file-specs will not be indexed.
  • Note that the Includes are processed first and the Excludes afterwards, so an Exclude file-spec takes precedence.

    An individual file-spec can include zero or more * or ? wildcard characters, where ? matches exactly one character, and * matches zero or more characters. For example file???.ht* would match:
        file001.htm, file101.html and file111.ht
    but not
        file1001.htm

    A list of file-specs can be given directly in the property, or indirectly in a file.

    Direct file-specs

    Direct file-specs are semi-colon separated, eg:
    Include=iso*;*12*
    Exclude=file???.ht*

    This specifies two Include file-specs and one Exclude file-spec.

    Indirect file-specs in a file

    An indirect value consists of @ followed by a file name, where file-specs are specified one per line in plain text. The above direct example may be expressed indirectly as follows:
    Include=@includes.txt
    Exclude=@excludes.txt

    where includes.txt contains:
    iso*
    *12*

    and excludes.txt contains:
    file???.ht*

    If an indirect file cannot be opened, an error message is printed to the standard output unless -q or -qq specified.


    Example Index Instructions File

    This example specifies a search database description of  Seiten und Wörter.  The search database is saved in D:\inetpub\wwwroot\fiscd\srchdb.  (Note that a search database actually consists of between 14 and 17 different files, ie srchdb.his, srchdb.hi1, srchdb.hi2, etc.)

    The search database is built by following links from the URL http://www.phdcc.com/ and indexing HTML files that meet this file spec *.htm,*.html,*.asp and TXT files that meet this file spec *.txt.

    Description=\ Seiten und W\u00F6rter
    SaveAsPathname=D:\\inetpub\\wwwroot\\fiscd\\srchdb
    ScanType=url
    ScanURL=http://www.phdcc.com/
    HTML_Files=*.htm,*.html,*.asp
    ParseTXT=true
    Note how a space character is specified at the start of the Description using the escape sequence.  German character ö is specified using its Unicode representation \u00F6.  In the SaveAsPathname, each PC backslash must be represented by two backslashes \\.  On PC systems, forward slashes can usually be used instead of backslashes if desired, so the SaveAsPathname could be specifies like this:
    SaveAsPathname=D:/inetpub/wwwroot/fiscd/srchdb

    Example Output

    The following example full output was produced when doing a directory scan of a test directory.
    jview /cp:a Findex.jar com.phdcc.findex.Findex @index.properties
    Findex 4.0, 29 July 2005.  Copyright (c) 2002-2005 PHD Computer Consultants Ltd
    iso8859-3.htm
    iso8859-15.htm
    Windows874.htm
    Windows1254.htm
    iso8859-9.htm
    Windows1258.htm
    iso8859-1.htm
    Windows1252.htm
    Windows1256.htm
    iso8859-6.htm
    Windows1257.htm
    iso8859-4.htm
    Windows1250.htm
    iso8859-5.htm
    Windows1251.htm
    Windows1253.htm
    iso8859-7.htm
    Windows1255.htm
    iso8859-8.htm
    iso8859-8i.htm
    Post-processing...
    No problems while scanning.
    
    Pages scanned:			20  (208kB)
    Words found:			1413
    Total output database size:		98kB  (47%)
    Time taken:			0 minutes, 2 seconds.
    
    Error code 0

    Index Instructions File Format

    This definition of the Index Instructions File Format is taken from the Sun java.util.Properties class definition:
    The stream is assumed to be using the ISO 8859-1 character encoding.

    Every property occupies one line of the input stream. Each line is terminated by a line terminator (\n or \r or \r\n). Lines from the input stream are processed until end of file is reached on the input stream.

    A line that contains only whitespace or whose first non-whitespace character is an ASCII # or ! is ignored (thus, # or ! indicate comment lines).

    Every line other than a blank line or a comment line describes one property to be added to the table (except that if a line ends with \, then the following line, if it exists, is treated as a continuation line, as described below). The key consists of all the characters in the line starting with the first non-whitespace character and up to, but not including, the first ASCII =, :, or whitespace character. All of the key termination characters may be included in the key by preceding them with a \. Any whitespace after the key is skipped; if the first non-whitespace character after the key is = or :, then it is ignored and any whitespace characters after it are also skipped. All remaining characters on the line become part of the associated element string. Within the element string, the ASCII escape sequences \t, \n, \r, \\, \", \', \ (a backslash and a space), and \uxxxx are recognized and converted to single characters. Moreover, if the last character on the line is \, then the next line is treated as a continuation of the current line; the \ and line terminator are simply discarded, and any leading whitespace characters on the continuation line are also discarded and are not part of the element string.

    As an example, each of the following three lines specifies the key "Truth" and the associated element value "Beauty":

    Truth = Beauty
    	Truth:Beauty
    Truth			:Beauty
    
    As another example, the following three lines specify a single property:

    fruits	apple, banana, pear, \
    	cantaloupe, watermelon, \
    	kiwi, mango
    
    The key is "fruits" and the associated element is:

    "apple, banana, pear, cantaloupe, watermelon,kiwi, mango"
    Note that a space appears before each \ so that a space will appear after each comma in the final result; the \, line terminator, and leading whitespace on the continuation line are merely discarded and are not replaced by one or more other characters.

    As a third example, the line:

    cheeses
    
    specifies that the key is "cheeses" and the associated element is the empty string.

    Return values

    Findex returns the following values on exit.
    0No errors or problems
    -veNo errors, but negative number indicates number of scan problems
    1Insufficient arguments
    2Invalid argument
    4Out of Memory
    5Unexpected exception
    10Error reading Index Instructions File
    11ScanDirectory invalid
    12SaveAsPathname not given in Index Instructions File
    13ScanType invalid
    14ScanDirectory property not specified
    15ScanDirLevels invalid
    16ScanPathname property not specified
    17ScanURL property not specified
    18Invalid true/false value for a property
    19AbstractWords invalid
    20Invalid parser plug-in
    30Error reading reindex parameters from existing Search database
    31Unable to reindex because an option is set that is not compatible with Findex, eg Parse PDF
    100General scan error
    101Internal error
    If Findex finds one problem while scanning (such as an invalid link) then it will return -1.  If your command shell does not cope with signed values then this will be seen as the unsigned value 4294967295 on systems where exit codes are 4 byte unsigned integers.  If there are two scan problems then -2 is returned (unsigned 4294967294).

    In Windows, the Sun JVM java command seems to return 0 if it runs out of memory, and 3 if Ctrl+C is used to halt the program execution.  In Windows, the Microsoft VM jview command seems to return 3221225786 / -1073741510 / 0xC000013a / STATUS_CONTROL_C_EXIT if Ctrl+C is used to halt the program execution.  Out of memory might be 0xC0000017 / STATUS_NO_MEMORY.

  •   All site Copyright © 1996-2008 PHD Computer Consultants Ltd, PHDCC   Privacy  

    Last modified: 5 July 2006.