FindinSite-CD: Search engine for CD/DVD   .
 
Powered by FindinSite-MS
. Home | Examples | Starting | Set up | Advanced | Languages | Purchasing | Email .
. .
  Getting started | FAQ | FindinSite-CD-Wizard | Findex | File Types | PDF | RDF | Parser API | HTML CDs/DVDs

 

findex Java indexing tool


Introduction

Findex is a Java version of the FindinSite-CD-Wizard indexing tool for FindinSite-CD.  As a Java application, it should run on both Windows and non-Windows platforms.  (The Findex indexing code is also used by FindinSite-JS, the site search engine for web sites.)

Findex indexes HTML web pages, TXT text files, PDF files, Word 97+ DOC documents, PowerPoint PPT/PPS presentations and JPEG images for meta-data.  It does not currently index XLS files.  Findex does not have a search database editing facility.  Findex will not index Office documents if you run it using the Microsoft VM.

Findex supports Parser plug-ins to index new file types.  The supplied phdccRDF plug-in finds meta-data in RDF/XML files - this plug-in does run under the Microsoft VM.

Findex finds any meta-data information for field searches, including that contained in RDF/XML files.  A search database produced by Findex cannot be properly read by FindinSite-CD-Wizard because FindinSite-CD-Wizard does not support field searches.

Apart from fields, a search database produced by Findex will be functionally equivalent to one produced by FindinSite-CD-Wizard (with the same options selected).  However the search database physical files will not be identical.  Indeed, the search database files produced by different Java implementations (eg the Sun JVM and the Microsoft VM) will not be physically identical [because each implementation enumerates the contents of a hash table in a different order].

Findex is usually much slower than FindinSite-CD-Wizard on the same computer.  In addition, Findex will tend to use more memory than FindinSite-CD-Wizard.  The Microsoft VM seems to require less memory that the Sun JVM.  However the Sun JVM seems to run faster than the Microsoft VM for smaller sites.

Findex has been tested in Windows using the Microsoft and Sun JVMs, and in Sun Solaris 2.6 and Solaris 9 using the Sun JVM.  As described above, some aspects of Findex will not run under the Microsoft VM.

Executable

The Findex executable (Findex.jar) is only supplied in the FindinSite-CD development kit for Windows, and is placed in the installation directory, eg:
C:\Program Files\PHD\fisCD\Findex.jar

To use Findex on other platforms, you must first install the FindinSite-CD development kit on a Windows computer. Then copy Findex.jar from the installation directory to a suitable directory on the new platform.

You may need to add Findex.jar or its directory to the Java CLASSPATH environment variable definition if you are not using its full path when running it.

Usage

Findex is a Java application that can either:
  • Make a new search database
  • Reindex an existing search database

Findex is supplied in a Java archive file called Findex.jar, with the main class com.phdcc.findex.Findex identified as the "Main-Class" in the JAR manifest.

Run Findex in the Java Virtual Machine on your computer, passing parameters to tell it what to do.  There should be no need to make any further changes to your existing Java CLASSPATH definition provided it already refers to the standard Java Runtime Environment classes.  For the Sun JVM you may need to use the -Xmx parameter to the java command to increase the maximum size of the memory allocation pool from 64MB if you run into Out of Memory errors.

Here are four different examples.  The first uses the Microsoft VM.  The last three use the Sun JVM in different ways.  In each case, the parameters are shown in green.

jview /cp:a Findex.jar com.phdcc.findex.Findex @index.properties
 
java -Xmx256M -jar Findex.jar -r MySite.his
 
java -cp Findex.jar;phdccRDF.jar com.phdcc.findex.Findex -q @index.properties
 
java -jar C:\Program Files\PHD\fisCD\Findex.jar -v
Findex returns an exit code of zero if there are no errors or scan problems.  A positive number usually indicates a hard error.  A negative number usually indicates the number of scan problems.  See below for full details.

See the parser plug-ins documentation for details of how to run Findex with plug-ins.

Parameters

The following parameters can be passed to Findex:
-v | -vv
or
[-q | -qq] ( @index_instructions_file | -r search_database )
If -v is specified then Findex lists several Java system properties to the "standard output" device.  For -vv, Findex lists all available system properties.

If -q is specified then Findex runs in quiet mode, not writing any progress information to the "standard output" device, but still writing error information to the "standard error" device. If -qq is specified then Findex runs in silent mode, not writing any progress information to the "standard output" or "standard error" devices.

You must then specify whether you want to create a new search database (using @index_instructions_file) or reindex an existing search database (using -r search_database).  In fact, you can create a new search database to overwrite an existing one.

Reindexing a search database

To reindex a search database, simply pass Findex the name of the search database system file, eg MySite.his, after -r.

Note that the Include and Exclude settings from an Index Instructions File are not saved in the search database system file, so any earlier inclusions or exclusions will not be used during a reindex. Similarly, plug-in options are not stored so these will be lost if you do a reindex.

If the search database was created by FindinSite-CD-Wizard and included options to scan file types not supported by Findex, then Findex will stop immediately with return value 31.

Index Instructions File

To create a new search database, you must pass Findex the name of an "Index Instructions File".

The Index Instructions file specifies the indexing options that Findex should use, such as the pathname of the search database to create.  The instructions file is in the format of a Java properties text file, where each line specifies a property and a value separated by an equals sign, with various escape sequences recognised - see below.  Property names are case sensitive.  See also the full example.

You must give a SaveAsPathname.  Depending on the ScanType, you must specify other parameters as follows:

You must set at least one of these properties: ParseHTML, ParsePDF, ParseDOC, ParsePPT and ParseTXT.

The following table details all the available properties:

Name Optional Description Default
Description Yes The search database description Taken from the first page title found
SaveAsPathname No The pathname used to save the search database, without any extension
The directory element of the pathname must exist.
 
ScanType Yes Indicates how Findex finds files to index:
dir Scan all files in ScanDirectory to a depth of ScanDirLevels
file Scan by following links from ScanPathname
url Scan by following links from ScanURL
dir
ScanDirectory Yes * The directory used to find files if ScanType is dir  
ScanDirLevels Yes The number of directory levels to scan if ScanType is dir.  Use a number in the range 0 to 255, or all. all
ScanPathname Yes * The initial file scanned if ScanType is file  
ScanURL Yes * The initial URL scanned if ScanType is url  
ParseHTML Yes Specify true if you want to scan HTML web pages, or false if not. true
HTML_Files Yes The file specification for HTML files, using * and ? wildcards as needed.  Separate individual specifiers with a comma. *.htm,*.html
ParsePDF Yes Specify true if you want to scan PDF files, or false if not. false
ParserN Yes Specify zero or more plug-in parsers
See the ParserN description on the Plug-in page.
 
PDF_Files Yes The file specification for PDF files, using * and ? wildcards as needed.  Separate individual specifiers with a comma. *.pdf
PDF_Passwords Yes Specify a comma-separated list of passwords to open PDF files.  
PDF_ReportCharacterDecodeProblems Yes Specify true if you want to have any PDF character decode problems listed, or false if not. false
ParseDOC Yes Specify true if you want to scan DOC Word 97+ documents, or false if not. false
DOC_Files Yes The file specification for DOC files, using * and ? wildcards as needed.  Separate individual specifiers with a comma. *.doc
ParsePPT Yes Specify true if you want to scan PPT PowerPoint presentations, or false if not. false
PPT_Files Yes The file specification for PPT files, using * and ? wildcards as needed.  Separate individual specifiers with a comma. *.ppt,*.pps
ParseTXT Yes Specify true if you want to scan TXT text files, or false if not. false
TXT_Files Yes The file specification for TXT files, using * and ? wildcards as needed.  Separate individual specifiers with a comma. *.txt
ParseImage Yes Specify true if you want to scan JPEG images for meta-data, or false if not. false
Image_Files Yes The file specification for JPEG files, using * and ? wildcards as needed.  Separate individual specifiers with a comma. *.jpg,*.jpeg
CaseSignificant Yes If finding files by following links, then the case of filenames is ignored if false.  If true then Findex views test.htm and Test.htm as separate files.
Windows always seems to ignore filename letter cases.  In Unix, filename case must be correct.
Windows: false
non-Windows: true
StoreStopWords Yes If false, Findex does not include words specified in StopWordFile. true
StopWordFile Yes The pathname of the file containing stop words, with one word per line in UTF-8 format.  
NoTitleIgnorePageLinks Yes If finding files by following links and this property is set to true, then links are not followed if a page has no title. false
ParseUpHierarchy Yes If finding files by following links and this property is set to true, then links are followed to directories above the initial file. false
StorePositions Yes If true then Findex stores word positions so that "adjacent word" searches will work. true
StoreLoneWords Yes If true then Findex stores a word's position even if the two surrounding words are stop words. true
UseNoBaseURLs Yes Determines whether to include a Base URL prefix for each page in the search database false for URL scans
true for File/Dir scans
UseMetaDescriptionAsAbstract Yes If true then the page abstract will be taken from the page META description tag. true
UseMetaAbstractAsAbstract Yes If true then the page abstract will be taken from the (new) page META abstract tag. true
AbstractWords Yes If building the abstract from the words in a file, this property indicates the number of words to use. 30
Include Yes A list of file specifications to include in the search database. See below All files will be included
Exclude Yes A list of file specifications to exclude from the search database. See below No files will be excluded

Include and Exclude files

The Include and Exclude properties provide an optional list of file-specs to determine the files to include or exclude in the search database.

The initial list of acceptable files is determined by the appropriate HTML_Files, HTML_Files, DOC_Files, XLS_Files, PPT_Files or TXT_Files settings. Then:

  • If an Include file-spec set is given, then only files meeting one of the given file-specs will be indexed.
  • If an Exclude file-spec set is given, then any files meeting one of the given file-specs will not be indexed.
  • Note that the Includes are processed first and the Excludes afterwards, so an Exclude file-spec takes precedence.

An individual file-spec can include zero or more * or ? wildcard characters, where ? matches exactly one character, and * matches zero or more characters. For example file???.ht* would match:
    file001.htm, file101.html and file111.ht
but not
    file1001.htm

A list of file-specs can be given directly in the property, or indirectly in a file.

Direct file-specs

Direct file-specs are semi-colon separated, eg:
Include=iso*;*12*
Exclude=file???.ht*

This specifies two Include file-specs and one Exclude file-spec.

Indirect file-specs in a file

An indirect value consists of @ followed by a file name, where file-specs are specified one per line in plain text. The above direct example may be expressed indirectly as follows:
Include=@includes.txt
Exclude=@excludes.txt

where includes.txt contains:
iso*
*12*

and excludes.txt contains:
file???.ht*

If an indirect file cannot be opened, an error message is printed to the standard output unless -q or -qq specified.


Example Index Instructions File

This example specifies a search database description of  Seiten und Wörter.  The search database is saved in D:\inetpub\wwwroot\fiscd\srchdb.  (Note that a search database actually consists of between 14 and 17 different files, ie srchdb.his, srchdb.hi1, srchdb.hi2, etc.)

The search database is built by following links from the URL http://www.phdcc.com/ and indexing HTML files that meet this file spec *.htm,*.html,*.asp and TXT files that meet this file spec *.txt.

Description=\ Seiten und W\u00F6rter
SaveAsPathname=D:\\inetpub\\wwwroot\\fiscd\\srchdb
ScanType=url
ScanURL=http://www.phdcc.com/
HTML_Files=*.htm,*.html,*.asp
ParseTXT=true
Note how a space character is specified at the start of the Description using the escape sequence.  German character ö is specified using its Unicode representation \u00F6.  In the SaveAsPathname, each PC backslash must be represented by two backslashes \\.  On PC systems, forward slashes can usually be used instead of backslashes if desired, so the SaveAsPathname could be specifies like this:
SaveAsPathname=D:/inetpub/wwwroot/fiscd/srchdb

Example Output

The following example full output was produced when doing a directory scan of a test directory.
jview /cp:a Findex.jar com.phdcc.findex.Findex @index.properties
Findex 4.0, 29 July 2005.  Copyright (c) 2002-2011 PHD Computer Consultants Ltd
iso8859-3.htm
iso8859-15.htm
Windows874.htm
Windows1254.htm
iso8859-9.htm
Windows1258.htm
iso8859-1.htm
Windows1252.htm
Windows1256.htm
iso8859-6.htm
Windows1257.htm
iso8859-4.htm
Windows1250.htm
iso8859-5.htm
Windows1251.htm
Windows1253.htm
iso8859-7.htm
Windows1255.htm
iso8859-8.htm
iso8859-8i.htm
Post-processing...
No problems while scanning.

Pages scanned:			20  (208kB)
Words found:			1413
Total output database size:		98kB  (47%)
Time taken:			0 minutes, 2 seconds.

Error code 0

Index Instructions File Format

This definition of the Index Instructions File Format is taken from the Sun java.util.Properties class definition:
The stream is assumed to be using the ISO 8859-1 character encoding.

Every property occupies one line of the input stream. Each line is terminated by a line terminator (\n or \r or \r\n). Lines from the input stream are processed until end of file is reached on the input stream.

A line that contains only whitespace or whose first non-whitespace character is an ASCII # or ! is ignored (thus, # or ! indicate comment lines).

Every line other than a blank line or a comment line describes one property to be added to the table (except that if a line ends with \, then the following line, if it exists, is treated as a continuation line, as described below). The key consists of all the characters in the line starting with the first non-whitespace character and up to, but not including, the first ASCII =, :, or whitespace character. All of the key termination characters may be included in the key by preceding them with a \. Any whitespace after the key is skipped; if the first non-whitespace character after the key is = or :, then it is ignored and any whitespace characters after it are also skipped. All remaining characters on the line become part of the associated element string. Within the element string, the ASCII escape sequences \t, \n, \r, \\, \", \', \ (a backslash and a space), and \uxxxx are recognized and converted to single characters. Moreover, if the last character on the line is \, then the next line is treated as a continuation of the current line; the \ and line terminator are simply discarded, and any leading whitespace characters on the continuation line are also discarded and are not part of the element string.

As an example, each of the following three lines specifies the key "Truth" and the associated element value "Beauty":

Truth = Beauty
	Truth:Beauty
Truth			:Beauty
As another example, the following three lines specify a single property:

fruits	apple, banana, pear, \
	cantaloupe, watermelon, \
	kiwi, mango
The key is "fruits" and the associated element is:

"apple, banana, pear, cantaloupe, watermelon,kiwi, mango"
Note that a space appears before each \ so that a space will appear after each comma in the final result; the \, line terminator, and leading whitespace on the continuation line are merely discarded and are not replaced by one or more other characters.

As a third example, the line:

cheeses
specifies that the key is "cheeses" and the associated element is the empty string.

Return values

Findex returns the following values on exit.
0No errors or problems
-veNo errors, but negative number indicates number of scan problems
1Insufficient arguments
2Invalid argument
4Out of Memory
5Unexpected exception
10Error reading Index Instructions File
11ScanDirectory invalid
12SaveAsPathname not given in Index Instructions File
13ScanType invalid
14ScanDirectory property not specified
15ScanDirLevels invalid
16ScanPathname property not specified
17ScanURL property not specified
18Invalid true/false value for a property
19AbstractWords invalid
20Invalid parser plug-in
30Error reading reindex parameters from existing Search database
31Unable to reindex because an option is set that is not compatible with Findex, eg Parse PDF
100General scan error
101Internal error
If Findex finds one problem while scanning (such as an invalid link) then it will return -1.  If your command shell does not cope with signed values then this will be seen as the unsigned value 4294967295 on systems where exit codes are 4 byte unsigned integers.  If there are two scan problems then -2 is returned (unsigned 4294967294).

In Windows, the Sun JVM java command seems to return 0 if it runs out of memory, and 3 if Ctrl+C is used to halt the program execution.  In Windows, the Microsoft VM jview command seems to return 3221225786 / -1073741510 / 0xC000013a / STATUS_CONTROL_C_EXIT if Ctrl+C is used to halt the program execution.  Out of memory might be 0xC0000017 / STATUS_NO_MEMORY.

  All site Copyright © 1996-2011 PHD Computer Consultants Ltd, PHDCC   Privacy  

Last modified: 5 July 2006.

Valid HTML 4.01 Transitional Valid CSS!