findex Java indexing tool
Introduction
Findex is a Java version of the FindinSite-CD-Wizard indexing tool for FindinSite-CD.
As a Java application, it should run on both Windows and non-Windows platforms.
(The Findex indexing code is also used by
FindinSite-JS, the site search engine for web sites.)
Findex indexes HTML web pages, TXT text files, PDF files,
Word 97+ DOC documents, PowerPoint PPT/PPS presentations and JPEG images for meta-data.
It does not currently index XLS files.
Findex does not have a search database editing facility.
Findex will not index Office documents if you run it using the Microsoft VM.
Findex supports Parser plug-ins to index
new file types. The supplied phdccRDF plug-in
finds meta-data in RDF/XML files - this plug-in does run under the Microsoft VM.
Findex finds any meta-data information for
field searches, including that contained in
RDF/XML files. A search database produced by Findex cannot be properly read
by FindinSite-CD-Wizard because FindinSite-CD-Wizard does not support field searches.
Apart from fields, a search database produced by Findex will be functionally
equivalent to one produced
by FindinSite-CD-Wizard (with the same options selected). However the search database physical files
will not be identical. Indeed, the search database files produced by different Java implementations
(eg the Sun JVM and the Microsoft VM) will not be physically identical [because each implementation
enumerates the contents of a hash table in a different order].
Findex is usually much slower than FindinSite-CD-Wizard on the same computer.
In addition, Findex will tend to use more memory than FindinSite-CD-Wizard.
The Microsoft VM seems to require less memory that the Sun JVM. However the Sun JVM
seems to run faster than the Microsoft VM for smaller sites.
Findex has been tested in Windows using the Microsoft and Sun JVMs,
and in Sun Solaris 2.6 and Solaris 9 using the Sun JVM.
As described above, some aspects of Findex will not run under the Microsoft VM.
Executable
The Findex executable (Findex.jar ) is only supplied in the FindinSite-CD development kit
for Windows, and is placed in the installation directory, eg:
C:\Program Files\PHD\fisCD\Findex.jar
To use Findex on other platforms, you must first install the FindinSite-CD development kit
on a Windows computer. Then copy Findex.jar from the installation
directory to a suitable directory on the new platform.
You may need to add Findex.jar or its directory to the
Java CLASSPATH environment variable definition if you are not using its full path
when running it.
Usage
Findex is a Java application that can either:
- Make a new search database
- Reindex an existing search database
Findex is supplied in a Java archive file called Findex.jar ,
with the main class com.phdcc.findex.Findex identified as the "Main-Class"
in the JAR manifest.
Run Findex in the Java Virtual Machine on your computer, passing parameters
to tell it what to do. There should be no need to make any further changes to your
existing Java CLASSPATH definition provided it already refers to the standard Java
Runtime Environment classes. For the Sun JVM you may need to use the -Xmx
parameter to the java command to increase the maximum size of the
memory allocation pool from 64MB if you run into Out of Memory errors.
Here are four different examples. The first uses the
Microsoft VM. The last three use the Sun JVM in different ways.
In each case, the parameters are shown in green.
jview /cp:a Findex.jar com.phdcc.findex.Findex @index.properties
java -Xmx256M -jar Findex.jar -r MySite.his
java -cp Findex.jar;phdccRDF.jar com.phdcc.findex.Findex -q @index.properties
java -jar C:\Program Files\PHD\fisCD\Findex.jar -v
Findex returns an exit code of zero if there are no errors or scan problems.
A positive number usually indicates a hard error. A negative number usually indicates
the number of scan problems. See below for full details.
See the parser plug-ins documentation
for details of how to run Findex with plug-ins.
Parameters
The following parameters can be passed to Findex:
-v | -vv
or
[-q | -qq] ( @index_instructions_file | -r search_database )
If -v is specified then Findex lists several Java system properties
to the "standard output" device. For -vv, Findex lists all available
system properties.
If -q is specified then Findex runs in quiet mode,
not writing any progress information to the "standard output" device, but still writing
error information to the "standard error" device.
If -qq is specified then Findex runs in silent mode,
not writing any progress information to the "standard output" or "standard error" devices.
You must then specify whether you want to create a new search database
(using @index_instructions_file) or reindex an existing search database
(using -r search_database).
In fact, you can create a new search database to overwrite an existing one.
Reindexing a search database
To reindex a search database, simply pass Findex the name of the search database system file,
eg MySite.his , after -r.
Note that the Include and
Exclude settings
from an Index Instructions File are not saved in the search database system file,
so any earlier inclusions or exclusions will not be used during a reindex.
Similarly, plug-in options are not stored so these will be lost if you do a reindex.
If the search database was created by FindinSite-CD-Wizard and included options to scan file types
not supported by Findex, then Findex will stop immediately with return value 31.
Index Instructions File
To create a new search database, you must pass Findex the name of an "Index Instructions File".
The Index Instructions file specifies the indexing options that Findex should use,
such as the pathname of the search database to create. The instructions file is in the format of
a Java properties text file, where each line specifies a property and a value
separated by an equals sign, with various escape sequences recognised -
see below. Property names are case sensitive.
See also the full example.
You must give a SaveAsPathname. Depending on the
ScanType, you must specify other parameters as follows:
You must set at least one of these properties:
ParseHTML,
ParsePDF,
ParseDOC,
ParsePPT and
ParseTXT.
The following table details all the available properties:
Name |
Optional |
Description |
Default |
Description |
Yes |
The search database description |
Taken from the first page title found |
SaveAsPathname |
No |
The pathname used to save the search database, without any extension
The directory element of the pathname must exist.
|
|
ScanType |
Yes |
Indicates how Findex finds files to index:
|
dir |
ScanDirectory |
Yes * |
The directory used to find files if ScanType is dir
|
|
ScanDirLevels |
Yes |
The number of directory levels to scan if ScanType is dir .
Use a number in the range 0 to
255 , or
all .
|
all |
ScanPathname |
Yes * |
The initial file scanned if ScanType is file
|
|
ScanURL |
Yes * |
The initial URL scanned if ScanType is url
|
|
ParseHTML |
Yes |
Specify true if you want to scan HTML web pages,
or false if not.
|
true |
HTML_Files |
Yes |
The file specification for HTML files, using * and ? wildcards as needed.
Separate individual specifiers with a comma.
|
*.htm,*.html |
ParsePDF |
Yes |
Specify true if you want to scan PDF files,
or false if not.
|
false |
ParserN |
Yes |
Specify zero or more plug-in parsers
See the ParserN
description on the Plug-in page.
|
|
PDF_Files |
Yes |
The file specification for PDF files, using * and ? wildcards as needed.
Separate individual specifiers with a comma.
|
*.pdf |
PDF_Passwords |
Yes |
Specify a comma-separated list of passwords to open PDF files.
|
|
PDF_ReportCharacterDecodeProblems |
Yes |
Specify true if you want to have any PDF character decode problems listed,
or false if not.
|
false |
ParseDOC |
Yes |
Specify true if you want to scan DOC Word 97+ documents,
or false if not.
|
false |
DOC_Files |
Yes |
The file specification for DOC files, using * and ? wildcards as needed.
Separate individual specifiers with a comma.
|
*.doc |
ParsePPT |
Yes |
Specify true if you want to scan PPT PowerPoint presentations,
or false if not.
|
false |
PPT_Files |
Yes |
The file specification for PPT files, using * and ? wildcards as needed.
Separate individual specifiers with a comma.
|
*.ppt,*.pps |
ParseTXT |
Yes |
Specify true if you want to scan TXT text files,
or false if not.
|
false |
TXT_Files |
Yes |
The file specification for TXT files, using * and ? wildcards as needed.
Separate individual specifiers with a comma.
|
*.txt |
ParseImage |
Yes |
Specify true if you want to scan JPEG images for meta-data,
or false if not.
|
false |
Image_Files |
Yes |
The file specification for JPEG files, using * and ? wildcards as needed.
Separate individual specifiers with a comma.
|
*.jpg,*.jpeg |
CaseSignificant |
Yes |
If finding files by following links, then the case of filenames is ignored
if false .
If true then Findex
views test.htm and Test.htm as separate files.
Windows always seems to ignore filename letter cases.
In Unix, filename case must be correct.
|
Windows: false
non-Windows: true
|
StoreStopWords |
Yes |
If false ,
Findex does not include words specified in
StopWordFile.
|
true |
StopWordFile |
Yes |
The pathname of the file containing stop words, with one word per line in UTF-8 format. |
|
NoTitleIgnorePageLinks |
Yes |
If finding files by following links and this property is set to
true ,
then links are not followed if a page has no title.
|
false |
ParseUpHierarchy |
Yes |
If finding files by following links and this property is set to
true ,
then links are followed to directories above the initial file.
|
false |
StorePositions |
Yes |
If true then Findex stores
word positions so that "adjacent word" searches will work.
|
true |
StoreLoneWords |
Yes |
If true then Findex stores
a word's position even if the two surrounding words are stop words.
|
true |
UseNoBaseURLs |
Yes |
Determines whether to include a Base URL prefix for each page in the search database
|
false for URL scans
true for File/Dir scans
|
UseMetaDescriptionAsAbstract |
Yes |
If true then the
page abstract will be taken from the page META description tag.
|
true |
UseMetaAbstractAsAbstract |
Yes |
If true then the
page abstract will be taken from the (new) page META abstract tag.
|
true |
AbstractWords |
Yes |
If building the abstract from the words in a file, this property
indicates the number of words to use.
|
30 |
Include |
Yes |
A list of file specifications to include in the search database.
See below
|
All files will be included |
Exclude |
Yes |
A list of file specifications to exclude from the search database.
See below
|
No files will be excluded |
Include and Exclude files
The Include and Exclude properties
provide an optional list of file-specs to determine the files to include or exclude
in the search database.
The initial list of acceptable files is determined by the appropriate
HTML_Files,
HTML_Files,
DOC_Files,
XLS_Files,
PPT_Files or
TXT_Files
settings. Then:
- If an Include file-spec set is given,
then only files meeting one of the given file-specs will be indexed.
- If an Exclude file-spec set is given,
then any files meeting one of the given file-specs will not be indexed.
- Note that the Includes are processed first and the Excludes afterwards,
so an Exclude file-spec takes precedence.
An individual file-spec can include zero or more * or ? wildcard characters,
where ? matches exactly one character, and
* matches zero or more characters.
For example file???.ht* would match:
file001.htm ,
file101.html and
file111.ht
but not
file1001.htm
A list of file-specs can be given directly in the property, or indirectly in a file.
Direct file-specs
Direct file-specs are semi-colon separated, eg:
Include=iso*;*12*
Exclude=file???.ht*
This specifies two Include file-specs and one Exclude file-spec.
Indirect file-specs in a file
An indirect value consists of @ followed by a file name,
where file-specs are specified one per line in plain text.
The above direct example may be expressed indirectly as follows:
Include=@includes.txt
Exclude=@excludes.txt
where includes.txt contains:
iso*
*12*
and excludes.txt contains:
file???.ht*
If an indirect file cannot be opened, an error message is printed
to the standard output unless -q or
-qq specified.
Example Index Instructions File
This example specifies a search database description of
Seiten und Wörter .
The search database is saved in
D:\inetpub\wwwroot\fiscd\srchdb .
(Note that a search database actually consists of between 14 and 17 different files,
ie srchdb.his, srchdb.hi1, srchdb.hi2, etc.)
The search database is built by following links from the URL
http://www.phdcc.com/
and indexing HTML files that meet this file spec
*.htm,*.html,*.asp
and TXT files that meet this file spec
*.txt .
Description=\ Seiten und W\u00F6rter
SaveAsPathname=D:\\inetpub\\wwwroot\\fiscd\\srchdb
ScanType=url
ScanURL=http://www.phdcc.com/
HTML_Files=*.htm,*.html,*.asp
ParseTXT=true
|
Note how a space character is specified at the start of the Description
using the \ escape sequence.
German character ö is specified using its Unicode representation
\u00F6 .
In the SaveAsPathname, each PC backslash must be represented by two backslashes
\\ . On PC systems, forward slashes can usually
be used instead of backslashes if desired, so the SaveAsPathname could be specifies like this:
SaveAsPathname=D:/inetpub/wwwroot/fiscd/srchdb
Example Output
The following example full output was produced when doing a directory scan
of a test directory.
jview /cp:a Findex.jar com.phdcc.findex.Findex @index.properties
Findex 4.0, 29 July 2005. Copyright (c) 2002-2011 PHD Computer Consultants Ltd
iso8859-3.htm
iso8859-15.htm
Windows874.htm
Windows1254.htm
iso8859-9.htm
Windows1258.htm
iso8859-1.htm
Windows1252.htm
Windows1256.htm
iso8859-6.htm
Windows1257.htm
iso8859-4.htm
Windows1250.htm
iso8859-5.htm
Windows1251.htm
Windows1253.htm
iso8859-7.htm
Windows1255.htm
iso8859-8.htm
iso8859-8i.htm
Post-processing...
No problems while scanning.
Pages scanned: 20 (208kB)
Words found: 1413
Total output database size: 98kB (47%)
Time taken: 0 minutes, 2 seconds.
Error code 0
Index Instructions File Format
This definition of the Index Instructions File Format is taken from the
Sun java.util.Properties class definition:
The stream is assumed to be using the ISO 8859-1 character encoding.
Every property occupies one line of the input stream. Each line
is terminated by a line terminator (\n or \r
or \r\n ). Lines from the input stream are processed until
end of file is reached on the input stream.
A line that contains only whitespace or whose first non-whitespace
character is an ASCII # or ! is ignored
(thus, # or ! indicate comment lines).
Every line other than a blank line or a comment line describes one
property to be added to the table (except that if a line ends with \,
then the following line, if it exists, is treated as a continuation
line, as described
below). The key consists of all the characters in the line starting
with the first non-whitespace character and up to, but not including,
the first ASCII = , : , or whitespace
character. All of the key termination characters may be included in
the key by preceding them with a \.
Any whitespace after the key is skipped; if the first non-whitespace
character after the key is = or : , then it
is ignored and any whitespace characters after it are also skipped.
All remaining characters on the line become part of the associated
element string. Within the element string, the ASCII
escape sequences \t , \n ,
\r , \\ , \" , \' ,
\ (a backslash and a space), and
\u xxxx are recognized and converted to single
characters. Moreover, if the last character on the line is
\ , then the next line is treated as a continuation of the
current line; the \ and line terminator are simply
discarded, and any leading whitespace characters on the continuation
line are also discarded and are not part of the element string.
As an example, each of the following three lines specifies the key
"Truth" and the associated element value
"Beauty" :
Truth = Beauty
Truth:Beauty
Truth :Beauty
As another example, the following three lines specify a single
property:
fruits apple, banana, pear, \
cantaloupe, watermelon, \
kiwi, mango
The key is "fruits" and the associated element is:
"apple, banana, pear, cantaloupe, watermelon,kiwi, mango"
Note that a space appears before each \ so that a space
will appear after each comma in the final result; the \ ,
line terminator, and leading whitespace on the continuation line are
merely discarded and are not replaced by one or more other
characters.
As a third example, the line:
cheeses
specifies that the key is "cheeses" and the associated
element is the empty string.
Return values
Findex returns the following values on exit.
0 | No errors or problems |
-ve | No errors, but negative number indicates number of scan problems |
1 | Insufficient arguments |
2 | Invalid argument |
4 | Out of Memory |
5 | Unexpected exception |
10 | Error reading Index Instructions File |
11 | ScanDirectory invalid |
12 | SaveAsPathname not given in Index Instructions File |
13 | ScanType invalid |
14 | ScanDirectory property not specified |
15 | ScanDirLevels invalid |
16 | ScanPathname property not specified |
17 | ScanURL property not specified |
18 | Invalid true/false value for a property |
19 | AbstractWords invalid |
20 | Invalid parser plug-in |
30 | Error reading reindex parameters from existing Search database |
31 | Unable to reindex because an option is set that is not compatible with Findex, eg Parse PDF |
100 | General scan error |
101 | Internal error |
If Findex finds one problem while scanning (such as an invalid link) then it will return
-1. If your command shell does not cope with signed values then this will be seen as the
unsigned value 4294967295 on systems where exit codes are 4 byte unsigned integers.
If there are two scan problems then -2 is returned (unsigned 4294967294).
In Windows, the Sun JVM java command seems to return 0 if it runs out of memory,
and 3 if Ctrl+C is used to halt the program execution.
In Windows, the Microsoft VM jview command seems to return 3221225786 / -1073741510 / 0xC000013a / STATUS_CONTROL_C_EXIT
if Ctrl+C is used to halt the program execution.
Out of memory might be 0xC0000017 / STATUS_NO_MEMORY.
|