FindinSite-CD: Search engine for CD/DVD   .
 
Powered by FindinSite-MS
. Home | Examples | Starting | Set up | Advanced | Languages | Purchasing | Email .
. .
  Getting started | FAQ | FindinSite-CD-Wizard | Findex | File Types | PDF | RDF | Parser API | HTML CDs/DVDs

 

Indexing parser plug-ins


Introduction

If FindinSite-CD does not support a file type then you can write a plug-in in Java to index it. Plug-ins can only be added to the Findex Java indexer, and cannot be added to FindinSite-CD-Wizard.

You must also make sure that your users will be able to view files of your new file type, ie they must have a suitable viewer available or provided on CD.

Java programmer information: A plug-in must implement the fisParser interface. The main function parse must parse an InputStream and return any found information using its caller's ParseHandler interface.

Any number of plug-ins can be added into Findex by adding ParserN properties to the index instructions file. Additional plug-in specific properties can also be added to this file.

FindinSite-CD comes with an example plug-in, phdccRDF for indexing RDF/XML files. This plug-in is slightly unusual because RDF files only contain meta-data information about other files. However the techniques shown in the source are exactly the same as would be used by a normal parser.

Using a plug-in

To use a plug-in, you must:
  • tell Findex about it in the index instructions file
  • add the plug-in runtime (.jar file) to the CLASSPATH when running Findex

Findex ParseN properties

To tell Findex to use a plug-in, add a ParserN property to the index instructions file. Any number of parsers can be added; the first must be called Parser1, the second Parser2, etc.

The value for each ParserN property must contain these fields, separated by semi-colons:

Field Example
File Type Short Name RDF
Default file-spec (comma-separated) *.rdf,*.xml
fisParser implementing class com.phdcc.findex.rdf.ParseRDF
AddPage false
TranslateEscapeSequences false

For each ParserN property you can also specify additional properties as follows:

Property Example Description Type
ParseName ParseRDF=true Whether indexing is enabled Boolean
Name_Files RDF_Files=*.rdf Actual file-spec desired Comma-separated file-specs

You can also add your own custom properties. Provide a semi-colon separated list of these in a ParserNparams property. For each parameter you can set a default value after a colon. The actual property value can be any string value (though there is special support for retrieving boolean values).

This is an example set of properties added to a Findex index instructions file. Two additional parameters are supported; the actual properties override the null and false default values.

parser1=XYZ;*.rdf;com.phdcc.findex.xyz.ParseXYZ;true;false
parser1params=XYZ_Passwords;XYZ_ReportErrors:false
ParseXYZ=yes
XYZ_Files=*.xyz
XYZ_Passwords=secret,shhhh
XYZ_ReportErrors=true

Plug-in runtimes

You need to add each plug-in runtime to the Java CLASSPATH. You can do this at runtime by adding the plug-in .jar file to the java command CLASSPATH, eg to add the phdccRDF runtime, add phdccRDF.jar to the -cp CLASSPATH (after a semi-colon). Note that you may need to use a full path to the .jar file.

java -cp Findex.jar;phdccRDF.jar com.phdcc.findex.Findex @index.properties

Programming a Plug-in

You need to be a competant Java programmer to write a Findex file indexer plug-in. You will obviously also need to know how to extract the required information from your files.

There is source code for the required interfaces in the plugin sub-directory of the Windows installation directory, eg:
C:\Program files\PHD\fisCD\plugin\
You will probably need to create a suitable directory structure if you are going to develop with these files.

fisParser.java, fisParser.class: fisParser interface source
fisParserInformation.java, fisParserInformation.class: fisParserInformation interface source
ParseHandler.java, ParseHandler.class: ParseHandler interface source
parseRDF.java: RDF example plug-in class source
The parseRDF.java example is compiled and assembled correctly into phdccRDF.jar, found in the main installation directory.

fisParser interface

Your plug-in must have a class that implements the com.phdcc.findex.fisParser interface, given below.

A single instance of your class is created when Findex starts.

Your parse() routine is called once to parse each file.  If appropriate, use the com.phdcc.findex.fisParserInformation interface (if info not null) to retrieve any values set for your custom parameters.  Then, as you inspect the InputStream, report any found information to Findex using the com.phdcc.findex.ParseHandler callback.  Finally, return the number of bytes indexed, ie the total number of bytes in the file.

Your GetPosition() method may be called at any time to get some information about where you are in the file for error reporting purposes. For example, you could return "line 123" or "page 5". Return null if there is no such information, or if you are not processing a parse() call.

The SetCharset() method is used by Findex to tell its HTML parser what character set to use, based on the page's META content-type.

package com.phdcc.findex;

import java.io.*;

public interface fisParser
{
public int parse( InputStream is, ParseHandler callback, fisParserInformation info);
public String GetPosition();
public void SetCharset( int charset);
}

fisParserInformation interface

Use the com.phdcc.findex.fisParserInformation interface to find any values set for your custom parameters. Any user-supplied values will override the default values you gave in the ParserNparams property.

Either use GetParam() or GetBooleanParam() to get a named parameter. If the parameter has not been set by the user and has no default then null is returned; in this case, GetBooleanParam() returns false.

package com.phdcc.findex;

public interface fisParserInformation
{
public String GetParam(String name);
public boolean GetBooleanParam(String name);
}

ParseHandler interface

Your plug-in passes any found information back to Findex through the com.phdcc.findex.ParseHandler interface, below. This interface is designed to accept HTML-like information.

Full information on the interface will be prepared soon.

package com.phdcc.findex;

import java.io.*;

public interface ParseHandler
{
public void SetPage( String URL, String target);
public int Tag( String tag);
public void TagEnd();
public int Attribute( String name, String text);
public int TaggedText( String text);
public int PlainText( String text);
public void ReportError(String Msg);
public PrintStream getErrorStream();
public void ShowProgress();
public boolean AbortNow();
}

RDF Plug-in Example

The example plug-in phdccRDF is a single class ParseRDF.java in package com.phdcc.findex.rdf. It inspects RDF/XML to find meta-data field information about other files.

As required, ParseRDF implements fisParser. However it can also be run from the command line for testing purposes, and so has a static main() method. You can pass the name of an RDF file as the first command line parameter. For the testing to work, ParseRDF also implements ParseHandler so it can print out what it is reporting.

ParseRDF uses Java SAX technology to analyse XML files. SAX is provided in the standard Sun Java VM distribution.

  All site Copyright © 1996-2011 PHD Computer Consultants Ltd, PHDCC   Privacy  

Last modified: 8 February 2006.

Valid HTML 4.01 Transitional Valid CSS!