Supplied code
- Runtime class DLLs - no source
- phdcc.fis.Find.dll
- phdcc.fis.Findex.dll
- Example C# console application - project source and binary
Getting started
- Unzip the supplied kit in a clean directory.
- Try out the supplied TextExtractTest console application.
- Load the TextExtractTest project into VS.NET and check that it compiles.
- In your VS.NET project, add a reference to
bin/phdcc.fis.Find.dll and bin/phdcc.fis.Findex.dll.
- Add code to your project to call TextExtractor.
TextExtractTest example application
TextExtractTest lets you try out all the features of TextExtract.
TextExtractTest is a console application, so you will probably test it from a 'Command Prompt' DOS-box.
- Go to the directory where you installed the development kit.
- Enter
bin\TextExtractTest.exe followed by the filename or path of the file that you want to parse.
For example, to parse the supplied file test.htm, enter
bin\TextExtractTest.exe test.htm
- TextExtractTest outputs all the information received to the console output.
You can alter the output seen by changing the project source code and re-compiling.
- You can use TextExtractTest to parse any of the accepted file types.
TextExtractTest uses the file extension to determine the file type.
- TextExtractTest only opens local files, so URLs are not accepted.
However TextExtract itself will parse any Stream including those obtained from URLs.
TextExtractTest VS.NET project
Open an existing solution or a blank solution in VS.NET.
Use "Add an existing project" to load TextExtractTest.csproj in the development kit directory.
The main example code is in C# source code file TextExtractTest.cs.
You may edit this code as you wish.
Class definitions
Here are the public definitions of the main TextExtractor class, the file type class URLtoParse,
and the associated event classes
FindexWordFoundEventArgs and FindexErrorEventArgs.
In summary, you need to make a new TextExtractor object, set up event handlers, open your stream
and then call TextExtractor.Parse(). Then process what you have received.
| C# |
namespace com.phdcc.findex
{
// Main TextExtractor class
public class TextExtractor
{
public event FindexWordFoundEventHandler WordFound;
public event FindexErrorEventHandler Error;
public TextExtractor();
public int Parse(
URLtoParse.Type type,
Stream InputStream,
TextWriter OutputWriter,
StringBuilder AllText,
ArrayList alIndividualWords,
StringBuilder sbIndividualWords,
Hashtable Fields
);
}
// WordFound event definition
public class com.phdcc.findex.FindexWordFoundEventArgs: EventArgs
{
public string word;
public FindexWordFoundEventArgs(string word)
{
this.word = word;
}
}
public delegate void FindexWordFoundEventHandler(
object sender,
FindexWordFoundEventArgs fwfe
);
// Error event definition
public class com.phdcc.findex.FindexErrorEventArgs: EventArgs
{
public string msg;
public FindexErrorEventArgs(string msg)
{
this.msg = msg;
}
}
public delegate void FindexErrorEventHandler(
object sender,
FindexErrorEventArgs fee
);
public class URLtoParse
{
public enum Type { HTML, TXT, PDF, DOC, XLS, PPT, Image, PUB, };
}
}
|
TextExtractor.Parse() method definition
You must supply two valid input parameters to Parse():
type must be set to one of the file type enumeration values, eg URLtoParse.Type.HTML
InputStream must be set to an open Stream to the file, URL or other resource.
There are five output parameters, any of which may be null if they are not desired:
- All plain text characters are sent to the
OutputWriter TextWriter
- All plain text characters are appended to the
AllText StringBuilder
- All unique (lower-cased) words are added to the
alIndividualWords ArrayList in sorted order
- All unique (lower-cased) words are appended to the
sbIndividualWords StringBuilder in sorted order, separated by spaces
- Each field string is added to the
Fields Hashtable, with a string key of the field name (in lower-case)
and a StringCollection value. The StringCollection contains one string element for each instance field found.
Note that fields are not split into words.
Parse() returns the number of bytes processed, ie the size of the file, or:
- -1: InputStream
null
- -2: Unrecognised file type
- -3: Any error found
In addition Parse() may raise the following events during processing:
- Each unique (lower-cased) word is reported using the
WordFound event.
- Any error messages are reported using the
Error event.
Notes
- All the plain characters can be found using either the
OutputWriter or
AllText parameters, or both.
- The individual words can be found using one or all of the
alIndividualWords or sbIndividualWords parameters or the
WordFound event.
- Processing will be quicker if you do not look for individual words at all
(because the code has to maintain a list of words and check to see if the word has already been found).
Character canonicalisation
Characters are canonicalised before being processed.
Canonicalisation means converting characters to a basic root character, eg:
- Characters ª and a are changed to a, and ç to c
- Ligatures such as fi are expanded to two characters fi
Note that the HTML parser will already have decoded escaped 'named entity' characters such as
ç to ç
Word definition
A word is a sequence of characters delimited by white space, punctuation characters, line breaks, table breaks or similar.
Note that this means that John's code. is reported as three words: "john", "s" and "code".
Each non-Latin character, eg an Asian character, is reported as a single word. One character=one word.
Calling TextExtractor
To call TextExtractor:
- Make a new instance of
TextExtractor
- Add
WordFound and Error event handlers if desired
- Open a System.IO.Stream to the file contents
- Determine the file type, eg
URLtoParse.Type.HTML
- Set up objects to receive the results
- Call
TextExtractor.Parse()
| C# |
static void Main(string[] args)
{
TextExtractor te = new TextExtractor();
te.WordFound += new FindexWordFoundEventHandler(WordFound);
te.Error += new FindexErrorEventHandler(ErrorFound);
// Open input stream and determine file type
Stream InputStream = new FileStream(@"test.htm", FileMode.Open, FileAccess.Read);
URLtoParse.Type type = URLtoParse.Type.HTML;
// Build objects to receive the output
TextWriter OutputWriter = System.Console.Out;
StringBuilder AllText = new StringBuilder();
ArrayList alIndividualWords = new ArrayList();
StringBuilder sbIndividualWords = new StringBuilder();
Hashtable Fields = new Hashtable();
// Parse file
int BytesParsed = te.Parse( type, InputStream, OutputWriter,
AllText, alIndividualWords, sbIndividualWords, Fields);
}
// Word found event handler
static void WordFound(object sender, FindexWordFoundEventArgs fwfe)
{
Console.WriteLine("Word found: "+fwfe.word);
}
///////////////////////////////////////////////////////////////////////////////
// Error found event handler
static void ErrorFound(object sender, FindexErrorEventArgs fee)
{
Console.WriteLine("Error: "+fee.msg);
}
|
Please see the TextExtractorTest code for examples of how to process the output.
Copyright and Licensing
All code, images and documentation is © Copyright 1998-2007 PHD Computer Consultants Ltd.
The FindinSite.TextExtractor trial version finds a restricted amount of information in each file.
To purchase, please contact sales@phdcc.com mentioning FindinSite.TextExtractor
Let us know what features you want provided.
This DLL may be provided as a component in future.
TextExtractorTest: version 1.4, 8 January 2007.
|