|
findinsite-cd-wizard
Introduction
FindinSite-CD-Wizard indexes or scans your existing CD to build a database of
all the words. You can then edit the search database.
FindinSite-CD-Wizard can build a basic search page to run FindinSite-CD with your search database.
It can copy the necessary FindinSite-CD program files and launch your default browser to
view the search page.
FindinSite-CD-Wizard is a Windows application. There is an alternative indexing tool,
Findex, which is a platform-independent Java application;
this indexes HTML, PDF, DOC, PPT/PPS, TXT and JPEG files, but will not build a search page and has no editor;
Findex also indexes meta-data for field searches, including RDF/XML files.
Search databases
A search database contains a list of the words on CD, from web pages, DOC, RTF, XLS, PPT and PDF files and
JPEG/TIFF image meta-data.
You have various scan options that will determine how many pages are scanned and how large
the database will be.
The database contains details of each web page or file in your site. It stores the
page title and an abstract, as well as the target frame. You can change these
details in the editor once the pages have been scanned.
Note that each search database is stored in 14-17 files, each having an extension
starting with ".hi", eg ".his", ".hi1", ".hi2", etc. However, when you open
or save a search database in FindinSite-CD-Wizard, it only asks you for the name of one of
the files, ie the one with extension ".his". Note that you are still dealing with
all the 14-17 search database files.
Make sure that you copy all the 14-17 files with extensions starting ".hi" to your CD.
FindinSite-CD-Wizard cannot open search databases created by Findex that
contain field search information.
Abstracts
An abstract is a short description of a file, that is shown by FindinSite-CD in the results
by default to help your user choose a suitable page to view.
(You can change what is displayed in the results list -
see here.)
See the file types page for a summary of how the abstract
is obtained for each supported file type.
For web pages, the abstract is normally taken from the META DESCRIPTION of a page.
If this tag is not present, the first words of the page are used as abstract.
Therefore it best to add META DESCRIPTION tags to all your pages.
Alternatively you can use a new META ABSTRACT tag to give the abstract, eg:
<META NAME="abstract" CONTENT="This page is the introduction<BR><BR>Start here">
Note that the abstract text may contain the characters
<BR> to force a line
feed in the abstract. Put the META ABSTRACT after the META DESCRIPTION.
If there is no META DESCRIPTION or ABSTRACT (or if you have deselected these
scan options) then the abstract is built from the first words of the page body.
You can specify how many words to include in the abstract.
|
Scanning
|
Select File+New to build a new search database.
See the File Types page for details of the types of file that
FindinSite-CD-Wizard can scan.
A wizard takes you through several steps before the scan starts. First, enter a name for your project.
Then select how FindinSite-CD-Wizard will find the files on your CD. It can either
find all files in a directory (or directories) on a local disk or CD. Alternatively it can follow the
hypertext links from an initial file. This initial file can either be a file on a local disk/CD
or a URL on a web site.
Scans of local disk files will be quicker.
If following links, the scanner follows relative links to other web pages.
It will not follow links to absolute URLs, eg "http://...".
If a FRAME tag or an A HREF tag (or similar), has an attribute SPY=ignore or
REL=nofollow then
the link is not followed, eg <A HREF="newpage.htm" REL="nofollow">
is not followed.
The third wizard page asks what type of files you want scanned. FindinSite-CD-Wizard can find
words in HTML web pages, PDF files, various Microsoft® Office files, TXT text files and JPEG/TIFF images.
You can also change
the file specification to indicate which
files are recognised as belonging to a file type. For example, you could set the "HTML files"
file specification to *.htm, *.html, *.asp if you want ASP script files
scanned as well as basic HTML files.
A "File mapping" button lets you index one type of file but show a different filetype if the user gets a hit in this file.
The next wizard page asks you to select a local file to store the search database.
Just enter a pathname without a file extension. The wizard adds extension ".his"
automatically. As stated above, the scanner generates 14-17 files in the complete
search database, each with the same filename prefix but different extensions.
This wizard page gives you the option of making a sub-directory for all FindinSite-CD files.
Most people find that this is useful because it keeps all the FindinSite-CD files
separate from the rest of your CD.
Finally, you can set various scan options. First time round, just use the
default options.
If you want to rebuild your search database automatically, then use FindinSite-CD-Wizard's
Command-line interface.
- Stop words
- You can opt not to store stop words, ie common words - such as 'the', 'and' and 'in'.
An English language stop
word list is provided in StopWordsEn.txt.
A prototype French stop word list is in StopWordsFr.txt
You can edit these stop word lists or create your own stop word lists easily -
have the words in a plain text file, one per line. Note that non-alphanumeric
characters should not be used, so you should not include "e.g." as a stop word.
- Store word positions
- FindinSite-CD-Wizard can also optionally store the positions of each word on each page.
This lets your customers find adjacent words.
This option is desirable, but will increase the database size.
- Stop lone word positions
- If you do not store stop word positions then some words may be surrounded
by stop words, eg time in The time of day.
A search for "The time of" (ie in exactly that order) will not
get any hits and it may
seem sensible not to store the position of time in this case.
- Use META description as abstract
- Select this option if you want the META DESCRIPTION to be used as the page abstract.
- Use META abstract as abstract
- Select this option if you want the META ABSTRACT to be used as the page abstract.
- Words in abstract
- Enter the number of words to include in the abstract,
if it is built from the page body.
- No title: ignore page
- Some simple pages may not have titles. If this option is selected then words
on pages without a title will not be included in the database,
and links will not be followed from this page.
- Case is significant
- This specifies whether the case of web page filenames is important, ie whether page
"haggis.html" is different from "Haggis.html". In Windows systems, both these
filenames will refer to the same page. (Note that on non-Windows servers, the
case of link page names is important.)
- Parse up directories
- If this box is checked, the scanner will follow links that go up a directory,
from the directory of the initial page.
- Report PDF character problems
- If this box is checked, the PDF Scanner will report any character code and glyph problems
(these can normally be ignored).
- PDF Passwords
- If any of your PDF files require passwords, type in the passwords here, comma separated.
Open (user) or master (security or owner) passwords are supported.
This option may not be available. If so, a message Sorry, password-protected PDFs
not supported is shown. See the main PDF page for more details.
Having pressed Finish, the scan now starts. It displays its progress in a Scan
Report window. When complete, a list of any problems encountered is shown.
Click on OK to complete the scan and begin editing the search database.
FindinSite-CD-Wizard often finds some genuine errors in web sites - it is very easy to make
mistakes. Press the "Save Report As..." button if you want to save the list of
problems to a text file.
If you cancel a scan part way through, the database for the pages already scanned
will be written correctly.
Ignoring words in web pages
Words and links in between APPLET../APPLET, SCRIPT../SCRIPT and STYLE../STYLE tags are ignored,
along with ASP script code in between <% and %>.
Words and links in between a <DIV class=nospy> tag and the next </DIV> tag are ignored.
Words in between a <DIV class=nospytext> tag and the next </DIV> tag are ignored, but
links are followed.
Words and links in between a <DIV class=nospyabstract> tag and the next </DIV> tag are
put in the search database but not put in the abstract.
Words and links after a NOFRAMES tag are ignored, apart from storing the abstract.
|
Making and viewing a search page
|
FindinSite-CD-Wizard can build a basic search page for you, either when you first scan your existing
pages to build a search database, or later using the Test+Create search page.. menu.
FindinSite-CD-Wizard can also copy all the necessary FindinSite-CD program files into the same directory
as the search page.
Note that the FindinSite-CD runtime files include a com sub-directory. Make sure that
com and all its sub-directories are put onto your CD.
You can tailor the search page as you wish, eg in your favourite HTML editor.
FindinSite-CD-Wizard can run the Windows Notepad text editor to view or change the source HTML of
the search page.
FindinSite-CD-Wizard can then display the search page in your default browser, using the
Test+View search page.. menu or its toolbar short-cut.
In the scan wizard you can ask that all the FindinSite-CD files be grouped in a separate subdirectory.
|
Editing the search database
|
The FindinSite-CD-Wizard main window lets you edit a search database, ie the words themselves,
complete pages of words and the Base URLs.
Most of the time you will just be trying to slim down the search database, ie to remove
pages that should not be found, or words that people are unlikely to search for.
It is also a good idea to check the abstract for each page.
The top of the edit window displays the description and various statistics about the
search database. Only the description can be edited.
If your files contain any words with characters that cannot be displayed in the current
system locale then the characters will not be displayed correctly. However they can be edited
- with care. See the character sets page for full details.
Word Editing
Click on the "Words" tab to edit the words.
Initially, no words are displayed. You have to select one or more of the
check boxes to make some or all of the words to appear.
Note that deleting words does not make them disappear straight away. Indeed, you
can still see them listed if you have the "Show deleted words" checked.
This is useful as it allows you, say, to delete all words with non-alphabetic
characters and then go through the list undeleting the ones that are of interest.
| Check box |
Shows |
| All |
All words |
| Just numbers |
Words with just numbers in |
| All capitals |
Words with only capital letters |
| Any non-latin chars |
Words with any non-latin characters |
| Starting with |
Words starting with the given letters |
| Shorter than |
Words with less the given number of characters |
| Longer than |
Words with more the given number of characters |
Select one of the check boxes to indicate which words you want to see.
Note that you can select more than one option at once.
For example, the above screen shows words with All Capitals and Starting With C.
The word list shows each word on a separate line, sorted alphabetically.
In the "Cases" column, any instances of a word with different capital letters
are shown. Finally, the "Pages" column shows all the pages that a word appears on.
To left of each word is a little state icon. This is a blue tick if the word is in
the database, or a thin red X cross if the word is deleted.
Click on the state icon to delete or undelete a word.
Alternatively press the Del key, right-click and select
"Delete", or select menu "Edit+Delete".
Select the "Delete all shown words" button to delete all the shown words.
You can click on the "Words" and "Cases" column header to sort the word list in different
ways.
You can change the list of pages that a word refers to by right-clicking and selecting
"Word properties..." or select menu "Edit+Word properties...".
The Word properties box shows the different letter cases of the selected word and the
pages in which the word appears. Click on the "Remove page" button to remove the
highlighted page from list of pages that the word finds.
Page Editing
Click on the "Pages" tab to edit the pages that the search database refers to.
First select a page to edit from the list.
Pressing "Delete" deletes the page and all the words it contains from the search database.
For each page, the Title, URL, Base URL, Target frame, Priority and Abstract are shown,
and can be edited. The list of anchors on the page is shown but cannot be edited.
(Base URLs are explained below.)
The Priority field can be used to re-order
the FindinSite-CD results list.
You can set the Priority using the
custom META phd-spy-priority tag.
Base URLs
Click on the "Base URLs" tab to edit the Base URLs of the pages.
A "Base URL" is the characters to put in front of the characters of the page URL.
So if the page URL is "index.html" and its Base URL is "http://www.phdcc.com/" then
the page displayed is "http://www.phdcc.com/index.html".
If you asked FindinSite-CD-Wizard scan wizard to make the search database in a subdirectory,
then there will usually be a single Base URL ../".
Otherwise there will be
no Base URLs, and the Base URL for each page will display <None>.
Normally you should leave the Base URLs alone. However you can add, edit and delete
the Base URLs in this Base URLs tab.
Simply press "Add" to add a new Base URL. Or select a Base URL and edit it below.
Note that you can only delete a Base URL if it is not used by any page.
(You may have to exit FindinSite-CD-Wizard and re-enter to get Delete enabled.)
Use the "Set all URLs to this Base URL" to make all the pages use the currently
selected Base URL. Alternatively "Set all URLs to <None>" to reset all
pages' Base URLs to <None>.
Back in the "Pages" tab, you can choose the Base URL for an individual tab in the
"Base URL" combo box.
|
Command-line interface
|
You can run FindinSite-CD-Wizard from a command-line to rebuild a search database without any user
interaction. This lets you update a search database easily from an MS-DOS batch file or equivalent.
Running FindinSite-CD-Wizard from a command-line is equivalent to selecting menu
File+Rebuild this search database
and therefore must refer to an existing search database.
(You cannot currently make a new search database from the command-line.
You cannot edit the search database from the command line.)
Run FindinSite-CD-Wizard from the command-line as follows (change the path to fisCDWiz.exe if necessary):
"C:\Program Files\PHD\fisCDv5\fisCDWiz.exe" /rebuild search_database [-c] [optional_log_file]
You must specify /rebuild as the first parameter, and
a search database filename for the second parameter, including the .his extension.
Optionally specify -c if you want a list of files output to a new console
(note that this is not the standard output so it cannot be redirected).
Optionally add the name of a file for the scan report (in text format).
It is safest to use the full pathname for each file.
Put the filename(s) in double-quotes if they contain space characters, eg:
"C:\Program Files\PHD\fisCD\fisCDWiz.exe" /rebuild "C:\My CD\fiscd\mysite.his" C:\scanlog.txt
FindinSite-CD-Wizard does not output any information to the command window.
However it will return a non-zero error code for serious errors.
For successful runs (even if there are scan errors), FindinSite-CD-Wizard returns zero.
If FindinSite-CD-Wizard has been used for more than 30 days under the Free licence, it will show a nag message box
suggesting a purchase, stopping the command-line interface from running uninterrupted.
Similarly scans of DOC, XLS or PPT files can result in message boxes appearing
if Word, Excel or PowerPoint, say, a file is corrupted.
|
Technical details
|
The Character sets page contains information on how FindinSite-CD-Wizard
scans HTML files. The PDF Scanning Support page has details of the
PDF Scanning module.
See the file types page for details of how FindinSite-CD-Wizard
indexes other file types.
If you use framesets, FindinSite-CD will usually not show a result page in the desired frameset.
Please consult the framesets page for several solutions to
this problem.
Directories and files that have a hash (#) character in their pathnames cannot be scanned
because FindinSite-CD-Wizard cannot distinguish them from anchor names. An appropriate error is
sent to the Scan Report.
META phd-spy-rebase
A custom META tag can be used to change the Base URL, filename and target for a page in the
search database. A META tag usually appears in the page header. The phd-spy-rebase META
tag must have an attribute called name with a value of phd-spy-rebase.
The content attribute specifies a replacement Base URL, filename and target - comma separated
with no spaces.
Each element is optional. If an element is not present then it is not changed.
If the replacement Base URL is present but empty, the page Base URL is not changed.
If the replacement filename is present but empty, then the page's filename is set to empty.
If the replacement target is present but empty, then the page's target is set to empty.
This example changes the Base URL for its page to http://www.xxx.com/, the filename
to newfilename.htm and the target to Main.
<META name="phd-spy-rebase" content="http://www.xxx.com/,newfilename.htm,Main">
If the replacement filename is -,
any directories in the page's filename are removed.
In this example, if the page filename is yy/z/apage.htm, then the Base URL is changed
to http://www.xxx.com/ and the filename
to apage.htm.
<META name="phd-spy-rebase" content="http://www.xxx.com/,-">
If the replacement filename is +,
then the page filename is not changed.
This is useful if you only want to change the target, eg:
<META name="phd-spy-rebase" content=",+,Main">
META phd-spy-priority
A custom META tag can be used to set a "Priority number" for each web page.
Priority numbers can be used to change the order that FindinSite-CD displays the results pages.
See the Screen layout - Results Ordering
section
for more details (you will need to set the ReorderResults
FindinSite-CD search page parameter).
A META tag usually appears in the page header. The phd-spy-priority META
tag must have an attribute called name with a value of phd-spy-priority.
(The old value of phd-spy-revision is also supported.)
The content attribute specifies an integer in the range 0 to 255 to be used as the page
Priority.
By default, each page has a Priority of zero. This example changes the Priority to 10.
<META name="phd-spy-priority" content="10">
The source of this page contains an example of META phd-spy-priority tag.
META robots
The standard robots META tag can be used to tell FindinSite-CD-Wizard whether to index a page
and whether to follow the links in page (when following links).
See http://www.robotstxt.org/wc/meta-user.html
for full details.
A META tag usually appears in the page header. The robots META
tag must have an attribute called name with a value of robots.
The content attribute specifies one or more comma-separated directives:
| Directive |
Description |
| index |
Index the words on this page |
| noindex |
Don't index the words on this page |
| follow |
Follow the links on this page |
| nofollow |
Don't follow the links on this page |
| none |
same as noindex,nofollow |
| all |
same as index,follow |
| spyignore |
Ignore the other directives, ie index and follow regardless |
By default, FindinSite-CD-Wizard indexes each page and follows links.
This example indexes the page but does not follow links.
<META name="robots" content="index,nofollow">
|