FindinSite-MS: Search engine for an ASP.NET website   .
  search
Powered by FindinSite-MS
. Home | Installation | Indexing | Control Panel | Web services | Advanced | Purchasing .
. .
  Indexing / Advanced | File types | Charset support | PDF support

 

findinsite-ms indexing


Before you can search your web site, you must index it to build a search database.
  • findinsite-ms supports indexing of HTML, PDF, DOC, DOCX, XLS, XLSX, PPT/PPS, PPTX, PUB, TXT, JPEG and TIFF file types, featuring a regular indexing schedule and email of indexing results.

If need be, please see our notes on using findinsite-ms on a load balanced server farm/cluster.

Setting the findinsite-ms Search database

findinsite-ms can search one or more search database.  To tell findinsite-ms which search database to use, first make your search database as described below.  Then go to the Control Panel Searching section and either:
  • add your new search database, or
  • change a search database to your new one and press Make Changes.

After an indexing run successfully rebuilds a current findinsite-ms search database, findinsite-ms automatically reloads the new search database. You can confirm the index dates and times for the current search databases by looking in the Control Panel Searching section.

The findinsite-ms Indexer

The findinsite-ms indexer only builds one search database at a time; any other pending indexing runs are held in an Immediate queue. The indexer checks its Scheduled list frequently to see if it has any work to do.

Application life-time and Indexing

The findinsite-ms indexer only ever does scheduled indexing runs on the hour. For the indexing system to work, the findinsite-ms application needs to keep running continuously. This will, for example, let it do an indexing run every day at 2am.

While some systems do run findinsite-ms continuously, other systems will stop the findinsite-ms application if it does not receive regular use, eg when searches are carried out. Some systems are set up to stop an application after a certain number of hits, memory use or after 29 hours.

findinsite-ms restarts if it has been stopped. At the next hour, findinsite-ms checks to see if any indexing runs have been missed, and runs them if need be.

If findinsite-ms is in regular use, then it will be restarted fairly quickly after stopping. Therefore indexing runs should happen more-or-less when expected. If you expect your site to be very quiet overnight and want your indexing run to happen at that time, then you must use an external scheduling tool to wake up findinsite-ms. The scheduling tool can be set to access any findinsite-ms URL - however it is recommended that you access this page: search.aspx?keep=alive If your indexing run is set up to run at 2am, then you should schedule your wake up task for a short time before, eg 1:50am.

Indexing Configuration

findinsite-ms's indexing is controlled from the Control Panel. The Indexing section of the Control Panel lets you set up immediate or regular indexing runs. Each indexing run builds one search database by indexing one web site.

The screenshot on the right shows the Indexing section of the Control Panel menu. Click on one of these options...

  • Indexing: shows the indexer status, and lets you set up indexing limits and email reporting.
    • Immediate queue: shows a list of indexing runs queued, in progress or run recently.
    • Scheduled list: shows a schedule of the regular indexing runs.
    • Create new: runs the wizard to set up a new indexing run.
Status, Limits, Emails Immediate queue Scheduled list Create new wizard Control Panel Indexing menu

Creating a new Indexing Run

The Create new indexing run wizard has five steps:
  1. Select time to run: "now" or regular schedule (see right)
  2. Choose a filename for the search database - see below
       Either: Reindex an existing search database
       Or: Build a new search database
  3. Enter URL of web site to index.  Check the file types that you want indexed.
  4. Enter any Advanced Options
  5. Confirm indexing run: Store in the schedule and optionally run it now
Regular indexing times
Hourly on hour
Daily at specified hour
Weekly at specified week-day and hour
Monthly at specified first week-day and hour
Monthly at specified day and hour

Search database filenames and the findinsite-ms Work directory

Filenames
A search database is actually stored in many files, each with the basic filename you choose, but with a different extension, eg for index1 the actual files are index1.his, index1.hi1, etc.
In addition, findinsite-ms may also use the basic filename with an underscore appended, eg index1_, ie files index1_.his, index1_.hi1, etc.

When findinsite-ms remakes an existing search database it chooses the oldest filename to remake, eg index1 or index1_.
This ensures that a good search database is still available in the event that an indexing run fails, eg because network access to the site fails.

Work directory
findinsite-ms always puts its search database files in its work directory. Make sure that this is in a suitable location by looking at the Control Panel General section. If you decide to change the work directory then you must change the work appSettings value in Web.Config, as described here.

Scheduled list of indexing runs

Clicking on the Scheduled list option in the Control Panel Indexing section displays a list of your regular indexing runs like this:

Control Panel Indexing Scheduled list

The list shows a summary of each indexing run, with various control options, as described in the next section. If an indexing run has completed then it will also show a summary of the run output - see the Immediate list screen below for an example.

Indexing run control options

When an indexing run summary is displayed on screen, click on the appropriate icon for the following options:

Indexing run Details icon Details Display a full description of the Indexing run and the Last output it generated.
Indexing run Edit icon Edit Start the Create new wizard to edit the indexing run.
Note that the indexing run will be removed from the scheduled list when you start an edit; therefore you must complete the wizard if you want to store your indexing run.
Indexing run Run icon Run Put the indexing run in the Immediate queue so that it is run when it comes to the front of the queue.
Indexing run Stop/Remove icon Stop/Remove The action of this option depends on the state of the indexing run:
  • If in progress, then stop the run
  • If in the completed runs list, then remove from this list
  • If in the scheduled list, then remove and delete the run
In each case, you are asked to confirm the action first.

Immediate queue of indexing runs

Clicking on the Immediate queue option in the Control Panel Indexing section displays this information:
  • The indexing run in progress
  • The list of indexing runs queued waiting for execution
  • The list of recent completed indexing runs (within the last 20 minutes)
If any indexing runs are in progress, then the display updates every 20 seconds to show you the latest status.

The example screenshot below shows an indexing run in progress and one recently completed. Notice how each summary lists the number of pages and words found. If any problems are reported, click on the Details (i) icon for more information on these problems.

Control Panel Indexing Immediate list

Indexing status and general options

Clicking in the main Control Panel Indexing section displays the current indexing status and settings, with an option to make changes to your general configuration.

The Indexing limits values give you control of all your indexing runs, if you do not want the run to take too long or the search database too large.

If you set all the Indexing Email reporting values then findinsite-ms will email you with details of each indexing run completed - useful to keep an eye on findinsite-ms.

Press the Make Changes button if you alter any settings.

Option Description
Current status
Immediate queue A summary of the size of the indexing queue, whether an indexing run is in progress, and the number of completed indexing runs.
Scheduled list The number of regular scheduled indexing runs
Indexing Limits
Time limit The maximum number of minutes for an indexing run (in minutes) or 0 to have no limit.
File limit The maximum number of files for an indexing run, or 0 to have no limit.
Indexing Email reporting
All the following boxes must be completed to enable email reporting of index results.
SMTP send mail server The name of your mail server, eg mail.mycompany.com
SMTP send port The mail server port, eg 25 by default
From name The name of the email sender, eg Julie Wilson
From email address The email of the email sender, eg [email protected]
From email password If your mail server requires send authentication, enter your password here.
The password is stored in plain text in the work directory file findinsite.xml. For more security, store it in the Web.Config file appSettings EmailFromPassword value - see here for details.
To email address The email of the email recipient, eg [email protected]
Send email if findinsite-ms restarted If ths box is checked, findinsite-ms will send an email whenever it is started by the servlet engine.
Send test email Click to send a test email.
Make sure that you press "Make Changes" first if you have just entered any changes.
If any of the above options are in grey boxes then you cannot change them; the value has been set by your webmaster or servlet administrator.

Technical details

Indexer user-agent

When the findinsite-ms indexer spiders/crawls a web site, it calls itself FindInSiteBot. The user-agent HTTP header is set as follows:

FindInSiteBot/1.17.2235.31507 (http://www.phdcc.com/findinsite/bot.htm http://www.example.org/findinsite/)

This string includes the findinsite-ms version, a link to an explanation page (http://www.phdcc.com/findinsite/bot.htm) and the URL of the current instance of findinsite-ms. The latter field is useful in determining which instance of findinsite-ms is indexing the site.

Indexing run sessions and cookies

findinsite-ms indexer maintains cookies and so preserves session state.

robots.txt

findinsite-ms supports the robots.txt exclusion file - see the Robot Exclusion Standard for details. findinsite-ms looks for the FindInSiteBot user-agent; if this is not present it honours the commands for the * user-agent.

findinsite-ms supports the Crawl-Delay option in robots.txt. The Crawl-Delay number indicates the number of seconds between accesses; values greater than 60 are reduced 60. The default value is zero.

Page and link indexing control

The META robots tag is supported, including noindex and nofollow options.

The rel="nofollow" attribute for A tags is supported.

Excluding sections of pages

You can exclude portions of web pages from indexing or spidering as follows:

Comments

You can use commands within HTML comments. The commands "googleoff:" and "FindinSiteoff:" turn off the specified options, while "googleon:" and "FindinSiteon:" turn on the specified options. The following options are available:

  • index: text indexing
  • follow: following links
  • all: both index and follow

Examples:

  • <!--googleoff: index--> ... <!--googleon: index-->
  • <!-- FindinSiteoff:follow index--> ... <!--findinsiteon: all-->

The commands must appear at the start of the comment. The options must be space or comma separated. Commands cannot be nested.

Other options
  • Words and links in between a <DIV class=nospy> tag and the next </DIV> tag are ignored.
  • Words in between a <DIV class=nospytext> tag and the next </DIV> tag are ignored, but links are followed.
  • Words and links in between a <DIV class=nospyabstract> tag and the next </DIV> tag are put in the search database but not put in the abstract.
  • Words and links in between APPLET../APPLET, SCRIPT../SCRIPT and STYLE../STYLE tags are ignored, along with ASP script code in between <% and %>.
  • Words and links after a NOFRAMES tag are ignored, apart from storing the abstract.

Searches by web crawler robots

Some web crawler robots fill in forms with (random) words. These will appear as normal searches to FindinSite-MS. If the associated UserAgent contains http:// then the search is marked as coming from a robot in the logs and in the search count displayed in the Control Panel About section.

You can stop stop honest web crawlers from visiting the FindinSite-MS search page by adding a suitable entry to the Disallow entry to your site's robots.txt file, eg:
Disallow: findinsite/search.aspx


Just get stuck in, and you will get more out of it
  All site Copyright © 1996-2014 PHD Computer Consultants Ltd, PHDCC   Privacy  

Valid HTML 4.01 Transitional Valid CSS!

Last modified: 23 October 2009.