FindinSite: search engines for MS-servers, Java-servers and CDs/DVDs   .
  search
Powered by FindinSite-MS
. FindinSite Home | About us | Contact us .
. .
  Hosted search | FindinSite-MS | FindinSite-JS | FindinSite-CD | White papers / Search engine indexing

 

How good are the major search engines at indexing?


Summary:

  • Search engines aren't as good as you might expect at finding all the information on your site.
  • Don't rely on the majors if you want a good site search.

 1 
The questions that need answering...

  • Do the search engines find all the information on my web site?
  • Is the information that they hold on my web site up to date?
  • What file types are found?

  • How do you ask the search engines to find your information?
  • If you use a major search engine as your site's search, how can you do better?
2
How do the search engines find the information on my website?
  • You usually submit the home page of your web site using the engine's Add URL function, eg adding http://www.phdcc.com/
  • The search engine then reads the home page and follows links to find new pages or files
  • The process of finding files is called spidering or crawling and the whole process including information storage is called indexing.

  • Search engines also find links into your site on external sites.
    For example, http://www.phdcc.com/lv/ is no longer officially part of our site; however other sites still provide links into these pages, so the main search engines still find them.

  • Google also has a system called SiteMaps in beta - you upload a sitemap which tells Google which pages to index on your site, and when.
3
How often are the search engines likely to index my site?
--- Analysis summary

The results of our analysis have yielded the following empirical conclusions:

  • Search engines do not necessarily index your site every day.
  • When they do visit your site, they may only re-index certain pages.
  • Some pages are not currently being searched because the indexer has not got around to following new links.
  • Some pages have never been indexed by some search engines.
  • Search engines re-index pages more often if the pages change often.
  • The number of files found at a site can vary a lot between different search engines.
  • The number of files found at a site is often a lot less than actually exist.
4
How many files have the major search engines found at my site?

The simplest way to find out how many files have been found on your website is to enter site:domain as the search text, eg site:www.phdcc.com. Note that search engines usually use the word "about" when saying the site size, eg Results 61 - 70 of about 780.

phdcc.com

Use the following links to assess how many pages these search engines think are on the phdcc.com site. The numbers in the right of the table indicate how many were found today:

Search:
site:www.phdcc.com
Number of files reported
(8 June 2005)
Google 763
MSN 2721...810...250
Yahoo 1000...952...958..927
findinsite 712

As can be seen, the number reported by MSN was initially 2721 but this went down to 250 by the time the third or fourth sheet of results was shown. For Yahoo, the results kept reducing as each results sheet was shown.

phdcc's site search engine findinsite reported 712 files. Here are some reasons that we have found for the numbers being different today (27 May 2005):

  • Google does not include this page in its index (http://www.phdcc.com/dircvs.html), even though this page has been available since at least 1999. Both MSN and Yahoo have found this page.
  • Google does not include this page in its index (http://www.phdcc.com/fis/phdcc_fis_intro.ppt), probably because it has not re-indexed this portion of our web site for a few days. The file went online 3 days ago. Other files are also not indexed for the same reason, eg http://www.phdcc.com/brightdayler/access.htm. MSN and Yahoo have not found included these files either.
  • Google does include an SWF Flash file. findinsite does not index this file type.
  • findinsite erroneously double-counts some directory URLs, eg it counts http://www.phdcc.com/findinsite/ and http://www.phdcc.com/findinsite/default.htm as separate files even though they are the same.

swlg.org.uk

Search:
site:www.swlg.org.uk
Number of files reported
(7 June 2005)
Google 141
MSN 644..63
Yahoo 106...105...104
findinsite 142

yeadonsailingclub.org.uk

Engine Search Number of files reported
(7 June 2005)
Number of files reported
(8 June 2005)
Google site:http://www.yeadonsailingclub.org.uk/ 4 4
Google site:www.yeadonsailingclub.org.uk 9 4
Google site:yeadonsailingclub.org.uk 12 11
MSN site:www.yeadonsailingclub.org.uk 20 20
MSN site:yeadonsailingclub.org.uk 20 20
Yahoo site:www.yeadonsailingclub.org.uk 25 24
Yahoo site:yeadonsailingclub.org.uk 30 30
findinsite   30

Google does seem to have performed poorly for this site. Is it because the site is based on frames?
Google's old list of 9 hits included various files that have not existed on the site for a long time.
Also: with Google if you search for somthing useful (like "dinghy sailing yorkshire") it doesn't find the yeadon site at all.

5
How up to date is the information held by the major search engines?

We found that some information stored by the major search engines can be out of date.

Google and MSN store a copy of the information retrieved in a 'cache' - you can see the cached version with search words highlighted by clicking on the 'Cached' or 'Cached page' links for a result. The page shown indicates the date when the file was cached. The date shows when the file's information was last incorporated into the index.

phdcc's Brightdayler service keeps track of file cache dates (and your position in the results list).

6
Should I use the major search engines as my site search?

Many people use a major search engine to provide a site search. The conclusion from the above results is that not all your site's information may be found and that some of the information may be out of date.

Here's another white paper from phdcc: Why bother with findinsite when Google, MSN, Yahoo etc do the same job?.
This explains how to provide a site search and gives other reasons for switching to a custom site search engine such as phdcc's findinsite. Reasons include:

  • The major search engines may show ads from your competitors
  • A custom site search can be integrated to match your site's appearance
  • Results can be highlighted

Comments:

Name: 
Email: 
Comment: 
Type

Comments will be moderated before posting.

FindInSite-ms: Search engine for ASP.NET
Search engine for ASP.NET
Click here to learn more about findinsite-ms
FindInSite-js: Search engine for Java server
Search engine for Java servers
Click here to learn more about findinsite-js
FindInSite-cd: Search engine for CD and DVD
Search engine for CD and DVD
Click here to learn more about findinsite-cd
  All site Copyright © 1996-2005 PHD Computer Consultants Ltd, PHDCC   Privacy