FindinSite-CD: Search engine for CD/DVD   .
 
Powered by FindinSite-MS
. Home | Examples | Starting | Set up | Advanced | Languages | Purchasing | Email .
. .
  Getting started | FAQ | FindinSite-CD-Wizard | Findex | File Types | PDF | RDF | Parser API | HTML CDs/DVDs

 

File Types and Viewers


HTML
DOC DOCX
XLS XLSX
PPT PPTX
PUB
PDF-more
RDF/XML-more
TXT
JPEG,TIFF

Introduction

This page lists the file types supported by findinsite-cd, findinsite-js, findinsite-ms and findex. It also describes the viewer programs that are necessary on your customers' computers. And there are some cautionary notes on filenames.

This table summarises the supported file types. Click on the type acronym for a summary further down this page.

Searching and Indexing

All the findinsite runtimes can do searches for hits of any file type stored in a search database.

However a search database must be made in advance by one of the indexing programs.

  • The findinsite-cd-wizard Windows tool makes search databases for findinsite-cd, and supports HTML, PDF, DOC, DOCX, XLS, XLSX, PPT, PPTX, TXT and Image file types.
  • The findex Java application and findinsite-js index HTML, PDF, DOC, DOCX, XLS, XLSX, PPT, PPTX and TXT files, and JPEG image meta-data.
  • findinsite-ms indexes HTML, PDF, DOC, DOCX, XLS, XLSX, PUB, PPT and TXT files, and JPEG and TIFF image meta-data.
  • Finally, the phdccRDF Java plug-in for findex indexes RDF/XML files.
  • The FindinSite.TextExtractor tool indexes as per findinsite-ms.
Type Description Standard extensions Indexer support Viewer software More information
findinsite-cd-wizard findex findinsite-js findinsite-ms,
TextExtractor
HTML Web page *.htm *.html YES YES YES YES Browser findinsite-cd-wizard
findex
Character sets
PDF Adobe® Portable Document Format *.pdf YES YES YES YES Adobe Acrobat Reader PDF Support
DOC Microsoft® Word document *.doc *.rtf YES # YES $ YES $ YES $ MS-Word, Word Viewer or WordPad  
DOCX Microsoft® Word 2007 document *.docx *.docm YES YES YES YES MS-Word 2007, (?)Word Viewer  
XLS Microsoft® Excel document *.xls YES ## YES YES YES MS-Excel  
XLSX Microsoft® Excel 2007 spreadsheet *.xlsx *.xlsm YES YES YES YES MS-Excel 2007, (?)Excel Viewer  
PPT Microsoft® PowerPoint presentation *.ppt, *.pps YES ### YES YES YES MS-PowerPoint  
PPTX Microsoft® PowerPoint 2007 presentation *.pptx, *.pptm YES YES YES YES MS-PowerPoint  
PUB Microsoft® Publisher publication *.pub NO NO NO YES MS-Publisher  
RDF/XML Resource Description Framework *.rdf NO YES $$ NO NO Not applicable RDF Support
TXT Non-word processed text *.txt YES YES YES YES Browser  
JPEG/TIFF Image meta-data *.jpg, *.jpeg,
*.tif, *.tiff
YES #### YES $$$ YES $$$ YES JPEG: browser  

# findinsite-cd-wizard requires MS-Word 97/2000+ to scan Word documents
## findinsite-cd-wizard requires MS-Excel 5/97/2000+ to scan Excel documents
### findinsite-cd-wizard requires MS-PowerPoint 97/2000+ to scan PowerPoint documents
#### findinsite-cd-wizard requires GDI+ to index images
$ Only Word 97 or later files are supported. Word is NOT needed. RTF files are NOT supported.
$$ Using phdccRDF plug-in.
$$$ findex and findinsite-js only index JPEG meta-data

Be careful to check that your files can be viewed correctly on your customers' computers.

  • Check they have a suitable viewer program, together with all necessary plug-ins.
  • Check that your files only use links to information that is present. For example, Word files may contain absolute links to files or images on your hard drive - make sure than all pictures are all saved in the document. In Word, select menu Edit+Links to see what links a document has.

File type identification

Each file type has an acronym, eg HTML for web pages. Each filename's extension identifies the file's type.

For example, a filename of index.html has a filename extension of .html, and a filename prefix of index. The file types are identified by the file extension using an asterisk to indicate any filename prefix, eg *.html means any file with a filename extension of .html. Web pages are normally identified as *.htm *.html because both the .htm and *.html filename extensions are commonly used to indicate that the contents is in HTML.

In findinsite-cd-wizard on the "Scan File Types" wizard page, you can specify the file extensions that should be scanned in a particular way. For example, if you want to scan RTF files using the MS-Word scanner, then put *.rtf in the MS-Word DOC edit box.

Filenames for findinsite-cd

Please be careful about your filenames to ensure that findinsite-cd will be able to show the pages on your customers' computers.

Filenames with spaces are not recommended, although they should work (because findinsite-cd converts them to %20 in the URL).

Filenames with non-English characters are definitely not recommended because some browsers will not display the files correctly. Microsoft Internet Explorer seems to cope with filenames correctly. However the Netscape browsers do not; even an e acute (é) in a filename will mean that the page cannot be shown.

In findinsite-cd-wizard, the scan directory or initial scan filename must be entered in the computer's default local character set. However findinsite-cd-wizard correctly converts these to a safe portable form, called Unicode.

On Windows NT, 2000 and XP, findinsite-cd-wizard will subsequently find all Unicode filenames correctly. For example, if you have a page with a Japanese filename on an English system, then findinsite-cd-wizard will find it correctly. However your customers may be running a different operating system and browser, so it is quite possible that they are unable to view such pages.

Non-latin Characters

In almost all cases, the findinsite and findex indexing programs will find the correct characters in your documents, even if they cannot be displayed correctly on the current computer.

For example, findinsite and findex correctly interpret most common web page character sets.
As another example, findinsite-cd-wizard extracts information from Microsoft Office documents in the Unicode format. Unicode has codes for the characters used in almost all languages. This works even on Western systems that cannot normally display non-latin characters such as Windows 98 and Me.

Document, field, target and link information

As well as reading the basic words in each type of file, the findinsite and findex indexers find more information if it is available.
  • For example, if an HTML web page specifies a "META description" then this is usually used as the page abstract.
  • Document meta-data is stored by all indexers. In addition, findex, findinsite-js and findinsite-ms store document meta-data for field searches.
  • If findinsite or findex are following links, then it finds files to scan from the links in each file, ie the hypertext links in web pages. If specified, the target window name is also stored.

The summary information below indicates the document, field and link information that are found for each file type.


Summary Information for each File Type

Web pages (HTML) *.htm *.html and possibly *.asp *.php, etc
 Indexer findinsite-cd-wizard, findex, findinsite-js and findinsite-ms.
  • These indexers can scan HTML web pages, which usually have file extensions of .html or .htm.
  • Dynamically generated files can be scanned as well, eg using file extensions .aspx .asp, .php, .jsp, provided you are indexing an online web site rather than source files.
  • The indexers can also scan source files that are meant to be interpreted first by server script engines, eg *.aspx, *.asp, *.php, *.jsp, etc. However, the indexers may not ignore the script text correctly.
  • The indexers will find all the plain text words in an XML file, but does not interpret any XML structure information, ie anything in tags is ignored unless it is HTML.

Each indexer has various options for scanning HTML files.

The findinsite-cd-wizard page describes how various tags are handled by all the indexers, eg how the <DIV class=nospy> tag can be used to indicate words that should not be stored.

See the Character sets page for details of the character sets (used by different languages) supported by the indexers.

  • Document information
Page title Obtained from the TITLE tag, if present. Otherwise set to the filename.
Abstract Obtained from the "META Abstract" or "META Description" tags, if present. Otherwise, from the first words of the page.
Word priority Words are prioritised in searches in this order:
  • Document meta-data, including title, description and keywords
  • Filename
  • Headings H1..H6
  • all other text
See the Character sets page and findinsite-cd-wizard page for details of other recognised tags.
• Field Information The following meta-data is found:
  • <META name=nnn content=xxx> as field "nnn" containing "xxx"
  • <META http-equiv=content-language content=xxx> as field "lang" containing "xxx"
  • <BODY lang=xxx> as field "lang" containing "xxx"
  • <TITLE>xxx</TITLE> as field "title" containing "xxx"
  • <IMG alt=xxx> as field "img" containing "xxx"
• Links If following links, the indexers follow hypertext links to non-absolute URLs. An option allows it to scan up the directory tree.

Hypertext links are normally A HREF=xx TARGET=yy tags. However, FRAME SRC=xx NAME=yy tags and any other tags (such as the AREA tag) that have HREF and TARGET attributes are also followed.

If a link has an attribute rel="nofollow", then the link is not followed, eg this link <A HREF=afile.htm rel="nofollow"> is ignored.

 Viewer program Please remember that your customers' HTML viewer (their browser) may not be the same as yours, so be careful in your choice of HTML features, plug-ins, etc. Dynamically generated pages designed to run on a server will usually not run correctly from a CD.

Adobe® Portable Document Format (PDF) *.pdf
 Indexer findinsite-cd-wizard, findex, findinsite-js and findinsite-ms.

The indexers scan PDF files that usually have a file extension of .pdf.

See the PDF Scanning Support page for full details of this feature.

Password-protected files can be read by the indexers if an appropriate password is supplied.

  • Document information Document information is extracted from the "General Document Info".
Title Obtained from the Title field, if present. Otherwise set to the filename.
Abstract Obtained from the Subject field, if present. Otherwise, from the first words of the page.
Word priority Words are prioritised in searches in this order:
  • Subject and Keywords fields
  • Filename
  • all other text
• Field Information The following meta-data is found:
  • The title is stored as field "title"
  • The subject is reported as field "description"
  • The keywords are reported as field "keywords"
  • The author is reported as field "author"
• Links Links to other files are supported. Only links with a /Launch or /URI action are currently supported.
 Viewer program Your customers must have a suitable viewer program to display PDF files - usually Adobe Acrobat Reader. It is common practice to provide a link to Adobe to download the latest version. Or you can put the Acrobat Reader installation kit on your CD - check that the licence permits this.

Microsoft® Word (DOC) *.doc *.rtf
 Indexer findinsite-cd-wizard, findex, findinsite-js and findinsite-ms.

findinsite-cd-wizard can scan any file type recognised by Microsoft Word 97, Word 2000, Word XP, Word 2003 or Word 2007. Word files usually have a file extension of .doc. Word can usually scan RTF files (*.rtf). You must have Word installed on your computer.

In a few instances, Word appears on screen while scanning, requesting some input from you. If you cannot provide this information, start the Windows Task Manager and end the Word application or the WINWORD process. findinsite-cd-wizard should then report an error and continue.

findex, findinsite-js and findinsite-ms can index Word 97 or later files, without requiring Word to be installed. RTF files are not supported.

Password-protected files can be read by findinsite-cd-wizard only, if an appropriate password is supplied.

  • Document information Document information is extracted from the "Document Properties" Summary tab.
Page title Obtained from the Title field, if present. Otherwise set to the filename.
Abstract Obtained from the Subject field, if present. Otherwise, from the first words of the page.
Word priority Words are prioritised in searches in this order:
  • Document title, subject and keywords
  • Filename
  • all other text
• Field Information The following meta-data is found:
  • The title is stored as field "title"
  • The subject is reported as field "description"
  • The keywords are reported as field "keywords"
• Links Hyperlinks and targets to other files are supported.
 Viewer program Your customers must have a suitable viewer program to display your Word document files. This could be Word 97 2000, XP, 2003 or 2007. However Word 95 or earlier would do if your documents are in a format that can be read by these versions.

A more suitable viewer is Word Viewer which can be obtained freely from Microsoft at http://office.microsoft.com/downloads/.

As a last resort, Windows WordPad seems to be able to open most .doc and .rtf files.

Apart from Word for Macintosh, no viewer programs are known for non-Windows platforms.

Microsoft® Word 2007 (DOCX) *.docx *.docm
 Indexer findinsite-cd-wizard, findinsite-ms, findinsite-js and findex.

findinsite-cd-wizard can index Word 2007 files provided you have Word 2007 installed. Password-protected files can be read if an appropriate password is supplied.

findinsite-ms, findinsite-js and findex can index Word 2007 files, without requiring Word to be installed. Note: a word will be split if it contains internal formatting. Encrypted (password-protected) and Restricted permission files are not supported.

  • Document information Document information is extracted from the "Document Properties" Summary tab.
Page title Obtained from the Title field, if present. Otherwise set to the filename.
Abstract Obtained from the Subject field, if present. Otherwise, from the first words of the page.
Word priority Words are prioritised in searches in this order:
  • Document title, subject and keywords
  • Filename
  • all other text
• Field Information The following meta-data is found:
  • The title is stored as field "title"
  • The subject is reported as field "description"
  • The creator is reported as field "author"
  • The keywords are reported as field "keywords"
  • The comments are reported as field "comments"
• Links Hyperlinks to other files are supported.

Microsoft® Excel (XLS) *.xls
 Indexer findinsite-cd-wizard, findex, findinsite-js and findinsite-ms.

findinsite-cd-wizard can scan any file type recognised by Microsoft Excel 5, 97, 2000, XP, 2003 or 2007. Excel files usually have a file extension of .xls. You must have Excel installed on your computer.
Password-protected files can be read by findinsite-cd-wizard only if an appropriate password is supplied.
Known bug: Scanning using Excel 5: repeat scan hangs in Windows 2000: Ending process wowexec in Task Manager sorts problem.

findex, findinsite-js and findinsite-ms can index any XLS file without requiring Excel to be installed. Password-protected files are not supported. There are various minor limitations.

  • Document information Document information is extracted from the "Document Properties" Summary tab.
Page title Obtained from the Title field, if present. Otherwise set to the filename.
Abstract Obtained from the Subject field, if present. Otherwise, from the first words of the page.
Word priority Words are prioritised in searches in this order:
  • Document title, subject and keywords
  • Filename
  • all other text
• Field Information The following meta-data is found:
  • The title is stored as field "title"
  • The subject is reported as field "description"
  • The keywords are reported as field "keywords"
• Links Excel 97, 2000, XP, 2003 and 2007 hyperlinks and targets to other files are supported.
 Viewer program Your customers must have a suitable viewer program to display your Excel spreadsheet files. This could be Microsoft Excel 5, Excel 97, Excel 2000, Excel XP, Excel 2003, or Excel 2007.

A more suitable viewer is Excel Viewer which can be obtained freely from Microsoft at http://office.microsoft.com/downloads/.

Microsoft® Excel 2007 (XLSX) *.xlsx *.xlsm
 Indexer findinsite-cd-wizard, findinsite-ms, findinsite-js and findex.

findinsite-cd-wizard can index Excel 2007 files provided you have Excel 2007 installed. Password-protected files can be read if an appropriate password is supplied.

findinsite-ms, findinsite-js and findex can index Excel 2007 files, without requiring Excel to be installed. Encrypted (password-protected) and Restricted permission files are not supported.

  • Document information Document information is extracted from the "Document Properties" Summary tab.
Page title Obtained from the Title field, if present. Otherwise set to the filename.
Abstract Obtained from the Subject field, if present. Otherwise, from the first words of the page.
Word priority Words are prioritised in searches in this order:
  • Document title, subject and keywords
  • Filename
  • all other text
• Field Information The following meta-data is found:
  • The title is stored as field "title"
  • The subject is reported as field "description"
  • The creator is reported as field "author"
  • The keywords are reported as field "keywords"
  • The comments are reported as field "comments"
• Links Hyperlinks to other files are supported.

Microsoft® PowerPoint (PPT) *.ppt, *.pps
 Indexer findinsite-cd-wizard, findex, findinsite-js and findinsite-ms.

findinsite-cd-wizard can scan any file type recognised by Microsoft PowerPoint 97, 2000, XP, 2003 or 2007. PowerPoint files usually have a file extension of .ppt or .pps. You must have PowerPoint installed on your computer.

findex, findinsite-js and findinsite-ms can index PowerPoint 97 or later files, without requiring PowerPoint to be installed.

Password-protected files cannot be read by any indexer.

  • Document information Document information is extracted from the "Document Properties" Summary tab.
Page title Obtained from the Title field, if present. Otherwise set to the filename.
Abstract Obtained from the Subject field, if present. Otherwise, from the first words of the page.
Word priority Words are prioritised in searches in this order:
  • Document title, subject and keywords
  • Filename
  • all other text
• Field Information The following meta-data is found:
  • The title is stored as field "title"
  • The subject is reported as field "description"
  • The keywords are reported as field "keywords"
• Links Hyperlinks and targets to other files are supported in findinsite-cd-wizard, but not findex or findinsite-js.
 Viewer program Your customers must have a suitable viewer program to display your PowerPoint slide show files. This could be Microsoft PowerPoint 97, 2000, XP, 2003 or 2007.

A more suitable viewer is PowerPoint Viewer which is included on your PowerPoint CD. The PowerPoint 2000 documentation says that this can be distributed freely. PowerPoint Viewer can also be obtained freely from Microsoft at http://office.microsoft.com/downloads/.

Microsoft® Powerpoint 2007 (PPTX) *.pptx *.pptm
 Indexer findinsite-cd-wizard, findinsite-ms, findinsite-js and findex.

findinsite-cd-wizard can index Powerpoint 2007 files provided you have Powerpoint 2007 installed. Password-protected files can be read if an appropriate password is supplied.

findinsite-ms, findinsite-js and findex can index Powerpoint 2007 files, without requiring Powerpoint to be installed. Encrypted (password-protected) and Restricted permission files are not supported.

  • Document information Document information is extracted from the "Document Properties" Summary tab.
Page title Obtained from the Title field, if present. Otherwise set to the filename.
Abstract Obtained from the Subject field, if present. Otherwise, from the first words of the page.
Word priority Words are prioritised in searches in this order:
  • Document title, subject and keywords
  • Filename
  • all other text
• Field Information The following meta-data is found:
  • The title is stored as field "title"
  • The subject is reported as field "description"
  • The creator is reported as field "author"
  • The keywords are reported as field "keywords"
  • The comments are reported as field "comments"
• Links Hyperlinks to other files are supported.

Microsoft® Publisher (PUB) *.pub
 Indexer findinsite-ms.
  • Document information Document information is extracted from the "Document Properties" Summary tab.
Page title Obtained from the Title field, if present. Otherwise set to the filename.
Abstract Obtained from the Subject field, if present. Otherwise, from the first words of the page.
Word priority Words are prioritised in searches in this order:
  • Document title, subject and keywords
  • Filename
  • all other text
• Field Information The following meta-data is found:
  • The title is stored as field "title"
  • The subject is reported as field "description"
  • The keywords are reported as field "keywords"
• Links Not supported.
 Viewer program Your customers must have a suitable viewer program to display your Publisher files. This is usually Microsoft Publisher.

Resource Description Framework (RDF) *.rdf
 Indexer findex using the phdccRDF plug-in.

RDF files only contain meta field data that describe other files. A single RDF file may contain meta descriptions for any number of other files.

Only a limited subset of RDF/XML is supported; see the RDF page for full details.

  • Field information RDF can define any field any number of times.
 Viewer program RDF files are not shown directly. They only contain meta field information about other files.

Non-word processed text (TXT) *.txt
 Indexer findinsite-cd-wizard, findex, findinsite-js and findinsite-ms.

These indexers find words in any plain text file. The indexers detect if the text is in Unicode (if the U+FFFE marker is at the start) and decodes it correctly. Otherwise the single-byte characters are assumed to be in the computer's default code page. Other plain text formats are not supported (eg UTF-8 etc).

  • Document information
Page title Obtained from the first line of the file, if present. Otherwise set to the filename.
Word priority Words are prioritised in searches in this order:
  • Title (ie first line of file)
  • all other text
• Field Information The following meta-data is found:
  • The title (ie first line of file) is stored as field "title"
• Links Not supported
 Viewer program Your customers' browsers will display text files.

Images *.jpg *.jpeg *.tif *.tiff
 Indexer findinsite-cd-wizard, findex, findinsite-js and findinsite-ms.

The indexers finds meta-data text information including the Title, Subject, Keywords, Comments and Author fields stored by Windows Explorer in Windows XP. Note that numeric information such as camera exposure time is not stored currently.

findinsite-cd-wizard and findinsite-ms find meta-data text information stored in JPEG and TIFF images.
findex and findinsite-js find meta-data text information stored in JPEG images only, not TIFF files.

GDI+ must be installed for image indexing in findinsite-cd-wizard. This is installed by default in Windows XP. GDI+ is available as a 1MB Download from Microsoft for Windows 98, NT4, Me, 2000 and XP. Run the download EXE - click on Unzip, then copy the unzipped file gdiplus.dll to the findinsite-cd installation directory.
findinsite-ms also finds XMP (Extensible Metadata Platform) data in JPEG and TIFF files, eg as produced by Microsoft Windows Vista. The following meta-fields are found (with Vista usage in brackets): • dc:subject (Tags) • dc:title (Title) • dc:creator (Author) • dc:description (Subject) • dc:rights (Copyright) • tiff:artist (Author) • exif:UserComment (Comments) • xmp:Rating (Star rating number)
  • Document information
Page title If embedded in a web page, then the IMG tag ALT text is used as the title. If the meta-data defines a title then this is used instead.
Word priority All words are stored at the same priority, equivalent to document title etc.
• Field Information All text meta-data is stored.
• Links Not supported
 Viewer program Your customers' browsers will display JPEG files. An external viewer is usually required for TIFF files.

  All site Copyright © 1996-2011 PHD Computer Consultants Ltd, PHDCC   Privacy  

Last modified: 6 November 2009.

Valid HTML 4.01 Transitional Valid CSS!