|
File Types and Viewers
Introduction
This page lists the file types supported by findinsite-cd,
findinsite-js, findinsite-ms and findex.
It also describes the viewer programs that are necessary
on your customers' computers. And there are some cautionary notes on filenames.
This table summarises the supported file types. Click on the type acronym for a summary
further down this page.
Searching and Indexing
All the findinsite runtimes can do searches for hits
of any file type stored in a search database.
However a search database must be made in advance by one of the indexing programs.
- The findinsite-cd-wizard Windows tool makes search databases for
findinsite-cd,
and supports HTML, PDF, DOC, XLS, PPT, TXT and Image file types.
- The findex Java application and findinsite-js
index HTML, PDF, DOC, XLS, PPT and TXT files,
and JPEG image meta-data.
- findinsite-ms indexes HTML, PDF, DOC, DOCX, XLS, XLSX, PUB, PPT and TXT files,
and JPEG and TIFF image meta-data.
- Finally, the phdccRDF Java plug-in for findex indexes RDF/XML files.
- The FindinSite.TextExtractor tool indexes
as per findinsite-ms.
| Type |
Description |
Standard extensions |
Indexer support |
Viewer software |
More information |
| findinsite-cd-wizard |
findex |
findinsite-js |
findinsite-ms,
TextExtractor
|
| HTML |
Web page |
*.htm *.html |
YES |
YES |
YES |
YES |
Browser |
findinsite-cd-wizard
findex
Character sets
|
| PDF |
Adobe® Portable Document Format |
*.pdf |
YES |
YES |
YES |
YES |
Adobe Acrobat Reader |
PDF Support
|
| DOC |
Microsoft® Word document |
*.doc *.rtf |
YES # |
YES $ |
YES $ |
YES $ |
MS-Word, Word Viewer or WordPad |
|
| DOCX |
Microsoft® Word 2007 document |
*.docx *.docm |
NO |
NO |
NO |
YES |
MS-Word 2007, (?)Word Viewer |
|
| XLS |
Microsoft® Excel document |
*.xls |
YES ## |
YES |
YES |
YES |
MS-Excel |
|
| XLSX |
Microsoft® Excel 2007 spreadsheet |
*.xlsx *.xlsm |
NO |
NO |
NO |
YES |
MS-Excel 2007, (?)Excel Viewer |
|
| PPT |
Microsoft® PowerPoint presentation |
*.ppt, *.pps |
YES ### |
YES |
YES |
YES |
MS-PowerPoint |
|
| PPTX |
Microsoft® PowerPoint 2007 presentation |
*.pptx, *.pptm |
NO |
NO |
NO |
YES |
MS-PowerPoint |
|
| PUB |
Microsoft® Publisher publication |
*.pub |
NO |
NO |
NO |
YES |
MS-Publisher |
|
| RDF/XML |
Resource Description Framework |
*.rdf |
NO |
YES $$ |
NO |
NO |
Not applicable |
RDF Support
|
| TXT |
Non-word processed text |
*.txt |
YES |
YES |
YES |
YES |
Browser |
|
| JPEG/TIFF |
Image meta-data |
*.jpg, *.jpeg, *.tif, *.tiff |
YES #### |
YES $$$ |
YES $$$ |
YES |
JPEG: browser |
|
| # |
findinsite-cd-wizard requires MS-Word 97/2000/XP/2003 to scan Word documents |
| ## |
findinsite-cd-wizard requires MS-Excel 5/97/2000/XP/2003 to scan Excel documents |
| ### |
findinsite-cd-wizard requires MS-PowerPoint 97/2000/XP/2003 to scan PowerPoint documents |
| #### |
findinsite-cd-wizard requires GDI+ to index images |
| $ |
Only Word 97 or later files are supported. Word is NOT needed. RTF files are NOT supported. |
| $$ |
Using phdccRDF plug-in. |
| $$$ |
findex and findinsite-js only index JPEG meta-data |
Be careful to check that your files can be viewed correctly on your customers' computers.
Check they have a suitable viewer program, together with all necessary plug-ins.
Check that your files only use links to information that is present.
For example, Word files may contain absolute links to files or images on your hard drive -
make sure than all pictures are all saved in the document.
In Word, select menu Edit+Links to see what links a document has.
File type identification
Each file type has an acronym, eg HTML for web pages. Each filename's extension identifies the file's type.
For example, a filename of index.html has a filename extension of .html,
and a filename prefix of index. The file types are identified
by the file extension using an asterisk to indicate any filename prefix, eg *.html means any file
with a filename extension of .html. Web pages are normally identified as
*.htm *.html because both the .htm and *.html filename extensions
are commonly used to indicate that the contents is in HTML.
In findinsite-cd-wizard on the "Scan File Types" wizard page, you can specify the file extensions that should be
scanned in a particular way. For example, if you want to scan RTF files using the MS-Word scanner, then
put *.rtf in the MS-Word DOC edit box.
Filenames for findinsite-cd
Please be careful about your filenames to ensure that findinsite-cd will be able to show the pages
on your customers' computers.
Filenames with spaces are not recommended, although they should work
(because findinsite-cd converts them to %20 in the URL).
Filenames with non-English characters are definitely not recommended because some browsers will
not display the files correctly. Microsoft Internet Explorer seems to cope with filenames correctly.
However the Netscape browsers do not; even an e acute (é) in a filename will mean that
the page cannot be shown.
In findinsite-cd-wizard, the scan directory or initial scan filename must be entered in the computer's
default local character set. However findinsite-cd-wizard correctly converts these to a safe portable form,
called Unicode.
On Windows NT, 2000 and XP, findinsite-cd-wizard will subsequently find all Unicode filenames correctly. For example, if you have
a page with a Japanese filename on an English system, then findinsite-cd-wizard will find it correctly.
However your customers may be running a different operating system and browser, so it is quite possible
that they are unable to view such pages.
Non-latin Characters
In almost all cases, the findinsite and findex indexing programs
will find the correct characters in your documents, even if they
cannot be displayed correctly on the current computer.
For example, findinsite and findex correctly interpret most common web page character sets.
As another example, findinsite-cd-wizard extracts information from Microsoft Office documents
in the Unicode format.
Unicode has codes for the characters used in almost all languages.
This works even on Western systems that cannot normally display non-latin characters
such as Windows 98 and Me.
Document, field, target and link information
As well as reading the basic words in each type of file,
the findinsite and findex indexers find more information if it is available.
- For example, if an HTML web page specifies a "META description" then this is usually used as the page abstract.
- Document meta-data is stored by all indexers.
In addition, findex, findinsite-js and findinsite-ms store document meta-data for field searches.
- If findinsite or findex are following links, then it finds files to scan from the links in each file,
ie the hypertext links in web pages. If specified, the target window name is also stored.
The summary information below indicates the document, field and link information that are found for
each file type.
Summary Information for each File Type
|
Web pages (HTML)
|
*.htm *.html and possibly *.asp *.php, etc
|
| Indexer |
findinsite-cd-wizard, findex, findinsite-js and findinsite-ms.
- These indexers can scan HTML web pages, which
usually have file extensions of
.html or .htm.
- Dynamically generated files can be scanned as well, eg using file extensions
.aspx .asp, .php, .jsp, provided you are indexing an online web site rather than source files.
- The indexers can also scan source files that are meant to be interpreted
first by server script engines, eg
*.aspx, *.asp, *.php, *.jsp, etc.
However, the indexers may not ignore the script text correctly.
- The indexers will find all the plain text words in an XML file, but does not interpret
any XML structure information, ie anything in tags is ignored unless it is HTML.
Each indexer has various options for scanning HTML files.
The findinsite-cd-wizard page
describes how various tags are handled by all the indexers, eg how the <DIV class=nospy> tag can be used
to indicate words that should not be stored.
See the Character sets page for details
of the character sets (used by different languages) supported by the indexers.
|
| |
Document information |
| Page title |
Obtained from the TITLE tag, if present. Otherwise set to the filename.
|
| Abstract |
Obtained from the "META Abstract" or "META Description" tags, if present.
Otherwise, from the first words of the page.
|
| Word priority |
Words are prioritised in searches in this order:
Document meta-data, including title, description and keywords
Filename
Headings H1..H6
all other text
|
See the Character sets page and
findinsite-cd-wizard page for details of other recognised tags.
|
| Field Information |
The following meta-data is found:
<META name=nnn content=xxx>
as field "nnn" containing "xxx"
<META http-equiv=content-language content=xxx> as field "lang" containing "xxx"
<BODY lang=xxx> as field "lang" containing "xxx"
<TITLE>xxx</TITLE> as field "title" containing "xxx"
<IMG alt=xxx> as field "img" containing "xxx"
|
| Links |
If following links, the indexers follow hypertext links to non-absolute URLs.
An option allows it to scan up the directory tree.
Hypertext links are normally A HREF=xx TARGET=yy tags. However,
FRAME SRC=xx NAME=yy tags and any other tags (such as the AREA tag) that
have HREF and TARGET attributes are also followed.
If a link has an attribute rel="nofollow", then the link is not followed,
eg this link <A HREF=afile.htm rel="nofollow"> is ignored.
|
| Viewer program |
Please remember that your customers' HTML viewer (their browser) may not be the same as yours,
so be careful in your choice of HTML features, plug-ins, etc.
Dynamically generated pages designed to run on a server
will usually not run correctly from a CD.
|
|
Adobe® Portable Document Format (PDF)
|
*.pdf
|
| Indexer |
findinsite-cd-wizard, findex, findinsite-js and findinsite-ms.
The indexers scan PDF files that usually have a file extension of .pdf.
See the PDF Scanning Support page for full details of this feature.
Password-protected files can be read by the indexers if an appropriate password is supplied.
|
| |
Document information |
Document information is extracted from the "General Document Info".
| Title |
Obtained from the Title field, if present. Otherwise set to the filename.
|
| Abstract |
Obtained from the Subject field, if present.
Otherwise, from the first words of the page.
|
| Word priority |
Words are prioritised in searches in this order:
Subject and Keywords fields
Filename
all other text
|
|
| Field Information |
The following meta-data is found:
The title is stored as field "title"
The subject is reported as field "description"
The keywords are reported as field "keywords"
The author is reported as field "author"
|
| Links |
Links to other files are supported.
Only links with a /Launch or /URI action are currently supported.
|
| Viewer program |
Your customers must have a suitable viewer program to display PDF files - usually Adobe Acrobat Reader.
It is common practice to provide a link to Adobe to download the latest version.
Or you can put the Acrobat Reader installation kit on your CD -
check that the licence permits this.
|
|
Microsoft® Word (DOC)
|
*.doc *.rtf
|
| Indexer |
findinsite-cd-wizard, findex, findinsite-js and findinsite-ms.
findinsite-cd-wizard can scan any file type recognised by Microsoft Word 97, Word 2000, Word XP or Word 2003.
Word files usually have a file extension of .doc. Word can usually
scan RTF files (*.rtf).
You must have Word installed on your computer.
In a few instances, Word appears on screen while scanning, requesting some input from you.
If you cannot provide this information, start the Windows Task Manager and end the Word
application or the WINWORD process. findinsite-cd-wizard should then report an error and continue.
findex, findinsite-js and findinsite-ms can index Word 97 or later files, without
requiring Word to be installed. RTF files are not supported.
Password-protected files can be read by findinsite-cd-wizard only, if an appropriate password is supplied.
|
| |
Document information |
Document information is extracted from the "Document Properties" Summary tab.
| Page title |
Obtained from the Title field, if present. Otherwise set to the filename.
|
| Abstract |
Obtained from the Subject field, if present.
Otherwise, from the first words of the page.
|
| Word priority |
Words are prioritised in searches in this order:
Document title, subject and keywords
Filename
all other text
|
|
| Field Information |
The following meta-data is found:
The title is stored as field "title"
The subject is reported as field "description"
The keywords are reported as field "keywords"
|
| Links |
Hyperlinks and targets to other files are supported.
|
| Viewer program |
Your customers must have a suitable viewer program to display your Word document files.
This could be Word 97 2000, XP or 2003. However Word 95 or earlier would do
if your documents are in a format that can be read by these versions.
A more suitable viewer is Word Viewer which can be obtained freely from Microsoft at
http://office.microsoft.com/downloads/.
As a last resort, Windows WordPad seems to be able to open most .doc and .rtf files.
Apart from Word for Macintosh, no viewer programs are known for non-Windows platforms.
|
|
Microsoft® Word 2007 (DOCX)
|
*.docx *.docm
|
| Indexer |
findinsite-ms.
findinsite-ms can index Word 2007 files, without
requiring Word to be installed. Note: a word will be split if it contains internal formatting.
Encrypted (password-protected) and Restricted permission files are not supported.
|
| |
Document information |
Document information is extracted from the "Document Properties" Summary tab.
| Page title |
Obtained from the Title field, if present. Otherwise set to the filename.
|
| Abstract |
Obtained from the Subject field, if present.
Otherwise, from the first words of the page.
|
| Word priority |
Words are prioritised in searches in this order:
Document title, subject and keywords
Filename
all other text
|
|
| Field Information |
The following meta-data is found:
The title is stored as field "title"
The subject is reported as field "description"
The creator is reported as field "author"
The keywords are reported as field "keywords"
The comments are reported as field "comments"
|
| Links |
Hyperlinks to other files are supported.
|
|
Microsoft® Excel (XLS)
|
*.xls
|
| Indexer |
findinsite-cd-wizard, findex, findinsite-js and findinsite-ms.
findinsite-cd-wizard can scan any file type recognised by Microsoft Excel 5, 97, 2000, XP or 2003.
Excel files usually have a file extension of .xls.
You must have Excel installed on your computer.
Password-protected files can be read by findinsite-cd-wizard only if an appropriate password is supplied.
Known bug:
Scanning using Excel 5: repeat scan hangs in Windows 2000:
Ending process wowexec in Task Manager sorts problem.
findex, findinsite-js and findinsite-ms
can index any XLS file without requiring Excel to be installed.
Password-protected files are not supported. There are various minor limitations.
|
| |
Document information |
Document information is extracted from the "Document Properties" Summary tab.
| Page title |
Obtained from the Title field, if present. Otherwise set to the filename.
|
| Abstract |
Obtained from the Subject field, if present.
Otherwise, from the first words of the page.
|
| Word priority |
Words are prioritised in searches in this order:
Document title, subject and keywords
Filename
all other text
|
|
| Field Information |
The following meta-data is found:
The title is stored as field "title"
The subject is reported as field "description"
The keywords are reported as field "keywords"
|
| Links |
Excel 97, 2000, XP and 2003 hyperlinks and targets to other files are supported.
|
| Viewer program |
Your customers must have a suitable viewer program to display your Excel spreadsheet files.
This could be Microsoft Excel 5, Excel 97, Excel 2000, Excel XP or Excel 2003.
A more suitable viewer is Excel Viewer which can be obtained freely from Microsoft at
http://office.microsoft.com/downloads/.
|
|
Microsoft® Excel 2007 (XLSX)
|
*.xlsx *.xlsm
|
| Indexer |
findinsite-ms.
findinsite-ms can index Excel 2007 files, without
requiring Excel to be installed.
Encrypted (password-protected) and Restricted permission files are not supported.
|
| |
Document information |
Document information is extracted from the "Document Properties" Summary tab.
| Page title |
Obtained from the Title field, if present. Otherwise set to the filename.
|
| Abstract |
Obtained from the Subject field, if present.
Otherwise, from the first words of the page.
|
| Word priority |
Words are prioritised in searches in this order:
Document title, subject and keywords
Filename
all other text
|
|
| Field Information |
The following meta-data is found:
The title is stored as field "title"
The subject is reported as field "description"
The creator is reported as field "author"
The keywords are reported as field "keywords"
The comments are reported as field "comments"
|
| Links |
Hyperlinks to other files are supported.
|
|
Microsoft® PowerPoint (PPT)
|
*.ppt, *.pps
|
| Indexer |
findinsite-cd-wizard, findex, findinsite-js and findinsite-ms.
findinsite-cd-wizard can scan any file type recognised by Microsoft PowerPoint 97, 2000, XP or 2003.
PowerPoint files usually have a file extension of .ppt or .pps.
You must have PowerPoint installed on your computer.
findex, findinsite-js and findinsite-ms can index PowerPoint 97 or later files, without
requiring PowerPoint to be installed.
Password-protected files cannot be read by any indexer.
|
| |
Document information |
Document information is extracted from the "Document Properties" Summary tab.
| Page title |
Obtained from the Title field, if present. Otherwise set to the filename.
|
| Abstract |
Obtained from the Subject field, if present.
Otherwise, from the first words of the page.
|
| Word priority |
Words are prioritised in searches in this order:
Document title, subject and keywords
Filename
all other text
|
|
| Field Information |
The following meta-data is found:
The title is stored as field "title"
The subject is reported as field "description"
The keywords are reported as field "keywords"
|
| Links |
Hyperlinks and targets to other files are supported in findinsite-cd-wizard, but not findex or findinsite-js.
|
| Viewer program |
Your customers must have a suitable viewer program to display your PowerPoint slide show files.
This could be Microsoft PowerPoint 97, 2000, XP or 2003.
A more suitable viewer is PowerPoint Viewer which is included on your PowerPoint
CD. The PowerPoint 2000 documentation says that this can be distributed freely.
PowerPoint Viewer can also be obtained freely from Microsoft at
http://office.microsoft.com/downloads/.
|
|
Microsoft® Powerpoint 2007 (PPTX)
|
*.pptx *.pptm
|
| Indexer |
findinsite-ms.
findinsite-ms can index Powerpoint 2007 files, without
requiring Powerpoint to be installed.
Encrypted (password-protected) and Restricted permission files are not supported.
|
| |
Document information |
Document information is extracted from the "Document Properties" Summary tab.
| Page title |
Obtained from the Title field, if present. Otherwise set to the filename.
|
| Abstract |
Obtained from the Subject field, if present.
Otherwise, from the first words of the page.
|
| Word priority |
Words are prioritised in searches in this order:
Document title, subject and keywords
Filename
all other text
|
|
| Field Information |
The following meta-data is found:
The title is stored as field "title"
The subject is reported as field "description"
The creator is reported as field "author"
The keywords are reported as field "keywords"
The comments are reported as field "comments"
|
| Links |
Hyperlinks to other files are supported.
|
|
Microsoft® Publisher (PUB)
|
*.pub
|
| Indexer |
findinsite-ms.
|
| |
Document information |
Document information is extracted from the "Document Properties" Summary tab.
| Page title |
Obtained from the Title field, if present. Otherwise set to the filename.
|
| Abstract |
Obtained from the Subject field, if present.
Otherwise, from the first words of the page.
|
| Word priority |
Words are prioritised in searches in this order:
Document title, subject and keywords
Filename
all other text
|
|
| Field Information |
The following meta-data is found:
The title is stored as field "title"
The subject is reported as field "description"
The keywords are reported as field "keywords"
|
| Links |
Not supported.
|
| Viewer program |
Your customers must have a suitable viewer program to display your Publisher files.
This is usually Microsoft Publisher.
|
|
Resource Description Framework (RDF)
|
*.rdf
|
| Indexer |
findex using the phdccRDF plug-in.
RDF files only contain meta field data that describe other files.
A single RDF file may contain meta descriptions for any number
of other files.
Only a limited subset of RDF/XML is supported;
see the RDF page for full details.
|
| |
Field information |
RDF can define any field any number of times.
|
| Viewer program |
RDF files are not shown directly. They only contain meta field information
about other files.
|
|
Non-word processed text (TXT)
|
*.txt
|
| Indexer |
findinsite-cd-wizard, findex, findinsite-js and findinsite-ms.
These indexers find words in any plain text file.
The indexers detect if the text is
in Unicode (if the U+FFFE marker is at the start) and decodes it correctly.
Otherwise the single-byte characters are assumed to be in the computer's default
code page.
Other plain text formats are not supported (eg UTF-8 etc).
|
| |
Document information |
| Page title |
Obtained from the first line of the file, if present. Otherwise set to the filename.
|
| Word priority |
Words are prioritised in searches in this order:
Title (ie first line of file)
all other text
|
|
| Field Information |
The following meta-data is found:
The title (ie first line of file) is stored as field "title"
|
| Links |
Not supported
|
| Viewer program |
Your customers' browsers will display text files.
|
|
Images
|
*.jpg *.jpeg *.tif *.tiff
|
| Indexer |
findinsite-cd-wizard, findex, findinsite-js and findinsite-ms.
The indexers finds meta-data text information
including the Title, Subject, Keywords, Comments and Author fields stored by Windows Explorer in Windows XP.
Note that numeric information such as camera exposure time is not stored currently.
findinsite-cd-wizard and findinsite-ms find meta-data text information stored in JPEG and TIFF images.
findex and findinsite-js find meta-data text information stored in JPEG images only, not TIFF files.
GDI+ must be installed for image indexing in findinsite-cd-wizard.
This is installed by default in Windows XP.
GDI+ is available as a 1MB Download from Microsoft
for Windows 98, NT4, Me, 2000 and XP. Run the download EXE - click on Unzip, then copy the unzipped file
gdiplus.dll to the findinsite-cd installation directory.
findinsite-ms also finds XMP (Extensible Metadata Platform) data in JPEG
and TIFF files, eg as produced by Microsoft Windows Vista. The following meta-fields are found
(with Vista usage in brackets):
• dc:subject (Tags)
• dc:title (Title)
• dc:creator (Author)
• dc:description (Subject)
• dc:rights (Copyright)
• tiff:artist (Author)
• exif:UserComment (Comments)
• xmp:Rating (Star rating number)
|
| |
Document information |
| Page title |
If embedded in a web page, then the IMG tag ALT text is used as the title.
If the meta-data defines a title then this is used instead.
|
| Word priority |
All words are stored at the same priority, equivalent to document title etc.
|
|
| Field Information |
All text meta-data is stored.
|
| Links |
Not supported
|
| Viewer program |
Your customers' browsers will display JPEG files. An external viewer is usually required for TIFF files.
|
|