|
PDF Scanning Support
Introduction
FindinSite-CD-Wizard, Findex, FindinSite-JS and FindinSite-MS can index PDF files.
(For FindinSite-CD-Wizard, this indexing is done by a separate PDF Scanner module.)
If one of your customers searches for text and finds it in a PDF file then the browser
attempts to display the PDF file. If the customer has a suitable viewer installed,
eg Acrobat Reader, then it is run. Acrobat Reader usually displays the PDF file within
the browser window, but may instead show it in a separate window.
Search words are not highlighted.
PDF Scanning
The File types page has a Summary of PDF Scanning features.
The PDF indexers read the words in PDF files so that they can be added to the FindinSite
search database.
PDF allows the characters that are seen in a PDF display program to be specified in any order.
It is therefore possible that the PDF indexers will not see the words in the intended
order, or even that the characters in a word are correctly together. However, it is
usually the case that words are complete and in the correct order.
Words in the text annotations are added to the search database.
Links to other files are supported. Only links with a /Launch or /URI action are currently supported.
Only Flate and ASCII85 stream filters are supported.
The old LZW compression filter is not supported.
Streams in separate files are not supported.
(Acrobat 6 generated) PDF 1.5 hybrid-reference files are supported; however object streams and xref streams
are not currently supported.
Password-protected files
Security/Password-protected (encrypted) files are supported
in the FindinSite-CD-Wizard PDF Scanner
if your Windows operating system permits.
Please specify any open (user) or master (security or owner) passwords on the
Scan options wizard page of FindinSite-CD-Wizard.
If you are outside the USA then Password-protected files may not be supported because
Microsoft restricts the key length supported by its CryptoAPI to 40 or 56 bits.
It looks as though Windows Me may be OK.
Windows XP should support 128 bit encryption for most of the world.
Microsoft has a free update to 128 bits for most of the world for Windows 2000:
http://www.microsoft.com/windows2000/downloads/recommended/encryption/default.asp.
This link updates Internet Explorer to 128 bits:
http://www.microsoft.com/windows/ie/download/128bit/intro.htm.
Also see http://www.microsoft.com/security/
and http://www.microsoft.com/exporting/.
PDF 1.4 files that use the "Adobe Standard Security" 40-bit and 128-bit encryption options are supported.
Therefore such PDF files generated by Acrobat 5+ are supported. Remember that users will
have to have Acrobat Reader 5+ to read such files.
Document Information
The PDF Document Information is inspected by the PDF indexers. The Title is used as the
page title, the Subject is used as the META Description and the Keywords are used
as the META Keywords.
Character Encodings
PDF specifies several ways in which characters are encoded in PDF files. The
PDF indexers only support some of these methods.
- The following PDF named Font Encodings are supported: PDFDocEncoding, StandardEncoding,
MacRomanEncoding, WinAnsiEncoding and MacExpert.
- Fonts with single byte ToUnicode mappings are supported.
Identity-H and Identity-V double byte encodings with ToUnicode mappings are also supported.
- Font Encoding with Differences from a base encoding are supported. If a BaseEncoding is
given, then the above named encodings are supported. The following
glyph names are supported in the Differences array.
Glyph names:
- All glyph names in the Adobe®
glyphlist.txt and zapfdingbats.txt are supported
- uni<CODE> glyph names are supported, eg "/uni20AC" represents Unicode character U+20AC, the Euro sign
- Glyph names in decimal are supported, eg "/162".
- Characters for glyph names in a /Font /CharProcs dictionary entry are ignored
- Characters for glyph names in a /Font /FontDescriptor /CharSet entry are ignored
- If the
glyphlist.txt contains duplicate codes for a single glyph name,
the duplicates are not used.
- Ligature glyph names with underscore characters are not currently supported
- /uni<CODE> glyph names with more than one <CODE> are not currently supported
Glyph name code translations:
- ".xxx" variant descriptors in glyph names are silently removed eg "A.swash" becomes "A"
- "small", "oldstyle", "inferior" and "superior" are silently removed from the end of glyph names, eg "Asmall" becomes "A"
If the Font has no named encoding, or the Encoding Differences has no BaseEncoding, then a default
encoding is used mapping a single byte code directly into double byte Unicode format.
Bugs
Possible improvements
- Get Outline text and report as headings
- Harder: decode font information further to reduce glyph and code errors.
|