PDF Scanning Support
FindinSite-CD-Wizard, Findex, FindinSite-JS and FindinSite-MS can index PDF files.
(For FindinSite-CD-Wizard, this indexing is done by a separate PDF Scanner module.)
If one of your customers searches for text and finds it in a PDF file then the browser
attempts to display the PDF file. If the customer has a suitable viewer installed,
eg Adobe Reader, then it is run. Adobe Reader usually displays the PDF file within
the browser window, but may instead show it in a separate window.
Search words are not highlighted.
FindinSite-MS 1.65+ moves to the page that contains the first instance of any search word
(if in pages 1 to 31).
The File types page has a Summary of PDF Scanning features.
The PDF indexers read the words in PDF files so that they can be added to the FindinSite
PDF allows the characters that are seen in a PDF display program to be specified in any order.
It is therefore possible that the PDF indexers will not see the words in the intended
order, or even that the characters in a word are correctly together. However, it is
usually the case that words are complete and in the correct order.
Words in the text annotations are added to the search database.
Links to other files are supported. Only links with a /Launch or /URI action are currently supported.
Only Flate and ASCII85 stream filters are supported.
FindinSite-CD-Wizard and FindinSite-MS support the Flate filter /DecodeParms /Predictor 12/Up.
The old LZW compression filter is not supported.
Streams in separate files are not supported.
PDF 1.5+ hybrid-reference files are supported, though /XRefStm is not used.
PDF 1.5+ pure cross-reference streams and object streams are supported by FindinSite-CD-Wizard and FindinSite-MS.
Security/Password-protected (encrypted) files are supported
in the FindinSite-CD-Wizard PDF Scanner
if your Windows operating system permits.
Please specify any open (user) or master (security or owner) passwords on the
Scan options wizard page of FindinSite-CD-Wizard.
If you are outside the USA then Password-protected files may not be supported because
Microsoft restricts the key length supported by its CryptoAPI to 40 or 56 bits.
It looks as though Windows Me may be OK.
Windows XP+ should support 128 bit encryption for most of the world.
Microsoft has a free update to 128 bits for most of the world for Windows 2000:
This link updates Internet Explorer to 128 bits:
Also see http://www.microsoft.com/security/
PDF 1.4 files that use the "Adobe Standard Security" 40-bit and 128-bit encryption options are supported.
Therefore such PDF files generated by Adobe Acrobat 5+ are supported. Remember that users will
have to have Adobe Reader 5+ to read such files.
The PDF Document Information is inspected by the PDF indexers. The Title is used as the
page title, the Subject is used as the META Description and the Keywords are used
as the META Keywords.
PDF specifies several ways in which characters are encoded in PDF files. The
PDF indexers only support some of these methods.
- The following PDF named Font Encodings are supported: PDFDocEncoding, StandardEncoding,
MacRomanEncoding, WinAnsiEncoding and MacExpert.
- Fonts with single byte ToUnicode mappings are supported.
Identity-H and Identity-V double byte encodings with ToUnicode mappings are also supported.
- Font Encoding with Differences from a base encoding are supported. If a BaseEncoding is
given, then the above named encodings are supported. The following
glyph names are supported in the Differences array.
If the Font has no named encoding, or the Encoding Differences has no BaseEncoding, then a default
encoding is used mapping a single byte code directly into double byte Unicode format.
Glyph name code translations:
- All glyph names in the Adobe®
zapfdingbats.txt are supported
- uni<CODE> glyph names are supported, eg "/uni20AC" represents Unicode character U+20AC, the Euro sign
- Glyph names in decimal are supported, eg "/162".
- Characters for glyph names in a /Font /CharProcs dictionary entry are ignored
- Characters for glyph names in a /Font /FontDescriptor /CharSet entry are ignored
- If the
glyphlist.txt contains duplicate codes for a single glyph name,
the duplicates are not used.
- Ligature glyph names with underscore characters are not currently supported
- /uni<CODE> glyph names with more than one <CODE> are not currently supported
- ".xxx" variant descriptors in glyph names are silently removed eg "A.swash" becomes "A"
- "small", "oldstyle", "inferior" and "superior" are silently removed from the end of glyph names, eg "Asmall" becomes "A"
- Get Outline text and report as headings
- Harder: decode font information further to reduce glyph and code errors.