FindinSite-CD: Search engine for CD/DVD   .
 
Powered by FindinSite-MS
. Home | Examples | Starting | Set up | Advanced | Languages | Purchasing | Email .
. .
  Getting started | FAQ | FindinSite-CD-Wizard | Findex | File Types | PDF | RDF | Parser API | HTML CDs/DVDs

 

PDF Scanning Support


Introduction

FindinSite-CD-Wizard, Findex, FindinSite-JS and FindinSite-MS can index PDF files. (For FindinSite-CD-Wizard, this indexing is done by a separate PDF Scanner module.)

If one of your customers searches for text and finds it in a PDF file then the browser attempts to display the PDF file. If the customer has a suitable viewer installed, eg Adobe Reader, then it is run. Adobe Reader usually displays the PDF file within the browser window, but may instead show it in a separate window. Search words are not highlighted. FindinSite-MS 1.65+ moves to the page that contains the first instance of any search word (if in pages 1 to 31).

PDF Scanning

The File types page has a Summary of PDF Scanning features.

The PDF indexers read the words in PDF files so that they can be added to the FindinSite search database.

PDF allows the characters that are seen in a PDF display program to be specified in any order. It is therefore possible that the PDF indexers will not see the words in the intended order, or even that the characters in a word are correctly together. However, it is usually the case that words are complete and in the correct order.

Words in the text annotations are added to the search database.

Links to other files are supported. Only links with a /Launch or /URI action are currently supported.

Only Flate and ASCII85 stream filters are supported. FindinSite-CD-Wizard and FindinSite-MS support the Flate filter /DecodeParms /Predictor 12/Up.
The old LZW compression filter is not supported.
Streams in separate files are not supported.
PDF 1.5+ hybrid-reference files are supported, though /XRefStm is not used.
PDF 1.5+ pure cross-reference streams and object streams are supported by FindinSite-CD-Wizard and FindinSite-MS.

Password-protected files

Security/Password-protected (encrypted) files are supported in the FindinSite-CD-Wizard PDF Scanner if your Windows operating system permits. Please specify any open (user) or master (security or owner) passwords on the Scan options wizard page of FindinSite-CD-Wizard.

If you are outside the USA then Password-protected files may not be supported because Microsoft restricts the key length supported by its CryptoAPI to 40 or 56 bits. It looks as though Windows Me may be OK.

Windows XP+ should support 128 bit encryption for most of the world.

Microsoft has a free update to 128 bits for most of the world for Windows 2000: http://www.microsoft.com/windows2000/downloads/recommended/encryption/default.asp. This link updates Internet Explorer to 128 bits: http://www.microsoft.com/windows/ie/download/128bit/intro.htm. Also see http://www.microsoft.com/security/ and http://www.microsoft.com/exporting/.

PDF 1.4 files that use the "Adobe Standard Security" 40-bit and 128-bit encryption options are supported. Therefore such PDF files generated by Adobe Acrobat 5+ are supported. Remember that users will have to have Adobe Reader 5+ to read such files.

Document Information

The PDF Document Information is inspected by the PDF indexers. The Title is used as the page title, the Subject is used as the META Description and the Keywords are used as the META Keywords.

Character Encodings

PDF specifies several ways in which characters are encoded in PDF files. The PDF indexers only support some of these methods.
  • The following PDF named Font Encodings are supported: PDFDocEncoding, StandardEncoding, MacRomanEncoding, WinAnsiEncoding and MacExpert.

  • Fonts with single byte ToUnicode mappings are supported. Identity-H and Identity-V double byte encodings with ToUnicode mappings are also supported.

  • Font Encoding with Differences from a base encoding are supported. If a BaseEncoding is given, then the above named encodings are supported. The following glyph names are supported in the Differences array.
Glyph names:
  • All glyph names in the Adobe® glyphlist.txt and zapfdingbats.txt are supported
  • uni<CODE> glyph names are supported, eg "/uni20AC" represents Unicode character U+20AC, the Euro sign
  • Glyph names in decimal are supported, eg "/162".
  • Characters for glyph names in a /Font /CharProcs dictionary entry are ignored
  • Characters for glyph names in a /Font /FontDescriptor /CharSet entry are ignored

  • If the glyphlist.txt contains duplicate codes for a single glyph name, the duplicates are not used.
  • Ligature glyph names with underscore characters are not currently supported
  • /uni<CODE> glyph names with more than one <CODE> are not currently supported
Glyph name code translations:
  • ".xxx" variant descriptors in glyph names are silently removed eg "A.swash" becomes "A"
  • "small", "oldstyle", "inferior" and "superior" are silently removed from the end of glyph names, eg "Asmall" becomes "A"
If the Font has no named encoding, or the Encoding Differences has no BaseEncoding, then a default encoding is used mapping a single byte code directly into double byte Unicode format.

Bugs

  • Quite a few PDF files are in an invalid format. PDF file reader programs can often cope with these problems. However, the PDF indexers will report an obscure error, such as the following. If you get an obscure error from an PDF indexer, check that the PDF file can be read correctly before contacting PHD to report a problem.
    XXX.PDF:  PDF Scanner error:    PDFGetByte: invalid BufferPtr

  • Some types of error result in minor memory leaks.

  • The PDF indexers will optionally report errors (such as the following) when it finds characters that it cannot convert into Unicode. These errors can usually be safely ignored, because they correspond to graphics symbols.
    XXX.PDF:  Undefined code 0x0011 in font /F16
    YYY.PDF:  Unrecognised glyph name /G7F in font /F5.  Characters with this code are ignored.
    These problems are only reported if the "Report PDF Character problems" checkbox is set in the FindinSite-CD-Wizard Scan Options wizard page.

  • The PDF indexers do not understand some text streams generated by unusual programs.

Possible improvements

  • Get Outline text and report as headings
  • Harder: decode font information further to reduce glyph and code errors.
  All site Copyright © 1996-2011 PHD Computer Consultants Ltd, PHDCC   Privacy  

Last modified: 6 November 2009.

Valid HTML 4.01 Transitional Valid CSS!