ICEpdf
  1. ICEpdf
  2. PDF-280

Error parsing malformed cmap file

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.2
    • Fix Version/s: 4.2.2
    • Component/s: Core/Parsing, Font Engine
    • Labels:
      None
    • Environment:
      any

      Description

      A client has sent in a file where the text "These fine ffine ften ffton flen fflen" is cot correctly extracted. The text in the PDF is represented by:

      [(\037)7(e)8(s)-3(e \036)13(n)7(e \035n)7(e \034)10(en \033)10(o)10(n \032)4(en \031)4(en)-25( )]TJ

      The glyphs Th, fi, ff, ft, fl, fft, ffl are referenced by single glyph id's. The cmap file that maps the glyph id to a unicode value is is actually malformed. The cmap file look like this:

      <19> <00660066006C>
      <1A> <0066006C>
      <1B> <006600660074>
      <1C> <00660074>
      <1D> <006600660069>
      <1E> <00660069>
      <1F> <00540068>
      <20> <0020>
      <65> <0065>
      <6E> <006E>
      <6F> <006F>

      According to the specification for cmap <###> represents only one unicode value and as a result our parser will only read the first unicode value ignoring the rest. According the spec, a cid -> more then one unicode value should start off with a bracket when defining the unicode values, for example.

      <19> [<00660066006C>]

      I guess long story short I need to do further work to our cmap parser to correctly interpret range format. The document in question was encoded with Adobe cs7 which means we'll likely see more file like this as time goes on.
      1. 9589_test2.pdf
        301 kB
        Patrick Corless

        Activity

        Patrick Corless created issue -
        Hide
        Patrick Corless added a comment -

        Cmap parsing issue for gid-> many cids.

        Show
        Patrick Corless added a comment - Cmap parsing issue for gid-> many cids.
        Patrick Corless made changes -
        Field Original Value New Value
        Attachment 9589_test2.pdf [ 12992 ]
        Tyler Johnson made changes -
        Salesforce Case [5007000000EUSuL]
        Repository Revision Date User Message
        ICEsoft Public SVN Repository #24276 Tue Mar 29 07:35:07 MDT 2011 patrick.corless PDF-280 OS cmap fix for mappings from one gliphid -> many cid.
        Files Changed
        Commit graph MODIFY /icepdf/trunk/icepdf/core/src/org/icepdf/core/pobjects/fonts/ofont/CMap.java
        Hide
        Patrick Corless added a comment -

        Applied patch for nfont pro library. It is now possible to property extract one-to-many unicode values even if the vector notation '[]' was omitted.

        Show
        Patrick Corless added a comment - Applied patch for nfont pro library. It is now possible to property extract one-to-many unicode values even if the vector notation '[]' was omitted.
        Patrick Corless made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Ken Fyten made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            Patrick Corless
            Reporter:
            Patrick Corless
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: