[PDF-280] Error parsing malformed cmap file - ICEsoft JIRA Issue Tracker

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 4.2
Fix Version/s: 4.2.2
Component/s: Core/Parsing, Font Engine
Labels:
None
Environment:
any

Description

A client has sent in a file where the text "These fine ffine ften ffton flen fflen" is cot correctly extracted. The text in the PDF is represented by:

[(\037)7(e)8(s)-3(e \036)13(n)7(e \035n)7(e \034)10(en \033)10(o)10(n \032)4(en \031)4(en)-25( )]TJ

The glyphs Th, fi, ff, ft, fl, fft, ffl are referenced by single glyph id's. The cmap file that maps the glyph id to a unicode value is is actually malformed. The cmap file look like this:

<19> <00660066006C>
<1A> <0066006C>
<1B> <006600660074>
<1C> <00660074>
<1D> <006600660069>
<1E> <00660069>
<1F> <00540068>
<20> <0020>
<65> <0065>
<6E> <006E>
<6F> <006F>

According to the specification for cmap <###> represents only one unicode value and as a result our parser will only read the first unicode value ignoring the rest. According the spec, a cid -> more then one unicode value should start off with a bracket when defining the unicode values, for example.

<19> [<00660066006C>]

I guess long story short I need to do further work to our cmap parser to correctly interpret range format. The document in question was encoded with Adobe cs7 which means we'll likely see more file like this as time goes on.

Options
- Sort By Name
- Sort By Date
- Ascending
- Descending
- Download All

Attachments

9589_test2.pdf

21/Mar/11 8:45 AM

301 kB

Patrick Corless

Activity

Ascending order - Click to sort in descending order

Hide

Permalink

Patrick Corless added a comment - 21/Mar/11 8:45 AM

Cmap parsing issue for gid-> many cids.

Show

Patrick Corless added a comment - 21/Mar/11 8:45 AM Cmap parsing issue for gid-> many cids.

Hide

Permalink

Patrick Corless added a comment - 18/Aug/11 7:40 AM

Applied patch for nfont pro library. It is now possible to property extract one-to-many unicode values even if the vector notation '[]' was omitted.

Show

Patrick Corless added a comment - 18/Aug/11 7:40 AM Applied patch for nfont pro library. It is now possible to property extract one-to-many unicode values even if the vector notation '[]' was omitted.

People

Assignee:

Patrick Corless

Reporter:

Patrick Corless

Votes:

0 Vote for this issue

Watchers:

0 Start watching this issue

Dates

Created:

21/Mar/11 8:43 AM

Updated:

29/Mar/12 11:42 AM

Resolved:

18/Aug/11 7:40 AM