[PDF-376] New Content Parser - ICEsoft JIRA Issue Tracker

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 4.3
Fix Version/s: 5.0.0 alpha1, 5.0.0 beta1, 5.0
Component/s: Core/Parsing
Labels:
None
Environment:
any

Description

When building the PostScript calculator for type 4 function support I did quite a bit of research into parsing techniques. The end result was relatively quick parsing engine. Once this work was completed I started working on a new PDF Content Parser system using the same techniques. In theory the new parser should be in the order of 50x faster the current one.

The ContentParser in ICEpdf is tightly coupled with the the generic Parser class. The Parser class feeds the Content Parser tokens for processing. This Parser is multipurpose handling both stream and dictionary parsing as well as providing tokens in a page content stream. The main problem here is that content stream operand tokens are returned as strings from the parser and then .equals is used by the content Parser to execute a found command. There are 90 plus operand tokens which is a a lot of comparison that we could be doing more efficiently.

One further problem with the Parser class is that it assumes that a content stream is always well formed and that operands, names and number will always be white space separated. This is not the case and a new setup should be able to determine tokens even if spaces are not present.

I've already done quite a bit of work on this. I will likely create a 4.3 branch and use the trunk to start checking in work for this optimization.

Activity

People

Assignee:

Patrick Corless

Reporter:

Patrick Corless

Votes:

0 Vote for this issue

Watchers:

2 Start watching this issue

Dates

Created:

17/Jan/12 10:23 AM

Updated:

01/Apr/15 3:01 PM

Resolved:

02/Apr/13 1:46 PM