[PDF-376] New Content Parser - ICEsoft JIRA Issue Tracker

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 4.3
Fix Version/s: 5.0.0 alpha1, 5.0.0 beta1, 5.0
Component/s: Core/Parsing
Labels:
None
Environment:
any

Description

When building the PostScript calculator for type 4 function support I did quite a bit of research into parsing techniques. The end result was relatively quick parsing engine. Once this work was completed I started working on a new PDF Content Parser system using the same techniques. In theory the new parser should be in the order of 50x faster the current one.

The ContentParser in ICEpdf is tightly coupled with the the generic Parser class. The Parser class feeds the Content Parser tokens for processing. This Parser is multipurpose handling both stream and dictionary parsing as well as providing tokens in a page content stream. The main problem here is that content stream operand tokens are returned as strings from the parser and then .equals is used by the content Parser to execute a found command. There are 90 plus operand tokens which is a a lot of comparison that we could be doing more efficiently.

One further problem with the Parser class is that it assumes that a content stream is always well formed and that operands, names and number will always be white space separated. This is not the case and a new setup should be able to determine tokens even if spaces are not present.

I've already done quite a bit of work on this. I will likely create a 4.3 branch and use the trunk to start checking in work for this optimization.

Activity

Ascending order - Click to sort in descending order

Hide

Permalink

Patrick Corless added a comment - 20/Nov/12 6:30 PM - edited

A rather massive check in will follow these comments. A new content parser has been cut into the core. New features include but not limited to.

handling of malformed content streams
operands are now collected as Integers and switched accordingly.
~~PDF-13~~ has been solved and multiple threads can now access a document's pages.
Vectors and HashTables have been replaced with Lists and Maps where appropriate.
Name objects are now preserved between parse dictionary look ups.
Shapes painting has now been rewritten to use a command pattern for paints operations which avoid timely "if" calls.
Image loading has been moved off onto a worker thread that will allow the content parsing to continue.
Optional image loading models for scaled and MipMapped images.
configurable initial shapes size for CAD drawings.
Removal of memory manager class, References class are now used resulting in lower memory overhead.
creation of ImageStream object to handle image data, still some more work need here to speed up image loading, memory consumption and generic application of various masks.
Updated the document class so that it will rebuild the cross reference table when doing a linear traversal of the document. This will insure that when page are resources are garbage collected the new crossreference entries can reget the respective objects from the file when needed.

Show

Patrick Corless added a comment - 20/Nov/12 6:30 PM - edited A rather massive check in will follow these comments. A new content parser has been cut into the core. New features include but not limited to. handling of malformed content streams operands are now collected as Integers and switched accordingly. PDF-13 has been solved and multiple threads can now access a document's pages. Vectors and HashTables have been replaced with Lists and Maps where appropriate. Name objects are now preserved between parse dictionary look ups. Shapes painting has now been rewritten to use a command pattern for paints operations which avoid timely "if" calls. Image loading has been moved off onto a worker thread that will allow the content parsing to continue. Optional image loading models for scaled and MipMapped images. configurable initial shapes size for CAD drawings. Removal of memory manager class, References class are now used resulting in lower memory overhead. creation of ImageStream object to handle image data, still some more work need here to speed up image loading, memory consumption and generic application of various masks. Updated the document class so that it will rebuild the cross reference table when doing a linear traversal of the document. This will insure that when page are resources are garbage collected the new crossreference entries can reget the respective objects from the file when needed.

Hide

Permalink

Patrick Corless added a comment - 02/Apr/13 1:46 PM

New content parser has been checked in and is now part of the PRO release. The NContentParser and the OContentParser are created via a new ContentParserFactory which detects if the PRO jars is available. The parser logic has been abstracted with regards to postScript to Java2d but the lexers differ quite substantially.

Show

Patrick Corless added a comment - 02/Apr/13 1:46 PM New content parser has been checked in and is now part of the PRO release. The NContentParser and the OContentParser are created via a new ContentParserFactory which detects if the PRO jars is available. The parser logic has been abstracted with regards to postScript to Java2d but the lexers differ quite substantially.

New Content Parser

Details

Description

Activity

People

Dates