[PDF-109] JBIG2 memory optimisation - ICEsoft JIRA Issue Tracker

Details

Type: New Feature
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 4.0 - Beta
Fix Version/s: 4.0
Component/s: Core/Parsing
Labels:
None
Environment:
All

Description

When looking at the JBIG2 library's code, some memory optimisations came to mind, to reduce the size of the memory spikes. One was done in ~~PDF-9~~. This is for any that are after the 4.0 Beta release.

Issue Links

depends on

PDF-9 JBIG2 Image support

Closed

Activity

Ascending order - Click to sort in descending order

Hide

Permalink

Mark Collette added a comment - 14/Dec/09 8:48 PM

When using the JBIG2 library, it first decodes the JBIG2 compressed input from a byte array, and then it allows for getting the BufferedImage. ~~PDF-9~~ optimised the memory usage of getting the BufferedImage.

This optimisation is for reducing the memory usage after decoding, but before getting the BufferedImage. Analysis of the internal data-structures of the JBIG2 decoder shows that intermediary sections of the JBIG2 image are assembled together to form the final image. While decoding, these subsections must remain accessible, but after decoding, these subsections (called "segments") become redundant, as the page segment then holds the entire image. So, I've added some cleanup code to clear away every segment except the page segments. Depending on the format of any given JBIG2 image in my test suite, I found that these other segments could account for anywhere between several kilobytes and more than 4 megabytes, for a given full page fax scan.

Initial analysis does not indicate that those segments could be cleared away progressively, as the segments are decoded, since it appears that the whole file consists of different segment types, one after the other, and the PDF spec dictates that they be processed in turn, and any later segment could refer to any earlier segment, so there's no way of knowing that a segment is no longer needed until the last segment has been processed. There is a random access mode of JBIG2 decoding, that might allow for more intelligent dependency calculations, but that appears to not be relevant to JBIG2 images embedded in PDFs.

Here is the post-decode cleanup commit.

Subversion 19987
icepdf\core\src\org\icepdf\core\pobjects\Stream.java
icepdf\core\src\org\jpedal\jbig2\JBIG2Decoder.java
icepdf\core\src\org\jpedal\jbig2\decoders\JBIG2StreamDecoder.java

Show

Mark Collette added a comment - 14/Dec/09 8:48 PM When using the JBIG2 library, it first decodes the JBIG2 compressed input from a byte array, and then it allows for getting the BufferedImage. PDF-9 optimised the memory usage of getting the BufferedImage. This optimisation is for reducing the memory usage after decoding, but before getting the BufferedImage. Analysis of the internal data-structures of the JBIG2 decoder shows that intermediary sections of the JBIG2 image are assembled together to form the final image. While decoding, these subsections must remain accessible, but after decoding, these subsections (called "segments") become redundant, as the page segment then holds the entire image. So, I've added some cleanup code to clear away every segment except the page segments. Depending on the format of any given JBIG2 image in my test suite, I found that these other segments could account for anywhere between several kilobytes and more than 4 megabytes, for a given full page fax scan. Initial analysis does not indicate that those segments could be cleared away progressively, as the segments are decoded, since it appears that the whole file consists of different segment types, one after the other, and the PDF spec dictates that they be processed in turn, and any later segment could refer to any earlier segment, so there's no way of knowing that a segment is no longer needed until the last segment has been processed. There is a random access mode of JBIG2 decoding, that might allow for more intelligent dependency calculations, but that appears to not be relevant to JBIG2 images embedded in PDFs. Here is the post-decode cleanup commit. Subversion 19987 icepdf\core\src\org\icepdf\core\pobjects\Stream.java icepdf\core\src\org\jpedal\jbig2\JBIG2Decoder.java icepdf\core\src\org\jpedal\jbig2\decoders\JBIG2StreamDecoder.java

Hide

Permalink

Mark Collette added a comment - 14/Dec/09 8:54 PM - edited

I'm not entirely sure if it might be worthwhile trying to optimise the fact that the decoder takes its input in the form of a byte[] instead of an InputStream. Technically it can take an InputStream, but then derives a byte[] from that. It uses its StreamReader class, which acts as a sort of specialised pseudo input stream hard-coded to be backed by a byte[]. In theory, I could extract its methods out into an interface, and make another implementation that could work off of our SeekableInput, and one that could work off of an InputStream, and so optimise each case of filtered and unfiltered input. But, since a huge fax page typically only takes 100 KB or so, I'm not sure how worth-while this would be.

Show

Mark Collette added a comment - 14/Dec/09 8:54 PM - edited I'm not entirely sure if it might be worthwhile trying to optimise the fact that the decoder takes its input in the form of a byte[] instead of an InputStream. Technically it can take an InputStream, but then derives a byte[] from that. It uses its StreamReader class, which acts as a sort of specialised pseudo input stream hard-coded to be backed by a byte[]. In theory, I could extract its methods out into an interface, and make another implementation that could work off of our SeekableInput, and one that could work off of an InputStream, and so optimise each case of filtered and unfiltered input. But, since a huge fax page typically only takes 100 KB or so, I'm not sure how worth-while this would be.

Hide

Permalink

Mark Collette added a comment - 31/Dec/09 6:00 PM

The current optimisations seem good for now.

Show

Mark Collette added a comment - 31/Dec/09 6:00 PM The current optimisations seem good for now.

JBIG2 memory optimisation

Details

Description

Issue Links

Activity

People

Dates