Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.0 - Beta
    • Fix Version/s: 4.0
    • Component/s: Core/Parsing
    • Labels:
      None
    • Environment:
      All

      Description

      When looking at the JBIG2 library's code, some memory optimisations came to mind, to reduce the size of the memory spikes. One was done in PDF-9. This is for any that are after the 4.0 Beta release.

        Issue Links

          Activity

          Hide
          Mark Collette added a comment -

          When using the JBIG2 library, it first decodes the JBIG2 compressed input from a byte array, and then it allows for getting the BufferedImage. PDF-9 optimised the memory usage of getting the BufferedImage.

          This optimisation is for reducing the memory usage after decoding, but before getting the BufferedImage. Analysis of the internal data-structures of the JBIG2 decoder shows that intermediary sections of the JBIG2 image are assembled together to form the final image. While decoding, these subsections must remain accessible, but after decoding, these subsections (called "segments") become redundant, as the page segment then holds the entire image. So, I've added some cleanup code to clear away every segment except the page segments. Depending on the format of any given JBIG2 image in my test suite, I found that these other segments could account for anywhere between several kilobytes and more than 4 megabytes, for a given full page fax scan.

          Initial analysis does not indicate that those segments could be cleared away progressively, as the segments are decoded, since it appears that the whole file consists of different segment types, one after the other, and the PDF spec dictates that they be processed in turn, and any later segment could refer to any earlier segment, so there's no way of knowing that a segment is no longer needed until the last segment has been processed. There is a random access mode of JBIG2 decoding, that might allow for more intelligent dependency calculations, but that appears to not be relevant to JBIG2 images embedded in PDFs.

          Here is the post-decode cleanup commit.

          Subversion 19987
          icepdf\core\src\org\icepdf\core\pobjects\Stream.java
          icepdf\core\src\org\jpedal\jbig2\JBIG2Decoder.java
          icepdf\core\src\org\jpedal\jbig2\decoders\JBIG2StreamDecoder.java

          Show
          Mark Collette added a comment - When using the JBIG2 library, it first decodes the JBIG2 compressed input from a byte array, and then it allows for getting the BufferedImage. PDF-9 optimised the memory usage of getting the BufferedImage. This optimisation is for reducing the memory usage after decoding, but before getting the BufferedImage. Analysis of the internal data-structures of the JBIG2 decoder shows that intermediary sections of the JBIG2 image are assembled together to form the final image. While decoding, these subsections must remain accessible, but after decoding, these subsections (called "segments") become redundant, as the page segment then holds the entire image. So, I've added some cleanup code to clear away every segment except the page segments. Depending on the format of any given JBIG2 image in my test suite, I found that these other segments could account for anywhere between several kilobytes and more than 4 megabytes, for a given full page fax scan. Initial analysis does not indicate that those segments could be cleared away progressively, as the segments are decoded, since it appears that the whole file consists of different segment types, one after the other, and the PDF spec dictates that they be processed in turn, and any later segment could refer to any earlier segment, so there's no way of knowing that a segment is no longer needed until the last segment has been processed. There is a random access mode of JBIG2 decoding, that might allow for more intelligent dependency calculations, but that appears to not be relevant to JBIG2 images embedded in PDFs. Here is the post-decode cleanup commit. Subversion 19987 icepdf\core\src\org\icepdf\core\pobjects\Stream.java icepdf\core\src\org\jpedal\jbig2\JBIG2Decoder.java icepdf\core\src\org\jpedal\jbig2\decoders\JBIG2StreamDecoder.java
          Hide
          Mark Collette added a comment - - edited

          I'm not entirely sure if it might be worthwhile trying to optimise the fact that the decoder takes its input in the form of a byte[] instead of an InputStream. Technically it can take an InputStream, but then derives a byte[] from that. It uses its StreamReader class, which acts as a sort of specialised pseudo input stream hard-coded to be backed by a byte[]. In theory, I could extract its methods out into an interface, and make another implementation that could work off of our SeekableInput, and one that could work off of an InputStream, and so optimise each case of filtered and unfiltered input. But, since a huge fax page typically only takes 100 KB or so, I'm not sure how worth-while this would be.

          Show
          Mark Collette added a comment - - edited I'm not entirely sure if it might be worthwhile trying to optimise the fact that the decoder takes its input in the form of a byte[] instead of an InputStream. Technically it can take an InputStream, but then derives a byte[] from that. It uses its StreamReader class, which acts as a sort of specialised pseudo input stream hard-coded to be backed by a byte[]. In theory, I could extract its methods out into an interface, and make another implementation that could work off of our SeekableInput, and one that could work off of an InputStream, and so optimise each case of filtered and unfiltered input. But, since a huge fax page typically only takes 100 KB or so, I'm not sure how worth-while this would be.
          Hide
          Mark Collette added a comment -

          The current optimisations seem good for now.

          Show
          Mark Collette added a comment - The current optimisations seem good for now.

            People

            • Assignee:
              Mark Collette
              Reporter:
              Mark Collette
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: