Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.0 - Beta
    • Fix Version/s: 4.0
    • Component/s: Core/Parsing
    • Labels:
      None
    • Environment:
      All

      Description

      When trying to run the PDF-114 tagging on the ICEpdf 4.0 code-base, which includes the new text extraction and selection facilities, the tagging could not complete because the JVM would run out of memory. Investigation showed that memory usage when parsing image-centric PDFs would remain between 40 MB and 90 MB, but when parsing text-heavy PDF files would regularly go up to between 325 MB and 785 MB, and certain PDFs would go beyond, and cause the out of memory exceptions.

      Investigation via memory profiler showed that the largest memory allocations were tied to GeneralPath objects, which were being used by the text facilities.

      The profiler investigation also showed that we were unnecessarily allocating memory with certain redundant reflection calls, which I fixed. That might speed up the code, but it wouldn't, and did not fix the memory leak.

      Investigation via code review showed that the Page's dispose method calls the Shapes' dispose method, which clears out the text. So, in theory, while the memory is being allocated, it should be cleared out as needed, and not cause any leaks that would explain the memory exhaustion.

      I theorised that somehow the PageText object was responsible for the leak, due to its ArrayList<LineText> object. I then prototypes a solution by heavily modifying PageText, to allow the ArrayList<LineText> to be nulled out on dispose, and reconstructed when called upon again. The solution worked, the meory useage went down to between 9 MB and 85 MB. I then investigated the Java source code for ArrayList, and found that the clear() method did in fact have a memory leak. It would allow for garbage collection of everything that had been in the ArrayList, but left the backing Object[] at it's maximally allocated capacity. I then replaced the prototype solution with a single call to ArrayList.trimToSize(), after ArrayList.clear(), which should replace the very large allocated Object[] with a very small one. That worked. The memory usage stayed low.

      I then seached for all occurences of ArrayList.clear() and Vector.clear(), and where it made sense I added the corresponding trimToSize(). I also searched for any ArrayList and Vector creation where we did not specify the initial capacity, which would default to 32. In cases like coordinates, where we have Vectors of size 4, that can add up to a lot of wasted memory. So, I added sane defaults wherever possible. In Parser I made 2 different fixes to the 2 different places where we parse out all of our array Vectors, to either pre-size them exactly, or trim them down to size. Some of these changes would help in our page to page memory leak reduction, and some would help reduce the memory used just within the page. Both types of improvement are quite useful.

        Activity

        Hide
        Mark Collette added a comment -

        To summarise:

        • Call trimToSize() after clear() on ArrayList and Vector objects
        • Pre-size ArrayList and Vector objects initially
        • Remove redundancy of Class.forName() calls

        Subversion 20202

        Show
        Mark Collette added a comment - To summarise: Call trimToSize() after clear() on ArrayList and Vector objects Pre-size ArrayList and Vector objects initially Remove redundancy of Class.forName() calls Subversion 20202
        Hide
        Mark Collette added a comment -

        PDF-114 tagging now uses even less memory than it did under ICEpdf 2.7. Before I had to use -Xmx750m and now it works with -Xmx512m.

        Show
        Mark Collette added a comment - PDF-114 tagging now uses even less memory than it did under ICEpdf 2.7. Before I had to use -Xmx750m and now it works with -Xmx512m.

          People

          • Assignee:
            Mark Collette
            Reporter:
            Mark Collette
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: