ICEpdf
  1. ICEpdf
  2. PDF-317

Page.getText() is not returning all page text

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.2.1
    • Fix Version/s: 4.2.2
    • Component/s: Core/Parsing
    • Labels:
      None
    • Environment:
      any

      Description

      A forum user has identified a bug with missing text when calling page.getText(). This method is used in the RI for the TextExtractionTask, it turns out that the "optimized" text extraction call is not initializing PDF XForm object and thus missing quite a bit of content during the extraction.

      The Content parser method parseTextBlocks() needs to be updated to insure the xform objects are correctly initalizied and parsed.

        Activity

        Hide
        Patrick Corless added a comment -

        Updated the content parsers parseTextBlocks method to include xobject processing. This feels like deja vu as I'm sure I've fixed this in the past. Regardless text extraction is working correctly with the optimization that not all page objects are parsed, such as images and other non related text content.

        Show
        Patrick Corless added a comment - Updated the content parsers parseTextBlocks method to include xobject processing. This feels like deja vu as I'm sure I've fixed this in the past. Regardless text extraction is working correctly with the optimization that not all page objects are parsed, such as images and other non related text content.

          People

          • Assignee:
            Patrick Corless
            Reporter:
            Patrick Corless
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: