Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 3.1
    • Fix Version/s: 4.0 - Beta, 4.0
    • Component/s: Core/Parsing
    • Labels:
      None
    • Environment:
      All

      Description

      If anything new is added to a PDF file, or something is modified, we'll have to support appending an incremental update.

      Relevant issues:

      1. Linearised file becomming unlinearised. We don't support linearisation, and we'd probably want to avoid re-writing the entire file.
      2. If existing file uses cross reference streams, should appended section use cross reference stream or xref table? There are also hybrid files that use both.
      3. If existing file uses object streams, should new objects go into an object stream (which allows for encryption), or simply write objects directly?
      4. When new objects are created, they have to be assigned object numbers, which is something you take from the existing xref table, since it tells you what the next number to use is.
      5. Are we only adding objects, or will we be editing and deleting them? This affects the xref writing.
      6. If the original document was encrypted, the updated document must also be encrypted.
      7. Will save feature be standard or pro feature

      A. Know how to write a trailer
      B. Know how to write a xref table, and possibly cross reference stream
      C. Possibly know how to write an object stream
      D. Know how to write whichever object is being added or modified
      E. Know how to write a Page object

        Issue Links

          Activity

          Hide
          Patrick Corless added a comment -

          Encryption issue for action dictionary string writes. .

          Show
          Patrick Corless added a comment - Encryption issue for action dictionary string writes. .
          Hide
          Mark Collette added a comment -

          Analysis of string encryption, that we've agreed to:

          There are 6 cases for strings being written out. There are LiteralStringObject, HexStringObject, and java.lang.String. And the main scenarios are when the document was not encrypted and when it was encrypted.

          When parsed from a PDF, both LiteralStringObject and HexStringObject hold the raw original bytes from the PDF. That means there are 8 bits of information in each char.

          Let me go on a tangent for a second. I was previously worried about us corrupting the strings, as we read in bytes and convert to String, and who knows which platform specific character encoding is used. I'm not worried anymore. In our Parser we're not using any String constructor that takes a byte[], so it doesn't apply. We're using StringBuffer, and doing InputStream (Reader would cause problems too) reads, and casting to char. There's no inadvertent sign extension, since InputStream.read() gives an int holding only the byte of data. When we go back to bytes, we're just casting back. So, bit-wise nothing's being corrupted. There's still the issue of user input strings having beyond ASCII characters, that will need to be handled, but right now we're only allowing editing of URIs, which only allow a subset of ASCII characters.

          Ok, back to encryption. In the unencrypted scenario, the LiteralStringObject holds the actual 7 bit ASCII string characters. Any 8 bit values are octal escaped. In the encrypted scenario, it hold 8 bit binary values. But there's no way to know when a LiteralStringObject is encrypted or not, since there's no flags per LSO. When we need to acess the unencrypted values, we just unencrypt it. If the PDF is encrypted, every LSO is encrypted. If the PDF is not encrypted, there's no key to decrypt, so we pass through the underlying data. Either way, the LSO already holds whatever values, as bytes, that the IncrementalUpdater should just write out, without trying to process.

          For HexStringObject, it's the same story, it either contains ASCII characters (because they're hex, so they're 0-9, a-z, A-Z), or the 8 bit binary encrypted data. Likewise no way to know if one specific HSO is encrypted, you just assume that if the PDF is encrypted, the HSO is too. Either way, those raw bytes are what the IncrementalUpdater should output.

          In those two scenarios it's clear, if page/annotation/action/destination editing or creating involves making a LiteralStringObject or a HexStringObject, from a java.lang.String returned from a swing editor component, then it needs to do the proper escaping (LSOs do octal escaping of 8 bit values, HSOs do hex escaping of all values), then encrypt those bytes, and store that in the LSO or HSO. Then the IncrementalUpdater can find them in the dictionaries, and simply write out those bytes, with no processing.

          That leaves .java.lang.String objects, found in the dictionary. The Parser would only make those as tokens for the stack, not for bonafide strings in an object. So our editing would be what's making them. We'll just make sure that our editing doesn't make java.lang.String objects, and just makes LiteralStringObjects, as described above. For now, the IncrementalUpdater will fail fast on java.lang.String values, to help us discover if we're incorrectly creating and storing them.

          Show
          Mark Collette added a comment - Analysis of string encryption, that we've agreed to: There are 6 cases for strings being written out. There are LiteralStringObject, HexStringObject, and java.lang.String. And the main scenarios are when the document was not encrypted and when it was encrypted. When parsed from a PDF, both LiteralStringObject and HexStringObject hold the raw original bytes from the PDF. That means there are 8 bits of information in each char. Let me go on a tangent for a second. I was previously worried about us corrupting the strings, as we read in bytes and convert to String, and who knows which platform specific character encoding is used. I'm not worried anymore. In our Parser we're not using any String constructor that takes a byte[], so it doesn't apply. We're using StringBuffer, and doing InputStream (Reader would cause problems too) reads, and casting to char. There's no inadvertent sign extension, since InputStream.read() gives an int holding only the byte of data. When we go back to bytes, we're just casting back. So, bit-wise nothing's being corrupted. There's still the issue of user input strings having beyond ASCII characters, that will need to be handled, but right now we're only allowing editing of URIs, which only allow a subset of ASCII characters. Ok, back to encryption. In the unencrypted scenario, the LiteralStringObject holds the actual 7 bit ASCII string characters. Any 8 bit values are octal escaped. In the encrypted scenario, it hold 8 bit binary values. But there's no way to know when a LiteralStringObject is encrypted or not, since there's no flags per LSO. When we need to acess the unencrypted values, we just unencrypt it. If the PDF is encrypted, every LSO is encrypted. If the PDF is not encrypted, there's no key to decrypt, so we pass through the underlying data. Either way, the LSO already holds whatever values, as bytes, that the IncrementalUpdater should just write out, without trying to process. For HexStringObject, it's the same story, it either contains ASCII characters (because they're hex, so they're 0-9, a-z, A-Z), or the 8 bit binary encrypted data. Likewise no way to know if one specific HSO is encrypted, you just assume that if the PDF is encrypted, the HSO is too. Either way, those raw bytes are what the IncrementalUpdater should output. In those two scenarios it's clear, if page/annotation/action/destination editing or creating involves making a LiteralStringObject or a HexStringObject, from a java.lang.String returned from a swing editor component, then it needs to do the proper escaping (LSOs do octal escaping of 8 bit values, HSOs do hex escaping of all values), then encrypt those bytes, and store that in the LSO or HSO. Then the IncrementalUpdater can find them in the dictionaries, and simply write out those bytes, with no processing. That leaves .java.lang.String objects, found in the dictionary. The Parser would only make those as tokens for the stack, not for bonafide strings in an object. So our editing would be what's making them. We'll just make sure that our editing doesn't make java.lang.String objects, and just makes LiteralStringObjects, as described above. For now, the IncrementalUpdater will fail fast on java.lang.String values, to help us discover if we're incorrectly creating and storing them.
          Hide
          Mark Collette added a comment -

          When editing pre-existing objects, and when creating new objects, found out that it's a best practice to have all dictionary derived objects be indirect references.

          For example, when altering an Annotation, if it has a direct Action, that has to become an indirect Action. The reason is because of how we do encryption. Let's say, in an encrypted PDF, there's a LinkAnnotation with a direct URIAction, and the user edits the URI. As the code stands, the new LiteralStringObject will get it's Reference from the URIAction. But, that will be null, since it's direct. Encryption will likely fail, or decryption will fail, since the Parser will actually wire the reference correctly, from the top-level annotation object.

          Of course, there are other ways of dealing with that specific example, such as:
          1. If the string was indirect, only changing that, and carrying forward the reference
          2. If the string was direct, but the action indirect, finding the annotation's reference
          But that's more complicated, and misses the point that if we just adhere to the best practice, then we'll skip over any similar issues.

          Show
          Mark Collette added a comment - When editing pre-existing objects, and when creating new objects, found out that it's a best practice to have all dictionary derived objects be indirect references. For example, when altering an Annotation, if it has a direct Action, that has to become an indirect Action. The reason is because of how we do encryption. Let's say, in an encrypted PDF, there's a LinkAnnotation with a direct URIAction, and the user edits the URI. As the code stands, the new LiteralStringObject will get it's Reference from the URIAction. But, that will be null, since it's direct. Encryption will likely fail, or decryption will fail, since the Parser will actually wire the reference correctly, from the top-level annotation object. Of course, there are other ways of dealing with that specific example, such as: 1. If the string was indirect, only changing that, and carrying forward the reference 2. If the string was direct, but the action indirect, finding the annotation's reference But that's more complicated, and misses the point that if we just adhere to the best practice, then we'll skip over any similar issues.
          Hide
          Mark Collette added a comment -

          Properly write out encrypted PDFs with encrypted strings. Fail fast when writing java.lang.String, so we can debug why there are java.lang.String objects in our data structures.

          Subversion 19907
          icepdf\core\src\org\icepdf\core\application\Capabilities.java
          icepdf\core\src\org\icepdf\core\pobjects\Document.java
          icepdf\core\src\org\icepdf\core\pobjects\StateManager.java

          Subversion 22029
          icepdf-pro\font-engine\src\org\icepdf\core\util\IncrementalUpdater.java

          Show
          Mark Collette added a comment - Properly write out encrypted PDFs with encrypted strings. Fail fast when writing java.lang.String, so we can debug why there are java.lang.String objects in our data structures. Subversion 19907 icepdf\core\src\org\icepdf\core\application\Capabilities.java icepdf\core\src\org\icepdf\core\pobjects\Document.java icepdf\core\src\org\icepdf\core\pobjects\StateManager.java Subversion 22029 icepdf-pro\font-engine\src\org\icepdf\core\util\IncrementalUpdater.java
          Hide
          Mark Collette added a comment -

          Fix encryption's conversion between bytes and chars.

          Subversion 19914
          C:\Documents and Settings\Mark Collette\IdeaProjects\ICEpdf3\core\src\org\icepdf\core\pobjects\HexStringObject.java
          C:\Documents and Settings\Mark Collette\IdeaProjects\ICEpdf3\core\src\org\icepdf\core\pobjects\LiteralStringObject.java
          C:\Documents and Settings\Mark Collette\IdeaProjects\ICEpdf3\core\src\org\icepdf\core\pobjects\security\StandardEncryption.java
          C:\Documents and Settings\Mark Collette\IdeaProjects\ICEpdf3\core\src\org\icepdf\core\util\Utils.java

          Show
          Mark Collette added a comment - Fix encryption's conversion between bytes and chars. Subversion 19914 C:\Documents and Settings\Mark Collette\IdeaProjects\ICEpdf3\core\src\org\icepdf\core\pobjects\HexStringObject.java C:\Documents and Settings\Mark Collette\IdeaProjects\ICEpdf3\core\src\org\icepdf\core\pobjects\LiteralStringObject.java C:\Documents and Settings\Mark Collette\IdeaProjects\ICEpdf3\core\src\org\icepdf\core\pobjects\security\StandardEncryption.java C:\Documents and Settings\Mark Collette\IdeaProjects\ICEpdf3\core\src\org\icepdf\core\util\Utils.java

            People

            • Assignee:
              Mark Collette
              Reporter:
              Mark Collette
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: