Bug 50852

Summary: [PATCH] Improve generation of PDFs with accessibility information
Product: Fop - Now in Jira Reporter: Martin K <martin.koegler>
Component: generalAssignee: fop-dev
Status: RESOLVED FIXED    
Severity: normal CC: vhennebert
Priority: P2    
Version: trunk   
Target Milestone: ---   
Hardware: All   
OS: All   
Attachments: Patch 01/20
Patch 02/20
Patch 03/20
Patch 04/20
Patch 05/20
Patch 06/20
Patch 07/20
Patch 08/20
Patch 09/20
Patch 10/20
Patch 11/20
Patch 12/20
Patch 13/20
Patch 14/20
Patch 15/20
Patch 16/20
Patch 17/20
Patch 18/20
Patch 19/20
Patch 20/20
Handle empty nodes
Mark no longer relevant patches as obsolete
Treat some content as artifact
Treat some content as artifact – Documentation

Description Martin K 2011-03-02 06:55:11 UTC
The current accessibility implemenation does not scale and is slow. The following implementation is memory efficient (only structure information of pages in processing is kept in memory) and faster.

 [PATCH 01/20] container element for structure tree

Adds object for storing the structure tree

 [PATCH 02/20] Add methods for storing structure tree to area tree
 [PATCH 03/20] Helper functions for StructureElement construction
 [PATCH 04/20] Parse role attributes
 [PATCH 05/20] Implement mapping functions of FO objects to structure tree
 [PATCH 06/20] Store new structure tree in area tree
 [PATCH 07/20] Handle structure tree only in new format
 [PATCH 08/20] Remove unused code

These patches switch to the new internal format for the structure tree. Only the structure tree of unfinished page sequences is kept in memory, so it the overhead for documents with not too large page sequences is limited. The content of the structure-tree tag in area-tree / intermediate XML is slightly changed:

The page-sequences shows up as tag, so that it possible to give it roles [eg. Part does not  make sense for a one page document].

 [PATCH 09/20] Workaround: Some test cases don't like ptr

I don't recommend to add this patch, as the additional ptr attributes do no harm - but someone has to do some work on the testsuite otherwise.

 [PATCH 10/20] Avoid overhead of creating writers

Adding structure information to a 100MB PDF can mean adding 300-400 MB of structure information =>  Multi Mio of additional dictionaries are written => So many BufferedWriter are created. Getting rid of both stream representation improves performance large document and simplifies code as there is no Writer to forget to flush anymore.

This patch is not required for the following patches.

 [PATCH 11/20] Add support for clearing objects at write time
 [PATCH 12/20] Add support for lazy object number assignment
 [PATCH 13/20] Improve PDFArray/Dictionary
 [PATCH 14/20] Free structure tree ID map at the end of the page sequence
 [PATCH 15/20] Don't write empty leaf structure to PDF

Structure information must be written in between - freeing resources after writing reduces memory pressure. Assigning objects number latter allows to  prune empty structure elements.

 [PATCH 16/20] Simplify references to the text

Use a more compact reference representation, if possible

-----------------------------------------
Enhancements:

 [PATCH 17/20] Dupplicate static content structures for each pages

Static regions, which are put on multiple pages, generate strange results. Duplicating them in the structure tree yields to more logical results.

 [PATCH 18/20] Generate shorter PDF documents

Structure information means much larger PDFs - some bytes can be safed.

 [PATCH 19/20] New roles for accessibility

Role "hidden":
Kept whole subtree out of structure tree.
Role "inherit":
Hide this node in the structure tree and place content directly in the parent.

Both functions are necessary to create "beautiful" structure trees. The role names need to be discussed.

 [PATCH 20/20] Support role for Flow & PageSequence

Allow changing the appearance of Flow/PageSequence in the structure tree. Especially on small documents, the Structure Doc->Part->Sect can not be appropriate.

PageSequence allows only assigning a tag name, but (because of implementation/performance issues) it can not be removed.
Flow/static-content can be changed in every way.
Comment 1 Martin K 2011-03-02 06:56:43 UTC
Created attachment 26701 [details]
Patch 01/20
Comment 2 Martin K 2011-03-02 06:57:28 UTC
Created attachment 26702 [details]
Patch 02/20
Comment 3 Martin K 2011-03-02 06:58:14 UTC
Created attachment 26703 [details]
Patch 03/20
Comment 4 Martin K 2011-03-02 06:59:03 UTC
Created attachment 26704 [details]
Patch 04/20
Comment 5 Martin K 2011-03-02 06:59:46 UTC
Created attachment 26705 [details]
Patch 05/20
Comment 6 Martin K 2011-03-02 07:00:38 UTC
Created attachment 26706 [details]
Patch 06/20
Comment 7 Martin K 2011-03-02 07:01:14 UTC
Created attachment 26707 [details]
Patch 07/20
Comment 8 Martin K 2011-03-02 07:01:59 UTC
Created attachment 26708 [details]
Patch 08/20
Comment 9 Martin K 2011-03-02 07:02:29 UTC
Created attachment 26709 [details]
Patch 09/20
Comment 10 Martin K 2011-03-02 07:03:35 UTC
Created attachment 26710 [details]
Patch 10/20
Comment 11 Martin K 2011-03-02 07:04:19 UTC
Created attachment 26711 [details]
Patch 11/20
Comment 12 Martin K 2011-03-02 07:04:43 UTC
Created attachment 26712 [details]
Patch 12/20
Comment 13 Martin K 2011-03-02 07:05:34 UTC
Created attachment 26713 [details]
Patch 13/20
Comment 14 Martin K 2011-03-02 07:06:11 UTC
Created attachment 26714 [details]
Patch 14/20
Comment 15 Martin K 2011-03-02 07:06:52 UTC
Created attachment 26715 [details]
Patch 15/20
Comment 16 Martin K 2011-03-02 07:07:29 UTC
Created attachment 26716 [details]
Patch 16/20
Comment 17 Martin K 2011-03-02 07:08:25 UTC
Created attachment 26717 [details]
Patch 17/20
Comment 18 Martin K 2011-03-02 07:09:11 UTC
Created attachment 26718 [details]
Patch 18/20
Comment 19 Martin K 2011-03-02 07:09:37 UTC
Created attachment 26719 [details]
Patch 19/20
Comment 20 Martin K 2011-03-02 07:10:00 UTC
Created attachment 26720 [details]
Patch 20/20
Comment 21 Vincent Hennebert 2011-06-16 17:58:06 UTC
Hi Martin,

I finally got round to having a look at your patch. First, I’d like to 
thank you for having taken the time to create 20 smaller self-contained 
patches. This made the review much easier. So thanks for that!

The new data structure is indeed much more efficient than the DOM that 
FOP manipulates at the moment. Unfortunately, this doesn’t solve (AFAIU) 
the fundamental problem that we recently discovered: empty content may 
be the cause for a wrong final structure tree. Take the following table:

  <fo:table width="100%" table-layout="fixed">
    <fo:table-body>
      <fo:table-row>
        <fo:table-cell>
          <fo:block>Cell 1.1</fo:block>
        </fo:table-cell>
        <fo:table-cell>
          <fo:block>Cell 1.2</fo:block>
        </fo:table-cell>
      </fo:table-row>
      <fo:table-row>
        <fo:table-cell>
          <fo:block/>
        </fo:table-cell>
        <fo:table-cell>
          <fo:block>Cell 2.2</fo:block>
        </fo:table-cell>
      </fo:table-row>
    </fo:table-body>
  </fo:table>

The content of the first cell in the second row is empty, which will 
result into a TR structure element having only one TD kid for the second 
cell; that TD element will be mistakenly interpreted by a screen reader 
as belonging to the first column.

See also discussion here:
http://markmail.org/message/mn7jdbxmjdq7ey52

To solve this we need to integrate the handling of the structure tree 
into the normal processing chain (FO tree -> Layout Engine -> Area tree) 
instead of bypassing it.


That said, I have a few comments and questions relating to some of your 
specific patches:
[PATCH 10/20] Avoid overhead of creating writers
I can imagine that if there are a lot of PDF objects to stream, creating 
an instance of BufferedWriter and OutputStreamWriter for each of them 
may have quite some performance impact. However replacing them with 
calls to PDFObject.encode everywhere it is necessary is not really an 
option. This makes the code difficult to read and maintain, and is 
error-prone as it’s very easy to miss one call somewhere.

I think the problem of encoding text into the output should be solved by 
defining a specialized PDFOutputStream that would be able to stream both 
String and bytes. That PDFOutputStream would be passed around to objects 
that then wouldn’t have to handle their own wrapper or make calls to 
encode. Does that make sense?

[PATCH 11/20] Add support for clearing objects at write time
I’m wondering why this is necessary? Isn’t it just possible to null out 
references to the objects and let the garbage collector do the work?

[PATCH 12/20] Add support for lazy object number assignment
Same here, what’s exactly the purpose of lazy object number assignment?


Thanks,
Vincent
Comment 22 Martin K 2011-06-16 21:36:22 UTC
I'll split my answers.

Add "Add support for lazy object number assignment":

The codes creates an PDF structure tree containing objects, which will be trimmed away (eg. because they are empty leaf objects).

Without lazy object number assignment such objects will get an object number too, as they are assigned while creating these objects. It will increase the xref size and the xref entries for such numbers will be incorrect (as there is no support for free object numbers).

Lazy assignment is the simplest solution for this problem without rewriting larger parts of the code. For such objects, the assignment will happen, when the first reference to such objects (or the object itself) is written into the PDF.
Comment 23 Martin K 2011-06-16 21:51:39 UTC
Add "empty table cell problem":

This should be easy to solve:
In PDFStructElement.prepare() 
change 
        if (kids.length() == 0) {

If the test is extended to only trigger, if the current structure element is not a table cell [getStructureType() not in (TD, TH, ..) && kids.length() == 0], an empty table cell will be put in the structure tree.

Maybe PDFStructElement should get a trim empty leaf flag and the decision moved to the FOToPDFRoleMap.
Comment 24 Martin K 2011-06-16 22:32:37 UTC
Add "Avoid overhead of creating writers":

By default, nearly each FO tag results in an PDF structure element [= object], so there are lots of OutputStreamWriters created.

The trunk implementation with two objects (Writer+Stream) has the weak point, that forgetting to flush the writer before using the stream can lead to incorrect PDFs.

As OutputStream.write only supports bytes, any forgotten encode will result in an syntax error => My solution will catch errors at compile time.

> I think the problem of encoding text into the output should be solved by 
> defining a specialized PDFOutputStream that would be able to stream both 
> String and bytes. That PDFOutputStream would be passed around to objects 
> that then wouldn’t have to handle their own wrapper or make calls to 
> encode. Does that make sense?

If the PDFOutputStream is persistent (eg, not recreated for each batch of objects), it is an interesting idea.

The current code creates lots of CountingOutputStreams too, which would be needed to be wrapped in a PDFOutputStream. To overcome this, the new PDFOutputStream could count the total number of written bytes. Instead of using a CountingOutputStream, the code can store the start number of written bytes and use the difference to the end number of written bytes.

This is an optional enhancement, therefore I would currently like to concentrate on the other patches.
Comment 25 Martin K 2011-06-16 23:11:44 UTC
"Add support for clearing objects at write time":

The structure tree is one big document wide tree. The root is written after the last page. 

The structure tree has circular links (parent=>child, child=>parent) and external links to tree nodes (eg. from the per page element list). 

If we don't want to keep the whole thing into memory, we need to write parts of the tree and then purge all references to them.

Without this patch, a write of structure elements would be:
* add the object to the PDF document
* callback to the caller do the actual write or get the OutputStream somehow to do the write (The whole logic is in PDFLogicalStructureHandler, which currently only has access to the PDF document, not the output stream).
* Traverse the written structure elements again and break the loops

With this patch, a write of structure elements is only adding the objects to the document.
The next write operation will store the streams and call clean the objects afterwards. For a structure element, the clean operation is mostly removing it's links to child elements, which also results in breaking the loops.

The same logic could eg. be used for writing pages: The clean function of a pages would be deleting the page content.
Comment 26 Vincent Hennebert 2011-06-17 11:15:57 UTC
(In reply to comment #23)
> Add "empty table cell problem":
> 
> This should be easy to solve:
> In PDFStructElement.prepare() 
> change 
>         if (kids.length() == 0) {
> 
> If the test is extended to only trigger, if the current structure element is
> not a table cell [getStructureType() not in (TD, TH, ..) && kids.length() ==
> 0], an empty table cell will be put in the structure tree.

I tried commenting out all the code in that method that removes elements and it does not work.

Anyway, as described in the mailing list discussion that I linked to in comment #21, I think the structure tree should not be trimmed, as there is structural information there that would be necessary for a round-trip process. But in the current design there is no way to create the right order of text and other structure elements.


> Maybe PDFStructElement should get a trim empty leaf flag and the decision moved
> to the FOToPDFRoleMap.
Comment 27 Martin K 2011-06-18 07:51:42 UTC
Created attachment 27171 [details]
Handle empty nodes

The attached patch should be able to handle both corner cases for empty blocks/table-cells (at least my testcase).

Empty blocks can have a meaning in the structure tree. Or they are created for technical reasons without any meaning. I created an interface for this policy decision in FOToPDFRoleMap. The patch includes a simple test policy - the final policy will require further discussion.
Comment 28 Glenn Adams 2012-04-07 01:43:46 UTC
resetting P2 open bugs to P3 pending further review
Comment 29 Glenn Adams 2012-04-08 09:11:08 UTC
increase priority due to presence of a patch
Comment 30 Glenn Adams 2012-04-08 19:15:53 UTC
vincent/peter, what is the status of this patch with your recent accessibility improvements? have these patches been applied? if not, then do expect to apply them? (when?)
Comment 31 Glenn Adams 2012-04-24 05:53:04 UTC
(In reply to comment #30)
> vincent/peter, what is the status of this patch with your recent accessibility
> improvements? have these patches been applied? if not, then do expect to apply
> them? (when?)

vincent/peter?
Comment 32 Vincent Hennebert 2012-04-24 14:53:44 UTC
(In reply to comment #31)
> (In reply to comment #30)
> > vincent/peter, what is the status of this patch with your recent accessibility
> > improvements? have these patches been applied? if not, then do expect to apply
> > them? (when?)
> 
> vincent/peter?

Some of those patches may still be worth applying as they may bring improvements in areas that were not touched by the work on accessibility. I'll need to have another look.
Comment 33 Vincent Hennebert 2012-05-23 18:54:42 UTC
Created attachment 28824 [details]
Mark no longer relevant patches as obsolete

The accessibility code has been refactored to fix the fundamental limitation that empty content was not properly handled. It also suppresses the need for an intermediate representation of the structure tree and directly generates the appropriate PDF structure elements from the FO tree.

Marking as obsolete the patches that are no longer relevant in the context of the new implementation. (Adding an empty patch is the quickest way I found to avoid editing each attachment individually.)
Comment 34 Vincent Hennebert 2012-05-23 19:02:56 UTC
Created attachment 28825 [details]
Treat some content as artifact

This is an adaptation of patch #19 to the new accessibility code. I chose the value 'artifact' instead of 'hidden' to follow the naming convention of the PDF Reference.

I also chose to implement this special property value on fo:wrapper only, for simplicity. It should be easy to wrap any content that must be treated as artifact into an appropriate fo:wrapper element.

I didn't implement the 'inherit' value, as I believe the PDF Reference already provides a standard structure type ('NonStruct') that allows to achieve a similar result. The node is not removed from the structure tree but the end result should be the same.

Any feedback is welcome. If I don't hear anything within 3 working days, I'll assume lazy consensus and apply the patch.

Thanks,
Vincent
Comment 35 Vincent Hennebert 2012-05-23 19:04:47 UTC
Created attachment 28826 [details]
Treat some content as artifact – Documentation

This patch is against the staging CMS website, that sooner or later will become the official website.
Comment 36 Vincent Hennebert 2012-05-23 19:27:44 UTC
Comment on attachment 26718 [details]
Patch 18/20

Applied patch #18 in rev. 1341992:
http://svn.apache.org/viewvc?rev=1341992&view=rev

I left the criteria for compact output in PDFDictionary to 2, as we have to decide on a size and I don't think the savings are significant anyway (2 bytes per entry).

Thanks for the suggestion,
Vincent
Comment 37 Vincent Hennebert 2012-05-24 10:50:30 UTC
Comment on attachment 26717 [details]
Patch 17/20

I must mark this patch as invalid. The structure tree corresponds to the logical structure of the document and is not meant to reflect the way it is laid out on pages. So the static-content cannot be duplicated in the structure tree as many times as there are pages in the final output.

As hinted by the PDF Reference, the proper way to deal with content that is repeated over pages is to treat it as artifact. This will become possible once patches #20 and artifact have been handled.

Vincent
Comment 38 Vincent Hennebert 2012-05-25 10:11:35 UTC
Comment on attachment 26716 [details]
Patch 16/20

This patch does no longer apply after the modifications brought to the accessibility code. Feel free to submit an updated patch, although the matter has been made slightly more complicated by the presence of placeholder structure elements. Also, it would be good to improve encapsulation, immutability and robustness by avoiding some of the tests (instanceof PDFDictionary, type equal to "/MCR"). Maybe a method like addMarkedContentKid on PDFStructElem would do. That would also avoid creating an instance of PDFDictionary, only to discard it later on because the reference to the kid can be simplified.

This simplification has been made less critical by the implementation of object streams, that allow to achieve a much more significant reduction of the file size. It may be of interest in cases where for some reason the user has to stick to PDF 1.4. But I'm not sure how common such a use case is.

Thanks,
Vincent
Comment 39 Vincent Hennebert 2012-05-25 15:18:57 UTC
Comment on attachment 26720 [details]
Patch 20/20

I applied the changes suggested by this patch in rev. 1342680:
http://svn.apache.org/viewvc?rev=1342680&view=rev

I adapted the changes to the new accessibility code and I also implemented support for the intermediate format.

Thanks,
Vincent
Comment 40 Vincent Hennebert 2012-05-29 09:53:33 UTC
I applied the artifact patch in rev. 1343632:
http://svn.apache.org/viewvc?rev=1343632&view=rev

This fixes the last remaining issue for this bug.