Bug 46705

Summary: [PATCH] Enhancement: PDF Accessibility
Product: Fop - Now in Jira Reporter: Jost Klopfstein <jost.klopfstein>
Component: pdfAssignee: fop-dev
Status: CLOSED FIXED    
Severity: enhancement CC: benoit.wiart, vinayak.patil1
Priority: P2 Keywords: PatchAvailable
Version: trunk   
Target Milestone: ---   
Hardware: All   
OS: All   
Attachments: PDF accessibility
Documentation - will be merged into trunk by Jeremias
Fixed NPE when used from command line
Removed unused code in area/inline/Image & related
added compiled stylesheets for improved performance in server environments (suggested by Jeremias)

Description Jost Klopfstein 2009-02-12 02:36:31 UTC
Created attachment 23250 [details]
PDF accessibility

See attached html file and patch
Comment 1 Jost Klopfstein 2009-02-12 02:45:47 UTC
Created attachment 23251 [details]
Documentation - will be merged into trunk by Jeremias
Comment 3 Chris Bowditch 2009-02-16 02:52:44 UTC
Hi Jost,

I've just been trying out your patch. I applied the patch to a fresh check out of the new IF branch, compiled it okay. But when I tried to generate a PDF from the command line I got the following NPE:

INFO: Accessibility is enabled
16-Feb-2009 10:25:44 org.apache.fop.cli.Main startFOP
SEVERE: Exception
java.lang.NullPointerException
        at org.apache.fop.apps.Fop.getDefaultHandler(Fop.java:138)
        at org.apache.fop.cli.InputHandler.renderTo(InputHandler.java:126)
        at org.apache.fop.cli.Main.startFOP(Main.java:174)
        at org.apache.fop.cli.Main.main(Main.java:205)

Thanks,

Chris
Comment 4 Jeremias Maerki 2009-02-16 05:54:06 UTC
Generally, the patch looks pretty good to me, besides the NPE which I know Jost is already working on a fix for. That particular bug occurs because the stylesheets get loaded via java.io.File instead of Class.getResource(AsStream).

I've got a few nits which I'll gladly address myself after we've applied to patch to a temporary branch:

- FOUserAgent: set/getResult are not very speaking names for what they do. It might also be better to just save a DOM instead of a byte array.
- area/inline/Image: ptr is normally set as a trait in the area tree but not for the Image area. That can be homogenized.
- I'd rename fox:alt to a more speaking fox:alt-text.
- Some classes have just non-semantic changes (like commented code that is never used or unnecessary "final" modifiers which look like they have come from "Save Actions" in Eclipse). I'll remove those while processing the patch to reduce the noise.
- There are some backwards-incompatible changes in the Renderer implementations which are avoidable.
- I'd like to avoid adding the "ptr" attribute to the IFPainter methods since most implementations don't support that anyway. I'd rather do it via the IFContext like I've done for the foreign attributes.
- Static code analysis indicates a bug in PDFImageHandlerSVG concerning the save/restore pairs.
- I'd move the FO -> PDF Struct Type mapping (in PDFStructElem) out of the PDF library into the render/pdf package (separation of concerns and opening an option for a later implementation of a custom role map).

Please speak up if anyone objects to any of these proposed changes.
Comment 5 Jost Klopfstein 2009-02-16 12:44:31 UTC
Created attachment 23261 [details]
Fixed NPE when used from command line
Comment 6 Jost Klopfstein 2009-02-16 13:06:21 UTC
Created attachment 23262 [details]
Removed unused code in area/inline/Image & related

[Jeremias:]area/inline/Image: ptr is normally set as a trait in the area tree but not
Comment 7 Vincent Hennebert 2009-02-17 03:00:27 UTC
That's great new functionality. I've only had time to look at a part of the patch so far, so I'm posting my few comments below. More later if I have time.

In the o.a.f.accessibility package:
- why are there two classes? Only the sub-class seems to be externally used, so it may as well be merged with the super-class. If some more general functionality turns out to be necessary, the extraction of a common super-class can always be done later.
- Likewise, only one constructor seems to be used ATM. If other constructors are needed, they can also be added later on.
- AFAIU mTranHandler is set in all of the constructors. Why systematically test it for null in the methods then?

There is an encapsulation problem in o.a.f.apps.Fop.getDefaultHandler(): the lines of code dealing with accessibility should be moved to the accessibility package.

In FOUserAgent:
- the "accessibility" string should be defined only once in a public final static field (ideally somewhere inside the accessibility package)
- there's no need to explicitly put false for the accessibility option in the constructor, as the value is tested for null in the accessibilityEnabled method

In FopFactoryConfigurator.configure():
- any reason why the accessibility option is not handled like the other options? (using getValueAsBoolean)

And a nit about coding style: you seem to be using Hungarian notation to name fields ('mTheClassField'). I don't have anything against this notation, but it's not used inside the rest of FOP's codebase. For consistency we may want to agree upon a notation. FWIW I tend to think that, thanks to modern development environments that now highlight class members, this notation is no longer so useful.

Thanks,
Vincent
Comment 8 Andreas L. Delmelle 2009-02-17 12:51:03 UTC
Interesting stuff! I've also read some of the Wiki, and one thought that popped in my head:
Do we really need the extension 'ptr'? Could this be simplified by using the standard 'id' property perhaps? 

IIUC, I definitely think so, but that would mean introducing:
a) auto-generated and assigned initial value for the 'id' property (mandated by the Rec, BTW: "the initial value ... is a random and unique identifier..." --XSL-FO 7.30.8 "id")
b) a way for the FONodes to echo themselves back into SAX events, so they can be sent through reduceFOTree.xsl (maybe that reduction could even be hardcoded into FOP)

a) would be relatively easy to implement. I've already been playing with that in the past. It may have a noticeable impact on performance for large documents, if we're not careful, since the set of ids to check becomes entirely dependent on the number of FO nodes, whereas currently, that set is limited to the ids that have been specified.

b) would require more effort. Something we have never needed so far, but may be a worthwhile addition, if you think about it. The emitted FO will not correspond exactly (1-1) with the source, since shorthands will be expanded, relative specs may have been resolved and such, but still...

That said, maybe to emulate that effect with the current proposal of a pre-processing stylesheet (minimal effort), we could already do something like:

[in addPtr.xsl -> addId.xsl]

<xsl:template name="addId">
  <xsl:element name="{name()}" namespace="{namespace-uri()}">
    <xsl:apply-templates select="@*"/>
    <xsl:if test="not(@id)">
      <xsl:attribute name="id">
        <xsl:value-of select="generate-id()" />
      </xsl:attribute>
    <xsl:apply-templates />
  </xsl:element>
</xsl:template>

May have unwanted side-effects only in case the author happens to have an explicit id that is identical to the value returned by generate-id() (but that's a bit of a stretch ;-))

[in reduceFOTree.xsl]

<xsl:template match="*">
  <xsl:element name="{name()}">
    <xsl:copy-of select="@id" />
  </xsl:element>
</xsl:template>

If that could make it even simpler (one less extension attribute necessary), I'm all for it... 
I'm not sure I'm comfortable with introducing a native extension, scattered in so many classes, while we already have the standard 'id' property conveniently available in FObj, which seems to serve the purpose nicely.

Another small detail: why exactly do we need the explicit value of the xml:lang shorthand in Block.java? IIC, Block should already have the implied language and country available (via getCommonHyphenation()). Just asking, since if you look closely, you will notice that none of the shorthands are stored on the FObjs (only the expanded, native FO properties), so making this exception should have a good reason.
Comment 9 Jost Klopfstein 2009-02-17 22:10:25 UTC
Hi Andreas,

The approach with the @id seems a bit risky. The @ptr replacement has to have unique values. What happens if a user has a few empty @id in his FO or multiple times the same value for @id?

Cheers, Jost
Comment 10 Andreas L. Delmelle 2009-02-17 22:21:47 UTC
(In reply to comment #9)

> The approach with the @id seems a bit risky. The @ptr replacement has to have
> unique values. What happens if a user has a few empty @id in his FO or multiple
> times the same value for @id?

Could be, and I'm definitely not blocking the proposal on this count. Just an idea.
Having multiple times the same value for an id will cause an error, since @id has to be unique for every element. If it is not specified, then it could be added using a stylesheet, initializing it with generate-id(), just like you propose for @ptr.

The only thing FOP currently does not do is generate and assign ids automatically, but the uniqueness constraint is already checked for.
Comment 11 Jeremias Maerki 2009-02-18 01:20:59 UTC
Jost asked me off-list about my opinion on Vincent's feedback. I'll put it here. Generally good input. The merging of TransformerNode and TransformerNodeEndProcessing is probably a matter of taste. I personally prefer keeping the general functionality seperate from the concrete functionality of the subclass. As mentioned before, I'd like to put this new code into a branch before putting it in Trunk. So we can also address these points in a second step. I assume we all agree that this is great new functionality and if it's applied now (in a separate branch) we can all help improve the whole thing rather than off-loading so many little things on Jost. He can of course continue to send follow-up patches.

On Andreas' comments: I didn't realize that the spec says to initialize the "id" property to an initial value. Of course, it would make sense to use that instead of the foi:ptr. I am, however, concerned about the performance implications for a collision detection which I don't think should be ignored. Also, the area tree and AT XML will increase in size of all elements have IDs. BTW, Andreas' proposal for adding the IDs is not entirely correct. Not all FO element take an "id" property. Maybe this should be done inside the FO tree instead of the addPtr.xsl.
Comment 12 Andreas L. Delmelle 2009-02-18 11:28:58 UTC
(In reply to comment #11)
> 
> On Andreas' comments: I didn't realize that the spec says to initialize the
> "id" property to an initial value. Of course, it would make sense to use that
> instead of the foi:ptr. I am, however, concerned about the performance
> implications for a collision detection which I don't think should be ignored.

Well, what I'm personally a bit concerned about is the case where you have:
1) auto-generated ids by the 'main' stylesheet
2) additional auto-generated ids by the 'pre-processing' stylesheet

Since we have two passes, are we still 100% certain that the XSLT processor will always return unique values that are used nowhere as an id elsewhere in the document? 
For explicit ids, this problem would also exist, but AFAIK, no one specifies explicit ids like "N897654", which is a format returned by some generate-id() implementations.

If the answer to my above question is yes, then there is no real overhead (on the FOP-side) involved for collision-detection, since the XSLT processor will take care of that (or even the XML parser: Is 'id' not a standard XML attribute? Or does it have nothing to do with the standard 'id()' XPath function? I'd have to check in the respective specifications to be sure...)

> Also, the area tree and AT XML will increase in size of all elements have IDs.

True, but then again, doesn't the foi:ptr also have to be carried through to the area tree or IF?
Maybe to address this concern, we could limit the case where the ids are actually auto-generated to this one: when accessibility is enabled. IOW, if and when it is needed (not default behavior).

> BTW, Andreas' proposal for adding the IDs is not entirely correct. Not all FO
> element take an "id" property. 

Again, very true, but we do implement getId() in FObj. That currently means that, although the property may not apply to a given element, like fo:declarations, as long as the Java object extends FObj, the id property will be bound and available if it is specified. 
Checking closer, Declarations.java overrides bind() rather than relying on the superclass implementation. Removing that empty implementation would be enough to make it work.

Not 100% compliant, but it seems like this actually may have added value here.

> Maybe this should be done inside the FO tree instead of the addPtr.xsl.

What I was thinking indeed. Something like a dedicated IDPropertyMaker that extends StringProperty.Maker, and would offer an implementation for make(FObj, PropertyList) that returns a 'proper' initial value. I once implemented it as such (and committed it, but it was a bit too quick and caused other issues, so it was reverted almost immediately)

Downside is that, if we implement it like that in the FO tree, it would also be necessary to implement the b) part in my earlier response, as the ids will never be available in the raw FO source, so reduceFOTree.xsl would not yet 'see' them (unless they would be added by a pre-processing stylesheet, like Jost proposes)
Comment 13 Jeremias Maerki 2009-02-18 11:50:41 UTC
(In reply to comment #12)
<snip/>
> Since we have two passes, are we still 100% certain that the XSLT processor
> will always return unique values that are used nowhere as an id elsewhere in
> the document? 

I would assume: no.

<snip/>
> True, but then again, doesn't the foi:ptr also have to be carried through to
> the area tree or IF?
> Maybe to address this concern, we could limit the case where the ids are
> actually auto-generated to this one: when accessibility is enabled. IOW, if and
> when it is needed (not default behavior).

Agreed.

<snip/>

Given all these observations, I think it would make sense to take the XSLT approach first (add an XSLT-generated ID if non exists) and then gather experience with this approach. This is easy to implement and can be extended later if we see the need.

As I've heard no objections to my proposal from Monday, yet, I'll process Jost's patch (as is) tomorrow morning (my local time) into a branch off the IF branch. From there, we can take it further and incorporate all the feedback. That also makes it easier to track what is changed.
Comment 14 Andreas L. Delmelle 2009-02-18 15:22:09 UTC
(In reply to comment #13)
> (In reply to comment #12)
> 
> I would assume: no.

FWIW, just checked the XML and XPath specifications, and that one question regarding 'id' is answered. If one specifies an 'xml:id', that would be a different matter entirely. Since, in XSL-FO, the attribute is in the default namespace (think 'fo:id'), they have nothing to do with each other.

I agree with your assumption. There seems to be absolutely no guarantee that generate-id() returns unique values across the two passes. Even worse, if the processor would implement generate-id() as a deterministic function (for example: if the id value is solely based on an index which is incremented with every element), then in both passes, generate-id() would always return something like "N0" for the root node, and the risk of collisions could become significant...

Placing the stress on the XSLT processor could offer a way out:

[in addPtr.xsl]
<!-- key for the explicit ids -->
<xsl:key name="idKey" match="fo:*" use="@id" />
...
<xsl:template name="addId">
  <xsl:param name="checkId" select="generate-id()" />
  <xsl:choose>
    <xsl:when test="key('idKey',$checkId)">
       <xsl:call-template name="addId">
         <xsl:with-param name="checkId" select="generate-id(key('idKey',$checkId))" />
       </xsl:template>
    </xsl:when>
    <xsl:otherwise>
      <xsl:value-of select="$checkId" />
    </xsl:otherwise>
  </xsl:choose>
</xsl:template>
...
<xsl:attribute name="id">
  <xsl:call-template name="addId" />
</xsl:attribute>
...

Untested for performance. The mere addition of the key could be problematic, as the first call to key('idKey',...) cannot be evaluated unless the entire source document is scanned.
I keep looking at that recursive call, and wonder whether that could lead to problems. Strictly speaking, if the id starting the recursion will always be unique (by definition so), then the recursion should always stop at some point, and the path should always run over different nodes.
The depth will, in the most extreme case, be equal to the maximum number of collisions, which is the same as the number of nodes that already have an explicit id, or the size of the xsl:key map.

Assuming that the explicit ids added in the first pass adhere to the uniqueness constraint:
If, for any given node, the recursion would go through all possible collisions, then that would mean that the collision-values all refer to each other. That is, every value returned by generate-id() for a node in the key-map, is already specified as an id on another node in that same map. That can happen at most N times for N nodes with explicit ids, and it already happened once, for a node not in that map, to trigger the recursion. So, there will then be one of those N nodes for which the generated id is not yet occupied. Moreover, this most extreme case can present itself at most for one node in the document. The more collisions with nodes not in the map, the less deep the recursions will go.

> <snip/>
> 
> Given all these observations, I think it would make sense to take the XSLT
> approach first (add an XSLT-generated ID if non exists) and then gather
> experience with this approach. This is easy to implement and can be extended
> later if we see the need.

Agreed.

Comment 15 Jost Klopfstein 2009-02-19 02:23:01 UTC
Created attachment 23278 [details]
added compiled stylesheets for improved performance in server environments (suggested by Jeremias)
Comment 16 Jeremias Maerki 2009-02-19 11:58:34 UTC
I've applied the patch (2009-02-19 02:23 PST) to a new branch as discussed (with modifications). The code is now under:
https://svn.apache.org/repos/asf/xmlgraphics/fop/branches/Temp_Accessibility
http://svn.apache.org/viewvc?rev=745949&view=rev

I'll continue working on the code when my batteries are filled again.

Jost, thanks a lot for your work here. This is great new functionality. It will still take a bit to polish a few areas. I hope to see you become a regular contributor now that the hardest part is over. Please don't hesitate to publish follow-up patches against the branch (or trunk if you have other ideas). I'll keep the issue here open until we've adressed all the feedback, after which we'll see to merging this into trunk.
Comment 17 Jeremias Maerki 2009-04-17 07:42:17 UTC
Most of the feedback has now been incorporated into the development branch. One notable exception is the discussion around replacing foi:ptr with the "id" property but I'm running out of time to work on this. Another point that I'm not entirely happy with is the fact that the enriched FO is fully buffered in a byte array before FOP starts processing. An idea here would be to build a reduced structure tree per page-sequence and attach it as such to the area tree instead of passing it to PDFDocumentHandler through the user agent. Anyway, when accessibility is disabled (the default), FOP does not suffer any performance drawbacks. So, IMO, the development branch could be merged with the Trunk. At any rate, the end-user documentation is now also in place. I've added a hopefully comprehensive list of limitations in addition to what Jost has written up.
Comment 18 Andreas L. Delmelle 2009-06-05 12:02:37 UTC
*** Bug 47130 has been marked as a duplicate of this bug. ***
Comment 19 Vincent Hennebert 2009-06-26 03:48:13 UTC
I've been (slowly) reviewing the patch and I noticed that classes in the layoutmgr package are affected by the changes. There's no reason why it should be the case, layout has little to do with accessibility.
In fact the layout managers just pass the ptr trait over to the area tree. This calls for a more generic way of doing this. Indeed there are other properties that don't affect layout but only areas (color, z-index among others).

I'll try to see how this can be handled. Just thought I'd made a note for later.

Vincent
Comment 20 Vincent Hennebert 2009-09-17 09:07:20 UTC
Note: I've been asked to look at integrating this patch into Trunk, but also to implement tagged PDF in the legacy PDFRenderer. More later.

Vincent
Comment 21 Vincent Hennebert 2009-09-22 04:59:24 UTC
I've just noticed that the structure tree stored in the IF XML is not re-parsed. Running FOP with the IF as input (-afin) and the -a switch even leads to a NPE. So tagged PDF is currently not supported when processing the FO document in two steps, first to the intermediate format then to PDF.
Comment 22 Vincent Hennebert 2009-10-01 04:02:45 UTC
(In reply to comment #21)
> I've just noticed that the structure tree stored in the IF XML is not
> re-parsed. Running FOP with the IF as input (-afin) and the -a switch even
> leads to a NPE. So tagged PDF is currently not supported when processing the FO
> document in two steps, first to the intermediate format then to PDF.

This has now been addressed:
http://svn.apache.org/viewvc?rev=819585&view=rev
Comment 23 Vincent Hennebert 2009-10-01 04:05:43 UTC
The number tree corresponding to the ParentTree entry in the StructTreeRoot object is invalid, as the values are directly stored in the array instead of being reference. See discussion on fop-dev:
http://markmail.org/thread/oijggvxzfgjyewwk
Comment 24 Vincent Hennebert 2009-10-01 04:10:58 UTC
Are images supposed to work? I've tested a simple document that contains an fo:external-graphic referring to a PNG image. The alt-text isn't read aloud by Acrobat, and the image doesn't appear in the list of tags, nor the Order tab.
Comment 25 benoit.wiart 2009-10-01 08:19:20 UTC
table-and-caption and table-caption are not supported by FOP but they do not crash the engine
On the Accessibility branch, a file with a table-and-caption / table-caption will generate a stacktrace (NPE) because foi:ptr is missing on those elements.
Comment 26 benoit.wiart 2009-10-01 09:03:41 UTC
the NPE is at the line
structElemType.put(ptr, structElem.get("S").toString());

structElem.get("S") returns null

it seems to be caused by FOToPDFRoleMap.mapFormattingObject(s, parent) : 
the STANDARD_MAPPINGS map does not know table-and-caption and table-caption
Comment 27 Jeremias Maerki 2009-10-01 09:13:20 UTC
(In reply to comment #24)
> Are images supposed to work? I've tested a simple document that contains an
> fo:external-graphic referring to a PNG image. The alt-text isn't read aloud by
> Acrobat, and the image doesn't appear in the list of tags, nor the Order tab.

Yes, they are supposed to work. One of my accessibility test cases I've just run through cause the alt-text for the images to be read out loud. I'm running off a current and clean checkout. If you want to send me some PDF for analysis, I'm happy to take a look.
Comment 28 Vincent Hennebert 2009-10-01 09:34:41 UTC
(In reply to comment #27)
> (In reply to comment #24)
> > Are images supposed to work? I've tested a simple document that contains an
> > fo:external-graphic referring to a PNG image. The alt-text isn't read aloud by
> > Acrobat, and the image doesn't appear in the list of tags, nor the Order tab.
> 
> Yes, they are supposed to work. One of my accessibility test cases I've just
> run through cause the alt-text for the images to be read out loud. I'm running
> off a current and clean checkout. If you want to send me some PDF for analysis,
> I'm happy to take a look.

Must be my local changes then. My local copy is getting a bit messy. I'll have another look with a fresh checkout.

But why are images handled like text, using marked-content sequences? IIUC they should be handled as PDF Objects (see "PDF Object as Content Items" in section 9.6.3 of the PDF Reference, Third Edition).

Thanks,
Vincent
Comment 29 Jeremias Maerki 2009-10-02 00:49:27 UTC
(In reply to comment #28)
> But why are images handled like text, using marked-content sequences? IIUC they
> should be handled as PDF Objects (see "PDF Object as Content Items" in section
> 9.6.3 of the PDF Reference, Third Edition).

Ideally maybe, but not necessarily so IMO. See the note on page 599 of PDF 1.4: "If it is important to distinguish between multiple renditions of the same XObject on the same page, they should be accessed via marked content sequences enclosing particular invocations of the Do operator, rather than via object references." So the pattern is not illegal. FOP can reuse the same XObject multiple times on the same or on different pages.
Comment 30 Vincent Hennebert 2009-10-09 02:29:20 UTC
Jost or Jeremias,

In the PDFPainter.prepareImageMCID method, there is a "fix for Acro Checker" comment. Do you have a sample FO file triggering that bug in Acrobat? I haven't been able to reproduce it.

Thanks,
Vincent
Comment 31 Vincent Hennebert 2009-10-13 10:15:35 UTC
Hi BenoƮt,

(In reply to comment #25)
> table-and-caption and table-caption are not supported by FOP but they do not
> crash the engine
> On the Accessibility branch, a file with a table-and-caption / table-caption
> will generate a stacktrace (NPE) because foi:ptr is missing on those elements.

This has been fixed in revision 824845:
http://svn.apache.org/viewvc?rev=824845&view=rev

Thanks for reporting the bug,
Vincent
Comment 32 Vincent Hennebert 2010-01-11 08:21:23 UTC
Separate bug (48518) has been opened about the ID discussion. Closing this one.
Comment 33 Glenn Adams 2012-04-01 06:26:21 UTC
batch transition pre-FOP1.0 resolved+fixed bugs to closed+fixed