Bug 60329

Summary: Avoid NPE when styleid is null
Product: POI Reporter: Tim Allison <tallison>
Component: XWPFAssignee: POI Developers List <dev>
Status: RESOLVED FIXED    
Severity: normal    
Priority: P2    
Version: 3.16-dev   
Target Milestone: ---   
Hardware: PC   
OS: All   
Attachments: File provided by Seva Alekseyev on TIKA-2152
mylyn/context/zip

Description Tim Allison 2016-11-01 18:55:51 UTC
On TIKA-2152, Seva Alekseyev submitted a docx file that triggers an NPE when trying to get the styleid.  If I read OOXML Ecma 1 [1] correctly, styleid is optional.  We should try to avoid an NPE if the styleid is null.

[1] http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-376,%20Fourth%20Edition,%20Part%201%20-%20Fundamentals%20And%20Markup%20Language%20Reference.zip : pdf page 3861/printed page 3851
Comment 1 Tim Allison 2016-11-02 11:07:15 UTC
From Mark Murphy on our dev list:

Seems to me that style ID is not optional because it is used to refer to the style from elsewhere in the document, and the style ID must also be unique. However, if the style ID is missing from the document, one should be assigned in any manner we choose, and it won't be able to be referenced.
That is ok as the document is mal-formed.
Comment 2 Tim Allison 2016-11-02 11:07:52 UTC
Sounds good.  Do we want to auto-generate a styleid and risk collision with an actual styleid (chances would be very, very low) or handle the null or simply not load a style that doesn't contain an id?
Comment 3 Mark Murphy 2016-11-02 11:37:00 UTC
Can you attach the document in question? I want to see what Word does, though that might be a different situation as Word could assign an ID and allow a user to interactively assign the style to an element in the document. 

The applicable use cases in POI would be to read a document, update it, and save it. Do we have (or want) a facility in POI to reference a style by some means other than its ID. If so, we should go ahead and generate a new ID. I suggest something like a negative 10 digit number. The spec also indicates that if a document has two styles with the same ID, the first one is to be used. For our case I think we should check for duplicate style ID's on load, and mangle duplicates by appending a sequential integer (in parenthesis) to duplicates unless the duplicate was caused by an auto-generated id. In that case the auto-generated id should be re-generated.
Comment 4 Tim Allison 2016-11-02 12:58:30 UTC
Created attachment 34414 [details]
File provided by Seva Alekseyev on TIKA-2152
Comment 5 Tim Allison 2016-11-02 13:01:12 UTC
>Seems to me that style ID is not optional because it is used to refer to the style from elsewhere in the document

I agree with this point.  However, from a spec standpoint it appears to be "optional" or am I misreading?

Also, thank you for your consideration of the full lifecycle -- open, modify, save.  I have to admit that I suffer from tunnel vision on open+extract. :)
Comment 6 Mark Murphy 2016-11-05 05:17:03 UTC
(In reply to Tim Allison from comment #5)
> >Seems to me that style ID is not optional because it is used to refer to the style from elsewhere in the document
> 
> I agree with this point.  However, from a spec standpoint it appears to be
> "optional" or am I misreading?
> 
> Also, thank you for your consideration of the full lifecycle -- open,
> modify, save.  I have to admit that I suffer from tunnel vision on
> open+extract. :)

Sorry, thought I replied to this. Yes the spec does say optional, and after my initial reading I couldn't think of how a style could be used without an ID, but after more thought, a style without an ID could be used indirectly if it were a default style for a run, paragraph, or table. So, to avoid other than explicit changes to documents, I believe we should just suppress any NPE, or better yet, correct the situation that is causing the NPE. I have found several places in other parts of the code where it is assumed that an XML object exists, and just goes and gets it. Maybe that is the issue here?
Comment 7 Mark Murphy 2016-11-29 04:19:35 UTC
Tim, do we have more information about this, like a stack trace, or some code which triggers the NPE? I can't recreate it. In fact when looking at the file in the attachment, all the styles have a styleId. I can modify it so that one does not. But that would defeat the purpose of attaching a file that exhibits this behavior.
Comment 8 Tim Allison 2016-11-29 13:19:31 UTC
Doh, y, this fell off my plate too.  Sorry.

In styles.xml, I see this (a style with no id):
<w:style><w:name w:val="EmptyCellLayoutStyle"/><w:basedOn w:val="Normal"/><w:rPr><w:sz w:val="2"/></w:rPr></w:style>


Full stack trace in Tika:

org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser@131ef10

	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
	at org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:158)
	at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:215)
	at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:207)
	at org.apache.tika.parser.microsoft.ooxml.OOXMLParserTest.testOneOff2(OOXMLParserTest.java:1311)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
	at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
	at org.junit.runner.JUnitCore.run(JUnitCore.java:160)
	at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
	at com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:51)
	at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:237)
	at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
Caused by: java.lang.NullPointerException
	at org.apache.poi.xwpf.usermodel.XWPFStyles.getStyle(XWPFStyles.java:198)
	at org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractParagraph(XWPFWordExtractorDecorator.java:150)
	at org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractIBodyText(XWPFWordExtractorDecorator.java:107)
	at org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractTable(XWPFWordExtractorDecorator.java:381)
	at org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractHeaderText(XWPFWordExtractorDecorator.java:433)
	at org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractHeaders(XWPFWordExtractorDecorator.java:423)
	at org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.buildXHTML(XWPFWordExtractorDecorator.java:89)
	at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:104)
	at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:122)
	at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
Comment 9 Mark Murphy 2016-12-01 02:23:26 UTC
fixed by r1772138
Comment 10 Mark Murphy 2016-12-01 02:23:28 UTC
Created attachment 34493 [details]
mylyn/context/zip