Created attachment 27618 [details] Patch for issue Sporadic bug with some Word files (unable to submit sample due to sensitive nature of files). The LittleEndian.getShort sometimes returns a length of -1, which causes a failure in subsequent getFromUnicodeLE() call. Added a check for invalid length before proceeding. Was last entry in my test file and having a null entries array value does not appear to cause any additional errors down-stream. Stack Trace (POI version 3.8-beta4) Caused by: java.lang.IllegalArgumentException: Illegal length -1 at org.apache.poi.util.StringUtil.getFromUnicodeLE(StringUtil.java:73) at org.apache.poi.hwpf.model.RevisionMarkAuthorTable.<init>(RevisionMarkAuthorTable.java:89) at org.apache.poi.hwpf.HWPFDocument.<init>(HWPFDocument.java:375) at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:67) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:196) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) ... 45 more
I applied patch to SttbUtils and rewrite RevisionMarkAuthorTable to use it. Please, check with r1177644 or later.
That appears to have fixed it!! I'll mark the Bug as resolved. Thanks very much for your attention to this. (In reply to comment #1) > I applied patch to SttbUtils and rewrite RevisionMarkAuthorTable to use it. > Please, check with r1177644 or later.