Created attachment 35171 [details] triggering file A fuzzed version of testException2.doc in Tika's test corpus triggers: Caused by: java.lang.AssertionError at org.apache.poi.hwpf.usermodel.Range.sanityCheck(Range.java:1158) at org.apache.poi.hwpf.usermodel.Range.<init>(Range.java:195) at org.apache.poi.hwpf.usermodel.HeaderStories.getSubrangeAt(HeaderStories.java:357) at org.apache.poi.hwpf.usermodel.HeaderStories.getOddHeaderSubrange(HeaderStories.java:196) at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:180) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:175) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:131) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) We should definitely have a range check, but I think it would be better to throw a RecordFormatException or RuntimeException? More generally, there is room to make our "broken record" exception handling more consistent. My preference would be to throw RecordFormatException for this kind of thing. Other ideas?
I see 73 instances of "assert(", and, yes, several of these are my fault. In most cases, I'd think we could convert these to a RecordFormatException. If there were a use case for turning assertions off and hoping for the best, I'd want to leave the asserts in. However, it looks (on quick review) like turning the assertions off will yield corrupt objects/data. So, I don't see a use case for assert instead of a RecordFormatException. I'm happy to make the changes, but given that this will be not a small patch, I'd like to get feedback before I fix this globally.
Of course, there are many, many more without the paren: "assert "
+1 for replacing java-assert I think we need at least two different cases: asserts based on record format (throws RecordFormatException) and other asserts (throws ? extends RuntimeException)
I'll probably start on this one class/family at a time unless there are objections. It looks like there are basically 4 places where we rely on assert. 1) assert that things are or aren't null, as in DrawTextParagraph: String buFontStr = bulletStyle.getBulletFont(); if (buFontStr == null) { buFontStr = paragraph.getDefaultFontFamily(); } assert(buFontStr != null); FontInfo buFont = new DrawFontInfo(buFontStr); 2) assert instanceof, as in Range: assert ( _doc instanceof HWPFDocument ); 3) assert x == y to confirm that a record is not wonky, as in HwmfBitmapDib: assert(introSize == headerSize); or in LittleEndianByteArrayInputStream: assert skipped == size : "Buffer overrun"; 4) checks on limitations of implementation as in: assert false : "hashCode not designed"; There may be other uses as well... Proposed solutions: 1) RecordFormatException? Or something else? 2) What exception should we use here? 3) is straightforward, I think: RecordFormatException 4) is straightforward, I think: implement/auto-generate a hashcode
in r1803092, I made some modifications to hwpf.Range. I left in the asserts in the binary search code. I added DocumentFormatException as a RuntimeException to handle larger scale problems with parsing the document than RecordFormatException. I think I'd want to use this for 1) and 2), RecordFormatException for 3), and just implement 4). I'll wait a bit before making any other changes to give folks a chance to review and offer feedback.