Parse Failed for doc file
gaurav.chd3@gmail.com - can you provide some context on why Apache POI support all these files? It seems to me that if you want to read these very old files, you should use MS Word to convert them to newer formats. Apache POI is a volunteer project and if this support matters to you or your organisation, maybe you can provide patches.
Thanks for response! This is a new file 2015 file not an old file. I am just testing it to see if it can be used in comparison to Other options. Have a good day ahead!
Missing attachment, missing error message, missing reproducible test case, missing other helpful information such as POI version. If you have a set of Microsoft Office files that can't be read, please do some investigation on your end, submit one and only one file for a given issue, and suggest an improvement in the form of a patch for POI to be able to read said file.
Sorry, for inconvenience. The file is attached now. The test cases 61265, 61267, 61266, and 61268 are completely different test cases/issues. They will have different root causes and resolutions. Point regarding improvement suggestion is noted. Thanks!
Created attachment 35107 [details] 2014 doc file File size is 6 MB. It can be downloaded from below link: http://www.3gpp.org/ftp/tsg_sa/WG3_Security/TSGS3_76_Sophia/Docs/S3-142235.zip "S3-142235 Comments on S3-142030 VF proposal TR 33969-071_rm.doc" file in the zip file
POI 3.16 / Tika 1.15 S3-142235/S3-142235 Comments on S3-142030 VF proposal TR 33969-071_rm.doc Caused by: java.lang.NegativeArraySizeException at org.apache.poi.ddf.UnknownEscherRecord.fillFields(UnknownEscherRecord.java:71) at org.apache.poi.ddf.EscherContainerRecord.fillFields(EscherContainerRecord.java:81) at org.apache.poi.hwpf.model.PICFAndOfficeArtData.<init>(PICFAndOfficeArtData.java:61) at org.apache.poi.hwpf.usermodel.Picture.<init>(Picture.java:112) at org.apache.poi.hwpf.model.PicturesTable.extractPicture(PicturesTable.java:162) at org.apache.poi.hwpf.model.PicturesTable.getAllPictures(PicturesTable.java:233) at org.apache.tika.parser.microsoft.WordExtractor$PicturesSource.<init>(WordExtractor.java:710)
I added a workaround in https://svn.apache.org/viewvc?view=revision&revision=1801395