Created attachment 34596 [details] File which the code fails I'm extracting the text from a WordExtractor class (apache POI), but I have an error for some .doc files. Here the code: " import org.apache.poi.xwpf.extractor.XWPFWordExtractor; import org.apache.poi.xwpf.usermodel.XWPFDocument; import org.apache.poi.hwpf.HWPFDocument; import org.apache.poi.hwpf.extractor.WordExtractor; import org.apache.poi.poifs.filesystem.OfficeXmlFileException; public class [class name] { public static void main(String... args) throws FileNotFoundException, IOException, NullPointerException, OfficeXmlFileException { File[] files = new File("[input path]").listFiles(); showFiles(files); } public static void showFiles(File[] files) throws FileNotFoundException, IOException, NullPointerException, OfficeXmlFileException { File log = new File("[output name]/out.tsv"); for (File file : files) { if (file.isDirectory()) { //System.out.println("Directory/" + file.getName()); showFiles(file.listFiles()); // Calls same method again. } else { String N = file.getName(); // caso .docx if (N.toLowerCase().endsWith(".docx") && !N.toLowerCase().startsWith("~")) { System.out.println(file.getAbsolutePath()); XWPFDocument docx = new XWPFDocument(new FileInputStream(file)); XWPFWordExtractor we = new XWPFWordExtractor(docx); String T = we.getText().replaceAll("\\n", " ").replaceAll("\\r", " "); // PARA ESCRIBIR EL ARCHIVO try{ // if(!log.exists()){ // System.out.println("We had to make a new file."); // log.createNewFile(); // } FileWriter fileWriter = new FileWriter(log, true); BufferedWriter bufferedWriter = new BufferedWriter(fileWriter); bufferedWriter.write(file.getAbsolutePath()+"\t"+T+"\n"); bufferedWriter.close(); } catch (IOException e) { System.err.println("Problem writing .DOCX to the file out.txt " + e.getMessage()); } } else { if (N.toLowerCase().endsWith(".doc") && !N.toLowerCase().startsWith("~")) { System.out.println(file.getAbsolutePath()); HWPFDocument doc = new HWPFDocument(new FileInputStream(file)); WordExtractor we = new WordExtractor(doc); //WordExtractor we = new WordExtractor(new FileInputStream(file)); String T = we.getText().replaceAll("\\n", " ").replaceAll("\\r", " "); // PARA ESCRIBIR EL ARCHIVO try{ // if(!log.exists()){ // log.createNewFile(); // } FileWriter fileWriter = new FileWriter(log, true); BufferedWriter bufferedWriter = new BufferedWriter(fileWriter); bufferedWriter.write(file.getAbsolutePath()+"\t"+T+"\n"); bufferedWriter.close(); } catch (IOException e) { System.err.println("Problem writing .DOC to the file out.txt " + e.getMessage()); } } } } } } } " For most .docx and .doc files it's work fine. The error message is: Exception in thread "main" java.lang.RuntimeException: java.lang.IllegalArgumentException: The end (4958) must not be before the start (4990) How can I fix it?
The problem seems to be with the document and not the code and specifically with the hidden bookmark with the name: "_Toc263095067". Apparently poi is inferring the end of the bookmark to be located before the start. Deleting that particular hidden bookmark makes everything work fine. Let me know of any other corrective measures or any actual problem in the poi code this highlights, if any.
Thank you very much. That was the problem. Now it works fine! Regards.
Seems to have been caused by a slightly broken document.