Hi, We seem to have an issue when data is being extracted in a word or PDF that certain pages that have been inserted using MS Field Control are being dropped. Doesn't throw an error, it extracts every other page but the ones inserted using MS field Control.
Which extraction? What code in Apache POI are you calling? Do you have an example file that shows up the problem?
Created attachment 33298 Example of file requested
file attached. Thanks.
Thanks for the file. How are you calling Apache POI though? Any chance of a short code snippet and/or junit unit test to show how to reproduce the problem?
FYI: org.apache.poi.hwpf.extractor.WordExtractor extractor1 = new org.apache.poi.hwpf.extractor.WordExtractor(getInputStream(contents)); return extractor1.getText(); XWPFDocument doc = new XWPFDocument(getInputStream(contents)); XWPFWordExtractor extractor = new XWPFWordExtractor(doc); return extractor.getText();
Created attachment 33331
Hi, We have updated our POI to the latest version poi-3.13.jar poi-ooxml-3.13 poi-ooxml-schemas-3.13.jar poi-scratchpad-3.13.jar we are using the following java codes to extract the attached docx file, it extracts the texts inside the form controls successfully, however it fails to extract the name and address on the top of word file (which also had Content Control Enabled simlar to thhe actually body that was successfully extracted on Page3) Please advise how to fix this. /* * To change this template, choose Tools | Templates * and open the template in the editor. */ package com.brainhunter.frontoffice.biz.util.extract; import com.brainhunter.frontoffice.biz.exception.UnableExtractException; import org.apache.poi.xwpf.extractor.XWPFWordExtractor; import org.apache.poi.xwpf.usermodel.XWPFDocument; /** * * @author Mohankumars */ public class DocxExtractor extends TextExtractor{ /** Creates a new instance of DocExtractor */ public DocxExtractor() { } public String getText( byte[] contents ) throws UnableExtractException{ try { XWPFDocument doc = new XWPFDocument(getInputStream(contents)); XWPFWordExtractor extractor = new XWPFWordExtractor(doc); return extractor.getText(); } catch( Exception e ) { throw new UnableExtractException( e ); } } }
Attachments deleted by ASF infrastructure team.