Bug 58654 - Extraction fails when data is inserted using Microsoft Field Control
Summary: Extraction fails when data is inserted using Microsoft Field Control
Status: NEEDINFO
Alias: None
Product: POI
Classification: Unclassified
Component: POI Overall (show other bugs)
Version: unspecified
Hardware: PC All
: P2 normal (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-11-25 17:47 UTC by raymond.cabrera
Modified: 2015-12-07 17:02 UTC (History)
1 user (show)



Attachments
.doc example of MS Field Control insert not being extracted (38.11 KB, application/vnd.openxmlformats-officedocument.wordprocessingml.document)
2015-11-25 21:15 UTC, raymond.cabrera
Details
Word Content Control test file. (39.46 KB, application/vnd.openxmlformats-officedocument.wordprocessingml.document)
2015-12-07 16:58 UTC, raymond.cabrera
Details

Note You need to log in before you can comment on or make changes to this bug.
Description raymond.cabrera 2015-11-25 17:47:12 UTC
Hi,

We seem to have an issue when data is being extracted in a word or PDF that certain pages that have been inserted using MS Field Control are being dropped.  Doesn't throw an error, it extracts every other page but the ones inserted using MS field Control.
Comment 1 Nick Burch 2015-11-25 17:49:23 UTC
Which extraction? What code in Apache POI are you calling? Do you have an example file that shows up the problem?
Comment 2 raymond.cabrera 2015-11-25 21:15:42 UTC
Created attachment 33298 [details]
.doc example of MS Field Control insert not being extracted

Example of file requested
Comment 3 raymond.cabrera 2015-11-25 21:15:52 UTC
file attached.

Thanks.
Comment 4 Nick Burch 2015-11-25 21:58:38 UTC
Thanks for the file. How are you calling Apache POI though? Any chance of a short code snippet and/or junit unit test to show how to reproduce the problem?
Comment 5 raymond.cabrera 2015-11-26 19:34:34 UTC
FYI:

            org.apache.poi.hwpf.extractor.WordExtractor extractor1 = new org.apache.poi.hwpf.extractor.WordExtractor(getInputStream(contents));
            return extractor1.getText();

            XWPFDocument doc = new XWPFDocument(getInputStream(contents));
            XWPFWordExtractor extractor = new XWPFWordExtractor(doc);
            return  extractor.getText();
Comment 6 raymond.cabrera 2015-12-07 16:58:21 UTC
Created attachment 33331 [details]
Word Content Control test file.
Comment 7 raymond.cabrera 2015-12-07 17:02:26 UTC
Hi,

We have updated our POI to the latest version
poi-3.13.jar
poi-ooxml-3.13
poi-ooxml-schemas-3.13.jar
poi-scratchpad-3.13.jar

we are using the following java codes to extract the attached docx file, it extracts the texts inside the form controls successfully, however it fails to extract the name and address on the top of word file (which also had Content Control Enabled simlar to thhe actually body that was successfully extracted on Page3)



Please advise how to fix this.

/*
* To change this template, choose Tools | Templates
* and open the template in the editor.
*/

package com.brainhunter.frontoffice.biz.util.extract;

import com.brainhunter.frontoffice.biz.exception.UnableExtractException;
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;

/**
*
* @author Mohankumars
*/
public class DocxExtractor  extends TextExtractor{

    /** Creates a new instance of DocExtractor */
    public DocxExtractor() {
    }

    public String getText( byte[] contents ) throws UnableExtractException{

        try {
            XWPFDocument doc = new XWPFDocument(getInputStream(contents));
            XWPFWordExtractor extractor = new XWPFWordExtractor(doc);
            return  extractor.getText();
        }
        catch( Exception e ) {
            throw new UnableExtractException( e );
        }
    }

}