58654 – Extraction fails when data is inserted using Microsoft Field Control

Bug 58654 - Extraction fails when data is inserted using Microsoft Field Control

Summary: Extraction fails when data is inserted using Microsoft Field Control

Status:	NEEDINFO

Alias:	None

Product:	POI
Classification:	Unclassified
Component:	POI Overall (show other bugs)
Version:	unspecified
Hardware:	PC All

Importance:	P2 normal (vote)
Target Milestone:	---
Assignee:	POI Developers List

URL:
Keywords:

Depends on:
Blocks:

Reported:	2015-11-25 17:47 UTC by raymond.cabrera
Modified:	2021-06-14 17:56 UTC (History)
CC List:	1 user (show)

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description raymond.cabrera 2015-11-25 17:47:12 UTC

Hi,

We seem to have an issue when data is being extracted in a word or PDF that certain pages that have been inserted using MS Field Control are being dropped.  Doesn't throw an error, it extracts every other page but the ones inserted using MS field Control.

Comment 1 Nick Burch 2015-11-25 17:49:23 UTC

Which extraction? What code in Apache POI are you calling? Do you have an example file that shows up the problem?

Comment 2 raymond.cabrera 2015-11-25 21:15:42 UTC

Created attachment 33298


Example of file requested

Comment 3 raymond.cabrera 2015-11-25 21:15:52 UTC

file attached.

Thanks.

Comment 4 Nick Burch 2015-11-25 21:58:38 UTC

Thanks for the file. How are you calling Apache POI though? Any chance of a short code snippet and/or junit unit test to show how to reproduce the problem?

Comment 5 raymond.cabrera 2015-11-26 19:34:34 UTC

FYI:

            org.apache.poi.hwpf.extractor.WordExtractor extractor1 = new org.apache.poi.hwpf.extractor.WordExtractor(getInputStream(contents));
            return extractor1.getText();

            XWPFDocument doc = new XWPFDocument(getInputStream(contents));
            XWPFWordExtractor extractor = new XWPFWordExtractor(doc);
            return  extractor.getText();

Comment 6 raymond.cabrera 2015-12-07 16:58:21 UTC

Created attachment 33331

Comment 7 raymond.cabrera 2015-12-07 17:02:26 UTC

Hi,

We have updated our POI to the latest version
poi-3.13.jar
poi-ooxml-3.13
poi-ooxml-schemas-3.13.jar
poi-scratchpad-3.13.jar

we are using the following java codes to extract the attached docx file, it extracts the texts inside the form controls successfully, however it fails to extract the name and address on the top of word file (which also had Content Control Enabled simlar to thhe actually body that was successfully extracted on Page3)



Please advise how to fix this.

/*
* To change this template, choose Tools | Templates
* and open the template in the editor.
*/

package com.brainhunter.frontoffice.biz.util.extract;

import com.brainhunter.frontoffice.biz.exception.UnableExtractException;
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;

/**
*
* @author Mohankumars
*/
public class DocxExtractor  extends TextExtractor{

    /** Creates a new instance of DocExtractor */
    public DocxExtractor() {
    }

    public String getText( byte[] contents ) throws UnableExtractException{

        try {
            XWPFDocument doc = new XWPFDocument(getInputStream(contents));
            XWPFWordExtractor extractor = new XWPFWordExtractor(doc);
            return  extractor.getText();
        }
        catch( Exception e ) {
            throw new UnableExtractException( e );
        }
    }

}

Comment 8 Mark Thomas 2021-06-14 17:56:55 UTC

Attachments deleted by ASF infrastructure team.