Bug 45622 - Header/footer extraction for Word documents incomplete
Summary: Header/footer extraction for Word documents incomplete
Status: RESOLVED FIXED
Alias: None
Product: POI
Classification: Unclassified
Component: HWPF (show other bugs)
Version: unspecified
Hardware: All All
: P1 critical (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-08-12 11:13 UTC by Dmitry Goldenberg
Modified: 2008-08-12 12:58 UTC (History)
0 users



Attachments
Simple Word doc with headers and footers. (26.00 KB, application/msword)
2008-08-12 11:13 UTC, Dmitry Goldenberg
Details
Word doc with some macros used in its header. (22.00 KB, application/msword)
2008-08-12 11:15 UTC, Dmitry Goldenberg
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Dmitry Goldenberg 2008-08-12 11:13:21 UTC
Created attachment 22435 [details]
Simple Word doc with headers and footers.

There are several issues with the header/footer extraction for Word as it is implemented now.

1. The newly added methods are on WordExtractor as follows:
public String getHeaderText()
public String getFooterText()

These methods do not account for the use-case of headers/footers defined differently for odd vs. even pages in Word.

I propose a different model:

HWPFHeader header = extractor.getHeader();
String oddHeader = header.getOddHeader();
String evenHeader = header.getEvenHeader();

HWPFFooter footer = extractor.getFooter();
String oddFooter = footer.getOddFooter();
String evenFooter = footer.getEvenFooter();

This will be adequate to the Word's model and in line with the model adopted in the Excel header/footer extraction code:

HSSFHeader header = sheet.getHeader();
String leftHeader = header.getLeft();
String centerHeader = header.getCenter();
String rightHeader = header.getRight();

2. The second issue is macros. You can define macros in headers and footers and currently they show up in the extracted text. For example, in the attached file HeadersFooters2.doc, the Author field was used in the header, and the string "AUTHOR" gets returned. It would be great if the headers/footers would only return the actual text and never the macros, or if the methods had a boolean flag to strip off the macros.

For example, for the attached HeadersFooters2.doc, the following gets returned:

HEADER GOES HERE. 8/12/2008  AUTHOR \* MERGEFORMAT Eric Roch

It would be great if the returned text was simply:

HEADER GOES HERE. 8/12/2008 Eric Roch

In the interest of being generic, a flag for stripping off this extra markup is probably best.
Comment 1 Dmitry Goldenberg 2008-08-12 11:15:01 UTC
Created attachment 22436 [details]
Word doc with some macros used in its header.
Comment 2 Dmitry Goldenberg 2008-08-12 11:59:26 UTC
Per Nick's comments -- found the HeaderStories object. We only need to look into the macros now. Disregard my comment about getHeaderText/getFooterText.
Comment 3 Nick Burch 2008-08-12 12:58:17 UTC
I've just added field (eg macro) stripping to HWPF. This is a static method on Range, Range.stripFields()

In addition, HeaderStores has an option to always strip out fields from text returned (off by default). With this turned on, the header given on your test document is as one would expect