Bug 45622

Summary: Header/footer extraction for Word documents incomplete
Product: POI Reporter: Dmitry Goldenberg <dgoldenberg>
Component: HWPFAssignee: POI Developers List <dev>
Status: RESOLVED FIXED    
Severity: critical    
Priority: P1    
Version: unspecified   
Target Milestone: ---   
Hardware: All   
OS: All   
Attachments: Simple Word doc with headers and footers.
Word doc with some macros used in its header.

Description Dmitry Goldenberg 2008-08-12 11:13:21 UTC
Created attachment 22435 [details]
Simple Word doc with headers and footers.

There are several issues with the header/footer extraction for Word as it is implemented now.

1. The newly added methods are on WordExtractor as follows:
public String getHeaderText()
public String getFooterText()

These methods do not account for the use-case of headers/footers defined differently for odd vs. even pages in Word.

I propose a different model:

HWPFHeader header = extractor.getHeader();
String oddHeader = header.getOddHeader();
String evenHeader = header.getEvenHeader();

HWPFFooter footer = extractor.getFooter();
String oddFooter = footer.getOddFooter();
String evenFooter = footer.getEvenFooter();

This will be adequate to the Word's model and in line with the model adopted in the Excel header/footer extraction code:

HSSFHeader header = sheet.getHeader();
String leftHeader = header.getLeft();
String centerHeader = header.getCenter();
String rightHeader = header.getRight();

2. The second issue is macros. You can define macros in headers and footers and currently they show up in the extracted text. For example, in the attached file HeadersFooters2.doc, the Author field was used in the header, and the string "AUTHOR" gets returned. It would be great if the headers/footers would only return the actual text and never the macros, or if the methods had a boolean flag to strip off the macros.

For example, for the attached HeadersFooters2.doc, the following gets returned:

HEADER GOES HERE. 8/12/2008  AUTHOR \* MERGEFORMAT Eric Roch

It would be great if the returned text was simply:

HEADER GOES HERE. 8/12/2008 Eric Roch

In the interest of being generic, a flag for stripping off this extra markup is probably best.
Comment 1 Dmitry Goldenberg 2008-08-12 11:15:01 UTC
Created attachment 22436 [details]
Word doc with some macros used in its header.
Comment 2 Dmitry Goldenberg 2008-08-12 11:59:26 UTC
Per Nick's comments -- found the HeaderStories object. We only need to look into the macros now. Disregard my comment about getHeaderText/getFooterText.
Comment 3 Nick Burch 2008-08-12 12:58:17 UTC
I've just added field (eg macro) stripping to HWPF. This is a static method on Range, Range.stripFields()

In addition, HeaderStores has an option to always strip out fields from text returned (off by default). With this turned on, the header given on your test document is as one would expect