When extracting the pure text with org.apache.poi.hssf.extractor.ExcelExtractor the header and footer of sheets are unconditionally extracted. Most often those contain something like "&page", which isn't resolved. I'd prefer to not extract header & footer. It would be ideal if there were a method setHeaderFooterExtraction() to influence the behaviour. Thanks, Axel.
Patch suggestion: --- src/java/org/apache/poi/hssf/extractor/ExcelExtractor.java (revision 722394) +++ src/java/org/apache/poi/hssf/extractor/ExcelExtractor.java (working copy) @@ -46,6 +46,7 @@ private boolean formulasNotResults = false; private boolean includeCellComments = false; private boolean includeBlankCells = false; + private boolean includeHeaderFooter = true; public ExcelExtractor(HSSFWorkbook wb) { super(wb); @@ -79,6 +80,12 @@ this.includeCellComments = includeCellComments; } /** + * Should header and footer be included? Default is true + */ + public void setIncludeHeaderFooter(boolean includeHeaderFooter) { + this.includeHeaderFooter = includeHeaderFooter; + } + /** * Should blank cells be output? Default is to only * output cells that are present in the file and are * non-blank. @@ -111,7 +118,7 @@ } // Header text, if there is any - if(sheet.getHeader() != null) { + if(sheet.getHeader() != null && includeHeaderFooter) { text.append( _extractHeaderFooter(sheet.getHeader()) ); @@ -201,7 +208,7 @@ } // Finally Feader text, if there is any - if(sheet.getFooter() != null) { + if(sheet.getFooter() != null && includeHeaderFooter) { text.append( _extractHeaderFooter(sheet.getFooter()) );
Thanks. Applied with tweaks, and with the same thing for XSSFExcelExtractor too