Bug 50076

Summary: [Patch] A Simple Extractor and Workbook are proposed
Product: POI Reporter: ssmeets
Component: XSSFAssignee: POI Developers List <dev>
Severity: enhancement    
Priority: P2    
Version: 3.7-dev   
Target Milestone: ---   
Hardware: All   
OS: All   
Attachments: svn diff output
New classes

Description ssmeets 2010-10-11 19:54:15 UTC
Created attachment 26160 [details]
svn diff output

Proposed is a SimpleExtractor and XSSFSimpleWorkbook in order to use a more efficient way of parsing an XSL spreadsheets in Tika (SAX based parsing). This is related to Tika-521 (https://issues.apache.org/jira/browse/TIKA-521).

Testcases will follow when the proposed approach is approved.
Comment 1 ssmeets 2010-10-11 19:54:57 UTC
Created attachment 26161 [details]
New classes
Comment 2 Nick Burch 2010-11-19 13:18:28 UTC
I've done some refactoring of XSSFEventBasedExcelExtractor in r1036968, which should help with the Tika side when it comes to outputting the values as XHTML

Next I'll need to expand on your XSSFSimpleWorkbook to cover all the different file parts we might need to replicate the functionality in XSSFExcelExtractorDecorator (may need some more POI refactoring as well as new code)

Finally, we'd then need to go to the Tika side and update XSSFExcelExtractorDecorator to use the new simple workbook + implement a SheetContentsHandler which generates the xhtml events
Comment 3 Nick Burch 2010-11-22 10:18:36 UTC
I've done some more work in r1037753. We can now use XSSFEventBasedExcelExtractor, wire in our own way to get at the text, and get at commends + headers.