Bug 50076 - [Patch] A Simple Extractor and Workbook are proposed
Summary: [Patch] A Simple Extractor and Workbook are proposed
Alias: None
Product: POI
Classification: Unclassified
Component: XSSF (show other bugs)
Version: 3.7-dev
Hardware: All All
: P2 enhancement (vote)
Target Milestone: ---
Assignee: POI Developers List
Depends on:
Reported: 2010-10-11 19:54 UTC by ssmeets
Modified: 2010-11-22 10:18 UTC (History)
0 users

svn diff output (1.70 KB, text/plain)
2010-10-11 19:54 UTC, ssmeets
New classes (2.65 KB, application/x-bzip)
2010-10-11 19:54 UTC, ssmeets

Note You need to log in before you can comment on or make changes to this bug.
Description ssmeets 2010-10-11 19:54:15 UTC
Created attachment 26160 [details]
svn diff output

Proposed is a SimpleExtractor and XSSFSimpleWorkbook in order to use a more efficient way of parsing an XSL spreadsheets in Tika (SAX based parsing). This is related to Tika-521 (https://issues.apache.org/jira/browse/TIKA-521).

Testcases will follow when the proposed approach is approved.
Comment 1 ssmeets 2010-10-11 19:54:57 UTC
Created attachment 26161 [details]
New classes
Comment 2 Nick Burch 2010-11-19 13:18:28 UTC
I've done some refactoring of XSSFEventBasedExcelExtractor in r1036968, which should help with the Tika side when it comes to outputting the values as XHTML

Next I'll need to expand on your XSSFSimpleWorkbook to cover all the different file parts we might need to replicate the functionality in XSSFExcelExtractorDecorator (may need some more POI refactoring as well as new code)

Finally, we'd then need to go to the Tika side and update XSSFExcelExtractorDecorator to use the new simple workbook + implement a SheetContentsHandler which generates the xhtml events
Comment 3 Nick Burch 2010-11-22 10:18:36 UTC
I've done some more work in r1037753. We can now use XSSFEventBasedExcelExtractor, wire in our own way to get at the text, and get at commends + headers.