Bug 50750

Summary: Support MS OneNote file format
Product: POI Reporter: Jan Høydahl <jan.asf>
Component: POI OverallAssignee: POI Developers List <dev>
Status: NEW ---    
Severity: enhancement CC: bonniot
Priority: P2    
Version: unspecified   
Target Milestone: ---   
Hardware: All   
OS: All   

Description Jan Høydahl 2011-02-10 11:55:07 UTC
Support extracting text content from .one files as per this file format spec http://msdn.microsoft.com/en-us/library/dd924743(v=office.12).aspx
Comment 1 Nick Burch 2011-02-10 13:12:59 UTC
Any chance you could create a few sample documents and upload them?

Ideally we'd want say 2 or 3 files. For each one, we'd also want a text file with the textual contents of the file (so we can make sure we get most of the contents), and possibly also a screenshot of the file when it's open in onenote (so we can get a feel for how the text might come out)
Comment 2 Jan Høydahl 2011-02-14 13:15:11 UTC
Here are some sample OneNote files in a zip file:

https://docs.google.com/leaf?id=0B5l8CG0AFbx2ZWRiYjRiY2QtYzAzOC00ODgxLWIwZGEtNGRlOTdlYzRmNDQ5&hl=no

Zip contains:
sample-onenote-2007.one
sample-onenote-2010.one
sample-onenote-package.onepkg
sample-onenote.pdf
sample-onenote.txt

The files are the default sample document in OneNote2010. The document is one section, 2 pages. Created with OneNote2010. The 2007 file is exported from OneNote2010. The .onepkg file has the same contents as the other files, but saved as a package. The txt doc is created by selecting all text on the page and then COPY, so you get an idea of what is graphics and what is text. The PDF gives a visual impression of the original workbook.
Comment 3 Nick Burch 2011-02-14 14:24:10 UTC
Thanks for these

I can't promise I'll be able to work on this very soon, but I should be able to add in Tika support just as soon as I've done the POI bit...