47687 – Is there any limitation at size of the MS Office document to extract using POI library?

Bug 47687 - Is there any limitation at size of the MS Office document to extract using POI library?

Summary: Is there any limitation at size of the MS Office document to extract using PO...

Status:	RESOLVED INVALID

Alias:	None

Product:	POI
Classification:	Unclassified
Component:	POI Overall (show other bugs)
Version:	3.2-FINAL
Hardware:	PC Windows XP

Importance:	P2 normal (vote)
Target Milestone:	---
Assignee:	POI Developers List

URL:
Keywords:

Depends on:
Blocks:

Reported:	2009-08-12 06:18 UTC by Bijju
Modified:	2009-08-12 06:47 UTC (History)
CC List:	1 user (show)

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Bijju 2009-08-12 06:18:52 UTC

We have been extracting many office documents successfully using POI 3.2. But for a specific document of huge size >19MB file was not able to extract. 

But in practical scenarios we will ave more than 500MB documents also (in fact no restriction at that). And technically, as POI is a Java library, size should not be a concern while getting the handle of the document. I am using event driven logic for document extraction.

But i have noticed, when document size is reduced POI extracts, if not fails. Any reason for this? Am i missing any basic technical point here?

Also, POI treats HTML content of word document as another document than of simple text. Need to check more on this. If this is yes, pls. let me know what would be the reason for this?

Comment 1 Nick Burch 2009-08-12 06:47:49 UTC

Please ask questions on the mailing list. Try checking the list archives too, your question is almost certainly about needing a bigger java heap size.

Also, http://poi.apache.org/poifs/embeded.html might be of interest to you WRT embeded documents