We have been extracting many office documents successfully using POI 3.2. But for a specific document of huge size >19MB file was not able to extract. But in practical scenarios we will ave more than 500MB documents also (in fact no restriction at that). And technically, as POI is a Java library, size should not be a concern while getting the handle of the document. I am using event driven logic for document extraction. But i have noticed, when document size is reduced POI extracts, if not fails. Any reason for this? Am i missing any basic technical point here? Also, POI treats HTML content of word document as another document than of simple text. Need to check more on this. If this is yes, pls. let me know what would be the reason for this?
Please ask questions on the mailing list. Try checking the list archives too, your question is almost certainly about needing a bigger java heap size. Also, http://poi.apache.org/poifs/embeded.html might be of interest to you WRT embeded documents