We're dealing with a scenario where very large MS Office files are being processed, with a tight limit on the heap size to be 100MB. This causes OutOfMemoryError's in RawDataBlockList. java.lang.OutOfMemoryError: KERNEL-10 : Java heap space at org.apache.poi.poifs.storage.RawDataBlock.<init>(RawDataBlock.java:68) at org.apache.poi.poifs.storage.RawDataBlockList.<init>(RawDataBlockList.java:53) at org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.java:155) RawDataBlockList loads all the blocks till end of file. Is there any way to limit this, perhaps having there be an optional "sliding window"-ful of blocks which gets repopulated on demand? As a quicker fix, it'd be sufficient to have a way to ascertain whether a given Office file is Excel, Word, or PPT. The way we do this is, once we know it's an Office doc, by examining the magic bytes, we try to read the 'application name' within the POI fs: public boolean isRecognized(DocumentPayload payload) { String application = null; try { application = getApplicationName(payload.getContentStream(), payload.getDocId()); } catch (Exception ex) { log.warn(TextExtractionError.ERROR, ex, "NON-FATAL error (proceeding with text extraction). Failed to determine application for document. Payload: %s.", payload); } return (application == null) ? false : application.toLowerCase().contains(EXCEL) && application.toLowerCase().contains(MICROSOFT); } Where protected String getApplicationName(InputStream is, String docId) throws IOException { String application = null; try { POIFSFileSystem filesystem = new POIFSFileSystem(is); // First, try to extract the application name from the metadata SummaryInformation si = null; PropertySet ps2 = getPropertySet(filesystem, SummaryInformation.DEFAULT_STREAM_NAME, docId); if (ps2 instanceof SummaryInformation) { si = (SummaryInformation) ps2; } application = (si == null) ? null : StringUtils.trim(si.getApplicationName()); // Unfortunately, the app name may not be present in the document metadata. // If that is the case, see if the file system has an entry by which we can tell // that the document matches the type. if (StringUtils.isEmpty(application) && hasDistinguishedEntry(filesystem)) { application = getDefaultApplicationName(); } } finally { is.close(); } return application; } And 'hasDistinguishedName' is as follows, e.g. for Excel protected boolean hasDistinguishedEntry(POIFSFileSystem filesystem) { boolean hasIt = true; // See if the Workbook entry is there try { filesystem.getRoot().getEntry("Workbook"); } catch (FileNotFoundException fe) { // Try the upper case form try { filesystem.getRoot().getEntry("WORKBOOK"); } catch (FileNotFoundException wfe) { // Try Book try { filesystem.getRoot().getEntry("Book"); } catch (FileNotFoundException wfee) { hasIt = false; } } } return hasIt; } If we can avoid doing all this, then the OutOfMemory issue becomes less significant. Otherwise we need a way to curtail the memory consumption on the blocklist side and still be able to have access to properties and entries. Any advise/recommendations?
It's blocking a customer patch here. Would greatly appreciate your help!
For now you'll just have to bump up the heap size There have been discussions on the dev list over the years about ways to reduce the memory footprint of POIFS. However, as yet no-one has been willing to sponsor the work for it. If all you want is the names of the streams in the file, then you might be able to cheat a bit to get them. It'd mean some NIO work, and taking advantage of the FAT entries being special so you ought to be able to find them via the header without touching the main data parts. It'd still take some work though
Try NIO Reading using NPOIFSFileSystem, see "http://poi.apache.org/poifs/how-to.html" on http://poi.apache.org/poifs/how-to.html It should be more efficient in terms of memory consumption. Yegor