Bug 50428 - Need a way to avoid OutOfMemoryError's in RawDataBlockList
Summary: Need a way to avoid OutOfMemoryError's in RawDataBlockList
Status: RESOLVED FIXED
Alias: None
Product: POI
Classification: Unclassified
Component: POIFS (show other bugs)
Version: 3.2-FINAL
Hardware: PC Windows XP
: P2 major (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-12-07 18:03 UTC by Dmitry Goldenberg
Modified: 2011-06-20 16:53 UTC (History)
0 users



Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Dmitry Goldenberg 2010-12-07 18:03:37 UTC
We're dealing with a scenario where very large MS Office files are being processed, with a tight limit on the heap size to be 100MB.

This causes OutOfMemoryError's in RawDataBlockList.

java.lang.OutOfMemoryError: KERNEL-10 : Java heap space
at org.apache.poi.poifs.storage.RawDataBlock.<init>(RawDataBlock.java:68)
at org.apache.poi.poifs.storage.RawDataBlockList.<init>(RawDataBlockList.java:53)
at org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.java:155)

RawDataBlockList loads all the blocks till end of file. Is there any way to limit this, perhaps having there be an optional "sliding window"-ful of blocks which gets repopulated on demand?

As a quicker fix, it'd be sufficient to have a way to ascertain whether a given Office file is Excel, Word, or PPT. The way we do this is, once we know it's an Office doc, by examining the magic bytes, we try to read the 'application name' within the POI fs:

  public boolean isRecognized(DocumentPayload payload) {
    String application = null;

    try {
      application = getApplicationName(payload.getContentStream(), payload.getDocId());
    } catch (Exception ex) {
      log.warn(TextExtractionError.ERROR, ex, "NON-FATAL error (proceeding with text extraction). Failed to determine application for document. Payload: %s.", payload);
    }

    return (application == null) ? false : application.toLowerCase().contains(EXCEL) && application.toLowerCase().contains(MICROSOFT);
  }

Where

protected String getApplicationName(InputStream is, String docId) throws IOException {
    String application = null;

    try {
      POIFSFileSystem filesystem = new POIFSFileSystem(is);

      // First, try to extract the application name from the metadata
      SummaryInformation si = null;
      PropertySet ps2 = getPropertySet(filesystem, SummaryInformation.DEFAULT_STREAM_NAME, docId);
      if (ps2 instanceof SummaryInformation) {
        si = (SummaryInformation) ps2;
      }
      application = (si == null) ? null : StringUtils.trim(si.getApplicationName());

      // Unfortunately, the app name may not be present in the document metadata.

      // If that is the case, see if the file system has an entry by which we can tell
      // that the document matches the type.
      if (StringUtils.isEmpty(application) && hasDistinguishedEntry(filesystem)) {
        application = getDefaultApplicationName();
      }

    } finally {
      is.close();
    }

    return application;
  }

And 'hasDistinguishedName' is as follows, e.g. for Excel

protected boolean hasDistinguishedEntry(POIFSFileSystem filesystem) {
    boolean hasIt = true;

    // See if the Workbook entry is there
    try {
      filesystem.getRoot().getEntry("Workbook");
    } catch (FileNotFoundException fe) {

      // Try the upper case form
      try {
        filesystem.getRoot().getEntry("WORKBOOK");
      } catch (FileNotFoundException wfe) {

        // Try Book
        try {
          filesystem.getRoot().getEntry("Book");
        } catch (FileNotFoundException wfee) {
          hasIt = false;
        }
      }
    }

    return hasIt;
  }

If we can avoid doing all this, then the OutOfMemory issue becomes less significant. Otherwise we need a way to curtail the memory consumption on the blocklist side and still be able to have access to properties and entries.

Any advise/recommendations?
Comment 1 Dmitry Goldenberg 2010-12-07 18:04:54 UTC
It's blocking a customer patch here. Would greatly appreciate your help!
Comment 2 Nick Burch 2010-12-07 19:19:52 UTC
For now you'll just have to bump up the heap size

There have been discussions on the dev list over the years about ways to reduce the memory footprint of POIFS. However, as yet no-one has been willing to sponsor the work for it.

If all you want is the names of the streams in the file, then you might be able to cheat a bit to get them. It'd mean some NIO work, and taking advantage of the FAT entries being special so you ought to be able to find them via the header without touching the main data parts. It'd still take some work though
Comment 3 Yegor Kozlov 2011-06-20 16:53:23 UTC
Try NIO Reading using NPOIFSFileSystem, see "http://poi.apache.org/poifs/how-to.html" on  http://poi.apache.org/poifs/how-to.html

It should be more efficient in terms of memory consumption.

Yegor