Bug 64322 - Very poor performance reading OLE2
Summary: Very poor performance reading OLE2
Status: RESOLVED FIXED
Alias: None
Product: POI
Classification: Unclassified
Component: POIFS (show other bugs)
Version: unspecified
Hardware: PC All
: P2 normal (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-04-07 17:06 UTC by Luis Filipe Nassif
Modified: 2021-03-14 23:07 UTC (History)
0 users



Attachments
Part 1 of 3 (976.56 KB, application/x-7z-compressed)
2020-04-07 17:06 UTC, Luis Filipe Nassif
Details
Part 2 of 3 (976.56 KB, application/octet-stream)
2020-04-07 17:07 UTC, Luis Filipe Nassif
Details
Part 3 of 3 (111.40 KB, application/octet-stream)
2020-04-07 17:07 UTC, Luis Filipe Nassif
Details
Performance fix for OLE2 files with a large number of small entries (5.13 KB, patch)
2021-03-14 22:35 UTC, sits
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Luis Filipe Nassif 2020-04-07 17:06:37 UTC
Created attachment 37154 [details]
Part 1 of 3

Attached file with 3.7MB takes more than 10min to be parsed with POI-4.0.1. Profiling with jvisualvm, most time is spent in IOUtils.readFully() (360s), FileBackedDataSource.size() (248s) and FileBackedDataSource.read() (88s). Code to reproduce below:

    public static void main(String[] args) {
        
        File file = new File("F:\\teste-files\\thumbs.db\\New Folder\\Thumbs.db");
        try (POIFSFileSystem poiFS = new POIFSFileSystem(file)) {
            recurseDir(poiFS.getRoot());

        } catch (IOException e) {
            e.printStackTrace();
        }
    }
    
    private static void recurseDir(DirectoryEntry dir) {
        for (Entry entry : dir) {
            try {
                if (entry instanceof DirectoryEntry) {
                    recurseDir((DirectoryEntry) entry);
                }
                if (entry instanceof DocumentEntry) {
                    DocumentEntry de = (DocumentEntry) entry;
                    HashMap<String, String> props = new HashMap<>();
                    try (DocumentInputStream dis = new DocumentInputStream(de)) {
                        props.put("name", de.getName());

                        if (PropertySet.isPropertySetStream(dis)) {
                            dis.mark(10000000);
                            PropertySet ps = null;
                            try {
                                ps = new PropertySet(dis);

                            } catch (UnsupportedEncodingException e) {
                                // ignore
                            }
                            if (ps != null) {
                                for (Section section : ps.getSections()) {
                                    for (Property p : section.getProperties()) {
                                        String prop = section.getDictionary() != null
                                                ? section.getDictionary().get(p.getID())
                                                : String.valueOf(p.getID());
                                        if (p.getValue() != null)
                                            props.put("property_" + prop, p.getValue().toString());
                                    }
                                }
                            }
                            dis.reset();
                        }
                    }
                }
            } catch (Exception e) {
                e.printStackTrace();
            }

        }

    }
Comment 1 Luis Filipe Nassif 2020-04-07 17:07:16 UTC
Created attachment 37155 [details]
Part 2 of 3
Comment 2 Luis Filipe Nassif 2020-04-07 17:07:39 UTC
Created attachment 37156 [details]
Part 3 of 3
Comment 3 Luis Filipe Nassif 2020-04-07 17:08:54 UTC
Needed to divide file into 3 7z segments because of file size upload limit.
Comment 4 Dominik Stadler 2020-05-16 06:28:24 UTC
For me the code runs in 16seconds, what king of hardware/memory/cpu/disk are you using here to let us see where the huge difference comes from?
Comment 5 Dominik Stadler 2020-05-16 13:09:24 UTC
I could optimize this somewhat by some internal changes in how the file-channel is handled, see changes done via r1877816. 

However you should be able to make a large improvement by opening the file via InputStream and not via java.io.File. This way Apache POI will use more memory, but process the file much quicker. 

For me runtime of the test went from 5 seconds to 200ms when using the other constructor!
Comment 6 Luis Filipe Nassif 2020-05-17 19:49:44 UTC
Thank you, Dominik, for taking a look and for the patch! Unfortunately, I am not able to reproduce the huge slowness anymore, probably because the previous used disk was a hardware RAID0 in a marvel chipset, but I lost it and don't have it anymore (fortunately I recovered the data).

Now the test takes 15-20 seconds to finish, with different disks in my machine. And using the InputStream constructor, it ends in 300ms! Thank you very much for pointing it!

But as I understood from POI code, the entire file is loaded in heap, right? Sometimes I have large files to process, will check file size first before calling the other constructor.

Thanks
Comment 7 Nick Burch 2020-05-18 01:44:51 UTC
FYI for people here - a few years ago we were talking about having three different open modes. Read-only, Read for write elsewhere (copy-on-write) and Read Write (in-place write). If we implemented that, at least some modes (and certainly read-only) could be optimised quite a bit. IIRC writing when growing a file size would also be able to get a big boost

Sadly, I couldn't convince $DAYJOB of the benefits to fund the work, and no-one else volunteered, so nothing happened. See https://bz.apache.org/bugzilla/show_bug.cgi?id=60741 and threads on the dev list at around that time for more
Comment 8 sits 2021-03-09 03:43:49 UTC
Hi - I don't think this issue is fixed.  I still have massive performance issues with processing OLE2 files which have a very large number of small entries in them (for example - MSG files with huge numbers of recipients).

The culprit is POIFSMiniStore which has this code below.  You can see to get the data for a specific block, the for loop below reads all block data preceeding this block before returning the block's data.  If you have code like mine which iterates over all document entries in the OLE2 file, you end up with O(N^2) complexity - reading massive parts of the file over and over again.

Yes it is true using the InputStream argument to POIFSFileSystem dodges this since POI reads the entire file in memory, but this is not practical when processing massive OLE2 files.

Can the code below be fixed in some way to just read the block requested rather than all preceeding it?  

It seems like a version of StreamBlockByteBufferIterator which doesn't read the current block's data but return's the block's offset would be a vast improvement?

    /**
     * Load the block at the given offset.
     */
    protected ByteBuffer getBlockAt(final int offset) {
       // Which big block is this?
       int byteOffset = offset * POIFSConstants.SMALL_BLOCK_SIZE;
       int bigBlockNumber = byteOffset / _filesystem.getBigBlockSize();
       int bigBlockOffset = byteOffset % _filesystem.getBigBlockSize();

       // Now locate the data block for it
       Iterator<ByteBuffer> it = _mini_stream.getBlockIterator();
       for(int i=0; i<bigBlockNumber; i++) {
          it.next();
       }
       ByteBuffer dataBlock = it.next();
       assert(dataBlock != null);

       // Position ourselves, and take a slice
       dataBlock.position(
             dataBlock.position() + bigBlockOffset
       );
       ByteBuffer miniBuffer = dataBlock.slice();
       miniBuffer.limit(POIFSConstants.SMALL_BLOCK_SIZE);
       return miniBuffer;
    }
Comment 9 Dominik Stadler 2021-03-09 08:30:41 UTC
This issue was about initial parsing of the document, not doing some specific processing of the file.

Best if you can create a new issue with a file and code-snippet that allows to reproduce it.
Comment 10 PJ Fanning 2021-03-13 18:02:38 UTC
Raised https://bz.apache.org/bugzilla/show_bug.cgi?id=65184
Comment 11 sits 2021-03-14 22:35:17 UTC
Created attachment 37770 [details]
Performance fix for OLE2 files with a large number of small entries

I've attached a path which can process an MSG file with a massive number of recipients (which translates to a massive number of OLE2 entries in the mini-store) from 20+ minutes to under 2 minutes.

The patch changes things so that when returning the bytes for an entry in the mini-store, rather than re-reading all proceeding entries entirely, it will only read the entry specified.  When iterating over all entries in the mini-store, this cuts down on excessive re-reads.
Comment 12 PJ Fanning 2021-03-14 23:07:10 UTC
Please use https://bz.apache.org/bugzilla/show_bug.cgi?id=65184