Bug 64322 - Very poor performance reading OLE2
Summary: Very poor performance reading OLE2
Status: RESOLVED FIXED
Alias: None
Product: POI
Classification: Unclassified
Component: POIFS (show other bugs)
Version: unspecified
Hardware: PC All
: P2 normal (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-04-07 17:06 UTC by Luis Filipe Nassif
Modified: 2020-05-18 01:44 UTC (History)
0 users



Attachments
Part 1 of 3 (976.56 KB, application/x-7z-compressed)
2020-04-07 17:06 UTC, Luis Filipe Nassif
Details
Part 2 of 3 (976.56 KB, application/octet-stream)
2020-04-07 17:07 UTC, Luis Filipe Nassif
Details
Part 3 of 3 (111.40 KB, application/octet-stream)
2020-04-07 17:07 UTC, Luis Filipe Nassif
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Luis Filipe Nassif 2020-04-07 17:06:37 UTC
Created attachment 37154 [details]
Part 1 of 3

Attached file with 3.7MB takes more than 10min to be parsed with POI-4.0.1. Profiling with jvisualvm, most time is spent in IOUtils.readFully() (360s), FileBackedDataSource.size() (248s) and FileBackedDataSource.read() (88s). Code to reproduce below:

    public static void main(String[] args) {
        
        File file = new File("F:\\teste-files\\thumbs.db\\New Folder\\Thumbs.db");
        try (POIFSFileSystem poiFS = new POIFSFileSystem(file)) {
            recurseDir(poiFS.getRoot());

        } catch (IOException e) {
            e.printStackTrace();
        }
    }
    
    private static void recurseDir(DirectoryEntry dir) {
        for (Entry entry : dir) {
            try {
                if (entry instanceof DirectoryEntry) {
                    recurseDir((DirectoryEntry) entry);
                }
                if (entry instanceof DocumentEntry) {
                    DocumentEntry de = (DocumentEntry) entry;
                    HashMap<String, String> props = new HashMap<>();
                    try (DocumentInputStream dis = new DocumentInputStream(de)) {
                        props.put("name", de.getName());

                        if (PropertySet.isPropertySetStream(dis)) {
                            dis.mark(10000000);
                            PropertySet ps = null;
                            try {
                                ps = new PropertySet(dis);

                            } catch (UnsupportedEncodingException e) {
                                // ignore
                            }
                            if (ps != null) {
                                for (Section section : ps.getSections()) {
                                    for (Property p : section.getProperties()) {
                                        String prop = section.getDictionary() != null
                                                ? section.getDictionary().get(p.getID())
                                                : String.valueOf(p.getID());
                                        if (p.getValue() != null)
                                            props.put("property_" + prop, p.getValue().toString());
                                    }
                                }
                            }
                            dis.reset();
                        }
                    }
                }
            } catch (Exception e) {
                e.printStackTrace();
            }

        }

    }
Comment 1 Luis Filipe Nassif 2020-04-07 17:07:16 UTC
Created attachment 37155 [details]
Part 2 of 3
Comment 2 Luis Filipe Nassif 2020-04-07 17:07:39 UTC
Created attachment 37156 [details]
Part 3 of 3
Comment 3 Luis Filipe Nassif 2020-04-07 17:08:54 UTC
Needed to divide file into 3 7z segments because of file size upload limit.
Comment 4 Dominik Stadler 2020-05-16 06:28:24 UTC
For me the code runs in 16seconds, what king of hardware/memory/cpu/disk are you using here to let us see where the huge difference comes from?
Comment 5 Dominik Stadler 2020-05-16 13:09:24 UTC
I could optimize this somewhat by some internal changes in how the file-channel is handled, see changes done via r1877816. 

However you should be able to make a large improvement by opening the file via InputStream and not via java.io.File. This way Apache POI will use more memory, but process the file much quicker. 

For me runtime of the test went from 5 seconds to 200ms when using the other constructor!
Comment 6 Luis Filipe Nassif 2020-05-17 19:49:44 UTC
Thank you, Dominik, for taking a look and for the patch! Unfortunately, I am not able to reproduce the huge slowness anymore, probably because the previous used disk was a hardware RAID0 in a marvel chipset, but I lost it and don't have it anymore (fortunately I recovered the data).

Now the test takes 15-20 seconds to finish, with different disks in my machine. And using the InputStream constructor, it ends in 300ms! Thank you very much for pointing it!

But as I understood from POI code, the entire file is loaded in heap, right? Sometimes I have large files to process, will check file size first before calling the other constructor.

Thanks
Comment 7 Nick Burch 2020-05-18 01:44:51 UTC
FYI for people here - a few years ago we were talking about having three different open modes. Read-only, Read for write elsewhere (copy-on-write) and Read Write (in-place write). If we implemented that, at least some modes (and certainly read-only) could be optimised quite a bit. IIRC writing when growing a file size would also be able to get a big boost

Sadly, I couldn't convince $DAYJOB of the benefits to fund the work, and no-one else volunteered, so nothing happened. See https://bz.apache.org/bugzilla/show_bug.cgi?id=60741 and threads on the dev list at around that time for more