Created attachment 37154 [details] Part 1 of 3 Attached file with 3.7MB takes more than 10min to be parsed with POI-4.0.1. Profiling with jvisualvm, most time is spent in IOUtils.readFully() (360s), FileBackedDataSource.size() (248s) and FileBackedDataSource.read() (88s). Code to reproduce below: public static void main(String[] args) { File file = new File("F:\\teste-files\\thumbs.db\\New Folder\\Thumbs.db"); try (POIFSFileSystem poiFS = new POIFSFileSystem(file)) { recurseDir(poiFS.getRoot()); } catch (IOException e) { e.printStackTrace(); } } private static void recurseDir(DirectoryEntry dir) { for (Entry entry : dir) { try { if (entry instanceof DirectoryEntry) { recurseDir((DirectoryEntry) entry); } if (entry instanceof DocumentEntry) { DocumentEntry de = (DocumentEntry) entry; HashMap<String, String> props = new HashMap<>(); try (DocumentInputStream dis = new DocumentInputStream(de)) { props.put("name", de.getName()); if (PropertySet.isPropertySetStream(dis)) { dis.mark(10000000); PropertySet ps = null; try { ps = new PropertySet(dis); } catch (UnsupportedEncodingException e) { // ignore } if (ps != null) { for (Section section : ps.getSections()) { for (Property p : section.getProperties()) { String prop = section.getDictionary() != null ? section.getDictionary().get(p.getID()) : String.valueOf(p.getID()); if (p.getValue() != null) props.put("property_" + prop, p.getValue().toString()); } } } dis.reset(); } } } } catch (Exception e) { e.printStackTrace(); } } }
Created attachment 37155 [details] Part 2 of 3
Created attachment 37156 [details] Part 3 of 3
Needed to divide file into 3 7z segments because of file size upload limit.
For me the code runs in 16seconds, what king of hardware/memory/cpu/disk are you using here to let us see where the huge difference comes from?
I could optimize this somewhat by some internal changes in how the file-channel is handled, see changes done via r1877816. However you should be able to make a large improvement by opening the file via InputStream and not via java.io.File. This way Apache POI will use more memory, but process the file much quicker. For me runtime of the test went from 5 seconds to 200ms when using the other constructor!
Thank you, Dominik, for taking a look and for the patch! Unfortunately, I am not able to reproduce the huge slowness anymore, probably because the previous used disk was a hardware RAID0 in a marvel chipset, but I lost it and don't have it anymore (fortunately I recovered the data). Now the test takes 15-20 seconds to finish, with different disks in my machine. And using the InputStream constructor, it ends in 300ms! Thank you very much for pointing it! But as I understood from POI code, the entire file is loaded in heap, right? Sometimes I have large files to process, will check file size first before calling the other constructor. Thanks
FYI for people here - a few years ago we were talking about having three different open modes. Read-only, Read for write elsewhere (copy-on-write) and Read Write (in-place write). If we implemented that, at least some modes (and certainly read-only) could be optimised quite a bit. IIRC writing when growing a file size would also be able to get a big boost Sadly, I couldn't convince $DAYJOB of the benefits to fund the work, and no-one else volunteered, so nothing happened. See https://bz.apache.org/bugzilla/show_bug.cgi?id=60741 and threads on the dev list at around that time for more