Created attachment 37154 [details] Part 1 of 3 Attached file with 3.7MB takes more than 10min to be parsed with POI-4.0.1. Profiling with jvisualvm, most time is spent in IOUtils.readFully() (360s), FileBackedDataSource.size() (248s) and FileBackedDataSource.read() (88s). Code to reproduce below: public static void main(String[] args) { File file = new File("F:\\teste-files\\thumbs.db\\New Folder\\Thumbs.db"); try (POIFSFileSystem poiFS = new POIFSFileSystem(file)) { recurseDir(poiFS.getRoot()); } catch (IOException e) { e.printStackTrace(); } } private static void recurseDir(DirectoryEntry dir) { for (Entry entry : dir) { try { if (entry instanceof DirectoryEntry) { recurseDir((DirectoryEntry) entry); } if (entry instanceof DocumentEntry) { DocumentEntry de = (DocumentEntry) entry; HashMap<String, String> props = new HashMap<>(); try (DocumentInputStream dis = new DocumentInputStream(de)) { props.put("name", de.getName()); if (PropertySet.isPropertySetStream(dis)) { dis.mark(10000000); PropertySet ps = null; try { ps = new PropertySet(dis); } catch (UnsupportedEncodingException e) { // ignore } if (ps != null) { for (Section section : ps.getSections()) { for (Property p : section.getProperties()) { String prop = section.getDictionary() != null ? section.getDictionary().get(p.getID()) : String.valueOf(p.getID()); if (p.getValue() != null) props.put("property_" + prop, p.getValue().toString()); } } } dis.reset(); } } } } catch (Exception e) { e.printStackTrace(); } } }
Created attachment 37155 [details] Part 2 of 3
Created attachment 37156 [details] Part 3 of 3
Needed to divide file into 3 7z segments because of file size upload limit.
For me the code runs in 16seconds, what king of hardware/memory/cpu/disk are you using here to let us see where the huge difference comes from?
I could optimize this somewhat by some internal changes in how the file-channel is handled, see changes done via r1877816. However you should be able to make a large improvement by opening the file via InputStream and not via java.io.File. This way Apache POI will use more memory, but process the file much quicker. For me runtime of the test went from 5 seconds to 200ms when using the other constructor!
Thank you, Dominik, for taking a look and for the patch! Unfortunately, I am not able to reproduce the huge slowness anymore, probably because the previous used disk was a hardware RAID0 in a marvel chipset, but I lost it and don't have it anymore (fortunately I recovered the data). Now the test takes 15-20 seconds to finish, with different disks in my machine. And using the InputStream constructor, it ends in 300ms! Thank you very much for pointing it! But as I understood from POI code, the entire file is loaded in heap, right? Sometimes I have large files to process, will check file size first before calling the other constructor. Thanks
FYI for people here - a few years ago we were talking about having three different open modes. Read-only, Read for write elsewhere (copy-on-write) and Read Write (in-place write). If we implemented that, at least some modes (and certainly read-only) could be optimised quite a bit. IIRC writing when growing a file size would also be able to get a big boost Sadly, I couldn't convince $DAYJOB of the benefits to fund the work, and no-one else volunteered, so nothing happened. See https://bz.apache.org/bugzilla/show_bug.cgi?id=60741 and threads on the dev list at around that time for more
Hi - I don't think this issue is fixed. I still have massive performance issues with processing OLE2 files which have a very large number of small entries in them (for example - MSG files with huge numbers of recipients). The culprit is POIFSMiniStore which has this code below. You can see to get the data for a specific block, the for loop below reads all block data preceeding this block before returning the block's data. If you have code like mine which iterates over all document entries in the OLE2 file, you end up with O(N^2) complexity - reading massive parts of the file over and over again. Yes it is true using the InputStream argument to POIFSFileSystem dodges this since POI reads the entire file in memory, but this is not practical when processing massive OLE2 files. Can the code below be fixed in some way to just read the block requested rather than all preceeding it? It seems like a version of StreamBlockByteBufferIterator which doesn't read the current block's data but return's the block's offset would be a vast improvement? /** * Load the block at the given offset. */ protected ByteBuffer getBlockAt(final int offset) { // Which big block is this? int byteOffset = offset * POIFSConstants.SMALL_BLOCK_SIZE; int bigBlockNumber = byteOffset / _filesystem.getBigBlockSize(); int bigBlockOffset = byteOffset % _filesystem.getBigBlockSize(); // Now locate the data block for it Iterator<ByteBuffer> it = _mini_stream.getBlockIterator(); for(int i=0; i<bigBlockNumber; i++) { it.next(); } ByteBuffer dataBlock = it.next(); assert(dataBlock != null); // Position ourselves, and take a slice dataBlock.position( dataBlock.position() + bigBlockOffset ); ByteBuffer miniBuffer = dataBlock.slice(); miniBuffer.limit(POIFSConstants.SMALL_BLOCK_SIZE); return miniBuffer; }
This issue was about initial parsing of the document, not doing some specific processing of the file. Best if you can create a new issue with a file and code-snippet that allows to reproduce it.
Raised https://bz.apache.org/bugzilla/show_bug.cgi?id=65184
Created attachment 37770 [details] Performance fix for OLE2 files with a large number of small entries I've attached a path which can process an MSG file with a massive number of recipients (which translates to a massive number of OLE2 entries in the mini-store) from 20+ minutes to under 2 minutes. The patch changes things so that when returning the bytes for an entry in the mini-store, rather than re-reading all proceeding entries entirely, it will only read the entry specified. When iterating over all entries in the mini-store, this cuts down on excessive re-reads.
Please use https://bz.apache.org/bugzilla/show_bug.cgi?id=65184