Bug 54982

Summary: [PATCH] ExtractorFactory does not close files when extracting via OCPPackage.open()
Product: POI Reporter: Dominik Stadler <dominik.stadler>
Component: XSSFAssignee: POI Developers List <dev>
Status: RESOLVED FIXED    
Severity: normal CC: dominik.stadler
Priority: P2    
Version: 3.9-FINAL   
Target Milestone: ---   
Hardware: PC   
OS: All   
Attachments: Patch to properly close the OCPPackage
Different approach adding a close() to the POITextExtractor and implementing it in a few sub-classes

Description Dominik Stadler 2013-05-16 08:23:47 UTC
Created attachment 30286 [details]
Patch to properly close the OCPPackage

This leaves files open longer than necessary and causes problems when working with temporary files on Windows as they cannot be deleted unless they are closed properly.

Attached patch fixes this.
Comment 1 Nick Burch 2013-05-16 11:40:25 UTC
I'm not sure this'll work, as some of the extractors (eg XSSFEventBasedExcelExtractor, or fetching the properties extractor) will still need access to the OPCPackage during their processing of the file. If we close the package during the create call, these would then break when you tried to get the text / properties

Also, ExtractorFactory and the associated classes don't receive a lot of work at the moment, almost all the text extraction effort goes into Apache Tika. You might want to switch to Tika instead, that uses POI but now generally has better support + cleaner ways of dealing with temporary resoures
Comment 2 Dominik Stadler 2013-05-25 16:23:24 UTC
Hmm, I did some more tests and unit tests, it seems to work as far as I see, however if you state the class is dormant, I'll stop on this Bug.

The Extractor-stuff was an easy way to read the text from an Excel file in unit tests, having to use another framework like Tika just for this would bloat the unit test framework, so I am currently using a locally fixed version of the ExtractorFactory which works fine for my purposes.

BTW, I also saw something like this in the class, which makes no sense at all, using a locally created lock is not doing anything useful:

		// Save the content
		ReentrantReadWriteLock l = new ReentrantReadWriteLock();
		try {
			l.writeLock().lock();
                        ...


Maybe at least the description at http://poi.apache.org/text-extraction.html should state that the classes are dormant instead of sending users to a broken class that is not updated currently...
Comment 3 Nick Burch 2013-05-26 17:41:03 UTC
I've tried to update the docs on the site, please could you take a look and let me know if you think that covers it?

In terms of freeing resources, maybe we'd be better to add a close method to the extractor base clases? That way, people can explicitly tidy up when they're doing if they want (or fallback to the current lazy tidy), without the issues of us trying to close resources before the user is done making calls with it. Thoughts?
Comment 4 Dominik Stadler 2013-05-28 21:39:04 UTC
Created attachment 30337 [details]
Different approach adding a close() to the POITextExtractor and implementing it in a few sub-classes

Hi, a close() sounds fine to me. I have created a patch which adds a close() interface to POITextExtractor which can be used to free resources later. I have implemented the close() in all classes that seem to make use of closeable-resources. It also adds calls to close() in tests and runs existing unit tests for XSSF-extractors also against the Extractor that is built via the Factory. Finally it also add a small test-suite to quickly execute all extractor-related tests.

Regarding updated documentation, I think that is fine and should it make easier to spot the state of Extractors.
Comment 5 Nick Burch 2013-06-12 17:43:25 UTC
Patch looks good to me, please feel free to commit it yourself! :)
Comment 6 Dominik Stadler 2013-06-17 07:56:31 UTC
Patch is committed on trunk now, Extractor now implements Closeable and tests are adjusted/enhanced accordingly

        M       src/java/org/apache/poi/POITextExtractor.java
        M       src/ooxml/java/org/apache/poi/POIXMLTextExtractor.java
        M       src/ooxml/java/org/apache/poi/xssf/extractor/XSSFEventBasedExcelExtractor.java
        M       src/ooxml/testcases/org/apache/poi/TestXMLPropertiesTextExtractor.java
        M       src/ooxml/testcases/org/apache/poi/xslf/extractor/TestXSLFPowerPointExtractor.java
        M       src/ooxml/testcases/org/apache/poi/xssf/extractor/TestXSSFEventBasedExcelExtractor.java
        A       src/ooxml/testcases/org/apache/poi/xssf/extractor/TestXSSFEventBasedExcelExtractorUsingFactory.java
        M       src/ooxml/testcases/org/apache/poi/xssf/extractor/TestXSSFExcelExtractor.java
        A       src/ooxml/testcases/org/apache/poi/xssf/extractor/TestXSSFExcelExtractorUsingFactory.java
        M       src/ooxml/testcases/org/apache/poi/xwpf/extractor/TestXWPFWordExtractor.java
        M       src/testcases/org/apache/poi/hssf/extractor/TestExcelExtractor.java
Committed r1493669