Bug 64020

Summary: getHostAddress called unexpectedly, causing significant performance hit
Product: POI Reporter: Jamie <jamie>
Component: XWPFAssignee: POI Developers List <dev>
Status: RESOLVED WONTFIX    
Severity: normal    
Priority: P2    
Version: 4.1.1-FINAL   
Target Milestone: ---   
Hardware: PC   
OS: Linux   
Attachments: stack trace

Description Jamie 2019-12-19 11:22:20 UTC
Created attachment 36923 [details]
stack trace

Our server uses POI for text extraction. When processing some documents, there is a deterioration in performance due to unexpected call to URLStreamHandler.getHostAddress(). .Please refer to the attached stack for an illustration of how this happens. It is due to a known oddity in the way that URL hashCode is implemented whereby it actually attempt to resolve a URL for equality testing purposes. A possible workaround is use the URI class instead of URL?
Comment 1 PJ Fanning 2019-12-19 11:53:23 UTC
in your stack trace, it appears to be org.apache.catalina.loader.WebappClassLoaderBase that is using the HashSet - not XMLBeans or POI code
Comment 2 PJ Fanning 2019-12-19 11:55:10 UTC
I'm not sure it would help but it might be useful if we added some options to XMLBeans to get it to configure the SAX parser not to read external files at all
Comment 3 Jamie 2019-12-19 12:02:49 UTC
My apologies. I guess I was skimming the stack too quick and missed that. Yes, it would be a great help if there was an option not to read external files. It would beespecially useful when performing text extraction on older documents for which the external references are likely to no longer exist. It could also be beneficial if some sort of parsing timeout could be implemented.