Created attachment 21595 [details] zip file containing 3 Visio files which produced the error I'm working on Linux and trying to extract text from visio files with VisioTextExtractor. A sample of the code I've written to do this is below. protected static String extractVSD(String filename){ try { FileInputStream fin = new FileInputStream(filename); VisioTextExtractor extractor = new VisioTextExtractor(fin); . . . } When it goes to open a new VisioTextExtractor, I get the following error. This error occurs for every VSD file that I have tried. I'm using POI scratchpad version 3.0.2, but I've also tried version 3.0.1 and encountered the same error. I've attached a zip file containing 3 of the simple test files that produced the error. Stacktrace: java.lang.ArrayIndexOutOfBoundsException: 1991 at org.apache.poi.util.LittleEndian.getNumber(LittleEndian.java:492) at org.apache.poi.util.LittleEndian.getUInt(LittleEndian.java:164) at org.apache.poi.hdgf.chunks.ChunkHeader.createChunkHeader(ChunkHeader.java:43) at org.apache.poi.hdgf.chunks.ChunkFactory.createChunk(ChunkFactory.java:108) at org.apache.poi.hdgf.streams.ChunkStream.findChunks(ChunkStream.java:54) at org.apache.poi.hdgf.streams.PointerContainingStream.findChildren(PointerContainingStream.java:92) at org.apache.poi.hdgf.streams.PointerContainingStream.findChildren(PointerContainingStream.java:99) at org.apache.poi.hdgf.streams.PointerContainingStream.findChildren(PointerContainingStream.java:99) at org.apache.poi.hdgf.HDGFDiagram.<init>(HDGFDiagram.java:92) at org.apache.poi.hdgf.extractor.VisioTextExtractor.<init>(VisioTextExtractor.java:46) at org.apache.poi.hdgf.extractor.VisioTextExtractor.<init>(VisioTextExtractor.java:50) at com.lmco.atl.soser.collector.FileContentsUtil.extractVSD(FileContentsUtil.java:298) at com.lmco.atl.soser.collector.FileContentsUtil.main(FileContentsUtil.java:442)
I have the same problem with my visio files (even an empty one).
Out of interest, if you open one of these files up in visio, and do "save as", does the resulting file still have the same problem? (It looks like there's more data in a chunk stream than there are chunks, so we're running out of data when creating)
Created attachment 21624 [details] test file I have the same problem with all my visio files.
I can't extract the content of my visio file. (even if I do "save as" before extracting). I have joined my document and this is the trace of my Junit test. java.lang.ArrayIndexOutOfBoundsException: 57 at org.apache.poi.util.LittleEndian.getNumber(LittleEndian.java:492) at org.apache.poi.util.LittleEndian.getDouble(LittleEndian.java:220) at org.apache.poi.hdgf.chunks.Chunk.processCommands(Chunk.java:174) at org.apache.poi.hdgf.chunks.ChunkFactory.createChunk(ChunkFactory.java:171) at org.apache.poi.hdgf.streams.ChunkStream.findChunks(ChunkStream.java:54) at org.apache.poi.hdgf.streams.PointerContainingStream.findChildren(PointerContainingStream.java:92) at org.apache.poi.hdgf.HDGFDiagram.<init>(HDGFDiagram.java:92) at org.apache.poi.hdgf.extractor.VisioTextExtractor.<init>(VisioTextExtractor.java:46) at org.apache.poi.hdgf.extractor.VisioTextExtractor.<init>(VisioTextExtractor.java:50) at fr.sacem.amely.document.repository.core.extractor.office.VisioContentExtractor.getPlainText(VisioContentExtractor.java:56) at fr.sacem.amely.document.repository.core.extractor.ContentExtractorManager.getPlainText(ContentExtractorManager.java:117) at fr.sacem.amely.document.repository.core.extractor.ContentExtractorManagerTestCase.testExtract(ContentExtractorManagerTestCase.java:78)
Do you happen to know what version of visio produced the files? (It's odd that you're both finding lots of files that trigger this, but none of mine ever have) I've added your problem files to svn, so they're available for writing tests against
This is the version of visio , I use : Microsoft Visio Professional 2002 SP-2(10.0.6865) (french version) (Do you have embedded stencils in your visio document?)
My Visio files were created with Microsoft Office Visio Professional 2007 (12.0.4518.1014) MSO (12.0.6017.5000)
Additional info have been provided
*** Bug 44594 has been marked as a duplicate of this bug. ***
I think before we spend a lot of time trying to work around these short chunks, we'll want to be sure we're correctly decoding them in the first place. So, I'll put this bug on hold until we've fixed the decompression problem from bug #43670
*** Bug 44687 has been marked as a duplicate of this bug. ***
*** Bug 44717 has been marked as a duplicate of this bug. ***
This should now be fixed in svn trunk
Created attachment 21800 [details] java.lang.IllegalArgumentException: Found a chunk with a negative length, which isn't allowed
I have tried with the new release "poi-bin-3.1-beta1-20080428". I stil have the same problem. Stacktrace: java.lang.ArrayIndexOutOfBoundsException: 57 at org.apache.poi.util.LittleEndian.getNumber(LittleEndian.java:502) at org.apache.poi.util.LittleEndian.getDouble(LittleEndian.java:220) at org.apache.poi.hdgf.chunks.Chunk.processCommands(Chunk.java:174) at org.apache.poi.hdgf.chunks.ChunkFactory.createChunk(ChunkFactory.java:171) at org.apache.poi.hdgf.streams.ChunkStream.findChunks(ChunkStream.java:58) at org.apache.poi.hdgf.streams.PointerContainingStream.findChildren(PointerContainingStream.java:92) at org.apache.poi.hdgf.HDGFDiagram.<init>(HDGFDiagram.java:92) at org.apache.poi.hdgf.extractor.VisioTextExtractor.<init>(VisioTextExtractor.java:47) at org.apache.poi.hdgf.extractor.VisioTextExtractor.<init>(VisioTextExtractor.java:51)
Bug still exists in POI 3.5 beta. Is there any idea about the error? I can do nothing if I can't create the HDGFDiagram object...
We're seeing the same bug in Apache POI 3.8.0 beta 3. Is there anyone who is able to fix this?
the same in 3.8 beta 4. stack trace: java.lang.ArrayIndexOutOfBoundsException: Illegal offset 8 (String data is of length 8) at org.apache.poi.util.StingUtil.getFromUnicodeLE(StringUtil.java:70) at org.apache.hdgf.chunks.Chunk.processCommands(Chunk.java:203) ...
the same bug with Apache POI 3.8-20120326 Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 57 at org.apache.poi.util.LittleEndian.getLong(LittleEndian.java:191) at org.apache.poi.util.LittleEndian.getDouble(LittleEndian.java:104) at org.apache.poi.hdgf.chunks.Chunk.processCommands(Chunk.java:175) at org.apache.poi.hdgf.chunks.ChunkFactory.createChunk(ChunkFactory.java:180) at org.apache.poi.hdgf.streams.ChunkStream.findChunks(ChunkStream.java:59) at org.apache.poi.hdgf.streams.PointerContainingStream.findChildren(PointerContainingStream.java:93) at org.apache.poi.hdgf.HDGFDiagram.<init>(HDGFDiagram.java:106) at org.apache.poi.hdgf.HDGFDiagram.<init>(HDGFDiagram.java:70) at VsdTextExtractor.<init>(VsdTextExtractor.java:56) at VsdTextExtractor.<init>(VsdTextExtractor.java:53) at VsdTextExtractor.<init>(VsdTextExtractor.java:66) at TestVsdTextExtractor.test(TestVsdTextExtractor.java:13) at TestVsdTextExtractor.main(TestVsdTextExtractor.java:6)
Has there been any update on this. In 3.8 I get the same error with some simple visio documents.
I tried all of the files provided here and all could be read successfully without any exception. Therefore I am closing this Bug now, if you still see this I think it would be best to report a new bug entry with a sample file and code to reproduce the problem, preferably as junit test.