Bug 49435

Summary: Metadata extractor fails to extract metadata when using a ByteArrayInputStream
Product: POI Reporter: David <dvergnaud>
Component: XWPFAssignee: POI Developers List <dev>
Status: RESOLVED WORKSFORME    
Severity: normal    
Priority: P2    
Version: 3.6-FINAL   
Target Milestone: ---   
Hardware: All   
OS: All   

Description David 2010-06-14 10:51:25 UTC
I'm using function POITextExtractor.getMetadataTextExtractor() on a docx document to extract its metadata. If the main POITextExtractor has been created using a "normal" java.io.File, then extracting the metadata seems to work alright. When using a ByteArrayInputStream instead (for exactly the same content, read using a simple FileInputStream on the same File object), then information is changed or missing. 

Example: using the following code: 

POITextExtractor te = ExtractorFactory.createExtractor( new java.io.File( "X:/projects/termDB/Frederick - Terminology database.docx" ) );
POITextExtractor te2 = te.getMetadataTextExtractor();
String t2 = te2.getText();
System.out.println( t2 );

I get the following output: 
Category = null
ContentStatus = null
ContentType = null
Created = Mon Apr 26 07:01:00 CEST 2010
CreatedString = 2010-04-26T07:01:00Z
Creator = David Vergnaud
Description = null
Identifier = null
Keywords = null
Language = null
LastModifiedBy = David Vergnaud
LastPrinted = null
LastPrintedString = 2010-04-26T07:01:00Z
Modified = Fri Apr 30 10:06:00 CEST 2010
ModifiedString = 2010-04-30T10:06:00Z
Revision = 31
Subject = null
Title = null
Version = null
Application = Microsoft Macintosh Word
AppVersion = 12.0000
Characters = 48573
CharactersWithSpaces = 59651
Company = Finnova AG Bankware
HyperlinkBase = null
HyperlinksChanged = false
Lines = 404
LinksUpToDate = false
Manager = null
Pages = 19
Paragraphs = 97
PresentationFormat = null
Template = Normal.dotm
TotalTime = 835

When using that code instead:
java.io.File file = new java.io.File( "X:/projects/termDB/Frederick - Terminology database.docx" );
byte[] content = new byte[ (int)( file.length() ) ];
java.io.FileInputStream fis = new java.io.FileInputStream( file );
fis.read( content );
POITextExtractor te = ExtractorFactory.createExtractor( new ByteArrayInputStream( content ) );
POITextExtractor te2 = te.getMetadataTextExtractor();
String t2 = te2.getText();
System.out.println( t2 );

I get the following output: 
Category = null
ContentStatus = null
ContentType = null
Created = Mon Jun 14 16:39:13 CEST 2010
CreatedString = 2010-06-14T16:39:13Z
Creator = David Vergnaud
Description = null
Identifier = null
Keywords = null
Language = null
LastModifiedBy = null
LastPrinted = null
LastPrintedString = 2010-06-14T16:39:13Z
Modified = null
ModifiedString = 2010-06-14T16:39:18Z
Revision = 31
Subject = null
Title = null
Version = null
Application = Microsoft Macintosh Word
AppVersion = 12.0000
Characters = 48573
CharactersWithSpaces = 59651
Company = Finnova AG Bankware
HyperlinkBase = null
HyperlinksChanged = false
Lines = 404
LinksUpToDate = false
Manager = null
Pages = 19
Paragraphs = 97
PresentationFormat = null
Template = Normal.dotm
TotalTime = 835

Some pieces of information (Created, LastPrintedString, ModifiedString) have changed, and some pieces are simply not available in the ByteArray version. 

Interestingly, all dates shown in the ByteArray version are actually the date when the program was executed (today, about 5 minutes ago). 

Incidentally, the first LastModified (in the "File" version) doesn't match the date as shown by my various operating systems -- all agree on the 3rd of May. However, I guess that's a difference between Word's internally stored modification date and the date of the last physical modification of the file itself at the OS level.
Comment 1 Nick Burch 2010-06-14 10:55:53 UTC
There shouldn't be any differences

Could you please create a simple unit test which loads the file the two different ways, and detects that they don't agree? We can then use that when trying to fix the bug, and to ensure it stays fixed

It might also be worth you digging down into the properties themselves, rather than just the extractor level text, and see if you can spot there where the problem is introduced
Comment 2 David 2010-06-15 02:49:10 UTC
OK, I'll do that as soon as I can find some time -- might take 1-2 weeks though. 

Can you just tell me how the test program should behave? Simply return 0 on success (same values on both sides) and -1 otherwise?
Comment 3 Nick Burch 2010-06-15 09:45:36 UTC
The class should extend junit.framework.TestCase

I would suggest you have open the file the two ways, then loop over the two objects doing assertEquals, assertNotNull etc. In theory, we'd expect one to pass and one to fail as they differ
Comment 4 Dominik Stadler 2016-02-14 08:29:36 UTC
No update for a long time, thus I am closing this for now, please reopen with more information if this is still a problem for you.