I'm using function POITextExtractor.getMetadataTextExtractor() on a docx document to extract its metadata. If the main POITextExtractor has been created using a "normal" java.io.File, then extracting the metadata seems to work alright. When using a ByteArrayInputStream instead (for exactly the same content, read using a simple FileInputStream on the same File object), then information is changed or missing. Example: using the following code: POITextExtractor te = ExtractorFactory.createExtractor( new java.io.File( "X:/projects/termDB/Frederick - Terminology database.docx" ) ); POITextExtractor te2 = te.getMetadataTextExtractor(); String t2 = te2.getText(); System.out.println( t2 ); I get the following output: Category = null ContentStatus = null ContentType = null Created = Mon Apr 26 07:01:00 CEST 2010 CreatedString = 2010-04-26T07:01:00Z Creator = David Vergnaud Description = null Identifier = null Keywords = null Language = null LastModifiedBy = David Vergnaud LastPrinted = null LastPrintedString = 2010-04-26T07:01:00Z Modified = Fri Apr 30 10:06:00 CEST 2010 ModifiedString = 2010-04-30T10:06:00Z Revision = 31 Subject = null Title = null Version = null Application = Microsoft Macintosh Word AppVersion = 12.0000 Characters = 48573 CharactersWithSpaces = 59651 Company = Finnova AG Bankware HyperlinkBase = null HyperlinksChanged = false Lines = 404 LinksUpToDate = false Manager = null Pages = 19 Paragraphs = 97 PresentationFormat = null Template = Normal.dotm TotalTime = 835 When using that code instead: java.io.File file = new java.io.File( "X:/projects/termDB/Frederick - Terminology database.docx" ); byte[] content = new byte[ (int)( file.length() ) ]; java.io.FileInputStream fis = new java.io.FileInputStream( file ); fis.read( content ); POITextExtractor te = ExtractorFactory.createExtractor( new ByteArrayInputStream( content ) ); POITextExtractor te2 = te.getMetadataTextExtractor(); String t2 = te2.getText(); System.out.println( t2 ); I get the following output: Category = null ContentStatus = null ContentType = null Created = Mon Jun 14 16:39:13 CEST 2010 CreatedString = 2010-06-14T16:39:13Z Creator = David Vergnaud Description = null Identifier = null Keywords = null Language = null LastModifiedBy = null LastPrinted = null LastPrintedString = 2010-06-14T16:39:13Z Modified = null ModifiedString = 2010-06-14T16:39:18Z Revision = 31 Subject = null Title = null Version = null Application = Microsoft Macintosh Word AppVersion = 12.0000 Characters = 48573 CharactersWithSpaces = 59651 Company = Finnova AG Bankware HyperlinkBase = null HyperlinksChanged = false Lines = 404 LinksUpToDate = false Manager = null Pages = 19 Paragraphs = 97 PresentationFormat = null Template = Normal.dotm TotalTime = 835 Some pieces of information (Created, LastPrintedString, ModifiedString) have changed, and some pieces are simply not available in the ByteArray version. Interestingly, all dates shown in the ByteArray version are actually the date when the program was executed (today, about 5 minutes ago). Incidentally, the first LastModified (in the "File" version) doesn't match the date as shown by my various operating systems -- all agree on the 3rd of May. However, I guess that's a difference between Word's internally stored modification date and the date of the last physical modification of the file itself at the OS level.
There shouldn't be any differences Could you please create a simple unit test which loads the file the two different ways, and detects that they don't agree? We can then use that when trying to fix the bug, and to ensure it stays fixed It might also be worth you digging down into the properties themselves, rather than just the extractor level text, and see if you can spot there where the problem is introduced
OK, I'll do that as soon as I can find some time -- might take 1-2 weeks though. Can you just tell me how the test program should behave? Simply return 0 on success (same values on both sides) and -1 otherwise?
The class should extend junit.framework.TestCase I would suggest you have open the file the two ways, then loop over the two objects doing assertEquals, assertNotNull etc. In theory, we'd expect one to pass and one to fail as they differ
No update for a long time, thus I am closing this for now, please reopen with more information if this is still a problem for you.