Bug 53816 - Extracted word count is incorrect
Summary: Extracted word count is incorrect
Status: NEW
Alias: None
Product: POI
Classification: Unclassified
Component: HPSF (show other bugs)
Version: 3.9-dev
Hardware: PC Linux
: P2 normal (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-09-02 14:53 UTC by mikemccand
Modified: 2015-03-22 20:47 UTC (History)
0 users



Attachments
Word document showing incorrect PID_WORDCOUNT=11 (96.00 KB, application/msword)
2012-09-02 14:53 UTC, mikemccand
Details

Note You need to log in before you can comment on or make changes to this bug.
Description mikemccand 2012-09-02 14:53:00 UTC
Created attachment 29316 [details]
Word document showing incorrect PID_WORDCOUNT=11

I have a Word doc (attached) that has 6 words, plus an embedded PDF document (not sure that's relevant).  When I view the word count with Word it correctly says 6.  But when I run org.apache.poi.hpsf.extractor.HPSFPropertiesExtractor the word count incorrectly says 11:

1 = 1252
PID_TITLE = 
PID_SUBJECT = 
PID_AUTHOR = IBMer
PID_KEYWORDS = 
PID_TEMPLATE = Normal.dot
PID_LASTAUTHOR = IBMer
PID_REVNUMBER = 3
PID_APPNAME = Microsoft Office Word
PID_EDITTIME = Sun Dec 31 19:03:00 EST 1600
PID_CREATE_DTM = Tue Jul 17 07:16:00 EDT 2012
PID_LASTSAVE_DTM = Mon Jul 23 07:21:00 EDT 2012
PID_PAGECOUNT = 1
PID_WORDCOUNT = 11
PID_CHARCOUNT = 55
PID_SECURITY = 0
PID_CODEPAGE = 1252
PID_COMPANY = IBM
PID_LINECOUNT = 1
PID_PARCOUNT = 1
17 = 65
23 = 730895
PID_SCALE = false
PID_LINKSDIRTY = false
19 = false
22 = false
PID_DOCPARTS =
Comment 1 mikemccand 2012-09-02 14:55:42 UTC
I also have a Word document (unfortunately can't share), which doesn't have an embedded document, that has 3 pages yet POI shows PID_PAGECOUNT=1.

Are there known cases where the properties will not be extracted correctly?
Comment 2 mikemccand 2012-09-02 15:41:19 UTC
PID_EDITTIME and PID_CREATE_DTM also seem to be wrong, at least when I compare this to the Document Properties via Word.
Comment 3 Nick Burch 2012-09-02 16:33:52 UTC
POI will give you exactly what is stored in the file, without any changes. If Word happens to store duff data, there's not a lot me can do about it :(
Comment 4 mikemccand 2012-09-02 17:12:37 UTC
But what confuses me is when I display the document properties in Word, they are correct.  It's as if POI is somehow pulling from a different (stale) set of properties stored in the Word doc or something...
Comment 5 Nick Burch 2012-09-02 17:48:13 UTC
If you load the file in Word, and do a save-as, does that fix what POI sees? How about just opening it in word and doing a save (no save-as)?
Comment 6 mikemccand 2012-09-02 17:58:20 UTC
First I tried "Save As.." to a new file, and then POI reports PID_WORDCOUNT = 13 (still wrong: should be 6 ... but curious that now it's wrong "differently" (13 vs 11 before)).

Then I tried "Save" and then POI also reports PID_WORDCOUNT = 13 (still wrong).  I made another change (add space then remove it), saved again, and POI still says PID_WORDCOUNT = 13.