Created attachment 29316 [details] Word document showing incorrect PID_WORDCOUNT=11 I have a Word doc (attached) that has 6 words, plus an embedded PDF document (not sure that's relevant). When I view the word count with Word it correctly says 6. But when I run org.apache.poi.hpsf.extractor.HPSFPropertiesExtractor the word count incorrectly says 11: 1 = 1252 PID_TITLE = PID_SUBJECT = PID_AUTHOR = IBMer PID_KEYWORDS = PID_TEMPLATE = Normal.dot PID_LASTAUTHOR = IBMer PID_REVNUMBER = 3 PID_APPNAME = Microsoft Office Word PID_EDITTIME = Sun Dec 31 19:03:00 EST 1600 PID_CREATE_DTM = Tue Jul 17 07:16:00 EDT 2012 PID_LASTSAVE_DTM = Mon Jul 23 07:21:00 EDT 2012 PID_PAGECOUNT = 1 PID_WORDCOUNT = 11 PID_CHARCOUNT = 55 PID_SECURITY = 0 PID_CODEPAGE = 1252 PID_COMPANY = IBM PID_LINECOUNT = 1 PID_PARCOUNT = 1 17 = 65 23 = 730895 PID_SCALE = false PID_LINKSDIRTY = false 19 = false 22 = false PID_DOCPARTS =
I also have a Word document (unfortunately can't share), which doesn't have an embedded document, that has 3 pages yet POI shows PID_PAGECOUNT=1. Are there known cases where the properties will not be extracted correctly?
PID_EDITTIME and PID_CREATE_DTM also seem to be wrong, at least when I compare this to the Document Properties via Word.
POI will give you exactly what is stored in the file, without any changes. If Word happens to store duff data, there's not a lot me can do about it :(
But what confuses me is when I display the document properties in Word, they are correct. It's as if POI is somehow pulling from a different (stale) set of properties stored in the Word doc or something...
If you load the file in Word, and do a save-as, does that fix what POI sees? How about just opening it in word and doing a save (no save-as)?
First I tried "Save As.." to a new file, and then POI reports PID_WORDCOUNT = 13 (still wrong: should be 6 ... but curious that now it's wrong "differently" (13 vs 11 before)). Then I tried "Save" and then POI also reports PID_WORDCOUNT = 13 (still wrong). I made another change (add space then remove it), saved again, and POI still says PID_WORDCOUNT = 13.