Bug 56880 - Non-extended character Pascal strings are not supported in RevisionMarkAuthorTable
Summary: Non-extended character Pascal strings are not supported in RevisionMarkAuthor...
Status: NEEDINFO
Alias: None
Product: POI
Classification: Unclassified
Component: HWPF (show other bugs)
Version: 3.10-FINAL
Hardware: PC All
: P2 normal (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
: 54937 (view as bug list)
Depends on:
Blocks:
 
Reported: 2014-08-22 11:19 UTC by venkatesh
Modified: 2016-07-17 21:19 UTC (History)
3 users (show)



Attachments
Small Word document (19.50 KB, application/octet-stream)
2014-10-09 17:50 UTC, jan.vanhoecke
Details
proposal to log instead of throw (2.02 KB, patch)
2016-01-19 19:17 UTC, Tim Allison
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description venkatesh 2014-08-22 11:19:58 UTC
I am not able to extract text from some documents. 
  This is the exception, Can please some body reply for this. 

Exception in thread "main" java.lang.UnsupportedOperationException: Non-extended character Pascal strings are not supported right now. Please, contact POI developers for update. 
        at org.apache.poi.hwpf.model.Sttb.fillFields(Sttb.java:81) 
        at org.apache.poi.hwpf.model.Sttb.<init>(Sttb.java:60) 
        at org.apache.poi.hwpf.model.SttbUtils.readSttbSavedBy(SttbUtils.java:52) 
        at org.apache.poi.hwpf.model.SavedByTable.<init>(SavedByTable.java:53) 
        at org.apache.poi.hwpf.HWPFDocument.<init>(HWPFDocument.java:361) 
        at org.apache.poi.hwpf.HWPFDocument.<init>(HWPFDocument.java:186) 
        at org.apache.poi.hwpf.HWPFDocument.<init>(HWPFDocument.java:174) 
        at com.cobbsystemsgroup.proform.resume.service.Main.main(Main.java:74)
Comment 1 Nick Burch 2014-08-22 11:22:22 UTC
Can you please attach a small, simple word file that shows the problem?
Comment 2 venkatesh 2014-08-22 11:44:35 UTC
Sorry Nick,
   I can't upload the same document which is failing.
   If i edit little part from the original document then i am able to extract the text from the document.
Comment 3 Nick Burch 2014-08-22 11:53:00 UTC
Without a file that demonstrates the problem, which we can use to both investigate and to unit test against, there's very little we're going to be able to do here
Comment 4 venkatesh 2014-08-22 12:26:08 UTC
I understand, But that document is highly confidential.
So only i can't upload the document.
How much you can do on this, do it please.
Comment 5 Nick Burch 2014-08-22 12:33:32 UTC
As previously mentioned, without a sample file we can't do a great deal

Also, please remember that everyone involved in the project are volunteers! 

If this truly is urgent and important for you, and if you are unable to assist with solving this, you will need to pay someone to give you commerical-style support. (post to the dev list if none of your current support organisations offer Apache POI support and you need to seek other consultants)
Comment 6 venkatesh 2014-08-22 12:55:59 UTC
Thanks Nick,
  Even i am ready to pay for commercial product.
  Can you suggest good product.
Comment 7 jan.vanhoecke 2014-10-09 17:50:43 UTC
Created attachment 32099 [details]
Small Word document

We are seeing the same issue. 
Attached a small Word file that should allow you to replicate the issue. 

The following bit of Scala code will throw the exception

import java.io.FileInputStream
import org.apache.poi.hwpf.HWPFDocument

val fos = new FileInputStream("test.doc")
val h = new HWPFDocument(fos)


java.lang.UnsupportedOperationException: Non-extended character Pascal strings are not supported right now. Please, contact POI developers for update.
	at org.apache.poi.hwpf.model.Sttb.fillFields(POI.sc2189530460748253791.tmp:77)
	at org.apache.poi.hwpf.model.Sttb.<init>(POI.sc2189530460748253791.tmp:56)
	at org.apache.poi.hwpf.model.SttbUtils.readSttbfRMark(POI.sc2189530460748253791.tmp:42)
	at org.apache.poi.hwpf.model.RevisionMarkAuthorTable.<init>(POI.sc2189530460748253791.tmp:47)
	at org.apache.poi.hwpf.HWPFDocument.<init>(POI.sc2189530460748253791.tmp:364)
	at org.apache.poi.hwpf.HWPFDocument.<init>(POI.sc2189530460748253791.tmp:182)
	at org.apache.poi.hwpf.HWPFDocument.<init>(POI.sc2189530460748253791.tmp:170)
	at #worksheet#.h$lzycompute(POI.sc2189530460748253791.tmp:5)
	at #worksheet#.h(POI.sc2189530460748253791.tmp:5)
	at #worksheet#.#worksheet#(POI.sc2189530460748253791.tmp:5)
Comment 8 Nick Burch 2014-10-09 17:59:24 UTC
Thanks for that Jan. I've used that to create a small (disabled) unit test in r1630543 that shows the problem

Now we just need someone to dive into that bit of HWPF code and work out what's needed...
Comment 9 Tim Allison 2016-01-19 17:59:15 UTC
*** Bug 54937 has been marked as a duplicate of this bug. ***
Comment 10 Tim Allison 2016-01-19 18:00:25 UTC
Is depended upon by: https://issues.apache.org/jira/browse/TIKA-1836
Comment 11 Tim Allison 2016-01-19 18:17:04 UTC
Would anybody object to logging this and skipping the rest of the SavedByTable so that the rest of the document can be read?  Warning: I haven't actually tested this option to confirm that there is no further corruption...
Comment 12 Tim Allison 2016-01-19 19:17:31 UTC
Created attachment 33466 [details]
proposal to log instead of throw

If this is ok, I'll commit it with an added unit test early next week.  I propose leaving this issue open, though, because this is just wallpapering over the larger limitation that we don't support this type of entry in a RevisionMarkAuthorTable.  Perhaps open a new issue?

If there are any objections, though, please let me know.
Comment 13 Tim Allison 2016-02-04 19:55:44 UTC
Applied patch (without println!) to change exception to logging in r1728547.  The full fix would be to add handling for this type of string.  For now, at least, users will be able to get some content out of the document even if not out of some fields in the RevisionMarkAuthorTable.
Comment 14 Dominik Stadler 2016-07-17 21:19:33 UTC
The related unit-test was enabled in r1753121.