I am not able to extract text from some documents. This is the exception, Can please some body reply for this. Exception in thread "main" java.lang.UnsupportedOperationException: Non-extended character Pascal strings are not supported right now. Please, contact POI developers for update. at org.apache.poi.hwpf.model.Sttb.fillFields(Sttb.java:81) at org.apache.poi.hwpf.model.Sttb.<init>(Sttb.java:60) at org.apache.poi.hwpf.model.SttbUtils.readSttbSavedBy(SttbUtils.java:52) at org.apache.poi.hwpf.model.SavedByTable.<init>(SavedByTable.java:53) at org.apache.poi.hwpf.HWPFDocument.<init>(HWPFDocument.java:361) at org.apache.poi.hwpf.HWPFDocument.<init>(HWPFDocument.java:186) at org.apache.poi.hwpf.HWPFDocument.<init>(HWPFDocument.java:174) at com.cobbsystemsgroup.proform.resume.service.Main.main(Main.java:74)
Can you please attach a small, simple word file that shows the problem?
Sorry Nick, I can't upload the same document which is failing. If i edit little part from the original document then i am able to extract the text from the document.
Without a file that demonstrates the problem, which we can use to both investigate and to unit test against, there's very little we're going to be able to do here
I understand, But that document is highly confidential. So only i can't upload the document. How much you can do on this, do it please.
As previously mentioned, without a sample file we can't do a great deal Also, please remember that everyone involved in the project are volunteers! If this truly is urgent and important for you, and if you are unable to assist with solving this, you will need to pay someone to give you commerical-style support. (post to the dev list if none of your current support organisations offer Apache POI support and you need to seek other consultants)
Thanks Nick, Even i am ready to pay for commercial product. Can you suggest good product.
Created attachment 32099 [details] Small Word document We are seeing the same issue. Attached a small Word file that should allow you to replicate the issue. The following bit of Scala code will throw the exception import java.io.FileInputStream import org.apache.poi.hwpf.HWPFDocument val fos = new FileInputStream("test.doc") val h = new HWPFDocument(fos) java.lang.UnsupportedOperationException: Non-extended character Pascal strings are not supported right now. Please, contact POI developers for update. at org.apache.poi.hwpf.model.Sttb.fillFields(POI.sc2189530460748253791.tmp:77) at org.apache.poi.hwpf.model.Sttb.<init>(POI.sc2189530460748253791.tmp:56) at org.apache.poi.hwpf.model.SttbUtils.readSttbfRMark(POI.sc2189530460748253791.tmp:42) at org.apache.poi.hwpf.model.RevisionMarkAuthorTable.<init>(POI.sc2189530460748253791.tmp:47) at org.apache.poi.hwpf.HWPFDocument.<init>(POI.sc2189530460748253791.tmp:364) at org.apache.poi.hwpf.HWPFDocument.<init>(POI.sc2189530460748253791.tmp:182) at org.apache.poi.hwpf.HWPFDocument.<init>(POI.sc2189530460748253791.tmp:170) at #worksheet#.h$lzycompute(POI.sc2189530460748253791.tmp:5) at #worksheet#.h(POI.sc2189530460748253791.tmp:5) at #worksheet#.#worksheet#(POI.sc2189530460748253791.tmp:5)
Thanks for that Jan. I've used that to create a small (disabled) unit test in r1630543 that shows the problem Now we just need someone to dive into that bit of HWPF code and work out what's needed...
*** Bug 54937 has been marked as a duplicate of this bug. ***
Is depended upon by: https://issues.apache.org/jira/browse/TIKA-1836
Would anybody object to logging this and skipping the rest of the SavedByTable so that the rest of the document can be read? Warning: I haven't actually tested this option to confirm that there is no further corruption...
Created attachment 33466 [details] proposal to log instead of throw If this is ok, I'll commit it with an added unit test early next week. I propose leaving this issue open, though, because this is just wallpapering over the larger limitation that we don't support this type of entry in a RevisionMarkAuthorTable. Perhaps open a new issue? If there are any objections, though, please let me know.
Applied patch (without println!) to change exception to logging in r1728547. The full fix would be to add handling for this type of string. For now, at least, users will be able to get some content out of the document even if not out of some fields in the RevisionMarkAuthorTable.
The related unit-test was enabled in r1753121.