Bug 55645

Summary: ChunkNotFoundException when trying to getRtfBody
Product: POI Reporter: Paolo <paolo.asioli>
Component: HSMFAssignee: POI Developers List <dev>
Status: RESOLVED LATER    
Severity: normal CC: paolo.asioli
Priority: P2    
Version: 3.9-FINAL   
Target Milestone: ---   
Hardware: Other   
OS: other   
Attachments: Outlook MSG that gives the above error

Description Paolo 2013-10-10 08:00:24 UTC
Created attachment 30916 [details]
Outlook MSG that gives the above error

Hello

I've a message (attached to this bug), saved from Outlook 2010, that gives me ChunkNotFoundException when I try to call 
getRtfBody

Could you please check if there's a bug ? I'm using the latest stable release 3.9 on Android 4.0.1

Thanks a lot !

Paolo
Comment 1 Nick Burch 2013-10-10 16:08:15 UTC
Are you sure your outlook file actually has a RTF part? (Not all of them do)
Comment 2 Paolo 2013-10-10 18:09:49 UTC
(In reply to Nick Burch from comment #1)
> Are you sure your outlook file actually has a RTF part? (Not all of them do)

Don't know for sure (don't know how to read a MSG by hand like an EML) but in Outlook it shows some richly formatted text.
Comment 3 Nick Burch 2013-10-10 18:29:57 UTC
Not all richly formatted text in Outlook is done using RTF! Does your message have a HTML chunk instead?

Take a look at the Tika Outlook parser if you want a detailed example of using HSMF to process msg files:
https://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OutlookExtractor.java
Comment 4 Paolo 2013-10-10 18:40:55 UTC
(In reply to Nick Burch from comment #3)
> Not all richly formatted text in Outlook is done using RTF! Does your
> message have a HTML chunk instead?
> 
> Take a look at the Tika Outlook parser if you want a detailed example of
> using HSMF to process msg files:
> https://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/java/org/
> apache/tika/parser/microsoft/OutlookExtractor.java

You're right, but I already get text, html and RTF and looks like there was none.

So I thought there may be some kind of bug, since Outlook showed some formatted text.

Here is and extract of my code:

try {
	this.messaggioHTML = msg.getHtmlBody();
	if (MainActivity.DEBUG) {
		android.util.Log.d(MainActivity.TAG, "HTML Body: "
				+ this.messaggioHTML);
	}
} catch (ChunkNotFoundException e) {
	android.util.Log.e(MainActivity.TAG,
			"HTML Body: not found");
	this.messaggioHTML = "";
}

try {
	this.messaggioTesto = msg.getTextBody();
	if (MainActivity.DEBUG) {
		android.util.Log.d(MainActivity.TAG, "TXT Body: "
				+ this.messaggioTesto);
	}
} catch (ChunkNotFoundException e) {
	android.util.Log.e(MainActivity.TAG,
			"TXT  Body: not found");
	this.messaggioTesto = "";
}

try {
	String messaggioRtf = msg.getRtfBody();

	if (MainActivity.DEBUG) {
		android.util.Log.d(MainActivity.TAG, "RTF Body: "
				+ messaggioRtf);
	}

} catch (ChunkNotFoundException e) {
	android.util.Log.e(MainActivity.TAG,
			"RTF  Body: not found");
} catch (Exception e) {
	// TODO Auto-generated catch block
	e.printStackTrace();
}
Comment 5 Nick Burch 2013-10-10 19:31:56 UTC
Outlook files tend to have one or two out of plain, rtf and html. It's very rare to have all 3. If your file only has rtf, and you really wanted something like html, you'd be best off using Apache Tika as that can convert for you
Comment 6 Paolo 2013-10-10 19:47:33 UTC
(In reply to Nick Burch from comment #5)
> Outlook files tend to have one or two out of plain, rtf and html. It's very
> rare to have all 3. If your file only has rtf, and you really wanted
> something like html, you'd be best off using Apache Tika as that can convert
> for you

Maybe I didn't explain myself correctly. The attached example apparently has 
NO plain text
NO HTML
NO RTF
according to Apache POI.

But since I see text when opening in Outlook, I think there may be a problem.

Did you test the MSG attachment to confirm my report ?
Comment 7 Paolo 2013-10-11 22:38:45 UTC
What kind of additional information do you need ? There's a MSG attached that to my tests shows this anomaly (no plain, no HTML and no RTF, yet when opened on Outlook presents formatted text).
Don't know what's the problem, but I think I gave ample information to investigate...

Please let me know.
Comment 8 Nick Burch 2013-10-12 21:19:33 UTC
Are you able to use one of the tools like POIFSViewer or POIFSDump to identify which chunk (POIFS Entry) actually contains your text? That will help us narrow down what's wrong
Comment 9 Paolo 2013-10-13 11:16:38 UTC
(In reply to Nick Burch from comment #8)
> Are you able to use one of the tools like POIFSViewer or POIFSDump to
> identify which chunk (POIFS Entry) actually contains your text? That will
> help us narrow down what's wrong

Thanks for the tip. I'll try that and get back with relevant information.
Cheers
Comment 10 Dominik Stadler 2015-08-17 20:58:17 UTC
No update for a long time, therefore I am closing this for now, please reopen with the promised additional information if this is still an issue for you.
Comment 11 Adrian Conlon 2016-04-10 18:09:38 UTC
An old problem, I realise, but here's some extra information that explains the symptoms (and why this *isn't* a POI issue):

The email message class is: IPM.Note.SMIME.MultipartSigned.  This means it's a digitally signed email with modifications effectively disabled.

As a signed email, this means the content is held in an attachment (one of "smime.p7m", "smime.txt" or "smime.p7s").

In this instance, the signed content is held in "smime.p7m" (which is pretty much the most common place).

The content is of this attachment is mime encoded.  Look for the "multipart/alternative" set, then pick out which part is best suited to your needs (I usually just pick out the "text/plain" part for body text extraction). Apache James Mime4j should do the trick...