Bug 51901 - [PATCH] StringChunk.parseAs7BitData - Encoding not found - US-ASCII; format=flowed
Summary: [PATCH] StringChunk.parseAs7BitData - Encoding not found - US-ASCII; format=f...
Alias: None
Product: POI
Classification: Unclassified
Component: HSMF (show other bugs)
Version: 3.8-dev
Hardware: PC All
: P2 major (vote)
Target Milestone: ---
Assignee: POI Developers List
Depends on:
Reported: 2011-09-27 23:36 UTC by Jeremy
Modified: 2011-10-03 19:26 UTC (History)
0 users

Patch for issue (822 bytes, application/octet-stream)
2011-09-27 23:36 UTC, Jeremy

Note You need to log in before you can comment on or make changes to this bug.
Description Jeremy 2011-09-27 23:36:02 UTC
Created attachment 27616 [details]
Patch for issue

Some message files appear to have additional information for charset when dealing with some US-ASCII types.

Patch attached, looks for an occurrence of a semicolon and substrings the string if present.  NOTE: won't work if a valid charset encoding for a string can contain semicolons as a valid option.  Other option could be to modify Pattern used to produce charsets.

Actual m.group(1) string returned from Content-Type: "US-ASCII; format=flowed; delsp=yes"

Unable to attach sample file due to sensitive nature.

Exception Message Stack Trace: POI-3.8-beta4

BaseTextExtractionService - Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@2ddd595d
org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@2ddd595d

Caused by: java.lang.RuntimeException: Encoding not found - US-ASCII; format=flowed
	at org.apache.poi.hsmf.datatypes.StringChunk.parseAs7BitData(StringChunk.java:155)
	at org.apache.poi.hsmf.datatypes.StringChunk.parseString(StringChunk.java:86)
	at org.apache.poi.hsmf.datatypes.StringChunk.set7BitEncoding(StringChunk.java:74)
	at org.apache.poi.hsmf.MAPIMessage.set7BitEncoding(MAPIMessage.java:413)
	at org.apache.poi.hsmf.MAPIMessage.guess7BitEncoding(MAPIMessage.java:373)
	at org.apache.tika.parser.microsoft.OutlookExtractor.parse(OutlookExtractor.java:73)
	at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:219)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
	... 49 more
Caused by: java.io.UnsupportedEncodingException: US-ASCII; format=flowed
	at java.lang.StringCoding.decode(StringCoding.java:170)
	at java.lang.String.<init>(String.java:443)
	at java.lang.String.<init>(String.java:515)
	at org.apache.poi.hsmf.datatypes.StringChunk.parseAs7BitData(StringChunk.java:153)
	... 56 more
Comment 1 Maxim Valyanskiy 2011-09-28 08:20:21 UTC
Thanx. I updated regular expression that searches for charset in revision r1176780.
Comment 2 Jeremy 2011-10-03 19:26:02 UTC
Thanks for the quick fix,  I tested it and confirmed it works against the files I was having an issue with.

(In reply to comment #1)
> Thanx. I updated regular expression that searches for charset in revision
> r1176780.