Please see https://issues.apache.org/jira/browse/TIKA-1002. Tika cannot decode the unicode-1-1-utf-7 encoding. These might be the mail headers involved: Subject: Benachrichtigung =?unicode-1-1-utf-7?Q?+APw-ber Zustellstatus (Relay)?= --9B095B5ADSN=_01C542A985ACDA72000063DAApollo.foobarbaz Content-Type: text/plain; charset=unicode-1-1-utf-7 ERROR 2012-10-04 10:13:01,589 - de.uplanet.lucy.server.searchengine.index.SearchIndexCompiler[SearchIndexCompilerThread] Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@3b7304b org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@3b7304b at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.Tika.parseToString(Tika.java:380) at de.uplanet.lucy.server.docplug.tika.TikaDocPlug.prepare(Unknown Source) at de.uplanet.lucy.server.searchengine.index.AddFileIndexAction.performAction(Unknown Source) at de.uplanet.lucy.server.searchengine.index.SearchIndexCompiler.a(Unknown Source) at de.uplanet.lucy.server.searchengine.index.SearchIndexCompiler.processQueuedActions(Unknown Source) at de.uplanet.lucy.server.searchengine.index.SearchIndexCompiler$SearchIndexCompilerThread.run(Unknown Source) Caused by: java.lang.RuntimeException: Encoding not found - unicode-1-1-utf-7 at org.apache.poi.hsmf.datatypes.StringChunk.parseAs7BitData(StringChunk.java:155) at org.apache.poi.hsmf.datatypes.StringChunk.parseString(StringChunk.java:86) at org.apache.poi.hsmf.datatypes.StringChunk.set7BitEncoding(StringChunk.java:74) at org.apache.poi.hsmf.MAPIMessage.set7BitEncoding(MAPIMessage.java:421) at org.apache.poi.hsmf.MAPIMessage.guess7BitEncoding(MAPIMessage.java:380) at org.apache.tika.parser.microsoft.OutlookExtractor.parse(OutlookExtractor.java:80) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:210) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) ... 8 more Caused by: java.io.UnsupportedEncodingException: unicode-1-1-utf-7 at java.lang.StringCoding.decode(StringCoding.java:170) at java.lang.String.<init>(String.java:443) at java.lang.String.<init>(String.java:515) at org.apache.poi.hsmf.datatypes.StringChunk.parseAs7BitData(StringChunk.java:153) ... 16 more Ref.: http://tools.ietf.org/html/rfc1642 http://tools.ietf.org/html/rfc2152
Do you know what's the closest JVM-supported encoding we should be mapping this onto?
This is not an encoding like those that are defined by charsets, but rather something as a MIME encoding. See RFC 1642 and RFC 2152.
Any suggestions on how we should be decoding the bytes for this case then?
It should be able to solve this by installing a codec for UTF-7 in the installation of the JVM that you are using, see e.g. https://confluence.atlassian.com/jirakb/parsing-utf-7-email-messages-643137684.html and http://jutf7.sourceforge.net/ Therefore resolving this as WORKSFORME.