Hi, I'm using Nutch to crawl websites, using Tika to parse documents. Encountered the following ERROR and thought that this would be the place to log it. 2012-09-22 22:30:03,321 ERROR tika.TikaParser - Error parsing http://www.montpelier-vt.org/upload/groups/384/files/meac_11.17.10.doc java.io.UnsupportedEncodingException: Codepage number may not be 0 at org.apache.poi.hpsf.VariantSupport.codepageToEncoding(VariantSupport.java:338) at org.apache.poi.hpsf.VariantSupport.read(VariantSupport.java:240) at org.apache.poi.hpsf.Property.<init>(Property.java:164) at org.apache.poi.hpsf.Section.<init>(Section.java:277) at org.apache.poi.hpsf.PropertySet.init(PropertySet.java:452) at org.apache.poi.hpsf.PropertySet.<init>(PropertySet.java:247) at org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaryEntryIfExists(SummaryExtractor.java:67) at org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaries(SummaryExtractor.java:57) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:182) at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:124) at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:36) at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:23) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:680) 2012-09-22 22:30:03,322 WARN parse.ParseUtil - Unable to successfully parse content http://www.montpelier-vt.org/upload/groups/384/files/meac_11.17.10.doc of type application/x-tika-msoffice
poi-trunk can parse the referenced file without problems. Please upgrade POI jars in your Nutch distribution or wait for the next Tika release. Yegor