Bug 30810 - Reading special characters from the MS Word Document thru POI and webdav
Summary: Reading special characters from the MS Word Document thru POI and webdav
Alias: None
Product: POI
Classification: Unclassified
Component: HPSF (show other bugs)
Version: 2.5-FINAL
Hardware: Sun Solaris
: P3 critical (vote)
Target Milestone: ---
Assignee: POI Developers List
Depends on:
Reported: 2004-08-23 18:16 UTC by Srivani
Modified: 2006-02-03 11:48 UTC (History)
0 users

Checkinfilter.java (4.91 KB, text/plain)
2004-08-26 16:43 UTC, Srivani
Sample Document (27.00 KB, application/octet-stream)
2004-08-26 16:48 UTC, Srivani

Note You need to log in before you can comment on or make changes to this bug.
Description Srivani 2004-08-23 18:16:51 UTC
The MS word documents are drag-and drop in thru webdav and POI reads title from 
the word document. We have the CMS server on Sun solaris and webdav URL is 
configured for each user thru the Windows explorer. 

So the POI is not reading special characters like è,® from the title field of 
the word document if they drap-and-drop in the file thru webdav. It work fine 
if the server is on Windows and webdav is also on windows does not work if the 
CMS server is on Sun and webdav URL thru windows Explorer.
Comment 1 Rainer Klute 2004-08-24 17:12:47 UTC
I don't understand what you are doing and especially I don't know what is means
to "read from the MS Word Document thru POI and webdav". You should give more
details so we can help better.

However, I suppose that this is not a POI problem since - as you say - reading
the POI file under Windows works. Did you set the LANG environment under Solaris
to a sensible value? If you don't the JVM reads ASCII characters only and
transforms anything else to '?' characters.
Comment 2 Srivani 2004-08-25 01:06:15 UTC
I did change the lang property to UTF8/ISO8859-1, but i still have the problem. 
What i am trying to do here is 

1. Webdav folders are like Windows Explorer which follows HTTP protcol are 
accessible from My networkplaces.
2. Webdav - Dropping the MS word doc in the webdav folder thru my networkplaces 
( This should automatically check in the doc to the CMS Server)
3. When i drop in the file, i am applying POI library to read the title from  
the MS-word  before checking into the Content Server(JUST FYI content server 
allows some check in filters and the code is enclosed here). 

public int doFilter(Workspace ws, DataBinder binder, ExecutionContext cxt)
		throws DataException, ServiceException
		if(isWordDoc(binder)) {
			String fileName = binder.getLocal("primaryFile:path");
				POIFSReader r = new POIFSReader();
				MyPOIFSReaderListener listner = new 
				r.read(new FileInputStream(fileName));
				String title = listner.getTitle();
				System.out.println(" My Title: \"" + title 
+ "\"");
				if(title != null)
			}catch(java.io.FileNotFoundException e) {
				System.out.println("FileNotFoundException : " + 
			}catch(java.io.IOException e) {
				System.out.println("IOException : " + fileName);


		// filter executed correctly.  Return CONTINUE
		return CONTINUE;

Comment 3 Rainer Klute 2004-08-26 09:56:50 UTC
Two questions:

- What does your MyPOIFSReaderListener look like?

- Can you provide a sample document together with the output of your CMS filter
Comment 4 Srivani 2004-08-26 16:43:00 UTC
Created attachment 12537 [details]
Comment 5 Srivani 2004-08-26 16:48:05 UTC
Created attachment 12538 [details]
Sample Document
Comment 6 Srivani 2004-08-26 16:52:14 UTC
Attached the CheckinFilter.java and Sample Document that i am reading from.  
And the output of the getTitle is 

������ Network Appliance - Press Release - 02/17/2004�议�� 

Comment 7 Rainer Klute 2006-02-03 20:48:24 UTC
The sample document contains those funny characters in the title, and POI
extracts them correctly. The rest of the sample document looks fine. How the
special characters got into the title property and whether that's correct or not
is outside the scope of POI resp. HPSF.