43736 – Chainsaw does not honor encoding when loading XML files

Bug 43736 - Chainsaw does not honor encoding when loading XML files

Summary: Chainsaw does not honor encoding when loading XML files

Status:	ASSIGNED

Alias:	None

Product:	Log4j - Now in Jira
Classification:	Unclassified
Component:	chainsaw (show other bugs)
Version:	1.2
Hardware:	Other other

Importance:	P2 normal
Target Milestone:	---
Assignee:	log4j-dev

URL:
Keywords:

Depends on:
Blocks:

Reported:	2007-10-30 13:18 UTC by Curt Arnold
Modified:	2008-08-05 17:25 UTC (History)
CC List:	0 users

Attachments
This is the file contains Chinese characters. (3.36 KB, application/octet-stream) 2007-10-30 15:02 UTC, jlin	Details
the problme seems still exist (189.29 KB, application/x-zip-compressed) 2007-11-06 15:48 UTC, jlin	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Curt Arnold 2007-10-30 13:18:38 UTC

On Oct 30, 2007, at 2:16 PM on log4j-user, Jessica Lin wrote:

I want to use Chainsaw to view the log file contains Chinese character. The log file was recorded by
using FileAppender which I defined the endoding as “UTF-8”. Here is part of my log4j.properties file.

# xml format file appender
log4j.appender.xml=org.apache.log4j.FileAppender
log4j.appender.xml.file=xml.log
log4j.appender.xml.encoding=UTF-8
log4j.appender.xml.append=false
log4j.appender.xml.layout=org.apache.log4j.xml.XMLLayout

Then I use Chainsaw to load “xml.log” file. The Chinese characters are shown as “ åŠ è¿™ä¸ªåŠŸèƒ½”.
The Original characters are “?????”.

I double checked the “xml.log” which did save as UTF-8 encoding. The XMLDecoder file Which
Chainsaw uses to load XML file also use UTF-8 encoding.

Can you help me?

Thanks,

Jessica

---------

The problem appears to be in o.a.l.xml.XMLDecoder in the receivers companion where at line 186 and
188, InputStreamReaders are allocated without explicitly specifying an encoding. That will cause the
InputStreamReader to use the default platform encoding which appears not be to UTF-8 in this
instance.

The approach is broken and needs to be rewritten to handle any arbitrary encoding. The XML parser
should be presented with a minimal document like:

<!DOCTYPE log4j:eventSet [
<!ENTITY content SYSTEM "...">
]>
<log4j:eventSet version="1.2" xmlns:log4j="...">
&content;
</log4:eventSet>

and an entity resolver should then load the URL as a byte stream in response to the resolveEntity call.

For a work around, anything that sets the default charset for the JVM to UTF-8 should avoid the
problem until it can be fixed. There is not a clearly documented way to do that and it is platform
dependent. On a Nix machine, you could try

export LC_CTYPE=UTF-8

on Windows you could try:

java -Dfile.encoding=UTF-8 org.apache.log4j.chainsaw...

Comment 1 jlin 2007-10-30 15:02:02 UTC

Created attachment 21059 [details]
This is the file contains Chinese characters.

Here is the file for you testing. 

Thanks,
Jessica

Comment 2 Curt Arnold 2007-10-30 17:12:26 UTC

Minimal fix would be something like:

===============================================================
====
--- src/main/java/org/apache/log4j/xml/XMLDecoder.java  (revision 570571)
+++ src/main/java/org/apache/log4j/xml/XMLDecoder.java  (working copy)
@@ -183,9 +183,9 @@
     if (owner != null) {
       reader = new LineNumberReader(new InputStreamReader(
               new ProgressMonitorInputStream(owner,
-                      "Loading " + url , url.openStream())));
+                      "Loading " + url , url.openStream(), "UTF-8")));
     } else {
-      reader = new LineNumberReader(new InputStreamReader(url.openStream()));
+      reader = new LineNumberReader(new InputStreamReader(url.openStream(), "UTF-8"));
     }
 
     Vector v = new Vector();


With comparable changes likely for UtilLogXMLDecoder.  That isn't sufficient to handle UTF-16 encoded 
log files, but would be a good interim fix.  Proper fix probably means rewriting the whole thing (likely 
using a SAX instead of a DOM parser to reduce memory use).  Would be good to get unit tests around it 
first before a rewrite.

Comment 3 Paul Smith 2007-11-02 22:19:14 UTC

I have applied Curt's patch to XMLDecoder with a very minor tweak.  The DOCTYPE
pre-amble was always assuming that the file was UTF-8 anyway, so it seems safe
to do this, but any arbitrary  encoding can not be used without quite a bit of
fiddly UI work at this stage.

I have elected not to modify the UtilLoggingXMLDecoder at this point as it does
not use a pre-amble and is therefore assuming that the file is in whatever the
default platform encoding is anyway.

Comment 4 Curt Arnold 2007-11-03 10:44:50 UTC

The presense of an encoding in the document type declaration in the string representation of the 
document is ignored by the parser.  It is a relic of the document once being encoded in as a byte 
stream, by the time that the parser is seeing it the byte stream has been decoded as a string by the 
InputStreamReader which is oblivious to any encoding declaration in the byte stream.

An external parsed entity (such as our and JUL's XML log files) that is not in UTF-8 or UTF-16 requires 
an explicit text declaration (http://www.w3.org/TR/2006/REC-xml-20060816/#charencoding).  
Without a text declaration, a parser will snif the file and determine if it is UTF-16BE or UTF-16LE by the 
presense of alternative 0 bytes and if not will assume that is it is UTF-8.  More detail at http://
www.w3.org/TR/2006/REC-xml-20060816/#sec-guessing.  It will never consult the platform default 
encoding.

However, if it were written in the platform encoding with the proper text declaration, like:

<?xml encoding="ISO-8859-1"?>
<log4j:event../>
<log4j:event../>

the current XML decoders would fail to read the file since the would just append that to the string and 
then the text declaration would be in the wrong place leading to a parsing error.

The only way to do it right is to rewrite, which I'm willing to do after getting log4cxx out the door.  But 
there is never any case where using the platform encoding will get you the right content and using 
"UTF-8" would get you the wrong content.

Comment 5 Curt Arnold 2007-11-03 10:54:36 UTC

On fiddly UI changes.  There should be no changes to the user-interface necessary since the UI appears to 
be about to display the right characters if the log file is parsed correctly and all the information to parse 
the file correctly is in the document itself.

Comment 6 Jacob Kjome 2007-11-03 13:37:56 UTC

I suggest looking at what the Rome developers have done with their XMLReader class [1].  It goes through an elaborate process to figure out the proper charset for the document.  It's explained here [2] and here [3].  It's copyrighted by Sun Microsystems, but it's under an Apache license.  I adapted it for the XMLC project [4] and modified it to use gnu-regexp instead of JDK1.4 regexp, since my project depends on JDK1.3, not 1.4.  I also slightly modified a couple constructors to make it easier to provide a per/instance defaultEncoding if, for some reason, none can be detected.  I use it like this....

try {
    InputSource inputSource = new ClosingInputSource(url);
    try {
        XmlReader reader = new XmlReader(InputSourceOps.openSystemId(url), false, defaultEncoding);
        inputSource.setCharacterStream(reader);
        inputSource.setEncoding(reader.getEncoding());
    } catch (XmlReaderException xre) {
        //This is somewhat unlikely to happen, but doesn't hurt to have
        //extra fallback, which XmlReader conveniently allows for by
        //providing access to the original unconsumed inputstream via
        //the XmlReaderException
        inputSource.setByteStream(xre.getInputStream());
        inputSource.setEncoding(defaultEncoding);
    }
    return inputSource;
} catch (IOException ioe) {
    throw new XMLCError("Couldn't load file "+url, ioe);
}

A lot of thought was put into this by the Rome team.  Seems like it would make sense to reuse it in Log4j rather than reinvent the wheel with something, likely, not nearly as robust.

Jake

[1] https://rome.dev.java.net/source/browse/rome/src/java/com/sun/syndication/io/XmlReader.java
[2] http://wiki.java.net/bin/view/Javawsxml/Rome05CharsetEncoding
[3] http://blogs.sun.com/tucu/entry/detecting_xml_charset_encoding_again
[4] http://cvs.xmlc.forge.objectweb.org/cgi-bin/viewcvs.cgi/xmlc/xmlc/xmlc/modules/xmlc/src/org/enhydra/xml/io/XmlReader.java

Comment 7 Curt Arnold 2007-11-03 22:41:53 UTC

Any XML parser will have the proper handling for encodings, you just need to allow it to work with the 
byte streams instead of trying to help it by passing it character strings which will have already made 
assumptions about encoding that can't be undone.  I'm not familiar with Rome, but it seems like 
everything is well within the capabilities of JAXP if called properly and there should be no need to add 
another dependency.  

The current approach is also wasteful since it first populates a DOM document and then extracts the 
info into an event stream when using an event based parser could eliminate the intermediate DOM 
document and the associated memory use.  I could see that being a pretty substantial performance 
issue with large log files.

I've got a lot of experience in this area and it would not take much time for me to rewrite the code and 
will try to get to it quickly.

Comment 8 jlin 2007-11-06 11:01:11 UTC

Since the workaround in the description doesn't work for me, do you think I 
can get the fix within one week? Can you send the chainsaw bundle in zip or 
the affected jars? 


Thanks,
Jessica

Comment 9 Paul Smith 2007-11-06 12:57:20 UTC

You could try this link:

http://people.apache.org/builds/logging/repo/log4j/apache-chainsaw/1.99.0-SNAPSHOT/apache-chainsaw-1.99.0-20071103.061102-3-standalone.zip

I'm currently experimenting with maven packaging options.  If you unpack the zip
and run the .sh or .bat file in the bin/ subdirectory it should launch Chainsaw
with the encoding change.

Comment 10 jlin 2007-11-06 15:48:38 UTC

Created attachment 21092 [details]
the problme seems still exist

Comment 11 Thorbjørn Ravn Andersen 2008-07-03 04:02:52 UTC

(In reply to comment #4)

> The only way to do it right is to rewrite, which I'm willing to do after
> getting log4cxx out the door.  But 
> there is never any case where using the platform encoding will get you the
> right content and using 
> "UTF-8" would get you the wrong content.

I would suggest looking at the XML snippet generating code instead, and change it so that all non-ASCII characters as well as <, > and & are encoded as &#...; (using the unicode value).  This will - I guess - parse correctly and completely circumvent the encoding problem.

The CDATA wrapper will not be necessary then, since all problematic characters are properly encoded.

Additionally this will be backware compatible.

Comment 12 Thorbjørn Ravn Andersen 2008-08-02 09:55:43 UTC

My experience with XML parsers have indicated that in order to get character encoding right, they should have the File object to work with instead of a stream or a reader?

Is that an option with this issue in the code?

Comment 13 Curt Arnold 2008-08-05 17:25:57 UTC

Streams are okay, they don't have the base information to get relative URL's, but that isn't the case here.  XML parsers need to work with the raw undecoded byte stream, so InputStream, InputSource, File, URL all work fine.  However, String and Reader both decode upstream of the parser based on the default encoding (which XML purposefully ignores) and by the time the parser gets the content, everything may be corrupt.

This might be a blocker for a receivers release, but isn't part of log4j so would not be considered a blocker for log4j 1.2.16.