On Oct 30, 2007, at 2:16 PM on log4j-user, Jessica Lin wrote: I want to use Chainsaw to view the log file contains Chinese character. The log file was recorded by using FileAppender which I defined the endoding as “UTF-8”. Here is part of my log4j.properties file. # xml format file appender log4j.appender.xml=org.apache.log4j.FileAppender log4j.appender.xml.file=xml.log log4j.appender.xml.encoding=UTF-8 log4j.appender.xml.append=false log4j.appender.xml.layout=org.apache.log4j.xml.XMLLayout Then I use Chainsaw to load “xml.log” file. The Chinese characters are shown as “ åŠ è¿™ä¸ªåŠŸèƒ½”. The Original characters are “?????”. I double checked the “xml.log” which did save as UTF-8 encoding. The XMLDecoder file Which Chainsaw uses to load XML file also use UTF-8 encoding. Can you help me? Thanks, Jessica --------- The problem appears to be in o.a.l.xml.XMLDecoder in the receivers companion where at line 186 and 188, InputStreamReaders are allocated without explicitly specifying an encoding. That will cause the InputStreamReader to use the default platform encoding which appears not be to UTF-8 in this instance. The approach is broken and needs to be rewritten to handle any arbitrary encoding. The XML parser should be presented with a minimal document like: <!DOCTYPE log4j:eventSet [ <!ENTITY content SYSTEM "..."> ]> <log4j:eventSet version="1.2" xmlns:log4j="..."> &content; </log4:eventSet> and an entity resolver should then load the URL as a byte stream in response to the resolveEntity call. For a work around, anything that sets the default charset for the JVM to UTF-8 should avoid the problem until it can be fixed. There is not a clearly documented way to do that and it is platform dependent. On a Nix machine, you could try export LC_CTYPE=UTF-8 on Windows you could try: java -Dfile.encoding=UTF-8 org.apache.log4j.chainsaw...
Created attachment 21059 [details] This is the file contains Chinese characters. Here is the file for you testing. Thanks, Jessica
Minimal fix would be something like: =============================================================== ==== --- src/main/java/org/apache/log4j/xml/XMLDecoder.java (revision 570571) +++ src/main/java/org/apache/log4j/xml/XMLDecoder.java (working copy) @@ -183,9 +183,9 @@ if (owner != null) { reader = new LineNumberReader(new InputStreamReader( new ProgressMonitorInputStream(owner, - "Loading " + url , url.openStream()))); + "Loading " + url , url.openStream(), "UTF-8"))); } else { - reader = new LineNumberReader(new InputStreamReader(url.openStream())); + reader = new LineNumberReader(new InputStreamReader(url.openStream(), "UTF-8")); } Vector v = new Vector(); With comparable changes likely for UtilLogXMLDecoder. That isn't sufficient to handle UTF-16 encoded log files, but would be a good interim fix. Proper fix probably means rewriting the whole thing (likely using a SAX instead of a DOM parser to reduce memory use). Would be good to get unit tests around it first before a rewrite.
I have applied Curt's patch to XMLDecoder with a very minor tweak. The DOCTYPE pre-amble was always assuming that the file was UTF-8 anyway, so it seems safe to do this, but any arbitrary encoding can not be used without quite a bit of fiddly UI work at this stage. I have elected not to modify the UtilLoggingXMLDecoder at this point as it does not use a pre-amble and is therefore assuming that the file is in whatever the default platform encoding is anyway.
The presense of an encoding in the document type declaration in the string representation of the document is ignored by the parser. It is a relic of the document once being encoded in as a byte stream, by the time that the parser is seeing it the byte stream has been decoded as a string by the InputStreamReader which is oblivious to any encoding declaration in the byte stream. An external parsed entity (such as our and JUL's XML log files) that is not in UTF-8 or UTF-16 requires an explicit text declaration (http://www.w3.org/TR/2006/REC-xml-20060816/#charencoding). Without a text declaration, a parser will snif the file and determine if it is UTF-16BE or UTF-16LE by the presense of alternative 0 bytes and if not will assume that is it is UTF-8. More detail at http:// www.w3.org/TR/2006/REC-xml-20060816/#sec-guessing. It will never consult the platform default encoding. However, if it were written in the platform encoding with the proper text declaration, like: <?xml encoding="ISO-8859-1"?> <log4j:event../> <log4j:event../> the current XML decoders would fail to read the file since the would just append that to the string and then the text declaration would be in the wrong place leading to a parsing error. The only way to do it right is to rewrite, which I'm willing to do after getting log4cxx out the door. But there is never any case where using the platform encoding will get you the right content and using "UTF-8" would get you the wrong content.
On fiddly UI changes. There should be no changes to the user-interface necessary since the UI appears to be about to display the right characters if the log file is parsed correctly and all the information to parse the file correctly is in the document itself.
I suggest looking at what the Rome developers have done with their XMLReader class [1]. It goes through an elaborate process to figure out the proper charset for the document. It's explained here [2] and here [3]. It's copyrighted by Sun Microsystems, but it's under an Apache license. I adapted it for the XMLC project [4] and modified it to use gnu-regexp instead of JDK1.4 regexp, since my project depends on JDK1.3, not 1.4. I also slightly modified a couple constructors to make it easier to provide a per/instance defaultEncoding if, for some reason, none can be detected. I use it like this.... try { InputSource inputSource = new ClosingInputSource(url); try { XmlReader reader = new XmlReader(InputSourceOps.openSystemId(url), false, defaultEncoding); inputSource.setCharacterStream(reader); inputSource.setEncoding(reader.getEncoding()); } catch (XmlReaderException xre) { //This is somewhat unlikely to happen, but doesn't hurt to have //extra fallback, which XmlReader conveniently allows for by //providing access to the original unconsumed inputstream via //the XmlReaderException inputSource.setByteStream(xre.getInputStream()); inputSource.setEncoding(defaultEncoding); } return inputSource; } catch (IOException ioe) { throw new XMLCError("Couldn't load file "+url, ioe); } A lot of thought was put into this by the Rome team. Seems like it would make sense to reuse it in Log4j rather than reinvent the wheel with something, likely, not nearly as robust. Jake [1] https://rome.dev.java.net/source/browse/rome/src/java/com/sun/syndication/io/XmlReader.java [2] http://wiki.java.net/bin/view/Javawsxml/Rome05CharsetEncoding [3] http://blogs.sun.com/tucu/entry/detecting_xml_charset_encoding_again [4] http://cvs.xmlc.forge.objectweb.org/cgi-bin/viewcvs.cgi/xmlc/xmlc/xmlc/modules/xmlc/src/org/enhydra/xml/io/XmlReader.java
Any XML parser will have the proper handling for encodings, you just need to allow it to work with the byte streams instead of trying to help it by passing it character strings which will have already made assumptions about encoding that can't be undone. I'm not familiar with Rome, but it seems like everything is well within the capabilities of JAXP if called properly and there should be no need to add another dependency. The current approach is also wasteful since it first populates a DOM document and then extracts the info into an event stream when using an event based parser could eliminate the intermediate DOM document and the associated memory use. I could see that being a pretty substantial performance issue with large log files. I've got a lot of experience in this area and it would not take much time for me to rewrite the code and will try to get to it quickly.
Since the workaround in the description doesn't work for me, do you think I can get the fix within one week? Can you send the chainsaw bundle in zip or the affected jars? Thanks, Jessica
You could try this link: http://people.apache.org/builds/logging/repo/log4j/apache-chainsaw/1.99.0-SNAPSHOT/apache-chainsaw-1.99.0-20071103.061102-3-standalone.zip I'm currently experimenting with maven packaging options. If you unpack the zip and run the .sh or .bat file in the bin/ subdirectory it should launch Chainsaw with the encoding change.
Created attachment 21092 [details] the problme seems still exist
(In reply to comment #4) > The only way to do it right is to rewrite, which I'm willing to do after > getting log4cxx out the door. But > there is never any case where using the platform encoding will get you the > right content and using > "UTF-8" would get you the wrong content. I would suggest looking at the XML snippet generating code instead, and change it so that all non-ASCII characters as well as <, > and & are encoded as &#...; (using the unicode value). This will - I guess - parse correctly and completely circumvent the encoding problem. The CDATA wrapper will not be necessary then, since all problematic characters are properly encoded. Additionally this will be backware compatible.
My experience with XML parsers have indicated that in order to get character encoding right, they should have the File object to work with instead of a stream or a reader? Is that an option with this issue in the code?
Streams are okay, they don't have the base information to get relative URL's, but that isn't the case here. XML parsers need to work with the raw undecoded byte stream, so InputStream, InputSource, File, URL all work fine. However, String and Reader both decode upstream of the parser based on the default encoding (which XML purposefully ignores) and by the time the parser gets the content, everything may be corrupt. This might be a blocker for a receivers release, but isn't part of log4j so would not be considered a blocker for log4j 1.2.16.