Bug 51408

Summary:	String.getBytes() and new String(byte[]) use default charset - may cause problems in some Locales
Product:	Tomcat 7	Reporter:	Sebb <sebb>
Component:	Catalina	Assignee:	Tomcat Developers Mailing List <dev>
Status:	RESOLVED FIXED
Severity:	enhancement
Priority:	P2
Version:	trunk
Target Milestone:	---
Hardware:	PC
OS:	Windows XP

Description Sebb 2011-06-21 15:47:16 UTC

There seem to be rather a lot of instances where Strings and bytes are cconverted using the default charset.

Since the default charset is unknown, its behaviour cannot be relied on.

Comment 1 Mark Thomas 2011-06-21 15:52:58 UTC

It has been like this for many years with zero issues reported by users.

Comment 2 Christopher Schultz 2011-06-21 16:28:53 UTC

I object to marking this bug as INVALID, since I think it's something that should be fixed.

How about marking it "minor" and fixing it lazily. Just being explicit about the character set (ISO-8859-1? UTF-8?) should be easy to do and is unlikely to affect anyone.

Comment 3 Mark Thomas 2011-06-21 16:48:00 UTC

What exactly needs fixing? I don't see a report from a user reporting an error. Neither do I see any analysis that demonstrates how an error may occur.

Given that the code has remained unchanged for many years without any reports of errors from a world-wide user base, I find it unlikely there is actually anything here to fix.

There is a greater risk, in my view, of a change introducing a regression. "If it isn't broken, don't fix it."

I have changed this to enhancement. If there has been no activity when I next come back to this, it is likely to get re-closed.

Comment 4 Konstantin Kolinko 2011-06-21 23:57:20 UTC

(In reply to comment #0)
> There seem to be rather a lot of instances where Strings and bytes are
> cconverted using the default charset.

Please be specific!

Searching through the TC7 source code I found

=====================================
1) 0 wrong calls to String.getBytes()

That is
1. Disregarding the test classes - have not looked through them at all,
2. Disregarding some never called test code in o.a.c.tribes.membership.Constants.main()
3. Disregading legit calls to this method, when encoding is configurable and the default encoding is used as fallback.

Nothing to do there.

=====================================
2) 2 places where new String(byte[]) calls might be fixed:

2.1. In SSIFilter#doFilter(), the following line:

   res.getWriter().write(new String(bytes));

It is some fallback code, when there is no OutputStream available, but only a writer. Is it possible to get encoding of the writer there, e.g. from content-type header?

Patches are welcome. It is an odd and rare use case though.

2.2. In JNDIRealm#getAttributeValue(String,Attributes)

  valueString = new String((byte[]) value);

I do not know JNDI API enough to comment here.

The rest are fallbacks, debug code etc.

=====================================
3) new String(byte[],int,int):

3.1. WebappClassLoader#findResourceInternal()

  String str = new String(binaryContent,0,pos);

That is the correct code and should stay as is. See r303915.

The rest are fallbacks, debug code etc.

=====================================
4) new FileWriter(String) or new FileWriter(String,[boolean]) instead of OutputStreamWriter + FileOutputStream

Is used in one place only:

4.1. in AccessLogValve.open()

This can be fixed - to add "encoding" configuration option for the valve. I think there was an enhancement request for that.

=====================================

That is all.

Comment 5 Sebb 2011-06-22 00:19:06 UTC

Sorry, I see now that the subject was misleading.

I was actually referring to the calls to getBytes(Charset.defaultCharset()) which were added recently in r1138019.

There are lots more of those, for example:

DigestAuthenticator.generateNonce
JNDIRealm.compareCredentials

etc.

Comment 6 Mark Thomas 2012-01-19 15:05:26 UTC

I'm reluctant to change this for 7.0.x since there is a risk it could break something and no-one has reported a bug as a result of this.

I have changed these to explicitly use ISO-8859-1 for 8.0.x onwards. There are some places where UTF-8 may be a better choice in the future as standards move towards UTF-8. We can change these as users request it.