Bug 28049

Summary: sql task not able to import "windows UTF-8"
Product: Ant Reporter: bruno arliguy <barliguy>
Component: Core tasksAssignee: Ant Notifications List <notifications>
Status: NEW ---    
Severity: minor    
Priority: P3    
Version: 1.6.1   
Target Milestone: ---   
Hardware: PC   
OS: Windows XP   

Description bruno arliguy 2004-03-30 15:36:47 UTC
UTF-8 file generated with windows applications like Notepad add BOM (Byte-order
mark) at the begining of the file. And when I try to import a .sql file in UTF-8
for "sql" ant task it's not able to read the file. If I erase the BOM, all is OK.

Of course I added encoding="UTF-8" in task definition.

BOM are not mandatory in UTF-8 file. BOM is composed of bytes EF BB BF, in
ISO-8859-1 it is char "".

Perhaps I missed a parameter for the sql task definition?
Comment 1 bruno arliguy 2004-03-30 19:23:48 UTC
In fact, according to http://www.unicode.org/unicode/faq/utf_bom.html#25 it can
have this BOM in UTF-8... It seems it's java that doesn't care of it when
openning an InputStreamReader whith "utf-8" charsetName.

To solve my problem I made a litle (and not good) hack in SQLExec.java in
runTransaction method. I replaced the line :

---
Reader reader =
                    (encoding == null) ? new FileReader(tSrcFile)
                                       : new InputStreamReader(
                                             new FileInputStream(tSrcFile),
                                             encoding);
---

with this block :

---
Reader reader = new FileReader(tSrcFile);
                
                if (reader.read() == 0xEF && reader.read() == 0xBB &&
reader.read() == 0xBF)
                {
                    reader.close();
                    //Has to be UTF8
                    reader = new InputStreamReader(new
FileInputStream(tSrcFile), "utf-8");
                    //Read the BOM char;
                    reader.read();
                }
                else
                {
                    reader.close();
                    reader =
                    (encoding == null) ? new FileReader(tSrcFile)
                                       : new InputStreamReader(
                                             new FileInputStream(tSrcFile),
                                             encoding);
                }
---

What it does? it opens the file, checks if there is the utf8 BOM (EF BB BF). If
BOM exists, open an InputStreamReader with utf8 charset and read first char (the
BOM, else do like before.

if file is less than 3 bytes, it will raise an exception I guess

Comment 2 Stefan Bodewig 2004-04-05 12:42:43 UTC
This is an issue for so many other classes in Ant that any local fix for the
sql task would be wrong.

Sun's own docs don't talk about a BOM for utf-8, so it's pretty likely they'll not
support it properly.  Even the unicode FAQ you link to says "Note that some
recipients of UTF-8 encoded data do not expect a BOM." and says it wouldn't make
any difference for the endianess of the stream.  So Notepad is allowed to do that,
but it's useless and dangerous.  Obviously Java clients are in the "do not expect
a BOM" department.

I'm not really sure what to do here.  Your patch should probably only apply if
requested or native encoding is UTF-8 and it should be farmed out into a helper
class so that othere tasks can reuse it.
Comment 3 peter reilly 2004-04-05 13:03:55 UTC
There is a (long term) outstanding java bug for this:
http://developer.java.sun.com/developer/bugParade/bugs/4508058.html