Apache OpenOffice (AOO) Bugzilla – Issue 100112
UTF-8 encoded text file opens with system encoding when if there is no BOM
Last modified: 2013-01-29 21:43:18 UTC
There was a bug report through Korean community forum that a utf-8 encoded text file was displayed broken when opened in OO.o. I checked the file with binary editor and found that the file has no BOM, and it caused OO.o to apply system encoding (in this case, MS949) instead of utf-8. As far as I know, BOM is not mandatory for utf-8 encoded text file, then OO.o should handle the case in better way. I understand that it is possible to force a specific encoding when user opens text file in OO.o. However, for plain users who do not have idea on encoding, that would be rather difficult thing. The possible solutions I can think of are 1. Letting user choose encoding when file is opened. This was already implemented for the text file without extension, and it might be not bad to apply the same thing for the utf-8 encoded text file without BOM. 2. Implementing auto detection of encoding like in firefox.
Created attachment 60877 [details] A sample UTF-8 encoded text file which has no BOM
@ sba: Please have a look.
SBA: When choosing file format "Text (encoded)" in the file open dialog, there is an ecoding selection dialog coming up. - In that one, choose "UTF-8" -> File opens fine.
@sba: I already knew that it works fine when users explicitly choose the correct filter. However, I'd like to address the default behavior for the text file without BOM in this issue. The points of this issue are - It is not mandatory to have BOM in utf-8 encoded text file. - If there is no BOM in utf-8 encoded text file, OO.o applies system encoding by default. - When system's encoding is not UTF-8 (which is the most cases for Windows XP/Vista), the files are decoded incorrectly. I believe the best solution is to figure out the correct encoding from the data. However, if it is or technically hard or not feasible to implement then please consider a work around at least. For example, the default behavior for the 'text file without extension' is asking users to choose encoding. What about implementing the same thing for the 'text file without BOM'? It is certainly better then just displaying broken characters.