Apache OpenOffice (AOO) Bugzilla – Issue 76649
sfx2: html parser support for html with missing encoding
Last modified: 2017-05-20 10:47:34 UTC
The problem is that sc/sw and the editengine all treat html from cut and paste differently from eachother in selecting a default encoding. e.g. in konqueror go to http://gu.wikipedia.org/wiki/ and cut and paste the content from there into calc, writer and draw as html. In calc it works fine, in writer and in draw (editengine) the text is loaded as the wrong encoding Looking at calc it has some extra stuff for setting UTF-8 as the default encoding for cut and paste source. I've take the calc solution in the attached patch and moved it down to the shared core in sfx2 so that we are consistent between our three apps.
Created attachment 44636 [details] patch to fix
I think it doesn't make sense to attach a patch to a QA engineer if you want to get it integrated soon. :-) I took the liberty to take it over as the patch contains parts for sw and sfx2. Niklas, the patch also contains something for sc. Could you make a review? Thanks.
It's impractical for an outside contributer to know who actually *is* the responsible developer for any given module.
Understood. I recommend to send patches to the project lead of the selected component. This will allow to handle the patch faster.
SetEncodingByHTTPHeader isn't called from EditHTMLParser ctor if there are no http header attributes, so this doesn't work in Calc. Use of SfxObjectShell::Current is always a bit fragile, but I'll leave it to MBA to judge that.
Yes, using SfxObjectShell::Current() should be avoided where possible. Perhaps we can generate the "fake" header outside the method SfxHTMLParser::SetEncodingByHTTPHeader() where the ObjectShell() is known.
target 2.3
@cmc: can you point me to the code that should call the SFX code you added when Calc is in play? I can't see it, sorry. Maybe you forgot some code added to SfxObjectShell::GetHeaderAttributes()? This is the only code left in Calc that is called in SFX2.
Created attachment 45215 [details] I think I forgot to include this bit
i.e. have the editengine html parser ctor always call SetEncodingByHTTPHeader. i.e handle the "draw" case as in comment #1 It's somewhat confusing to have different results for different apps, but I'm easy as to "how" the differences between sd/sc/sw is resolved, so if there's difficulty/overly ugliness with the approach taken here then feel free to unset "patch".
So the latest patch should be the addition that lets the code be executed in Calc? I still don't like the SfxObjectShell::Current. In fact this Objectshell can't be the ObjectShell for loading, so the code you added to SFX2 in fact is the same as if you had added faking the header always if none is given. I think we should move the decision whether a fake header should be used to the place where the ObjectShell is known. I have attached an alternative patch. Caolan, would you like to test if it works for you? I hope that I didn't overlook something
Created attachment 45227 [details] Proposed alternative patch from mba
Oops, the patch in sw is wrong. Patch corrected.
Created attachment 45228 [details] revised patch
BTW: in case of Writer on Windows the new code wasn't used; Writer used UTF-8 anyway.
Hmm, isn't it the case that no one will free that returned SvKeyValueIterator*. before it belonged to the xValues Ref ? And what about pasting such http headless html text into draw/impress. They won't have a fakehtml header set for the "paste from clipboard" case I think. Maybe it would be better to just the ditch the whole fake header thing and reverse the logic, i.e. in the SfxHTMLParser::SfxHTMLParser ctor set the default html encoding to UTF-8 and add just SetSrcEncoding(GetExtendedCompatibilityTextEncoding(RTL_TEXTENCODING_ISO_8859_1)) calls to the sc/sw html parsers for the "not coming from clipboard" case before calling GetEncodingByHttpHeader with the real headers if any.
Deleting should be easy as it can be deleted where it was created (as before in Calc). I just forgot to add the line. :-[ For Draw/Impress: could be done the same way. I don't know what will happen if we changed the default behavior. I tend to be cautious with such changes. Perhaps this is something MIB or NN should answer.
So overall I think we should tackle the problem differently. Caolan, do you agree to reject the patch and set the issue to "DEFECT"?
sure, whatever works for you.
OK, as this problem needs to be tackled differently I reject the patch and set the type to "DEFECT". I'll keep the 2.3 target for the time being.
As my vacation starts next week :-) -> 2.x
according to release status meeting -> target 3.x
Created attachment 65271 [details] just a refresh of my original patch
thanks, will have a look
Created attachment 70611 [details] new minimal patch for 3.3
So, lets roll back a bit. Now in 3.3 calc and writer both do the right thing, while draw is still broken. How about this far more minimal patch to set the default encoding for the HTML parser to be UTF-8
I will take care of this after my vacation
set target to 3.x since not release relevant for 3.4 release.
Hi Matias ... Can this patch be committed?
I think that the patch does not make the code more broken as it is - and at least it makes it better in some regard. :-)
I committed the band aid (one liner) patch as revision 1181314. The issue still remains open for future review.
I'm adding this comment to all open issues with Issue Type == PATCH. We have 220 such issues, many of them quite old. I apologize for that. We need your help in prioritizing which patches should be integrated into our next release, Apache OpenOffice 4.0. If you have submitted a patch and think it is applicable for AOO 4.0, please respond with a comment to let us know. On the other hand, if the patch is no longer relevant, please let us know that as well. If you have any general questions or want to discuss this further, please send a note to our dev mailing list: dev@openoffice.apache.org Thanks! -Rob
Reset assigne to the default "issues@openoffice.apache.org".