Issue 76649 - sfx2: html parser support for html with missing encoding
Summary: sfx2: html parser support for html with missing encoding
Status: CONFIRMED
Alias: None
Product: General
Classification: Code
Component: code (show other issues)
Version: OOo 2.2
Hardware: All Linux, all
: P3 Trivial (vote)
Target Milestone: ---
Assignee: AOO issues mailing list
QA Contact:
URL:
Keywords:
Depends on:
Blocks: 90439
  Show dependency tree
 
Reported: 2007-04-24 10:53 UTC by caolanm
Modified: 2017-05-20 10:47 UTC (History)
2 users (show)

See Also:
Issue Type: PATCH
Latest Confirmation in: ---
Developer Difficulty: ---


Attachments
patch to fix (5.35 KB, patch)
2007-04-24 10:54 UTC, caolanm
no flags Details | Diff
I think I forgot to include this bit (604 bytes, patch)
2007-05-18 12:02 UTC, caolanm
no flags Details | Diff
Proposed alternative patch from mba (4.35 KB, text/plain)
2007-05-18 17:58 UTC, Mathias_Bauer
no flags Details
revised patch (4.27 KB, text/plain)
2007-05-18 18:17 UTC, Mathias_Bauer
no flags Details
just a refresh of my original patch (6.12 KB, patch)
2009-10-09 14:21 UTC, caolanm
no flags Details | Diff
new minimal patch for 3.3 (473 bytes, patch)
2010-07-14 11:19 UTC, caolanm
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this issue.
Description caolanm 2007-04-24 10:53:49 UTC
The problem is that sc/sw and the editengine all treat html from cut and paste
differently from eachother in selecting a default encoding.

e.g. in konqueror go to http://gu.wikipedia.org/wiki/ and cut and paste the
content from there into calc, writer and draw as html.

In calc it works fine,
in writer and in draw (editengine) the text is loaded as the wrong encoding

Looking at calc it has some extra stuff for setting UTF-8 as the default
encoding for cut and paste source. I've take the calc solution in the attached
patch and moved it down to the shared core in sfx2 so that we are consistent
between our three apps.
Comment 1 caolanm 2007-04-24 10:54:33 UTC
Created attachment 44636 [details]
patch to fix
Comment 2 Mathias_Bauer 2007-04-24 22:17:33 UTC
I think it doesn't make sense to attach a patch to a QA engineer if you want to
get it integrated soon. :-)

I took the liberty to take it over as the patch contains parts for sw and sfx2.
Niklas, the patch also contains something for sc. Could you make a review? Thanks.
Comment 3 caolanm 2007-04-25 07:43:11 UTC
It's impractical for an outside contributer to know who actually *is* the
responsible developer for any given module.
Comment 4 Mathias_Bauer 2007-04-25 09:38:07 UTC
Understood. I recommend to send patches to the project lead of the selected
component. This will allow to handle the patch faster.
Comment 5 niklas.nebel 2007-04-25 11:19:00 UTC
SetEncodingByHTTPHeader isn't called from EditHTMLParser ctor if there are no
http header attributes, so this doesn't work in Calc.

Use of SfxObjectShell::Current is always a bit fragile, but I'll leave it to MBA
to judge that.
Comment 6 Mathias_Bauer 2007-04-25 11:33:44 UTC
Yes, using SfxObjectShell::Current() should be avoided where possible. Perhaps
we can generate the "fake" header outside the method
SfxHTMLParser::SetEncodingByHTTPHeader() where the ObjectShell() is known.
Comment 7 Mathias_Bauer 2007-05-14 09:24:24 UTC
target 2.3
Comment 8 Mathias_Bauer 2007-05-14 20:33:10 UTC
@cmc: can you point me to the code that should call the SFX code you added when
Calc is in play? I can't see it, sorry.
Maybe you forgot some code added to SfxObjectShell::GetHeaderAttributes()? This
is the only code left in Calc that is called in SFX2.
Comment 9 caolanm 2007-05-18 12:02:12 UTC
Created attachment 45215 [details]
I think I forgot to include this bit
Comment 10 caolanm 2007-05-18 12:05:09 UTC
i.e. have the editengine html parser ctor always call SetEncodingByHTTPHeader.
i.e handle the "draw" case as in comment #1

It's somewhat confusing to have different results for different apps, but I'm
easy as to "how" the differences between sd/sc/sw is resolved, so if there's
difficulty/overly ugliness with the approach taken here then feel free to unset
"patch".
Comment 11 Mathias_Bauer 2007-05-18 17:57:22 UTC
So the latest patch should be the addition that lets the code be executed in Calc?

I still don't like the SfxObjectShell::Current. In fact this Objectshell can't
be the ObjectShell for loading, so the code you added to SFX2 in fact is the
same as if you had added faking the header always if none is given.

I think we should move the decision whether a fake header should be used to the
place where the ObjectShell is known. I have attached an alternative patch.
Caolan, would you like to test if it works for you? I hope that I didn't
overlook something
Comment 12 Mathias_Bauer 2007-05-18 17:58:10 UTC
Created attachment 45227 [details]
Proposed alternative patch from mba
Comment 13 Mathias_Bauer 2007-05-18 18:06:00 UTC
Oops, the patch in sw is wrong. Patch corrected.
Comment 14 Mathias_Bauer 2007-05-18 18:17:45 UTC
Created attachment 45228 [details]
revised patch
Comment 15 Mathias_Bauer 2007-05-18 18:20:30 UTC
BTW: in case of Writer on Windows the new code wasn't used; Writer used UTF-8
anyway.
Comment 16 caolanm 2007-05-18 18:34:38 UTC
Hmm, isn't it the case that no one will free that returned SvKeyValueIterator*.
before it belonged to the xValues Ref ? And what about pasting such http
headless html text into draw/impress. They won't have a fakehtml header set for
the "paste from clipboard" case I think.

Maybe it would be better to just the ditch the whole fake header thing and
reverse the logic, i.e. in the SfxHTMLParser::SfxHTMLParser ctor set the default
html encoding to UTF-8 and add just
SetSrcEncoding(GetExtendedCompatibilityTextEncoding(RTL_TEXTENCODING_ISO_8859_1))
calls to the sc/sw html parsers for the "not coming from clipboard" case before
calling GetEncodingByHttpHeader with the real headers if any.
Comment 17 Mathias_Bauer 2007-05-18 18:58:42 UTC
Deleting should be easy as it can be deleted where it was created (as before in
Calc). I just forgot to add the line. :-[

For Draw/Impress: could be done the same way.

I don't know what will happen if we changed the default behavior. I tend to be
cautious with such changes. 

Perhaps this is something MIB or NN should answer.
Comment 18 Mathias_Bauer 2007-06-06 11:39:14 UTC
So overall I think we should tackle the problem differently. Caolan, do you
agree to reject the patch and set the issue to "DEFECT"?
Comment 19 caolanm 2007-06-06 11:51:09 UTC
sure, whatever works for you.
Comment 20 Mathias_Bauer 2007-06-08 15:07:50 UTC
OK, as this problem needs to be tackled differently I reject the patch and set
the type to "DEFECT". I'll keep the 2.3 target for the time being.
Comment 21 Mathias_Bauer 2007-07-16 13:44:59 UTC
As my vacation starts next week :-) -> 2.x
Comment 22 Mathias_Bauer 2007-12-04 16:17:14 UTC
according to release status meeting -> target 3.x
Comment 23 caolanm 2009-10-09 14:21:29 UTC
Created attachment 65271 [details]
just a refresh of my original patch
Comment 24 Mathias_Bauer 2009-11-11 14:20:52 UTC
thanks, will have a look
Comment 25 caolanm 2010-07-14 11:19:48 UTC
Created attachment 70611 [details]
new minimal patch for 3.3
Comment 26 caolanm 2010-07-14 11:21:02 UTC
So, lets roll back a bit. Now in 3.3 calc and writer both do the right thing,
while draw is still broken.

How about this far more minimal patch to set the default encoding for the HTML
parser to be UTF-8
Comment 27 Mathias_Bauer 2010-07-28 18:23:45 UTC
I will take care of this after my vacation
Comment 28 Martin Hollmichel 2011-03-15 21:00:04 UTC
set target to 3.x since not release relevant for 3.4 release.
Comment 29 Pedro Giffuni 2011-10-10 20:26:11 UTC
Hi Matias ...

Can this patch be committed?
Comment 30 Mathias_Bauer 2011-10-10 21:20:33 UTC
I think that the patch does not make the code more broken as it is - and at least it makes it better in some regard. :-)
Comment 31 Pedro Giffuni 2011-10-29 19:30:01 UTC
I committed the band aid (one liner) patch as revision 1181314.

The issue still remains open for future review.
Comment 32 Rob Weir 2013-03-11 14:58:44 UTC
I'm adding this comment to all open issues with Issue Type == PATCH.  We have 220 such issues, many of them quite old.  I apologize for that.  

We need your help in prioritizing which patches should be integrated into our next release, Apache OpenOffice 4.0.

If you have submitted a patch and think it is applicable for AOO 4.0, please respond with a comment to let us know.

On the other hand, if the patch is no longer relevant, please let us know that as well.

If you have any general questions or want to discuss this further, please send a note to our dev mailing list:  dev@openoffice.apache.org

Thanks!

-Rob
Comment 33 Marcus 2017-05-20 10:47:34 UTC
Reset assigne to the default "issues@openoffice.apache.org".