Bug 64443

Summary: POSTing form data through proxy_html with different frontend / backend charsets
Product: Apache httpd-2 Reporter: Antonio Suárez <a.suarez>
Component: mod_proxy_htmlAssignee: Apache HTTPD Bugs Mailing List <bugs>
Status: RESOLVED WORKSFORME    
Severity: normal Keywords: FixedInTrunk
Priority: P2    
Version: 2.4.43   
Target Milestone: ---   
Hardware: PC   
OS: Linux   

Description Antonio Suárez 2020-05-15 07:09:59 UTC
Per design and by default, proxy_html will translate HTML content into UTF-8 regardless of the backend charset. This is fine, since UTF-8 has wide browser support as far as I know.

In that scenario, the browser will encode POSTed form data in UTF-8, but that may not match the backend charset when proxy_html re-submits the form content upstream. E.g.:

GET:  Client <--(UTF-8)--- proxy_html <--(ISO-8859-1)--- Backend
POST: Client ---(UTF-8)--> proxy_html ---(UTF-8)-------> Backend
                                     (encoding mismatch!)

A simple workaround is to specify the backend charset by adding an accept-charset attribute to HTML <form> tags. That attribute isn't usually needed, as form enconding usually matches that of the HTML document; so -I guess- it's rarely used. When moving a site from direct to proxied publishing, that means the whole site would need to be checked and recoded to add that accept-charset attribute to every <form>. 

As proxy_html deals automatically with different fronted / backend charsets in downstream content, maybe it would be expected to do the same with upstream POSTed form data. Maybe a "stateful" approach to it (i.e. proxy_html keeping track of every form translated downstream that should be reverse-translated when posted upstream) isn't convenient or even feasible. In my very humble opinion (with no knowledge of the internals of it) maybe a simpler solution could be having that accept-charset attribute added automatically by proxy_html when translating HTML forms. As per the docs, proxy_html's mission is just to "rewrite HTML links in a proxy situation", but maybe it could be more widely scoped to make HTML content coherent accross an Apache HTTP proxy.

Thank you in advance. Best regards,

Antonio
Comment 1 Nick Kew 2020-06-06 22:58:49 UTC
Committed a fix to trunk in r1878553 .

This is the patch I posted and you tested on-list, fleshed out to test whether the attribute is really necessary rather than insert it willy-nilly:

(a) if the input is utf-8, then we can't have broken anything, so don't fix it.
(b) if ProxyHTMLCharsetOut is set, assume the sysop is in charge, and don't fix anything.
(c) if the backend set its own accept-charset attribute, don't mess with it!
Comment 2 Antonio Suárez 2020-06-08 10:14:24 UTC
Seems well thought of :)

Also: checked out from trunk and works fine.

(only tested the <form> handling part under conditions b) and c) so far; willing to test it more widely as soon as able)

Thanks for the great job!