Bug 64443 - POSTing form data through proxy_html with different frontend / backend charsets
Summary: POSTing form data through proxy_html with different frontend / backend charsets
Status: RESOLVED WORKSFORME
Alias: None
Product: Apache httpd-2
Classification: Unclassified
Component: mod_proxy_html (show other bugs)
Version: 2.4.43
Hardware: PC Linux
: P2 normal (vote)
Target Milestone: ---
Assignee: Apache HTTPD Bugs Mailing List
URL:
Keywords: FixedInTrunk
Depends on:
Blocks:
 
Reported: 2020-05-15 07:09 UTC by Antonio Suárez
Modified: 2020-06-08 10:14 UTC (History)
0 users



Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Antonio Suárez 2020-05-15 07:09:59 UTC
Per design and by default, proxy_html will translate HTML content into UTF-8 regardless of the backend charset. This is fine, since UTF-8 has wide browser support as far as I know.

In that scenario, the browser will encode POSTed form data in UTF-8, but that may not match the backend charset when proxy_html re-submits the form content upstream. E.g.:

GET:  Client <--(UTF-8)--- proxy_html <--(ISO-8859-1)--- Backend
POST: Client ---(UTF-8)--> proxy_html ---(UTF-8)-------> Backend
                                     (encoding mismatch!)

A simple workaround is to specify the backend charset by adding an accept-charset attribute to HTML <form> tags. That attribute isn't usually needed, as form enconding usually matches that of the HTML document; so -I guess- it's rarely used. When moving a site from direct to proxied publishing, that means the whole site would need to be checked and recoded to add that accept-charset attribute to every <form>. 

As proxy_html deals automatically with different fronted / backend charsets in downstream content, maybe it would be expected to do the same with upstream POSTed form data. Maybe a "stateful" approach to it (i.e. proxy_html keeping track of every form translated downstream that should be reverse-translated when posted upstream) isn't convenient or even feasible. In my very humble opinion (with no knowledge of the internals of it) maybe a simpler solution could be having that accept-charset attribute added automatically by proxy_html when translating HTML forms. As per the docs, proxy_html's mission is just to "rewrite HTML links in a proxy situation", but maybe it could be more widely scoped to make HTML content coherent accross an Apache HTTP proxy.

Thank you in advance. Best regards,

Antonio
Comment 1 Nick Kew 2020-06-06 22:58:49 UTC
Committed a fix to trunk in r1878553 .

This is the patch I posted and you tested on-list, fleshed out to test whether the attribute is really necessary rather than insert it willy-nilly:

(a) if the input is utf-8, then we can't have broken anything, so don't fix it.
(b) if ProxyHTMLCharsetOut is set, assume the sysop is in charge, and don't fix anything.
(c) if the backend set its own accept-charset attribute, don't mess with it!
Comment 2 Antonio Suárez 2020-06-08 10:14:24 UTC
Seems well thought of :)

Also: checked out from trunk and works fine.

(only tested the <form> handling part under conditions b) and c) so far; willing to test it more widely as soon as able)

Thanks for the great job!