Bug 62371 - Improve logging in AbstractProcessor.parseHost()
Summary: Improve logging in AbstractProcessor.parseHost()
Status: RESOLVED FIXED
Alias: None
Product: Tomcat 9
Classification: Unclassified
Component: Catalina (show other bugs)
Version: 9.0.7
Hardware: PC Linux
: P2 normal (vote)
Target Milestone: -----
Assignee: Tomcat Developers Mailing List
URL:
Keywords:
: 62383 62399 62437 (view as bug list)
Depends on:
Blocks:
 
Reported: 2018-05-11 08:38 UTC by Alex
Modified: 2018-06-12 09:50 UTC (History)
4 users (show)



Attachments
Fixes hyphen validation (1.75 KB, patch)
2018-05-11 22:50 UTC, Robert Rettig
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Alex 2018-05-11 08:38:09 UTC
It now sends 400 code without any traces in logs of what is the source of the problem. In my case it was quite hard to trace the issue. I had:

java.lang.IllegalArgumentException: The character [_] is never valid in a domain name.
        at org.apache.tomcat.util.http.parser.HttpParser$DomainParseState.next(HttpParser.java:781)
        at org.apache.tomcat.util.http.parser.HttpParser.readHostDomainName(HttpParser.java:673)
        at org.apache.tomcat.util.http.parser.Host.parse(Host.java:66)
        at org.apache.tomcat.util.http.parser.Host.parse(Host.java:40)
        at org.apache.coyote.AbstractProcessor.parseHost(AbstractProcessor.java:269)
        at org.apache.coyote.http11.Http11Processor.prepareRequest(Http11Processor.java:760)
        at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:383)
        at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:66)
        at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:754)
        at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1376)
        at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
        at java.base/java.lang.Thread.run(Thread.java:844)


I spend lots of time trying to debug the issue I had.


If you need more info on config - I have:

Apache 2.4 -> ProxyPass ajp -> tomcat (java 10)
           -> ProxyPass websocket (http11) -> tomcat (same)

All this runs as docker containers so in Apache config I connect to tomcat using name "default_tomcat_1" - given by docker (compose). As a result after upgrading from tomcat 7.0.56 to 9.0.7 I have all websocket requests just failing with code 400. After debugging tomcat I understand that websocket requests go via Http11Processor and fail cause it receives this "default_tomcat_1" as host name from Apache. At the same time other requests routed via AJP are working fine cause they've got Apache public IP as "host".

    On apache config:

    ProxyPass /websocket ws://default_tomcat_1:8080/websocket retry=0
    ProxyPassReverse /websocket ws://default_tomcat_1:8080/websocket

    ProxyPass / ajp://default_tomcat_1:8009/ retry=0
    ProxyPassReverse / ajp://default_tomcat_1:8009/
Comment 1 Luko 2018-05-11 12:28:21 UTC
I have the same issue.
In my opinion the issue is in Tomcat host validation.
My application DNS alias looks like this : myapp-t.my-dommain
where -t is env (test)
my-domain is the domain name (yes, with minus sign (-))

When I change the header request to name:myappt.mydomain everything is OK
When header request host name:myapp-t.my-dommain I get the HTTP 400 bad request.
Comment 2 Christopher Schultz 2018-05-11 15:06:11 UTC
Hyphens (-) are allowed to be allowed in hostnames, but not in TLDs[1]

I wonder if this is too restrictive for Tomcat, and whether or not it would apply (unfairly) to punycode hostnames. My sense is that these hostname restrictions should apply AFTER any punycode transformation takes place, but this parser appears to (a) perform no punycode transformation and therefore (b) would fail to handle any non-US-ASCII domain names.

[1] https://tools.ietf.org/html/draft-liman-tld-names-06#section-1
Comment 3 Mark Thomas 2018-05-11 16:13:23 UTC
The data on the wire will be after the punnycode encoding so the validation performed by this parser should be correct (Tomcat allows '-' in every element apart from the gTLD).

To get to the original report, logging the exception at debug is probably the way to go unless we want to use the UserDataHelper.
Comment 4 Robert Rettig 2018-05-11 22:50:24 UTC
Created attachment 35931 [details]
Fixes hyphen validation

This problem occurs in typical docker deployments, especially docker swarm deployments where service names contain hyphens.
In Docker Swarm, the service names can be specified with the prefix "tasks.". to get concrete container addresses through the embedded DNS instead of the virtual address of the service.
In such environments you can no longer use Tomcat in the embedded version with Spring Boot.
Comment 5 Robert Rettig 2018-05-12 00:36:16 UTC
This effects version 8.5.31 too, which has much bigger impact to other projects!

see: http://svn.apache.org/viewvc/tomcat/tc8.5.x/tags/TOMCAT_8_5_31/java/org/apache/tomcat/util/http/parser/HttpParser.java?r1=1830182&r2=1830188

This validation is a new feature introduced in a minor version change!

Please check if such changes really correspond to your project policies.
Comment 6 Mark Thomas 2018-05-12 16:15:57 UTC
(In reply to Robert Rettig from comment #4)
> Created attachment 35931 [details]
> Fixes hyphen validation

This patch is not consistent with the RFCs for host / domain names. I'm currently -1 on applying it for that reason.
Comment 7 Mark Thomas 2018-05-12 16:19:00 UTC
(In reply to Mark Thomas from comment #6)
> (In reply to Robert Rettig from comment #4)
> > Created attachment 35931 [details]
> > Fixes hyphen validation
> 
> This patch is not consistent with the RFCs for host / domain names. I'm
> currently -1 on applying it for that reason.

While Tomcat doesn't have a formal policy, the general expectation is that clients confirm to the relevant RFCs. Therefore, Tomcat does, from time to time, tighten up the validation of input data (In reply to Robert Rettig from comment #5)
> This effects version 8.5.31 too, which has much bigger impact to other
> projects!
> 
> see:
> http://svn.apache.org/viewvc/tomcat/tc8.5.x/tags/TOMCAT_8_5_31/java/org/
> apache/tomcat/util/http/parser/HttpParser.java?r1=1830182&r2=1830188
> 
> This validation is a new feature introduced in a minor version change!
> 
> Please check if such changes really correspond to your project policies.

While Tomcat doesn't have a formal policy, the general expectation is that clients confirm to the relevant RFCs. Therefore, Tomcat does, from time to time, tighten up the validation of input data when gaps in validation are identified.
Comment 8 Mark Thomas 2018-05-12 16:45:47 UTC
My limited understanding after reading the Docket documentation is that tasks.<service-name> is used (via DNS) to get a list of all of the tasks backing the service.

Why would there be a HTTP request to "tasks.<service-name>" rather than
"<service-name>"?
Comment 9 Robert Rettig 2018-05-12 17:57:35 UTC
(In reply to Mark Thomas from comment #8)
> My limited understanding after reading the Docket documentation is that
> tasks.<service-name> is used (via DNS) to get a list of all of the tasks
> backing the service.
> 
> Why would there be a HTTP request to "tasks.<service-name>" rather than
> "<service-name>"?

(In reply to Mark Thomas from comment #6)
> (In reply to Robert Rettig from comment #4)
> > Created attachment 35931 [details]
> > Fixes hyphen validation
> 
> This patch is not consistent with the RFCs for host / domain names. I'm
> currently -1 on applying it for that reason.

I totally agree not to applying that patch. I tried just to show what would work for the hyphen problem as it will be the one which will fail now in many environments. I just checked docker swarm, or even openshift deployments.
in openshift there is really a hostname declared as  maybe 'data-service.tenant1-apps.svc' that worked before tomcat 8.5.31 
This is just relevant because many microservices are deploped with spring boot and the current relase pulls in tomcat 8.5.31 which stops working wihtout reporting clear errors about the failures:
https://mvnrepository.com/artifact/org.springframework.boot/spring-boot-starter-web/2.0.2.RELEASE
Comment 10 Robert Rettig 2018-05-12 18:03:25 UTC
(In reply to Mark Thomas from comment #8)
> My limited understanding after reading the Docket documentation is that
> tasks.<service-name> is used (via DNS) to get a list of all of the tasks
> backing the service.
> 
> Why would there be a HTTP request to "tasks.<service-name>" rather than
> "<service-name>"?

For example in a docker swarm environment the service-name will resolve to a virtual-ip (loadbalanced). Some applications have the requirements to get the address of a dedicated service task. Therefore the "hostname"(which is misleading) would be 'tasks.service-name'. It will resolve to a list of dedicated ip. in most cases the first will be used and maybe cached for further requests.
Comment 11 Mark Thomas 2018-05-12 18:20:40 UTC
'data-service.tenant1-apps.svc' is a valid domain name so that should be OK.

I don't know enough about docker to know if using 'tasks.service-name' in that way is a valid usage or not.
Comment 12 Robert Rettig 2018-05-13 02:06:30 UTC
(In reply to Mark Thomas from comment #11)
> 'data-service.tenant1-apps.svc' is a valid domain name so that should be OK.
> 
> I don't know enough about docker to know if using 'tasks.service-name' in
> that way is a valid usage or not.

I think you are right. The docker feature for resolving tasks with dns should just be used in special cases like discovery https://github.com/docker/docker.github.io/pull/6420/files 

But even in that case is there are possibility to disable the new host domain validation in tomcat (environment, java runtime system property)?

Here are some additional observations within docker swarm from a debian container with 'getent' from libc-bin package and 'host' binary from bind9-host package.

:::RESOLVE (A)
::libc-bin::
root@a9348509c12f:/# /usr/bin/getent hosts data-service
10.0.0.21       data-service
root@a9348509c12f:/# /usr/bin/getent hosts tasks.data-service
10.0.0.13       tasks.data-service

::bind9-host::
root@a9348509c12f:/# /usr/bin/host data-service
data-service has address 10.0.0.21
Host data-service not found: 3(NXDOMAIN)
root@a9348509c12f:/# /usr/bin/host tasks.data-service
tasks.data-service has address 10.0.0.13
Host tasks.data-service not found: 3(NXDOMAIN)

:::REVERSE (PTR)
::libc-bin::
root@a9348509c12f:/# /usr/bin/getent hosts 10.0.0.21           
root@a9348509c12f:/# /usr/bin/getent hosts 10.0.0.13
10.0.0.13       data-test_data-service.1.j37493y41n0t9eehio3dze3vl.data-test_backend

::bind9-host::
root@a9348509c12f:/# /usr/bin/host -t PTR 10.0.0.21
Host 21.0.0.10.in-addr.arpa. not found: 3(NXDOMAIN)
root@a9348509c12f:/# /usr/bin/host -t PTR 10.0.0.13
13.0.0.10.in-addr.arpa domain name pointer data-test_data-service.1.j37493y41n0t9eehio3dze3vl.data-test_backend.



:::RESOLVE (A) the name from reverse
::libc-bin::
root@a9348509c12f:/# /usr/bin/getent hosts data-test_data-service.1.j37493y41n0t9eehio3dze3vl.data-test_backend
10.0.0.13       data-test_data-service.1.j37493y41n0t9eehio3dze3vl.data-test_backend
::bind9-host::
root@a9348509c12f:/# /usr/bin/host data-test_data-service.1.j37493y41n0t9eehio3dze3vl.data-test_backend
data-test_data-service.1.j37493y41n0t9eehio3dze3vl.data-test_backend has address 10.0.0.13
Host data-test_data-service.1.j37493y41n0t9eehio3dze3vl.data-test_backend not found: 3(NXDOMAIN)
Comment 13 Alex 2018-05-14 07:05:03 UTC
> While Tomcat doesn't have a formal policy, the general expectation is that 
> clients confirm to the relevant RFCs. Therefore, Tomcat does, from time to 
> time, tighten up the validation of input data when gaps in validation are 
> identified.

Hi, 

to me this "from time to time", no review of potentially affected users, no logging and no way to switch off this added strictness looks very questionable from the user perspective. Maybe the flag for strict validation will be also good here, not just improved logging.
Comment 14 Mark Thomas 2018-05-14 09:58:30 UTC
Generally, the tightening up of validation like this stems from a security vulnerability report where mal-formed input results in unintended consequences. Usually information disclosure of some form. In this case, the changes can be traced back to CVE-2016-6816. That vulnerability report identified some gaps in our validation of the request line. When we receive such a report, we don't just fix the one issue identified in the report, we look more widely. The reason we look more widely is that if one gap in validation can lead to a security vulnerability then other gaps may do the same. Even if we can't see how a validation gap could be exploited, we still fix it as we assume that an attacker may find something we haven't. When we reviewed the request line validation after CVE-2016-6816, we identified various gaps in the request line validation and have been working on tightening them up over time. Host name validation was one of these issues.

We recognise that clients do not always conform to the specifications. While our default position is to implement the specs and that bugs in clients should be fixed, we do recognise that this can take time.

The new host validation has been in 9.0.x since 9.0.2 (2017-11-30) where it logged failures but took no other action. After fixing some edge cases reported by users it was switched to rejecting invalid hosts in 9.0.5 (2018-02-11) and we received no reports of problems as a result of enabling the validation.

The changes to request line validation have been causing other problems (again due to specification non-compliant clients). See bug 62273 for the latest information on this aspect. It was largely as a result of these issues that we introduced the host validation in logging only mode first and only enabled it once we thought all the issues had been ironed out. 

As a result of bug 62273, we wanted to back-port that enhancement to all versions. The host validation was wrapped up in those changes and it was difficult to untangle it. Since it had been running in 9.0.x without issue and that it should not be possible to register an invalid host/domain name it was felt that back-porting all validation changes - including the host validation - would be safe.

It appears that some uses of Docker are FQDN being passed to to Tomcat that include a '-' in the final segment.

Tomcat does not permit a '-' character to appear in the final segment of a FQDN. This appears to be based on RFC 920 and/or https://tools.ietf.org/html/draft-liman-tld-names-06#section-1

Ignoring the original report which requested better logging of these failures (fixing that is in hand and should happen later today) the key question at this point is whether or not '-' is valid in the final segment of a FQDN.

RFC 952 does allow '-' in the final segment. RFC 1123 does not change this. Therefore it is is both possible and valid that '-' could appear in the final segment of a intranet FQDN.

RFC 920 and https://tools.ietf.org/html/draft-liman-tld-names-06#section-1 are also rather dated. The introduction of IDNA means that '-' can appear in the final segment of an internet FQDN.

In light of the above, I am going to change Tomcat's host name validation to allow '-' in the final segment. This change will be made at (roughly) the same time as the additional logging.

Ideally, this issue would have been caught in one of the releases since 9.0.2. Unfortunately it wasn't. Given the circumstances, back-porting the bug 62273 enhancement looked to be sufficiently low risk. This issue highlights that Tomcat can always use more real-world testing and I would encourage folks to download the release candidates as the votes are announced and test them in their environments. The more folks that do this, the more issues like this we will catch and the sooner we will catch them.
Comment 15 Mark Thomas 2018-05-14 10:31:55 UTC
Ah. Found the reference for the final segment being alphabetic:

From RFC 1123
<quote>
However, a valid host name can never have the dotted-decimal form #.#.#.#, since at least the highest-level component label will be alphabetic.
</quote>

There is some interesting discussion of this in the errata.

Where things get 'interesting' is whether the final segment can be purely numeric or not. Per RFC 952 and RFC 1123 they can. There are currently no such gTLDs registered with ICANN. However, they could still be present on an intranet. Therefore, I am leaning towards accepting them. That means 0.0.0.256 would be treated as a valid FQDN rather than as an invalid IPv4 address. Whether any client would let a user specify such a string is a different question.
Comment 16 Alex 2018-05-14 11:25:26 UTC
> This issue highlights that Tomcat can always use more real-world testing and I would encourage folks to download the release candidates as the votes are announced and test them in their environments. The more folks that do this, the more issues like this we will catch and the sooner we will catch them.

Maybe adding workaround flag in one version, changing the default behaviour and then dropping flag some versions later may be better in terms of real-world testing then logging and testing RC's as an approach for such a serious things?
Comment 17 Mark Thomas 2018-05-14 14:12:03 UTC
Improved logging fixed in:
- trunk for 9.0.9 onwards
- 8.5.x for 8.5.32 onwards
- 8.0.x for 8.0.53 onwards
- 7.0.x for 7.0.89 onwards
Comment 18 Christopher Schultz 2018-05-17 01:46:27 UTC
(In reply to Alex from comment #16)
> > This issue highlights that Tomcat can always use more real-world testing and I would encourage folks to download the release candidates as the votes are announced and test them in their environments. The more folks that do this, the more issues like this we will catch and the sooner we will catch them.
> 
> Maybe adding workaround flag in one version, changing the default behaviour
> and then dropping flag some versions later may be better in terms of
> real-world testing then logging and testing RC's as an approach for such a
> serious things?

You are presuming that there were no 9.0.x releases (beta!) which included this change with no comments for months. In fact, it was included in 9.0.2 with logging, then completed in 9.0.5 as Mark details in comment #14. I think this qualifies as a reasonably-slow roll-out. There is no reason to wait many years to change things... the alternative is an internet where it takes 20 years to widely-deploy new encryption capabilities (TLS) and effectively NEVER to properly-implement some IETF specifications (e.g. cookies). Sometimes you have to just have to remove the headphone jack.

You took the big step of a 4-major-release-version jump and seem incensed that things aren't working exactly as they had worked before. This is the purpose of testing. In this case, you found a problem, engaged the community, and got a fix. Instead of complaining bitterly, how about a "thanks for the 5-day turnaround on a blocking issue I'm having"? If you wanted zero changes, you should have stayed on the version where you were.

If you'd like to debate Tomcat's development methodologies, release cycles, or test-coverage, you are welcome to join the dev mailing list.
Comment 19 Alex 2018-05-17 07:25:21 UTC
> If you'd like to debate Tomcat's development methodologies, release cycles, or test-coverage, you are welcome to join the dev mailing list.

I don't know if this reply should go there but:

(In reply to Christopher Schultz from comment #18)
> (In reply to Alex from comment #16)
> You are presuming that there were no 9.0.x releases (beta!) which included
> this change with no comments for months. In fact, it was included in 9.0.2
> with logging, then completed in 9.0.5 as Mark details in comment #14. I
> think this qualifies as a reasonably-slow roll-out. There is no reason to
> wait many years to change things... the alternative is an internet where it
> takes 20 years to widely-deploy new encryption capabilities (TLS) and
> effectively NEVER to properly-implement some IETF specifications (e.g.
> cookies). Sometimes you have to just have to remove the headphone jack.

I don't see the reason for catching the error and removing logging. This is not about timeframe or moving progress.
The second point - I would prefer to have workaround. For the workaround the timeframe is important and two months between releases seems to be really quick cause upgrade cycle is not so fast usually, I guess. I think there should be more than a year before removing the workaround if it was provided.

> You took the big step of a 4-major-release-version jump and seem incensed that things aren't working exactly as they had worked before. This is the purpose of testing.

Agree, but I'm talking about logging here.

> Instead of complaining bitterly, how about a
> "thanks for the 5-day turnaround on a blocking issue I'm having"? If you
> wanted zero changes, you should have stayed on the version where you were.

Thanks for the 5-day turnaround! Really fast! For some libraries I know it would take a year... and more than 20 years for the internet! )
Comment 20 Mark Thomas 2018-05-18 06:05:47 UTC
*** Bug 62383 has been marked as a duplicate of this bug. ***
Comment 21 Mark Thomas 2018-05-18 09:50:10 UTC
*** Bug 62383 has been marked as a duplicate of this bug. ***
Comment 22 Mark Thomas 2018-05-18 20:32:03 UTC
*** Bug 62383 has been marked as a duplicate of this bug. ***
Comment 23 Mark Thomas 2018-05-23 06:18:58 UTC
*** Bug 62399 has been marked as a duplicate of this bug. ***
Comment 24 ZhiFeng Hu 2018-05-23 09:10:51 UTC
How to remove the validation for host name?
I want to use any string as the host name .

Would you please let us choice ?
Comment 25 Mark Thomas 2018-05-23 09:25:56 UTC
The Host validation is not optional. It is a specification requirement.

The changes discussed in comment #14 and comment #15 (using the same rules for the final segment as the other segments) have been made in the versions listed in comment #17.
Comment 26 ZhiFeng Hu 2018-05-23 09:31:01 UTC
Though it was a specification. why not gave us an setting or configuration to disable the check ? 

Gave us a switch please. or we can not upgrade our projects to latest tomcat. or we should have to switch to other (Jetty, undertow)
Comment 27 Mark Thomas 2018-05-23 09:52:49 UTC
Simply wait (until early next month) for next release round and upgrade then.
Comment 28 Tim Levett 2018-05-29 19:10:27 UTC
I'd like to provide some clarification for the docker swarm users that are experiencing this issue. We are using docker stacks in docker swarm that are deploying spring-boot applications with embedded tomcat. The challenge for us was the underscore that is supplied by docker that separates the stack name and the service name in the fully qualified DNS name. The good news is docker swarm registers many DNS names inside the docker networking. Including stack_service, service, and tasks.stack_service. If you have unique enough service names you may be able to get away with just service name DNS resolution and still use the load balancer that is shipped with docker swarm.

For example using my-stack_my-service. We noticed my-stack_my-service:8080/actuator/health would return a 400, but my-service:8080/actuator/health worked as expected. 

As mentioned above, you shouldn't use tasks.stack_service or stack_service for http, just for auto discovery.

Hope this helps.
Comment 29 Mark Thomas 2018-06-07 08:45:11 UTC
*** Bug 62437 has been marked as a duplicate of this bug. ***
Comment 30 Robert Gacki 2018-06-12 09:50:49 UTC
Docker Swarm is not the only environment that may experience a regression. I have a client that has an "acme-xy" TLD for the internal network. We upgraded our Spring Boot applications to 2.0.2, which ships with Tomcat 5.8.31 and we were bitten by the same issue.

Whether this is RFC compliant or not, I rather like to see such changes to be toggled and / or introduced gradually (like a warning first).