This is an Ubuntu server running PHP 7 on Apache, with a website which enforces TLS (using the standard port). To my understanding,
https://example.com:443 are exactly equivalent (and, indeed, in my browser the port number disappears from the address bar when I type it in). And yet the
HTTP_HOST usually contains just the domain name, but sometimes contains also the port number. This could be for bot visitors (I haven’t analysed logs), but even so I don’t see how. Is there any actual difference?
(This is causing some problems, as some of our logs and work queues and server-side cache are separated out by
HTTP_HOST, so having the same site report on different hosts is confusing.)
The PHP documentation states that
Contents of the Host: header from the current request, if there is one.
Indeed, every variable in this associative array whose key begins with the string
HTTP_ is a copy of the corresponding HTTP request variable sent by the user agent.
So, why does it sometimes contain the hostname, and sometimes contain both the hostname and port number?
It turns out that both syntaxes are legal and equivalent. The port number is required if the server uses a non-default port, but is optional otherwise.
In what circumstances would a user agent send the port number even when it is the default?
RFC 7230 section 5.4 explains that the Host: header’s value is an exact copy of the authority component of the URI.
If the target URI includes an authority component, then a client MUST send a field-value for Host that is identical to that authority component, excluding any userinfo subcomponent and its “@” delimiter . . .
What is the authority component?
This comes from the definition of a URI in RFC 3986 section 3.2, which explains that it is the user information (username and password), host and port. It explains that the port SHOULD be omitted if it is the default port, but SHOULD does not equal MUST. (See RFC 2119.)
So, to put this all together, a user agent is expected to send the port number in the Host: header if it also appears in the URI. Thus, if the user agent has the URL
https://example.com:443/robots.txt then it will have a header
Host: example.com:443. How the user agent got such a URL, there’s no real way to tell. It might have been sent by your application, or it might have been constructed by the user agent.
RFC 7230 section 2.7.3 explains URL normalization which, for this case, indicates that a URL containing no port number and a URL containing the default port number are equivalent.
TL;DR: Your application must expect that a port number may appear in this header and deal with it in some way appropriate to the context in which it is used.
You may consider instead using
$_SERVER['SERVER_NAME'], which contains the value of the
ServerName directive in the Apache
<VirtualHost> which processed the request.
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.