2021-Oct-7: Running webservers behind a transparent reverse proxy

Looks like the most requested choice for a topic for October 7 is setting up webservers behind transparent reverse proxies. This is partially a sequel to my recorded LFNW2020 presentation on implementing transparent reverse proxies, but if you haven’t seen that, don’t worry, I’ll give a brief recap. I think there will be enough time to go over how to run about 4 different webservers, which should be sufficient to apply the lessons to any webserver. Given there are more options for “other webservers” than there is time to cover, if you have a preference, please pick here. I’ll put the presentation together in priority order.

  • PHP (Wordpress, nextcloud, et cetera)
  • WSGI (django and similar)
  • uHTTPd/lighthttpd
  • nodejs
  • nginx
  • other (closed source, java, the obstinate and weird)

0 voters

2 Likes

Related thread: Reverse proxies

Terminology

As mentioned in the first post, this follows my presentation from a year ago on transparent reverse proxies. While I will begin this presentation by covering the basics again, there are several ambiguous and confusing terms involved. So this post will attempt to provide some clarity of terminology, and a written version of the background information.

Open Files

First set of confusing terms is file handle vs file descriptor vs file description.
All three refer to an open file, which can be a logical file on a filesystem, a socket, a device, everything is a file.

File Handle

File Handles are a userspace structure for referencing a file (open or not). The key here is userspace. As such, they are process-specific.

File Descriptor

File Descriptors are still process-specific, but they are the kernelspace counterpart to the File Handle. These are presented to userspace as just an integer. File Descriptors can be copied, including between processes. The duplicate File Descriptor will be assigned a new integer. Multiple File Descriptors can point to the same File Description.

File Description

File Descriptions are the internal kernelspace datastructure for an open file. Each open file has exactly one File Description. File Descriptions hold metadata about the opened file (such as read/write mode, pointer offset, underlying location, and so on).

Topology

Load-balanced or Virtual Hosted services are another area of confusing terminology. There is less standardization of terms here, but for the purposes of this presentation, I will use the following.

Load Balancer

A load balancer is a service that splits incoming connections between multiple service providers. This can be at the DNS level, at the application level, or at the edge server. Application and DNS level load balancing split the connections at Layer-4 (TCP/UDP), so the main connection (HTTP GET requests, et cetera) are routed to different logical service providers. When load balancing at the edge server level, all the traffic hits the edge server.

Edge Server

The logical server, after resolving domain names, which accepts incoming connections. When multiple logical servers are involved, often only the edge server listens on the public internet. Note that the Edge Server can run on the same logical machine (physical or virtual) as other logical servers.

Upstream Server

A logical server which handles a subset of incoming connections or requests. These are hidden from the “outside world” behind the edge server. A Reverse Proxy can intelligently select between upstream servers.

Reverse Proxy

A service running on the Edge Server which examines incoming connections and relays or routes them to different upstream servers based on the name of the server the client requested.

Transparent Reverse Proxy

A Reverse Proxy which uses the SSL Server Name Indication Field or packet inspection to determine the name of the requested Upstream Server without affecting the state of the connection. The connection is then passed (usually by duplicating the File Descriptor for the socket) to the Upstream Server. At the application level, this process is transparent.

Other Terms

Server Name Indication (SNI)

The name field of a TLS header, which indicates in plain-text the desired logical server name.

Named Virtual Host

A Named Virtual Host is a logical website (or other HTTP service) identified by a unique hostname. This is provided by the SNI field of a TLS header, or as the HOST: line of the HTTP headers of a request.

Notes

  1. In the description of Topology, HTTP is mentioned only in passing. Reverse Proxies are used for a wide variety of application-level protocols, and one of the major advantages of Transparent Reverse Proxies is they are application-protocol agnostic (provided the initial handshake has some way for the Reverse Proxy server to determine which Upstream Server to use).

  2. Transparent Reverse Proxies are not always superior to HTTP Reverse Proxies. HTTP Reverse Proxies can cache results, are often supported out of the box, and often produce equivalent results. This is not a dig at them, or at those who write or use them.

  3. Load balancing (but not Named Virtual Hosting) can also be done by the kernel (starting in Linux 4.something). I’m going to ignore that for expediency.

  4. I am assuming all logical services are running on the same logical computer on the same physical computer. It is possible to use a Transparent Reverse Proxy when that is not true, but demonstrating it is more difficult, the setup is often trivialized, and some of the performance benefits of a Transparent Reverse Proxy are lost.

This is related to 2021-Oct-7: Running webservers behind a transparent reverse proxy . I’m unable to comment on that poll.

I’ve been using traefik proxy in front of docker containers for running various web services like Nextcloud, Jellyfin, Wallabag, Pi-hole. Traefik proxy is interesting because it dynamically discovers new web services when you start them up (based on labels associated with containers) I really prefer this setup and I don’t miss manual config with apache/haproxy/nginx.

Anyway, I just wanted to put a plug in for this talk including traefik or something like it.

I gave a related workshop at LibrePlanet this year, see Adam Monsen / LibrePlanet 2021 Private Cloud Workshop · GitLab

Odd that you cannot reply there… Are you a member of the BLUG group? If not, you’ll need to join before you can post on BLUG threads. I believe there’s an invite link here somewhere.

Traefik looks like a neat project. I have considered trying to do something similar, but UI programming is not something I particularly enjoy. It is worth noting that Traefik is not transparent, it must pass connection metadata (remote IP/port, and so on) out-of-band. It likely will have scaling issues once you get to where you need to be using sendfile(2), but for a dead-simple-to-use reverse proxy, it has promise.

You mention docker, I should probably include a bit on how to use dockerized services behind the proxy too. I don’t normally use docker, it tries to be fancier than a container manager should, which leads to many of its users not understanding what it’s doing, but I think I can make an exception here. (Normally I use firejail to manage starting new kernel namespaces).

1 Like

@meonkeys - would you like to join the BLUG group?

1 Like

Maybe not. I’ll reply to @bear454 below.

UI programming?

Oh huh, and this is different than using maybe HAProxy or nginx / apache as a reverse proxy? I’d love to learn more about this. To be clear, I’m not using Traefik for performance reasons, more for convenience–so I can easily serve many container-based web services with different domain names from the same IP address (and, because I’m cheap/lazy, the same physical server).

Cool, I’d like to learn more about that too. I’ve heard of a several different ways to containerize and just went with Docker because it seemed like the popular option (and there are tons of prebuilt images available).

Sure! I think I got it, I just clicked Join on Blug members - LinuxFest Northwest .

Ah yep, that was why I couldn’t reply to the poll. Thanks @logan and @bear454 !

1 Like

One of the most compelling features of Traefik is its own web user interface for monitoring things. While I appreciate good UIs, and can make them, given sufficient time, I lack the talent to make them quickly.

I am something for a “black sheep”. I’ve been doing asynchronous programming since the days when Java was the language for enterprise software and multiplexing connections meant just fork() a few extra times.

Docker is popular because Docker is popular. Docker is fairly terrible from a security perspective (when’s the last time you looked inside a docker container to see what it’s really doing?), but it reduces the friction on deploying software to “the cloud” (computers you don’t control), so it is popular. Containerization is good, and is likely here to stay. Docker will likely be replaced eventually (or morph to something different).

Yes. The key word is transparent. HAProxy, nginx, and similar are HTTP proxies, using the HTTP proxy protocol. This is a Layer-5 proxy protocol, and has side-effects at layer 5. For instance, the SSL certificates must be handled by the proxy itself. The proxied server cannot directly see the remote IP address (that must be included in the HTTP-Proxy metadata instead). As a related issue, socket options on the outgoing connection cannot be changed by the proxied server.

The transparent remote proxy I use is a Layer-4 proxy protocol. It slightly complicates establishing the layer-4 (TCP) link, but is perfectly transparent once the connection is established. The SSL certs are handled by the proxied server, remote client information is available through the usual syscalls, and the data generally need never pass through the proxy server process once the initial connection is established (if the proxied server is physically separate, having the proxy server pass the data is one option).

I realize this probably just raises a few more questions. I’m working on a post for the other thread with some terminology and background information. Essentially the material I covered last year for LFNW2020.

https://www.blug.org/event/virtualblug-meeting-10-7-logan-perkins-running-webservers-behind-a-transparent-reverse-proxy/

As these things typically go, it is at the end that the right choice is apparent. I just went to put the code on github, and found it’s already there. Complete with some pieces that would have been useful for the presentation.

Anyway, the sources are here, complete with a patch to let nginx work, a working LD_PRELOAD library, and a shim to support TCP upstream hosts.

The slides are available here

1 Like

Thanks for a great presentation, @logan !