How TCP over HTTP worked

When I first started hosting natechoe.dev, I discovered that my high school's strict network filtering blocked my own website, labeling it as "Unmanaged". This sent me down a years-long journey to profile and crack that filtering as much as possible.

Over a few weeks during class downtime, I'd do some experiments with netcat, figuring out which connections were blocked and which weren't. I made a few interesting observations:

SSH connections were completely blocked, regardless of port number
The school was very strict on HTTPS connections, but quite lenient with HTTP connections. For example, https://natechoe.dev was blocked, but http://natechoe.dev went through without a problem. I assume some sysadmin accidentally configured it that way and didn't bother testing HTTP. The .dev TLD has HSTS enabled anyways, so this wouldn't help unblock my website.
The school uses, likely among other things, the domain name and web host to determine if a website should be blocked. For example, natechoe.dev, which is hosted from my homelab on a residential IP address, is blocked. natechoe.com, however, which is hosted with Cloudflare (and earlier Github Pages) isn't. This is why I bought the natechoe.com domain.
When HTTPS connections aren't blocked, the website is sent through using the original host's actual web certificate. This means that the school can't see the contents of the webpage. I assume that in this case it's filtering entirely based on the hostname and webhost.
When HTTPS connections are blocked, the school injects its own global self-signed certificate into the TLS connections and returns a different web page.
Most protocols other than SSH, HTTP(S), and the various VPN protocols are completely unfiltered. This notably includes SMTP on port 25 (which can be used for email spam) and Gopher.

Based on these and various other observations I developed a mental model for the school's internet filtering which looked kind of like this:

If a student makes a TCP connection:
    If this is an SSH connection:
        Drop this connection
    Else if this is a VPN connection:
        Drop this connection
        Block this user's internet access for 5 minutes
    Else if this is an HTTP(S) connection:
        Get the domain name of the endpoint, either through SNI or the `Host`
        HTTP header

        Evaluate the endpoint on various metrics, including:
            Top level domain
            Second level domain name (?)
            Whether the connection is encrypted or not
            Destination IP address
            Whether the endpoint has been explicitly alllowed/blocked by a
            sysadmin
        If the endpoint isn't trustworthy:
            Drop this connection
    If this connection hasn't been dropped:
        Allow it to go through as-is

Note that with this model, if a TCP connection isn't dropped its contents are not modified in any way. HTTP requests aren't buffered, HTTP headers go through as-is, and so on. The contents of a TCP connection are only modified if the connection is dropped.

Also note that since I have a server, I can write custom proxy protocols that the school isn't checking for. At first I spent a while trying to build a complicated generic TCP to HTTP encapsulation protocol, until I realized that we could just run this useless handshake at the beginning of our TCP connection.

> GET / HTTP/1.1
> Content-Length: 99999999999999
> Host: example.com
>
< HTTP/1.1 500 Internal Server Error
< Content-Length: 99999999999999
<

According to the IETF, this is a perfectly valid HTTP exchange.

All responses, regardless of the status code (including interim responses) can be sent at any time after a request is received, even if the request is not yet complete. A response can complete before its corresponding request is complete (Section 6.1). Likewise, clients are not expected to wait any specific amount of time for a response. Clients (including intermediaries) might abandon a request if the response is not received within a reasonable period of time.
RFC 9110

At the end of this exchange, the client is sending a really really long request body and at the same time the server is sending back a really really long response body. At this point, regardless of what data is sent through this TCP connection, any reasonable network sniffer would interpret this TCP connection as an HTTP connection. We can now run some other protocol, like SSH, through this disguised connection. The general model looks like this:

                          My school's filter
                                  |
      My laptop (localhost)       |     My server (natechoe.dev)
                                  |
                                  |
   SSH <----> TCP over HTTP  <--------->   TCP over HTTP  <------->  SSH
 client           client          |           server                server
     normal SSH     ^        SSH session         ^       normal SSH
       session      |     with injected header   |         session
                    |             |              |
              injects/strips      |        injects/strips
                  header          |            header

Some existing SSH client connects to the TCP over HTTP client running on localhost. The TCP over HTTP client does that fake exchange with the TCP over HTTP server and naively shovels data back and forth between the SSH client and TCP over HTTP server. The TCP over HTTP server receives that fake exchange from the client and shovels data back and forth to the SSH server. From the SSH client and server's perspective, they're just making a regular SSH connection. From my code's perspective, we're just making a small exchange and then shuffling data around. From the school's perspective, I'm just making a weirdly long but perfectly fine HTTP request.

This is how TCP over HTTP works. Around a year after I first built this system, it started getting flagged as a VPN. It turns out that this mechanism is similar enough to an existing VPN app called HTTP injector to get detected. Eventually this got bad enough that I hacked together but didn't publish another quick proxy which used websockets and Node.js, and hosted it at natechoe.com behind Cloudflare.

Cloudflare is interesting. A decent chunk of the internet is proxied through Cloudflare, to a point where it's not at all practical to indiscriminately block every single one of them. At the same time, Cloudflare websites are slowly becoming indistinguishable. From what I understand, my school was already doing most if not all of its filtering based on domain names and web hosts, but with new technologies like Encrypted Client Hello it may become impossible to distinguish between any two Cloudflare websites in any meaningful way. I think that in the future, all of the tens of millions of websites that Cloudflare proxies will have the same web host and the same TLS handshake, so it will be impossible to filter anything without shutting down the internet entirely.

I'm not entirely sure how I feel about this. The centralization of the internet is definitely a huge problem - every few months a massive cloud service provider like Cloudflare or AWS has an outage that takes down something like 10% of the internet for a few hours. At the same time, the fact that it's basically impossible to selectively censor Cloudflare websites clearly shows that there's some tradeoff between centralization and censorship. The goal of a censor is to distinguish between "good" and "bad" traffic, and the goal of the evader is to make this job as difficult as possible. The beauty of having everything on Cloudflare is that it's literally impossible to distinguish between two types of traffic when it all just looks like encrypted randomness being sent to a single host. This is definitely getting more political than I usually like to be on this blog though, so I'll just say that regardless of what systems we collectively decide on as a society, I'll just keep abusing them as much as I can to get what I want.

Edits: I made some slight edits to this article on March 26. Specifically, I made the ASCII art diagram slightly clearer and rewrote the entire last paragraph.