Enumerating every UTD personal website

The following work is entirely my own, it is not endorsed by the University of Texas at Dallas.

The University of Texas at Dallas allows every student and faculty member to create their own personal website, hosted on UTD's servers. Here's mine! Of course, immediately after finding out about this resource I got distracted, learned PHP, created my own markup language, and then didn't use that markup language on the site that I made it for.

UTD hosts a number of servers which any student can SSH into. The network diagram looks kind of like this:

+----------+     +-UTD private network--------------------------------------+
|  Public  | ------> (pubssh.utdallas.edu)                              mars|
| Internet |     |   (giant.utdallas.edu)                               axon|
+----------+     |   (cs1.utdallas.edu)                              Malthus|
                 |   <various other servers>     <various other NFS servers>|
                 +----------------------------------------------------------+

If you're connected to UTD's private network, you can directly SSH into any of the servers on the left side of the diagram. Notably, pubssh.utdallas.edu is the only server accessible to people outside of UTD's network, everything else is hidden behind NAT. This is so that people can still access these systems even if they're off campus; you just SSH into pubssh, then SSH again into some other server.

The servers on the right contain a bunch of home directories. axon contains the home directories of all the faculty members associated with the college of Brain and Behavioral Sciences (BBS), Malthus contains the home directories of people in the college of Economics, Political, and Policy Sciences (EPPS), and so on. All of those servers expose these directories through NFS drives, which are mounted onto the servers on the left.

Side quest: Finding these servers

When I first created my personal website, I found this page listing every server that stores home directories. I wanted to link to it in this article, but it got taken down and I didn't save the link to look up in the Wayback Machine.

About an hour of fiddling with the Wayback Machine API and jq later, I wrote this script to automatically download every single article posted by the UTD Office of Information Technology that's been archived so that I could grep through it to find the specific article I was looking for.

#!/bin/sh

# get all pages on the utd oit website
curl 'https://web.archive.org/cdx/search?output=json&url=https%3A%2F%2Fatlas%2Eutdallas%2Eedu%2FTDClient%2F30%2FPortal%2FKB%2F%2A' > index

# print all articles
#cat index | jq --raw-output 'map(select(.[0] | test("ArticleDet";"i"))) | .[] | .[0]'

# create a directory structure
cat index | jq --raw-output 'map(select(.[0] | test("ArticleDet";"i"))) | .[] | "/\(.[2])"' | cut -b 2- | jq -Rr '@uri' | sed "s/^/.\/out\//" | xargs mkdir -p

# fetch all articles
cat index | jq --raw-output 'map(select(.[0] | test("ArticleDet";"i"))) | .[] | "-o\n./out/\(.[2] | @uri)/\(.[1])\nhttps://web.archive.org/web/\(.[1])id_/\(.[2] | @uri)"' | xargs curl

Some interesting things to note are my heavy use of xargs for performance (while loops are very slow in shell scripts), and the fact that I only invoke curl once so that every request can be made in a single TCP connection.

Side quest: The operating system that these servers run on

For my class CS 3377 "Systems Programming in UNIX and Other Environments", we have to do our labs in cs1. As I was doing these labs, I was quite surprised to find that cs1, cs2, and csgrads are all still running CentOS 7, despite the fact that it reached end of life in 2024. What's even weirder is that every other server I checked is running some up-to-date version of Red Hat Enterprise Linux (RHEL), so clearly some IT guy somewhere was in charge of migrading from CentOS to RHEL and just forgot about all of the CS servers.

I really have no idea why these specific servers are using a deprecated operating system. My two best guesses are that the cs servers are mainly used by students to complete homework assignments, so a consistent environment is more important than a secure environment, or that someone just forgot about these servers when migrating from CentOS after the deprecation announcement. Both seem reasonable to me.

End of side quests

Iterating through public websites is easy. We just look in every user's home directory for a public_html subdirectory and print out their NetIDs. We can start by finding all of the home directories by listing autofs mount points.

{cslinux1:~} mount -l | grep 'type autofs' | grep -oP 'on .*? ' | cut -b 4-
/proc/sys/fs/binfmt_misc 
/misc 
/net 
/people/cs/u 
/usr/local 
/people/cs/o 
/people/cs/t 
/home/012 
/courses/cs4396 
/people/advising 
... extra output truncated

Side quest: The /courses directory

The /courses directory contains three subdirectories: cs4396, cs6390, and se6367. cs6390 is entirely empty except for a single file called remove.txt, which just says this:

remove this when convenient
bnelson 7/21/2020

cs4396 contains a 57 gigabyte tar file called cs-tech.tar and a file called remove.txt, which says this:

remove this when convenient`
bnelson 7/21/2020

se6367 contains a software testing utility called xsuds, and a file called remove.txt, which says this:

ownership removed from wew021000 
bnelson 7/21/2020
remove in 2 years if no complaint
they should be using the xsudsu serve:

Based on this personal website, it seems like wew021000 is a professor who teaches se6367, and uses xsuds in his course. It's been four years and I don't think there have been any complaints, but this is the sort of thing that sysadmins just forget about for years.

End of side quest

It turns out that when you get rid of all the miscellaneous mounts, all of the home directories are stored in /home and /people. Who would have guessed!

Side quest: The removed users

If you look into any of the larger home directory stashes, you'll find that a lot of home directories are marked to be removed:

{cslinux1:/home/010/n/nx} ls | head
nxa130430.REMOVE.2024-06-30-175222
nxa154130
nxa161130.REMOVE.2024-09-30-000212
nxa161230
nxa164430
nxa170006
nxa170430
nxa170930
nxa180002
nxa180007
{cslinux1:/home/010/n/nx} ls | grep REMOVE | wc -l
1514

You'll also find that most of these users seem to have had public websites at one point.

{cslinux1:/home/010/n/nx} ls | grep REMOVE | wc -l
1514
{cslinux1:/home/010/n/nx} ls -d *REMOVE* | while read line ; do test -d "$line/public_html" && echo "$line" ; done | wc -l
1103

According to this small experiment, over 70% of these users had a public_html directory. I don't know who these people are; I'd assume that they're mostly alumni, but the idea that 70% of students create a personal web page before graduating seems a bit fishy to me. I suspect that a vast majority of these "websites" are actually just empty directories with no content inside of them. From my experiments this seems relatively common.

End of side quest

So let's find every website! I really, really want to avoid a bash while loop, so I wrote this Python script to filter out real websites reasonably efficiently.

#!/usr/bin/env python3

import os
from os.path import basename
import stat

while True:
    try:
        line = input()
    except EOFError:
        break

    public_html = line + "/public_html"
    try:
        html_stat = os.stat(public_html)
    except FileNotFoundError:
        continue

    # public_html must be a directory
    if not stat.S_ISDIR(html_stat.st_mode):
        continue

    # public_html must be accessible for the web server to function
    if html_stat.st_mode & 0o055 != 0o055:
        continue

    # public_html must actually have something in it
    if len(os.listdir(public_html)) == 0:
        continue

    print(basename(line))

The one slight compromise I made was to check for 0o055 permissions instead of 0o005 permissions. Either I slightly underestimate the real number of UTD personal sites, or I massively overestimate it. I decided to go with the former.

{cslinux1:/home/010} find . -type d -maxdepth 3 -not -name '*.REMOVE*' 2>/dev/null | ~/has-home.py | tee ~/mars-websites
jcp016300
jct220002
jcw200002
jcarden
jce180001
jcg053000
...
{cslinux1:~} wc -l mars-websites 
1074 mars-websites

I ran this same script on every home directory mount location and got the following data:

{cslinux1:~} wc -l mars-websites axon-websites eng-websites malthus-websites people-websites 
 1074 mars-websites
   78 axon-websites
  191 eng-websites
   42 malthus-websites
  115 people-websites
 1500 total

As far as I can tell, there are exactly 1500 personal websites hosted by UTD. I should note that the names of these fields were taken directly from the existing filenames. Websites in /home/axon were recorded in the axon-websites file, websites in /home/eng were recorded in the eng-websites file, and so on. people-websites is for every website stored in the /people directory.

{cslinux1:/home} ls
010  011  012  013  014  axon  eng  malthus  nsm

I should note that every single website in the /people directory was run by someone in CS. There is not a single academic advisor at UTD who uses the UTD-provided personal websites.

Let's take a random sample of sites to see generally who's running them:

{cslinux1:~} cat *-websites > combined-websites
{cslinux1:~} shuf -n30 combined-websites 
ptw190000
jxy170007
nkumar
chasteen
kxa051000
dga071000
kxl172530
yxw158830
csr170000
nxs135730
mxz173130
jbb130330
nxa190029
cje160030
cxs180003
jxa220048
nxh150030
nai160030
sxb180041
mxy171630
fass
mxh143930
axv210014
sxn177430
ted
sxk220505
jpb170330
jjo190001
bag190002
mxc220049

NetID	Owner classification	Other notes
ptw190000	Student	Undergraduate, website was created for a class
jxy170007	Website is broken
nkumar	Professor
chasteen	Professor
kxa051000	Professor
dga071000	Professor
kxl172530	Website is broken
yxw158830	TA	Website looks very nice
csr170000	Website is broken
nxs135730	Website is broken
mxz173130	Student	Ph.D. student
jbb130330	Website is broken
nxa190029	cppfile.exe
cje160030	cppfile.exe
cxs180003	Website is broken
jxa220048	cppfile.exe
nxh150030	Website is broken
nai160030	Website is broken
sxb180041	Website is broken
mxy171630	Website is broken
fass	Professor
mxh143930	Website is broken
axv210014	Student	Ph.D. student
sxn177430	Professor
ted	Professor	I'm assuming this guy is a professor, he might have a more managerial role though, the website is kind of vague.
sxk220505	Student	Graduate student
jpb170330	Website is broken
jjo190001	Professor	Website looks very nice
bag190002	Student	Undergraduate, website was created for a class
mxc220049	cppfile.exe

I sorted these websites into four categories: websites by professors, by students, broken websites, and cppfile.exe. I have no idea what these cppfile.exe websites are, but they all look identical. I assume that they were all created for the same class.

In total we had 5 students, 8 professors, 1 TA, 12 broken websites, and 4 cppfile.exe sites. All of these broken websites have some content in them, they're just misconfigured in some way. For example, the last broken website I looked at was actually a cppfile.exe website in disguise.

{pubssh:/home/010/j/jp/jpb170330/public_html} ls -l
total 96
----------. 1 574231 studunionx 9224 Sep  4  2020 cppfile.exe
-r--r-----. 1 574231 studunionx   15 Sep  4  2020 trial.txt

If you want to play around with this data yourself, the lists I created can be found here.