Enumerating every UTD personal website
The following work is entirely my own, it is not endorsed by the University of Texas at Dallas.
The University of Texas at Dallas allows every student and faculty member to create their own personal website, hosted on UTD's servers. Here's mine! Of course, immediately after finding out about this resource I got distracted, learned PHP, created my own markup language, and then didn't use that markup language on the site that I made it for.
UTD hosts a number of servers which any student can SSH into. The network diagram looks kind of like this:
+----------+ +-UTD private network--------------------------------------+
| Public | ------> (pubssh.utdallas.edu) mars|
| Internet | | (giant.utdallas.edu) axon|
+----------+ | (cs1.utdallas.edu) Malthus|
| <various other servers> <various other NFS servers>|
+----------------------------------------------------------+
If you're connected to UTD's private network, you can directly SSH into any
of the servers on the left side of the diagram. Notably, pubssh.utdallas.edu
is the only server accessible to people outside of UTD's network, everything
else is hidden behind
NAT. This is so
that people can still access these systems even if they're off campus; you just
SSH into pubssh
, then SSH again into some other server.
The servers on the right contain a bunch of home directories. axon
contains
the home directories of all the faculty members associated with the college of
Brain and Behavioral Sciences (BBS), Malthus
contains the home directories of
people in the college of Economics, Political, and Policy Sciences (EPPS), and
so on. All of those servers expose these directories through NFS drives, which
are mounted onto the servers on the left.
Side quest: Finding these servers
When I first created my personal website, I found this page listing every server that stores home directories. I wanted to link to it in this article, but it got taken down and I didn't save the link to look up in the Wayback Machine.
About an hour of fiddling with the Wayback Machine API and jq
later, I
wrote this script to automatically download every single article posted by the
UTD Office of Information Technology that's been archived so that I could grep
through it to find the specific article I was looking for.
#!/bin/sh
# get all pages on the utd oit website
curl 'https://web.archive.org/cdx/search?output=json&url=https%3A%2F%2Fatlas%2Eutdallas%2Eedu%2FTDClient%2F30%2FPortal%2FKB%2F%2A' > index
# print all articles
#cat index | jq --raw-output 'map(select(.[0] | test("ArticleDet";"i"))) | .[] | .[0]'
# create a directory structure
cat index | jq --raw-output 'map(select(.[0] | test("ArticleDet";"i"))) | .[] | "/\(.[2])"' | cut -b 2- | jq -Rr '@uri' | sed "s/^/.\/out\//" | xargs mkdir -p
# fetch all articles
cat index | jq --raw-output 'map(select(.[0] | test("ArticleDet";"i"))) | .[] | "-o\n./out/\(.[2] | @uri)/\(.[1])\nhttps://web.archive.org/web/\(.[1])id_/\(.[2] | @uri)"' | xargs curl
Some interesting things to note are my heavy use of xargs
for performance
(while
loops are very slow in shell scripts), and the fact that I only
invoke curl
once so that every request can be made in a single TCP connection.
Side quest: The operating system that these servers run on
For my class CS 3377 "Systems Programming in UNIX and Other Environments", we
have to do our labs in cs1
. As I was doing these labs, I was quite surprised
to find that cs1
, cs2
, and csgrads
are all still running CentOS 7, despite
the fact that
it
reached end of life in 2024. What's even weirder is that every other server I
checked is running some up-to-date version of Red Hat Enterprise Linux (RHEL),
so clearly some IT guy somewhere was in charge of migrading from CentOS to RHEL
and just forgot about all of the CS servers.
I really have no idea why these specific servers are using a deprecated
operating system. My two best guesses are that the cs
servers are mainly used
by students to complete homework assignments, so a consistent environment is
more important than a secure environment, or that someone just forgot about
these servers when migrating from CentOS after the deprecation announcement.
Both seem reasonable to me.
End of side quests
Iterating through public websites is easy. We just look in every user's home
directory for a public_html
subdirectory and print out their NetIDs. We can
start by finding all of the home directories by listing
autofs mount points.
{cslinux1:~} mount -l | grep 'type autofs' | grep -oP 'on .*? ' | cut -b 4-
/proc/sys/fs/binfmt_misc
/misc
/net
/people/cs/u
/usr/local
/people/cs/o
/people/cs/t
/home/012
/courses/cs4396
/people/advising
... extra output truncated
Side quest: The /courses directory
The /courses
directory contains three subdirectories: cs4396
, cs6390
,
and se6367
. cs6390
is entirely empty except for a single file called
remove.txt
, which just says this:
remove this when convenient
bnelson 7/21/2020
cs4396
contains a 57 gigabyte tar file called cs-tech.tar
and a file
called remove.txt
, which says this:
remove this when convenient`
bnelson 7/21/2020
se6367
contains a software testing utility called xsuds
, and a file
called remove.txt
, which says this:
ownership removed from wew021000
bnelson 7/21/2020
remove in 2 years if no complaint
they should be using the xsudsu serve:
Based on this personal
website, it seems like wew021000
is a professor who teaches se6367
, and
uses xsuds
in his course. It's been four years and I don't think there have
been any complaints, but this is the sort of thing that sysadmins just forget
about for years.
End of side quest
It turns out that when you get rid of all the miscellaneous mounts, all of
the home directories are stored in /home
and /people
. Who would have
guessed!
Side quest: The removed users
If you look into any of the larger home directory stashes, you'll find that a lot of home directories are marked to be removed:
{cslinux1:/home/010/n/nx} ls | head
nxa130430.REMOVE.2024-06-30-175222
nxa154130
nxa161130.REMOVE.2024-09-30-000212
nxa161230
nxa164430
nxa170006
nxa170430
nxa170930
nxa180002
nxa180007
{cslinux1:/home/010/n/nx} ls | grep REMOVE | wc -l
1514
You'll also find that most of these users seem to have had public websites at one point.
{cslinux1:/home/010/n/nx} ls | grep REMOVE | wc -l
1514
{cslinux1:/home/010/n/nx} ls -d *REMOVE* | while read line ; do test -d "$line/public_html" && echo "$line" ; done | wc -l
1103
According to this small experiment, over 70% of these users had a
public_html
directory. I don't know who these people are; I'd assume that
they're mostly alumni, but the idea that 70% of students create a personal web
page before graduating seems a bit fishy to me. I suspect that a vast majority
of these "websites" are actually just empty directories with no content inside
of them. From my experiments this seems relatively common.
End of side quest
So let's find every website! I really, really want to avoid a bash while
loop, so I wrote this Python script to filter out real websites reasonably
efficiently.
#!/usr/bin/env python3
import os
from os.path import basename
import stat
while True:
try:
line = input()
except EOFError:
break
public_html = line + "/public_html"
try:
html_stat = os.stat(public_html)
except FileNotFoundError:
continue
# public_html must be a directory
if not stat.S_ISDIR(html_stat.st_mode):
continue
# public_html must be accessible for the web server to function
if html_stat.st_mode & 0o055 != 0o055:
continue
# public_html must actually have something in it
if len(os.listdir(public_html)) == 0:
continue
print(basename(line))
The one slight compromise I made was to check for 0o055
permissions instead
of 0o005
permissions. Either I slightly underestimate the real number of UTD
personal sites, or I massively overestimate it. I decided to go with the former.
{cslinux1:/home/010} find . -type d -maxdepth 3 -not -name '*.REMOVE*' 2>/dev/null | ~/has-home.py | tee ~/mars-websites
jcp016300
jct220002
jcw200002
jcarden
jce180001
jcg053000
...
{cslinux1:~} wc -l mars-websites
1074 mars-websites
I ran this same script on every home directory mount location and got the following data:
{cslinux1:~} wc -l mars-websites axon-websites eng-websites malthus-websites people-websites
1074 mars-websites
78 axon-websites
191 eng-websites
42 malthus-websites
115 people-websites
1500 total
As far as I can tell, there are exactly 1500 personal websites hosted by UTD.
I should note that the names of these fields were taken directly from the
existing filenames. Websites in /home/axon
were recorded in the
axon-websites
file, websites in /home/eng
were recorded in the
eng-websites
file, and so on. people-websites
is for every website stored in
the /people
directory.
{cslinux1:/home} ls
010 011 012 013 014 axon eng malthus nsm
I should note that every single website in the /people
directory was run by
someone in CS. There is not a single academic advisor at UTD who uses the
UTD-provided personal websites.
Let's take a random sample of sites to see generally who's running them:
{cslinux1:~} cat *-websites > combined-websites
{cslinux1:~} shuf -n30 combined-websites
ptw190000
jxy170007
nkumar
chasteen
kxa051000
dga071000
kxl172530
yxw158830
csr170000
nxs135730
mxz173130
jbb130330
nxa190029
cje160030
cxs180003
jxa220048
nxh150030
nai160030
sxb180041
mxy171630
fass
mxh143930
axv210014
sxn177430
ted
sxk220505
jpb170330
jjo190001
bag190002
mxc220049
NetID | Owner classification | Other notes |
---|---|---|
ptw190000 | Student | Undergraduate, website was created for a class |
jxy170007 | Website is broken | |
nkumar | Professor | |
chasteen | Professor | |
kxa051000 | Professor | |
dga071000 | Professor | |
kxl172530 | Website is broken | |
yxw158830 | TA | Website looks very nice |
csr170000 | Website is broken | |
nxs135730 | Website is broken | |
mxz173130 | Student | Ph.D. student |
jbb130330 | Website is broken | |
nxa190029 | cppfile.exe | |
cje160030 | cppfile.exe | |
cxs180003 | Website is broken | |
jxa220048 | cppfile.exe | |
nxh150030 | Website is broken | |
nai160030 | Website is broken | |
sxb180041 | Website is broken | |
mxy171630 | Website is broken | |
fass | Professor | |
mxh143930 | Website is broken | |
axv210014 | Student | Ph.D. student |
sxn177430 | Professor | |
ted | Professor | I'm assuming this guy is a professor, he might have a more managerial role though, the website is kind of vague. |
sxk220505 | Student | Graduate student |
jpb170330 | Website is broken | |
jjo190001 | Professor | Website looks very nice |
bag190002 | Student | Undergraduate, website was created for a class |
mxc220049 | cppfile.exe |
I sorted these websites into four categories: websites by professors, by students, broken websites, and cppfile.exe. I have no idea what these cppfile.exe websites are, but they all look identical. I assume that they were all created for the same class.
In total we had 5 students, 8 professors, 1 TA, 12 broken websites, and 4 cppfile.exe sites. All of these broken websites have some content in them, they're just misconfigured in some way. For example, the last broken website I looked at was actually a cppfile.exe website in disguise.
{pubssh:/home/010/j/jp/jpb170330/public_html} ls -l
total 96
----------. 1 574231 studunionx 9224 Sep 4 2020 cppfile.exe
-r--r-----. 1 574231 studunionx 15 Sep 4 2020 trial.txt
If you want to play around with this data yourself, the lists I created can be found here.