Ask HN: Stanford CS 153 help
hi hn - i'm volunteering at Stanford next quarter to co-teach cs 153 (infrastructure at scale) - a course i wish had existed during my undergrad years. rather than pure theory, it's focused on how large-scale systems actually work in production
the format combines hands-on projects with a speaker series. we've confirmed some solid speakers (Jensen Huang from NVIDIA, Matthew Prince from Cloudflare etc), but i'm also keen to bring in perspectives from folks who don't fit the standard mold. tbh, many of the best systems eng/devs/infra ppl i've worked with are pretty weird - they think differently, take unconventional paths, and often learn by obsessively building and breaking things rather than following traditional routes. i think it would be cool for the students to realize its a feature, not a bug, to be weirdly obsessive
if you're interested in this kind of stuff, i'd value your thoughts on:
1/ who are the fascinating/unsung heroes in infra/systems eng that students should learn from? especially interested in people who've solved hard scaling problems through unconventional thinking or unique approaches
2/ what kind of projects do you think would fun and meaningfully demonstrate real-world infrastructure challenges while still being achievable in an academic quarter?
prerequisites are CS106/CS111 level programming. draft syllabus here: https://explorecourses.stanford.edu/search?view=catalog&filt...
email: anjney at alumni dot stanford edu if you prefer to share thoughts privately. thank you in advance for any and all help
Rachel by the Bay (https://rachelbythebay.com/) has long impressed me as someone who clearly is deep in the actual work of systems, day in and day out, and can write well about it.
Julia Evans has a wonderful approach as well, and has amazing talent for teaching: https://jvns.ca/
Kellan Elliott-McCrea (https://laughingmeme.org/) has given the world some of the better advice on the hardest parts of software scaling, which is of course scaling the human organizations. New grads are virtually always underestimating that part of the work; eventually you realize the hard problems are usually social and not technical.
i've followed Rachel and Julia for a long time, but didn't know about Kellan - thanks so much for that.
re: human org scaling - true and this was the most surprising thing for me when i was running the platform org at discord. companies ship their org charts whether they like it or not. and refactoring org charts correctly, at scale, is essentially untested in the modern era
A progression of projects that comes to mind:
1) CI and IAC that deploy a web app running in a container
2) Add horizontal scaling and load balancer
3) Add long running tasks / scheduled task support
4) Deploys will likely break long running tasks. Implement blue/green or rolling deploys or some other sort of advanced deployment scheme
5) Implement rollbacks
This! This is what I’ve seen at my companies and is super salient to today’s real life work ~
Love this. Easy to Advanced, with 5 for extra credit. Thank you
6) Feature flags, telemetry, soaking
7) Alarms
2/
Build a multi-cloud architecture. And by this, I mean connect two cloud's networks without traversing the public internet to connect two applications running in each respective cloud. And then, put that into IaC. It sounds like not much, but the issues you uncover are pretty illuminating and it is a fantastic interview question to give to senior-ish infra guys to see how they approach it and the challenges they expect.
And you're right, we're all weird.
I am curious how to connect without public internet. U mean vpn ?
Direct Connect and ExpressRoute
We are all nerds because we love the technology, science, and math behind it.
this is exactly the type of pointer i was hoping for, thank you
At multiple points in my career I stumbled upon stuff from Bredan Greg. He is highly skilled in large-scale distributed computing but also down to the nitty gritty details (bits).
Are there any downloadable materials and lecture videos?
kyle kingsbury/aphyr of jepsen seems like an exemplar of #1
this is an awesome rec thank you
I don't have recommendations like others here. But as a junior engineer still coming upto speed with real engineering, I'd really appreciate it if this was course was made open (interms of lectures, assignments etc) to help folks like me audit & learn
1) in addition to the excellent recommendations already mentioned:
Brendan Gregg has a lot of good stuff about monitoring and performance analysis https://brendangregg.com/ https://github.com/brendangregg
Also Jess Frazelle (lots of good stuff, esp around containerization): https://blog.jessfraz.com/ https://github.com/jessfraz
Marc Brooker @mjb https://news.ycombinator.com/user?id=mjb
Also Mahesh Balakrishnan (Yale, Facebook/Meta, Confluent) https://muratbuffalo.blogspot.com/search?q=mahesh
1) you should reach out to the Convex.dev folks. They have built a solid infra platform, and their backend is open sourced(ish). They are ex-Dropbox as well. And finally they love to share!
2) I think multiplayer games could be interesting! Lots of meat while still having a lot of space to calibrate the scope.
convex is really elegant and now that you mention it, multiplayer games like their ai-town agent sim is such a great fit for the class - thank you
About deployment to cloud: https://news.ycombinator.com/item?id=38988238
Not unsung, but Jay Kreps has made original contributions to the practice of building large scale systems. He also built a big business around it, so that perspective might also be interesting to students.
Charity Majors (https://charity.wtf) is a great writer and speaker, and her work on observability is directly relevant to infra at scale.
Quite a strong cast of presenters back in Jan 2024 https://cs153.stanford.edu/syllabus.html
thanks for noticing! this is the first time we're expanding it from 'security at scale' to 'infra at scale', but we've taught this course 2 yrs in a row now
curious to learn how many undergrads took this?
couldn't find the syllabus
deploy something like cassandra and make a system that can update the kernel on the servers running the databases without downtime or losing data
or come up with some distrubuted blob store thing/cdn for world wide users
my whole career has been automating updates for software or operating systems lol
Maybe reach out to Netflix's live streaming dept. since we all learn so much more from our own failures.
Cheers!
i didn't know you could do that! how does one volunteer to teach a course?
Infrastructure for gov cloud is another beast and might make a fun case study
Also the folks at a company like render, railway, or even supabase might be fascinating - what it takes to write an infra abstraction at scale