Immutability Changes Everything (2016) [pdf]

118 points by fire_lake 6 months ago

Immutability is a fantastic tool, especially when working with enterprise data. It's relatively easy to implement your own temporal tables on most existing databases, no special libraries or tools required. It seems really trivial/obvious, but I'll admit I first stumbled into the concept using the AS400 at work. If you make a mistake on payroll in IBM's old MAPICS program, you don't overwrite or delete it. You introduce a new "backout record" to nullify it, then (maybe) insert another record with the correct data. It seems obvious once you've seen the pattern.

I've made a few non-technical eyes go wide by explaining A) that this is done and B) how it is done. The non-tech crypto/blockchain enthusiasts I've met get really excited when they learn you can make a set of data immutable without blockchain / merkle trees. Actually, explaining that is a good way to introduce the concept of a merkle tree / distributed ledger, and why "blockchain" is specifically for systems without a central authority.

(Bi)Temporal and immutable tables are especially useful for things like HR, PTO, employee clock activity, etc. Helps keep things auditable and correct.

layer8 6 months ago

Without specific support from the RDBMS, bitemporal schemas are difficult with regard to cross-table references, such as foreign keys. Rows that need to be consistent between tables aren’t necessarily 1:1 anymore, but instead each row in one table needs to be consistent with all corresponding rows in the other table having an intersecting time interval. You then run into problems with transaction isolation and visibility.
- pyrale 6 months ago
  
  > bitemporal schemas are difficult with regard to cross-table references
  Who needs more than one table ? >:)
  More complex models can be built and stored separately. The great benefit of this method being that, once you're unhappy with your table model, you can trash it and rebuild it from scratch without regard for data migration.
  
  layer8 6 months ago
  
  Your last sentence sounds more like event sourcing than bitemporal databases, which are quite different concepts. I don’t see how bitemporal schemas simplify schema migration.
  
  pyrale 6 months ago
  
  > I don’t see how bitemporal schemas simplify schema migration.
  It's not the bitemporality that helps, it's primary data immutability.
  The event sourcing community has its own specificities (event sourcing advocates saving decisions, not outside data), but not on that aspect: if you store events immutably as this article describes, you are bound to benefit from read models that you can trash and rebuild at will.
- hobs 6 months ago
  
  Pretty much, you want triggers to store things in a schemaless fashion in an audit format so that you are free to migrate tables.
  This does require either knowing the schema at the point in time or recording enough information to do a schema on read.
  The other options are of course you basically run a table like an API, always adding, never removing.
refset 6 months ago

> It's relatively easy to implement your own temporal tables on most existing databases
It gets tricky when you need to change the schema without breaking historical data or queries. SQL databases could do a lot more to make immutability easier and widespread.
- jiggawatts 6 months ago
  
  One fundamental issue I’ve noticed is that typical SQL databases have a single schema per table defining both the logical and physical aspects, typically with a strong correlation between the two.
  Databases could treat the columns as the fundamental unit with tables being not much more than a view of a bunch of columns that can change over both space (partitioning) and time (history).
  
  bobnamob 6 months ago
  
  That’s effectively how datomic works. Datoms are the fundamental unit, with attributes being analogous to a column name and views being the 4 indexes that datomic keeps
teleforce 6 months ago

>Actually, explaining that is a good way to introduce the concept of a merkle tree / distributed ledger, and why "blockchain" is specifically for systems without a central authority
This is a very important points, for whatever systems or solutions that you do, do not overengineer and always remember premature optimization is the root of all evil.
It used to be blockchain and it seems apparently ML/AI is the new fad. Most probably majority of the solutions being design now with ML/AI does not need it and in doing so just make it expensive/slow/complex/non-deterministic/etc.
People need to wake up and smell the coffee, since ultimately ML/AL it just a tool inside the many tools toolbox.
unit149 6 months ago

[dead]

gatane 6 months ago

My main gripe with immutability is that making updated data requires building a full copy of the data again with the changes. Sure, you could have zippers to aid in the updating process by acting as a kind of cursor/pointer, but raw access to data beats them anytime (even if you optimize for cache).

So if you had to optimize for raw speed, why not choose mutable data?

https://ksvi.mff.cuni.cz/~sefl/papers/zippers.pdf

dsQTbR7Y5mRHnZv 6 months ago

> My main gripe with immutability is that making updated data requires building a full copy of the data again with the changes.
Conceptually yes, but the implementation doesn't always necessarily need to work that way under the hood: https://www.roc-lang.org/functional#opportunistic-mutation
KingMob 6 months ago

> My main gripe with immutability is that making updated data requires building a full copy of the data again with the changes.
That's not generally true. Many immutable languages are using "persistent" data structures, where "persist" here means that much of the original structure persists in the new one.
For more, see:
- Purely Functional Data Structures by Okasaki: https://www.cs.cmu.edu/~rwh/students/okasaki.pdf - Phil Bagwell's research - e.g., https://infoscience.epfl.ch/record/64398/files/idealhashtree...
munchler 6 months ago

> My main gripe with immutability is that making updated data requires building a full copy of the data again with the changes.
That is not true in general. There are plenty of data structures that can be updated without forcing a full copy. Lists, trees, sets, maps, etc. All of these are common in functional programming. This is discussed in the article (e.g. "Append-Only Computing").
- sarchertech 6 months ago
  
  If you really care about performance, iterating over all of those is going to much much slower than iterating over an array.
  
  munchler 6 months ago
  
  If you really care about multi-threading, mutating array elements is going to be much buggier than using an immutable data structure.
  
  sarchertech 6 months ago
  
  Well sure but the OP wrote
  >if you had to optimize for raw speed, why not choose mutable data?
  So in context we are talking about a case where we have to optimize for raw speed.
  It doesn’t matter that immutable data is easier to reason about if you don’t have the performance budget to go that route.
  
  reubenmorais 6 months ago
  
  Raw speed these days means concurrent processing, so those two are more and more often the same case. The whole "rewrite it in Rust" trend is a very clear example of the benefits of easier correctness of concurrent programming - Rust programs end up being faster than other alternatives even though on paper C has better "raw speed" (e.g. no bounds checking).
  
  sarchertech 6 months ago
  
  1. Raw speed on modern CPUs means taking advantage of data locality more than anything else. Even concurrency. Cache misses will cost you a few hundred cycles, far too much to make up for with concurrency in most cases.
  2. Of course given a sufficiently large array, iterating over it with 16 processors is faster than with 1. Arrays still dominate other data structures for raw performance here.
  3. Concurrency doesn’t just mean multi threading. SIMD instructions can perform simultaneous operations on multiple operands in your array. Can’t do this with a linked list.
  
  reubenmorais 6 months ago
  
  Yes you can write a very fast SIMD loop over densely packed data. But if that data is mutable and you need to acquire a lock before you work with it, it's very easy to lose all the performance you gained. Immutability can reduce coordination costs and improve effective parallelism.
  For a similar reason immutability also helps you write code with fewer data races.
  
  sarchertech 6 months ago
  
  A single threaded SIMD loop over densely packed data, will outperform the same transformation on a linked list running on 50 threads (this obviously an over generalization and there are transformations and data layouts where this doesn’t hold, but it’s very common. You could also construct cache line aware hybrid data structures, but there are trade-offs).
  The only reason you’d need to deal with increasing parallelism (beyond SIMD) is if you wanted it even faster than that.
  I’m not saying immutable data isn’t a good idea in many cases (my primary day job language these days is Elixir). What I am saying is that if you are “optimizing for raw speed” immutable data structures are almost never the right choice.
  That doesn’t mean immutable data structures can’t be fast enough to be the best choice in many situations.
  
  MrJohz 6 months ago
  
  Unless you encode ownership into the type system, and then you kind of have the best of both worlds: you don't have functions mutating things unexpectedly or by accident, but you can explicitly opt into mutation when it would be beneficial. To opt into mutation requires you to have exclusive control over data (i.e. nowhere else in your program will mutate this code at the same time), which avoids issues where different threads are trying to change the same data at the same time.
  
  mrkeen 6 months ago
  
  Requiring exclusive ownership avoids the issues, but it also avoids the features.
  Sometimes you actually do want multiple threads working with data.
  
  MrJohz 6 months ago
  
  And there are patterns for that, that allow you to convert a static exclusivity check into a dynamic exclusivity check - something like a mutex, where multiple threads can simultaneously hold a mutex, but only one thread at a time can gain access to the contents of that mutex. You still enforce that mutation requires exclusive access to an object, but you are now enforcing that at runtime instead of compile time.
  You never want multiple threads to be mutating the same data without some form of synchronisation, but with ownership rules, you can still have that synchronisation.
mrkeen 6 months ago
Someone should try it with postgres. Make a raw speed branch that gets rid of the overhead of mvcc:
```
  while querying a database each transaction sees a snapshot of data (a database version) as it was some time ago, regardless of the current state of the underlying data

  https://www.postgresql.org/docs/7.1/mvcc.html
```
- ahoka 6 months ago
  
  That’s not exactly how PostgreSQL works. This is true only at certain isolation levels.
cratermoon 6 months ago

https://dl.acm.org/doi/10.1145/356635.356640

dang 6 months ago

Immutability Changes Everything (2016) - https://news.ycombinator.com/item?id=27640308 - June 2021 (94 comments)

Immutability Changes Everything - https://news.ycombinator.com/item?id=10953645 - Jan 2016 (4 comments)

Immutability Changes Everything [pdf] - https://news.ycombinator.com/item?id=8955130 - Jan 2015 (25 comments)

(Reposts are fine after a year or so; links to past threads are just to satisfy extra-curious readers)

gleenn 6 months ago

I love the quote "accountants don't use erasers". So many things should be modeled over time and keep track of change right out the gate. Little things like Ruby on Rails always adding timestamps to model tables was super helpful but also a little code smell. If this is obvious enough to be useful everywhere, what is the next level? One more reason Datamoic is so cool: nothing is overwritten, it is overlayed with a newer record and you can always look back and you can always also always take a slice of the db at a specific time and have a complete and consistent viewbof the universe at that time. Immutability!

yencabulator 6 months ago

Accountants also have trivially simple schemas. (Though lots of complexity elsewhere.)

cowsandmilk 6 months ago

The “right to be forgotten” has caused a lot of conflicts with certain immutable data stores. If I can reconstruct a snapshot with a user’s data, have I actually “forgotten” them? Having a deadline where the merges fully occur and old data is rendered inaccessible is sometimes necessary legally.

hcarvalhoalves 6 months ago

You can always "redact" previous data. You can treat the sensible entries themselves as mutable, without it breaking the system design around immutable data.
I have also seen a scheme where you store the hash, and have a separate lookup table for sensible data, that you can redact more easily without messing with the log.
mrkeen 6 months ago

Likewise with database backups.

prydt 6 months ago

One of my favorite papers! This reminds me of Martin Kleppmann's work on Apache Samza and the idea of "turning the database inside out" by hosting the write-ahead log on something like Kafka and then having many different materialized views consume that log.

Seems like a very powerful architecture that is both simple and decouples many concerns.

082349872349872 6 months ago

In their 1992 Transaction Processing book*, Gray and Reuter extrapolate h/w and s/w trends forward and predict that the DBMS of their far future would look like a tape robot for backing store with materialised views in main memory.
Substitute streams for tape i/o, and this description of Samza sounds like it could be very similar to that vision.
* as far as I know, their exposition of the WAL and tradeoffs in its implementation has aged well. Any counter opinions?
- gsf_emergency_2 6 months ago
  
  Thanks!

lbj 6 months ago

I have to say, I really love the title :)

cacozen 6 months ago

I guess “Immutability changes nothing” wouldn’t have the same impact

skybrian 6 months ago

Editors and form validation are where this gets tricky. The user isn't just reporting new, independent observations to append to a log. They're looking at existing state and deciding how to react to it. Sometimes avoiding constraint violations with other state that they're not looking at is also important.

It often works out, but if you're not looking at the right version then you're risking a merge conflict.

niuzeta 6 months ago

Semi-related, but is there any repository(ies?) that comprise of these technical white papers? I'm fascinated by these papers whenever they show up in my feed and I gorge on them, and I'd love more. I can't be the only one thinking this way.

ahoka 6 months ago

I can recommend Adrian Colyer‘s excellent The Morning Paper blog: https://blog.acolyer.org/

sstanfie 6 months ago

Needs more exclaimation points!

lincpa 6 months ago

[dead]