HN Offline: Feldera Incremental Compute Engine

Feldera Incremental Compute Engine

gzel | 145 points | 9mon ago | github.com

arn3n|9mon ago

If you don’t want to change your whole stack, ClickHouse’s Materialized Views do something extraordinarily similar, where computations are ran on inserts to the source table in an online/streaming manner. I’m curious how this solution compares in its set of features/gaurantees.

atombender|9mon ago

ClickHouse's materialized views are wonderful, but they do require very careful design up front.

In particular, aggregations need to be defined using the special AggregateFunction data types, which must be paired with the corresponding aggregation functions such as countMerge(). Joins are possible in CH views, but they operate in a specific way (against the insert batch) that you must know about; joins against other tables are generally a bad idea for performance, and you should use dictionaries as much as possible for fast in-memory lookup. Lastly, it's also hard to update MVs because their entire source query has to be modified as a whole. Adding a column requires declaring the whole MV, which introduces the possibility of making mistakes in your migrations.

CH views are really more like triggers, and so they're a little misleadingly named. Very powerful, of course. In short, a lot more "manual" than this other system.

lsuresh|9mon ago

For incremental computation, Feldera is just way more powerful and general. It can evaluate arbitrarily sophisticated SQL programs incrementally (tables and deeply nested layers of views). For example, it can do rolling aggregates over joins, handle late and out-of-order arrivals, can compute over infinite streams with finite state (via automatic garbage collection), and it's strongly consistent. Clickhouse's materialized views are much simpler and restricted in comparison.

That said, we are /not/ a replacement ever for Clickhouse or any other historical warehouse. In fact, we pair best with one of them. Have data flow through or teed into Feldera to maintain sophisticated standing queries -- maintain historical data in your warhehouse.

nextaccountic|9mon ago

Could Feldera be a Postgres extension?

rebanevapustus|9mon ago

Big fan of Feldera here.

I would advise everybody to stay clear of anything that isn't Feldera or Materialize. Nobody aside from these guys have a IVM product that is grounded on proper theory.

If you are interested in trying out the theory (DBSP) underneath Feldera, but in Python, then check this out: https://github.com/brurucy/pydbsp

It works with pandas, polars...anything.

jamesblonde|9mon ago

It's based on Z-Sets - a generalization of relational algebra. Many of the aggregations, projections, filters from SQL are associative and can be implemented in Z-Sets. Z-Sets supports incremental operations (adding one value to a set while computing the 'max' is just the max of the two arguments - rather than requiring recomputing the 'max' over the entire set.

nikhilsimha|9mon ago

dumb question: how do z-sets or feldera deal with updates to values that were incorporated into the max already?

For example - max over {4, 5} is 5. Now I update the 5 to a 3, so the set becomes {4, 3} with a max of 4. This seems to imply that the z-sets would need to store ALL the values - again, in their internal state.

Also there needs to be some logic somewhere that says that the data structure for updating values in a max aggregation needs to be a heap. Is that all happening somewhere?

lsuresh|9mon ago

We use monotonicity detection for various things. I believe (can double check) that it's used for max as well. But you're correct that in the general case, max is non-linear, so will need to maintain state.

Update from Leonid on current implementation: each group is ordered by the column on which we compute max, so it's O(1) to pick the last value from the index.

nikhilsimha|9mon ago

So the writes are O(N) then - to keeps reads at O(1)?

ryzhyk|9mon ago

Both reads and writes are O(1) in time complexity. Writes additionally have the log(N) amortized cost of maintaining the LSM tree.

nikhilsimha|9mon ago

gotcha! thanks for the clarification

shuaiboi|9mon ago

Just a guess... wouldl like to hear the answer as well.

they probably have a monotonicity detector somewhere, which can decide whether to keep all the values or discard them. If they keep them, they probably use something like a segment tree to index.

ryzhyk|9mon ago

That's right, we perform static dataflow analysis to determine what data can get discarded. GC itself is done lazily as part of LSM tree maintenance. For MAX specifically, we don't have this optimization yet. In the general case, incrementally maintaining the MAX aggregate in the presence of insertions and deletions requires tracking the entire contents of the group, which is what we do. If the collection can be proved to be append-only, then it's sufficient to store only the current max element. This optimization is yet coming to Feldera.

lsuresh|9mon ago

Yes, we do a lot of work with monotonicity detection. It's central to we perform automatic garbage collection based on lateness.

nextaccountic|9mon ago

> Z-Sets - a generalization of relational algebra

Does this have a paper or some material describing the theory?

lsuresh|9mon ago

This is our paper describing the theory underlying Feldera: https://www.vldb.org/pvldb/vol16/p1601-budiu.pdf

shuaiboi|9mon ago

This is pretty neat but I'm wondering how well this implementation obeys dataframe algebra. Ponder goes into detail about how dataframes and relations aren't the same, but your dataframe zset seems to be more or less the exact same thing as the relation zset?

https://youtu.be/7TyIjqvfWto?si=CMFH30DFEWxkltlw&t=1095

rebanevapustus|9mon ago

It does not. The example I give on the README is only meant to show how easy it is to use it to "streamify" regular relational Dataframe operations.