Thesis: Partial State in Dataflow-Based Materialized Views
Oct 23, 2020 17:43 · 8347 words · 40 minute read
Hey everyone! Hello. Welcome to Jon Gjengset’s thesis defense. And thank you for attending. I want to say just a couple of words before handing over to Jon. In his research life, Jon’s been pursuing a really neat vision for how busy web sites ought to manage their data. It’s been exciting to watch him develop this vision and impressive to see his skill in realizing it. But in parallel with research, Jon’s been a hugely energetic contributor to all the communities he touches. From his infinite patience as a TA, to organizing graduate student events, to live-streaming systems programming tutorials.
Jon’s sure to be valuable and valued whatever he 00:38 - sets his hand to. And assuming his committee lets him go, many people at MIT are going to miss him. I certainly will. So with that, I’ll hand over to Jon. Thank you Robert, I appreciate that! So, thank you everyone for coming, and welcome to my doctoral dissertation presentation where I’ll be presenting my thesis on Partial State in Dataflow-Dased Materialized Views. Now, my goal with this presentation is to try to summarize the past six years of work or so in about 45 minutes. And hopefully I will succeed. And at least, hopefully, by the end, you will understand what the words in the title mean.
And why I think it was worthwhile to spend six years doing this work. Now, don’t worry, this work was supervised. My committee, Robert, who you just saw, but also Frans, Sam, and Malte, have been working with me in basically trying to figure out that this work is worthwhile, and to make sure that what comes out of it is something that is valuable. So, I want to start this presentation with “Why are we here?”. And not in the existential sense of, like, “why are we all on this planet?”.
01:48 - But more in the sense of “Why am I here?” — “Why am I talking to you at all?”. I am here because I want to make databases better. And in order to understand the way in which I want to do that, we first need to talk a little bit about what databases are, in order to understand the problems that some applications have with them. So let’s do a little bit of database 101. So here, in databases, you just sort of take some tables. In this case, we’re going to work with a stories table and a votes table.
02:16 - And these two tables hold stories and votes. And then the idea is that, as an application, you can query these tables by issuing “queries”. And these queries might do all sorts of things, like they might aggregate and combine values, or they might join different tables together, in order to produce some application data that it might care about. And the query operations are here shown in orange. If you want to modify the data, then you perform an “insert” or an “update” or a “delete” directly into the tables at the bottom. So here, if you wanted to insert a new vote, then you just make that insert operation directly to the votes table.
02:55 - And even though inserts in SQL, and updates and deletes, are “queries”, I will not be referring them as such in this talk. Rather, I will be talking about queries as things that read data. Think SELECTs. Whereas inserts, updates, and deltes, are “writes” or “updates”. And now that we’ve looked at this picture, you might see that there is a lot more orange here than blue. And what this indicates is that if you do a read, you have to do more work than if you’re doing a write.
But this is unfortunate, because in practice many applications, and in particular 03:26 - web applications, are read-heavy. They do a lot more reads relative to the number of writes. And then it seems unfortunate that the reads are also the most expensive operation. And so, ideally we want some way to mitigate this problem. Because it becomes pretty severe once you look at applications that issue many many read queries that are either identical or similar, where there is a lot of work that is a same in executing any given query.
And it’d be great if we didn’t have to do this repeated, unnecessary work. And you might say, well Jon, this problem has been solved: you just use a cache. And it’s true that caches are great. Caches are essentially a storage system that you place in front of your database. And the idea is that the queries look in the cache, and if the result is there, then they just return immediately, and the reads are now fast. And then only if you miss in the cache — if the result you’re after is not there — you issue a query to the backend database.
04:24 - And it’s true that this does make queries fast. And there are some other similar schemes, like denormalization where you store and maintain derived values like the number of votes for a story in the stories table, that has a similar effect. And it’s true that caches are great, but unfortunately they’re also really hard to get right. This diagram is an attempt to summarize some of the things that you have to get right in the application in order to use this kind of a caching strategy correctly. So, the cache is great, but as the data in the tables changes, the cache becomes out of date. And it needs to somehow be refreshed.
So imagine that we insert a new vote into the votes table, 05:07 - the cached vote count is now wrong. And so the path of the application that inserts this new vote also needs to invalidate or somehow mark the cache as being outdated, and maybe refresh that value. Furthermore, even though it is easy to say that if you miss in the cache you have to query, it’s hard to figure out how this should work in practice. Imagine that 100,000 people want to read a particular story, but that story is not present in cache. Do all those application queries all swarm the database, and all execute the same query? That seems wasteful too.
And 05:44 - even if they do that, we still need some way to fill the cache. If all of them read when they miss, someone has to write that value back into the cache. And orchestrating who should do that is not entirely trivial. Beyond this, over time, the cache is going to accumulate more and more values. And we only have so many resources on the server that’s serving these requests. And so we need some way to evict from this cache, entries that are old or unpopular. But then the question arises again: who does this eviction, and how? And hopefully this image gives you an idea of just all the stuff that needs to be figured out by the application developers in order to make databases serve these kinds of read-heavy workloads. And so in my research, I’ve been looking specifically at this problem of automatic database caching. Specifically, is there a way that we can get all this cache logic to not reside in the application, and instead be provided by the database. And that is what my thesis implements, and what we’re going to be looking at for the rest of the talk.
06:48 - Now, the way we’re going to do that is we’re going to be walking through the title of the thesis. Which, to remind you, is partial state in dataflow-based materialized views. And what we’re going to do is we’re going to parse the title from right to left. And then move through the words and figure out what they mean and how they fit together. So let’s start with materialized views. And in particular, why they’re useful in this context. Materialized views have been around for a long time; they were invented by the database community in the early 1980s. And essentially they are running the query and then remembering the result. Which sounds like a fairly straightforward thing — it sounds very much like a cache. And the key question with materialized views, just like with caches, is how to maintain that materialization. What happens if the data in the tables change? And so this is what is called materialized view maintenance.
And ideally we want this maintenance to be incremental. We want it to be so that we don’t have to re-execute the query that is materialized every time the underlying data changes. For example, if some story has 100,000 votes, and then one vote more comes in, we’d like to not have to count to 100,001, and instead just take the current count of 100,000, and have the system just realized it can just increment that count by one rather than recomputing the query from scratch. We also have this questions when it comes to materialized view maintenance of whether we maintain on write, so proactively when a new change happens to the underlying data. Or whether we do it sort of lazily and on demand on later reads.
08:25 - And this brings us to the next part of the thesis title, which is how we maintain these kinds of materializations. And the title gives us a clue: dataflow-based. Now, dataflow has a lot of meanings in different parts of academia, and in particular within Computer Science, but essentially it is a systems architecture. And in this talk, we’re going to be talking about dataflow as having data move to compute. The idea here is that instead of code fetching data from tables, you’re going to have the tables send data towards the compute. You can think of this a little bit like push-based computation.
09:03 - And these data changes are going to propagate through a graph of operators. These are relational operators like joins, aggregations, and filters. And you can sort of think of them as whatever operators you might use in SQL. This might sound a little hard to get your head around, and so I’m going to give an example shortly of how this works out in practice. And then each edge of the dataflow is going to indicate a data dependency. So for example, a join is going to depend on all of the inputs of that join. An aggregation is going to depend on the table, or the data, that it is aggregating over. And then the messages that flow over the edges of the dataflow are going to be deltas. A delta here is a full row, it has columns just like the rows that are coming out of the operator above, but they also have a “sign” which is either positive or negative. If a delta has a positive sign, you can think of it as an addition, an additional query row result to what came before.
If it’s a negative, it’s a revocation of some 10:09 - result that was previously issued. You can think of it as sort of “add this” or “remove this”. And in particular, we’re going to be looking at this whole dataflow model in the context of “Noria”, which is an eventually consistent materialized view system that is built using dataflow. So let’s take a look at a concrete example of how dataflow can be used to maintain materialized views. So here on the left I show you a query, and on the right, I show you the dataflow that Noria constructs for that query.
So the query here creates a materialized view by the name 10:44 - of StoryWithVC. VC for vote count here. And then defines a query that does a join between stories and votes on the story id. And then counts the number of votes for each story. You can see it groups by stories.id. And if you squint at this, you can see how some of the parts from the relational query appear in the dataflow graph. For example, there’s an aggregation here, a count, indicated by the sum operator in the dataflow.
And it groups by the stories id, which we know 11:15 - from the join clause is the same as the story_id column of votes. Similarly, there’s a join between stories and votes, and that is represented by the join operator in the dataflow graph. You might notice here that these seem a little out of order: the join is between stories and votes, not stories and the vote count operator. And what we see here is the effect of query optimization. Noria has decided that it’s going to do the aggregation first for optimization purposes.
11:42 - And now the question becomes: what if the application wants to read from this view at the bottom, StoryWithVC. Well what happens is that the application issues a query over that view. Which might look something like this. So in this case SELECT * FROM StoryWithVC WHERE id = ?. Where question mark here means some parameter that is input by the application at runtime. And that parameter is essentially going to be a lookup key into the materialized view to get out just the subset that the application cares about.
12:16 - Now imagine that some data changes in the underlying tables. So imagine for example that a new vote comes in for some story. What Noria is going to do is take that input change and inject it into the dataflow at the votes table node in the dataflow graph. And then it’s going to stream that as an update through the dataflow graph, along each of the edges, down the dataflow, all the way down to any materialized views we might have. And the idea here is that the update here is modified by the operators it passes through.
12:51 - So for example, when the vote — let’s say it’s for story 7 — passes through the count operator, the count is going to replace, or convert, that +1 vote into a negative of the old count — a negative delta — and a positive delta of the new count. So in this case, it’ll be a -7, the story id, 42, the old count, and a positive 7, 43 for the new count. You can think of this message as communicating: “It used to be that the count for story 7 was 42. Forget that, it is now 43”. These two deltas then flow along to the join, and the join needs to convert that again into some kind of delta that can be applied to the downstream materialized view. From the query, we know that the materialized view has a bunch of additional columns that come from the stories table.
And so the join performs a join by doing lookups into the stories table and then 13:44 - stitching together the output deltas so that they have the additional columns. Those deltas then flow through the dataflow again, down to the StoryWithVC materialized view where they update that view in place. And now subsequent reads from that materialized view are going to see the updated result. The result that has the count of 43 rather than 42. Now that we’ve walked through dataflow-based materialized views at a relatively high level, let’s start to look at the last part — or first part — of the title: partial state.
And partial state is the core contribution of this thesis, 14:22 - and what we’ll be looking at for the rest of this talk. So partial state can be summarized as “learning to forget”. The observation here is that the chances are most entries in any given view are not accessed. Old and unpopular stories are just sitting there wasting memory; if no one ever reads story 7, why are we storing its result? Furthermore, why are we even computing, and keeping the results up to date, if no one reads them? Similarly, if we don’t have the memory to keep all the results for every story in there, we might need to choose to only keep some of them. Say, only the popular ones that speed up reads the most.
We want to trade off in favor of the most popular and most frequently accessed things. But traditional materialized views don’t really give us this opportunity. As we saw before, you just have a query that gets all of the results, and then you have to query over that view. And what we need is some way to evict old entries from materialized views, and only add new ones on demand. Which brings me to three of the main contributions of this thesis.
15:27 - The first is the notion of missing state in materialized views: the ability to mark some state as not being there, and not expending memory on it. The second is the mechanism of upqueries, which allow populating missing state using dataflow. And finally, an implementation and evaluation of partial state in the Noria dataflow materialized view system. So in order to get at partial state, one thing we first need to figure out is this separation between the definition of the view and the query over that view. Because unfortunately this doesn’t give us quite enough information to make the view partial.
We don’t know what key to use as the basis for deciding whether a subset of the view 16:11 - should be made missing or not. We don’t know the query key that the application is going to use. And therefore, in partial state and in partially materialized views, what we’re going to do is introduce the query parameter into the view definition itself. So instead of writing what you currently see on screen, and having them be separate, we can have a single view definition that also includes this clause of where the story id equals question mark. And then have the question mark be a parameter for the view.
And we’re going 16:40 - to use this key to determine if a given result is present or missing. And then when the application wants to query over this view, it’s going to execute that view without any additional query, and just supply the parameters that are identified by the question mark placeholders. And what’s neat about this is that when an application wants to execute a given view, let’s say for story 7, that gets sent as a request to Noria. Which then receives it and looks up into its materialized view, looks for the index for story 7, and it might find that that entry is missing, indicated here by a hollow green box. Now when this happens, some mechanism needs to be in place in order to fill that result, that subset of the view, so that we can respond to the application. The way we do this is using an “upquery”.
And an upquery is, as the name implies, a query that goes up the dataflow graph. In this case, immediately to the join operator. And you can think of an upquery as a request for a summary of all the relevant past deltas to be retransmitted through the dataflow. You can think of it sort of as the downstream saying: “tell me about 7 again, because I forgot all about it”. Now, the join in this case is stateless — it doesn’t have a way to send such a summary — and so in order to support this, upqueries can also recurse.
In this case, 18:06 - the join might choose to forward this upquery to one of its ancestors, which might be stateful. In this case, let’s say it decides to forward that upquery to the count operator on the right. And you can think of this as sort of asking: “can you tell me about 7 so that I can tell the join — the downstream — about 7?”. Now, a quick aside here: we’ve only talked so far about state in tables and in views. But there is other state too. Remember how I mentioned that the count takes a +1 vote and turns it into a negative for the old count and a positive for the new count? Well, in order to do that, it needs to remember that the current count was 42.
18:44 - And the way it does this is it also keeps its own little materialized view internally. It’s a materialized view of its past output. And so this might actually be state that is present in the dataflow, and that can be used to satisfy these upqueries. In this case, it might be that story 7 is known in that materialized state, and so we don’t need another upquery to recurse all the way back to the individual votes in the votes table. When the aggregation responds to the upquery, it sends that upquery just directly through the existing dataflow.
And this is a key point: there is no need for special operators or different 19:21 - forward or backwards query processing in this design. Instead, the response that gets sent through the dataflow is just a single message that holds the current state of the source for the requested key. And it’s processed just like any other dataflow update. When that response arrives at the join, the join does the same thing it would have done for any other update: looks up into stories, patches together the output, and then forwards that back to the materialized view. And that materialized view, when it receives this upquery response, takes that and uses it to fill in the hollow box because this represents the answer for 7.
At this point we now have enough 20:02 - state to respond to the application query for 7, so we can just send the response back. What’s nice about this is that the state has now been populated. So if we now get later queries for the same value, 7, we can just serve them directly from the same materialized view because that state is no longer missing. What all of this enables, and the sort of key feature of partial state, is that at some point later, if story 7 falls out of favor — becomes old — we can evict that entry. In fact, we can evict any state we want internally in the system related to that key.
So for example, we might choose a year later 20:42 - to evict story 7 from both the aggregation state and also from the materialized view. And this lets us save memory that can then be used to materialize other results that are popular. I mentioned also that we might waste compute to keep materialized view results up to date that are no longer being accessed. And eviction and missing state allow us to work around this. Imagine that after evicting 7, some change happens to the story table where say the title of story 7 changes.
This is going to introduce a delta in the dataflow graph that changes 21:15 - the story. And when that arrives at the join, and the join does a lookup into the state of the aggregation to figure out what the count is, it’s going to encounter missing state. And when it does, we can safely evict that… Sorry, we can safely discard that update to the story. The reasons for this are a little subtle, and the thesis goes into more detail, but essentially if you observe missing state for a given key in a sibling in the graph, it generally also means that that state is missing downstream. And therefore there is nothing to update — there is no state for 7 downstream — and so don’t need to update it and can discard this update entirely.
21:55 - At this point I’ve talked enough about materialized views and dataflow and partial state that I think it’s time for an intermission where we talk a little bit about related work. And there’s been a lot of related work in this general area. And the first of course is materialized view maintenance. So, materialized views, in general, have traditionally been used for a slightly different workload than what I’m going for with this thesis work. In general, materialized views are used for something like an analyst that comes in to check results occasionally.
And it would be too slow that complex query the analyst wants to run 22:31 - those few times they come in. And therefore, over time, we’re just going to keep this materialized view. And then when the analyst comes in they can just open the view and it opens immediately. And so the focus has more been on the maintenance than the reads because the reads are infrequent. Think of it as there is a high frequency and volume of writes, and so we want to make sure that the maintenance is as efficient as possible.
But it’s okay if the read has to sort of do a bunch 22:57 - of work when the analyst sits down as long as it’s much less than it would be to execute the query. These systems also generally have little or no support for on-demand queries. The queries are often compiled into a program, or something similar, and it’s not really built for the kind of dynamic query setting that we often see in something like a web application. And for similar reasons these systems rarely have support for eviction, because it’s not really needed; the analyst’s query is what it is. That is the view. And there’s no subset of the view that is more important than the others.
23:35 - Another area of related work is automated caching systems. We’ve seen a number of these come out of both industry and academia. Unfortunately, especially the ones out of industry, these tend to be very tailored for a particular purpose. They’re not really general purpose things that you can plug and play into your own application. What usually happens is some large company has a particular caching problem, they build a solution that works for them, but you can’t just spin that up in your Ruby on Rails application.
24:05 - These systems also often only support invalidation, and not incremental updates. They just focus on evicting things from cache that are related, and not updating them in place, which has the downside that now you might miss a lot and have to go the backend a lot more than is necessary. Furthermore, these systems are often limited to specific database interactions: they require that you go through a framework or an ORM. As opposed to what we’re targeting here, which is sort of general-purpose SQL where you just write SQL queries, you write views, and they just work similar to parameterized prepared statements in SQL. And finally, there’s been a lot of work on dataflow and stream processing systems.
24:46 - And these systems are also usually focused on write performance, similar to materialized views, where there is a data pipeline and you want to perform all these ongoing computations over that data pipeline. These tend to focus on strong consistency, which usually comes at the cost of read latency. The reads generally have to coordinated with the data pipeline somehow; often the reads even have to go through the write processing path in order to give consistent results. Whereas with Noria we can leverage the eventual consistency to give much faster results by leveraging the fact that the results are allowed to be stale. These dataflow systems, and stream processing systems, also tend to have limited support for on-demand compute and eviction.
Here, too, you sort of set up some queries, and they 25:36 - run for a while, and you can’t evict partial results from a given materialized view. If you wanted to you would sort of have to terminate the process, or something along those lines. So now that I’ve talked to you for a bit about what the system does, you might wonder: well, are we done? Is everything you told me everything I need to know, and I can just go run this in my application. And unfortunately that’s not the case. Although if it were, this thesis wouldn’t be very valuable, and so I’m kind of glad that there are more challenges. Because it turns out that, in practice, things are hard.
In particular, we need to ensure that 26:10 - data changes to the tables take effect exactly once. For example, if you do an insert to the base table, well that insert has to happen. It can’t just… The row you inserted can’t just vanish. And if the database applied the insert multiple times so the table now contains that row multiple times, that would also not be great. Now, this might strike you as weird if you come from a traditional database world, because the only way that could really happen is through a bug.
The database sort 26:38 - of goes through and looks at indicies and scans tables and such, and would only encounter any given row once. But in this model, it’s a little bit harder. And the reason is because, well, it’s two-fold. First, upqueries are summaries of past state. Things like all the state in the past, all the deltas in the past, for story 7. But upqueries can happen concurrently to updates that flow through the dataflow graph. And those updates might be for the same state. Imagine, for example, there’s an upquery for story 7 at the same time as story 7 is being updated.
We need to ensure 27:10 - that these don’t conflict with each other and end up violating this sort of exactly once rule. Similarly, we want the ability to discard updates early to not maintain state unnecessarily. However, if we discard things erroneously — if we encounter missing state but there is downstream state that actually depended on the update that’s flowing through the graph — then that’d be really unfortunate: that result would be permanently stale. There are a lot of hazards that can cause these kinds of problems, and the thesis goes through a fair number of them, but in this talk I’m going to focus on one just because of limited time. The one we’re going to focus on is incongruent join evictions.
And if you don’t know what these 27:53 - are, it’s not that weird because it’s a term that I made up. But hopefully you’re going to see in a little bit why I think the name makes sense. So let’s start then with “what is an incongruent join?”. What I’m showing you here is another query and dataflow side by side. It’s a different query and dataflow from the ones we looked at before even though they look kind of similar. Here, we have a stories table like before, and then we have a users table. And the idea is that the stories table has an author id column. And in order to display a given story we probably want the author’s name rather than just their user id. And so we store a separate table that has the user ids and various information about that user, such as their name. And so in this StoriesWithAuthor view we’re doing a join between stories and users which ends up combining the results such that the output rows actually contain the author’s name pulled from the users table.
What makes this join incongruent is the fact that 28:51 - query key is different from the join key. The query key here is the story id, whereas the join key is the author id of the story. That’s what we end up looking into users with. Now, in general this is not a problem. Let’s first work through the sort of correct case of what happens when an upquery occurs. Well let’s say that someone queries this view as well for story 7. An upquery goes to the join, the join is stateless, it forwards to the stories table. The stories tables looks up the state for 7 and sends back a message along the dataflow saying “here’s story 7, the author is 42”. It might include some other columns too, but let’s just consider these for now. The join then dutifully does the lookup into the users table, and wants to look for the user information for user 42. And let’s say it finds the name for user 42 is Lena.
29:41 - Well in that case it takes that information, it stitches it together with the update that flowed in as a response to the upquery, and then it forwards the response downstream to the materialized view to fill the hollow box — the missing state. So this includes 7, the various columns from the stories table, and the author name it fetched from users. So far so good. But now recall this figure. Where if we encounter missing state in a sibling, we end up just discarding the update, assuming that there’s no state downstream that we might have to update. This ends up causing us a problem in this particular case. Let’s consider what happens if the author changes.
So this would be represented in the dataflow as two deltas: 30:23 - one that removes the story with the old author, and one that adds the story with the new author. These two deltas flow down through the dataflow graph, and when they arrive at the join, the join then needs to do the lookups as before. So it first does a lookup for 42. It again finds the state for Lena, and populates the author name column for that in preparation for sending it to the materialized view. And then it needs to do a lookup for author 43 to do the same for the positive delta. However, what if that lookup misses? So the lookup misses in the users table, and now we don’t know what the author’s name is! Now, you might wonder: how could this happen if users is a table? And there are two answers I can give you to that.
One is you could imagine that the users table is, say, only partially stored in 31:15 - memory, and 43 would have to be fetched from disk. Now, Noria doesn’t actually support this — keeping materialized state in memory versus disk — but in practice there are other ways that this can occur. For example, here I’ve given you a simplified view where users is just a table. But you could imagine that users is actually a view in and of itself, and has some large dataflow upstream. And then there could totally be results in users that are missing.
Regardless of how this happens, 31:42 - the join still now has a problem: it needs to complete the processing of that update, but it doesn’t have the state that’s required to do so. So what do we do? We can’t produce the needed update. We cannot forward just the negative, and just sort of discard the positive, because if we did, the downstream materialized view would now have no rows for story 7. And any subsequent query for story 7 would get no results, which is obviously not correct. We also can’t just drop the update altogether. Because if we did, any subsequent read for 7 would now see the old author name rather than the new one. And this would also be a problem. So what do we do? Well, your first instinct might be that we can just fill the missing state. Right? We just have the users table do an upquery — or something — fill in the state for 43, and then everything is great. And while this seems like an attractive solution, in practice it doesn’t work so well. Because we can’t… we would have to hold up the dataflow at the join until that upquery finishes.
32:46 - And that might potentially take a long time. Remember, the users table might be a view that has huge dataflow above it. And satisfying that upquery might take forever. Well, at least a very long time. And during that time, the join can’t process any more updates. And we’re sort of stuck just holding up the dataflow — not processing more writes. In fact, it turns out this is even worse. There are cases — I won’t go through them here — where this might end up in deadlock.
Where the join can’t process more updates until the upquery 33:15 - finishes, but the upquery cannot finish until the join processes more updates. And now we’re stuck. You might think: well, can the join just process later updates to avoid this problem. But we can’t process them out of order because there might be later updates that depend on this one. And so we need to finish processing this one first. Okay, so that’s not a viable solution. What do we do instead? Well, partial state actually gives us the mechanism to solve this problem, and that is evictions.
It’s true that we don’t know what 33:47 - the author for story 7 now is, but we can just communicate that downstream with an eviction. We can evict story 7 from the downstream materialized view. And now, all the application has… or, all Noria has to do, is when a later query comes in for 7, that’s going to fill the required state through the normal upquery flow, which we already showed worked just fine. And it turns out the system can actually detect when you have incongruent joins, and only send evictions in those cases where it’s necessary. It doesn’t have to do it for every join. So might wonder now: does all of this work? Like, this seems like there’s a bunch of mechanism internally, does it end up just killing performance — is this even worthwhile? In order to evaluate that, we need a realistic test subject. And for this thesis I chose Lobste.rs, which is a Hacker News-like news aggregator.
Users submit stories, 34:47 - the users can vote for and comment on those stories, look at top lists of the most popular stories, that kind of stuff. And I chose Lobste.rs for two reasons. One is that it is open-source, which allows us to see the queries that are issued for different page requests. And the second is that the data statistics for Lobste.rs are available. This is stuff like how many requests come in per second in general, how are those requests distributed across different pages, what users what stories are most popular and most active. And ultimately all of this data allowed me to build a workload generator for Lobste.rs that can synthesize Loste.rs-like requests.
35:31 - And the reason why we want a generator here is so that we can change the load of the system artificially, and see whether the system keeps up. The generator also has a pluggable backend so that we can choose to run it against MySQL, against Noria, and just see how they perform. It also lets us bypass Ruby on Rails so that we can benchmark the true backend performance rather than the language that’s serving the frontend. So lets’ first look at how MySQL does. This is MySQL running on Amazon EC2. It’s a 16-core instance. And this is showing the throughput that the workload generator can get to before the system falls over. Before all the cores on the machine are saturated and the latency starts to spike. So this gets to about 400 pages per second.
36:23 - And this is across all the different page types — it generates a sort of mix of page requests. Now I want to point out that this is already with a denormalized schema. The Lobste.rs developers have done a decent amount of work on their queries to make sure that things like the vote count — or their equivalent of it that they call story “hotness” — is actually a column of the stories table. So they don’t generally have to join and aggregations in their queries. Although their update paths in the application have to do a lot of work to make sure those values are kept up to date.
If you don’t include this kind of manual denormalization 36:58 - the throughput for MySQL… it can’t even keep up with a single page request per second. Now let’s contrast that with running Noria without partial state. Noria without partial state already does significantly better — this is about an order of magnitude improvement. And this shows you the power of materialized view. Especially in a case like Lobste.rs where there’s a majority of reads. All those reads don’t have to do joins; they’re essentially all just key/value lookups.
Here, 37:28 - however, we run into another bottleneck: we’re not CPU-bound, we’re memory-bound. When Noria without partial gets to about 4500 pages per second, it runs out of the 128GB of memory on this machine. And in fact, it gets a little bit worse than that because the memory use tends to increase over time, and so if you ran the benchmark for longer it might not even keep up with this load for longer periods of time. And this is already with some amount of optimizations within Noria to make sure we only materialize requested values. So if no one ever asks for a particular query, we don’t materialize results for it.
38:10 - And now let’s look at what happens if we run Noria with partial state. It does significantly better — this is about 67% over Noria without partial, and about 18x MySQL. And here, there’s actually a little bit of room to grow. This benchmark falls over not because of saturating all the cores, and not because of memory. It falls over because of processing upqueries. It turns out that there’s a particularly update-heavy path through the dataflow that currently bottlenecks Noria.
And this might be somewhere where additional optimizations could help. And we can see how Noria with partial state gets to this higher performance number by looking at the memory use. So here I’m showing you the memory use in GB for Noria without partial and Noria with partial. At the capacity just below where Noria without partial falls over. You see that Noria without partial uses over 100GB of memory, and Noria with partial state uses only a third of that.
39:13 - And you might say well, Jon, forget about MySQL, no one does that in practice. If I implemented caching myself, is Noria really going to keep up with that? Is Noria really a serious sort of competitor for that? And this is a really hard question to answer because it depends on the implementation, it depends on the application in question, it depends on the workload. And so instead what I did is construct a benchmark that is sort of an idealized caching setup. This is one where there are no evictions, there are no misses, almost all requests are reads, almost all requests turn into a single key lookup. And there’s only a single query in these benchmarks, so this is not all of Lobste.
rs, 39:51 - but instead just the stories and vote count query I showed you in the very beginning. And I ran this against Redis, which is a popular key value story that’s often used to back things like caches. And in this idealized workload, Redis gets to just under a million requests per second. Which might sound pretty good. However, Redis is single threaded — it can only run on one core — and so if we really wanted to compare these, we have to assume that someone implemented like perfect scalability across cores for Redis, and so here’s what I’m going to show you is 16 times that number. This is not… the number on the right here is not a real benchmark number, it is just 16 times the number on the left.
Think of it as a theoretical maximum for what you could 40:38 - get against Redis here. And then I ran the same benchmark against Noria. Now before I show you the result, remember that Noria provides SQL, automatic maintenance of this cache, and it doesn’t have these requirements of everything has to be implemented in the application logic. Instead, Noria does it automatically for you. And Noria here gets pretty close to this theoretical maximum for Redis. Noria uses as many cores as it needs to satisfy reads. It handles all the eviction and caching updating for you.
And yet it gets within 41:12 - a factor for two of what you could get with a theoretical Redis deployment on this machine. Now, I’ve been talking for a while, and so I’m going to start to wrap things up now. And I want to start first by talking a little bit about future work on this. Because Noria is neither perfect nor complete. There are a number of things that are missing both from Noria without partial state, and partial state itself. For example, there is no support currently for things like range queries, cursors, and time-windowed operators that might be useful for applications, and is somewhere where there’s room for innovation both within Noria’s dataflow but also within partial state.
It’d also 41:53 - be really nice if there was a way to integrate Noria with an upstream database. So imagine that the tables in Noria were not stored in Noria itself, but was stored in Postgres or MySQL or some other source of truth database that the user or application developer is already running. Similarly, there’s an attractive prospect here of… because Noria uses dataflow to manage all of the updates to materialized views, there’s no real reason it has to stop at the materialized views. Imagine that the user, or the application developer, also has client applications running on devices that users are holding. Well it might be that those are showing subsets of views that are on the server.
And you could imagine that the deltas are allowed to flow all the way to the 42:37 - end-user device and update the views there in place. This is sort of reactive programming, or reactive applications, that are becoming a bit of a trend in web application development. And finally, fault tolerance is something that Noria has a somewhat weak story for. But partial state might be able to help here. The idea would be that if a given node goes away, what we can do is just make the state that that node held as missing, and then have the partial state and upqueries mechanisms take the role of trying to repopulate that state.
43:11 - Now, before I end, I want to acknowledge the influence of a couple of people on this work. And I want to start off with the committee. Robert and Frans have — over the years — just been continuously asking and trying to make me build a better story for Noria. And for partial. And I think this is one of the reasons why the work is where it is today. I think initially I had these sort of hazy ideas for something that might be cool, and I think the two of them have really helped hone the story, hone the argument, for why this is useful and what the work should focus on.
43:44 - And this project wouldn’t be anything like what it is today without their invaluable input. Malte as well was a post-doc in my group during many of the years of Noria’s development, and without Malte Noria wouldn’t have SQL. It wouldn’t have query optimization. It would essentially just be a dataflow system. And his contributions have been invaluable and without him this work would not be in its current state. I’m also very glad that I have Sam on my committee so that I can draw from some of his database experience.
Which is something that he has a lot more of us than us 44:17 - mere systems people. I also want to thank the people of the parallel and distributed operating systems group at MIT. The people there have just been giving me… they just have this endless curiosity and insight and support that I think has made me the researcher I am today. And I don’t think I would be here without having them surround me throughout the years. I also want to thank my family — my parents and step-parents — who have always been encouraging me to pursue my interests no matter how geeky they might be.
44:53 - And I think they’ve sort of been nudging me along all the way until I got where I am today — where I’m now giving a very geeky presentation — and I think they’d be happy with that. I also want to thank my girlfriend Talia. I am amazed that she’s tolerated me just locking myself in my room for hours on end working on this work, on the evaluation, on the thesis, on this presentation. I am so grateful and happy that I have her, and… I love you. To conclude: my thesis enables materialized views to be used as caches. It does so by allowing state to be missing from materializations, and using upqueries to populate missing state on demand.
The resulting system provides automated 45:40 - caching for SQL queries, and reduces the need for complex, ad hoc caching logic. Thank you all for listening, and please ask any questions you might have. .