eXtreme Modelling in Practice - Jesse Davis, Max Hirschhorn, and Judah Schvimer

Oct 2, 2020 03:14 · 2895 words · 14 minute read last one presented another operator

Hi, I’m Jesse, I work at MongoDB. My colleagues; Max Hirschhorn, Judah Schvimer, and I have done two case studies testing that our software conforms to its specification, and that’s what I’m going to describe to you today. This presentation is adapted from a paper that we presented at VLDB last month. Compared to our VLDB presentation, this one will give you a lot more detail, and particularly about how we used TLA+. At MongoDB, we’ve been adopting TLA+ more and more over the last few years.

00:39 - We use it to specify distributed systems algorithms and to prevent bugs. But how do we know that our implementation of these specs actually conform? This is particularly important because we make incremental changes to both sides over time. We looked around for a testing technique, a way to test conformance, and we came across this 2011 paper, “Concurrent Development of Model and Implementation.” In this paper, the authors propose a software methodology called eXtreme Modelling. In eXtreme Modelling, the specifications in the implementation evolve over time, and they have two techniques for testing conformance.

01:28 - In one of them, model-based trace checking, you run your implementation and collect an execution trace, and then check that that trace had been permitted by the spec. The other technique called model-based test case generation starts with the spec, generates all possible behaviors, and then generates unit tests to check that the implementation follows each of these behaviors. We’ve got two testing techniques, and we did two case studies where we tried each of these techniques with one of our products. We tried model-based trace checking with the MongoDB Server, and we tried model-based test case generation with MongoDB Realm. The rest of this talk is simple. I’m going to talk about the left-hand case study first, and then I’m going to talk about the right-hand one.

02:25 - The MongoDB Server is a distributed database, and it uses replication protocol that is a lot like Raft. You write to the leader, the followers asynchronously replicate the write, and once the write has been acknowledged by a majority, the leader considers it to be a majority committed. The leader keeps track of a majority committed timestamp. All writes up to that timestamp are committed and all of the rights after that are not. The leader needs to keep track of this majority timestamp despite crashes, and network outages, and elections.

03:08 - That’s what we specified in TLA+, we wrote a short spec of this aspect of the server, and we wanted to test the implementation’s conformance to this spec. For this first case study we used model-based trace checking, and the model-based trace checking flow goes like this. You run some test of your implementation, and while the implementation is running, it logs all of the state transitions that it experiences, and it outputs all of the variables that are defined in the spec. We collect this execution trace and translate it into a TLA+ data structure. This allows us to do the last step, which is to use the model checker to check that this particular state sequence was permitted by the spec.

04:07 - We developed a Python script and a data pipeline for doing this testing with the MongoDB Server. To begin with, we chose a test to run, we chose a JavaScript randomized test called rollback fuzzer, which does random operations on the server while, at the same time, triggering rollbacks of uncommitted writes. This test generates database requests to each of the three nodes in the replica set and each of them logs every state-transition in its own log file. We then need to combine these log files into one state sequence. We can just sort the events by timestamp because all of the servers are on the same machine with a shared clock, and we introduced a short sleep after each event to ensure that they have distinct timestamps.

05:11 - Once we have the combined logs, we have a Python script which parses the logs and creates a state sequence that expresses all of the state transitions for the system as a whole. We then output that state sequence as a TLA+ data structure, and we use the model checker to check if this was permitted by the spec. That’s our pipeline, and I want to talk about two parts of it in particular detail. First of all, how do we process trace events? At every moment the system as a whole has some state. For example, here we’ve got node 1 is the leader, and node 2 is a follower, and it thinks that the election term is 1.

06:05 - To get the next state in the sequence, we combine this with a trace event, so let’s say that node 2 becomes leader in term 2. Our Python script combines the current state with the new event to produce a new state where node 2 is the leader and node 2’s term is 2. Notice here, the Python script makes an assumption that if there’s a new leader then the previous leader is instantly a follower. This is actually a lie. In fact, two leaders may co-exist for a short period in different terms. This assumption is going to be a problem, and I’ll explain that in a minute.

06:53 - That’s the first stage of this pipeline that I wanted to describe. The other one is how we use the model checker to check that this was a permitted state sequence. We’re following a process here that was proposed by Ron Pressler. We output the state sequence as a series of TLA+ tuples, where each tuple contains the action that led to this state, the value of all of the state variables, and also the server log source of this trace event. For example, the initial state here is a tuple, the action that brought us here is init.

07:38 - We include all of the state variables which you don’t need to pay attention to right now, and after that, we start processing trace events, so maybe the next event is that a member became primary. We call this action become primary by magic because our spec doesn’t model the election protocol in detail, so we just assume that some member becomes a leader, and you can see here that the role variable has changed its value and also that we’ve got the server log name and line number here. This is not needed for trace checking, but we found it very helpful for debugging problems. The rest of this method is adapted from Ron Pressler, you declare your state variables and instantiate the handwritten TLA+ spec. This is the spec that we are checking conformance to.

08:46 - There’s a index variable that says which event we’re on, and a read operator that loads up that trace event into the spec variables. There is another operator read next, which loads up the primed variables primed from the next event in the state sequence. The spec as a whole, its initial state is that it’s read the first event, its next state relation is it reads the next state and loads it up into the variables. The behavior of the spec as a whole is that it starts at the first event and proceeds until it reaches the last one. We use the model checker to check this spec, which is trivially correct, and also the spec behavior of our instantiated handwritten spec.

09:49 - By checking both of those things at the same time, we check that this particular behavior matched the spec. That’s model-based trace-checking. The idea is that the more we test, the more coverage we would achieve both of the implementation and the specification, and the more confidence we would get that our implementation conforms. There are a lot of really interesting questions here like how do we measure coverage? What do we mean by confidence that the implementation conforms? Unfortunately, we didn’t get to explore any of these questions because we actually failed. Our traces never matched our specification. We didn’t get trace-checking to pass. That’s the result of this first half of the paper, but we have some lessons that we hope are going to be useful for anybody who tries this in the future.

10:51 - There are three reasons why we couldn’t do model-base trace-checking. The first of these is that we have to snapshot all of the modeled state within one of the nodes in order to log it every time it does a state transition. But MongoDB is highly concurrent, it’s multi-threaded, it’s designed to avoid global locking. As a result, it took us a month to figure out how to use locks and MVCC to get a consistent snapshot of the values that we needed to lock. The second problem is that we actually began with the spec that didn’t match the implementation. There are a few discrepancies.

11:35 - For example, in our implementation, we can have two leaders at a time. This is true of Raft and many other consensus protocols. We didn’t want to deal with that complexity in the spec, so we had ignored it and we’d assumed that there’s only one leader at a time. In our Python script that processes events, we had tried to fudge this by marking the previous leader as a follower as soon as we learn about a new leader. This post-processing, it might be useful to bridge differences between spec and implementation.

12:18 - Sometimes I think this might be a good idea, sometimes I think that formal refinement is often a better way to bridge the gap between a abstract spec and the spec that you want to trace-check. In our case, this fudging, this need for post-processing, was a symptom that we were on the wrong track. We should have modeled what the implementation actually did. The final problem with this case study is that the marginal cost of trace-checking was just too high. Our original vision was that we would gather execution traces from all of our tests in our continuous integration system, and check all of them against all of our five or six specs and that we would rapidly write more specs and that we would trace check against them as well.

13:13 - But we had spent 10 weeks so far and we had failed to successfully trace-check even one spec. We discovered that in most of the work that we had done was specific to that one spec. So we guessed that the marginal cost of trace-checking each additional spec in our repository was not going to be worth it. As a result, we stopped the project. I still think that model-based trace-checking is practical and a useful technique. I hope that some of the lessons that we learned would be useful to make it practical.

13:54 - If we’d started again, we would have ensured that the spec and the implementation conform try knowledge. We would have fixed any discrepancies and if trace-checking found additional discrepancies, we would’ve stopped and fixed those either by changing the implementation or by changing the spec. The other thing we would have done differently is we would’ve modeled easily observed operations like network messages rather than some internal state that requires locking and snapshotting. If we had rewritten the spec in those terms, it would have been a lot more like the original graphed spec and a lot easier to trace check. We found that the TLA plus Toolbox and TLC were very capable for this problem, but for more widespread adoption, I think that a couple of wishlist items would make this easier for future engineers.

14:53 - One would be if there were a standard format for inputting execution traces. We created a TLA plus data structure, but if there were some data structure like JSON that’s more common to output. Then maybe if there were just a button in the TLA Toolbox where you can say, “Check this execution trace against this specification.” People would be much more likely to use it. I do want to say thank you to Marcus Coop. He was unbelievably helpful throughout this whole process. He was doing on the fly development, fixing bugs and adding features to TLC while we were working on this project, and we wouldn’t have been able to reach this point without his help. Same for Ron Pressler, who talked us through his method and was giving us very useful advice throughout. This is the end of the first half, the first case study. Now, onto the second product and the second test method.

16:03 - MongoDB Realm is a mobile database where disconnected clients can write to their replicas of the data. They periodically connect to a central server and upload their changes and download changes from other clients. We have a conflict resolution algorithm which ensures that all members of this system eventually converge on identical copies of the data. Originally, we knew that the central server and all of the clients used the same conflict resolution algorithm because they were all running the same C++ code. But recently we re-implemented the central server and Go.

16:50 - Now we needed a way to make sure that the client and the server have implemented the same algorithm in different programming languages. For this, we used the second of the two conformance testing techniques, which is model-based test-case generation. Now the first step is to actually write a TLA plus spec. Before this, we had a algorithm written out in English and implemented in C++, but now we needed a formal specs, so we wrote it up in TLA plus, and we ran the model checker and verified that indeed the algorithm does work. Now, we have some confidence that our spec is correct, but we still don’t know whether our implementations conform to it or to each other.

17:37 - The next step is to output all possible behaviors of this specification. We used this with the TLC command option here, which outputs all of the states and the actions that lead to them in this dot format. It’s practical in this case because this is a fairly small state-space with only order of tens of thousands of states. Once we’ve got this, we turn each behavior of the specification into a unit test of the C++ implementation. We wrote a Go program that read TLC’s output and converted each state sequence to unit test.

18:27 - By the way, we have an advantage here that our state-space has no cycles. The algorithm always stops after a few steps. We have a well-defined notion of a distinct state sequence that starts at the beginning and stops at the end. If you had an algorithm that didn’t always halt, you’d have to be more clever. For each sequence we create a unit test. These follow a template where they begin by instantiating two clients and starting off with some initial data, then each of the clients independently does some operation on the data, they all synchronize with each other, and at the end, we check that they have converged on the same data and that both of their operations is reflected in the final data.

19:23 - One example of a test begins with the data 1, 2, 3. One of the clients sets the third element of the array to four, and the other client removes the second algorithm from the array. At the end, we check that they have the same data and that both of these operations has completed. In Realm, there are six different operations that can be performed on the arrays, these combined together into 21 different conflict resolution rules. TLC found 30,000 states, and since each sequence has several states and this combined to create about 5,000 test cases.

20:08 - For model-based test-case generation, TLC and the TLA plus Toolbox worked well for this as well. We just have one wishlist item which would be a standard format for exploiting the state-space with the variable values and the transitions and the action names. For us, we had to read a Go program to parse the dot format the TLC outputs. I think that that’s a barrier to adoption, and if TLC could output a CSV or a JSON file that people would be much more likely to try out this technique. In conclusion, model-based test-case generation worked great for us.

20:50 - We were motivated to read a TLA plus spec which caused us to find a bug, and the bug was indeed in our implementation already. Testing found 100 percent branch coverage. Model-based trace-checking we found to be much more difficult and eventually not worth the effort for us. I do still think that it’s a good idea if you ensure that your spec and implementation conform to your knowledge when you begin the project. If you write your spec in terms of easily-observed operations like network messages.

21:25 - On behalf of my colleagues Max and Judah, thank you very much. If you want to read the complete paper, this bit.ly link will take you there. .