Xpand – an internal review - Max Mether - MariaDB HA MiniFest 2021
Apr 7, 2021 09:46 · 4698 words · 23 minute read
Hey everyone, and welcome to this talk about Xpand! So I’m going to spend a few minutes describing a bit what Xpand is and a few of the key features of Xpand. This talk is only 25 or 30 minutes long so we can’t go into all of the details, but I will talk about some of the key features of Xpand. So without further ado let’s get right into it! So what is Xpand? Xpand is based out of the Clustrix technology that was acquired by MariaDB Corporation around two and a half years ago.
Clustrix was an existing product and at MariaDB we have now taken this existing product and updated it a bit and basically tied it to MariaDB. So what it does is that it brings distributed SQL to MariaDB, so in that sense it’s similar to something like CockroachDB or Yugabyte or even for those who are old-timers in the MySQL world, it’s similar to MySQL Cluster back in the day which I myself was involved in, as part of my my tenure at MySQL. So it has a lot of similarities.
All of the distributed SQL have kind of similar features, but the main idea is that you’re able to distribute the data across multiple nodes or multiple servers so that you’re not constrained by the limitations of one server. And the limitations could be due to size, it could be CPU power or many other things, but the whole idea is that it gives you a flexible architecture that allows you to then scale out more and more as you need, and without any specific penalties.
That it’s a basically linear scalability more or less, which is different to many of the solutions currently available in MariaDB before Xpand. Now Xpand is a smart engine. It does exist in MariaDB as a storage engine which means you can connect to it like a storage engine, but because Xpand is a smart engine, it has a SQL parser of its own. You can also connect to it directly and use MaxScale as a load balancer, so there’s kind of two ways of connecting to it.
I’m not going to talk in details about about the difference between these two, but obviously if you connect through MariaDB you get all of the features of MariaDB, if you use a direct connection what in our recommendation is called the performance topology. Then you basically don’t have all of the features of MariaDB, but you have the Xpand native features available. All right, so let’s look a bit further at the characteristics of Xpand. So it’s built on a shared-nothing architecture, which basically means that the nodes are fully independent, there’s no shared memory or shared disk or anything like that.
All the nodes have their own hardware and then they communicate over the network. Xpand provides total elasticity. You can add nodes or remove nodes from an Xpand deployment pretty much on demand. Obviously, you have to do some operations to do that and there are some things done in the background, but from a user point of view it is pretty much on demand. Xpand is very self-managing in that it does a lot of things automatically, behind the scenes, like redistributing data and things like when you add nodes, data is automatically redistributed and things like that.
And we’re going to look at some of the details on how that is done as well. Xpand is fully ACID. Basically provides strong consistency, it’s not like some of the NoSQL, solutions out there where there’s eventual consistency, but this is strong consistency. When a transaction has been committed it’s been committed everywhere, there’s no way of seeing data in a previous version, or seeing data somehow out of date. It’s completely ACID.
04:45 - It supports standard SQL. The Xpand nodes themselves, when connected directly, have a very rich SQL parser. It doesn’t support all of the SQL that you get through MariaDB using it as a storage engine, but most of it. And one of the cool features that I’m not going to dig into that much deeper in this talk is how joins are handled, where they’re kind of partially executed by each node and then the node is executed by the first node then goes to the second node and executed further so it’s kind of a really cool way of handling nodes and kind of optimized for being distributed as opposed to using something like a nested loop join or something that’s the default in other engines in MariaDB.
It has continuous availability, basically automatic failover when a node goes down. It basically transparently handles failovers and data distribution and that’s something we are going to look at how this is done. Anyway so those are the pretty cool features of Xpand and let’s look a bit further at the architecture.
06:03 - Here I have listed an architecture where expand is as a Storage Engine which means that you do have a MariaDB Server in front of the Xpand storage engine, but if you are using the performance topology you would basically have a similar setup you just wouldn’t have MariaDB server in front, but you would connect directly from the application through MaxScale or something and then directly to the Xpand nodes. The green box here could be the application that connects directly, or it could also be that you have MaxScale in between.
Typically we always recommend to have MaxScale or similar load balancer in between, to kind of try and transparently hide things like node failures and adding nodes and so forth. If you use MaxScale, the application doesn’t have to be aware of what the current node topology is, which nodes are available, if have you added more nodes and so forth. MaxScale will transparently take care of that and the applications just have to connect to MaxScale. So I think for the purpose of this example it’s better to look at the green box as MaxScale and then the application next to MaxScale.
In order to guarantee high availability, you need to have at least three nodes in Xpand. Xxpand does run in a single node setup but, as it is used mostly for distributed SQL for HA, you do want to have HA and you do want to have [at least] three nodes. The reason why you need three nodes is of course to avoid split brain scenarios. If you only have two nodes then it’s impossible to distinguish in a network setup a network failure from a node failure. With a two node scenario basically as soon as one node fails the other node would have to fail as well to ensure that there is no split brain scenario, hence three nodes is the required minimum.
Because then if one node fails, the other two can continue because they have a majority of the nodes. Right so let’s look a bit at how the data is distributed and what goes on inside Xpand. So in Xpand, each table and each index will be divided into something called slices. Here we’re going to kind of look at how how it would be done for one table but in your database this will be done for every single table, every single index would have the same kind of distribution.
So here we have an Xpand table, it will be divided into what’s called slices across the Xpand. Basically a slice is like a certain amount of rows in the table. It’s like a partition except that you can’t really specify how this table is partitioned. It’s done basically automatically by Xpand, typically using a hash value of the primary key. Anyway, so in this example we have three nodes, so we will have three partitions which is here represented by the different colors here.
You have a blue, gray and and the green. So basically each of the parts of the table will go to one node like this. And now of course each node has a third of the table, but of course that’s not enough for HA. So in order to have HA, we also need to have a copy or a replica of each slice, so basically each node will have a primary slice that it’s a primary [for] and a slice that it’s a replica [for]. So here you see these nodes, one two and three they will have a primary and they will have replica and and like this.
And by doing this we basically guarantee that there’s HA because now, any of these nodes could fail and we would still have all data available. If the third node fails here, well it’s the primary slice, it was the green slice, but that green slice also exists as a replica on the first node, so we’re fine. And so forth the, second node fails , the green slice, but that exists on the third and so forth. Obviously two nodes cannot fail in this scenario. Now in Xpand, you can actually set it up to have more copies of the data than two, but the default is two and for most use cases that’s the best option because adding more copies of the data makes writes slower because you have to write in more locations.
Like this, every time you write something, you have to write in two locations, in the primary slice and the replica of that slice. But, in theory, if you need it for some reason, you can also set up Xpand to have more copies of each slice and you can even set up Xpand so that every node has all of the data, and this actually can be really useful for for specific scenarios, because it of course makes read performance faster if all of the data is in each node, but of course it means that it doesn’t scale as well because, every time you add a node, it has to have all of the data.
So that’s something that can be used for specific use cases, but also what’s really cool about Xpand is that this is highly configurable, so you could actually set this up per table, so you could have some small table that you often use for references in joins and stuff, so you could have this table set up to be fully available on each node and then have the other larger tables distributed. So that’s something that can be used for optimization purposes.
All right, so let’s look at how elasticity worked then. So we have our three nodes, now this table is divided into more slices than three. I think I have 3 times 4. I have 12 slices here. Now by default you will have at least as many slices as there are nodes, but in addition there is a maximum slice size which by default is eight so as soon as you reach more than eight gigs you will get more slices as well. Anyway this table is divided into twelve slices, they’re evenly distributed across the nodes like we see here.
I’m not very good at making slides perhaps, but these colors are supposed to represent the different slices. That’s what we have here. And for each slice there’s the primary and there’s a replica. So what happens if we have this setup and we now want to add a node. Let’s say our our traffic is increasing. We’re starting to reach the limits of our computational power on each node, we want to add more to be able to scale. So let’s go through the process.
First thing we do, there is an empty node added to the to the Xpand deployment. You of course need to have the hardware, then there’s a simple command that says alter cluster, basically add node. And the node is added to the cluster. It will be empty in the beginning. There’s a short time period when this node is accepted into the cluster which is called a group change and after that, everything is fully available. This node has no data yet, but everything is fully available.
Then the background operation starts. So basically in the background, the Rebalancer start copying primary and replica slices from the existing nodes to the new node. The purpose of this is of course to rebalance the data. So we had our 12 slices distributed on the three nodes so that each node had four primary slices and four replica slices, now when we add a new node basically it will take one primary slice from each node, this one, this one and this one, and it will take one replica slice from each node and copy them over to the new node.
So we end up with a new setup where instead of having four plus four slices on each node we now have three plus three slices on each node. And we can see that by that we basically have changed or been able to balance the traffic and the load more evenly, on over four nodes instead of over three nodes, so that’s pretty cool! And that’s done in the background, you don’t have to worry about copying slices or whatever. The Rebalancer does it for you, so you basically you get the hardware there and you do your alter cluster add node command and everything else is done in the background.
So that’s pretty cool! Now, what about the opposite? What about removing a node, or if there’s a node crash? The operations are pretty much the same so this works in both scenarios. It could be that a node fails or that you for some reason want to remove a node. Similarly when you want to remove the node it’s a command, but the same kind of the same steps are taken if a node crashes. So first of all, let’s say this node crashes in this scenario so node 4 has crashed.
First thing to figure out: can this cluster continue? Do we have all of the data? The answer is yes obviously because the cluster was set up so that a node can fail. All of the nodes that were primary on the third node are as replica slices on the other nodes. So that’s the first thing. It goes into the group change protocol again, where the cluster nodes figure out: can they actually continue? And yes they can. And then what happens after this depends on a timeout.
Because of course, if a node fails it could be that it just had a temporary error and it’s going to come back up. So you can actually set a timeout, how long do you leave your cluster continue in a mode where it’s more vulnerable, because here of course we don’t have copies of all of the replicas. We have all of the data available, but we lost a fourth of everything, which means that we don’t have copies of all of the data. So there are some slices here where we only have the primary and there is no copy, so it’s kind of a vulnerable situation.
Right now if we would lose another node we would be in trouble. But, of course, if the fourth node is just rebooting, it’s not worth starting to move slices around yet so there’s a timeout and once you hit this timeout or time limit, then the Rebalancer kicks in again and the first thing we do is well we reconstruct, or we basically copy new primary slices from the replicas. So of course all the replicas are available, so we, for each primary that was lost, we take the replica and it’s copied as a primary to another node.
So that’s the first thing. Now we have all of our 12 primary slices available as primaries, and then we do the same thing with the replicas. So we recreate replicas of all of the primary slices so now we have 12 replica slices as well, after this operation. And again this is done in the background and it doesn’t really affect traffic in in any other sense then that it keeps the nodes busy. And now we’re back to three nodes and we have a copy of everything.
So that’s basically how removing a node, or losing a node is done because the steps are the same. So that’s pretty cool because you can both add nodes and remove nodes and the Rebalancer does everything behind the scene. So that’s basically the coolest part of Xpand is the Rebalancer. So let’s look at what the Rebalancer does: so with the initial data, the Rebalancer is the the part of Xpand that basically distributes your data into slices across nodes.
It’s the one who decides which slices go where and so forth. So that the Rebalancer is key already with the initial data. When you have data growth, it also splits large slices into multiple smaller ones. So for example when you reach your 8 gig limit on a slice the Rebalancer will split it into smaller slices and then of course these will be distributed across the nodes as well. In some cases you might have skewed data. You might have nodes that are much larger than others so it does that too.
It redistributes data to even out across nodes, or you might have one node that has more than others or something like that. You might have hot data, you might have one slice that is used all the time for some reason. The Rebalancer would notice this and it will split the slice into smaller slices and just redistribute them across nodes And all of this is done, again, behind the scenes. And as we saw with node failures, what the Rebalancer does is, it reprotects the data by ensuring that there’s multiple copies.
So basically as we saw in our example, when one node failed and we lost a copy of many of these nodes or under primary but we had we had the copies then the Rebalancer was the one that made sure that we recreated copies of all of the nodes so we ended up in a state where we had two versions of every single slice. That’s the real answer. And of course, elasticity when you add nodes, remove nodes it’s the Rebalancer that makes sure that we copy. That is a similarity to node failures.
It redistributes them or copies them and so forth.
20:39 - You can also set up Xpand to be availability zone aware. So if you’re using Xpand on EC2, or some other cloud scenario where you have ability to build zones, you might want to set up Xpand so that in case a whole availability zone is lost, you still have all of your data available. And basically the Rebalancer can be made availability zone aware so that it balances data across availability, not only across nodes but even across availability. So it’s really cool! It’s pretty much one of the secret sauce or key features of Xpand that makes everything in Xpand work well.
And that’s the basic intro into Xpand. There’s a lot more. Obviously we could go into to how joins and other queries are executed and so forth. We don’t really have time for that in this talk so I’m going to stop here.
21:45 - If you want to find out more information about Xpand, the first version of Xpand was released as part of the Enterprise Server release in December and now we have had the first maintenance release come out last week, where we also put emphasis on the performance topology. We’ve added lots of documentation, so if you go to the docs section of mariadb. com, there’s a whole lot there, so I would recommend going there. We’ve also created quite a few blogs that go into some of the key functionalities of Xpand, so you can also go there to find out more.
And of course I will answer some questions after this talk, so thanks everyone for listening and talk to you later! So, great presentation Max. I quite liked it and welcome now to the Q&A session. Thank you Kaj.
22:49 - So you compared Xpand to Cockroach and YugoBase and MySQL Cluster and I think I got three key technical takeaways from what you said, and I’d like to hear whether you think that’s the right takeaways to have.
23:03 - So, number one is that Xpand distributes data on multiple nodes, getting away with the constraint to have all data on one server.
23:15 - Correct. Yep, good, and then the second one was the Rebalancer. You portrayed the Rebalancer as the the secret sauce of Xpand that does all the hard stuff and all the cool stuff, so learning about Xpand is a lot about understanding the Rebalancer, is that also correct? Yeah, that’s true, Rebalancer is kind of the automation, the thing that does kind of the things behind the scenes, that makes things smooth, like adding nodes and so forth, because that’s the one that makes sure that when you add a node, data goes to the new node, when you remove node everything is reduplicated and stuff like that.
23:55 - Yeah, so that’s what makes it possible, all the difficult stuff that Xpand can do, and then the third one was what you called performance topology, so that, but there are several ways of using Xpand. One is as a storage engine over MariaDB directly, or directly, indirectly however you put it, and the other way where one is to use it independently of MariaDB and you call that performance topology. Right, yeah, so I mean that’s what we call it, yes, in our documentation, so I want to use the same name so that it could be easily found but because Xpand has its own parser, it allows you to do this, yes.
So then the first question I have is if you have this performance topology, and it sounds cool with performance, so what’s the benefit for the end user of using Xpand as a storage engine? The benefit is that MariaDB, the MariaDB Server in itself, has a lot more features than Xpand has baked in, so you get the benefit of all of the features of MariaDB, for example, the audit plugin, if you need auditing if you use MariaDB Server, then you get all of the auditing features for example, all of the security features, the password plug-ins, and all that, MariaDB has all of these that Xpand doesn’t and of course you can also then use native or local InnoDB tables and mix with Xpand and things like that, so it adds a lot more compatibility with with MariaDB and also a lot more flexibility in what you do.
25:37 - Okay, makes a lot of sense, so on that note you touched upon the topic of joins, and I remember from the old days of MySQL Cluster that there was a tricky thing, so what about these joins, I mean judging from your answer right now you would sometimes… For the things that need lots of joins, you might pick another storage engine and Xpand, and optimize it that way and you also pointed out that you could do the number of slices in such a way that all data for certain tables would happen to be on every single node.
I suppose that’s a way to optimize joins as well is it? Yeah, exactly right, so I mean the data is still distributed, so you still have some penalties when you do lots of complicated things where you have to go to different tables in different places you do have network hops. It’s not as bad as it could be because if you do a nested loop join, then you typically go and get one row from the first table then you go and get the second row, and every time you go and get a new row, it’s network hops, you have a lot of networks hops.
Now in Xpand, there’s an optimized way of doing joining where it builds a program and then it sends it to each node. Each node does its own version of the program, so you kind of do a lot of local computations on each node, but it’s still more network hops obviously than than doing everything locally so that’s why if you have some reference tables that you often join with, and they’re not that large, you can have those in these what we call all nodes format, where all of the data is on each node and then then Xpand is aware of this, and it knows that hey, this table, I have everything locally, so I can basically use it as something that does not change, I can just locally look it up, and quickly.
So to me, that sounds like a much more positive scenario than what it was in MySQL Cluster where joins were working more or less theoretically in most scenarios, but in practice they had a lot of response times, and you described quite a number of of safety nets, and ways to get the joins to be efficient enough.
28:00 - Yeah, I mean for me I was heavily involved in NDB or MySQL Cluster back in the day, so for me this is basically, like the really bare bone basics is the same, but it has all of the automation and all of the optimizations that that NDB lacked in the day, so it’s very exciting because it’s what I always wanted NDB to have. Now we have this in one product. Yeah, that’s a great perspective to have.
28:26 - So you see quite a lot of things in the video and you get good answers here on the Q&A. If I want to learn more, the documentation and the blogs, where should I start? That’s a good question. I would definitely start with the documentation because in the documentation we recently added the Xpand part, so it’s very new and it’s basically there’s no bloat, you won’t have a hard time, hey where should I start, it’s not 200 pages, it’s 20 pages, so it’s easy to get started in.
28:58 - And Xpand, how ready to use is it, you said Enterprise Server December was the first one, and now there’s a maintenance release so is it ready to roll? Definitely, I mean we already have customers using it right, so it’s definitely ready to roll. It’s also going GA in SkySQL in an upcoming release in a couple of weeks, so it’s definitely ready to roll.
29:20 - And actually one thing I forgot to mention is that we now also have a trial version you can use from our website, so it’s a lot easier than it used to be in the past. That’s something we launched a couple of weeks ago, so it’s really new, but you can now get it and use it to download it, and use it anywhere on prem.
29:37 - So you are now on a MariaDB Foundation event, and this is more or less closed source, it’s commercial only, so we’re used to doing everything open source, and this is obviously not, so what kind of… you mentioned there’s a trial version if one wants to try this anyway, even if it’s not open source? Right, yeah so you get basically a trial license that allows you to use it for, I don’t remember, 90 days or something like that, We’re still looking into what to do, right now it’s a proprietary license, or source code, we’re looking at different options going forward, but we haven’t yet decided exactly where to go with that, but it’s something we are looking into actively.
Great, so thank you Max for that. Thank you you Kaj. .