Debugging without debugger - investigating SQL Servers internal structures
May 11, 2021 20:00 · 10230 words · 49 minute read
Hello everyone. My name is Hugo Kornelis and in the next hour or so I am going to take you on a journey where we’re going to investigate some of the internals within SQL Server that we were probably not supposed to see, using tools that were not really intended for that specific purpose. Welcome to: Debugging without debugger: Investigating SQL Server’s internal structures. My name is Hugo Kornelis, and I make SQLServerFast. com. SQLServerFast. com is the website where I am building the Execution Plan Reference.
The Execution Plan Reference is a full set of documentation for everything related to execution plans. Because, you see, execution plans are great. Execution plans can always help you understand why a query is running slow when you expect it to go fast. Execution plans can help you understand what you need to change to make that query go faster. Execution plans give a wealth of information about how a query was executed. But what we are missing is a wealth of information about execution plans, so that’s what I’m trying to provide in the Execution Plan Reference.
As a reference, it is not really a good place to learn about execution plans. For that, I am recording the SQLServerFast Execution Plan Video Training, a huge selection of extensive videos where I teach viewers everything about execution plans; and whatever your level is now, you will learn a lot by watching those videos. So if you’re interested in learning more about execution plans, go check it out. I also have other content on SQLServerFast. com, like my blog where I regularly write about, usually, execution plans; sometimes other stuff as well.
I have longer articles there; and I have a section called resources, where I have the slide deck and the demo code for the session you are about to see, and for many other sessions. Just go to SQLServerFast. com, then click Other - Presentations - Presentation resources in the menu, select one of the many presentations, and download the slide deck and the demo code so you can play with it, and do your own experiments, and check the stuff that you wanted me to do but that I didn’t.
I do other stuff in the community. I do stuff for money. You don’t really care about that. What you do care about is how to contact me, because it is possible after watching this video that you have questions. This is not a live delivery where you can raise your hand and Ii will ask ask you what your question is and answer it. I would do that if this were a live delivery, but this is a recording. So if you have questions, just mail me, tweet me, and I will try to answer you as soon as possible.
If you mail me and you don’t get a reply within a week, send me a next mail; it’s probably swamped between all the other stuff I get. I’m not deliberately ignoring you, I’m just sloppy. With that out of the way: Welcome to the session. We are going to investigate the internals of an operator called Window Spool; and before we do that let’s look at what the Window Spool does. So what Window Spool does, it’s an operator in execution plans that reads rows, stores them internally in a worktable, and then the parent operator can receive multiple copies of those rows.
Now that is not unique to Window Spool. In fact, all spool operators do that. Table Spool, Index Spool, do the same thing as Window Spool. But Window Spool has some specific functionality within it, to make it work for the windowing functions the so-called over OVER clause. You know what, let’s just jump into a demo and see it in action.
03:56 - So I’m going to go to the demo window and Ii will enable the live query statistics, the live execution plan, and execute the query. And while it’s running, I’ll talk you through the execution plan, but first the query. So the query simply reads data from the SalesOrderDetail table. This is one of the tables in the AdventureWorks sample database that Microsoft makes available, you can just download it from Microsoft. And we just return some of the columns; then we have one column which is the maximum carrier tracking number, based on a window.
This window is defined as the ten thousand preceding and the three following rows, so it’s a total of ten thousand four rows, the current row is there as well; and preceding and following is based on an ORDER BY specification. So if you order the data on SalesOrderID, SalesOrderDetailID and then take the ten thousand preceding and the three following rows, you have the window of rows that each current row looks at to determine the maximum carrier tracking number.
So this is not the maximum of the entire set, it’s the maximum for each row for its window of rows.
05:05 - For those who don’t know the OVER clause, you can specify a PARTITION BY clause as well; in which case it means for each partition you basically reset and you only look at the window in that specific partition. I didn’t do that for this demo.
05:20 - Now if you look at the execution plan, and I will start at the far right. Execution plans can be read in multiple ways, in this case I choose to follow the data, and following the data means start at the right. So I start at the far right where you find the Clustered Index Scan; and I hover it for the properties, and I will zoom in a bit, and then you will see that the Ordered property is set to true. This means that optimizer forces the operator to return rows in the order in which they are logically defined by that index, and I happen to know that this index, called PK_SalesOrderDetailID… well, long name; it’s the primary key of the SalesOrderDetail table. I happen to know that if you read this ordered you always get the data ordered by SalesOrderID, SalesOrderDetailID, and it’s no coincidence that this this order… whoops, that was wrong key… It’s no coincidence that it’s this order because… The reason is, by ordering the data this way we can easily identify those windows, because the rows are already in the proper order. SQL Server has to have the rows in this order to even enable this windowing functionality.
In this case it can do this by reading from a Clustered Index Scan with ordered. If that had not been possible, if I had chosen to order by something where there is no supporting index, SQL Server would have had to use a Sort operator. Now, apart from the performance consideration, which usually is that you prefer not to have a Sort; in this specific case I am trying to investigate stuff in tempdb. I want to look into the internals in tempdb, and the less messy clutter I have in tempdb, the better.
A Sort can spill to tempdb, so a Sort can allocate stuff in tempdb, which is exactly the messy clutter I try to avoid. So that’s why I deliberately used a sample where no Sort is needed because there is an index that can be read. Now after reading data from that index, the data goes through a Segment operator, and the Segment operator is used to mark logical segments. Normally, if there were a PARTITION BY clause, this would mark a new segment for the start of each partition, so that other operators in the execution plan know: “Hey, we have a new partition here”.
In this case there is no PARTITION BY, so why is the segment column there? Well it’s there, and I’ll also show you that it’s… It helps to click correctly… The Group By which defines the segment boundary is empty, so the Segment defines the entire data set as one single segment, which makes sense because there is no partitioning, so everything is one partition, one segment. So why is the Segment operator even there? Well, it’s just technical. The next operator, Sequence Project, requires a segment column in its input, so it needs to have a segment column even if it’s logically not really needed, so it’s just technical.
The Sequence Project then adds an extra column to the data. In this extra column is a row number as you see here. So this simply adds one extra column that numbers the rows: one, two,… well, the total number of rows is 121,317; it will just count them and give each of them a consecutive number. This number is needed later, for instance in the Compute Scalar here. And the Compute Scalar adds two more computed columns. One of them is called TopRowNumber and is computed as the RowNumber minus ten thousand.
That looks like ten thousand preceding! And - let’s remove the marks - the second is defined as the… is called BottomRowNumber and is defined as RowNumber plus three, that is the three following, so now we start to see already how this all interacts. We see that the data is read in the correct order; then the rows are numbered so they are each given a unique number; and then based on that number we compute the row number of the first row in the window and the last row in the window.
09:38 - Then there is another Segment operator, and this Segment operator is in the plan for the same reason as the previous one. And why don’t we simply reuse the previous segment column? Well, because Window Spool just requires a Segment operator as its direct input. Segment operators are incredibly cheap, so it makes no sense for the optimizer team to do all kinds of optimizations to prevent the double Segment operator, when those optimizations cost more than the extremely low overhead of that Segment operator.
So it’s just this way. Technically the operator would not be needed but it’s there. Now the Window Spool operator. If you look at the row count, you see that it is producing insane amounts of rows, and that’s because how Window Spool works. What Window Spool does is: all the rows it gets from its child, from this Segment operator, are stored in this worktable, and then for each row the Window Spool operator will return that specific row and then this entire window of ten thousand preceding up to three following rows.
So the Window Spool returns 1005 rows for each input row. That’s why the number of input rows is 121,000; the number of output rows is in the millions.
10:57 - What Window Spool returns, if you look at the Output List, is a lot of data that you expected and also a column called WindowCount1010, and that’s simply a number that goes up every time a new row plus window is output. So it outputs a current row plus the window that belongs to it, with a number; then it outputs the next row plus the window that belongs with that next row, with a new number. That enables the Stream Aggregate to then… whoops… to then group by that number and reduce this entire set of rows back to one.
So what’s the use of having one row, exploding it into ten thousand five rows, and then aggregating it back to one? Well, the use is that now this Stream Aggregate can look… and I need to go to the Defined Values for that… can look at that window of ten thousand five rows to determine the maximum carrier tracking number that we want to return for this OVER expression we have here. For all other columns, it uses the ANY operator, which simply uses the value from the first row, because here we tell the operator just use any of them and it happens to be the first one that’s how ANY works.
12:16 - So that’s how all this works. We have some preparation steps to make sure that the rows have a number and that we know the number of the first and the last row in the window. Then the Window Spool stores it in a temporary worktable, returns those rows multiple times so that each row gets its entire window of rows with it, and Stream Aggregate can then do the aggregation so that it determines the correct maximum of the carrier tracking number.
12:46 - Let’s start investigating. Because we were not here to talk about Window Spool. We were here to talk about investigating SQL Server’s internal structures. So I want to know: this worktable. how is it stored exactly? And if you start by doing some internet searching, you will find very little information. What you will find is that if, based on the text in the query, the number of rows can never exceed ten thousand, so it’s based on the numbers in between so many PRECEDING and so many FOLLOWING, if that is the case, then Window Spool will store its worktable in memory.
If it can be more than ten thousand it will store its rows on disk, and I want to investigate the on disk version. That’s why I used a window BETWEEN 10000 PRECEDING AND 3 FOLLOWING. That’s a window of 10,004 rows, current row is included, which means it’s more than ten thousand, and it will be stored on disk. Now when I want to investigate a worktable, I need to overcome some problems. First of all, queries don’t run as long as I want to. Queries run in finite time.
At one point they finish, and if there is a structure used internally as a work object by that query, SQL Server will of course not keep it allocated. As soon as that query is done, SQL Server realizes: “I don’t need the structure any more, I’m going to release it”. And my investigation will hit a dead end, and I need to restart the query and restart the investigation. How much time do I need to investigate this? I mean, this is a one hour video, but of course I spent a lot of time figuring things out, and I’m only showing you the end result.
I had a lot of dead ends. I needed much more time than one hour. I didn’t know in advance how much time i need. So no matter how long I make the query run, it can always be too short. If I let the query run for six hours, turns out i need seven, then the structures are gone before I’m done, and the problem is, the next time the query runs I need to find the structures again. So of course you can use the debugger to halt the process. Tou can just start the query, make sure that the temporary worktable is allocated, and then use the debugger, halt the process.
But problem is: the debugger is an incredibly complex tool. Some people can work with it. I’m not, I’m not smart enough to use the debugger. Also, when there is something stored in tempdb in a worktable, due to how SQL Server works, it can be in memory only, in the buffer pool; or it can be flushed from the buffer pool to physical disk; or it might be in both and they might even be different if the page is dirty. And if the SQL Server process is halted, I need to figure that out all by myself, so I need to figure out where in the buffer pool is the page stored, where on disk is the page stored, which version is correct.
That’s hard. If SQL Server’s process is running, I can use the statement DBCC PAGE. It’s not a documented statement, but it’s a quite well-known statement; it’s often used for purposes like this. And that statement will make SQL Server itself figure out where the data for that page is stored, and just show me the data. Much easier, but that means the debugger is off the table. Now I can create a very slow query, as I already said: make the query run for 6 hours.
But then if I need 7 hours, structures are gone before I’m done, and if i then restart the query, because they are internal structures there’s no guarantee that they will be allocated in the same place. They might be allocated somewhere else, so I have to restart my search. That’s a waste of time so I don’t want to do that either. What I do instead is, use documented stuff, supported stuff. SQL Server supports concurrent use. There can be multiple users.
SQL Server has a process called “locking” to say: “Hey, this row is protected, you shouldn’t touch it. And if you want to touch that row, you are blocked. You need to wait until the lock is released”. Wait until the lock is released, that sounds like exactly what I need, because I can determine how long I keep the lock. So I can make you wait as long as I need. That’s exactly the solution I need. And bonus, it’s completely supported! I’m not doing any weird stuff.
I’m using supported statements only. So let’s jump in the demo and see how this works.
17:22 - Now what I have here is first, as a comment, the original statement. The reason I have it here is because we need the results. I already executed the statement, I have the results here. I’m not going to change this window, I’m not going to run anything anymore here. I already have the result, so I don’t need it anymore. Then I have a bunch of code in the next commented section, and I’m going to copy that, and then open a new window and I’m going to paste it there.
What this code does is, I’m going to create a table called BlockWindowSpool, because this table is used to block the progress in the Window Spool operator. This table has four columns, and these four columns correspond exactly to the four columns in the output here. I checked the AdventureWorks schema, I checked the data types of all the columns, I checked the uniqueness rules, just to make sure that I know that these are the correct data types and I can store the output in those columns; and I also know that the output will be unique on the combination SalesOrderID, SalesOrederDetailID, and that’s important because I need a primary key, or at least a unique index, to do what I want to do.
Now I want to determine where I want the execution to stop, because i want the execution to run for a while, and then after we exceed the ten thousand rows - let’s make sure that we have a full window, so i want to be past the first ten thousand rows, and let’s say I want to go to… that’s a bit too far, go back a bit… this always takes a little while but we’re there. Let’s say I want to stop at this row. So we have SalesOrderID 46,359 and SalesOrderDetailID 10,101.
I think that’s the one i have in my code. No, it’s 10,100, sorry I mean this. So we’re going to stop at this row. So I take these two values, I don’t care about the other values, I care about the values that are in this primary key. So I use those two values here, and because I start a transaction, then add data to a table, and then… this is commented out, I don’t roll back or commit the transaction, I leave the transaction open, so at this point we have a running transaction, it’s not completed yet, that has data stored in this BlockWindowSpool.
Now I’m going to first check the execution plan that is used for this statement, so let’s make sure that we’re in the correct database. Execute. Now I’m going to select this statement. I want to check the execution plan only. The reason I do that is because adding the insert can affect the execution plan. Sometimes SQL Server realizes: “Oh, if I do another execution plan so I get the rows in another order, then I have better performance”. I want this execution plan to produce the rows in the same order.
I want full control. I want to know for sure that when this row is produced, it is in fact row number 10,100, and the rows are not produced in another order, because I want to have this full control. So what I did was, I checked not only the graphical execution plan, I checked all the properties of all the operators to make sure that there’s zero difference. I will not do that now on camera, it’s a lot of time, it’s boring, but it’s needed. In this case the execution plan is the same, so even though I won’t see the results, they are inserted into a table so I can’t see them, I know I can trust that the results will be produced in the same order as when I did see the results.
So let’s run this query and see what happens. The query just starts running normal. There are no results because they are now inserted, as a Clustered Index Insert is the only addition compared to the original execution plan we had. But as rows are being processed and inserted you will soon see that now the counters stop. Yes, the timing counters still run, but the row counters have all stopped. You will see that the Segment has returned 10,105 rows, that’s a bit more than we expected but that’s because of the 3 FOLLOWING, the Window Spool operator has to read ahead.
You cannot return row number 10,100 without also having seen rows 10,101, 10,102, etcetera. The Stream Aggregate has just returned row number 10,100. Clustered Index Insert tries to insert row number 10,100 but it can’t. It has inserted 10,099 rows; it tries to insert row number 10,100, but because I have this row inserted in an uncommitted transaction, that value is blocked, and the operation is not allowed to insert. And this will continue as long as A want to until I say: “Okay, you know what, let’s just roll back this transaction”.
And now this process is running again. I don’t need it right now, so I’m going to stop it. But now we have figured out a way to block the query for as long as I need it. For as long as I want. I can just leave it open for days if I want to. No problem. Okay, I can’t reset my computer, I can’t use it for other things, but I can.
22:49 - Now that we have the query frozen, that we have all the time in the world, it’s time to find the objects. So where do I find this worktable? Well, what will not work is just type SELECT * FROM… because from what? What’s the name of this worktable? Sure, let’s move back in the demo, and I ran this original query with STATISTICS IO ON. That means that I get information here about tables used, and as you see I used a table called “Worktable”. But here’s the thing, this is not the name, this is just a token.
For SalesOrderDetail, this is actually the name. But for Worktable, it’s a symbolic name. The table isn’t named ‘Worktable’, and if you type SELECT * FROM Worktable, you will get an error because there is no object named Worktable. Doesn’t exist. So that doesn’t work. Okay, something else then, perhaps? Uhm, use DMV’s, like sys. indexes or sys. sysindexes. And the best way to find stuff if you don’t know the name but you want to see what is added to sys.
ndexes, the best way to do that is to have two query windows open, one before and one after, and then we can compare the results of the two. Let’s try it.
24:13 - So I have a query here. I’m going to copy it entirely, and create a copy in a new window, and I’m going to run this query already. So this is the sys. indexes and sys. sysindexes when the Window Spool is not running. This has been stopped, so the Window Spool has deallocated its structures, there is no worktable now. Now I am going to set up my block again, to give myself as much time as needed; start the Window Spool again; and now at this point a worktable… not at this point, I had something highlighted, but at this point a worktable should be allocated for the Window Spool. So here, I’m not going to touch this, hands off. This is the image of sys. indexes and sys. sysindexes before executing the Window Spool. Now with the Window Spool there, I’m going to look at sys. indexes. and sys. sysindexes again, and I could Ctrl-Tab between the two windows to compare differences. But let’s first look at something else.
Let’s first look at the total number of rows returned. So now with Window Spool running, it’s 380 for the two result sets combined. Before I started the Window Spool, that was this window, it was… also 380. So the resultsets have not grown. I have allocated something in tempdb, but it’s not in sys. indexes nor in sys. sysindexes. Okay that makes sense. It’s an internal object. Why should it be exposed in sys. indexes and sys. sysindexes. Okay, perhaps it was, it was worth a try.
But there is no real need for SQL Server to expose this. So, yeah, this is a failed attempt, but it was worth a try. You know what? What I also can do is, instead of looking at sys. indexes and sys. sysindexes, which are DMV’s, I can look at the underlying system tables. Those those are normally protected from view. You cannot simply query them. But they have information much closer to the engine, more raw information. So i can compare them by looking at the number of rows returned in the execution plan, I’ll show that in the demo.
You could also use DBCC PAGE, for instance. The euuhh or you can use the datamate… datamated… I can’t pronounce words anymore! Uhm, you can also use the dedicated administrator connection, but there’s only one dedicated admin connection, so then you can’t use the Ctrl-Tab trick to alternate between the two windows and you have to do more work. So what I did instead was look at execution plans to see how many rows are returned from each underlying actual system table, and then if needed you can use DBCC PAGE to check the actual data in those tables.
So let’s look at whether this is a good approach. I deliberately executed these two queries with the execution plan enabled so this was… the window with the copy is the one I executed before starting the Window Spool. And this one is the one I executed after starting the Window Spool, so even though sys. indexes and sys. sysindexes show the same number of rows, in the execution plans you can look at all the underlying system tables, and I’ll save you some time and tell you that the most important data where the actual indexes are stored is this Clustered Index Scan on sysidxstats.
27:55 - That was the wrong click, my apologies. And if you look, you will see that the number of rows is 166 but the number of rows read, that’s the number of rows actually in the system table, is 214. So there were rows in the sysidxstats system table that are not returned. See! SQL Server is hiding some internal tables from us! They’re not intended for us, but they’re stored anyway, and they’re hidden based on this Predicate, which looks incredibly hexadecimal and very illegible to us.
But it does confirm that there is more stored in sysidxstats than we see in sys. indexes. 214 rows after the Window Spool started. And before the Window Spool started there were… also 214 rows. And yes, when I investigated this I did more work. I actually did go to DBCC PAGE to look at the actual data within sysidxstats to make extra sure that I wasn’t overlooking something. But the reality is, this entire worktable used by the Window Spool, is simply not stored in the DMV’s, nor in the system tables.
And again, it makes sense. There’s only one location within SQL Server that needs to know the details of this worktable, and that is this Window Spool operator. And that Window Spool operator can store that data in its work memory. It doesn’t need to store it in DMV’s. So nothing about that worktable is stored. Let’s stop this, and we’ll return to it later, but let’s first make sure that everything is clean again. So now I have seen what doesn’t work.
However, there’s one thing that you should be aware of. SQL Server supports concurrency. Even if nobody except the Window Spool itself uses the pages allocated to that worktable, Microsoft still wants to make sure that, while the Window Spool is running. other sessions that also do work in tempdb don’t use the same page. So that’s where it gets interesting, because even though the information about the worktable is never stored anywhere, one thing does get stored, that is: “This page is mine, keep your hands off it because it’s mine.
it’s reserved”. One of the places that is used for that is the PFS page, the Page Free Space page, and if you have a database file of I believe it’s up to eight gigabyte, I don’t know the exact number, then there is one PFS page; if you have larger database files or if you have a database with multiple files, then you have multiple PFS pages. I like to keep my life easy and that’s why, for this specific research, as shown in the prepare script, I have created a separate instance that I’m using only to test here.
And I have created tempdb with just one single file. For performance, that’s a bad idea. . That’s not how I normally configure my instances. But now it helps me because it’s a single file; it’s small, so i know that there’s just one PFS page. I will know that if anything changes in which pages are free and which are not, it will be visible in that single PFS page. I don’t have to go hunt for other PFS pages. I’ve also switched off parallelism, to make sure that all the rows are always returned in the same order and you don’t have multiple threads returning rows.
Because then I could block one thread but other threads would continue to run, and that would make my life a lot harder again. So that’s why I also disabled parallelism. Let’s clear out some stuff, while I’m in the demo anyway, and then I’ll return to the demo later once we checked what we’re going to do. So I’m going to look at the PFS page, and I’m going to use the same trick with two windows and Ctrl-Tab between them to see the differences. As soon as I find the page that was free and then later gets allocated, that is a page that is probably allocated to the worktable in the Window Spool.
At that point I’m going to check the header of that page to do some double checking, and then use a DMO called dm_os_buffer_descriptors to quickly find all other pages of that object. An alternative way would have been to check all the pages that were first not allocated and then allocated later; or to use the previous page and next page pointers. But this is the quicker way so that’s what I used. And once I found all the pages allocated to the temporary table used by the Spool, I can use DBCC PAGE to inspect their contents.
And that’s the strategy I will use to find and inspect the object. Let’s look at this in the demo again.
33:01 - Now one more thing I’m going to do before running this demo is, I’m going to restart the instance. And the reason I do that is because there is some optimization going on in SQL Server, tempdb specifically, because SQL Server knows that tempdb constantly uses space, releases it again, uses it again, releases it again… so if some pages were allocated, then are free, it doesn’t immediately give them back; it keeps some pages allocated because it knows it’s probably going to need it again.
So after running a few experiments, tempdb will simply reuse pages it already has, instead of allocating pages that were free. To quickly find pages that were actually allocated, I need to start with a clean tempdb. At this point, nothing has been allocated in tempdb yet, and I can actually start looking. So let’s take some code, and this is code that looks at page 1 in file… sorry… page 1 in file 1 of tempdb. The 3 is the output style, 3 is a rather readable output style, it formats the output in a way that is easier to parse than the other alternatives.
It’s official name is dump style. I can run this, and if I look at the output I see some interesting stuff about the page header, then below that a list of all the pages that were allocated or not allocated, and some information about how they are allocated. Now if I use the same code in this window and execute it here as well, I obviously get the exact same results because we did nothing in tempdb. Both are are executed immediately after each other without doing anything.
But now we’re going to do something in tempdb. I’m going to set up the block again, we have a row that’s blocking the Window Spool. I’m going to execute the Window Spool again, and now I’m going to wait for a bit because at this time stuff is moving in tempdb. Data is being allocated, data is being read, data might even be released. I don’t know exactly what Window Spool is doing, but it’s doing stuff, and it’s doing stuff in tempdb. So if I look now, it will change from one millisecond to the next.
Instead of doing that, I will wait until, as you see now on the screen, all the movement has stopped. At this point, all this query does is wait for Clustered Index Insert to say: “Hey, I managed to insert a row and you can give me the next row”. But Clustered Index Insert will not tell us that because I have this block that I’m not going to release. Now I’m going in this window to refresh the data, and look at how the PFS page looks now. So this is the window that shows the PFS page after Window Spool was started.
This is where you see the PFS page before the Window Spool was started, and I can Ctrl-Tab between the two, and if you look at the output results you see no difference. And if I page down, you see again no difference. If you page down again you see a difference in the page allocation status, but that’s not really intended for human use and I’m not going to look at it. What I’m interested in is that long list of page allocations below it. Now sometimes when I experiment with this, there is a difference in the has ghost or has no ghost.
This time it seems to be not the case, because I don’t see that difference. Whoops I scrolled the wrong window, let’s make sure they’re the same pages again. No difference. At one point you see in… Here we see it! Now I see a place where I’m looking at the same set of pages, but there is a difference when I Ctrl-Tab. So here is what we had before we started the Window Spool. Page 125 is allocated; 126 and 127 are both not allocated. Then I started the Window Spool, and now 125 and 126 are both allocated; 127 still isn’t.
So the difference is in page 126. This is one page and… if I scroll further there will be more, but I have an easier way to find them, but I now know that at least page 126 was allocated when I started the Window Spool. That doesn’t make it 100 percent sure that it is for the Window Spool, but it makes it very likely. So I’m plopping 126 in DBCC PAGE to check the contents and I’m using dump style 1 because I’m at this point only looking at the page header.
What I want to see in the page header is first the object id. The object id is normally a positive number that corresponds to an object in the sys. objects table. Here it’s a negative number, and that only happens for internal objects that are not officially documented in the objects table. So here we have a confirmation that this page belongs to something that is an internal object and not a normal table.
38:07 - That means that it still might be our worktable. I’m still not 100 percent sure, but I’m getting more and more convinced. So I’m looking at the allocation unit id now; I’m going to copy that into… Ctrl-C and then Ctrl-V to paste it here. And now I can run this query to find all the pages that have the same allocation unit. This is the fastest way. There are other ways, but this is the fastest way to find all pages that have the same allocation unit, with their page type and some other information.
And i see an index page, an IAM page, and a lot of data pages; and if i scroll down you will see that the rest is all data pages.
38:49 - As one of my experiments I’ve done the same after letting the Window Spool run for a much longer time, after having itprocess a hundred thousand rows. At that point there were way more pages here, so that shows that all the data read is stored in this worktable. It doesn’t release data that’s not needed anymore. Technically it only needs the last 10,000 rows. But it can store up to 100,000 rows or more if you use a larger set. At that point you also get multiple index pages, because this data is effectively a B-tree.
Good, we have a B-tree. There’s more confirmation of that later. I’m not going to point it out, but I had a lot of things confirm that this is a B-tree. This is a relatively small B-tree with just one index page, which is the root, and a bunch of data pages. If I have more data, there will be more index pages. The root page is at page 369, so I’m going to copy that, paste it here, and check the data on this page. And for the record, the reason I use dump style 1 here is because I have to.
Normally when I check data pages, I use dump style 3. If I do that here, I get some header information and then if I scroll down I get an error message. Dump style 3 is not possible. That’s because this page is allocated to an internal object that is not documented in the sys. indexes tables. DBCC PAGE uses sys. indexes for the formatting for dump style 3. So if that information is not available, dump style 3 doesn’t work; that’s why I have to use dump style 1.
Dump style 1 does mean that the data will be harder to read. This is all we get. Hexadecimal data, and you have to know exactly what all this means. Well, the good news is, I know what all this means. I have read this book by Kalen Delaney. It’s the SQL Server 2005 version. I probably should buy a new version one of these days, but her description of how rows are stored on pages is still accurate, even on SQL Server 2019. So I use the information in that book to help me understand the data that is here.
And one of the things that I noticed… that I happen to know is these two bytes here are a pointer to a file. Now tempdb in my sample database only has one file, so obviously that has to be the number 1. These numbers are always read in a right to left fashion, byte by byte. So you read it as zero zero, zero one; and that is the number one. The four bytes before that are therefore read as zero zero, zero zero, zero one, seven zero. Hexadecimal 170.
Now I’m going to bring up a calculator, put it to hexadecimal mode, and 170 is decimal 368. So what I see here is that in this index row, there is a pointer. We have data on page number… what was the number again… 368. Now double check: run this query again. 368 is indeed a page that’s also allocated to this object, so i didn’t make an error. Probably. Let’s look at that page, 368, and again I need to use dump style 1. Dump style 3 here doesn’t give an error; it simply gives nothing even though there is data there, which is even more confusing than an error.
Dump style 1 shows me all the data that is there. This is one of the many pages in the worktable, and this is one of the rows. And when I first saw this, I got very excited, and I’ll tell you why. I got excited when I saw 4911-403c-98. Why does this information make me excited? Because I had looked at the results of the query before, and when I go to the first row in the results and then look at its carrier tracking number: that’s 4911-403c-98. That’s the same data we see here: 4911-403c-98.
We have here the carrier tracking number of the first row in our data set. So now I know for sure that I have found data that belongs to this worktable. I have found the worktable I was looking for! I was not supposed to find this. SQL Server tries to make it hard for me, but I managed to get there. That made me very happy. But it was not yet the end of my journey. So let’s return to the presentation, and look at what I did next. So I now have found an index page, in this case one, if I let the query run longer there will be more.
I found lots of data pages, and on those data pages I found the data that is stored for the worktable. But how exactly? What is the structure? How is this data stored? Well, before diving into all that hexadecimal garbage, let’s think about what we expect. So I returned to the execution plan, and I looked at the Segment operator that feeds data to the Window Spool. And I looked at its Output List property. The Output List property tells us what data do we expect to find… sorry, the Output List property tells us what columns are returned from this Segment operator to the Window Spool operator, and that’s this list of seven columns. I also looked at the Window Spool operator, at its Output List property, to see which columns the Window Spool returns to its parent, the Stream Aggregate. And when you compare those two lists you see that the SalesOrderID, SalesOrderDetailID, CarrierTrackingNo, and RowNumber, are in both, so it makes sense that those are all stored in the worktable.
Logic, I mean, they get in, they get out, they are stored somewhere. That’s what the worktable is for.
45:03 - Same for Segment1009, that’s also a column that’s passed into and returned from the Window Spool. so it makes sense that it is stored. TopRowNumber and BottomRowNumber are presented by Segment to Window Spool but are not returned; but based on how Window Spool works I would expect those in the worktable as well. WindowCount1010 is produced by the Window Spool as it returns rows; and based on how it works I personally did not expect this column to be in the worktable.
So my expectation is, the seven columns SalesOrderID, SalesOrderDetailID, CarrierTrackingNo, RowNumber, TopRowNumber, BottomRowNumber, and Segment. Of course that’s an expectation. I can be wrong. But it helps to know what we are looking for, to find either a confirmation or, okay, we found something else and then we need to figure out why my expectation was wrong. It does give me something to look out for, because now I know what data I’m looking for.
46:05 - And then I started reading data. Like I said, Inside Microsoft SQL Server 2005, the Storage Engine, written by Kalen Delaney; or any other books… there is lots of documentation about how to read DBCC PAGE output; this is not something you do on a daily basis, it’s probably something you never do on your normal work. But if you are investigating internals, information like that helps. So I started with the index page, and here you see a screenshot of one of the earlier times I did this.
It was a different run at a different time; data was found on different pages. On the index rows that means that you find different data there because index rows have pointers to other pages. And if those other pages are allocated at a different location, then those pointers change. So these are not the same values you just saw in the demo, but it’s another time I ran the same queries.
47:03 - So the first byte in an index row is always a status byte, and the value 06, you can just look it up in the in the documentation, means this is an index row and there is no null bitmap. DBCC PAGE also tells us that because it says here: it’s an index record, and there is no null bitmap. But even if that were not… you can now verify that indeed the 06 matches. And as I also already said, in the index page you find at this specific location the pointer to the file number and the page number where a page is stored.
And that page is the page that starts with… well not this value. In a B-tree index, on every index level on the first page in the level, on the first row, the index key is unused. And when SQL Server doesn’t use data, it doesn’t bother to initialize it. So whatever garbage was in that location on disk or in memory when this page was allocated is left unchanged. The bytes are reserved because it’s a fixed length field, but there’s nothing there. It’s just whatever garbage is there; this value is never used because it’s the first entry on the first page of a level.
If you go to the next index row here the status byte is 16, meaning it’s an index row with a null bitmap; and again DBCC PAGE tells us that: index record, null bitmap. And here, you see also a pointer to in this case file number 1 page number 282. And we have an actual value here saying that on page 282, the first row will have the index value 113 decimal. 71 hexadecimal. Also, because there is a null bitmap, we find the number of columns as two bytes at the end, and the null bitmap itself.
Well this one column here is not null and we already know that because the value is 113. Then finally, for the last… for the third row on this screenshot, we have something similar: we have page 283 starting with index key 225, and I checked some of those pages and they did indeed start with those values, so it’s all still a very normal B-tree index.
49:33 - Let’s now look at the data page. The data page is much more interesting, because that’s where the data is stored. Well, the index page was interesting as well because it keeps confirming me that this worktable is created as a normal B-tree index; it’s not using specific structures for worktables only, it’s using a B-tree to store data. But the data stored is also interesting, so here we have a screenshot of some of the data on such a page. The same one we saw in the demo, the first page.
50:04 - Because it’s a data row we have two status bytes, and the two status bytes is 3000, that encodes it’s a data row, there is a null bitmap, and there is at least one variable length column. Cool, we know that now. Then the next two bytes is the position where the number of columns is stored, and that position is further in the row because the way SQL Server stores data on page is that after the status bytes and the position of the number of columns first comes all the fixed length data.
So we’re going to skip some bytes that are the fixed length data, and go to the place where the number of columns is stored. Hexadecimal 25 happens to be this location, and the value here is 0008, so we have eight columns in total in this row.
50:59 - After that we have the null bitmap. The null bitmap is one or more bytes, depending on the number of columns. If there’s 1 to 8 columns the null bitmap will be one byte if there is 9 to 16 columns, the null bitmap will be two bytes, etc. Here we have 8 columns total so the null bitmap is one byte.
51:20 - After that come two bytes that represent the number of variable length columns, so in this case we now know that this row stores eight columns; two of them are variable length, which means six of them are fixed length. Those two variable length columns each have their end location encoded, so the first variable length column ends just before byte number hexadecimal 2E in this row, and the second variable length column ends just before byte number hexadecimal 46.
I counted the bytes, and the first variable length column ends immediately after the… so just before that 34, and that means it’s a zero length column. Yes, variable length columns can be zero length. Obviously, if it’s zero length, then the content is empty and it’s a bit harder to understand the data that’s there. So at least for this row, the variable length column is very hard to decode because it’s empty. I can look at other rows later, but let’s first check this row further.
So I’m going to skip this first… you know what, let’s move my head around a bit because I’m in the way of my slides, so I’m now on the other side of the slide deck. And now you can see that the first variable length column I put question marks in there because I don’t know yet what is there. I’m going to skip that. The second variable length column ends at the end of the entire row, and we already saw in the ASCII representation on the right of this data that this is the carrier tracking number, an nvarchar column, which is where there’s two bytes for each character.
So we found the carrier tracking number. We found a second variable length column that we don’t understand yet, and we know that there are six fixed length columns. Let’s look at those first. Now, the fixed length portion starts with eight bytes that’s if you read them in the correct order a bunch of zeros and the end of one. Zero zero zero blah blah blah one. That’s hexadecimal, but in decimal is also the value 1. And I was expecting a column for the RowNumber, and the first row obviously has row number 1.
It’s 8 bytes which is a bigint. That means that I found that this is the RowNumber, stored as a bigint, and for row number one it’s the value 1. Now then I cheated a bit. Instead of trying to understand what 8BAA was, I did something else. I went to the data. So let’s go to the data, and I expect that at least this value, 43659, is stored. So I copy this value into a calculator, decimal, and then I say what is the hexadecimal version, it’s AA8B. Well if you know that you’re looking for AA8B then it’s quite simple to understand that these four bytes are the integer AA8B.
So that’s how I found that this should be the SalesOrderID. Cool. Then, after that, is four more bytes for the SalesOrderDetailID, and I knew that the value was one in that integer as well so that was also easy to find. And then I ran into a real head scratcher, because after that, if you look at the screen you see the value 05F1D8, and I looked at it and it didn’t make sense. And I looked at it a bit more and it still didn’t make sense. And I ran into a wall there.
So what i then did was, I know that there’s six columns total in the fixed length portion and on the starting point… I also know the ending point. So let’s work backwards from the end of the fixed data. In the end of the fixed data there’s eight bytes there that are very easily recognized as the number four. Because I already checked the execution plan, I know that there is a BottomRowNumber column, that’s computed as row number plus three; well that’s four, so these are the eight bytes that store the BottomRowNumber.
And once you are there, you realize that just before that sixth column, so in the fifth column, there’s a lot of F’s. And in the hexadecimal representation of an integer, if there’s a lot of F’s, it’s a negative number. Well, surprise, we needed a negative number, because we also have the column called TopRowNumber, which is computed as RowNumber minus 10,000. For row number one, that’s minus 9,999. I use my calculator. Minus 9,999 in hexadecimal using uh um the normal notation with uh one reverse… I forget the name but using the normal encoding of integers in hexadecimal, minus 9,999 is represented as a bunch of F’s and then D8F1. And that’s exactly what I found in this row. So now I have found that it made sense that I didn’t understand 05F1D8 because it’s just a part of one column plus another column. And the only missing column is one single byte. And that single byte is the only thing that’s still giving me a headache. On the first row you see the value is 05 and on the second row you see the value is FE.
And it’s hard to relate that to the data. I was unable to make sense of it. So I ran a bunch of experiments, and I stopped for the day and I restarted the next day, and then at one point I realized something. There’s something weird with that weird mysterious single byte column, and that is: If i run the same code multiple times, it’s not always the same data. Now for an index page that makes sense. For an index page the data can be different. Because index pages point to other pages and data is not always allocated in the same location and if it’s allocated somewhere else then the pointer needs to change.
For data pages I don’t expect the data to ever change, and yet this byte changed. And then I noticed something else. The first row always has an odd number. No matter how it changes, it’s always odd. All other rows in the work table that I looked at, and of course I didn’t look at all 121,000, but I looked at a lot of them, and all of them had an even number except in that first row. And then I realized this is not a byte I’m looking at. Yes it takes a byte, but I’m actually looking at eight bits.
Seven of them are unused. Remember what I told you about what SQL Server does with locations that are unused: nothing! It just leaves whatever garbage is there. So those seven bits can be different between runs, because they’re not initialized, it’s just whatever is there. Only that eighth bit is controlled, and that’s for the segment column that is always one on the first row which is why it’s always an odd number on the first row and always even on the rest.
So now I figured out that fourth byte is simply seven unused bits and one bit used for the segment column. That’s seven of the eight columns down. The only thing left was this variable length column. And I will admit, I probably would not have figured this out if I had only looked at Windows Spool. But before looking at Window Spool, I had already investigated Table Spool and Index Spool. And there is an interesting similarity between all of them: they all use clustered indexes, they all use B-trees and they all define that b-tree as a nonunique clustered index.
Now in a nonunique clustered index, you always have something called the uniquifier. So once I noticed when I went through all the pages, run after run, comparing data after data, and that column was always zero length on every row on every page wherever I looked, it was always zero length. It was always there, and it never had any data. Once I realized that, and I saw the similarity with Table Spool and Index Spool, I realized this B-tree is the B-tree of a clustered index not a nonclustered index, and the clustered index is declared as not unique.
When you declare a clustered index as not unique, then SQL Server will define a hidden column, called the uniquifier, and the first time it sees a value that uniquifier remains empty. It’s a zero length column. If you then insert the same value a second time, you have a duplicate, which is allowed in a nonunique index. But for a clustered index there still needs to be some uniquity, and that’s why the uniquifier column is then used to add an extra four-byte ascending integer to make the rows unique.
Now in this case, I had a clustered index defined on the RowNumber. Due to how the execution plan works, there will never be duplicates there, so the index didn’t need to be defined as a nonunique clustered index. It could have been defined as a unique clustered index; then this column would not have been there. But for whatever reason, the way the product is built, a Window Spool defines its worktable as a nonunique clustered index; and that’s why we get this always empty column in the rows.
It’s there; it’s always empty, it’s always zero length, but it’s there because a clustered index needs it if it is defined as nonunique. And with that, I have finished the last problem to overcome: reading and interpreting the data. I have managed to do that. Give me a second while I move to my favorite corner, and then I’ll talk you through the conclusions. So what I have seen is: if a Window Spool uses an on-disk worktable, that worktable will be defined as a clustered index, defined as nonuniquel on the column that contains the row number, which is a unique number but it’s still declared as nonunique.
61:50 - My suspicion was confirmed that indeed all the input columns are stored in that worktable, but the output column WindowCount that is generated by the Window Spool is not.
62:04 - And that’s all I have. If you have questions, please mail me or tweet me. I will try to answer them. If you want to learn more about execution plans, go to my website. Check the reference if you have specific things you want to know. Check out the training videos if you want to learn more. And also, if you want to see more of my videos: comment, like, subscribe. I think that’s what all the cool kids say when they’re on YouTube. Thanks for watching.
This was Debugging without debugger: Investigating SQL Server’s internal structures.
62:40 - My name is Hugo Kornelis. Bye-bye!.