TDT4205 Compiler Construction lecture 19-1: Data flow analysis overview
Mar 19, 2020 11:29 · 3321 words · 16 minute read
Welcome to our first video lecture on compiler construction. Today we’ll continue to look at an overview of how dataflow analysis actually work and how these fit together with the general framework we’ve introduced in the last lecture. In upcoming lectures we’ll take a closer look at different dataflow analysis that make use of our framework here. Today we’re taking a closer look at classical bit vector dataflow analyses. These are called bit vector analyses because information that has to be processed by these analyses can be represented using sets of bits.
Vectors of bits of different 00:44 - elements of a set can be kept in a word consisting of 16, 32 or whatever bits and all operations we have to perform here can be performed using bit vector operations alone. We need to consider that there are also forms of dataflow analysis which require additional operations that are not binary operations, but we won’t consider them here. We will give a quick overview of how to represent sets and bit vectors here. We’re doing this because we want to have an efficient implementation of sets. Essentially we can keep a number of set elements inside of a simple word so, for example, if we have a 32-bit word then each bit would just indicate the presence or absence of an element in our set.
So if we set it to one the element in our set 01:41 - is actually present, if we set it to 0 it would indicate that the element here is absent. For example, we can represent the variables in our previous example here as a 5 bit set so we have 5 variables here: c, d, x, y, z that we could assign to a 5 bit word consisting of bits 0 to 4. Bit 0 would represent variable c, bit one would represent variable d and so on. So this example here – one, zero, one, one, zero – would mean we have c as part of the set d not part of the set, variables x and y are again part of the set and variable z is not included. So we have a set represented here which consists of the elements c, x, and y.
If 02:33 - we take a look at our example here we have all these sets of different variables in our analysis and now we could simply replace them by just putting the equivalent sets of bits here which makes it harder to read for us as humans but which makes it easier to operate upon for a computer. Just remember that this is efficient as long as the number of set elements you want to consider here is reasonably small so usually the number of set elements should actually fit into a single machine word, so when we want to do set operations on bit vectors we can implement the typical set operations and bit vectors using boolean logic operators that we already know. For example we have a union operator of two sets and this union operator can simply be implemented using the OR operation so if we have two sets {c, x, y} as we see before represented by 10110 and another set {x,z} represented by 00101 we can simply do a bitwise OR. So 0 or 1 is 1, 1 or 1 is also 1 so here we can ensure that each element, even if it is in both original sets, only shows up once in the target set. So using this we can do the union or join operation of these two sets so we have a joined set 10111 which represents {c,x,y,z} which is of course the union of all variables contained in the original input sets.
We can also do this for the opposite 04:21 - operation, the intersection or meet operator of two sets. We have implemented join as OR so conversely AND will implement the intersection operator. If we have two sets {c,x,y} and {c,y,z} we will be using a bit-wide AND and only these elements which are present in both sets will have a one in the resulting set so will be parts of the resulting set so we have {c,y} here as common elements and x and z are no longer in the resulting set here. If we need a set complement we can implement it using XOR or a NOT and AND operation as we’ll see in a bit. We can also use bit vectors to represent the graphs themselves so bit vectors can represent a graph’s adjacency matrix so what we do is we have our control flow graph here on the left hand side we can transform it to a bit simpler representation here just indicating the numbers of the basic blocks here and the transitions that are possible, so the jumps or control flow edges that are possible between these and now we can transform this into a set of vectors or a matrix and so essentially in this matrix we have the basic block we’re considering we have the basic block that is a target of a control flow and so for example we can go from block one to five with would be represented here so from one to five we have a 1 which indicates that there is an edge outgoing from one incoming to five or another edge from two to three would be going from here to here would be that one.
Now when we look at control flow graphs we need to get a couple of 06:13 - definitions right so if we have an edge in the control flow graph like that one or the ones inside here (which we didn’t indicate with explicit arrows, but of course we only have control in one direction), we can define successors and predecessors of a basic block we’re currently looking at so for example we have this one here so the successor of block 2 would be block 3 the predecessor of block 2 woud be block 1 or the predecessors of block 3 would be block 2 and block 5 because there’s also an edge going up here. We have two distinguished unique nodes. We have a start node this has no predecessor, so we have to figure out the initial conditions and we have an end node here which has no successor, so again for this we have to define suitable ending conditions. We do a bit of simplification here, so for general control flow graphs we might have situations where the end block is not reachable because we have something like an infinite loop. Here we assume that every basic block we have in our control flow graph is actually reachable from our start block and the other way around the end block is also reachable from any block inside of our control flow graph. Now when we want to perform data flow analysis on this representation we are viewing computation of data through expressions and the transition of data through assignments to variables.
So the properties of programs we have 07:47 - explicitly described in our control flow graph are defined now as properties of program entities, so we have expressions, we have variables, we have definitions that make up our program and again we take a very simple approach here. So we restrict our expressions to primitive expressions involved with a single operator. We’ve already seen this three address code, so we are splitting up our operations we perform here into simple operations, so we can operate on the intermediate representation we’ve already introduced and we are restricting ourselves to simple scalar variables and definitions to assignments to scalar variables. Of course, real programming languages have more and more complex data types but again, we want to keep things simple here, so we’ll keep it to this. Our general approach for performing dataflow analysis, if we have something like an expression in a program, our dataflow analysis involves two steps.
The first step is we want to discover the effect 08:50 - of the individual statements on the expression, and we want to relate these effects across the statements in our program. Very often we’ve already seen that we don’t do this on a single statement basis, but for reasons of efficiency both steps are usually carried out over a basic block instead of a single statement, because then we have fewer blocks to care about. There are actually approaches that consider going back to single statements again, because the original concatenation of instructions into basic blocks was mostly used for efficiency. Computers might now be fast enough (for single instruction analysis). There’s an interesting PhD thesis on this.
But we’ll consider basic blocks again because 09:34 - that’s what’s common in literature so far. We have two steps, the discovery of the effect of individual statements on the expression. This is called a local dataflow analysis and this is performed once for every basic block. And we have the relation of the effects across statements in the program and as we’ve already seen in our introductory example two lectures ago this is our global dataflow analysis and this may require repeated reversals of the basic blocks in the control flow graph until we reach a fixed point, when we figure out no more changes are going to happen. Take care: “global” here means we’re not looking at the complete program but we’re only looking at the effects of dataflow inside of a single procedure so we’re not assuming that effects currently traverse beyond procedure bounds.
We start with something inside 10:27 - of a procedure we end with something, we have incoming and outgoing data inside a procedure but that’s also global might be a bit of misleading but that’s what’s used in literature again. So when we want to discover local dataflow information it depends actually on the analysis we want to perform, how we are going to model this. And we’re going to have a closer look at this in the upcoming lectures. Nevertheless, there’s a common pattern of data flow generation and/or invalidation of data flow information, so we have these elements here. For the entities we have in our simple intermediate representation, so variables x, expressions, and the definition of a variable.
We assign 11:10 - something, the result of an expression, to a new variable here we can have two operations each. For variables, we can read a variable x so we use it or we can modify its value.For an expression e we can compute it or we can modify an operand of the expression and for a definition this might occur or anytime we have an x this might happen here. So a variable may be used or an expression may be computed on the one hand on the right hand side of an assignment statement, then in a condition that alters control flows so a loop or an if condition or case statement or as an actual parameter in a function call or as a return value from a function. All other operations not using variables or computing expressions involve an assignment statement to a relevant variable here.
So you might be asking 12:16 - yourself what happens if we’re reading a variable from the input. We’ve seen we can’t really predict what’s happening then. We can safely consider this as an assignment statement that just assigns an unknown value, a question mark or something we can use to represent this, so we don’t know a lot about this then. When we look at local and global data flow information in our control flow graph, we need to capture this relationship between local and global data flow information and we capture this using so-called data flow equations and data flow equations are in general a system of linear simultaneous equations. We have to solve these and, you know this probably from your linear algebra, in general these equations may have multiple solutions.
How can we actually consider data flow equations and 13:10 - how can we obtain these? In forward flow analysis we have interesting things so we have the exit state of a basic block B we call outB and this is a function of what’s happening inside of our basic block like this assignment here. So outB is transB of inB, so essentially transB is the transition function depending on the code contained in our block and since we have multiple instructions in general in a basic block this is actually the composition of all the effects of all the statements in the block. So we look at what’s coming in. We apply all the operations inside of our block to our input variables and then calculate what’s going to happen with the output variables. Now this is simple as long as we have a single basic block. What happens if we have some join of control flow? Then we have to consider any paths we have are arriving on.
So for example we have a path 14:14 - going via B1 and another path going via B2 and they’re joined here in block 3. So as the input sets of this block B3 now we have to consider the join of those information sets coming in from B1 and B2. So the entry state of a basic block in general is another data flow equation, a function of the exit states of all predecessors so inB is the join of all predecessors of our current block we’re looking at and of course we join their output values. Of course, for a single block like this one, this is simple because we only have a single block coming in so we don’t have to join anything and if we have joined control flow we have to consider all of these. So these combine the exit states of all the predecessors and this is then the combined entry state for our block B we’re currently looking at so in this case this would be block B3.
We 15:14 - can perform dataflow analysis in two directions and each type of dataflow analysis then has a special transfer function and a specific join operation applied to them and so we can do forward analysis like this so in a forward analysis we traverse the control flow graph along the direction of control flow. This can be used in several analyses like reaching definitions and available expressions we’re going to look at, and we can do it the other way around, so we can do backward analysis, we can look from the output to the input and see what happens there so we traverse the control flow graph against the direction of control flow. This is used for example in live variables analysis or something we call very busy expressions analysis. Here we apply the transfer function to the exit state and we yield the entry state of this block so the join operation now works on the entry states of the successors to yield the exit state. So it’s just doing it the other way around and, depending on which analysis we’re going to perform, we need one way or the other.
When we’re doing bit vector analysis 16:22 - we can consider two special sets, called “gen” and “kill” sets.We have already seen bit vector data flow analysis works on sets of facts. Facts are just the information you care about and these sets can be represented efficiently as bit vectors. We’ve also seen join and transfer functions can be implemented as logical bitwise operations, so join can be the typical intersection or combination operation We implement them using a logical or OR a logical AND. The transfer functions, and that’s important because that makes bit patterns quite a bit easier than other analyses, can be decomposed into two sets and these sets we call the gen and the kill sets.
17:07 - Gen describes the points of the graph where a fact you care about becomes true. Gen of a block B describes data flow information that is generated within a basic block B and kill, on the other hand, describes the points on the graph where a fact that you care about becomes false. So killB describes data flow information which becomes invalid – contrasted to generated – in block B. So we’ve seen we cannot have just a very general abstract definition because we need to specify the gen and kill points and sets depending on the facts that we care about, so depending on the analysis we’re actually going to perform. We have an example for Gen and Kill sets here. One example is live variable analysis.
For 17:54 - this, our join operation is “union”, so our kill sets – we want to see which variables are alive – are the variables that are written within a block. The Gen set are the variables that are read without being written first. So the related data flow equations are: outB is the join of all input information about all the successors so we’ve done backward here, and inB is then – we’re looking at the outB first, we are removing the kill B elements that are killed in the current basic block we are looking at, and we join them with the ones that are actually generated in the current basic block. Now this is always expressed in form of set operations so if we want to implement this actually we can do this using logical operations using this piece code so we start off with an empty set for out(B) and then we populate out(B) by looking at all these successors of B here and so we add the input sets of each successor to the output set of B here and when we traverse through all these successors of B we can finally construct our input set, which is the output set that we just considered. From this we remove using the AND NOT operation – so we invert our kill set here – the ones which are killed we want to remove from out and then we join this together using the OR operation with ones that are generated in B. So that’s all for today.
We have some references here there’s some 19:25 - interesting books just covering dataflow analysis alone. As you’ve already probably imagined this is a quite complex topic in which we’re only scraping at the surface here. So dataflow analysis has actually started very early with Gary Kildall who already unfortunately passed away several years ago. He’s actually famous for developing the first common personal computer operating system CP/M for 8-bit computers but he did his PhD thesis actually in 1973 on data flow analysis and he was the first one to perform global program optimization. He was thinking of translating code between processors of different instruction sets so he was writing a converter from intel 8080, the first (actually second…
) 8-bit processor by Intel to 8086 code because both 20:18 - are both processors were supported by his CP/M operating system and he actually built a translator that was able to transfer code on assembly source code level from 8080 to 8086. A performed some optimizations, these were the result of actually his analysis on this. So he used a very practical problem to derive a very interesting theoretical approach and this paper is actually interesting to read. As I already announced these lectures are going to be a bit shorter than our regular lectures and we have a couple more of these so essentially upcoming lectures then will look at several different examples of forward and backward data flow analysis. But that’s all for now so thanks for listening and see you next time. Bye! .