Will Students Write Tests Early Without Coercion?

Nov 21, 2020 23:29 · 1661 words · 8 minute read left plots every student 11

Hi, my name is John Wrenn, and I’m a PhD student from Brown University, here to answer the age-old question: Will students test early without coercion? But first, a quick disclaimer:I come from a pedagogic school that uses the term “examples” to refer to what others call “early tests”. But, by “examples” or “early tests”, we just mean test cases you write before implementing to build confidence that you understand the problem. In contrast, the test cases you write after implementing more so serve to confirm that your implementation matches your understanding. In any case, either kind of testing is a form of self-regulation that students are pretty poor at doing. In study after study, we find that students often begin implementing with an incomplete understanding of the problem and, consequently, waste their time, hurt their grade, and fail to meet the learning objectives of the assignment.

00:51 - A formal problem-solving methodology like the Design Recipe from “How to Design Programs” can help students avoid this by scaffolding their problem-solving process. Ror instance a student who first reframes a problem by writing some input-output examples has a chance to build a better understanding of the problem before they implement. And, once they complete their implementation, these examples serve as test cases with which they can check their work. Scaffolding-based interventions like this can significantly improve the productivity and self-efficacy of students. But, we also see that, left to their own devices, students trained in a scaffold might not actually make use of it when they need to most.

01:30 - Unfortunately: realizing you need to use a scaffold is in itself a feat of self-regulation. So we ask instead: What is it about implementation that makes it so much more naturally compelling to students than example writing? Well, compared to implementation, early test writing is almost boring. Implementing a problem rewards you with feedback upon every click of run. Even a student who creates a flawed implementation will still be guided by language errors and runtime results. In contrast, examples are inert; until you have an implementation to run them against, you can’t get the same sort of feedback about them.

02:08 - Consequently, even a student who does write examples is ill-equipped to do so well. Consider these examples for “median”. Are they valid? Well, yes: they all accurately represent the behavior of a median function. But are they thorough? Not really. A student who has written only these examples might very well be confusing median for mean —these are valid tests of both functions. Unfortunately, there’s often no feedback to inform the student they might be confusing median for mean, and if they are confusing median for mean (say they write an outright wrong test case) there’s no feedback to inform them of that either! So, even students who do write tests as part of a programming methodology might not actually do it very well. Some educators have thus taken the approach of requiring students to solve a small number of pre-authored test cases before they begin implementing.

02:58 - That is: rather than asking students to formulate an open-ended set of examples from scratch, the instructor provides partial test cases that have inputs but not outputs and have students fill in the blanks. This intervention ensures that students have explored the problem correctly and thoroughly before they begin their implementation. However, these approaches miss the learning opportunity provided by developing valid and thorough examples for oneself. We believe that if example writing can be made helpful and compelling, that students will do it well and on their own volition! To provide students with feedback, we observe that a valid suite of examples ought to accept correct implementations (we call these implementations the wheat) and a thorough set ought to reject buggy implementations (we call these implementations the chaff). If students could only trial run their examples against sets of wheat and chaff implementations, then they could evaluate for themselves whether their examples are valid and thorough.

03:57 - We prototyped this idea out two years ago with Examplar: an editing environment specifically for examples. When you click run an Examplar, your tests are not run against your own implementation (which might not even exist yet), rather they’re run against sets of wheat and chaff implementations provided by the instructor. This prototype was only for examples. When students wanted to develop or test their implementations they had to switch IDEs. Nonetheless, students found Examplar compelling: nearly all students opted to use it and they used it a lot. Examplar was helpful, too. When we compared students with exampler to those in a previous year of the course who did not have it, we found that the students with exampler were far less likely to submit invalid test suites.

04:41 - However, Examplar still required a high degree of self-regulation from students: They needed to realize, on their own, when they would benefit from using it and then make the choice to switch. And, as a research prototype, we didn’t know when students were using Examplar relative to when they were using the usual IDE for testing and implementation. This paper is about Examplar V2, which resolves both of these concerns by providing a unified programming environment that encourages students to write examples before they begin implementing. Examplar version 2 provides two nudges towards early and effective example writing: When you first open Examplar for an assignment, you are greeted with a file for writing your tests. The implementation file is hidden behind a “Begin Implementation” button.

05:26 - Moreover, Examplar provides validity and thoroughness feedback with every run. It doesn’t matter whether you’re working on your test file, or your code file —every run will tell you how thoroughly you’ve explored the problem with your tests. We deployed this IDE in a fall-semester accelerated introduction to computer science. 59 students, mostly first years, completed the course. They typically had some prior background in programming, but not in testing.

05:53 - Our foremost question was: “Did these these students test first?” That is: “How thoroughly did they explore the problem with test cases before implementing it?” To answer this, we began by dividing each student’s progress into a sequence of run intervals. A run interval is the period between clicks of ‘Run’. These periods reflect the amount of work the student was willing to undertake before asking the IDE for additional feedback. Within each run interval, the student might edit their test file, or their code file (or, rarely, neither or both files) and with each click of ‘Run’ they get additional feedback from Examplar. This student began by editing their test file, clicked run, and learned that they had caught one-out-of-five chaffs.

06:37 - Then, they wrote some more tests and achieved a thoroughness score of two-out-of-five. In their third round of testing, they weren’t able to increase their thoroughness so they switched to implementation and made three rounds of edits to their code file. Within these intervals, their thoroughness scores stayed at two-out- of-five, because only edits to their test file —not their code file— could increase their thoroughness score. Next, they returned to writing tests for one more interval, and caught one more chaff. Having achieved a three-out-of-five thoroughness score, they did one last round of implementation, and then finally two rounds of test writing —ultimately managing to achieve perfect thoroughness.

07:15 - Given such a sequence, we ask: “How thoroughly did this student explore the problem before implementing it?” We begin by examining each of the intervals in which they edited their implementation and compute the peak thoroughness they achieved prior to that interval. By the time this student began their implementation, their thoroughness score was already two-out-of-five. Occasionally a student (like this one did) might encounter an error or, rarely, their thorough score might dip because they commented out a test case. We ignore these blips and always look at the maximum thoroughness the student achieved prior to that interval. Overall, this student completed three implementation intervals catching two-out-of-five chaffs and one interval having caught three-out-of- five chaffs taking the mean of these numbers produces a measure of this student’s test firstness —the thoroughness they achieved before the bulk of their implementation work. In this case it’s 0.45.

08:10 - The fact that the student eventually achieved perfect thoroughness is immaterial —thoroughness achieved after implementation doesn’t contribute to this test- firstness measure. We computed this measure for each student on each assignment. The figure on the left plots every student’s test firstness score on our first assignment, DocDiff. This assignment had six chaff students could catch and each level of fairness is shaded in the plot distinctly. Of 64 students, only six —those points at the very bottom of this plot— did not achieve some level of thoroughness before the bulk of their implementation work.

08:44 - The median student, on the other hand, caught between five and six chaffs before most of their implementation. This pattern is repeated on every assignment, with at the very most 11 students failing to achieve some level of thoroughness on TourGuide. There is an apparent decline in test firstness as the semester progresses, but I’m doubtful this has much to do with time and rather more to do with the increasing complexity of these assignments or perhaps just a really hard to catch chaff. This is a question ripe for future research. Overall, our experiences with Examplar convince us that students can test early without coercion. But before you race off to build your own IDE to encourage example writing, check out our paper.

While we believe that in-flow feedback is a major contributor to our success, 09:30 - this paper is not a rigorous A/B study and there might be other factors at work. Nonetheless, we hope you’ll encourage example writing in your courses, and have a sense of how you might assess your success! .