Psychonomics 2020 - Power Simulations for Linguistic Data

Nov 16, 2020 17:00 · 1338 words · 7 minute read end standard sampled words meet

Hi everyone, and welcome to our talk on power simulations for cognitive studies. My name is Erin Buchanan, and I’ll be presenting work today that was supported by my colleagues KD, Nick, Jack, and Maria. There’s a long-standing tradition in the cognitive sciences to use controlled stimuli in our experimental studies. For example, the Snodgrass and Vanderwart line drawings have been cited over 6000 times and are a very common set of stimuli for picture naming studies. The current lens on replication and reproducibility has increased the focus on our method and materials used in studies.

00:39 - Recently KD, Nick, and I published an annotated bibliography of the stimuli sets and linguistic norms, and we found there’s been a rapid increase in the publication rates of normed datasets, especially in Behavior Research Methods. While it’s very exciting that these data are available, we have to stop and consider the implications of publication of “reliable”, validated data. Normally, when we power our studies, we focus on the desired sample size to achieve a specific power probability. This sample size planning is driven by the research design, choice of hypothesis test, and the effect size estimation. Stimuli norming has nearly none of these parameters.

01:27 - Often the choices to meet some minimum are well established criteria like N = 30. The issue of power and sample size planning has been mostly ignored for normed data collection. Here I’ll discuss how one might plan sample sizes for qualitative data collection, such as the semantic -feature- property task, and quantitative data collection. Further, power increases in complexity for cognitive designs with many items and the use of multi-level models. And the ideas in this presentation can be used even when you have a specific hypothesis test in mind.

02:04 - First, power in qualitative studies is often called coverage or saturation, and this term denotes adequate sampling and that further sampling is unnecessary. In the semantic property listing task, you might ask a participant what is a zebra? And they would list options like has stripes, is a horse, and is mean. Over many participants, this qualitative data is summed for feature frequency. I want to highlight this new, great paper by Canessa and colleagues that covers how to estimate the necessary sample size for these types of studies. First, you define a minimum coverage criteria in percentages.

02:44 - Then, you sample a small number of participants to estimate the coverage space initially. Given this current coverage space, you can then estimate the remaining sample size for your desired coverage. You repeat this process until the coverage sizes for each item have been met. The procedure for quantitative studies is remarkably similar if we use accuracy and parameter estimation. In AIPE the focus shifts away from p values and hypothesis testing to calculating the desired sample size to accurately measure a parameter by providing a sufficiently narrow confidence interval.

For this process, 03:26 - you would define a minimum acceptable sample size, define a stopping rule, and finally defined a maximum sample size. I’m going to demonstrate a simulation example specifically for a study with no hypothesis testing. However, you can find articles on how to estimate sample size by checking out work by Ken Kelley. That’s look at an example of a study that uses word response latencies. You can use your own previous data for these types of simulations or in this case, I’m going to use the English Lexicon Project, which is the normed data set with lexical decision and naming response latencies for many words.

04:04 - This data provides a good metric for the variability in simple response latency because we know that participants have a somewhat arbitrary base response latency, these first have been z-scored by participant data collection session. I’ve also filtered only examine correct answers for real words. First, let’s figure out a stopping rule for data collection. What should a sufficiently narrow conference interval be in a response latency study? What parameter do I accurately want to measure? Since I don’t have a hypothesis test in mind for my study, I will use the response latency as my parameter, and define sufficiently narrow by standard error, which controls the actual width of a confidence interval. I could however define this with a hypothesis test in mind saying I wanted the confidence interval of Cohen’s d to be approximately point two on each size side of the final effect size.

05:09 - So what is the average standard error of a response latency for real English words? As we look at the graph here of each item’s standard error I’ll note: the average sample size for each word is currently 27. There’s a lot of variability in response latency variance, and the average standard error is approximately.16. If I assume these data are representative of my potential stimuli list, what sample size should I expect to meet that standard error? I randomly-selected a hundred words from the larger set, sampled with replacement to achieve sample sizes of 5, 10, 15, 20, up to 200. So while the real data averages approximately 30 participants per word, I can simulate larger sample sizes for testing. This graph indicates that small samples are pretty variable, while larger samples show the expected decrease variance.

06:09 - Given the simulations, what should the sample size be? At N = 25, we found the 80% of our samples with meet the standard error criteria, and we would need to increase as sample size to 50 to find 95% of standard sampled words meet our criteria. Therefore, we can define our minimum sample size, and I’ve selected 35 to meet our confidence interval goals. Note this isn’t too far off from the original study. I would also wanted to find a maximum sample size, and this estimate is based on time, money, effort for the study, and we’ve selected 300 participants because we know we can afford to. Now what? I have a minimum, maximum, and a stopping rules.16. You should pre-register these plans.

Next, collect data for the minimum 07:00 - sample size. With this data, you can calculate your confidence interval or standard error as our proxy for confidence interval. Did you meet your criteria? If so, then you can stop. If you do not, continue collection and repeat until you’ve met your criteria or you’ve reached the maximum sample size. You can calculate after each participant or after a set of participants depending on your time and code skills.

Here are some other 07:28 - considerations are possible for this type of targeted sample size procedure. You could consider using an adaptive design that probabilistically samples stimuli based on their potential variability. In our study, we are planning a thousand stimuli, and participants will only see a subset of these. However, we do not wish to bias participants, but only showing them the “weird’ words at the end. So we can use the previous variability to help us evenly spread stimuli across participants.

07:59 - Since this procedure focuses on accurately estimating items, I also recommend pairing that with hypothesis tests, when appropriate, that consider and control for items, like multi-level models. Brysbaert and Stevens have an excellent paper on these designs and their power considerations as well. Last I’ve mentioned this work is in preparation for a new study, which is the spam l or semantic priming across many languages, and that study is in partnership with Psychological Science Accelerator. The PSA is a global network of research labs that partner together to engage in worldwide research. This project lofty goals include providing a large multi linguistic normed data set for computational analysis, code packages for accessing and interacting with the data, and more.

08:51 - If you’re interested in joining this project including three pre- projects that are currently ongoing, please contact me. We’re looking for collaborators all parts of project. Now, I’m happy to answer questions, and thanks for your attention. .