Week 2 ABCD: Data Access and DEAP

Oct 6, 2020 18:30 · 10664 words · 51 minute read okay next version little bit

Hello my name is Wes Thompson, I’m a professor of biostatistics at the university of california san diego and I’m the director of biostatistics for the ABCD study and an associate director of the data analysis and informatics resource center which is also located at ucsd. So today I’m going to talk about the data exploration analysis portal or DEAP, ABCD DEAP. This is an informatics tool that the the data analysis and informatics resource center has made available to all qualified users for the ABCD data. I’m going to go through some some background of the study first and the study design and how how it relates to the analyses that one might perform on the study and then show you how those are instantiated and DEAP. I’ll go through the slide presentation and at the end I’ll try to give a live demo of show you how to access DEAP and how to use it so I’ll just go through some very simple examples.

01:10 - So the goals of the course are pretty straightforward so the first is to learn some pertinent details about the ABCD study design in particular aspects of the study that are relevant for how you might analyze the data from the study so that’s the second goal then is to incorporate the it is to think about incorporating the study design into your analyses. I’ll show you a particular example using mixed effects models that controls for certain aspects of nesting in the study but there’s other these other aspects that you might consider going forward which I’ll try to touch on briefly then and then the last goal is to learn how to access the data from DEAP. I’ll show you how to download the data and also how to perform analyses and to explore the data set in DEAP as well so so what is DEAP so again DEAP stands for Data Exploration and Analysis Portal. it’s a web-based interface deployed on the cloud it’s actually hosted by the NDA the NIMH data archive so that’s where you will if you register and access the ABCD data that’s where you access the data and you access DEAP in the same place where you access the data I’ll show you in the live demos everywhere to go. It has a number of different tools built in for example it has multi-level analyses that incorporate covariates design cover it random effects for site and subject and family.

03:00 - I’ll go over why that’s important potentially and then it also has some visualizations for the data interactive tools for exploring the data and the outputs including diagnostic plots and and get in model fits and so forth you know that all this is shareable and downloadable and is intended to kind of lower the activation energy for people who are perhaps new to DEAP or sort of new to ABCD or don’t have a lot of experience analyzing data sets from you know using multi-level models so so the idea is we want to try to make this as easy as possible for people to get started to work with the data. So before we get to DEAP I want to point out some aspect of of how the DAIRC has tried to incorporate open science into its ethic. One thing that we’ve done is put all of our source code on github so if you just do a google search for ABCD github it’ll be the first thing that shows up. All the code that the DAIRC has used to process data and to create DEAP and to do a lot of different things including geocoding how the e-prime files for the fmri a number of different things including the source code for DEAP itself so if you wanted to make it to take DEAP and implement something implement it on a different study the code’s there you can go ahead and you play with it so the that I think this is an important part of the the ethic of ABCD is that is open science we want to make sure that everything that we we do is is out there for people to see and is reproducible and is something that people can take and understand and there shouldn’t be any mystery everything should be reproducible and you know there shouldn’t be anything hidden in ABCD in fact one of the things that we’ve been mandated by the NIH is that nobody can publish on anything in ABCD that’s not publicly available to everybody so including anybody inside the consortium so so there’s no data that we have inside the consortium that we can publish on that’s not available for everybody anybody who can access the data outside the consortium as well you know I feel like that’s an important aspect of ABCD that we’ve really bought into and we’re really trying to make sure we live up to that ethic and and DEAP is really our one of our one of the things we’ve tried to implement to really live up to that to that standard the open science standard reproducibility is super important there’s no there’s nothing you should do in a paper with the ABCD data that anybody else with access to ABCD data shouldn’t be able to replicate exactly so if they have access to exactly the same data there shouldn’t be any mystery whatever you publish on or showing a talk some you should be able to make this code available on github or some other place and and make it available for other people to exactly reproduce what you’ve done I think that’s true open science so there’s a number of other things here I’m not going to go into detail but so for example just one example is I published a paper in 2018 on it’s just it was a you know just a PCA of the neurocognitive measures in ABCD and I wrote the paper in our markdown and I didn’t hardcode any numbers I basically had the numbers read in dynamically from the from the R outputs and I made all of this all the R code and the R markdown file available in in ABCD github so if anybody wants to reproduce in my paper they can just get that R markdown file run the run the R code on the 2.0.1 release should give you exactly the same numbers I get and and it reads it directly into the paper and so you should be able to reproduce my paper by clicking a button and compiling the R-markdown file so I tried to make that as reproducible as I possibly could and I strongly encourage all of you to make your code available when you write papers with ABCD data also.

So I’m going to cover a little bit about the design of ABCD because it’s important 08:03 - to understand some basic aspects. I’m not going to go through every single measure we’re collecting there’s a lot of them specifically 65,854 measures per person in the baseline visit that’s a lot of that is actually so this is called the tabulated data these are the files you can download directly from NDA so it does not include the images. It has the tabulated imaging data but it’s not like the full images with the voxel vertex level also does not include the genetics data everybody almost everybody at this point in the ABCD has whole genome genotyping data. There’s also a fitbit study, there’s other kind of sources of data out there that are not part of the tabulated data but the tabulated data is almost 66,000 measures per person in the baseline and so there’s a number of different domains these come from this kind of organized according to work the work groups in ABCD there’s there’s a number of work groups that specialize in certain aspects of data collection and ABCD except for example this culture environment mental health physical health substance use neurocognition mobile technology that includes the fitbit study the scan the brain imaging scans there’s what we used to call passive data but we know we call linked external data that includes geocoded environmental exposures and we’ll soon also incorporate school records and and local policy information and very soon hopefully we’ll also be starting getting getting getting other students of data including electronic health records and the number of measures we’re collecting is expanding all the time so the 65,854 measures at baseline is going to grow so having said that so that’s that baseline at year one if you look you know at the year one there’s only like five thousand measures so so the big difference here is is that is the the imaging debt is collected every other year and so a lot of these 65,000 measures are actually the tabulated data from the from the imaging so it’s like ROI level and data for critical thickness and surface area and volume and the MRI the resting state MRI and the the task MRI and so forth so a lot of the a lot of this stuff is is basically ROI level imaging data but there are still a number of other things that in here as well that are not imaging and and that’ll be ever growing and it’s it’s a little daunting to try to navigate I think if you if you’re not used to the stuff even if you’re used to even if you’ve been involved this study it’s it’s hard to locate variables when there’s 65,000 to look through and hopefully that’s where DEAP can help so a little bit more about the study design. So we have it’s a longitudinal study so here’s a kind of a snapshot of the design so we have a release here going on the the horizontal sorry the vertical axis here so release here 1 through 14.

11:25 - and then the horizontal axis gives you the the visit structure so so again we have yearly in person visits so baseline year one year two year three etc. There’s there’s interim six-month visits where it’s they’re not visits to six-month data collection that’s phone calls except for the the six-month phone calls collect a very abbreviated set of measures and and so probably for the most part when you analyze the data you’ll be focusing on the the yearly in-person visits and again the the imaging data is collected every other year so we have baseline year two year four year six year eight and year ten so we should have roughly six imaging assessments per person by the end of the study. So again this is a longitudinal study so it’s projected to last at least 10 years hoping it lasts longer but 10 years minimum and it’s 12,000 kids at baseline almost twelve thousand eleven thousand eight hundred eighty and and so we’re hoping to keep as many of them as we can in the study by year 10. There’s so far there’s minimal dropout and so if you look here on the top panel again this is basically the numbers of people with at each visit for each release so for example the the 1.0 release the NDA 1.0 release was a little under half the the baseline sample so roughly 5000 kids in the the the 2.

0 release we had the full baseline sample and then we had 13:17 - some six month and seven year one and so on and so forth so so there’s kind of a staggered structure because the data collection the recruitment of the kids took place over a couple years so at no yearly at no yearly release will we have a complete data set for all the visits. The last visit should roughly consist of you know approximately half you know half the kids at each release. So for example the 3.0 release which is coming up imminently any any any week should be seen that’ll have the full baseline full year one and in roughly half the kids in the year team follow-up and so and so then we’ll have two imaging time points on half the sample and and one imaging time point on the other half for the 3.0 piece. So this is the kind of things you kind of need to be aware of when you download the data is like like you know there’s kind of there’s going to be missing data you know by not by design like maybe some missing visits or some missing values for any individual person but there’s also kind of missing data by design because there’s not going to be a full sample for every visit for every person at any given release so um just wanted to throw this up on the screen this is a paper that that the biostatistics work group just put together that’s available on bioRxiv this is something we’re planning on submitting to a peer-reviewed as a peer-reviewed article and then in the coming two or three weeks but anyway this is something you can access right now on by archive it was it was really geared to go over the basics of the study design in aspects of population, for example how to make population inferences with ABCD, how to think about hypothesis testing and effect sizes best practices for reproducibility in in some work examples. So hopefully this is something that you can find useful this is the the reading material that I’m submitting along with this class so so hopefully there’s some information here that you find useful for learning about this the study design and and what we consider to be best practices for analyses and presentation of your results so DEAP so I want to get into the portal itself so I’m just gonna go through the basic aspects and slides and then when I’m done I’m gonna go through and do a little bit of a live demo so so this is the opening screen that you’ll see when you first log in and it’s got several modules it’s got information in the getting started page there’s a plan page which I’ll go into a little bit detail explorer which I find super useful that’s something that I use all the time because again like with 65,000 variables it’s hard it’s hard to figure out what’s going on and so we’ve tried to built in some functionality to help you explore the data set limit and extend I’m not going to go into as much detail I’ll just kind of briefly mention it and then but then I’ll spend a little bit more time on analyze and analyze is that kind of the main module for doing analyses using the mixed specs models and using our code and I’ll show you an example of that in detail.

17:10 - So explore so so this this basically exploits an ontology that we put together for the for the study so we so we’ve put all the variables into a hierarchical ontology based on the work group structure that I mentioned before so again we have culture and environment imaging mental and physical health neurocognition and so on and so forth and so we’ve kind of organized the data into these we’ve got special names and DEAP that they’re they’re not the same as the names in NDA there’s a one-to-one correspondence and we have the map between them but but we decided that in DEAP we wanted to have a more explorable way of naming the variables and so we did that using ontology using a prefix structure and so I’ll show you that a little bit more detail so on the left side there’s a tree you just click on the nodes and it opens up the leaves and you can interactively look for variables that way organized according to the ontology when you when you click on one of these nodes, it copies the information over to the right in it and here we have a way to kind of click on it and and when you click on the variable it’ll it’ll show you the data dictionary information, it’ll show you a histogram for continuous variables, it’ll show you the cross tabs for the the percentage breakdown for the for categorical variables and there’s also kind of a kind of a google life functionality for searching for variables in the bar here so it’s kind of interactive between the left side and the right side whichever you prefer. so limit I’m just going to talk about briefly and and so this is a way to essentially limit the data set that you want to analyze so for example if you wanted to look at say here’s sex equals f so if you want to look at females on the females on the analysis it would basically find all of the females in the study and you know put them into a collection and then you could save that and then you could and then when you do analyze you can actually analyze the data set which is a subset according to your specifications. I’ll briefly show you how to do that in the in the DEAP live demo. There’s also extend extend is a way to create new variables using the the existing variables so an example might be say if you wanted to compute the body mass index based on height and weight and so there’s a way to interactively create variables you can also upload your own variables so this is there’s a link here for being able to upload maybe some some verbal you’ve computed yourself that you want to add into your analysis you can say you can save the these extend variables and then they’ll be incorporated they can be incorporated into your into the analyze module so again I’ll just discover that briefly when we get to the live demo. In the analyze module which is where I’ll spend the most time on the demo we’ve implemented a mixed effects model all the code in the analyzes is implemented in R which you can access to your expert mode and what we’ve done is take the the gamm4 library and implemented a mixed effects model that incorporates so here we have it for site random effects and and family random effects so we also currently have have subject random effects because now there’s like repeated measures on subjects so now we allow actually for three three levels of nesting in the data within subject within failing within site and just to know about sites there’s 20-21 different sites in ABCD and so we want to probably try to account for some variable in your analysis or some way of accounting for the site variation you can do that via random or fixed effect here we have it as a random effect you could also just do a fixed effect if you wanted to family family you probably have to use a random effect because there’s there’s there’s thousands of families in in ABCD so there’s a lot of twins for example four of the sites are twin hub sites that intentionally collected twins to do behavior genetics analyses so I think there’s about 1,200 twin pairs or so but there’s also other family members in the study like just sibs that happen to be in the study together so I think there’s on the order roughly four thousand four thousand the kids are within a family that has more than one number in the study so if I remember correctly but it’s it’s a considerable number and so you need to account for that in your degrees of freedom when you’re when you’re doing say a regression analysis and so one way to do that is via random effects and so this is the way we have it implemented and analyzed you can do it and you know there’s other ways of doing you can use like the generalized estimating equations approach or you could try to randomly select one member of each group that’s kind of a little bit wasteful but sometimes you have to do something like that if you have a method that doesn’t allow for for nesting, but in any case the the especially site and family are things that you need to the aspects of the design that you should account for in your analyses and then of course with longitudinal data so we have a little bit already in the 2.0.1 release and then in the 3.

0 release which is coming out in a couple of weeks, 23:01 - or hopefully in a couple of weeks, there’ll be there’ll be at least two time points for imaging they’ll be two times points for imaging for roughly half the subjects and there’ll be a number of other variables that have two or three time points because no then we’ll they’ll they’ll be baseline year one and year two follow-up data for for some variables so so then you have to account for nesting within so within subject across time right so that could be accounted for via a random random subject effect as an for example and that’s what we have implemented indeed and then one of the nice things about the gams is that they allow for non non-normal distributions other exponential family like logistic regression or poisson regression they’re having said that there there are potentially some issues with convergence if you use nonlinear link functions for like we have logistic link function implementing a number of people have reported to us that there’s issues with convergence of the models sometimes the random effects models with the logistic link function and so I think in very soon after the 3.0 data releases we’re going to kind of switch over to the generalized estimating equation approach because that seems to be a little bit more stable for the for these nonlinear line functions so that’s something we’ll be addressing in the upcoming version of the DEAP and hopefully in a couple months. So when you when you I’m not going to spend too much time on this because I’m going to go to the live demo scene but this is the where you can enter your your regression variables you have your independent variable of interest the main independent variable your dependent variable of interest you have other covariates including interactions or transformations these and then we have a set a number of covariates that we recommend people consider including in the analysis and and the reason why is because they’re actually part of the design of the study so so if you’re not familiar with how ABCD was how the subjects were recruited I’ll just go over that briefly so so the the primary form of recruitment was through schools so there’s a random selection of schools in each of the 21 catchment areas and it was designed to try to reflect the population of the catchment area of the united states within those catchment areas so it’s it’s truly a population-based study and so it was designed to kind of match with the American community survey which is a very large census-based survey of nine in ten that and we looked at the nine and ten-year-olds in that survey and tried to match the ABCD sampling with respect to race, ethnicity, sex, education level of the family income marital status and a couple other variables but but I think those are the primary ones so so those we put in as default covariates in for analyses you can click off them you don’t have to put in there and in general we don’t recommend that you just default to a given set of coverage that you think carefully about what covariates you want to put in your models the paper that I the the reading material the meaningful effects paper has a lengthy discussion about this and a couple of examples but we wanted to just like kind of put these in here not not saying that you should always put them in your analysis but just to strongly consider putting them in your analysis and if I mean if you think they’re not appropriate that’s fine but they are part of the design of ABCD and so we have that kind of coded in here is something to think about we also have random effects like I said of site and family and now we also have subject for longitudinal data and and those are kind of hard created in because I we just thought there’s we couldn’t think of an example where you wouldn’t want to try to control for that in a basic analysis so we we kind of hard code those in our in our having said that this is like I said all this code is available on R. you can download it you can modify it as you see fit if you wanted to for some reason take these out you could but I think you should think carefully and just a quick note the site and family random effects so there is variation in the study due to site and variation in the study to the family and generally the variation due to the family is quite a bit more substantial than the various and data site so maybe on the other have ten times as much variation as accounted for by family than by sight so so it’s I think it’s quite important to have a family random effect or somehow control for a family in your analysis site too but again family is particularly important and it’s it’s not something you could address via fixed effect whereas site you could potentially put them as a fixed effect if you really wanted to you could also like anyway I’ll get to that in in is in more expensive detail when I should go through the light demo so there’s some outputs which I’m not going to get into here because I’ll show you the live demo some regression scatter plot and regression model fits some displays of the data some diagnostic plots in the table of coefficients and p-values and so forth there’s a tutorial mode in DEAP which again I’ll show you so you just click on a button and that opens up a text that describes each of the each of the inputs and outputs in the in analyze and gives you some intuition as to what what what they’re what they’re doing and and how to interpret them there’s also expert mode which I’ll show you and show you in a second as well when you click on expert mode the our code opens up and you can go in and you can actually edit and modify the R code and run your run your analysis tailored to your to your specifications and you can of course like I said you can download everything and run it locally and do whatever you want with it too here’s the screen where I’ll where you input the variables I can go into this because again I’m going to give a live demo in a second so let me actually just go to I’m just skip the model builder. All right so before I actually go through the live demo I want to go through some of the updates that we’re gonna that are coming down the pike some of them actually are are actually ready to go in fact we’re probably gonna release some of these next this coming week so for example population weighting so even though the ABCD was designed to match the american community survey in terms of such these key socio- demographic variables because of self selection in the study we don’t match it exactly in in some cases and so with steve haringa at the university of michigan we’ve created weights that weight our study to the to the american community survey and so you can try to get your analyses to to match more closely to the population in the united states you know of the population of nine and ten year olds and so that’s something that’s been implemented and we’ll be ready in a week or two we have enhanced interactive download of data that’s actually going to also be available in the week or two so so it kind of interacts with explorer so you can put your variables in a cart and then download some samples of the data in whatever format you want like an excel like a csv file or an rds file or a sas file or whatever so the image analyses is also something that will be implemented at least partly in in a couple of weeks as well it will be implemented in the ROI level and this is something that we’re going to extend to the voxel and vertex level in the coming months so the so the first three here are are going to be implemented in in a couple weeks so probably by the time you see this lecture they’ll be implemented and available on NDA DEAP one student analyses so so that’s going to be after 3.

0 because we don’t currently have 31:33 - a lot of longitudinal data I mean like I said we have a random effect for subject already implemented but we’re going to kind of extend some of the functionality of DEAP to account for more kind of thinking about trajectories even though it’s still a little preliminary with like 3.0 we’ll have like two or at most three time points for a lot of variables so it’s not like really expensive trajectories yet but at some point they will be and you know we out to year 10 we’ll have a lot of we’ll have a lot of time points for every subject and so we’re going to have more like spaghetti plots and and ways to do mixed- effect models that maybe look at like time lag structure or change or or whatno so there’ll be a lot of things we have to implement with the launch general data because the number of analyses you can do with longitudinal data is becomes combinatorically larger as you get more and more time points one thing we’re actually working on this will be available for the 3.0 release after this we want to release is the twin analyses so you can look at heritability in with twins that was the point of one of the points of of getting them in the sample so you can you can use behavior genetics analyses actually as implemented in openmx which is an R package and so we’re working on getting the this openmx implemented and DEAP as well and there you can look at genetic unique and common environmental sources of variation and decompose every measure into the into those sources of variation you can do that universally or very vibrantly and there’s a lot of different things you can do with twin data that I think are pretty cool and then we’re also working on getting kind of more more advanced machinery methods implemented and also like cross validation and out of sample estimation of effect sizes that’s something we’re actively working on right now and then another thing is missing data imputation that’s something that we’ll have to deal with more and more as we get more and more time points we’ll have we’ll have missing covariates and independent variables and we’ll either have to do list wise deletion which is not optimal or you know hope and so so we’re going to try to implement some kind of multiple imputation at some point indeed okay I’m going to stop the slideshow. okay and then I’m going to now switch to the live demo so bear with me for a couple minutes I’m just gonna bring up my browser. okay let me just go to google alright so now I’m just going to share my screen again.

34:24 - okay so I’m just going to take you through how to access DEAP and then just some basic examples of using the different modules so the best way to find it is just to type in ABCD DEAP. it’ll take you straight there here I’ll just go to this one because this is to show you where you connect this is where you actually access the data as well so this is the the now I’m at the NIMH data archive the NDA data repository for ABCD and so this is where you access all your collections and you can also access DEAP here with this button the way to access it is so once once you’ve gone through and you’ve gotten the permissions to use the data to download the data then you’ll have a username and a password use the same username and password to access DEAP that you use to access the data and just before I get into to DEAP, the rds file–so that’s a R data format–the rds file that that kind of powers DEAP is also available for direct download and and this this is essentially this rds file has everything in the full release here the advantage to downloading the rds file as opposed to going is to downloading these these text files directly is that there’s these are if you download all the the text files directly from the from the annual release I think there’s dozens of them they’re split up they’re all in character format when you analyze it you’ll have to merge them and you’ll have to transform the data types to whatever you know the correct data type before you do your analysis if you want to avoid doing that just download the rds file we’ve already done all that for you it’s all of the variables that are available in 2.0.1 are available in the rds file we sometimes add in a few extra things that are that are variables that are that are processed from the existing data but like we go through a substantial amount of work to get them in there into DEAP and that kind of saves you the work for doing that processing yourself but I just want to emphasize the rds file contains the same data as the 2.0 release it is the 2.0 release of the 2.1.1 release but we’ve processed it, we’ve merged it we’ve created that we’ve switched it to the right data types and then we in some cases we’ve like taken some of the variables in there and created new variables that are related related to the existing ones that might be effortful for people to do from scratch and and I also want to point out that the code for doing this is available on ABCD github so you can see exactly how the steps we went through to create the rds file from the existing 2.0.1 data. So you can download the data directly again it’s an rds file if you if you read into r you can export it to whatever format you want also we have like I said in the version that’s coming out in a week or two of DEAP that will have added functionality for interactive download of data into whatever format you want so watch this DEAP so this is where you put in your your NDA username and password and it takes you here so here’s the the entry screen so the the getting started is just kind of some basic information about the study about the data about the different modules how to cite DEAP and so forth so if we go home here we have a list of recent updates this tells you the like the different features that we’ve been adding so here’s oh maybe this has already been updated okay so we’ll see so this gives you a list of the the things that we’ve currently added to the to DEAP so yeah august 17th so maybe this is so maybe I can show the ROI analysis so the and then we go to plan so plan is something we put in here to encourage people to to consider doing a hypothesis registration so so I talk again it talks about the sampling the design the analysis and the analysis scripts which are things that you want to share and then give some details for the the design and some references some relevant references for ABCD for the design some for recommended analyses for where to share your scripts and then also for hypothesis registration and so one of the one of the places you can go to for hypothesis registration is the open science foundation osf and so we’ve created an osf template and so if you wanted to use this as a template and to modify it for your particular analysis and then to submit that for osf you can then pre-register your hypotheses in your analysis plan there are a number of journals that take registered reports as an option we’ve we’ve partnered with two generals to do this so cerebral cortex and the developmental cognitive neuroscience are two journals that have specific options for doing registered reports for for ABCD data so and if you have an interest in this and I’ve never done it before I think this is neat for a lot of people but but something we’ve been trying to encourage is at least to think about for an option please send me an email and I’ll be happy to to give you more information and to let you know like how a little bit more detail how the register reports works and point you to the resources that are available including the two journals that I just mentioned and so here’s the template and so now if we go back home okay so let me go to the explore function this I find quite useful so so there’s again this is based on the ontology that we put together the DAIRC put together so sub-studies so there’s not a lot of sub-studies right now there’s the irma sub- study but there’s a lot more sub-studies coming down the pike so this is these are kind of subsets of data in ABCD that have been collected that weren’t part of the original design so for example the irma sub-study was like after hurricane irma there was a desire amongst some of the investigators to collect information about the effects of irma and the exposure to irma on the particularly like the florida site and so that became a sub-study and and there’s a number of other sub studies that are coming down as well and so they would be here in the sub-study leaf.

Main study is where most of it is of course and so we currently have it divided up into 42:18 - into these categories and and again like I said before these are based a lot of this is based on the work group structure of of ABCD of the ABCD consortium in terms of how how we organize data collection by experts in each of these areas that that are kind of in charge for determining which of these measures should go into ABCD and QC the measures and so on and so forth. so we have biospecimens that includes hair samples and and soon there will be blood samples and there’s say hormones and some other measures so you can see here. This is probably going to grow over time we just started collecting blood before the coping pandemic so you know hopefully we’ll be able to get that going again at some point and then that’ll that’ll lead to a number of other biosamples, culture and environment, demographics, core demographics, and a lot of other ones so let’s see let’s look at core demographics. so these these are the ones that I mentioned before this is like the ones that were a lot of these are part of the study design like household education, household income, age, marital status, race, ethnicity so on and so forth so there’s the imaging roles I’m not going to click on that mental health new recognition there’s the NIH toolbox and a few others novel tech let’s see what’s in novel tech no a screen screen time and fitbit so there’s the fitbit study that I mentioned so that each of the kids gets a fitbit and wears it for a couple weeks and we’re making summaries of those data available here screen time so how much like the kid the children are asked how often they spend on the screen and various activities like gaming and social media and so forth so there’s physical health so residential history drive geolocation scores these are one of the things that we’ve done is to take everybody’s address in ABCD and geocode it with latitude and longitude and then we can link that up to a number of different data sources so so for example if we look at current address one here’s all the different variables by the way every time I do this you can see it copies it over into the right side I’ll show you how to kind of interact with that in a second and so there’s a number of variables that we’ve linked to where this is actually a very active thing we have a we actually have a geolocation work group which I’m co-chair of that we’re trying to link to a lot more like environmental neighborhood level environmental exposures like police, water pollution, air pollution, crime rates we have the neighborhood dipper the area deprivation index we’ll have a hopefully a lot more of these variables coming down the pike we’re also going to try to get residential history in a lot more detail going forward we are actually getting the residential history we’re trying to get it back to birth and then proactively going forward and then so we’ll have a kind of a dynamic set of exposures that are geo coded using residential history so this is this is kind of an interest I think interesting in an expanding area that we’re gonna try to make sure we get a lot more information for the ABCD. I think screener questionnaire isn’t so useful potentially then the substance use so there’s a lot of different substances measures in ABCD.

so ABCD is product is not 46:15 - just but is predominantly funded by nida national institute of drug abuse and so one of the primary interests of the funders for ABCD was the impact of drug exposures on health and mental health and trajectories of brain and neurocognitive development so we have a number of variables in those domains. so if I look in a little bit more detail so for example neurocognition now let’s go there so here’s the different tasks so we have so let’s say let’s get NIH toolbox okay so there’s a number of variables that that you get out of that so again now all of these are now copied over to the to the right hand side here I could actually actively search for things in this toolbar here if I wanted to but let’s take an example of one of them so let’s look at say internet toolbox the picture vocabulary age corrected and if I click on that it opens up a histogram that gives you the the NDA name so this is this is again this is this this is our DEAP name with the prefix structure that that has that kind of obeys the ontology that we put together to make it easier to search but we also tell you what the what the NDA the official NDA name is for these continuous variables it opens up a histogram in a five number summary of the data so you can kind of get a sense for what’s in there and what it looks like and then for discrete variables let me see if I can find a discrete revolve so here’s like language that’s gotta be discrete so we have say we have english 7719 blank hundred fifty four and n/a’s is fifteen for the 495. so essentially this is all english we have missing data for 154 and then this is due to the this is something we actually have we’re gonna parse out a little bit more detail in the next version of DEAP is like the reason why there’s so many n/a’s is because this was data that was collected at baseline and so the missing data is just data that that is like post baseline like six months year one follow-up and so forth so again we have to be cognizant of the fact that this is a longitudinal study and there’s a structure to the data so as the study progresses and becomes more and more longitudinal we’ll have to morph DEAP to be able to handle you know in a better way but how the like kind of the visit structure we’re already we’ve already made some progress on that so hopefully it’ll it’ll be relatively easy to figure out what’s going on when you click on different variables so if we go to back to home okay so so I’m not gonna go into too much detail here for a limit so again this is where you basically create a filter we have some example filters here so let me see let’s do all right handed. but you can you can go into here and you can modify the logic statement to pick out whatever you want you can do multiple criteria so you use like all right- handed females or whatever so here here you can see we’ve we’ve started developing this out into the longitudinal structure so here we have 9425 records that are right-handed 1943 that are not so this is the baseline data so you can see that there’s 9425 people at baseline that were right- handed you can look over here you can see there’s 2450 people at baseline that were left-handed and then the the reason why these records are over here in nay is because we didn’t collect handedness data at six months one year or 18 months so so again you need to be cognizant of the kind of the longitudinal structure of ABCD to understand kind of like why certain variables might be present or missing in a given record and here like for the lunchtime follow-up you know we just collected the handedness data baseline so if we go back so you can so you can basically so just you know a little bit more to you if you you can basically modify this code however you want save it and then and then then it’ll be available here or something like you can pull up and it’s also something you’ll be able to reference in in the analyze which I’ll show you in a minute so so you can like limit the your analysis to a particular subject of subset of subjects that you’re interested in looking at and again extend I’m not going to go into too much detail here again you can basically create new variables based on old variables and and then use those in your analyses as well I think we have we have video tutorials for all of this I can’t get into too much detail because I’ll probably run out of time so let me just go through analyze quickly and then then I’ll end up my lecture so this is the last thing I’ll cover so this is what I was showing in the slides and so this is essentially a regression analysis that incorporates the design structure of ABCD in terms of the nesting and the design convergence so just to point you to the expert mode if you click here it opens up the right side and you can edit the art and this is all the r code that’s used to run the models so you can you can actually go in here and you can say here you can define the variables or change the code in different ways and run it. There’s also tutorial mode that I was talking about before so if you click on tutorial mode it opens up a text that describes the the models and then when the outputs are down here after running it’ll describe the outputs so you can take a look at that once if you want to and then and so but let me go through how to how to run the model so we have the the dependent variable of interest if you click on it it opens up a little a little box down here with a histogram and interactive way to to perform transformations in the data so for example if the data were skewed you could do a log transformation these they’re pretty normal you don’t really need to if there were some extreme observations like really extreme observations like there are sometimes with the fmri beta weights and you can get censor them it censors the top and the bottom a half a percent of values gives you a little summary so you can do the same thing for your primary independent variable of interest.

you 53:57 - can add more variables I’ll talk about in a second but this is the one that showed in the plotting functions and so here we have a few more transformations available so again we have log you can add a polynomial transformation if you want so you can run a polynomial model again you can censor you also have the smooth transformation and so so again this is running with a gam4 package so it’s a generalized additive mixed model and so it allows for a data driven smooths and so if you think the the relationship’s not linear and you don’t want to use a polynomial for whatever reason you can you can use the smooth an example of that in a second okay and then so so I’ll go through this in a little bit this is for grouping this is for interactions with the independent variable of interest now you can add other independent rebels this is where you select your subsets say like females only or or you know whatever I’ll just do all subjects for now and again these are these are the design covariates for ABCD so race, ethnicity, sex, education, highest education of the guardians, household income ,household moral status, age of the subject, and whether or not the subject’s hispanic and so these are all kind of highlighted in green you can click them off if you don’t include them in your model but again we just encourage people to think about whether they you should you know think about which covariates you want to put your model and these are important ones that people tend to control for having said that you know every analysis you should think through carefully about what you want to do your analysis the random effects families hard-coded you can’t click that off I can’t think of a good reason not to have control for family site you can click off if you do the reason why is devices is the mri device and so if you’re doing an analysis that incorporates the imaging data in some way like the tabulated imaging data you might want to control for a scanner instance as opposed to site so again there’s 21 sites roughly half the sites have one scanner and half sites have two scanners so you don’t want to put site and device in at the same time because the you know they’re large they can funnel with each other and so you can put one or the other in you don’t you can’t put both in subject you put in if it’s a longitudinal measure that it’ll kind of detect if there’s repeated measures the the depot detectors repeated measures and it’ll add in the random effect for you but or you can click on it yourself so I’ll just put set in for now we’ll just do a baseline analysis so we’ll do the kind of a dumb model I don’t nobody would ever run this but just for an example we have the NIH fluid component uncorrected so this is a neural cognitive measure of fluid intelligence and we’re gonna use the the picture vocabulary to predict that controlling for these keywords so I just click submit this typically takes I don’t know roughly 10-12 seconds to run hopefully that’s what happens now okay so 12 and a half seconds just takes a couple seconds to render okay here’s the rendering so here’s the data displays so this gives histograms of the dependent variable and the independent variable a primary independent variable of interest it also gives a scatter plot of the data along with a fit the regression fit with 95 confidence intervals. each of these data points is pickable so you can click on it and see the subject id here site, sex, race, age and then the values for the two the independent variable of interest so for example if you have an extreme outlier like this person here you can kind of see who that is and like what if there’s any specific characteristics about them so I’m on the margin we have the histograms on the margins here’s the the R formula for the for the fixed effects and then here’s the R formula for the random effects next we have kind of a table one so this is just a this is a summary table of the of the variables that are going into your analysis so we have for 10 use variables we have the mean and the standard deviation and then for categorical variables we have the the numbers in each cell and the percentage and then with the population weights we have another column which is the population weighted versions of this so by the time you get access to this if you play around with DEAP you’ll see another column here that’s like basically weighting these back to the population in the united states or at least to the american community survey. the next is the effect size table and we’ve quantified effect is the delta r squared or the change in r squared and so we the the model when you click the button and actually fit two models it fit one model with the covariates only so basically with with these guys and the random effects and then I fit another model with those plus the independent variable of interest and then it computed the r square for each and took the difference and so if you do that the delta r square is point zero five five nine four so roughly five point six percent of the variance explained additional variance explained by putting in pick vocab on top of the covariates of no interest. So next is an anova table this gives you the significance of the different terms in the model and then a parameter table this gives you the the the estimates the regression weights the beta weights and the standard errors and the p value for the for the independent variables here we just have the pick vocab right which is of course highly significant so and then the next set of plots are are diagnostic plots so this is a a plot of the fitted values versus the residuals this is something that should be flat there should be no relationship so if it’s a lowess curve and you can see that there’s a little bit of a curve here so maybe I’ll try a non-linear fit next but but yeah there’s like I don’t know how worrisome that is but but there’s a little bit of departure from like a flat line here on both sides of both ends here’s a plot here’s histograms of the of the predicted random effects so we have site on the left and we have family on the right and so as per usual family explains has a lot more variance than the than the site so typically like I said it seems like family explains roughly 10 times the amount of variation that site does and most of the analysis I’ve seen and then the last plot is just the qq plot of the residuals and that if you know if it’s a normal if you’re wearing a normal model you’d expect the you to hope the residuals to be normal and so you hope that the residuals follow kind of a straight line like a red light on top of the dotted red line so it’s important to check your model assumptions so we built that functionality in the DEAP so now if I wanted to say maybe I was a little worried about this I don’t know this departure of the residuals from from from a linear fit I might go here and I might click on smooth transformation and rerun the model let’s see okay it takes a little longer because it’s smooth it’s a little more computationally demanding so that was 18 seconds let’s see a little a little bit more time to render here we go. so here you can see now that instead of a linear fit we have a curvy linear fit this was different data determined if it’s like basis functions and smooths it automatically I think it’s b spline defaults to b splines and so now if I go down here and look at the residual plot you can see it’s a bit flatter now so maybe this conforms better the model here conforms better than data than the first model so I’m not sure if it would lead to substantively different conclusions but you know it’s a better model so yeah so just maybe one other thing so let me click off the smooth let me just make that linear again oh by the way before I go on so this is this kind of interacts directly with the with the explorer so if I put the prefix in it’ll open up it’ll give me a bunch of options for you know for for variables but if I wanted to kind of search from scratch I could click on the plus here it’ll take me over here and I can just kind of I can try to find I don’t know variables that that I need to include in my analysis and if I want to I just you know click add and it’ll copy it over here so let’s see this is something I could analyze probably not so all right I’ll just put the old one back let’s say oops yeah let’s do crystallize. let’s do uncorrecte.

d so now I’m predicting nobody would ever do this model I’m predicting 04:41 - well I don’t know so I’m predicting food intelligence from crystallized intelligence so all right so maybe we want to incorporate like say I don’t know let’s say we want to look and see if if the impact is moderated by sex if the impact of crystal I don’t know the relationship between crystallized intelligence it’s a relationship with fluid intelligence is as an international site so if I click submit it’ll now incorporate that interaction and it’ll show it in the plots and the results so here we go so now we see that the regression line is two regression lines one for females one for males. males. females. so so here we can see now that the anova table has both a main effect for crystallized one and main effect for sex and interaction between crystallized and sex. this is not even close to significant for either one so and you can kind of see that in the regression plot they’re pretty much right on top of each other so on here the the dots are colored according to female or male and so on and so forth so so you can do this pretty much with any kind of categorical variable for this plotting variables now if you want to add in other variables you can you know there’s there’s there’s no limit to the number of variables or the or the interactions that you incorporate so if I wanted to put I don’t know bmi i don’t know star effects if you wanted to do an interaction between bmi and sex again I’m just this is not a model anybody would actually run and just this is just for an example let’s click that let’s get rid of this one let’s just run and see if I run this what happens and you can also do like smooth smooths of these variables or polynomial transformations or log or whatever so okay all right so now the results have uh have the bmi which is significant probably not very though I mean if you have what 12 000 kids a p-value of 0.038 is not that impressive so and and then the interaction is nothing so anyway so there’s you can even you know do more complex regression models if you want to now the last thing I’ll do is show you how to kind of save your results and download them so if you click on this little hamburger menu here it opens up a little interactive screen you can save your model for later and you can share it okay I need to give it a name and then save it and you can also click on download this button here and this will put the little package that you can download that has all of the R code all the plots all the model outputs and all the data that went into your analysis and you can take that in and work on it locally I’m not going to go through here sorry so you can download everything and work on it locally so the and then the idea may be that well you can get your analyses started indeed and then if you wanna do something more complex it’s not so easy to work directly in DEAP you know or some other analysis you can just download everything and use that as the basis to start doing other analyses or or extending it or you know or sharing it with others that for which you share a signed database contract that you’re allowed to share you know like at least code with right so all right so I’m gonna stop here I think if you have any questions please let me know my emails start wkthomsonhealth.ucsd.edu be happy to put you in touch with anybody if I can’t answer your questions and and I hope you have a good experience if you use DEAP and let us know what you think if you do use it and if you have any ideas for extending it or improving in different ways we’d like to hear from you but in any case best of luck and happy analyzing bye .