Creating a Data Management Plan Webinar

Feb 24, 2021 14:29 · 8985 words · 43 minute read

USDA logo United States Department of Agriculture Erin Antognoli: I know a few people are still joining us but I’m going to get started here.

00:18 - Hello, everyone, who decided to join us on this some late Friday afternoon. Thank you for coming. And welcome to the National Agricultural Library’a Creating a Data Management Plan Webinar. My name is Erin Antognoli.

00:31 - I’m Data Curation Team Lead in NAL’s Knowledge Services Division. Before we begin I want to go through a few housekeeping items.

00:39 - This webinar is being recorded and all participants have been muted.

00:45 - If you have a question at any point during the webinar you can type it into the chat.

00:48 - Please type to everyone if possible and we will attempt to answer as many as we can during the Q and A period at the end of the presentation. My fellow data curators John Sears and Michal McCullough are monitoring the chat and they’ll help you out with your questions along the way if possible.

01:06 - And with that let’s get started. We’ll begin by explaining what a data management plan or DMP is and why it’s important to create one.

01:19 - We’ll review each expected section of a data management plan at which time we’ll also look at a real data management plan’s implementation of these components.

01:29 - And we’ll conclude with time for questions.

01:36 - First we should start out with an explanation of what a data management plan is and conversely point out what a data management plan is not. Your data management plan should cover the basic guidelines for dealing with the data you will produce over its entire life cycle. This will take the form of a brief outline that covers the who, what, when, where, and sometimes the how in terms of management and preservation protocol for your data.

02:00 - This document should be very easy for someone who’s not familiar with your project to understand. Be clear and concise.

02:12 - A data management plan is not an in-depth documentation of your protocol.

02:18 - You don’t need to include workflows, instructions, charts, diagrams, names of every person or company who will interact with the project, or in-depth review guidelines. You do need these in-depth project management tools for internal operation and use but this data management plan is meant to be a brief overview with a very specific purpose.

02:39 - In that vein, this plan is not the end of the conversation about data management but rather the beginning. You should expect to revisit this plan and flesh out your protocol and workflow procedure after your funding comes through and you’re ready to begin your research. Finally a data management plan is not concerned with paper or a manuscript publication or promotion strategy.

03:02 - This plan concerns only the data and all points should address data management. Of course there are always exceptions especially if a funder requests information about a certain topic or activity.

03:16 - An exception might be if data storage is bundled with paper publication, in which case you should note that in the appropriate section. Also to note, if you’re depositing your data in one location but cataloging it in another such as the Ag Data Commons, you should note who is responsible for preserving the data and who’s responsible for providing access.

03:44 - When we refer to the data life cycle this graphic from Data One’s education module series visualizes this process.

03:53 - Managing data is an ongoing process and a good plan is a crucial first step to ensure your data is handled properly.

04:02 - We are focusing on the plan portion of this life cycle in this webinar which incidentally reflects all the other steps of the life cycle.

04:15 - There are many reasons why a solid data management plan is important.

04:19 - Data are valuable and often unique assets that should be properly managed in order to be accessible, understandable, and reusable into the future.

04:28 - Sometimes referred to as the Force 11, Open Data FAIR Principles, the goal of a DMP is to form a blueprint to follow in order to make data Findable, Accessible, Interoperable, and Reusable.

04:43 - FAIR is mostly applicable to the steps after data are released whereas a DMP also covers part of the data life cycle before the data are released.

04:53 - So the DMP ensures the data are kept safe and understandable both during and after projects.

05:02 - The guidelines I present in this webinar follow US federal public access and open data directives and also comply with a broad range of current funding agency requirements for DMPs including P&P 630. In other words, DMPs may be more work on the front end. But a well thought out plan will save you lots of time, money, and headaches throughout the life cycle of the data.

05:31 - Before you create a new DMP, researchers should determine if a DMP specific to their agency, program, or previous project already exists and use that as a template.

05:44 - If a DMP does not exist, follow the guidelines already in place to create yours.

05:49 - Don’t reinvent the wheel. Note that current DMP guidelines for ARS are for two to three pages integrated into the project plan. Also note that under the USDA Public Access Implementation Plan, most research data generated with USDA funding will be required to be cataloged in the Ag Data Commons.

06:11 - This means that whether the data is deposited directly in the Ag Data Commons or linked from another repository, a record for that data must exist in the Ag Data Commons.

06:27 - The requirements laid out in the OSQR project plan call for general DMPs written at the project level. It’s important to lay out the access and sharing procedures between team members during the course of the project and indicate access procedures after the project is complete.

06:45 - Define when data collection is considered complete.

06:49 - In most cases, this will be when the paper was accepted for publication.

06:54 - Identify a reputable repository for the data.

06:58 - In principle, use domain-specific repositories for the best support and management of your data. Otherwise, use the Ag Data Commons, USDA’s generalist Ag repository. Consider how to maintain public access to your data in the long term. Obtain unique identifiers like a DOI for your data sets and outline any public access provisions, restrictions, or exclusions in the plan.

07:29 - I will now review each of the six sections expected in the DMP and provide examples of what belongs in each section.

07:41 - We begin our plan by outlining Expected Data Types.

07:45 - This section covers what you will produce. Describe the type of data, for example, digital or non-digital, and how they will be generated. Different methods include lab work, field work, surveys, custom software, and so on. And that should be specified.

08:04 - What kind of metadata will be generated and how? Will metadata be manually entered or automatically generated by the data collection method? Following a specific community-approved metadata standard consistently is encouraged to facilitate wider understanding and reuse of the data.

08:23 - And the specifics of the standard used will be covered in the next section.

08:28 - For example, you may be collecting environmental data from real-time sensors or images from phenocams. You may be conducting interviews with digital video and audio recordings and subsequent digital transcriptions. You may have field notebooks from crop management experiments or field trials that are not born digital.

08:49 - You may be generating sequence data for whole genomes or metagenomics. During analysis or modeling you may be creating customized computer code or scripts for transformation or data cleaning. Metadata describing the data you have collected should be recorded for each experiment, each physical sample, or be embedded in the files produced by the sensors or sequencing machines. Describe whether any raw or processed data will be reused from other studies, and if so, name the anticipated sources.

09:30 - This is a screenshot of a DMP with a comprehensive Expected Data Type section. This project gathered genetic data in multiple physical formats and states that they will arrange the resulting data in an Excel file. We will follow this example throughout the presentation.

09:54 - Next, the Data Formats and Standard sections covers the data and metadata formats and schemas chosen for the project. Describe the data formats, for example csv, pdf, doc, tiff, and so on, for both raw and process data. If data are in a non-digital format, are there plans to digitize the data? Use of machine-readable formats is strongly encouraged and will soon be required by most U. S. federal funding agencies. To make data more machine readable consider using CSV in place of Microsoft Excel or text in place of Microsoft Word.

The nature of your data will determine if this shift is possible but planning a structure from the beginning of a project makes it easier to arrange your data to comply with these formats. What standards or schemas will be used to structure, store, or share the data and metadata? Community-recognized and non-proprietary standards are strongly encouraged. Be specific when noting this information.

11:05 - Name and link to any published data dictionaries, data standards, or ontologies that you are using.

11:13 - Examples include ICASA Master Variable List, Gene Ontology, or Integrated Taxonomic Information System.

11:21 - If data will be deposited in a professional database or repository, refer to their data and metadata standards.

11:34 - Following the same DMP as before, we see their Data Formats and Standards section.

11:40 - This DMP notes that all data file formats resulting from various parts of their project. This is genetic data and the FASTA format is a community standard for storing this type of research.

11:53 - Note that they specify formats for sequence files, text files, and image files. And note how they’re saving their methods of replication in addition to the data they are generating. Anyone using this data later will know exactly what to expect when they open this file package.

12:18 - The Data Storage and Preservation of Access section covers provisions for depositing data in a long-term preservation and archiving environment. This includes backups, cloud storage, access protocols, obsolescence avoidance, data migration strategy, and so on. Where will the data be stored during and after the life of the project? Name specific workspaces and repositories and link to them as appropriate. For example, you may initially manage data on a local or network hard drive while working on the project and then transfer the data to a repository such as Ag Data Commons for long-term access and preservation.

13:02 - You may maintain data on a high-speed computing platform such as Scinet or Sciverse or on a shared workspace like Open Science Framework during analysis. You may deposit data in an institutional repository if one is available to you.

13:19 - What is the technical infrastructure and staff expertise? How and why are they qualified to preserve the data? Funders want to know that competent people with adequate arrangement will maintain the data. This is especially important information to include if researchers are publishing data using their own infrastructure.

13:42 - Specify plans for long-term preservation. Some items to address include approximately how much data are expected to be archived.

13:52 - Ideally this includes raw data and/or minimally processed data depending on your situation. Record the planned retention period for the data. Outline strategies, tools, and contingency plans that will be used to avoid data loss, degradation, or damage. There are several factors to consider as you evaluate your data repository choices.

14:22 - We encourage researchers to archive their data in a subject-specific community-recognized repository whenever possible.

14:30 - These repositories often standardize their data and metadata requirements to suit their designated research community.

14:38 - In the absence of a subject-specific repository, the Ag Data Commons provides storage and access for USDA-supported data.

14:47 - The Ag Data Commons may also be appropriate for archiving associated supplemental data for primary data that’s deposited in a subject-specific repository. Data should have a DOI whenever possible to provide a persistent access location. Your repository choice should also support open licenses like U. S. public domain and Creative Commons that apply to most federally-funded data. Look for the repositories policies and make sure they’re clear. And make sure the repository provides long-term permanent storage and access and not just a temporary workspace.

15:29 - On the right is a list of several examples of preferred repositories.

15:33 - The Ag Data Commons is the preferred general Ag repository for USDA-supported data. But you can see some repositories geared toward very specific types of data like soil carbon and food microbiology.

15:47 - You should put your data in the location where your research community can most easily and effectively find and use your data for the long term.

15:57 - The Fair Sharing and RE3data links at the bottom of this slide provide even more information about preferred repositories for anyone who’s interested.

16:11 - The DMP we’re following records methods for storing and preserving their original source material which is in a physical form and also specifies where the digital data will be deposited.

16:24 - In this case, they’re submitting their data to Genbank for public access which outsources as much of the long-term access issues.

16:32 - They’ve also specified who will be responsible for maintaining data backup and preservation long term. In this case the overseer of the lab, a position rather than a named individual, is responsible for maintaining the infrastructure and workflow.

16:54 - The data sharing and public access section explains any restrictions, embargo periods, license, or public access level. Data generated by federal employees usually has either U. S. public domain or Creative Commons CC. 0 status while federally-funded data and non-federal data may vary depending on funder requirements. You should indicate in this section when data collection is considered complete because that starts the 30-month clock for public release of the data.

Describe your data access and sharing procedures during and after the data collection process.

17:34 - For example, publication or public release. Name specific repositories, databases, and catalogs as appropriate.

17:44 - For example, data may be shared by publishing in a genomic database or open source code can be shared in a public code repository such as Github, but it should also be cataloged in the Ag Data Commons if it’s funded by USDA.

18:01 - Outline restrictions such as copyright, proprietary and company secrets, confidentiality, patent appropriate credit disclaimers, or conditions of use of the data. Limiting distribution of data to project or personnel or project or personal websites is strongly discouraged and grant panels probably won’t be impressed. Similarly, in most cases it’s not sufficient to make data available only on request.

18:31 - Depositing data in a reputable long-term archiving and access environment is preferred and may be required by many journals before articles are published. Indicate how you will ensure the appropriate funding project numbers will be acknowledged with the data.

18:50 - For example CRS numbers such as NIFA award numbers or ARS project numbers should be noted in a consistent way.

19:02 - There are some circumstances which make release of the data within 30 months of completion problematic. Under these circumstances, you can request a waiver to this requirement.

19:14 - You can submit a waiver request to your research leader or area director at any time prior to the 30-month release requirement.

19:23 - If you expect this issue ahead of time, the DMP Data Sharing and Public Access section should outline the anticipated need for the waiver.

19:37 - Requesting the waiver is straightforward. The researcher prepares a decision memorandum and you can find an example of this in the exhibits of the P&P 630 document at which point they pass the request through their normal chain of command up to the Office of National Programs.

19:56 - ONP then passes the request to the Office of the Chief Scientist for final decision.

20:03 - Some reasons that you might request a waiver include when a follow-on study is imminent and the data will be needed by that paper.

20:11 - If there’s a need for a deeper analysis of potential dual use and security issues or if you need additional resources and time to convert the data into machine readable formats.

20:29 - As we reach the bottom of the first page of the DMP we’re following we see there are no restrictions placed on the data.

20:36 - Because this data is federally-funded this is the default expectation.

20:41 - However, if your project has multiple funding sources or other contingencies, note those in this section as needed. This particular plan should have added the distribution method of the data into this section.

20:55 - They did indicate Genbank in the previous section about storage and preservation. And if this repository is also their plan method for making the data accessible they should note that here as well.

21:05 - According to the new P&P guidelines they should also indicate when the data gathering is considered complete.

21:20 - The Roles and Responsibilities section provides information about project team members and tasks associated with data management activities over the course of a project. State who will be primarily responsible to ensure DMP implementation. This is particularly important for multi-investigator and multi-institutional projects. Define key roles of the data management team. This is especially appropriate for a larger scale project where you should identify who will perform which tasks.

Will graduate students or post docs or technicians have day-to-day responsibilities along with their research roles? Or is there a full-time data manager or database administrator? Provide a contingency plan in case key personnel leave the project.

22:12 - For example, if data are managed individually or collaboratively on a platform such as ARS’s Scinet and an investigator leaves, who then becomes responsible for the data? What resources are needed to carry out the DMP? If funds are needed, have they been added to the budget request and budget narrative? Projects must budget for sufficient resources to implement the proposed DMP.

22:40 - For example, there may be a data publication charge, data storage charge, or salary for data managers.

22:52 - At the top of the second page of our sample DMP we see the Roles and Responsibilities clearly outlined. They specify the platform they will use to transfer data from their lab to collaborators.

23:06 - Note again that they are outsourcing their access responsibilities to Genbank.

23:11 - They also note who will submit which portions of the project to Genbank.

23:22 - The final section, the Monitoring and Reporting section, contains information on how the researcher plans to monitor and report on the implementation of the DMP during and after the project as required by the funders. This may include progress in data sharing including publications, database, software, and so on, among other information. The plan should also indicate who is responsible for this duty. Cataloging everything in the Ag Data Commons should help researchers easily pull together a list of data products shared from a particular project.

The sample DMP concludes with information about the specifics of the monitoring and reporting.

24:10 - While they specify principal investigators are responsible for reviewing and revising the DMP, a link to the information about the specifics of NIFA’s monitoring protocol and requirements would be welcome.

24:23 - Always be specific when referencing a policy that you plan to use.

24:32 - As you can see, this entire example DMP is under two pages.

24:37 - Creating a DMP does not have to be a labor intensive task but it does make you think about your project to ensure your planning for contingencies of data gathering, formatting, access, preservation, and future use. Your DMP can take up to three pages within the OSQR project plan. Consider the DMP like you would a roadmap to help you stay on track To help ARS researchers navigate DMPs NAL will review DMP drafts.

25:15 - Send your DMP draft and a project summary draft to agref@usda. gov. We’ll provide a general review of your DMP and propose improvements. But you are ultimately responsible for the final decisions on the content. Please plan ahead. We are a very small team and we require at least five business days to complete this review process. As such we’re always looking for examples of good DMPs representing the various ARS programs to share. If you have a DMP that you think might help others in your program more easily create higher quality DMPs, please let us know.

We’d love to collaborate with you to provide the best resources to our researchers. We currently have one exceptional DMP available for download on our DMP guidance page and hope to add more.

26:11 - Our pilot DMP review service is dynamic and we welcome your feedback to improve this process. And to that note, just to let you all know, you will be getting a survey at the end of after this presentation concludes and we would very much appreciate if you would fill that out and return it to us so that we can target our services to help you more easily achieve these goals.

26:40 - And as we near the end of this presentation I just want to summarize everything that we covered today. DMPs are an important part of the data life cycle. They save time and effort in the long run and ensure data are relevant and useful into the future. Funders, journals, and institutions including our own are beginning to mandate data management plans. And so it’s important for scientists to understand what a DMP entails. As you plan your next project and get ready to write your own DMPs consider the expected data types, data formats and standards, data storage and preservation of access, data sharing and public access, roles and responsibilities, and monitoring and reporting.

And keep in mind NAL is here to help you fine-tune your DMP draft to make it the best that it can be.

27:34 - And as we get ready to answer some of your questions I want to share a few links that will help you complete your data management plans.

27:43 - The first link is NAL’s guidelines for data management resources page that I mentioned earlier in this presentation which contains a data management plan guideline, a link to an exemplary data management plan, and a lot of other resources to help you in composing your own DMP. The second link, dmptool. org can be found on NAL’s Data Management resource page, but I’m highlighting it here as well because you can go there to see examples of many types of real DMPs for comparison.

28:16 - And finally I wanted to include the email again to send your DMP draft review questions, any questions for our team, or questions for the rest of the library. This is a general library email and any question you have about this process, they will forward it to the appropriate person or unit. And with that I’m going to ask John if we’ve got any questions in the queue lined up and you can all start asking your questions here.

28:46 - And we’ll try to answer them. John Sears: All right.

28:50 - Yes we do from Hans Chang. Could you please expand upon data types especially those that are not digital and how do you recommend that they be stored? Erin Antognoli: Non-digital, there it really depends on your field. And I think most of you today are working with animals, I believe. So that could take a wide variety of of avenues. It could be samples that you’re collecting whether it’s tissue samples or notebooks that you’re recording by hand or it just really depends on your collection method and what you’re collecting.

29:37 - Where those things are stored, first of all, some of it it depends on whether it needs to be stored. If things are being transcribed say, you’re taking notes by hand but they’re all being transcribed digitally, you may not need to keep the originals if everything is being converted. But, you know, there may be… I know there are tissue banks… I think… I forget if one of those examples that I shared… I know NAL and USDA in general maintain a lot of different physical samples… a lot of different repositories for physical samples. So it would really depend on what you’re collecting and where is appropriate to deposit.

30:23 - And I know that probably didn’t answer the question very specifically but it really does depend.

30:47 - And Jon? Jon, if you’re asked, if you’re reading off the questions you’re muted.

30:57 - Jon Sears: Yes, sorry. Joanne says, “I share data with an international database.

31:02 - I have the world wide website but not a DOI.

31:06 - Is that sufficient?” Erin Antognoli: I know they recommend DOIs. And this is where I don’t know if we have anyone from higher up with the policies who can speak more closely to that, whether or not a DOI is going to be required. I know it’s strongly encouraged to have one but is maybe maybe I don’t know Susan you may know? I see Cyndy’s on the line. If Cyndy did type into the… Cyndy typed into the chat she said if there’s a persistent identifier that may be sufficient.

Yes I can unmute you. Cyndy: Thanks Erin. Essentially there are certainly persistent identifiers that a professional repository will provide such as Genbank that that would be sufficient for this.

32:07 - What we want to avoid is some sort of repository or database that is not handling things in a way that you can have a persistent reference to the data So, again, as Erin said on that other question it may depend on the specifics here.

32:27 - Susan McCarthy: This is Susan McCarthy. I’d like to expand on this a little bit in that if you’ve deposited data with another database that does not provide a DOI, there’s also the potential that NAL could provide a DOI if you would submit a copy of that same data to us. So we have some ability to create DOI’s as well.

32:53 - Jon Sears: Okay. thank you. And next question is from Jeff Silverstein.

32:59 - Is there an FAQ for DMPs. The NAL… Ii could just jump in here real quick. NAL has some information pages with with useful links as well and I think Data One, that’s a good set of information there with the primer and so on.

33:20 - And I’m going to paste that into the chat. Erin Antognoli: That is a good a good thought though we are in the process of revamping our our Data Management Plan web pages and if if Frequently Asked Questions is something that people think would be useful for them it’s something that we’ll definitely consider. If we start compiling a lot of the questions that people have we can we can start putting something like that together. But as we’re as we’re gathering feedback from people even you can use these the email contact us with any feedback on any services or information that you think would be particularly helpful to you over the course of creating your DMPs.

And as we’re revamping our pages we will try to incorporate those… those comments.

34:21 - Jon Sears: And okay, next question from John Down. I assume not every piece of data gets needs to go in a repository like photos of every jail or do we just use common sense to determine is there unlimited storage space for us in Ag Data Commons. Erin Antognoli: So there’s a few questions there. The first is does every piece of data need to go into a repository and that is something that you’ll probably…

34:49 - this is something you want to outline in your DMP. you’re generating a lot of data but does all of it need to be stored indefinitely. And again, that’s going to… it’s going to depend on your project and what is needed to be able to verify those results.

35:05 - Sometimes… sometimes as you’re generating things it’s… the raw data isn’t what people would use to be able to verify results or rerun the tests. It might be a slightly clean version of the data. And so, again, I can’t really answer that for you. It’s going to depend on what is the most useful data. What’s going to be useful in the short term versus the long term. And you’ll want to make sure you’re specifying that in your project plan that you’re generating this data but you know this particular you know this subset of the data is what gets stored over time and why that is, you know.

Maybe you’d want to make sure that you’re outlining why that is and that you can answer that question. What was the next part of that question? Jon Sears: Is there unlimited storage space for us Ag Data Commons? Erin Antognoli: Right now, there is not unlimited storage.

36:05 - Right it’s… If you’re uploading individual files there’s a limit on 20 gigabytes per file that you can upload directly to us. Now we had run a Big Data pilot. We’re not sure exactly how sustainable that is for us right now but you know, if that’s a requirement that you have, if you don’t have anywhere to put your data, that’s something that you can make known. And if enough people are requesting this, then it’s something that the Library may have to look into.

36:37 - So… but right now there is a 20 gigabyte limit per file on the data that you’re uploading directly to us.

36:49 - Jon Sears: Okay… Erin Antognoli: Was that the last part of that question? Jon Sears: Yes.

36:54 - Erin Antognoli: Okay. Excellent. Jon Sears: And Dana asks, “It seems that using digital notebooks may help facilitate a compilation of data that are be shared. What options for digital notebooks are approved by the agency or where can we find a listing?” Erin Antognoli: As far as… as far as what options for using digital notebooks or storing the digital notebooks? [Music] Jon Sears: I think it means…

37:27 - do what which ones are approved… which actual notebooks.

37:39 - Erin Antognoli: Yeah, I don’t know… if there are any like… if using them there’s an approval versus not.

37:46 - Again as far as storing it’s like any other data if… as long as you have it in a package you can zip contents together and store in the Ag Data Commons or if it’s code, a code repository. We are in the process of putting together preferred repositories for a variety of types of data and so that’s something that we get asked about a lot and so hopefully that will be in addition to the Data Management Plan guidance pages that Jon linked.

38:19 - I’m sorry I don’t have a more specific answer to that question right now.

38:24 - Jon Sears: Now it’s a related question from Lindsay, so, “Where should lab notebook data be stored?” Erin Antognoli: Again I maybe… maybe I don’t know if there’s someone else on the call who’s more familiar with some of these projects who would be able to suggest? But this is another thing if you, if anybody knows of preferred repositories for your specific type of dataplease let us know where everyone is depositing so that we can look into these… look into these repositories and evaluate and get them added to the list so that that information is more widely known to everyone because I know there’s a lot of different types of of data out there and we’re not 100 percent familiar with every single repository that everyone deposits to.

39:16 - Jon Sears: All right, another question from Joan, “What about supplemental data in a manuscript that’s attached to the DOI for the manuscript? I assume that’s sufficient… sufficient for meeting the funder requirements and yeah.

39:36 - Erin Antognoli: So the supplemental data attached to the journal published by the journal as well? John Sears: Yes, so it’s covered by the DOI for the article.

39:45 - Erin Antognoli: That should be. If the article… is issuing a DOI and it’s making that data available, that should be fine. Just make sure it’s a reputable journal.

39:56 - Not all journals are equal and I know the Library does have a list of reputable journals. And I don’t have that link. Maybe, I don’t know if Jon or Michal can find it and put it in the chat.

40:10 - But yes, as long as it’s a reputable journal, that should be all right. Just make sure that there’s a catalog record in the Ag Data Commons so that people can find it.

40:21 - Because the journals they’re focused on… the journals, not the data, and so it’s going to be really hard for people to find the data if it’s just cataloged in the journal.

40:33 - Jon Sears: And then maybe come under the different license for attribution. If it’s part of the article. Erin Antognoli: Yeah, yeah. You have to make sure about the license there and making sure that it’s open.

40:53 - John Sears: Jeff Silverstein says, “I think with DMPs for ARS project plans there will be useful answers to questions. For ARS, would encourage an FAQ?” Thank you. That’s something you need to work on.

41:13 - Erin Antognoli: Yeah, I think that’s a great idea. And doing these webinars will will give us a lot of a lot of questions that we’ll have to work on answering.

41:26 - Jon Sears: And then Cyndy adds, “Remember, these are peer-reviewed.

41:30 - So your community may have expectations you should meet. “ Erin Antognoli: Yes. Yeah. With all of this you want to make sure that you’re in line with your research community standards and that’s why each research community is a little bit different. So it’s hard to generalize these things. But you know, when in doubt, ask around.

41:51 - Ask… ask your fellow researchers your supervisors where is appropriate. And and make sure that you know someone’s communicating that to us too we’ll put together resources for you but but your community is is the one that’s going to know best what’s expected of your data and where it’s found, how it’s found.

42:14 - Jon Sears: All right. David Alt asks, “In the relatively small project of which I’m a member, a large number of data types are collected, from animal tissues to isolates of bacteria, sequence data, proteomic data, flow cell data, and so on.

42:32 - In the end, the list is relatively extensive and the raw data quite expensive. I’m concerned that many projects will face this kind of challenges and I’m concerned with respect to the limits provided. Do you have any thoughts with regard to the relatively narrow example you provided?” Erin Antognoli: So the limits as far as storage space? Or…

43:03 - Jon Sears: Let me just re-read that. Erin Antognoli: Page limits. Again, this is something that the the people who are a little higher up may want to address. So, can anyone speak to having a DMP that’s more than three pages long? John Sears: Yeah. I haven’t really heard of that but…

43:37 - Susan McCarthy: This is Susan. I think we’ve more or less discussed this a little bit in the sense that if your research involves perhaps standard protocols or something. If there is a standard reference article that you can point to that perhaps lists most of the data types that you are going to be using in your protocols, you might be able to simply cite another reference that provides a more complete list.

44:15 - Jon Sears: Jumping back to lab notebook, Susan says, “Paper lab notebooks may have specific locations for storage. Check with your area office. ” Thank you for that. And Cyndy says, “Supplemental data can also be uploaded to Ag Data Commons on its own in machine readable formats with good metadata. ” Erin Antognoli: Yes… you can always… Regardless of whether your data is stored somewhere else-and this could be for a variety of reasons- if you’re not sure that the journal is going to be around long term or if it’s not open or or if it doesn’t issue a DOI and… and you want to upload that data to the Ag Data Commons, you’re welcome to do that as well, even if it’s… even if it’s attached to the journal. Again, provided you’ll have to check with the journal policies. Sometimes they don’t want it shared elsewhere. But for the most part the journals that you’re submitting to should allow that and so any of those reasons would be fine to upload your supplemental data to the Ag Data Commons.

45:28 - Jon Sears: Right. And then Jill at NAL has provided a link to the NAL’s list of trusted journals.

45:39 - Erin Antognoli: Ah, thank you. Yes that’s the one that I was referring to. So if you’re not sure about your journal, if it’s… if it’s a reputable journal, check out that list that that Jill put in the chat. It’s those… those have been reviewed by NAL staff and should be good places.

45:59 - Jon Sears: And then Susan points out a best practice here that is a possible issue with supplemental data is that they’re often in PDF format which is sort of a dead end. So it’s not machine readable and not easily reusable. Good point. Um, and Jill also points out the NAL Digitop web page for integrity and impact in publishing. And I think let’s see…

46:35 - I… Jennifer Wilson Welder said, “I think David means Ag Data Common storage space limit. The project we just completed at 18 gigabytes of photos alone. ” So, to clarify the… in Ag Data Commons the file size limit is currently 20 gigabytes per file, correct? Erin Antognoli: Yes.

47:05 - You can also zip contents too so as long as the grand total is under 20 gigabytes, you can zip them together, you don’t have to upload one at a time that way. Now if it goes above that again, if this is something we’re gonna see more of, we can’t provide that at the moment but we were looking into options for people. So the more we hear about it, the more we know we need to look.

47:36 - Jon Sears: And it’s 20 gigabytes per file but you can have multiple files? Erin Antognoli: Yes. Yes. You can have multiple files. Jon Sears: All right. Hans Cheng says, “I would welcome guiding principles on what data samples, etc. , need to be recorded or stored. Like David Alt, I’m quite concerned with the vast amount and breadth of data that is being generated. Uploading sequence data is already problematic enough and the method is known. ” Erin Antognoli: Yeah.

This would be something that we’d need to collaborate with each individual research community. It really depends on the standards that are set. You know, what is going to be needed to reproduce your data? Or is it necessary to reproduce that data? If someone wants to check the results, is what you’re saving enough or do you need to save? You may not need to save the raw data in a lot of cases. It just depends on on the standards of your community.

And so talking about making data interoperable and reusable, what gets stored to make it reusable? That’s highly dependent on your individual… your project, your your research community, what’s expected there.

48:55 - And we can help, we can definitely help work with you on that but we can’t necessarily tell you. It’s it’s… we have to work with the researchers as well and get some input from that community before we can give you a more definitive answer there, because it’s really geared toward the people who understand and use your data. And that’s that’s who this needs to be for.

49:23 - Jon Sears: And Cyndy’s added to that, “David, if they’re concerned about storage limits on data, please consider raising the issue with us, with the NAL and with your NPL and supervisor so we can help to find a solution. And Jennifer’s added to that discussion I can go with Jen. Also, there is going to be… also is there going to be additional personnel to assist with transcribing and entering and uploading all of this data? Jeff Silverstein: Hi, this is Jeff Silverstein.

Am I off mute? Erin Antognoli: You are. I can hear you. Jeff Silverstein: Okay. Thanks. Yeah.

50:07 - I just wanted to mention. I know that there’s a lot of discussion around what kind of data should be stored locally so that you don’t lose it but is not data that would necessarily be shared publicly that some of the data that’s shared publicly would be derived. And you know one example I can think of is a genetic sequencing, a genome sequencing image file. I think, you know, in some cases there’s you know there’s discussion over whether image files should be saved at all.

But you know those individual image files may not need to be made publicly available. You may want to have that as a reference and you make sequence that you’ve generated publicly available. And if someone were to come back and ask for further information, you know, that might be something that you’d be willing to share. But I know that there are… there’s active discussion around what should be privately stored and/or stored at the lab level and what should be stored at a more public level? And I don’t know if anyone from the Library can comment further on that or if there’s some good examples.

But I think as as Erin was saying, some of those things will be community driven.

51:31 - Erin Antognoli: Yeah. Definitely that’s going to be community driven what are the expectations.

51:38 - Now as far as this Data Management Plan goes, in the Access section where you can you should outline any restrictions or things of that nature, you can outline if you shouldn’t… like if there’s some reason that you can’t make the raw data publicly available or like the primary data, say you had to disambiguate a lot of your data for… or take out PII. You know, you can’t supply the raw data or, you know, even the unclean data publicly.

And so that’s… you would outline that in that section that I can’t make this… this part of the data public, I’ll make, you know ,the clean data public, or something like that. But again, it depends on what’s expected and the nature of your data in particular. Jon Sears: All right. The final discussion point here.

52:35 - I think that I see from Hans again,” ‘Community’ is not clearly defined as one hopes.

52:43 - To impact well beyond your own field of expertise.

52:47 - So is a better question, what data needs to be preserved to reproduce the results, to get into that the reproducibility issue?” Erin Antognoli: Yeah, these are all very tricky questions.

53:05 - And it is. Yeah some communities are better defined than others.

53:10 - They’re, you know, a lot have specific data and metadata standards and some, you know, it’s a little more loose.

53:18 - It’s difficult… it’s hard to really pin down for a lot of instances what should be saved or how. And yes, the question is what should we save to reproduce the results or to verify results? You can look at it that way too.

53:38 - If you’re coming out with a paper that has findings and someone wants to check those results to see how you came up with your findings does the, you know, whoever is researching have the ability to do that based on the data that you’ve saved and shared.

54:00 - And yeah Susan, I’m looking at the the chat as well. I’m… . Susan typed in, “In some cases like meteorology data, raw data would be preferred as algorithms are used to analyze the data and these can change over time. And having the raw data is preferable to allow re-analysis. ” So, yeah, again it… a lot of your processes and methods are going to be, they’re going to factor in to how much and what you need to share because if you’re processing a certain way, you know, you should let people know how you’re processing the data so that if they run that same process, do they get the same results? Or if they run a different process, do they get wildly different results? Again, it’s very dependent on the type of data and the the subject.

Jon Sears: Right. Erin Antognoli: These are all… these are all questions we’re working on at the Library.

55:02 - So, you know ,we may not have immediate answers for you, but if you have questions, feel free to to contact us and we’ll help work through them with you.

55:14 - Jon Sears: And Cyndy says, “If you already have files you used in this statistical analysis of your data, these could be very close to what could be shared. You are not required to transcribe non-digital data. ” And Cyndy again adds, “The first time through writing a DMP will be the hardest. Your OSQR reviews will provide the peer feedback to help make them better. ” Erin Antognoli: Yeah, and again once you have that DMP, your first DMP, if your subsequent projects are similar, you can always take that DMP that you already have and then change, you know, change the pertinent parts to the new project.

55:58 - Which is why we’re looking for good examples from different programs because, you know, if everybody in your program is depositing their data in a certain place, those sections will already mostly be filled out for you and you can just change the specifics of your project, like, you know, the dates and possibly the types of data. But, you know, the closer your community is or your research group is, it could help. It’s one of the reasons we’re trying to find a good variety of samples.

56:40 - Jon Sears: Yeah, that’s all we have. Oh here’s Susan, again. Susan McCarthy: Yeah. I’d like to address the issue about the community and community standards. This is something we’ve been thinking about and doing a little bit of work on. We had a couple of NIFA-funded grants where we were able to convene, you know, groups of working groups together to help start to define what should be saved. And I think over time we want to do more of this type of work where we can help to better define what are the standards of practice, you know, in animal health, for example? What should be saved? So we, as Erin said, we don’t have all the top drawer answers right now.

But this is an area of active development. And we certainly look forward to interacting with you and your communities to come up with some better guidelines for everybody.

57:38 - Jon Sears: All right, Julie says, “Reproducing or verifying data requires the materials and methods, unless these, and” or… it’s jumping around.

57:54 - Let me repeat that. “Reducing… reproducing or verifying data requires the materials and methods. Unless these stored raw data are linked to the experimental design, it will not be possible to reproduce.

58:06 - Is not immediately clear how experimental methods and raw data will be linked in this data storage process or how this information should be addressed within the DMP.

58:21 - Erin Antognoli: A lot of your methods and things of that nature, that comes in with the metadata. So if there’s a paper that outlines your methods, you can link that in the metadata.

58:34 - And I know the Ag Data Commons has a place to do that. And there are a bunch of other metadata standards depending on your subject area that also account for linking articles or other types of publications or data dictionaries or things that define your measurements and things of that nature.

58:54 - And so, link wherever you can. If these things are defined already somewhere else, a link to them is perfectly sufficient as long as that is outlined somewhere and you’re specifying, “I’m following these procedures… I’m using this… these are the measures that I’m also using. ” Try to reuse wherever possible.

59:17 - I think… did that answer sort of answer the question? If that doesn’t exist, yeah, it’s a lot harder of a process. And that does limit your reusability if you don’t have those things defined already. And it creates more work that you have to define things in your DMP. But, yeah, in the DMPs, if there’s already a protocol, link to it. Link to it wherever possible.

59:50 - And that’ll help standardize research all across the board, really, if more people are linking to the same methods and protocols.

60:04 - Jon Sears: And presumably there will be more room for details on materials and methods in the in the overall project plan.

60:13 - Erin Antognoli: Yeah. This is just, the DMP is, is a small part of the overall project plan.

60:19 - There’s a lot more to the project plan. Jon Sears: And then Jeff says, “Thanks for the presentation, Erin. Good job. ” And Jeffrey Vallet says, “There are two goals of making data available to the public. One is to support reanalysis, the other is to support combination of your data with other data to support novel analyses. ” Some of the questions suggest some confusion regarding data versus metadata. Some of the questions about images like gels or histology slides one might want to post both the images and the densities or other measures derived from them.

Single gels to confirm something may not be that valuable. Batches of gels that are used to make measures supporting a statistical analysis would be likely to be more valuable to share. “ Thank you for that.

61:28 - Erin Antognoli: And Susan actually wrote here wanting to know if Jeff, either of the…

61:34 - Jeffs, Silverstein or Vallet, would want to offer a wrap-up because we are at the end of the hour. Do either of you want to say anything? And you may have to type in and we can unmute you.

62:03 - Alright. Actually, I don’t see him in here. Did he… did I just miss him? I think Jeff S. left, so it might be up to Jeff V.

62:57 - Oh, all right, I guess no one’s… we don’t have… Susan do you want to say anything or should we just wrap it up? Susan McCarthy: I do want to appreciate again that everybody spent the time to come and attend this webinar today. This is our first round of developing specific data management training for ARS researchers and we really appreciate your input.

63:26 - We will be sending a short… a very short survey: 10 questions.

63:30 - Should take no more than two or three minutes to answer.

63:33 - And we would appreciate your support in helping us to continue to refine our work so that we can better serve ARS moving forward. And again, thank you, and best wishes for your project write-ups and a happy, healthy holiday season.

63:50 - Erin Antognoli: And actually, Jeff is now unmuted. He does want to say something. Jeffrey Vallet: So sorry and so yeah I don’t…

63:59 - that was a good wrap-up, Susan. And can… you hear me, can everybody hear me? Erin Antognoli: I can. Yes. Jeffrey Valle: Okay. So yeah, these are great questions. And so we… obviously, we’ve got some collaborating to do. We’ve got some discussions and what not to to provide good answers to these questions. I think we appreciate you guys’ time and in listening, and asking the questions. And we will work together to try to get this sorted out.

64:48 - That’s all I have. Erin Antognoli: All right, well thank you everyone for coming. And I… that’s the end of our webinar. We will… we have recorded this, so we’re going to make this recording available. And we’ll hopefully we’ll be sending that link around soon so everyone can review. Please let us know if you have any questions. And we’re here to help, so thank you.

65:16 - Bye. USDA is an equal opportunity provider and employer. .