Fedora Migration Paths and Tools: Pilot Project Update

Dec 3, 2020 05:19 · 4662 words · 22 minute read hat risk content recording note

Thanks for joining us today will get started in about 90 seconds Glad you could join us we’ll get started in about a minute We’ll get started in about 30 seconds or so. Thanks for joining us. Why don’t we go ahead and get started. Thanks for joining us today. I’m Cliff Lynch I’m the director of the Coalition for Network Information and I’ll be introducing the session which is a project briefing that is part of week three of the CNI fall 2020 virtual member meeting. Just to remind you, week three focuses primarily on standards technology and infrastructure considerations. I want to note that, along with the live sessions that make up week three is the meeting.

We’ve 03:14 - also released a number of pre recorded videos on this theme, and I invite you to have a look at those as well. A few things about the session logistically we are recording this we will make it available after publicly. There is closed captioning available and please make use of that if it’s helpful to you. There is a chat, feel free to use that and there’s also a q&a tool at the bottom of your screen. You can net pose questions as they occur via during the presentations.

After we 03:52 - hear from all three of our speakers. Diane Goldenberg-Hart from CNI will moderate a q&a session and we’ll try to address those questions or as many as we have time. So with that, let me introduce the session a little bit. We have three speakers with us. David Wilcox from lyricists, Robin ruka bar from the University of Virginia, and Amy Blau from Whitman college And I guess what I’ll say about this project is that this is one of these things that people tend not to think about strategically. You know we have to move migrate a whole diverse community of implementations of a platform.

04:47 - But well, nobody tends to want to think about this strategically This is a huge pain point for the institutions involved. And I was very pleased to see this project and to see that I recognized the importance of this and funded it because it really is, I think, very well designed effort to work as a community and a community community problem and and really make a significant difference. We have to implementer organizations represented here. The University of Virginia is a sort of a nasty customized high end Fedora three Whitman is an island Doris site in there are a lot of those out there too. So, you know, I think the project has really looked at the diversity of implementations in the field and tried to do things that are helpful to all, So, I’m really delighted to have Robin, David naming here, I thank them for joining us. And I’ll turn it over to David. Great.

Well, 06:12 - thanks very much for the introduction. Cliff I really appreciate it and. Thanks to everyone for for joining us here today and will follows is really just kind of a brief update. Want to provide a summary of the activities that we’re working on here from Robin and Amy on a couple of the pilot projects that are ongoing right now and then talk a little bit about what’s next for for this So, the grant itself was awarded earlier this year there’s some links embedded in these slides and we’ll share them out if you’d like to take a look at any of these resources in detail. But as Cliff mentioned this was grant awarded by the Institute of Museum and Library Services for a little bit less than than $250,000 over 18 months and the sort of the goal here are the focus rather is on moving from Fedora three installations to Fedora six installation so on the migration side of things. And just a note note that Fedora six is sort of the the latest version of the software that is currently in an alpha state and we hope to have released in production, early next year.

This presentation really doesn’t 07:25 - focus on the particulars of Fedora and on the features of this new release those sorts of things. I’ve spoken about that at length elsewhere and there’s lots of other places you can go to find those details. There are some links in the presentation but I’m also happy to answer questions if folks have specific questions about the software itself but really the grant is focused much more on the challenge which is simply that most Fedora installations out in the world are running version three or earlier which is unsupported and and has been for several years at this point, and so this is a repositories that are on aging technology, and really our concern is not so much the software it’s the content it’s all the content in these legacy systems that is becoming more and more at risk. As the years pass and security updates are no longer applied applied and versions of dependencies are no longer supported by systems administration teams, etc. And so the focus really is on trying to move this content forward in time, but recognizing that migrations take a lot of time and effort and this was something we learned in the planning grant that preceded this one where we investigated all the reasons why folks in the fedora We’re having such a hard time moving forward to one of the more modern and supported versions of the software.

08:53 - So fundamentally in a sentence we’re really just trying to bring the community forward to a modern and supported version of Fedora that’s kind of the overarching goal here but again the focus is not so much on the software but on the content and trying to make sure that all of this great content that’s in these repositories all over the world doesn’t get lost to aging technology. So the process we’re following here is pretty clear. We are starting out by working with pilot partners, and to there on the line here with me and that there’ll be speaking shortly. And we’re working to develop test and refine migration tools, there’s some tools that already exists, there’s some that we’re developing as part of the grant. And we’re working with these pilot partners to do upgrades and migrations.

09:43 - So that we can improve these tools but also produce documentation and best practices, and combine all these things into kind of toolkit that we we can then disseminate to the community at large, which which point we can get feedback and iterate on on the toolkit itself to, hopefully help everyone else. Move along the same path and finally, we hope to host a dedicated migration training event at the end of this so I’ll say a bit more on that in a few minutes. So just a bit more detail here on the phases we’re currently in phase one, which began in September and runs roughly until may of next year, 2021, and the goal here again is to document the migration and upgrade process, working with pilot partners, and in particular, taking a look at metadata mapping decision making all the steps that one needs to follow in preparing for and executing a migration between, you know, version three and six of Fedora but this I think would also apply more broadly, in many cases to other kinds of migrations. In the goal here again is to produce a community toolkit, which is something that we hope to share early next year. So the pilot partners, University of Virginia and Cliff mentioned this, Virginia has a number of adorable repositories but Robin will talk about the specific focus for this particular grant project which has a custom front end environment, and Whitman college which is an island or installation and it has a particular set of use cases they’re phase two begins roughly june of next year, continuing through to September, and this is where we plan to take the toolkit that we develop as part of phase one and really disseminated to the community validated solicit feedback will Phase two begins roughly June of next year, continuing through to September, and this is where we plan to take the toolkit that we develop as part of phase one and really disseminated to the community validated solicit feedback will be hosting webinars and providing lots of community channels and reaching out both two groups as well as individuals and the community that we know are running Fedora three repositories to really encourage them to take up these tools and and work with them and let us know what’s missing.

what would help their 12:01 - institution, better prepare for an execute migrations in their own local environments. And finally just a phase three is really trying to just recognize that it’s really no replacement for hands on learning. So the last phase of the grant running from roughly October next year to the following February. I’m saying earlier we intend to host a migration training workshop. A lot of this depends on travel restrictions and the status of covid at that time.

If in person travel 12:33 - is possible, then, then we’ll hope an in person And of course all the way through this, this project we’re collecting feedback from the pilots as well as those that attend the workshop and those that use the toolkit, doing as much as we can to gather feedback and improve the outputs that we’re generating as part of this work. And of course, once the grant ends we want to make sure that this work continues, and so fortunately we do have ongoing year over year funding for the fedora program that will help us continue to make sure that these tools and this training and everything that we produce lives on past the grant and and continues to be updated and supported. And that’s really only possible through, as I said the support from all of the member institutions that fun this year over year and support full time staff on the project as well as all of our efforts to provide training and updates for the software and everything else that we do around community events to, to support these activities so I do want to say thank you to all these institutions for your ongoing support. And of course, encourage anyone who is using Fedora and gets benefit out of the to consider becoming a member if you’re not already, just to make sure that we’re able to continue to support and sustain the software over time. So with that I want to turn it over to Robin who’s going to talk more specifically about one of the pilots that were running in the first phase of this grant. Thanks, David.

So I wanted to start 14:35 - off talking a little bit about our goals so our primary goal was to save this hat risk content. And it’s in our oldest Fedora repository it’s a version three dot two dot one. And, but we also went to test the migration tools, and make sure that they had the features in them that would help other people that find themselves in similar circumstance to us. So today we have three different versions of the door we have a four, which supports three of our sin bearer applications which manager at DMS are open access and our audio and video collections. We have a three dot for repository, which always used for access derivatives.

15:21 - And most of that has already been migrated to a triple I have server that we have a good idea about what’s in that particular repository and have less problems with it. For this three dot two dot one repository to content is largely unknown repository hasn’t been touched for a decade And it predates most of the people working there today. What we didn’t know so far is that we have around 90 gig of content to migrate. We know that some percentage of the content is older and older non supportive for such as Mr Sid, which we had tried briefly before jataka was ever a thing. a We also know that we have some content that’s now on public domain and rail the available, other places such as a collection of Mark Twain, but we also have some rare and unique content such as transcribe manuscripts, and a three volume set of UV history.

16:30 - The loss of all this content is considered high risk it’s on older infrastructure that we want to get rid of. It’s an old version that is largely undocumented for us in the way that we’ve been using it. So people have been a to touch this system, and we don’t think we’re alone in this situation so we went to test migration tools to tackle the problem and provide the back to the community. So considering all of these factors we made the decision to migrate the entire contents as is from this Fedora three dot two dot one repository to Fedora six to get it started this Oxford common file layout, which is the persistence of Fedora six that persistence layer we believe this is the best path to stabilise the content and get it it into a standard format. That will then facilitate air ability to analyze the content of prudent and provided any necessary migrate format migration, Since we’re migrating all of our infrastructure to the cloud.

The first thing we 17:44 - needed was the ability to install Fedora six utilizing AWS and Docker, the fedora team quickly at day to day installation tools to accommodate a cloud installation, we’ve run into a few problems primarily around content that didn’t have necessary component. So, it stopped the migration and as a result of the problem and some other things that we’ve noticed we’ve been able to give that information back to the fedora team and they been able to quickly turn around, extra features and such as progress tracking And they’re also working on tools developed, what’s been migrated. Obviously this is an important thing this validation for everyone, but especially for us, given the that we don’t have a good idea of all of the content so no one everything that was read out of the repository. is all that information is in this new standard format is very critical for us. So the benefits we see as being part of this pilot is one we’ve prioritized a long delight project we’re gaining knowledge about our content, the migration tools now, unable to use of Amazon AWS and Docker, the migration tools have gained a couple of user friendly features such as the progress update that I’ve mentioned before, and we’ve identified some content lacking information, which now we can get together with the content.

19:29 - and I guess stewards and discuss these problems and find out whether this is something we can print or if it’s sent in their head we’re going to need other people to help us with later. And I think most important the content. And whatever state now begin to standard persistent format therefore better protected, which pauses time for further evaluation. So with that, I’m going to turn it over to Thanks So I’ll tell Robin. you a little bit about Whitman colleges repository. We call it our Minda.

It contains around 20:08 - 30,000 digital objects, which include undergraduate honors theses other students and faculty works archival collections of digitized photographs and other documents and student newspapers from 1896 to 2015. We have materials and almost every Island or a type. And this means that our, our migration will document migration pathways for a wide range of use cases. One of our main goals in preparing for this migration has been to remediate are metadata. both to make the metadata mapping from mods to ours which is required for Ireland right to make that less complicated, and to improve the user experience with our collections.

The documentation of our 20:50 - functional requirements related to our range of content types, and the documentation of our metadata remediation and mapping, our two of our main major contributions to this project, especially at this sort sort of early phase, the functional requirements that we laid out for our island are safe Island or eight site for the most part, stipulate functional parity with Island or seven site. And we had listed a number of these requirements at the object and system level the lyricist team broke these out into some some further categories, things that are important to us include things like SSL integration access control at the object level search and filter across content types, and probably the most important functional requirement related specifically to Fedora is ensuring Amazon s3 buckets storage capability for Fedora within Fedora for Island or objects, because this will reduce the cost, cost of cloud storage for us. And so that’s also a potential incentive for others to migrate to Island or eight using the doors with together with your six mini Many of our functional requirements can be met using the affordances within Drupal and Island or eight, which is great. There will be some custom development required it looks like specifically to support cereals page content in our news collection. So we’re looking ahead to that. Our metadata remediation was managed by a small metadata working group of Whitman college librarians.

22:18 - We re evaluated metadata fields across all of our collections in order to streamline and standardize fields as I said partly in preparation to map mods to RDF, and also to improve display. Um, we standardize some elements such as data encoding and creator name, we rewrote titles and descriptions descriptive metadata for archives collections, this working group comprised our metadata and digital assets library in our associate archivist and the repository who is the scholarly communications librarian who is me. So we have expertise in description from both library in Archives sides, and some knowledge of how meta data is displayed in Ireland Dora, and having this representation was really helpful to ensure that the standards that we were deciding upon would really work the broad range of our collections. This work started we started in the beginning of March 2020, we met bi weekly through the spring and summer and weekly starting in the fall and obviously, almost every one of those meetings was online, but that’s I met metadata fine. pulled together documentation on all of the fields that we were using in our island or a seven instance, and the working group members evaluated all 158 of these fields.

So we got rid 23:36 - of fields that had irrelevant, or outdated or duplicate information we combine some fields. we introduced a couple of new fields, that would be helpful in sorting or faceting and examples of genre, reducing the number of fields also really streamlined the mapping process. We’re currently down to 54 fields of which I believe 38 has been mapped to RDF, so we only need to map a few more. Our metadata librarian mapped these mods fields to RDF building on the mappings of the island or metadata interest group. And, you know, selected some alternative mappings only were those didn’t really match up with, with our metadata needs.

24:15 - a really essential aspect of this work was the documentation of how we were going to use these metadata fields. Because this documentation both guides the remediation we’re doing now, and should improve our metadata generation going forward. And so we track these requirements that we came up with an individual documents by field name. we have a larger draft guide to metadata, our requirements include definitions to clarify the field usage controlled vocabularies spelling and capitalization conventions date and coding conventions. Whether I feel can be repeated etc etc.

We 24:52 - have lots and lots of I think though, or my conclusion from this really is that the remediation and mapping of our collection metadata has been really very valuable and preparing us to make decisions about how to structure our metadata and our collections in the new island or at system, or deep familiarity with our collection metadata helps us to really consider the ramifications of some of some decisions we’re still working on, such as how we’re going to deal with linked agents And as we’re working with our project partners in the coming weeks to really plan the specifics of our Islander Island right site and the migration pathways, we’re really going to be able to draw on this knowledge, the more we can really draw those connections between are metadata work, and the site the site building the migration pathways, the more useful models we should be able to provide to other institutions who plan to migrate to Island or eight. And that’s sort of my take and I’ll hand it back to David. Thanks. Hey, thanks very much. Amy and Robin for the updates, and I’ll just quickly wrap things up here, we do want to leave some time for questions. We are, as you can tell sort of right in the middle of these pilot projects so there’s still lots to come before we’re able to put out a toolkit and share with the community but we expect to have something available early next year. If you’d like to follow along with this project there’s lots of ways to do so.

There is a 26:28 - landing page and the fedora wiki that is a good sort of jumping off spot for all the work that we’re doing. We’re putting out monthly blog posts on the fedora website and you can follow those if you’d like to get some of that information. We have lots of our active conversations going on in Slack, which can join as well as mailing lists. And of course you can support us by becoming a member and supporting supporting Fedora. But I want to leave here just my contact information.

Since I’m leading this 26:58 - effort in case anyone who’d like to get in touch. For more information, I’m always happy to talk about the grant project or Fedora in particular. And if you have a migration use case, we’d love to hear from you. And see if this toolkit might be something that might be abuse. But I’ll stop there. I think we do have a couple of minutes. So maybe we can address questions if there are any. Terrific, thank you David and, to Robin and Amy for examples from your pilots, and we do have some questions, To begin with, our first question is actually for Amy. For Whitman will the custom scripts to s3 buckets storage be bespoke for Fedora or might it be leveraged for other repositories will you share it in the expected toolkit release. I will take a quick step and I’ll probably have to hit that back to David my understanding is that what we produce in the course of the project will be shared. I imagine that what is happening for our site, things will be bespoke for Fedora but I don’t know the extent to which they can be leveraged for something else, David if you’ve got a little bit more sort of specifics on that.

28:20 - Yeah, I could say a little bit there I mean so Fedora six will have native support for s3 and it really just underlies Island or eight. So the way we’re going to migrate content into two Island Dora, it will be migrated through Island Dora and then the content will be then persisted to Fedora and s3 so there’s really nothing custom there that’s a sort of pathway that anyone who’s doing a similar. migration could follow. If on the other hand you’re wondering if you could use something other than Fedora and have s3 storage under Island or eight I believe that is also possible I think Island or eight has a fly system module and other means of having different ways of storing data. So, I don’t think there’s really going to be anything particularly custom here. And so if you have sort of use cases around s3 I think those will be supported. Great, thank you. Thank you for the question.

And thank you for 29:16 - addressing that question. And now we have a question for Robin for UVA Can you explain more about the reasons behind the CFL step. So if you’re familiar with the earlier versions of Fedora you’ll know that there was a persistence layer and Fedora three, but you have to know a lot about that repository and the way that had laid it on disk. So even though it’s readable it still takes a lot of knowledge about that specific version of Fedora. And while it might be possible to bring up another repository overhead, I don’t have much faith that it would come up without having problems with what stored there, there.

So, moving the content 30:02 - migrating from Fedora three dot two dot one to Fedora three. The door is six. Which, the way it stores it is in this Oxford common file layout standard on the disk. It’s more easily read it’s a persistence layer which is more in line with best practices for preservation, but it also has something that if we wanted to use a complete other repository software Have some kind. It’s in this common layout that other repositories can honor And we think it’ll be in a format on this that we can more easily understand because it’s standard. And so we can write software to parse through it and print it for things that we don’t want there anymore. Does that.

31:00 - cover what you were looking We have documentation for the standard and documentation for Fedora six, David, I don’t know if you had anything else you wanted to add, I think you mostly covered in it’s worth noting that the Oxford comm file layout is a separate but related effort there are, as you were saying, Robin it’s a more standardized approach their repositories that are not using Fedora but are using Oh CFL and so there are tools that that can inspect an OC FL repository regardless of whether it’s Fedora based or some other application so that that standardized approach, really does make the kind of work that you’re talking about Robin, I think a lot easier because there are a wide variety of tools that can understand and parse that, that, that data that don’t rely on Fedora three for example which is, you know it’s its own sort of custom application. I’m seeing a thumbs up in the q&a box so I think that address the question. Thank you, Robin and Thanks David. And thank you for the question. And I’m not seeing any other questions at this time and I see that we are at time so I am going to once again thank our presenters for sharing your work on this project with us here at CNI and also our attendees for making time out of your day to join us. I will go ahead and turn off the recording and if you are still with us and wish to approach the podium and have a chat with any of our speakers ask a question, please feel free to do so.

And with 32:44 - that I will say goodbye and thanks everyone have a great rest of your day hope to see you back at.