Code4Lib 2021 Tuesday Welcome and Lightning Talks
Apr 5, 2021 23:43 · 6355 words · 30 minute read
Thank you to our Captioning Sponsors: MIT Libraries and balsamiq uh i have a uh a few announcements before we do dive into the talks we have six of them um a reminder that our official conference hashtag on social media is hashtag c4l21 and if you would like to join our code4lib slack community please go to code4lib. org slack after filling out the form if you didn’t receive an invite immediately please check your spam folder have you hit information overload have you hit that wall during the conference uh even on day two remember if you need to take a break from the talks the virtual quiet room on Whova has a variety of quieter meditative and restorative activities to help you rest up for the next batch of talks and these talks are recorded so you can catch up on anything you might have missed during your break you can find the virtual quiet room page under the logistics menu option in Whova we have our community support squad or css here to assist the attendees you can find their schedules and photos beneath the conduct and safety section of the website the community support volunteers for the first half of today are eric phetteplace that’s p-h-e-t-t-e-23 on slack and louisa kwasigroch and sorry if i messed up that uh pronunciation and the slack is l-o-u-i-s-a so without further ado um we’re going to enter right into our lightning talks um each presenter has about six minutes um if you haven’t already signed up for a lightning talk and are interested in participating um i think that the the fact that we have this opportunity to to share our knowledge um without having to prepare much for a presentation is one of the nice features of the code4lib community after today’s session we also have slots available tomorrow wednesday at uh 3 p.
m eastern time and thursday 1 10 pm eastern time all of our talk chat and q&a will be in the same session listing in Whova a little different from the talk blocks yesterday where we moved around from session to session if you have a question for one of the presenters please enter the presenter’s first or last name then your question they will be presenting live they’re here with us so likely we will not be able to get to address the questions until their presentation is over and without further ado i’d like to introduce our first speaker hector correa who will talk about marcli a combined a command line utility to parse mark fields take it away hi hector let me share my screen so my name is hector correa i work at princeton university library and i’m going to talk about this command line utility that i wrote to parse large smart files this started because me being a developer another librarian when i started working in libraries i’ve heard people look at bibliographic records like this and librarians tend to look at them more in this form which is the marked data behind the record and this is really good has a lot of nuanced information and if you’re a developer you need to know a lot of the details of the record and all all that information is specified in mark so this is great but when me as a developer i got marked data it usually looks like this right and they’re like what is that well that’s a marc file in binary or sometimes they come in xml which is a little bit more readable but it still is not the same as reading what the librarian was looking at right so what i’ve done is done is i brought the small utility that actually allows us to log on the terminal to marc file in this format so let me show you what i mean so i have several marc files here on my machine this is one so if i look at this file it looks like this but if we use it with this utility marc the li oops file we’ll give it the same name then it looks in a more readable way now if you’re have a developer and you parse marc regularly you probably have utilities to do this in ruby in java there are many libraries to do this so this is just in case you need to parse a very large file to actually find a particular record or analyze something really really quickly it works really well with very large files let me show you for example this is a file that is almost 500 max so if i look at that file and i’m gonna unit run it through the less command in linux and you can see that i can actually start looking at the file right away and there are several kind of quick options to do filtering like maybe only one certain fields something i’m going to put it to the less command so it doesn’t continue scrolling forever so you can see the record right away can do filtering i only want records that somewhere have the word water and again somewhere in this record the word water appears so this has been useful for me to find out how many records we have with a particular field they don’t have or they don’t have a given field in marc and it works like it work any any uh linux utility will work so because it outputs text files so let me do the big file that i have here you can just do maybe i only want a certain field and so that’s that’s the data you can do grip because we’re just generating a regular text file you can see the ids and then you can say well how many lines there were and then you know that there are 10 000 records in this in this file and maybe you want to find out how many of they have the word fish somewhere and you have seven records so that’s the call of the utility it allows you to parse much files quickly and it has no dependencies it’s an executable that you can just download and run on your machine these are the this is where you can go and get the files uh there’s an executable for linux one for macs one for windows and if you’re interested in the source code you can also find it in here it’s written in go which is not quite popular in in libraries but it is very readable and if you have any thoughts questions comments feel free to contact me and that’s all i have thank you i clicked i clicked uh turn off video instead of unmute uh thank you for the great talk hector um just as a reminder um if you have any questions for hector or any of the other presenters please put them in the q and a our next speaker is matt lincoln who will present on too many photos not enough metadata computer vision for similarity search in the archives take it away matt all right thanks so much um so yeah the longer title of this uh is a project that we did at carnegie mellon last summer for computer aided metadata for photo archives initiative and this was a project that i worked on with our wonderful university archivist julia corn and emily davis and another one of our former colleagues scott weingart um but yes as the short title is uh too many photos not enough metadata and i’m going to be completely stealing from andromeda yelton kind of this entire short talk so the problem that we were facing here at cmu was our general photograph collection is the institutional photo archive for the university there’s over a million photographs they estimate between prints 35 millimeter negatives born digital photos going back about 120 years we have about 20 000 digitized or ingested so far um but precisely none of them have been put onto our uh now very old digital collections platform we’re in the midst of migrating to a new platform but one of the main things that was stopping getting these photos out and public and searchable was the real lack of metadata how could we make these things findable and i’ll give you a sense of this assignment problem that we’re facing um the collection is mostly organized uh by the level of a single roll of film um or for stuff in like late 90s cds that we get um and what you see on the right hand side is sort of like a mocked up contact sheet of what we would get in a roll of negatives and the description that the photographer would have written portraits of an unidentified man slash freshman camp which ones are the portrait of the unidentified man which ones are the portrait of the freshman camp there’s nothing attached to the single item that’s telling us unless a human being is looking at it and then it’s obvious so we couldn’t just inherit copy and paste all these container level info onto the items and to get all of that item level metadata in manually would be the amount of labor that we simply couldn’t do so what was the summer project for this initiative um a prototype implementation of a system and a user interface that would not replace our archivists with computer vision but instead give them a user interface that had just enough sort of sprinkling of computer vision sugar on it that would help them to quickly pull up similar photos and class them together we have an output that was a white paper reporting on the effectiveness of a bunch of methods similarity search duplicate detecting tagging at scale even testing out object detection with google vision spoiler alert it’s terrible but then also some production requirements uh what would a system look like if we were going to do this for real um so i’ll just give you a hint of some of the results that we got out of this um here’s one of those lovely uh tsne or i think this was a umap cloud of those photos in visual space we did feature detection using off-the-shelf models and i’m happy to chat more about that in slack or in questions afterwards again we weren’t using these models to create labels for these photos we’re using them to say cluster my photos together which ones are close together in a similar space to each other um these off-the-shelf models aren’t perfect for this collection uh primarily black and white photos historical photographs and we talk about that more in the paper but it got us close enough to get pretty effective visual search here’s an example of this up in the upper left-hand corner is a picture of our college of fine arts building and here are all the photos that we could get out of a search result a bunch of them as you can tell are from the same role of film but others are from different parts of the shoot and some of them are completely different years which is exactly what we needed this was step one step two was then building out an interactive user interface for our archivists to experiment how would we use visual search not just to locate photos but to actually then apply some textual metadata apply some tagging at scale and so this is what we mocked up using the existing metadata that we had those kind of very broad labels inherited from that role of film archivists could pull up say i want to start with a picture of commencement and when picking an index photo visual search would then return here are a bunch of possible photos that look similar to this archivist could then flip back into that archival context all right show me that whole roll of film let me pick out which are the ones of these that were actually say students at commencement versus pictures of professors at commencement or honorary speakers at commencement and so our interface was really to say visual search is the start but then comes the human eye to look through what are all the kinds of things that we need to make pulling in that archival context and not forgetting that so i invite you to check out the full white paper that we have where we talk about the ups and the downs of this what worked what didn’t what we learned from an archival standpoint what we learned from a kind of ui standpoint also a lot of important considerations about computer vision itself what are good things to do with it what are bad things to do with it and we also have sample code again this is not a production system it’s completely an alpha test but you can get an idea of where we’re going with it so i am happy to answer questions in slack or hit me up there and i think julia corn may also be on if you have questions about the archive uh and so thank you thank you matt uh okay the screen has stopped sharing um now our next speaker uh will be cary gordon who will give an islandora update take it away cary ah there you go uh thank you uh this is going to be a remarkably unvisual overview of where islandora is right now uh islandora as most of you know is a digital asset management system built for libraries islandora started in 2006 integrating drupal fedora and solar through custom software uh built by three developers at the university of prince edward island uh through islandora seven it uses drupal mostly for chrome and ephemeral content and the display is still driven by fedora and fedora as the set was or is the center of the allendore universe up till then uh and drupal modules were used but they were really just wrappers for islandora custom code uh dozens of developers 331 known installation uh it’s grown quite a bit since then although through seven it is basically the same approach uh also in the last couple years aisle which is was called islandora for everyone and now islandora enterprise uh recycling the uh acronym uh began as a consortia of eastern colleges uh used docker as the way to mount islandora which lessened the technical debt and it was built largely with the cut by a company called born digital before then i mean originally it had installing uh installing islandora was quite a process but then uh in the last few years using ansible to build it which simplified it a lot moving to docker helps even a lot more islandora eight is a complete rewrite it’s centered on drupal 8 9 and a bevy of microsoft micro services not microsoft microservices it supports what it doesn’t need fedora 6.
uh just throw this up so that’s pretty much the microservice map and uh at the beginning of the chat today’s chat i put links to the documentation where you can find this chart of explanation of what the latest version of islandora eight is doing and uh some links to the uh to the alpha code for uh mounting this in uh in docker so you know it doesn’t need fedora six it can use it simple systems that don’t need that level of preservation and metadata tooling can be just built in drupal and uh it also features things like you can build it in drupal and use an outside storage mechanism like s3 directly it can still be be built and ansible but the recommended path going forward is going to be through aisle container based and docker for now uh the new doc the new aisle install is an alpha right now it’s got 27 docker images we’re trying to uh narrow that down a little uh it’s it’s the development of islandora has been supported heavily by uh lyrasis discovery garden born digital and to some dream my company cherry hill uh i just wanted to make this quick and that’s pretty much it ask any questions you like uh and give it a try thank you cary um just a reminder there were two questions that are that were put in whova that were for matt um please do make sure you include their name i think it was clear that it was for him but uh for any future questions put the name of the speaker thank you for your presentation and our next speaker is ryuuji yoshimoto who will introduce an open book camera version one a diy camera for pictures of books take it away okay yeah i hear you okay and now we see you okay how hello everyone i’m nervous so i’m not good english so let’s try uh hello everyone i i’m yosh from japan i introduce open book camera version one so i am web engineer in japan and uh operate several libraries with service uh opac and discovery service and yeah so i want to show the book spine in opac and any web service so a book cover image is uh provided by commercial date and first so but uh spine image data is no data so nothing so uh i won’t take pictures more fast for spine pictures so good so i tried between between summer vacation and tried developing the camera for books so three-year video in summer vacation so many ways to shot books by camera so uh completed in uh this is completed so this is an open book camera version one so this camera is uh three sides yeah 15 books per minute so uh take three shot is top bottom and side and go same time read backwards so this is four four seconds to shoot one book so put the book and automatic exposure so as 3 5 same folder same same past store okay so this is open hardware and open soft as open source software uh so uh all design and code are provided under cc0 so this is placed in github and as possible uh commodity parts sold from amazon or usb camera arduino python opencv so this is architecture of openbook camera but this is computer and control light and sensors and books detector and thickness detector so top camera is focus controlled i make 40 camera at the build at home so many camera this camera is a little bit large so i assembled for the camera kit assembled and now tested in personally and special library public library and bookstore… so special library in tokyo tries automatic with ocr google vision api so i i tried now i planning development so make together if interest thank you thank you very very much our next speaker is carolyn cole who is going to talk about deploying everything with capistrano go ahead carolyn uh hi everyone uh so yeah deploy everything with capistrano it sounds a little crazy um whoops uh and so you know i wanted to just take a step back and say you know where do good ideas come from and per my usual uh this one came from a conversation most ideas are sometimes their conversations with yourself but most of the time they’re conversations with others it kind of gives you a different perspective you know your head is down and maybe when you look up you can get a good idea uh so the conversation was a little bit like this uh francis and esme were and i were in a room and i went ugh i hate this thing that i’ve done i created this ansible role i wanted it to deploy my code using the whole role was really bad because it had to look and rebuild the whole server so i created a separate role that’s still slow i’m the only one on my team that can do it um and it’s stored separately from the source code and francis goes to me carolyn why aren’t you deploying this with capistrano and my first reaction was what the heck this is drupal code why would i ever deploy it with capistrano capistrano was a ruby app and i said well i’m just going to stick with ansible for now and and that’s just wrangled in my head for the longest time and i said why not try it so i picked a tiny app i picked a php application and then i i did a few things i created gemfile tiny little file it’s not that much to maintain all right i ran bundle install i ran bundle exec cap install that’s the instructions from capistrano not hard i then went in and added my servers and then i pointed it at my code base set a few things like where to deploy it to and and then i did one little thing which was really a little bit weird not rubyish which was php has an environment so i linked it in but you know capistrano has all of these lovely after so that wasn’t even really that hard and so my mind was blown i’m like whoa wait a second can i really deploy everything with capistrano it was so easy so then i thought oh let me just try something else and i will tell you that drupal seven drupal eight drupal nine they’re not quite as easy there’s a lot more configuration there’s some attaching with solar there’s playing with stuff but it really wasn’t that hard uh the thing that we now have is that even though i have to maintain ruby applications drupal applications php applications craft three applications no matter what code i’m working on i can deploy it the same way so if i have an update and usually my updates to these applications are when something has gone horribly wrong i don’t really have to think that hard i just say oh yeah cap deploy awesome the one thing i haven’t gotten is princeton has a uh slack integration with hubot and so i would like to be able to deploy them with that i haven’t quite gotten that to work yet but i mean it was incredibly easy and i i would never go back so that’s my whole talk uh there’s urls i put the url into slack for the the chat if you want to look at things i’m happy to answer any questions but thank you francis for asking me asking me that question that’s me great thank you carolyn um we have one more scheduled talk lined up but we still have about 25 minutes left in this session so if anyone has been inspired by what you’ve heard right now and have something that you want to share with the community it doesn’t have to be anything formal with slides or presentation like a more formal powerpoint you can uh you know raise your hand in the slack or sorry raise and in zoom or reach out to us in slack or on Whova and and we can promote you to a panelist and and you can get a chance uh to speak to the to the group even if it’s just a brief thing you want to say and we’ll also use the the remaining time to make sure that any questions that have been asked of our present presenters are addressed we do have one that’s addressed to cary but hasn’t been answered yet and and it looks like we are getting responses uh on the one that was asked to ryuuji so uh good job everyone um but without further ado i will introduce our next speaker i’m eric phetteplace who will talk about mrs marc and learning from code4lib go ahead eric hello everybody let me get my machine screen sharing started here um so this is actually just a redo of a talk that our cataloger gave last week um i encouraged her to create a lightning talk about a critical cataloging project that we had that actually came out of work from code4lib um and so our cataloger is amber bales um there’s her picture she threw the pictures of other staff in here so i feel like she deserves a shout out so in the upper left that’s her and she gave this at um skel colloquium which is a local california consortium last week and basically uh the project was identifying um biased or in antiquated descriptions um in name fields of our metadata it was a a problem that was originally brought to our attention by janine scarborough our archives technician and she had noticed it with regards to the wives of our trustees at california college of the arts which was formerly california college of the arts and crafts so ccac but you can see the caption over on the right of this image says ccac trustee mrs charles henri hine left and mrs albert g churchill wife of a trustee etc spoilers their names are not actually albert and charles they have their own names but in lots of older records women were described this way they were referred to as their husband’s name even when they were the author or creator of a work for instance in marc records um so this was something that we had been aware of and had seen a lot in our records but then in 2019 noah geraci gave a code for lib talk programmatic approaches to bias and descriptive metadata that i found to be really great and highlighting a lot more than just this issue but specifically producing a tool um to analyze uh names and sort of guess whether or not this sort of mrs husband’s name construction um existed so really quickly just looking at that this is the the talk um uh abstract and there is a recording also linked to in the slides here of the talk if you want to go look at it and noah um in this script kind of notes the complexity of names and how this is a approximate approach right so gender is not a property of a name you can’t just look at a name and know the gender of a person which is why the tool used is called gender guesser right so it’s always approximate and it always takes some kind of human intervention and analysis to determine what’s going on but this tool does a really good job of just trying to approximate where there are bias descriptions and uh all i did was very little i just stitched it together let me refresh this is more readable i think the white is a lot more readable i just stitched it together with pie mark essentially so the tool is generic it’s not meant to analyze any particular type of metadata it’s a text analysis tool but you can simply iterate over all the non-control fields in mark was all i did looking for these constructions and printing them out and then i compiled them into a spreadsheet where we can we could walk through them one by one um do some research and some analysis sometimes uh nothing was wrong and the the know the gender guesser simply had been inaccurate other times uh there was the correct name in the right form somewhere in the record and a lot of the times we had to do research um utilizing primarily like lc authority records since these are bibliographic records right so that should be good a lot of the times but oftentimes that wasn’t enough and you can see there’s some worldcat links wikipedia uh there’s a trove link in here so all sorts of different sources to try and determine the uh actual name of these creators these authors or people being referenced in works um and for a few of them i think it was only like two out of some 50.
uh we actually could not find the actual uh correct name form uh this is part of a few um sort of critical cataloging and metadata practices that we’re trying to undergo at california college of the arts and we’re inspired by uh projects like archives for black lives which is linked here that is a series of anti-racist description resources um particularly for people at predominantly white institutions like ours where most of our employees are white in the libraries this is really important um work for us especially because we do end up describing materials that are you know by people who are not white by minorities and um the cataloging lab is another um really valuable and inspirational resource for us that sort of introduces new subject headings keep track of subject heading changes um specifically one thing we’ve been engaged in is we have migrated away from the illegal aliens subject heading to undocumented immigrants um there’s a great documentary on this that people might be familiar with where basically the library of congress wanted to make this change and congress blocked it and and politicized it um so that’s something that we do and that does take like quite a bit of work and manual maintenance on our part but it’s important to us and we as a uh library all watch this documentary and sort of discuss this and other similar issues and places where we can make our metadata less biased and i think i posted the link to these slides but there’s a few useful embedded links in there like to noah’s talk from 2019 and thank you very much okay thank you eric so uh that does uh conclude all the scheduled speakers for today so um before we do break i do want to give us a few minutes to since we do have all of the the speakers available for for any to give any answers to any more questions you have from the talks please do put in any questions you may have in the chat or the q&a and uh and we can get those answers maybe even live okay i do see a question here from sarah hammerman hi sarah uh are there any instances in this is for eric are there any instances in which this tool could be used to identify authorized lcsh with mrs husband in the heading and work to revise those headings yeah that’s a great question i absolutely think you could um iterating over the entirety of like lcsh or the library of congress authority headings would probably take quite a long time but i have no doubt that you would find a lot of them um just anecdotally from the instances that we saw most of the lc authority records um did refer to women by their own name and they might have an alternate access point um in like a like what is that like a 700 field or something um for uh other constructions like mrs uh mrs um husband’s name so mostly the problems were with our catalog and not with lc but absolutely i think that would be a valuable project for undertaking thanks okay oh another another question for you eric uh did you meet resistance in breaking from lcsh when you changed to the undocumented immigrants heading if so how did you address that um no we didn’t locally um we’re all pretty much on board with this we have a pretty progressive staff and even though it’s a little bit extra labor for us to make this change um everybody thought it was the right decision um but for looking at resistance actually i would recommend um watching that subject or that uh change the subject documentary um the problem kind of uh originated at dartmouth right um that’s what the the um documentary is about and that kind of gives you an interesting um look at the complexities of it and the resistance that librarians can sometimes put up um yeah um i’m hoping i can see if i can convince one of my colleagues to to give a lightning talk later this week about this exact subject um but uh i’ll i’ll paste a link in the chat if you do a search for undocumented immigrants in the in the franklin catalog at penn um we’ve introduced uh uh an interesting approach to uh to show preferred terms like to to still preserve the data um but to to show and and and have the records display undocumented immigrants instead of uh instead of illegal aliens let’s see any more questions okay uh another one for you eric could deliberately altering mrs husband’s last name also be considered a form of dead naming um i honestly am not informed enough to answer that question um all i can say is that it does speak to the complexities of names also um it is somewhat of an assumption that people want to go by their own names one thing that we came across in our research was that um black women for a while uh wanted to be referred to by their husband’s name because for a while black people weren’t allowed to marry right like in america literally we repressed that so there was a sort of point of pride to it i think most of most if not all of the records were for like anglo saxon women that we were correcting so we felt pretty comfortable in doing that but that’s certainly a point that we’ll have to look into more okay thank you for answering these questions um and anything that we haven’t already talked about um we can continue to discuss in slack later um because i think oh no another question for you eric uh not a question about the technology i really like that you all watched and discussed change the subject and i think we might want to do something similar here was that part of an existing staff development structure um yes and no um kind of just with the black lives matter protests our administration is trying to be um more thoughtful and actually like materially invested in change and progressive projects like what we’re working on so we set aside juneteenth to everybody read and research and sort of look into things that you wanted to apply in your work and we had just a um an hour set aside where all the this is actually during 2020 so we were working remotely we all got together and watched virtually the the um documentary so it will become something that is a recurring annual project but it only just started this year okay um thank you uh so it looks like i just did get a a a message in here for uh first for uh i guess uh cary uh i think can uh also do a quick minute uh or two about obs so i will uh pass off the the microphone to him and uh and yeah we do have if someone still wants to join him we do have 10 minutes left in this session okay give me one second here um almost there okay so obs so uh there are oops i just blew this whole thing sorry uh there are a number of people using obs uh running their video through obs and let me start my video there we go okay so the number of people who are you using obs and i use it uh i started using it when uh our co-chair my co-chair friend and uh sometimes co-worker uh peter murray i saw that he had a uh index data logo in the corner of his videos he said well that looks cool and he so he said oh yes so let me just give you a screen share find let me find it here i only have 4 000 windows open uh it’s in here somewhere here it goes sorry i have to do something here this is taking amazingly long sorry this is totally impromptu yes so okay so that should work so screen sharing i’m still not screen sharing let me try again okay well this is screwed up because it says i have to allow screen sharing for obs and i have and it’s still not showing up so anyway this is this this is a failure maybe i can show my entire screen and just make this bigger i’ll try one more time yep uh zoom is not letting me uh share my screen in any way shape or form so anyway uh this would have been interesting maybe i’ll try to put it together for later uh but right now it’s a complete fail uh because i can’t really show you what i’m doing so i tried tried to come up with something sorry about that experience cary but yeah as as i said earlier in the uh in the session there are still slots available uh wednesday tomorrow at 3 p.
m and thursday at 1 10 p. m uh do sign up um and i guess i will just take a quick look to make sure nothing uh well i i think we’ve addressed everything i don’t see any new questions and so uh with that i think we’ll end the session a few minutes early the the the next session uh will be in a separate whova session and we’ll start at 105. so thank you all very much you.