Code4Lib 2021 Monday Talks Group 2 and Closing

Apr 8, 2021 17:59 · 8623 words · 41 minute read

Thank you to our Captioning Sponsors: balsamiq and MIT Libraries hello everyone welcome to session two i hope you had a good break and enjoyed those breakout sessions our first speaker in this session is david pixton he’ll be talking about machine learning apps for non-programmers he is an engineering subject liaison librarian in technology for a brigham young university let’s talk a little bit about machine learning in libraries and more particularly a process called topic modeling i’m david pixton and have been working for the past few years as the engineering and technology librarian at brigham young university though my background is technical from the engineering field i am not a coder and so the perspective that i bring to this discussion is from a librarian’s point of view so what can topic modeling do for libraries well two areas of interest in this technology are found in the cataloging and also the collection development realms in this first area topic modeling can help catalogers efficiently sift through lots of text to provide information relating to the aboutness of many library resources this then makes them more discoverable through subject searching in the second area topic modeling using various data sources can help provide insights into research interests at an institution and thereby help more fully align purchases of library materials with research directions this latter area is where we’ll focus this discussion now in a traditional organization such analysis may be relegated to a very few experts in data science however there is a trend today towards the democratization of machine learning or more broadly data and data science the thinking here is that by pushing this type of analysis deeper into organizations the organization becomes more effective one author of a recent harvard business review article suggests that if the workforce is more data literate a data team can shift its focus from quote doing everybody else’s work to quote building the tools that enable everyone to do their work faster indeed if those with the background to interpret data in a particular context such as in this case a subject librarian if they can also be given the tools to crunch the data that can be quite productive gartner the analyst who is best known for their publishing of the hype cycle in the tech world suggests that a couple of ways democratization can be manifest is in citizen developers and also citizen data scientists gartner adds that low code and no code development tools support these citizen efforts so what might that look like in the library environment specifically relating to topic modeling well as i’ve looked around inside this space it seems that potential librarian developers have some key needs thankfully we also have some key enablers that help fill those needs first of all the data sources that can be of most used to librarians in their topic modeling efforts come in different formats so a librarian that’s not savvy in manipulating data formats needs a simple means of handling these secondly the process of modeling such as including such steps as data collection tokenization and so forth these are not often part of a librarian’s day job so citizen data scientists need to be led through or at least reminded of these processes and thirdly interpretation of the data requires first presenting the data in a format that best helps the interpreter to see important relationships and for many that’s going to be a visual format now for the enablers thankfully there is a robust open source community that shares machine learning and visualization routines that support low code python development this includes some really nice visualization routines as well the jupiter notebook is another enabler because it can walk a citizen data scientist through the data science process in addition the kaggle environment provides a python core that’s complete with stable builds of the necessary python libraries this overcomes a big barrier to entry for python coding which is the issue of version compatibility in a fast-moving open source world so let’s see one fairly simple way of implementing topic modeling and visualization using these tools now here is my jupiter notebook containing an explanation of how to do topic modeling mixed with executable code blocks the user can make changes to any code block and then run the code by clicking the execute button to its left now i’ll note that we are inside the kaggle data environment which facilitates sharing of data and notebooks i’ve ex included an examples data set but the user can actually select their own data set copy it into the correct code line and then read that data now because the librarian may be using data sets that include different field names i print out the field names from the data set that we just read so that the user can select which is of interest the user then inserts the field names of interest into the next code block now in this case you will note that the term author keywords used in the sample data set is named uncontrolled terms in the present data set so we’ll just replace those next the user has walked through some validation and data cleaning processes to ensure that the right fields are selected and that they have meaningful data this is followed by pre-processing routines including identifying stop words punctuation and special characters that will be removed from the analysis now it’s actually time to execute the topic modeling routine first we choose how many topics to identify and then we execute the main code block now you’ll notice when the code is is finished executing that i used an open source visualization tool called pi lda vis to help me understand and refine the results now what we see here is a typical topic model from my analysis the circles on the left represent each identified topic area and visually show how unique some topics are and how much other topics overlap these overlapping topics may indicate we need to revisit our choice of number of topics they also might be candidates for consolidation now the bar chart on the right shows the individual terms that constitute a particular topic and we can see in which topics a particular term appears and where it dominates by simply clicking on that term the red bar shows an estimate of how frequently that particular term would show up in the topic and the blue bar indicates how frequently the term appears in all of the publication documents this helps us to determine determine how good a particular term is at discriminating one topic from all the others so how has this topic modeling code helped me as a librarian well first of all it’s given me the ability to better learn topic information from a variety of recorded evidence so far i’ve looked at journal abstracts and keywords from bibliographic databases which shed light on faculty interests i’ve also analyzed our institutional repository which highlights student research areas from theses and dissertations and also interlibrary loan data which has largely represents undergraduate studies these data sources do differ in scope and quality though i found that looking at publication information from two different databases in this case compendex and scopus that that’s important because it gives me different topics of interest i found that approximately half of the topics from each database were similar to each other and the other half were unique to that particular database and this is due to differences in the journal coverage of each and where different faculty members publish i was a bit disappointed by the interlibrary loan data though it offered the most records of any of my sources information was limited to just titles which proved to be less informative than other sources providing titles abstracts and keywords now secondly because the analysis is a relatively simple and quick process i have autonomy to update this regularly without burdening someone else’s cue this is important because evidence-based decision-making is always looking to some extent in the review rearview mirror so you don’t want to just run the analysis every few years if you want to stay up to date and because the jupiter notebook leads me through the process i can pick it up in six months without a lot of wondering how i did it the last time through this process i’ve come to appreciate how topic modeling can provide a more complete understanding of research directions at my university now this slide lists a few of those directions that i found depending on the number of topics i choose i can get not only broad subject areas but can find some of the nuances that better describe what’s of interest some of the topics in my topic models are quite detailed in fact for example one of the dominant topics that the topic model identified is origami-based compliant mechanisms this is a niche but a key area of research at my institution this niche area is not apparent when i just analyze a list of database indexing terms there’s still more that i wish to explore with this tool certainly i’m interested in determining if there are yet other data sources that provide meaningful understanding i’ve actually done a little work in modeling topics using author supplied natural language versus database supplied index terms and there are some interesting difference differences between these two though both are useful also a lot of this analysis appropriately relies on subject expertise but we need to ask ourselves if there is a way to make some of the process steps more objective one candidate is perhaps the selection of topic numbers based on the size of the data set or other factors so i thank you for your interest in this topic i’ve included a link to the code if anyone is interested in using it or better yet improving on it i also invite you to look for other ways to support librarian data scientists in parting i acknowledge the work by marina zhang and kari kozak which has brought this particular application of machine learning to light in libraries and i also thank andromeda yelton who has provided me excellent training with respect to machine learning and who has encouraged me to share this with you thanks again thank you david i’m not able to turn on my video at the moment so uh i’m sure it’ll be back on in a moment but uh yeah i wanted to thank you for there we go i wanted to thank david for that really interesting talk that got a couple of questions which he’s already answered in the question in a answer section on machining learning apps for non-programmers we had one question asked asking him to share the link which he’s already done and another about what data formats to import and what that process looks like in jupiter notebooks so i’m going to if there are no further questions i’m going to move on to our next speakers who are again clara turp and lucy kiester or kiester from ethics of ai algorithms looking into pubmed’s best match algorithm clara again is the discovery systems librarian at mcgill and her research interests include linked data and ethical implications of algorithms and libraries lucy lucy is the liaison librarian for undergraduate medical education at mcgill i’m looking forward to hearing both of your talks hi everyone uh thank you for being here uh i’m clara turp and i’m a systems librarian and i’m with lucy kiester who’s a medical liaison librarian we both work at mcgill university and we’re here to talk about the ethics of ai algorithms specifically looking into pubmed’s best match algorithm so here’s a quick outline of our presentation but we will dive right in so to introduce we thought we’d make it really simple and talk about what our algorithms because you can define algorithms in a very mathematical way with this is steps in a and b and c but we actually think it makes more sense especially in the context of this talk to see algorithms as institutions because of their power to structure behavior influence preferences guide consumption produce content signal quality and sway commodification so we think that we really like this definition because it looks at algorithms it as a whole environment and you look at it holistically with all their parts instead of specifically looking at the very kind of mathematical steps but looking at how they work within their entire environment and so a good way to look at this specifically is actually through an example so we will now go into what is pubmed so pubmed is actually a great way to look at an environment holistically but a little background information first what is pubmed it is a free access database produced by the ncbi for the national library of medicine or the nlm out of the usa critically what it does is give access to medline and medline is about 50 000 of what are considered to be the core biomedical journals that make up basically the backbone of western medical and biomedical science so it’s kind of a big deal pubmed itself is indexed using something called medical subject headings or mesh but critically because pubmed is free access the full text isn’t necessarily but the database itself is and it is free worldwide which has led to it being one of the most used databases in the medical fields for example in just 2017 which is the most current data that has been released by the nlm pubmed had 3.

3 billion individual searches so it is a highly utilized database so why pubmed why are we using the example of pubmed well there’s sort of a twofold reason here first is in 2020 uh pubmed came out with a new interface and they did a bunch of sort of cosmetic things to make it look more modern and nice and sleek and all that good stuff but critically it introduced a new default sort order for its results and that new default sort order is something called best match which is actually an ai based algorithm that uses something called learning to rank more on that in just a minute so this new default is ai based and critically from a researcher perspective because pubmed is publicly financed it is all public domain information meaning that we can go in and access the algorithms we can go in and see their statistics their decision-making processes are all freely accessible to anyone who wants to know which makes this the perfect sort of system to investigate and to get really into ai and algorithms as an entire system not just a mathematical model next slide please so really briefly how is pubmed used like most users of most databases pubmed users click on the first page eighty percent of clicks happen on the first page of results there is also because pubmed has also been widely studied and how it is used has been studied there is a prevalence of short overly broad searches so those who are seeking health information especially health practitioners they are not necessarily expert searchers and they want quick and easy results and so often they will do quick and easy searches to get those results so critically in pubmed as the total number of results increases the likelihood of the user clicking on any result decreases and this is actually really important because pubmed has thousands of new articles added every day the medical field is a prolific one and so it’s very easy to get hundreds of thousands of results and not find anything that’s useful in a search so now i will take it on for a little tech talk um so we’ll dive right in so best match uses a classic model information retrieval model to as a first layer it uses bm25 which is a probabilistic model to kind of rank all documents as a first step so how do classic models actually calculate relevancy it’s basically very briefly it compare it does a comparison between query terms and index document terms and cne matches and then weighs the documents based on matches it found so the weights are given depending on how many times a doc a term appears in a document but it also looks at how many times it appears in the entire collection so if a word appears a lot of times in every document it might not be that relevant in instead if it appears a lot of time in only one document the document is probably very very relevant so the more weight is given the more relevant the document is and the higher it ranks and then in all that mix you also had a layer of document length normalization which basically makes sure that you don’t give more power to long books over short articles so that’s the first layer you have a first layer of classic relevant sort bm25 and then you add a second layer which uses learning to rank which is kind of a bunch of models that all use artificial intelligence and machine learning to rank and the one specifically used by pubmed is lambda mart and so lambda mart reorders the first 500 documents so it takes the entire result bm25 gave it and reorders the top 500 in a new relevance so based on its algorithm and its algorithm is influenced partly by features that would impact relevancy some of the features are query features others document features and some the relationship between query and document so an example the easiest is for a document feature so date document format all those are examples of features and ai the the algorithm also uses ai to figure out the importance of each feature and it that might change with every search and then machine learning needs to learn as it’s uh nicely called so you need to train it to learn and to evolve and to recognize what is a good result so the way pubmed uses training is through click-throughs so anytime a user that searched using best match click on a result that kind of becomes a click-through and that might be a component for the gold standard that will train the uh algorithm so learning to rank this is kind of a visual representation where you have bm25 then the top 500 results are re-hashed in lambda mart which is trained using click-throughs and is influenced by query and document features and then finally you get the results in pubmed so a little bit about how other ways pubmed uses artificial intelligence is query expansion and query suggestion so query expansion is basically when you modify query whether it’s with spell check automatic term mapping or other ways to to um to fit it better basically so the algorithm assumed that if you spelled canker it it was probably not what you meant because it’s not actually a word so it kind of spell checks it and suggest cancer in cell and then you get the results for cancer and you get a nice little line that says did you mean cancer um and then you also have query suggestion which suggests queries based on terms typed by user so as you type cancer it will give you popular searches with the word cancer and you can click on any one you want uh to get the results for that specific word so that those are two behaviors that users tend to be very used to because of google so now that we know a little bit about pubmed best match and how it uses ai we’ll go a little bit into ethics of ai so algorithms are black boxes they’re this mysterious thing that we kind of understand uh and we the techie crowd the non-techie crowd think it’s magic and that increases with ai so this is where it’s really important to think of algorithms as more than mathematical steps because you really have to understand the whole environment for it not to be a black box not just abc for every algorithm but how every algorithm interact with each other and modify the outputs of each other and what is the end result and it becomes a very complex environment that’s not easy to figure out one way we kind of like to show this is with the accountability gap and that is basically who holds the responsibility for the outputs so if for example google did a search and the results are ethically very concerning which happened a fair amount of time um then if you complain about the algorithm you might get an answer along the lines of yeah the algorithm kind of happened and that shows how much of a black box it is because it might even get not be explainable by the designers or the creators of the system themselves and finally when we think about query suggestion we also have to think about how it presents a mainstream worldview because it might present common searches it might bring you towards something that more documents talk about and that can change the user behavior and be very very dangerous especially when you don’t know what’s happening then you have transparency which is our second second big concern uh transparency can be defined as accessibility and comprehensibility pubmed is actually pretty good on that compared to other databases it is publicly funded so it’s not a highly guarded secret it’s not a competitive advantage you can see the algorithm you can read on it which is an attempt to achieve comprehensibility it’s not bad but it’s not perfect and the complexity of it makes it is it actually comprehensible is it actually accessible is it actually transparent and that’s a question we’re asking each of ourselves and everyone else and an example specific example of this but lucy might dive we’ll dive a bit more into this later but how and when are user queries modified so when you do a query suggestion is it always clear to the user that the query was modified and in which way it was modified and if it’s not always transparent then it becomes very dangerous because it manipulates the query and presents results without the user knowing that happened and the last kind of general issue ethical issue we have is with user click-throughs so there’s a limited amount of data that fits this gold standard threshold that pubmed expects because it’s a relatively new system and they’re using user click-throughs that use best match you don’t have thousands and thousands and thousands of searches which means that you’re training it on limited data which could be problematic this also brings us to the idea of a vicious circle bias will become harder to identify because the system is trained by the results of the system so if there’s a bias it will become very inherent to the whole thing and it’s going to become harder to know where the issue is another concern we have is with user click-throughs is is a click actually proof of relevancy we’ve all been in cases where you type a search there’s nothing really great so you click on one because you’re like um maybe i’ll get something in the intro maybe we’ve all been there i’m sure and so that’s kind of worrying because if some of the click-throughs are assumed to be gold standard but they were like yeah that might do that’s not a great gold standard and then uh fiorini actually outlines how there is bias in ignoring the documents ranking we said people click on the first page and those are what that’s where most click happens so if a document is the first result versus if it’s the first result of the second pages one might get more clicks just because it’s higher where the other one could have been more relevant so there should be some waiting based on where the document was ranked and that could bring bias so i will pass the torch back to lucy we’ll talk about some medical ethics yeah so when you add the layer of medicine the ethics actually get more complicated and part of this is because of something called evidence-based medicine so what evidence-based medicine is is something that’s seems fairly intuitive but it’s the idea that when you’re making a clinical decision you want to be using the best most current information there’s a couple of different ways that you identify best most current information one is the date date of article really matters especially when you’re looking at drug information new information is coming out so quickly that an article that is even three years old could be completely out of date and defunct there’s also a hierarchy of article types that are preferred so something like a systematic review which is a synthesis of multiple individual trials is preferred over a single one of those trials because you’ve got a higher sort of pool of numbers to look at and hopefully draw more reliable conclusions when it comes to actually selecting the best article there is sort of that question of well do i want a systematic review from two years ago or a clinical trial that came out two weeks ago and there’s not always a right answer but it is something that has to be considered and further we know that doctors are doing quick searches they don’t have time to even read multiple articles let alone evaluate or synthesize multiple articles they want a one and done get in get the answer get out which leads to our next slide which is some user experience so when a clinician or a doctor or any sort of medical practitioner goes into pubmed they want in and out fast and this leads to a lot of perceived versus actual relevance which we’ve already sort of talked about but it’s important to highlight again that most users of pubmed are not expert searchers they are going to run these very simple broad searches and they’re going to pick what they think fits best based on the results presented to them not necessarily on what is the actual best results which maybe isn’t even showing up because of their search and the default sort so we know and now you know that best match as the default sort of pubmed uses ai but there is no flag that says best match uses ai when you’re actually in pubmed all it says is sorted by best match so you actually would have to go digging further to get that information to know that it was a best matched an ai based algorithm so there are these sort of different layers that lead to some additional complication also when we discuss the features that are used to calculate relevance we don’t actually know the weight of each feature so we know the document type is considered in the best match ai algorithm and we know that date is considered but we don’t know what weights are given to them so when you do a search and it is default sorted by best match it’s entirely possible that an article from 2016 is placed first and an article from 2020 maybe doesn’t appear until fourth or fifth position why that is only the algorithm can tell you but that actually is something that is potentially concerning from evidence-based medicine where date really has a lot of heavy weight to it but so does article type and then best match is considering other factors as well and all of a sudden you’re not really sure why things are being presented in the order they’re being presented next slide please so there’s a risk benefit analysis that has to kind of happen here best match is meant to assist users in finding helpful articles quickly it was designed to help clinicians find that good article and find it fast especially given that there are hundreds of thousands of article results possible for any given search it’s meant to be helpful but ethically if you’re making a clinical decision based on the article that you read out of pubmed could the display order be biasing the articles you see well yes as we’ve discussed it probably is so is this risk of bias worth the time that is saved for the clinician and we don’t have an answer to that but it is something that as we’ve dug into it has become the sort of really big looming question of this matters how do we expand on how it matters that leads us to algorithmic literacy so i am a teacher i spend a lot of time teaching pubmed as do many health sciences librarians pubmed is a big one to teach because it is open access so our students can use it and access it long after they leave university for the great wide world and i have found that the more we get into this research the more i realize how important it is to highlight the place of ai in the databases as i teach so one of the things that we as librarian instructors do is teach critical thinking and often it’s about the cycle of research we teach to evaluate how you got your results we teach you to evaluate the contents of the articles but something that we need to add in is how the articles were presented to you because so often in databases it is that order of presentation that has that ai influence on it and that’s not something we currently teach and is something that is going to become increasingly critical especially with these medical ethical issues sort of getting all wrapped up in it so we’ve thrown a lot at you but in conclusion we think it’s really really important to consider not only the black box ai algorithms but also to understand how different disciplines will have different ethical implications in the use of ai and crucially that algorithmic and a.

i literacy even at a superficial level is going to be absolutely critical for librarians to teach as we move forward into this increasingly technologically complex worlds thank you for your time uh here’s our bibliography we encourage you to check it out and we appreciate your attention thank you very much clara and lucy that was a very interesting talk i we don’t have time for questions because that was a little bit overtime but we’re going straight into andromeda yelton’s discussion of navigating without a map discovery in a world beyond subject access andromeda is a software engineer and a librarian investigating human applications of machine learning and she’s an adjunct adjunct faculty member at the ischool in san jose hi i’m andromeda yelton that andromeda on the twitters and the slack and this is navigating without a map discovery in a world without subject access so imagine that you have a collection for example 44 000 electronic theses and dissertations representing the graduate student output of your institution and imagine that you would like to explore this collection but it does not have subject access metadata sad face but wait idea perhaps you could have humans lovingly hand craft subject headers for all say 44 000 items in your collection sad face um if you if you do have that if you have the cataloging labor at your institution available to do that please tell us what planet you live on and if it’s nice there but i think most institutions really don’t have the human labor available to provide handcrafted subject access headers to all of their distinctive collections so can we have robots help this is the story of how i trained a neural net to help me create ways of navigating a collection of about forty four thousand graduate theses which was the collection of masters and phd theses for mit um as of about 2017.

this collection had some metadata applied to it like author year title department but it didn’t have any subject access and what i ended up with was all 44 000 of these arranged by conceptual similarity so this is a spoiler alert for the rest of my talk where i will tell you what you’re actually seeing here but in brief each of these dots is a thesis in general ones that are closer together are more similar in meaning and the different colors represent different departments so you may be wondering how they got here to find out let’s make a neural net um i use the algorithm doctor machine learning algorithm to train a neural net on all of these theses and i’m going to give you the like super quick crash course and how it works i would love to answer more questions uh in slack because time is very limited for this but in brief you have your algorithm read through your corpus of documents and it’s kind of following a moving window of words across and so although i’ve given you windows of different lengths of words here just because they’re sort of easier to read as a human when they have uh sort of meaning boundaries around them in practice it would be a fixed window of words ten or so at a time and you assume that words that appear close together probably have something in common with one another so a word that appears in the context of this moving window is more likely to be somehow related to other words in this window than not so as an example we see the word rendezvous appears in a couple of windows in edwin eugene aldrin jr’s thesis line of sight guidance techniques for manned orbital rendezvous and if we look closely we note that the word trajectories also happens to appear in both of those contexts it appears next to the word in both cases that’s not important as long as it appears somewhere in the window we count it and so we conclude that maybe rendezvous and trajectories have something in common and then as we continue reading our documents through multiple passes we might note that the word trajectories appears without the word rendezvous in some other contexts and if we look at what words it co-occurs with we see that sometimes it co-occurs with the word intercept so we’re like huh maybe they have something in common so we conclude that the word rendezvous is somehow closer to the word trajectories than it is to the average word and the word trajectories is somehow closer to the word intercept than it is to the average word and what that means further is that maybe the word intercept is a bit closer to the word rendezvous than it is to an average word sort of transitively now a couple things that are important to know here does our neural net have any idea what these words mean no it doesn’t have a dictionary it doesn’t have a brain i did not teach it what these words mean words mean is it constructing subject labels for clusters of words that it puts together absolutely not it only knows that some words are closer and some words are farther um but this turns out to be pretty useful it is for instance robust around things that a simple word count would not be because it has a sense of synonymy right it knows that two words that appear in the same context frequently such as rendezvous and intercept which appear around a lot of the same words um might be synonyms and the sense of might be close together so it’s it’s able to avoid some problems that naive word counts fall into and if we do this process a whole lot of times um on maybe a whole lot of documents and if we include a label for the document in that context so our neural net can also learn which document labels um partake more of which contexts and others which ones are closer together or further apart then eventually the algorithm will learn oh some are closer together and some are farther apart and here they are all depicted now this is a slight over simplification because the algorithm actually is uh i believe 300 dimensions by default and i felt like that would be a little bit hard to visualize so i’ve projected it down to two dimensions and that means that there are some things that are close together here that are not actually close together in the original in the same way that two people depicted in a photograph who are standing quite far apart in three-dimensional space might nonetheless have their pixels be close together in a two-dimensional photograph but by and large if you see a cluster of things that are the same color that’s probably a real phenomenon they’re probably close together all of these dots and i’ll show you how to explore them on your own in a moment they’re a thesis and each one has some associated metadata such as the author the title the year the department uh so let’s zoom in on a couple of areas in this graph to see what kind of stories we can uncover for example at the very bottom of this graph there’s a biology which is this tomato red area kind of on the outside and then there’s an orange cluster at the bottom but on the inside where if you hover over those theses you’ll see that they’re all chemistry theses so that’s cool i didn’t actually tell it about department metadata much when i did the training um for the most part it kind of figured out that biology theses should be together chemistry theses should be together i actually i didn’t tell it at all about departments when i did the training i used a bit of knowledge about the department to select uh which neural nets had done the best job because i trained a bunch of them with different parameters so it’s not entirely surprising that they cluster but i didn’t tell it how to do that it did that itself and i just picked the best results so we see biology and chemistry are close together which is great because biochemistry should in fact sit on the border between them but you may be able to see at the very bottom here there’s also a clump of sort of a buttery yellow color that isn’t biology or chemistry that color turns out to be electrical engineering and computer science now there are of course many many many electrical engineering and computer science theses at mit this is only a tiny subset of the whole and you’ll see a really large buttery yellow cluster kind of halfway up on the outside next time we see the the full picture so this is just a few of them and what are they doing down here in biochemistry land so i looked at a bunch of them and i looked at their titles and i realized they all had titles along the lines of computational regulatory genomics motifs networks and dynamics which is to say all of these yellow dots that i noticed anyway are applying computational methods to biological subject area problems so while they’re awarded by the eecs department they really belong closely with biology and this was one of my favorite things of doing this work because i wanted to know could i find interdisciplinary connections between theses the closest thing i had to a subject access was a department but that’s actually really terrible for for subject access because there are things that are in the same department that are really quite conceptually dissimilar there’s an end of electrical engineering that is you know basically mechanical engineering and there’s an end of computer science that’s basically pure math but these are the same department at mit the flip side of that is is of course there are things that are in different departments such as some of those cs and math theses that really belong quite close together conceptually and so i was extremely pleased that doc2vec was able to find that i promised you that you would be able to play along so here’s the url i’ll put this in slack as well um but you can pull this up at bitly slash 3 lowercase vsoy capital w lowercase q and play along so let’s look at the big map again and this time let’s zoom in on the very top of that big island where you see there’s a big cluster of another sort of bright red color it’s it’s not biology there are more departments than colors so there’s some reuse and then just south of that big red cluster there’s a paler green cluster this is unhelpful color wise but i didn’t really know how to control the colors when i did the first pass of this so um i’m sorry if it’s hard to make out the differences i will narrate them as i can so what are these colors representing well the red one is architecture the big green one is urban studies and planning and there’s a little bit of orange it’s quite hard to see sort of tucked in just under architecture toward the outside of the slide and that turns out to be media arts and sciences and this is interesting because at mit those three departments constitute the school of architecture and planning and that’s something else that i did not tell my algorithm about it reverse engineered that this fact about uh organizational structure um you can see there’s another couple of colors sprinkled in throughout here and for the most part those turn out to be things that also properly belong within the school of architecture and planning for instance there’s the center for real estate development so it’s not a department there are only three departments in this school but there is also the center and there’s the department of city and regional planning which is what urban studies and planning was named in the 40s 50s and 60s and for me one of the big joys of doing this layout was that i was able to learn about the history of the institution and and see departments that i’d never heard of because they haven’t existed in decades and how they connect to the intellectual heritage as a whole again lots of stories here that i’ve cut for the interest of time but i am happy to babble about them in slack all you want if you are interested in learning more so let’s look at the big island again uh because another thing that jumped out to me very fast and probably you as well is that there’s a secondary island that looks kind of like a manta ray swimming as fast as possible away from the main island so let’s look at that because i really wondered what that was and i especially wondered because it looks like static there’s all these dots that are different colors so it’s not like a cluster of a department and i wondered is is there some sort of you know interdisciplinary concept that people approach from lots of different departments and so they’re all hanging out together but as i looked at the titles i couldn’t really see a pattern i mean how best to redevelop vacant big box retail property in texas and a system for the interactive classification of knowledge really just don’t have anything in common with each other and eventually i thought of looking at the original thesis files in d space and what i realized most of these are older theses and in many cases the paper is thin and it has grayed or text on the reverse side of the paper has bled through to the front it’s really like janky typewriter it’s it’s image files rather than scanned files so in other words this right here is bad ocr island sad face i have a lot of feelings on this but you probably have the same feelings and in the interest of time i have to cut all of the feelings so let’s talk about that in slack and wish that we could talk about it in uh hotels afterward um because i promised you some other ways to navigate this collection in my talk description so let’s see what else we can do when you have a concept of which theses are closer to which others and all of this you can play along with at hamlet.

andromedayelton. com i’ll stick that in the slack as well when you go to hamlet the first thing you see is a recommendation engine and if you click on that you’ll have the opportunity to search by either author or title for theses of interest so let’s search for the same thesis we were looking at in doc2vec before where despite the name edwin eugene being used on the title page of the thesis the way he is written in the metadata is the somewhat more familiar buzz aldrin and as we go to his thesis we see the title and a little link to be able to read it but we also see a list of the theses that occur closest to it in the doc2vec space so those include things like system study of propulsion technologies for orbit and attitude control of microspacecraft uh navigation of a manned satellite supply vehicle back to earth if you scroll down the page farther you’d see things like lunar descent using sequential engine shutdown and star occultation measurements as an aid to navigation in cis lunar space i am not an expert on aeronautical and astronautical engineering but i can tell that all of these are in fact about the guidance of spacecraft and if you were interested in guidance techniques for orbital rendezvous you might well be interested in these other theses that were also about similar topics what else do we have in hamlet well there’s a thing labeled your literature review buddy that says upload a text or docx file and find out what works have been cited by conceptually similar theses that’s right we do your lit review for you because it turns out once i have this map of what’s closer to what i can also allow people to upload their own documents for instance a chapter of a work in progress and hamlet will figure out okay what works that i know about are closest to this um that’s that’s the oracle you may have noticed on an earlier slide and so it’ll tell you oh you’re writing this you might be interested in reading these but once i’ve got that why not try to strip out the bibliographic data from those files and say hey people who wrote similar works have cited this maybe you also are interested in these sources uh so what i did with the lit review buddy was i uploaded a 1960-something paper by henriette avram the mother of marc that was about the early development of marc records and using cobalt for processing them i was trying to think what would be a a paper that was relevant to libraries but also likely to have related objects well represented in the mit collection and when i did that it found a couple of theses that had to do with database formats and foundations of computer science so i was pretty pleased about that and for example uh here we are here is what it could find that looked like bibliographic citations from the first of those theses it’ll give you all of the ones it finds um we’re lucky with henriette avram that it actually did a pretty good job identifying theses these do in fact look like citations um often the lit review buddy is not quite as helpful uh because sadly bibliographies do not come in a really standard structured format it’s hard to parse them out of free text anyway lots of feelings about that as well but as you can see there’s a bunch of interfaces you can put on top of a concept of what is closer to what conceptually um there’s some more stuff on hamlet i haven’t had time for there’s a bunch of other ideas that i know could be implemented that i would love to babble about with you so let’s imagine those imagine all of the ways you could explore a collection even if it is impossible to assign metadata to it simply because you are able to deduce which ones are are closer in a sense to which others thank you thank you andromeda that was really fascinating talk uh we’re running low on time but i think she’s answered all of the questions in the q&a section um i see one though are there are there dreams or nightmares about getting the these similarity results into the official discovery system for your etds what would that look like yeah so i was just in the process of responding to that question and i would love to hear everyone else’s uh dreams or nightmares about this i don’t really have access to that part of the world right now but i was definitely thinking when i did this like my frustration that the etd system didn’t allow me to explore in a way that was discovery oriented and wondering like well what else could be on front of this system that might be more fun and useful great comment i’m looking forward to seeing what people have written in slack and also in the q&a sections and chat so it’s time to wrap up this session if you have further questions do again please use the networking channels that are available to us i have three announcements for you the talks might be over but this is where the fun it just begins if you haven’t already done so check out the virtual meets thread on the whova community board for diy events that are posted this evening or for more information about the upcoming social events this week we also have virtual sponsor booths um so go to the sponsor tab to see those that’s listed on the left side of the screen in your main navigation the sponsors will have resources available for download as well as live demo events scheduled lastly if you have an idea a project or practice that you’d like to share with this community please sign up for a quick 5 minute lightning talk we’re still accepting them and we’ll have lightning talk sessions once a day for the next three days use the links for lightning talks and the breakout room signups these links will be shared with with you in the code4libcon slack channel again that’s hashtag code4libcon c-o-n thank you very much everyone.