Mining ETDs for Trends in Graduate Research
Nov 13, 2020 03:40 · 4965 words · 24 minute read
… … … … … … … Cliff Lynch: Yeah. Welcome, everybody. Thanks for joining us, we’ll get started in about two minutes. Cliff Lynch: Thank you for joining us today. And welcome, we’ll be getting started in a little over a minute. Cliff Lynch: Welcome everybody will get started very shortly. Cliff Lynch: Alright, I think it’s about time to get started let me welcome you all to the second day of the spring 20.
I’m sorry, of the fall 2020 CNI virtual member meeting. I’m delighted you joined us today. I’m Cliff Lynch. I’m the director of CNI and I will be introducing this session, very briefly. Cliff Lynch: After we hear from our speaker Diane Goldenberg-Hart from CNI will be man and Cliff Lynch: Moderate the question and answer session at the end you have a both a chat tool which you should use to feel free to comment on the discussion as we go along, and also a Q AMP a tool which I invite you to use to pose questions at any point Cliff Lynch: Although we will get all the questions at the end of the session. I just also note that we do have closed captioning available which you are welcome to Fernand, if you wish. And for those of you who didn’t hear the Cliff Lynch: Conversation between Bill and I at the beginning of the session, the session is being recorded and the recording will be available after through our usual channels. And I think that’s all my announcements.
So let me move on and take us right to billing from 04:01 - Cliff Lynch: From the, from Virginia Tech bill is going to talk to us today about mining Cliff Lynch: Electronic theses and dissertations for identifying and understanding trends in graduate research. Cliff Lynch: I’m familiar with a fear body of work that tries to mine scholarly published literature journal literature to identify emerging trends, but Cliff Lynch: The, the TD Cliff Lynch: Corpus is really quite a special corpus with some unique properties. And I think offers some unique insights into things. So I’m going to be very interested to hear what Bill has to tell us about this. And with that, thank you for joining us today. Bill and over to you. Bill Ingram: Great, thank you. I hope everybody can hear me. And thank you for the warm welcome. So I’ll just jump right in. Bill Ingram: I Bill Ingram: Am a librarian at Virginia Tech and in researcher and my researcher explores the application of computational methods and techniques on on library collections.
So the, the idea of collections as data and using computational methods to to mine. Bill Ingram: I’m interested in machine learning natural language processing and Bill Ingram: And and the like. And what you’re looking at here is a just a summary slide of of the grant that I am working under this is from MLS Bill Ingram: It’s funded need for three years to explore all of these techniques against the corpus of EDS so we’re particularly interested in each of these because they are longer documents they resemble books. Bill Ingram: And we’re focusing on three areas information extraction classification and summarization using machine learning and deep learning and ultimately better building better digital libraries by adding value through these services. So sorry timer. Bill Ingram: This is the team that I’m working with.
So we might copia eyes are at Fox and Jen will 06:35 - Bill Ingram: From computer science at Virginia Tech and Old Dominion and then there are two graduate students that work with us full time be Pasha Bill Ingram: Works for me and Monta beer works for Dr. Woo at Old Dominion and I also wanted to just list of a few names of students, past and present that have been working in our lab. Bill Ingram: To have them just finished with master’s degrees dissertation or sorry Theses on on eight CDs and the rest of the folks listed down. There are also working on a TD related research. Bill Ingram: So those of you who are at sea and I last year might remember that I gave a talk about this project about bringing computational access to book length documents. Bill Ingram: after the talk. I was approached by the chief strategy officer at Pro quest and I had mentioned in the talk that we were interested in the pro quest subject categories for doing automatic classification. And so we had a nice conversation.
And that led to 07:47 - Bill Ingram: Me meeting the team that is responsible for the new TM studios so texts and data mining studio at at Pro quest. And so that led to conversations with Bill Ingram: Among others that the two folks that I’ve got on this slide. JOHN Dylan and awesome McClain I’d actually known AUSTIN BEFORE, BUT THIS opened up a Bill Ingram: Collaboration, which led to a pilot of their new software. So this talk is is about that pilot. It’s about the data that we were using. It’s about the the the study that we did. So I’ll try to Bill Ingram: address all these things. So I, I want to give a quick overview of the tedium studio.
I’ll introduce the research question that we’re trying to to answer. Talk about the data, the methodology and then share some results and hopefully have some time at the end for discussion. Bill Ingram: Before we do that, though, I just want to Bill Ingram: Have a disclaimer here that this isn’t a product endorsement for for the studio it these are just my opinions. This is that my research and they don’t reflect Bill Ingram: The views of pro quest, or if I’m alas, or Virginia Tech library or anyone else. Bill Ingram: That said, we did have an enjoyable and and very positive experience working with with Pro quest and with the TPM studio on this is a sort of a high level overview of what the studio is so Bill Ingram: There’s an interface where you select content and for us.
This was obviously tedious, but they have any content that I believe that your library is 09:43 - Bill Ingram: Subscribes to is available, including a lot of newspapers, which I thought was very interesting that lead right up until I think you could even bring in yesterday’s news, perhaps even today’s news. Bill Ingram: Into what they’re calling the workbench where it is a Jupiter interface for interacting with the data, either with with Python or with our and then exporting your results and graphing them, etc. And so I’ll, I’ll be showing you actually lots of graphs here at the end. Bill Ingram: I should mention that, you know, although I’m not an expert on this on the studio. I’ll try to field any questions, but there’s a slide that I’ll put at the end.
10:29 - Bill Ingram: If you want to get in touch with the progress folks because they’re, they’re the experts on this obviously in the beginning your login. And once you have a an account. Bill Ingram: You are given this this web interface for selecting the data that you want to bring into your studio Bill Ingram: So this should be, you know, fairly familiar for anyone who’s worked with digital libraries you select Bill Ingram: The either publication titles or a particular database that you want to bring in in information into your into your instance, I think. Bill Ingram: What’s interesting here is that the content rights of all been cleared for EDM. So I think that’s actually kind of a big deal that you can bring in, like I said, newspaper articles. The New York Times Bill Ingram: In it’s not so much of an issue with the TD since most of them are openly licensed, but for the content that other folks might be interested in.
11:29 - Bill Ingram: It’s all cleared for for doing text and data mining and you can build data sets up to to 2 million documents which we almost did. Bill Ingram: So, you know, it’s sort of a faceted search and browse. We just the the pro quest dissertations and theses global collection drill down into that. Bill Ingram: As you see there the 2 million document limit. So we had to winner this down a bit in order to meet that limit but Bill Ingram: We finally did. And so once you you have your collection setup you you set off the the Bill Ingram: Movement of the files. And so this, this takes a fair amount of time for the files to move over. But once you have the moved over, then you can interact with the data using standard Jupiter notebook which Bill Ingram: Folks that are doing data science are really familiar with. Bill Ingram: So this is just a screenshot of kind of what that looks like. Bill Ingram: So the question that we wanted to do.
And I should say that this pilot was was three months and so 12:43 - Bill Ingram: We didn’t have a lot of time and what we wanted to do a study, we wanted to see, you know, what we could do with the data. Bill Ingram: This is a bit different than my normal research in that it’s it’s more of a text mining less of machine learning classification type of tasks that we set for ourselves, but I still think it’s it’s really interesting. Bill Ingram: So what what do we want to do. We want to see what we can learn through text mining of the ED corpus about how graduate research topics have evolved over time. Bill Ingram: And especially interesting is the interdisciplinary between or among different majors or departments. Bill Ingram: In graduate research.
This is particularly interesting for me because early in my career in libraries, I was working on putting together. Bill Ingram: The systems to collect and and and display CDs and so now we’ve been, you know, gathering each of these for almost 20 years and we have this this great corpus and and it, it is really a reflection of of research that’s that’s happening across the country or across the world. Bill Ingram: And Bill Ingram: It’s, it’s just really exciting to be on the other side to mine it and making use of it, especially in this way. Bill Ingram: So we were able to move over a roughly 1.3 million CDs. Now, this is from the year 2002 Bill Ingram: We were talking earlier and I can’t remember why we stopped at 2018, but I think it was Bill Ingram: It could have been that that’s that’s the the limit for when that was mostly the time that it was have full collections available, but we started around 2000 because we wanted to get born digital.
I didn’t want to have the added 14:52 - Bill Ingram: complication of having to use OCR text. So, but still. This is a lot of dissertations. Bill Ingram: What you get in the studio is full text XML and metadata. Bill Ingram: We wanted to get department metadata that was important for the for the study. So we ended up having to wind this down even further down to Bill Ingram: Just 600,000 documents so that we would have the, the department metadata. And so this is the top 20 departments that we that we harvested and are working with minus the one there in the middle department, not provided.
15:36 - Bill Ingram: So with this data set we extracted title, abstract department and the year of publication we organize this into batches by years and by major Bill Ingram: And the intuition behind this is that the top terms would be present in the title and abstract, which would indicate the research topic of the paper and so Bit more on that here in a moment. Bill Ingram: The sources of data there ended up being I think over 1000 but this is the top 20 institutions. So this was completely random. We didn’t really look at what the Bill Ingram: The sources were, we were more concerned with with getting the numbers, but this is just how the spread looks and then a little bit more detail here. Bill Ingram: As you can see in the top graph up there. The green one. There’s a really there’s a long tail of universities that the data came from, but the majority are from these top 20 here. That are shown here, and pink.
16:52 - Bill Ingram: So the methodology that we use a again, what we’re concerned with is trying to find out, you know, what is the topic. What is the research topic of this paper. Bill Ingram: And so to do that. Our first attempt was to use the TF IDF which is the term frequency inverse document frequency, a measure to try to collect the, the most important two or three word phrases across the corpus. So within a major Bill Ingram: And initially this look very promising. And so you can see from the the first two columns. This is computer science. I don’t know if you’re seeing my mouse moving, but this is computer science here user interfaces digital libraries, etc. That seems good. Bill Ingram: These look like research topics in computer science over here and biology. These also seem to be working well. Bill Ingram: However, the problem was.
We’re also turning up a lot of other mainly irrelevant phrases such as result show and future work high level these kinds of things that just made way too much noise and 18:02 - Bill Ingram: So that that wasn’t working, what we ended up doing is using this tool called wiki fire that was developed by Dan Roth’s lab. Bill Ingram: When he was at Illinois he’s since moved to you, pen. But what it does is takes a stream of text and it’s it’s mainly for author disintegration are named entity disintegration and so it runs the text through a Bill Ingram: Search into Wikipedia and tries to return the entities that are that that have matching Bill Ingram: Wikipedia articles and so here you see there’s the the underlying terms. Those are actually links to the articles and Wikipedia and then the bold and terms, I believe our Bill Ingram: Terms that they had pulled out as being context, in order to disambiguate what the entity is so this is this is not our work. This is a just a tool that we are using if you’re interested in that.
I suggest reading the paper that I have here link below. Bill Ingram: Okay, so. But anyway, so what we did with it was this is use the wiki fire to identify what the, what the terms are and then we were able to use this to Bill Ingram: To rank them so that we could figure out what the topic of the paper was so this is kind of a step by step here of what we did. Bill Ingram: The first was, was to for every document in the batch use wiki fire to use a wiki file that the the text that we extracted and again this text is just the abstract and title. Bill Ingram: And then, and then ranking those terms to define out what the, what the research topics for that department or major were Bill Ingram: And then the way we rank those is by calculating the document frequency Bill Ingram: Across different periods of time. So, what I mean is, how many documents contained that phrase.
20:18 - Bill Ingram: Over in a certain batch of time and we normalized against the number of documents so that you know if one document had just that phrase over and over and over again, it wouldn’t unfairly way way the results and then plot the results. Bill Ingram: On a graph, so that we can see what the, what the highest document frequency Bill Ingram: Terms ended up being for the department and Bill Ingram: And then compare these with other departments and then and then finally plotting multiple departments, so that we could see what the the shared topics that they have across them. Bill Ingram: So, Bill Ingram: I made a lot of these. And what I am going to focus on is, is the intersection of computer science and biology which I just think is interesting, myself and I know the most about computer science out of any of these not so much about biology, but Bill Ingram: Let’s explore the data. Bill Ingram: So I’m going to show a series of these these sort of bubble graphs and I made these with with GFI Bill Ingram: On the data.
So this is computer science from let’s see 2001 to 2005 and these are the major topics here it for, for various reasons, and we could talk about that, if we have time, but they’re the there’s some sparsity here in these early years, it doesn’t really get interesting until 21:58 - Bill Ingram: Until here. So now you can really start to see these these research topics emerging Bill Ingram: You know and and so that the size of each circle represents the the the ranking, you know, the weight of of the topic. Bill Ingram: And so you can see here, now that we’re, you know, nearing the end of the first decade of the year 2000 on these are the the topics that are emerging in computer science. And so it’s interesting. Bill Ingram: on machine learning, of course, very big there’s these sensor networks and wireless sensor networks, a lot of the sensor network stuff you don’t really see that a lot anymore. So that was definitely very hot in this area in in during this time period.
22:46 - Bill Ingram: See social network starting to emerge social networks. Bill Ingram: That, that wasn’t there really in the data at all pre 2006 Bill Ingram: As we move in into the the 2010s, you see, machine learning continuing to have a very strong presence here. The, the sensor networks are starting to get smaller. Bill Ingram: Data Mining starting to get bigger social networking still bigger social networks. I will say that I Bill Ingram: I put this together pretty quickly and and I didn’t notice until it was too late that the plural version of a lot of these things is represented.
So 23:30 - Bill Ingram: If you just imagine that these like social networking social networks, just imagine that they’re bigger and together. Bill Ingram: But you can see how how this is trending. Nonetheless, so finally here is here’s the most recent batch again machine learning where you’re starting to see the deep learning and neural nets really starting to rise. Bill Ingram: As, as well as big data. And there’s something I yeah let me get back one. If you notice right here really small. There’s big data right in the middle. That’s a big it really didn’t take off until I don’t know 2015 and now suddenly big data is is and I think if we continue this into Bill Ingram: Into 2020 we would see that continue to grow. Bill Ingram: Okay, so that’s computer science wanted to do the same thing with bio Bill Ingram: This is particularly, I don’t know, sort of funny to me that again in these first few years there isn’t as much data so that the topics on emerging in the same way as they do later but Bill Ingram: It seems that the biologists were interested in prairie grass, for the most part, until now you’re starting to see what I more of what I expected with with T cells with gene expression genetic analysis.
25:01 - Bill Ingram: A lot of these words that I don’t even know how to pronounce. I assume that these are genes that they are studying Bill Ingram: This goes into Bill Ingram: Let me get back for a minute. One of the things that that’s really been interesting and doing these is seeing the how different topics emerge and and in our sort of growing. So on this slide. You don’t see anything about climate change. Bill Ingram: But it really emerges in the next here in this this batch of four years. So now, climate change is suddenly on the map, you know, people are interested in studying that. And that that surprised me.
I thought that that would have been 25:45 - Bill Ingram: You know, going on in the research for for much earlier. But it turns out it and Bill Ingram: Again, nothing, nothing else. Bill Ingram: too surprising and this. Finally, the last of the of the bio stuff. Bill Ingram: That this gene expression of course just keeps you know showing up T cells have just gotten bigger and bigger. Bill Ingram: And stem cells were another one that I saw. Sort of. Sort of arise through through the data. Bill Ingram: To hurry up. Bill Ingram: So here’s the entered the intersection of CS and biology.
26:28 - Bill Ingram: The first batch of years really didn’t have much results at all. But you start to see them here with DNA sequencing statistical analysis. Bill Ingram: Actually not very surprising. Bill Ingram: Huge gene expression. Bill Ingram: Again, climate change and so keep in mind this is the intersection of CS papers and bio paper so that means, there were CS papers that were written about climate change about gene expression about DNA sequences. Bill Ingram: And then finally, here’s the most recent batch.
Where were you see the the 27:10 - Bill Ingram: Intersection of CS and via what what surprised me about this. And I don’t know if you share this surprise or maybe it was my naivete. Bill Ingram: I expected to see more of the computer science terms showing up in this intersection, but for the most part this intersection is all biology terms and Bill Ingram: In that that could just be and I’m reminded that computational biology is a subfield of computer science. It’s not a subfield of biology. So that, that alone could just explain it, but I thought that was fascinating that, you know, computer science is Bill Ingram: Becoming a lot of other other things. Bill Ingram: Just, just for reference. I wanted to show the intersection of econ and math.
28:03 - Bill Ingram: No real surprises here that, you know, the big interdisciplinary topic between econ and math is Game Theory. So that’s what you see here, and then the most recent Bill Ingram: Again Monte Carlo Markov chains, etc. But you do see the machine learning here. So, you know, we’ve got computer science in in this area as well. Bill Ingram: So let’s see, my God. Right, so Bill Ingram: We don’t have very much time, but I wanted to open up for questions before I do, I just want to revisit Bill Ingram: The research question. So, you know, what can we learn through text and data mining about the evolution of research topics and Bill Ingram: I think that we’ve shown that it is possible to determine the research focus of any TD using methods from natural language processing specifically using the Illinois wiki fire.
29:00 - Bill Ingram: For concept disintegration and then graphing the document frequency of these research topics allows us to visualize the relative importance of these topics within and across disciplines. Bill Ingram: So that’s it. I wanted to just thank pro quest for allowing us to use the studio and to do this. And of course, thank you to the time. Alas for their continued support. Bill Ingram: And I have a slide here at the very end, if you wanted to get in touch with Pro quest about this product, you should try one of these options here. Bill Ingram: Okay, so if there’s questions I’m full screen, so I can’t see if there are questions but Diane Goldenberg-Hart (CNI): Great. Thank you, Bill. There actually are some questions that was a really interesting talk. And I know people are curious to know more.
29:57 - Diane Goldenberg-Hart (CNI): We have a first a comment from Rebecca Bryant and then a question. Diane Goldenberg-Hart (CNI): Rebecca says this is more of a comment for velvet. A question I think that the Council of Graduate Schools would be very interested in hearing about your research as you help inform how graduate education has changed in the past two decades. Diane Goldenberg-Hart (CNI): And she goes on to ask Diane Goldenberg-Hart (CNI): A number of institutions have now gone. No quest meaning they’re no longer sending their data to pro quest is your research here dependent upon the data set pro quest maintains IE institutions not sending copies or metadata to pro quest are not included.
30:39 - Bill Ingram: In this particular experiment, they are not. Although, from what I know. And again, you know, somebody from progress would would know more about this. Bill Ingram: Because of the copyright clearance. You can’t pull data out of the out of the studio, obviously, but you can bring in your own data. And so if we, you know, we’ve actually as as part of the the larger grant funded research have been amassing Bill Ingram: A quite a large Bill Ingram: Corpus of each CDs for our own research and this is all by harvesting from open repositories.
And so we’ve got about 500,000 of those 31:18 - Bill Ingram: The opportunity with with progress, though, is to be able to really have, you know, the fire hose of progress, it would if I don’t know if it would have been possible or feasible to collect 1.3 million EDS by crawling institutional repositories so that that was, that was the advantage here. Diane Goldenberg-Hart (CNI): Interesting and answer the question, but yeah. Well, that’s a really interesting question. Thanks. Rebecca for bringing that up and Diane Goldenberg-Hart (CNI): Thank you for addressing it bill. Diane Goldenberg-Hart (CNI): Next we have a question from Michael Siegel, who asks, How are you handling multi language is issues or is the focus mainly on English language works and it looked like your corpus.
There was mostly us base Canadian and British 32:06 - Bill Ingram: Yeah, I mean that that’s why we made it easy for ourselves, although there is a woman in my lab who is Bill Ingram: Doing her master’s work on Arabic et DS and and is doing automatic classification of them. And so if anybody knows she’s actually, it’s been a challenge to gather enough Bill Ingram: Arabic EDS in order to you know have training data for her models. So if anybody knows a source of Arabic eat CDs, please share that with me. But yeah that’s that’s an interesting topic as well. Diane Goldenberg-Hart (CNI): Great, thank you. And thanks Michael for the question. I now we have a question from Cliff lunch.
32:55 - Diane Goldenberg-Hart (CNI): Cliff says it looks like the time to produce a thesis is much longer than the time to get papers or conference papers published so you’re going to be recognizing Diane Goldenberg-Hart (CNI): Topic emergence more slowly in the TD database than in literature analysis, how much more slowly. What’s the average time between selection of research topic to acceptance of thesis. Bill Ingram: That’s, that’s interesting. I don’t know the answer to that, but I mean I could I could think of a good study to do where you would compare the you know the emergence of topics. Bill Ingram: In the, you know, sort of the journal and conference literature versus the GTD literature. I think that would be that would be interesting. Diane Goldenberg-Hart (CNI): Indeed, and Rebecca is Brian is piggybacking on cliffs question.
33:43 - Diane Goldenberg-Hart (CNI): I think it would be really interesting to combine this study with data from the survey of earned doctorates so lots of fodder to follow on your Diane Goldenberg-Hart (CNI): Your project there. Diane Goldenberg-Hart (CNI): Well, I want to thank Bill so much for coming to see and I to present the results of his work. And in fact, his project with us really interesting and we’ll look forward to hearing more about this. Diane Goldenberg-Hart (CNI): I also want to thank our attendees for joining us. I see that we are a little bit past time. So I’m going to go ahead and turn off the recording.
34:24 - Diane Goldenberg-Hart (CNI): On this session and just invite any attendees who are still around and half time if they’d like to stay back and Diane Goldenberg-Hart (CNI): Chat with Bill asked questions just raise your hand. I’ll be happy to unmute you. And we’ll have another session as part of the Fall Meeting here. Diane Goldenberg-Hart (CNI): At two o’clock summarizing web archives through storytelling with the dark and stormy archives project with Sean Jones of Los Alamos National Laboratory. So we hope to see you there at or at another conference another session from the conference be well everyone. Take care. Bye bye. Bill Ingram: Thank thank you for attending.
34:58 - Diane Goldenberg-Hart (CNI): Thank you, Bill. .