SCIP 2020 - Using STRUDEL for Semantic Concept-Feature Norms
Nov 23, 2020 17:00 · 994 words · 5 minute read
Hi everyone, and welcome to our talk today on creating concept feature norms. My name is Erin Buchanan, and my collaborators on this project are Jonathan Korn and Mark Newman. I know the SCIP crowd appreciates large data sets because computational research runs on corpora or data sets. This data can be used for hypothesis testing or stimuli selection for experimental questions. The Open Subtitles Corpus is a recently updated, large set of corpora that provides linguistic data for 50 or more languages.
00:33 - These data sets are a collection of movie subtitles which should represent more naturalistic language. Previous research with the subtitle corpora have shown their usefulness and effectiveness and computational research. They are freely available to download from the Opus website. But specifically, I want to talk about how I might exploit these large data sets for use in understanding semantic concept feature relations. Traditionally, semantic feature production norms are created by using the property listing task.
In this task, participants are asked to list 01:11 - the properties of a given concept. If shown dog, they may list an animal, barks, or has a tail. These norms require a lot of time, effort, and data processing to create and publish. However, this work is worthwhile because the norms are useful for many tasks especially measures of semantic similarity. Therefore, we might be able to explore the use of computational models to create linguistic norms in many languages in a more cost-effective and time effective way.
This session focuses on computation for the 01:48 - social good and computational analysis would allow us to provide these data in many languages beyond English that do not have adequate data sets to answer questions about their own language much less cross-linguistic comparisons. STRUDEL or structure dimension extraction and labeling is a model proposed by Baroni and colleagues to extract concept feature relations from large corpora. STRUDEL showed some promising results, such as providing relevant features for the concept book: reader, author, library, and chapter as compared to the McRae norms. However, this model has not been used or explored much since publication. One key limitation to STRUDEL is the focus on English as the primary language with extraction rules that are somewhat generic but may not cover the various forms the sentence structure in many languages.
Last the 02:49 - code is not only difficult to find but also hard to navigate because of the use of a command line corpus software and Perl. The core concept of STRUDEL’s extraction rules is a version of dependency parsing which creates a relationship between a head word and a related token. The image on the right presents a simple sentence in which authors and write are related and books and write are related. The advantage to dependency parsing is the availability of models and simplified processing in many languages. Specifically, We are using udpipe and excellent R package with models train on treebanks.
03:33 - Our process is to import the tokenized text from the Open Subtitle corpus, and use udpipe to process each sentence for part of speech tags, lemmas, and dependency relations. From this, we extract all nouns, adjectives, and adverbs as our concepts and features. Only noun, adjective, and object modifiers are selected from the dependency relations. Last we provide the frequency of all of these combinations in our final output for all to use. Here’s an example of what the output provided by udpipe with a token, lemma, part of speech, and dependency relation.
04:16 - We use the head token ID to sign that relationship as write is the head word related to both author and books. Additionally, several and books will be related as an adjective modifier. The English corpus is approximately half billion tokens, which creates 27 million different dependency relations for nouns, verbs, and adjectives. These relations count for 308 million instances where the frequency of those instances ranges from 1 to nearly 400,000 for last night. We will use a few words as an example of the results from this data, which will be compared to the recently-published Buchanan feature norms. Two real questions arise with this data.
One: which way does the dependency 05:11 - make the most sense? Should the head word in the dependency be the concept and the related word the feature or vice versa? Second, while we provide all concept feature frequencies, this processing does create a fair amount of noise. What should the cutoff be for frequency for usage? This graph, let me scroll down a little here, shows the cosine between the Buchanan norms and English STRUDEL processed norms. The x-axis shows the cut-off percentage for the STRUDEL norms, which have been normalized by the number of occurrences total in the data set. We can see that the likely cut off is very small as increasing values past.5% do not increase cosine values greatly. Second, we can see that the backward relations from the dependent word to the head work concept appear to work better for most words with dry as a notable exception. The cosine values are low.
06:15 - However, this preliminary comparison is on the final processed data set from the Buchanan norms, which excludes many small feature combinations. A final analysis will be explored on the original non-reduced data or more fine grain tuning. Last, want to note that this project is part of a larger mega study that focuses on semantic priming partnered with a Psychological Science Accelerator. The PSA is a global network of research labs who partnered together to engage in worldwide research. The project’s lofty goals include providing a large multi linguistic normed data set for computational analysis, code packages for accessing and interacting with the data, and more.
07:03 - We’re looking for collaborators, and if you’re interested in learning more, please contact me. Thank you for listening, and I’ll be excited to answer your questions now. now. .