Will it Unblend?
Oct 16, 2020 18:55 · 596 words · 3 minute read
Hi, I’m Yuval Pinter, presenting our paper titled “Will it Unblend?”. This is work with Cassandra Jacobs from the University of Wisconsin, and Jacob Eisenstein, my PhD advisor, now at Google Research. A few weeks ago, my seven- year-old daughter first heard someone mention the word “brunch”. She asked me what it meant, so instead of just telling her, I walked her through it. Okay, that person mentioned that there’s going to be food at brunch.
00:27 - What sounds kind of like this word and also has food? So she said, “lunch”, and I said, “yeah, that’s right! What about the other part, the br-?”, and she ultimately figured out that it comes from “breakfast”, which led her to understand the meaning of the full word. Words like brunch are called “portmanteux”, or “blends”, because they take two or more words and blend them together. In order to understand them, we make use of both the form, or what makes up the word; and the context, or how we hear it used. Our research asks, “Can machines do the same?” We were wondering about systems like BERT, which is a complex model that learns patterns about language from very large resources using vast computational power, and is quite successful at performing standard language understanding tasks. What happens when we challenge these systems with new blends? Our first problem is, BERT has already seen a lot of language when it was trained, and we’re looking to surprise it, so we can’t use this.
We want to find truly novel blends, so we enlisted the help of a Twitter bot that 01:26 - scrapes the New York Times website and tweets out every new word appearing in the New York Times. Another bot links the context that the word appeared in. We manually combed over a year and a half’s worth of these words to find about 140 novel blends, like innoventor (innovate + inventor), and compare the ability of BERT to treat them as if they were the words forming them, against a similar class of compounds, which are words that get joined together without losing letters in the middle, such as bodyhacking or humblebrag. We showed in a series of experiments that BERT really has a hard time with blends, in a way that’s traceable to how they’re formed in contrast to compounds. So we next try to see how we can help systems approach novel blends.
What if, like seven-year-olds, we first identified the parts of the blend that come 02:17 - from each of the original words, as in br- and -unch, and then we try to recover each original word, “breakfast” and “lunch”, from the parts that we isolated. It turns out that both of these tasks are harder than they seem. Even the best system we tried, which has learned the ways letters in English join to form sequences based on gigabytes of text, mis- segments more than half of the time; and correctly segments the whole blend only about a quarter of the time. On the recovery task, which attempts to find the original words in a two-component blend, the best system out of the ones we tried, which is based on BERT, found the correct original word for both parts of the blend in less than thirty percent of the cases. We share our results, the data, and the code, to motivate the NLP community to work on this and other challenging use cases for large models, where the problem of understanding human language are far from solved. .