Canning the can’t
As a company specialized in natural language understanding, word embeddings are one of the building blocks of our technology. Our NLU models need to be capable of correctly ‘understanding’ what is written or said. One way of doing this is by using an intent classification model: when a user inputs a sentence, the model predicts an intent. Accurate intent classification is thus crucial for the interaction to flow smoothly. Although our models are the most accurate in the world for Dutch and French, sometimes the classification goes wrong, and occasionally this can result in funny conversations. Very recently, a chatbot built on our NLU models failed to recognize a user’s witty sarcasm, and responded in the following passive-aggressive manner:
User: Thanks for not answering, genius. Bot: You’re welcome!
The sentence was misclassified as a “thank you” intent, to which the bot very politely gave a correct response. A lot of the misclassifications we have identified can be retraced to the word embeddings we make use of and the way they are created.
There are several ways of training word embeddings, but what most of them have in common is that the whole conception of word embeddings is based on the idea that lexical semantics is distributional: words with similar meanings occur in similar contexts, and similar contexts contain similar words.
“You shall know a word by the company it keeps!”
— John Rupert Firth – A synopsis of linguistic theory
But is that really enough? Can you get the full meaning of a word by just looking at its context? What does context mean? And where does it go wrong? By sharing these questions here I hope the marvelous complexity of human language can provide you with the same sense of astonishment as it gives me.
Putting the context into the word
Simply put, the classic word2vec word embeddings represent words as the average of the contexts they can appear extensive. The ‘context’ a word occurs in can be described at a variety of levels. You can speak of the syntactic context of a word. For instance, some words have a different part of speech – and thus also a different meaning – depending on the syntactic context they occur in.
I work very hard because I love my work.
The first occurrence of work is a verb and designates an action. The second occurrence, on the other hand, is a noun and designates an activity. Although we can agree that both words are similar in several respects – they are homophones and homonyms and share the same rootwork – they are dissimilar in meaning and two different words. Nevertheless, they will have one single vector representation that summarizes both of these words and the contexts they can appear in.
Let’s look at another example:
There is coffee on the kitchen floor. I would like a coffee, please.
Here, both occurrences of coffee are nouns, but their meaning is different due to the context in which they occur. In the first sentence, coffee designates a substance, it is a mass noun, while in the second sentence, coffee designates a cup of black gold. How do we know? The syntactic context. The presence of the indefinite determiner ‘a’ indicates that this occurrence of coffee is countable. Hence we automatically infer that there should be a cup involved. However, you cannot do this trick with all words. Sand will never be countable, no matter which determiner you put in front of it. So yes, semantics is distributional, in the sense that the meaning of words can depend on the context they occur in, and that words cannot occur in specific settings because their meaning simply doesn’t allow for it.
Also, the semantic context can ‘change’ the meaning of a word. Take a look at the following example.
The bass played the bass.
Both instances of bass are nouns, the same determiner precedes both, but they obviously have a different meaning. We are dealing with a fish with musical aspirations here. How do we know? Well, only basses can be played, while basses cannot. Observe that although the syntactic context (more specifically the word order and subject-object relations) helps us to interpret each occurrence of bass, it is mainly the semantics of played interacting with the semantics of the basses that helps us disambiguate the meaning. In the example below, we switch around the subject-object relation and the word order, but we still get the same two interpretations of bass thanks to the presence of play:
The bass was played by the bass. (wait… maybe basses can be played after all… The drama of being a fish!)
Finally, we have pragmatic context. Pragmatics refers to language use, and much human nature comes into play here. Humans use language for a variety of purposes, not only for transmitting ideas and information, but also to get other people to do things. Humans use different variants of a language to show their inclusion in social groups or to comply with particular social rules. Most relevant for our discussion here is the role of register. In different social contexts, we tend to use a different language. The differences can be in pronunciation (I dunno vs. I don’t know) but also in the vocabulary we choose to use. We can ask a friend to grab us a coffee (in a cup), but we are likely to ask a colleague to bring us a coffee, please. Same syntax, same semantics, even same intention (I am too lazy to get a coffee, so I want you to do it without explicitly telling you what to do because I value our relationship and the social rules we live by, prescribe that we be polite hence use questions instead of imperatives to maintain good relations). Different context,s though. Not linguistic, but pragmatic.
Can you can a can as a canner can can a can?
One of the reasons (there are many more, word embeddings are fantastic!) why word embeddings are so popular is because they allow us to calculate word similarity. As mentioned above, similar words can occur in similar contexts. Hence, given that word vectors represent the average context a word can occur in, we expect similar words to have similar word vectors. However, just like context, word similarity can refer to several concepts.
What does it exactly mean for words to be “similar”? Words can be similar in many respects: they can have the same spelling (address and address) or the same or a similar pronunciation (too and two). Words can be similar in that they have the same root: canner, can and can have the same root can, but can doesn’t, it has the other root “can”. Remember to breathe, and don’t get confused.
Words can also have the same part of speech: work and sing are similar in that they are both verbs, work and song are similar in that they are both nouns (and work and work are similar in that they have the same root “work”).
Finally, words can have the same or a similar meaning. But again, what does it mean to have a similar meaning? The meaning of two words can be similar in that they refer to entities of the same ontological class, like for instance apple and steak, which both refer to entities of the class of edible objects. Apple and pear have even more similar meanings, since they do not only both refer to entities in the class of edible objects, but also to entities in the class of fruits, and even more restrictively, the class of pome fruits. However, most people think of synonyms or near-synonyms when speaking of similar words: words that refer to the same entity or action or concept. Word embeddings don’t. They think of several types of similarity and mix them all.
Survival of the fittest
As I mentioned above, word vectors represent the average of the contexts they occur in. We feed a massive corpus to a smart algorithm, and we get a bunch of numerical representations of the words in the corpus. Similar vectors represent similar words, because they occur in similar contexts.
Due to the complexity of the word-context interplay, vector similarity won’t always represent the type of similarity that we are after. For instance, one of the words with a vector most similar to the vector of can, is cannot (check your favorite word vectors!). I think we can all agree that these two words do not exactly mean the same thing, so why are their vectors so similar? It is pretty simple actually: one aspect of the context both words can occur in has a much bigger weight than the other elements and that will determine the shape of their vectors. In this case the syntactic context wins. Both can and cannot can only occur in a very restricted set of syntactic contexts. Being auxiliary verbs, they can only happen in a position preceding other auxiliary verbs or lexical verbs, or at the beginning of a sentence in case, the sentence is a question.
Additionally, their poor semantic content makes both can and cannot compatible with a vast variety of verbs and nouns, so the semantic context it can appear in is pretty much undefined. As a consequence, the vector of these words will mainly represent their syntactic distribution, and from a word vector point of view, they will be very similar to words with the same syntactic distribution and poor semantic content, such as all other auxiliary verbs. The presence of the negation in terms “cannot” does not have sufficient weight to compensate for the syntactic similarity between can and cannot. Luckily there are ways to make sure that the quite relevant semantic difference is taken into account, for instance by making use of a hardcoded negation extractor, which identifies negative elements such as no, but also un– and anti-. Since we started making use of such a negation extractor, our bots became less cocky.
The syntactic context determines the vector representation for all functional words, such as determiners, numerals, prepositions, and pronouns. Their semantic content simply isn’t strong enough to transpierce in their vector representation, and their syntactic distribution is so restricted that the word vectors will basically represent the syntactic context they so frequently occur in.
Another case is when the pragmatic context has the most prominent weight. If you look up the words that are most similar to nope in the fastText pre-trained word vectors, you will notice that the top ten contains a lot of short words that are frequently used in an informal register, such as anyways, fwiw, yeah and hmmm. Again, these words have little semantic content. They are also syntactically quite independent (they can be ‘sentences’ on their own). This leaves room for pragmatics to determine their distribution, and this will be represented in their word vectors.
Does the semantic context have any influence at all? Of course, it does! Especially with words that have a strong, specialized, semantic content, a rather free syntactic distribution, and no pragmatic ties to any register. This is the case for a lot of nouns, verbs, and adjectives. The words most similar to furious are a.o. angry, infuriated, and enraged, which are synonyms. However, the more general the semantic content of the word, and thus the bigger the variety of semantic contexts it can occur in, the more ‘general’ the semantic representation in their word vectors will be. Take for instance make, which has “create” in its top ten, but also give, bring and get, which, like make, are verbs that have multiple meanings, depending on the context they occur in (I made it! vs. I made an omelet). So apparently the average meaning (whatever that means) of make is similar to the average meaning of give. Very often, the semantic similarity represented by the word vectors will rather be class similarity. Apricot is similar to plums and pears because they appear in fruity contexts, and flu is similar to dengue and measles because they occur in feverish settings.
The reality-possibility discrepancy
So why do synonyms rarely pop up when looking for similar vectors? It is not because two words can appear in the same context, that they will. Synonyms often differ from each other in the sense that one of the alternatives is used very frequently with one of its other meanings. Think, for instance, of our beloved friend the bass. The most similar word is guitar, and there is no fish whatsoever in the top ten, probably because we don’t talk about the fish as often as we do about the instrument. Another reason is that words are rarely synonyms in all contexts: dialect and variant can be used interchangeably when talking linguistics, but that is the only context in which that is possible. A variant can occur in so many other contexts, and these contexts will pull the word vector in a certain direction, away from dialect. Synonyms are not always synonyms in all variants of a language. Verlof (holiday or leave) is a synonym of vakantie (holiday) in Flemish Dutch, but not in Dutch from the Netherlands. Finally, synonyms can be preferred in certain registers. Bro is a synonym of friend, but not when you are talking to your boss.
It is thus clear that using word vectors to find synonyms isn’t a very good idea. There are some alternative approaches to build a good synonym suggestor, though. For once, you can make use of a built-in thesaurus and predict which one of the candidate synonyms is the most likely to occur in the given context using word prediction techniques. In any case, a rule of thumb is that you need to take into account the specific context, because whether two words are synonyms or not, often depends on the context they occur in.
An average conclusion
Word vectors represent the average context of a word – all aspects of the context: syntax, semantics, and pragmatics. If one aspect has more weight than another, this will be visible in the word vector representation. If a word occurs more frequently in one context than in the other, this will be visible in the word vector representation. If a word occurs more frequently with one of its meanings, this will be visible in the word vector representation. So if you want to use word vectors to calculate similarity, beware of the similarity complexity.
Although word vectors are handy in many respects and allowed us to make significant progress in NLU, they are based on a simplistic approach to language. There are so many subtleties in language that humans understand intuitively and juggle with without even thinking about it. We are capable of inferring much meaning that isn’t even linguistically expressed, just based on the context in which we use language, or our knowledge about and relation with the people we interact with, and so many other factors. So let’s keep up the good work and deal with the shortcomings of word vectors, because one day I want a pet robot.