Natural language in 2020 at-a-glance:
The world of natural language AI has changed significantly in a relatively short space of time—the shift to transformer architectures.
Beyond this shift, there appears to be a bifurcation in approach:
Which one you use depends on what needs to be prioritized. Left-to-right modeling is better at content, while bi-directional is better at style (vocabulary, sentence length and internal consistency). Bi- directional is a kind of “improv” version of natural language generation – it takes what it has and does it in a style that is consistent, while left-to-right is able to extend the ideas and concepts further, but is less efficient.
Prioritizing style makes language sound more “human.” Google’s bi-directional models and various derivatives are more consistently setting benchmarks for NLP (this could be due to availability and network effects of Google’s infrastructure). The bi-directional nature of the model allows for better performance on finding answers in text, synonym matching and text editing.
However, OpenAI’s model is impressive in how it can generate realistic and coherent continuations of the topic of interest to the user. For this reason, from an AGI perspective, the left-to-right language modeling approach is often considered the most interesting. The big trade-off is efficiency – more trainable parameters are needed per unit of downstream accuracy for left-to-right models.
Both of these models are based on transformer architectures. Transformer architectures rapidly pushed out LSTM and RNN models which had been used to calculate “attention” (the computation that underlies how networks determine context) because they were more scalable, parallelizable and relatively simple compared with large scale LSTMs. There are now transformer models at billions of parameters. So far, it doesn’t appear that we’ve hit diminishing returns and bigger is still better (although large models aren’t parameter efficient which is why transfer learning is now being used to pre-seed the parameters so we may see a tapering off on scale at some point as researchers optimize this approach).
Probably the most interesting development from Google last year was T5 – exploring the limits of transfer learning with a unified text-to-text transformer. What really came out of this work was that using encoder-decoder architecture offers a significant improvement on either encoder or decoder alone. This matters because it moves the Google architecture closer to what the GPT models are good at – rather than ad-lib and “yes-anding,” the Google model can answer questions and test knowledge. The researchers also made downstream tasks more generalized and specifically trained so it is able to summarize and translate. It cannot do auto-regressive text generation. If they train for this capability as a downstream objective, it would level the “AGI playing field” with GPT. This model was huge; 11 billion parameters.
OpenAI’s progress mirrors what Google’s architecture is strong in – style and summarization. They prioritize use of reinforcement learning for summarization training because it is far better at maintaining truthfulness (they are able to instruct a value such as “don’t lie”). Summarization is hard because, even for humans, there is a lot of subjectivity, which is something that the AI can’t evaluate well. OpenAI says the primary constraint is the quality of the data. Most NLP training data comes from the internet and the internet is getting worse as a training corpus, making the human evaluation of NLG inescapable for the foreseeable future.
Transformer architectures do seem to be the future – especially if you care about nuanced and noisy labels like emotional states (fear, anger, trust etc). Google is working to make them memory efficient. Google researchers pre-trained a network and inserted “adapter modules” which are zero-initialized. Only these modules are trainable. The model is almost as good on performance and the original model is preserved and “roll-backable.”
Facebook has also done some interesting work. Facebook seemed to spend 2019 bedding in the self- supervised learning approach across everything. They are slick at hybridization and their latest is to take the Google BERT model and combine vision with language. Researchers use visual representations to pre-train the ground truth and enhance the language side of things, essentially building a joint visual-linguistic representation. The innovation is linking the models, having them reason jointly between vision and language with separate streams for vision and language processing that communicate with each other through transformer layers.
Amazon recently published research on “skinnying down” transformer models. It’s interesting because Amazon has potentially conflicting goals in reducing model size. It’s also interesting because Amazon (for Alexa) have really doubled down on LSTM if last year’s Alexa prize is anything to go by.
The net-net of this latest work is that Amazon researchers tried out three different ways of skinnying down the transformer encoder/decoder structure in a very specific way. The key strength of the transformer architecture is its ability to capture long-term dependency in the structure using self- attention and positional encoding. (Focuses on the most important word but also captures where that word is in a sentence). Making a skinnier network means finding ways to reduce the number of connections (and associated parameters) whilst preserving this long-term dependency capability, which highly related to the complexity of the connections).
In essence, the work balances two competing objectives – long term context, with fewer connections.
Amazon borrowed from convolutional structures to create a “dilated transformer” model. Dilated transformers are a bit like convolutions (which prioritize for proximity). They act as a wide-angle lens that allows the model to see how different components in the input relate to each other, whilst maintaining as much self-attention as possible (the strength of the encoder/decoder) so that there are still structures that act like a telephoto lens, which allows the model to zero in on specific details.
They experimented with three different architectures:
The end result across all three variations was a reduction in parameters of 70% with some, but not a lot, reduction in performance. This research was fairly narrow re training set but the idea of dilating a transformer network to reduce computation requirements is interesting. It also strikes me as early so it will be interesting to see who follows and expands on these ideas.
Where is the frontier of NLG now? Three open challenges stick out – encoding context, incorporating personality, reducing boring answers. The shift to transformer architecture is most certainly helping with context. It seems likely that a combination of scale and fresh/niche approaches to data as well as using techniques such as transfer learning will see some important progress. Personality modeling seems to be shifting towards a more implicit approach – using reinforcement learning to vectorize an individual user could see progress. Conversational and chat both have a lot to gain from this. Same goes for fixing boring responses, such as “I don’t know,” where people are looking to reinforcement learning to penalize the AI for saying something boring.
Researchers funded by DARPA identify two key directions – cognitive architectures and emotional intelligence. Cognitive architectures could be an analogy for how to deal with large scale language models. Here there would be a specific model for long and short term memory with some form of action-selection mechanism as a bridge. Emotional content encoding is seen as a “hack” or an efficiency mechanism whereby the AI is just more able to make a better choice based on an understanding of the emotional context of the situation.
One diversion of interest is recent research on poetry generation. It sounds a bit wonky but it’s actually pretty interesting. The researchers’ goal (using GPT-2) was to see if they could produce more creative and emotionally engaging content using poetry as a base. Poetry is unique as a language construct because it uses all sorts of linguistic devices and underlying imagery which elicit feelings and emotions in the reader. Poems were categorized as joy, anticipation, trust, anger, sadness, surprise, disgust and fear. The model generated poems that were correctly able to elicit emotions of sadness and joy 87.5% and 85% of the time. The researchers see this work as indicative that context and emotion are inseparable.
What’s coming? The priorities and research directions seem to be:
This is fast-moving frontier. It keeps pace with computational power which means that scale will continue to deliver performance gains. But the true subtleties of language are yet to be cracked – especially with chat and language generation.