How machines learn, a primer on machine learning

Machine learning is a new way to tell a computer what you want it to do. It’s a new way to program.

Traditional technology relies on humans to write rules and translate them into instructions for a computer which creates an output based on applying the rules. Humans translate their knowledge of the world into code, lock down the design in a software release and update the rules on a periodic basis.

The new world of machine learning is the other way around. Outcomes, via data, guide the computer to define rules. The raw material for computers is data. Machines find underlying structure in data which allows them to sort, cluster, and derive new rules.

Machine learning techniques have been around for a long time but it is the recent increase in computing power and the scale of data that has made for modern AI systems. This can’t be over-emphasized—the fuel for AI is data and the type of data matters. The explosion of the social internet, with its artifacts of video, images, and audio means that AI has a different kind of fuel than just numbers. AI is now able to learn things that previously only humans can do, such as being able to see or talk.

AI doesn’t evolve as living organisms have—it is designed. But while humans choose the data and set the parameters of AI, AI is not under the complete control of humans because an AI’s behavior depends on its post-design experience. There is no traditional version control, only an evolution of what the machine learns through its exposure to new data. When rules that AI finds no longer seem to apply, it’s not always easy to know whether the AI’s model is incorrect or whether the world has changed.

The old way to work with computers is rules in, data out. This was the world of expert systems and IFTTT. Give a system rules and it gives you answers. The new way is data in, rules out. These new rules reflect some underlying structure in the data. New forms of data allow a machine to learn things that humans evolved to do.

It is this change, from rules-based systems to data-driven systems, that has powered many of our favorite consumer products. Google’s search algorithm learns from the data it receives as queries. Netflix’s movie recommendations learn from what you’ve watched and what people like you have watched. Apple Maps learns where you go and how you like to get there. Machine learning systems that are part of large internet platforms can use the vast scale of data to adjust to consumer experience in real time. Each time you watch a movie or drive your car, the systems can adjust for the new data.

How does that work? Let’s explain.

Machine learning systems provide computers with the ability to learn without explicitly being programmed. All machine learning algorithms use data to train a model which then updates and changes as it is exposed to new data. The model makes predictions or decisions against some performance measure. Performance is measured based on how accurately the program predicts a value for an output that was already known. The difference between the predicted output and the actual output is the error.

Showing the flowchart of how machines learn: using statistical techniques to make models & predictions against performance or measure then determining if it's doing well or not

The machine reduces error and improves performance—or learns—by updating its programs or by updating its rules. It learns from new data and adjusts to reduce error using some form of calculus or statistical technique. As we said, this is the key differentiator for machine learning. We live in a world where we don’t have to give rules to a machine anymore. Now we can give it data and a goal and it gives us the rules.

The learning style depends on the type of data, the type of problem, and the type of algorithm. Let’s walk through four ways in which systems learn: supervised learning, unsupervised learning, reinforcement learning, and generative adversarial networks.

Machines learning from humans: Supervised learning

Supervised learning is the most common form of machine learning but it has some serious limitations, namely that machines need a lot of labeled examples. In supervised learning, humans label examples to create the data used to train the model. For instance, a human might label many images of company logos with the name of the company so that an algorithm can create a model that will predict that a swoosh image is the Nike logo. The number of examples that need to be labeled varies based on the type of algorithm and the complexity of the prediction. An specially designed algorithm for identifying a specific image like a logo may not need many examples while an algorithm for identifying something in the wild may need many. How many times have you had to identify the number of squares in a captcha that includes a traffic light or a bicycle? Those clicks are labeling data and the complexity of identifying a traffic light from any perspective with any background requires a lot of labeled examples.

Supervised learning’s power resides in the ability for humans to teach a machine to do something. The machine is able to amplify that teaching well beyond human scope and speed. The weakness of supervised learning, however, resides in its reliance on humans as teachers. Machines learn the good and the bad from humans. For instance, a machine may learn to recognize facial hair as a beard but it might also learn to recognize people with beards as potential terrorists if its teacher is racist. This kind of bias can be learned but hidden in a model and not discovered until something unintentional and unfortunate happens. We’ll dig more into this topic later in this section.

Supervised learning is the workhorse of AI. Many of the AI systems you are in contact with every day use supervised learning, such as image classification, sentiment analysis (like analyzing tweets for customer sentiment), and predicting house prices. While there has been huge progress toward more sophisticated and flexible machine learning algorithms, the vast majority of AI systems still use human labeled data and rely on humans to teach, test, and monitor performance.

Great Machine Strength: Once an AI has been trained to classify data, it can classify millions of data points in multiple dimensions much faster than humans.

Great Machine Weakness: Training AI takes a lot of examples, is costly, and inherits the good and bad of humans.

Humans learning from machines: Unsupervised learning

Another style of learning is when the machine discovers the knowledge for itself.

In unsupervised learning, data points have no labels and the goal instead is to organize the data by similarity and understand its structure. There isn’t a known result or a “correct answer” from which to create a model. So, instead of testing against correct answers, in unsupervised learning, a model is prepared by deducing structures in the data. An important goal of unsupervised learning is to get the machine to find data patterns that humans don’t know about.  

This technique can be extremely powerful for discovering new knowledge beyond our human perception. It is used in many applications in scientific discovery, such as protein modeling and DNA analysis. Perhaps the most easy-to-grasp yet startling example of unsupervised learning comes from researchers at DeepMind who used this method to discover that an AI could predict the sex of a person from an image of their retina at an accuracy of 97%. Let’s reiterate this astounding discovery: no human expert has ever been able to do this, and no human expert can figure out what the AI is seeing. The researchers aren’t sure we ever will, stating that “the retinal features apparent to domain experts for this task may go unanswered, as the power of deep learning in integrating population-level patterns from billions of pixel-level variations is impossible for humans to match.”

Semi-supervised learning is when the techniques of supervised and unsupervised learning are combined. These hybrid systems use a little human setup and tweaking in order to reduce the scale of a problem. It’s a way of getting more “bang for your buck” out of labeled data sets by reducing the labeling bottleneck.

Unsupervised learning, also referred to as self-supervised learning, is responsible for the progress made in recent years in large-scale language models. Language is extremely complex for AI. However, using self-supervised learning techniques, enabled by huge compute power, the AI can discover representations and patterns in language on its own without the use of examples which have been labeled by humans.

Self-supervised learning has had a profound impact on the development of large language models. In natural language processing (NLP), part of a sentence is hidden and the AI has to predict the hidden words based on the words that remain. The AI can discover signals in the data for itself because it uses the structure of data itself. It does not need to rely on labels so it is more accurate over large data sets. Next time you use the translation function in Facebook, it’s likely that self-supervised learning is at work.

Great Machine Strength: AI can find patterns and associations that humans can’t.

Great Machine Weakness: AI might not be able to explain what it’s found.

Machines learning from experience: Reinforcement learning

Another learning method that’s becoming increasingly popular is reinforcement learning, a technique that mimics how humans learn based on experience and reward, and trial and error.  

These systems learn to take actions within an environment so that they get as many rewards as possible. These rewards could be points or some other bonus. For example, in chess the reward would be capturing pieces. The machine doesn’t know whether an individual action taken was good or bad. Instead, the machine learns from time-differentiated results whether it is making progress toward a goal. The system has to resolve a classic dilemma in all decision-making—balancing exploring new states while still maximizing overall reward. Resolving this dilemma might mean making short term sacrifices so it’s important to keep track of which choices work best. The trick is to collect information about these short-term decisions so as to make the best overall decision.

One of the strengths of reinforcement learning is that the machine can learn without knowing whether it's correct or not. It learns what a guess should have been from the next guess it makes. Reinforcement learning optimizes reward in a particularly clever way—the machine is programmed to do two things at once. At the same time as it acts, it also self-evaluates.

Part of the system learns to predict the final outcome, or score. This is the actor. In a game, the actor can make a prediction of whether the systemagent is winning or losing. In the case of a robot learning to pick up objects, the actor may make a prediction about the success of its grasp. This half of the system gets better and better at assessing its performance at each moment, rather than waiting until the end. The other half of the system is called the critic. The critic uses this prediction to correct itself. The systemactor learns to act wisely and the critic gets better at evaluating consequences of actions. They learn and progress together.

Reinforcement learning is used in situations where the optimal outcomes are not known or are more difficult to define and where there is less feedback. The feedback can be much later in the process and may not provide any information about the steps themselves. The steps have to be discovered through exploration, testing out a lot of different strategies and measuring performance against the long-term goal.

Machines learning from experience aka reinforcement learning: showing a car learning how to stay on the road.

Reinforcement learning is so effective that you’ll see it everywhere. Many recommendation systems (such as how Facebook and YouTube decide on what to show you) are based on reinforcement learning. For instance, YouTube’s recommendations are based on a long-term goal of keeping a viewer on the platform as long as possible. The reinforcement learning algorithm doesn’t optimize for the next view, it optimizes for the longest viewing time. The algorithm has to understand not just what engages a user but also what makes them bored. While these algorithms are very complex, there is nothing inherently sinister or magical about them. Designers might downweight factors such as whether a user has already watched videos by the same creator or something similar that day.

Reinforcement learning is commonly used in robotics, including the software for autonomous or self-driving vehicles. It isn’t possible to define the optimal path for a vehicle to drive. There are too many roads and too many obstacles in the world to define the positive result in every case. But we can define that success is staying on the road and failure is going off the road. Using reinforcement learning, the car will get a reward when it stays on the road but will not when it goes off the road. The car incorporates the feedback loop given the circumstances and builds its own model for how to drive to reach the defined success of staying on the road.

It is through reinforcement learning that a car can learn on its own how to accomplish goals that humans can’t specify ahead of time. In 2015, Chris Urmson, who was then head of Google’s self-driving program, explained how one of Google’s self-driving cars was driving through Mountain View when it encountered a woman in an electric wheelchair, waving a broom and chasing a duck. This isn’t something that could ever have been anticipated in the DMV handbook!

Great Machine Strength: AI can learn for itself by trial and error.

Great Machine Weakness: AI needs a human to set the goal and needs many trials to learn.

Machines learning from competition: Generative adversarial networks

The last style of learning we will cover is when machines learn by trying to fool another machine. You know what they say: if you can’t fake it, you don’t understand it.

Generative adversarial networks or GANs are relatively new. They were introduced in 2014, but are considered one of the most interesting ideas in machine learning in a long time. GANs pit two neural networks against each other in competition so that each can improve their predictions through competition. The two competitors are called discriminative algorithms and generative algorithms.

Discriminative algorithms map features to labels. So, given a series of features, what label would it give? Generative algorithms do the opposite. Instead of predicting a label given certain features, they attempt to predict features given a certain label. You can think of this in the world of spam. A discriminative algorithm will predict if an email is spam or not (label) given the words in the email (features) while a generative algorithm will predict what words would be in an email if it is spam or not.

Computers learning from competition aka Generative Adversarial Networks or GANs: showing a discriminator  computer determining if a generator computer's output is fake or real.

Let’s look at how these work in the world of images. The generator generates new data instances (images), while the discriminator evaluates them for authenticity against a dataset of known, real images. In other words, the discriminator decides whether each image created by the generator is part of the actual training dataset or not.

The generator acts as a forger, hoping that its synthetic images will be deemed authentic even though they are fake. The goal of the generator is to generate passable images: to lie without being caught. The discriminator acts as law enforcement, hoping to identify images coming from the generator as fake.

In a world where we sometimes wonder what’s true and what’s false, GANs matter. The machines are getting better and better at producing fake images. It is getting harder and harder to spot the fakes. The same is increasingly true of audio, video, and text. Given our hair-trigger reactions, we are particularly vulnerable to manipulation by fakes designed to provoke viral outrage. Our brains are ill-equipped to cope.

One application of GANs is to help de-bias datasets. GANs can generate synthetic data which can balance a dataset when the data is unrepresentative. For example, if a dataset has a high proportion of white males, GANs can generate data that is representative of Black females. When subsequent models use the dataset, the AI performs better on predictions for Black females than it otherwise would have. This also has the upside of reducing privacy-invasive data collection problems which can disproportionately affect underrepresented groups, such as when Google was caught out for scanning the faces of Black homeless people in order to improve the performance of its facial scanning technology.

Great Machine Strength: AI can be creative and generate new data for itself.

Great Machine Weakness: AI can fool us.

Why language is special

Humans are the only species to invent language. Machines that can communicate with us in natural language represent an enormous change in our use and perception of artificial intelligence.

There has been incredible progress in chat systems through the invention of large language models that use self-attention and generative architectures to build enormous sophisticated statistical models. But we have to remember that these models, as powerful and as useful as they are, are not human intelligence.

When we talk, information conveyed as language, transfers between us at a maximum rate of around 60bits/second. And that’s if we say something simple like one-zero-zero-one. Between computers, however, it’s gigabits per second. We have a theory of mind for others, the idea that we are modeling each other all the time. Our communication is genuinely synchronous and builds on the other’s ideas. It is a creative partnership of second guessing what the other person is going to say, that massively amplifies our human-to-human transfer rate.

What gives language meaning?

Language gives us meaning through its structure and is made up of three things: words, rules and interfaces.


On the surface of it, words are the easy part. And, yes, for generation of written language it’s a straightforward exercise to build a lexicon of words. A typical high school graduate has around 60,000 words stored in long-term memory, with about 20,000 actively used. Computers can obviously store more and learn them faster. But we store our words packed with a lot more information; for every word we store, we also store how it sounds, what it means and with a host of subtle associations, links to other words and concepts.

Even teaching a machine to hear and speak words is more complex than you’d initially expect. Our spoken language is far more complex than you would naturally assume. For example, take the phrase “cool cat.” If you say this out loud to yourself a few times you can hear that the two hard “c’s” sound quite different. Somehow, a computer has to be told this (and many like it) or learn it on its own.

Rules, Algorithms, Grammar

The set of algorithms that we use to assemble bits into complex sequences is called grammar. There are rules for combining vowels and consonants into the smallest words, rules for assembling bits of words into more complex words and then there’s syntax, which are the rules that allow us to build phrases and sentences from words. When we use these rules in speech, we do so innately. We effortlessly construct unique and meaningful sentences all the time. We know instinctively whether a sentence conforms to these rules and has meaning.

There’s a famous example given by Noam Chomksy in 1956. “Colorless green ideas sleep furiously.” We instantly recognize this conforms to English syntax. But we also know it’s meaningless. We know it’s meaningless because the transition probabilities – the probability of one word following another – are almost zero. How many times have you heard “green” follow “colorless”? The point is that the rules don’t necessarily equate with the meaning. Many machine learning language-processing algorithms use a combination of rules-based and probability models to understand and generate language in a similar, but far more limited, way to us. A natural language processing engine should recognize, as you do, that “furiously sleep ideas dream colorless” is word salad.

Add in accents, colloquialisms, slang, technology and new generations of people being born and deciding new rules for themselves and you realize how language is a dynamic thing.

Interface – the Link with the Mind

By far the most difficult technology challenge though, is the third component of language; the connection between language and the rest of our mind. Linguists call this “pragmatics.” We understand language in context. Our knowledge of the world and assumptions about how other people speak and behave is an intricate and vital part of how we understand language.

We have an inherent expectation that the person we are speaking with is working with us to get meaning across, we cooperate as we converse. We can’t yet assume this with a machine. And this principle of cooperation lies at the center of an artificial intelligence gap.

The big leap forward: self-attention and transformers

Transformers are a type of deep learning model that have revolutionized the field of natural language processing (NLP). They have been designed to handle sequential data, such as text, and have outperformed previous methods in a wide range of NLP tasks such as machine translation, question answering, and sentiment analysis. In this essay, I will explain how transformers work and why they are so much better at detecting context compared to other methods for language.

A traditional approach to NLP involves breaking down the input text into individual words and representing each word with a fixed-length vector. These word vectors are then fed into a recurrent neural network (RNN) or a convolutional neural network (CNN) to learn the relationships between the words. However, this method has several limitations, including the inability to capture long-term dependencies between words and the difficulty in processing variable-length sequences.

Transformers, on the other hand, use self-attention mechanisms to learn the relationships between the words in the input sequence. The self-attention mechanism allows each word to attend to other words in the sequence, taking into account their relative position and importance, to build a more comprehensive representation of the input. This allows transformers to process sequences in parallel, reducing the computational cost of processing long sequences.

One way to think about transformers moving through language is as if they are snow plows moving along a snowed-in street. They create words and sentences like snow plows clear roads—the driver can adjust the angle and approach of the plow depending on where it is most effective just as the transformer’s “attention” can be guided. In this way, transformers are “proto-concept” builders. They can’t establish meaning but they do establish a basic context in a novel way.

The self-attention mechanism in transformers can be understood as a query-key-value operation. Each word in the input sequence is transformed into a query, a key, and a value, and the attention mechanism calculates a weighted sum of the values for each query based on the similarity between the query and the keys. This weighted sum is then used to build a more comprehensive representation of the input.

The use of self-attention mechanisms has several advantages in NLP tasks. Firstly, transformers can capture long-term dependencies between words, as the attention mechanism allows each word to attend to all other words in the sequence. This allows transformers to better understand the context in which a word is used. Secondly, transformers can handle variable-length sequences, as the self-attention mechanism allows the model to process sequences in parallel, without the need for padding.

Finally, transformers have significantly improved the performance of NLP tasks compared to previous methods. This is because they can capture the complex relationships between words in a sequence, leading to a more comprehensive representation of the input. This improved representation allows transformers to better understand the meaning and context of the input, leading to improved performance on NLP tasks.

Transformers are a powerful tool for NLP tasks, as they can handle variable-length sequences and capture long-term dependencies between words in a sequence. The use of self-attention mechanisms in transformers has allowed for a more comprehensive representation of the input, leading to improved performance on a wide range of NLP tasks. These benefits have made transformers a popular choice for NLP applications, and they continue to be an active area of research and development.