Tech Updates

Meta’s Yann LeCun is betting on self-supervised learning to unlock human-compatible AI

This article is part of our coverage of the latest in AI research.

What is the next step toward bridging the gap between natural and artificial intelligence? Scientists and researchers are divided on the answer. Yann LeCun, Chief AI Scientist at Meta and the recipient of the 2018 Turing Award, is betting on self-supervised learning, machine learning models that can be trained without the need for human-labeled examples.

LeCun has been thinking and talking about self-supervised and unsupervised learning for years. But as his research and the fields of AI and neuroscience have progressed, his vision has converged around several promising concepts and trends.

In a recent event held by Meta AI, LeCun discussed possible paths toward human-level AI, challenges that remain, and the impact of advances in AI.

World models are at the heart of efficient learning

Among the known limits of deep learning is need for massive training data and lack of robustness in dealing with novel situations. The latter is referred to as “out-of-distribution generalization” or sensitivity to “edge cases.”

Those are problems that humans and animals learn to solve very early in their lives. You don’t need to drive off a cliff to know that your car will fall and crash. You know that when an object occludes another object, the latter still exists even if can’t be seen. You know that if you hit a ball with a club, you will send it flying in the direction of the swing.

We learn most of these things without being explicitly instructed, purely by observation and acting in the world. We develop a “world model” during the first few months of our lives and learn about gravity, dimensions, physical properties, causality, and more. This model helps us develop common sense and make reliable predictions of what will happen in the world around us. We then use these basic building blocks to accumulate more complex knowledge.

Current AI systems are missing this commonsense knowledge, which is why they are data hungry, required labeled examples, and are very rigid and sensible to out-of-distribution data.

The question LeCun is exploring is, how do we get machines to learn world models mostly by observation and accumulate the enormous knowledge that babies accumulate just by observation?

Self-supervised learning

LeCun believes that deep learning and artificial neural networks will play a big role in the future of AI. More specifically, he advocates for self-supervised learning, a branch of ML that reduces the need for human input and guidance in training of neural networks.

The more popular branch of ML is supervised learning, in which models are trained on labeled examples. While supervised learning has been very successful at various applications, its requirement for annotation by an outside actor (mostly humans) has proven to be a bottleneck. First, supervised ML models require enormous human effort to label training examples. And second, supervised ML models can’t improve themselves because they need outside help to annotate new training examples.

In contrast, self-supervised ML models learn by observing the world, discerning patterns, making predictions (and sometimes acting and making interventions), and updating their knowledge based on how their predictions match the outcomes they see in the world. It is like a supervised learning system that does its own data annotation.

The self-supervised learning paradigm is much more attuned to the way humans and animals learn. We humans do a lot of supervised learning, but we earn most of our fundamental and commonsense skills through self-supervised learning.

Self-supervised learning is an enormously sought-after goal in the ML community because a very small fraction of the data that exists is annotated. Being able to train ML models on huge stores of unlabeled data has many applications.

In recent years, self-supervised learning has found its way into several areas of ML, including large language models. Basically, a self-supervised language model is trained by being provided with excerpts of text in which some words have been removed. The model must try to predict the missing parts. Since the original text contains the missing parts, this process requires no manual labelling and can scale to very large corpora of text such as Wikipedia and news websites. The trained model will learn solid representations of how text is structured. It can be used for tasks such as text generation or fine-tuned on downstream tasks such as question answering.

Scientists have also managed to apply self-supervised learning to computer vision tasks such as medical imaging. In this case, the technique is called “contrastive learning,” in which a neural network is trained to create latent representations of unlabeled images. For example, during training, the model is provided with different copies of an image with different modifications (e.g., rotation, crops, zoom, color modifications, different angles of the same object). The network adjusts its parameters until its output remains consistent across different variations of the same image. The model can then be fine-tuned on a downstream task with fewer labeled images.