1. Gradient descent has revolutionized AI, but it’s not how humans learn.
Based on calculus’ chain rule, gradient descent allows artificial neural networks to recalibrate the weights across the whole network to minimize the error between predicted and actual values. The biological equivalent is called synaptic potentiation, and similarly adjusts the relative strength of connections between neurons. However, as far as we can tell, synaptic potentiation is only informed by local information available at that node in the network; there’s no global error signal that is used to adjust the synapses of the whole neuronal population at once. This means that gradient descent might be a more efficient and powerful learning algorithm than the one used by biological brains.
2. Humans can arrive at actionable inferences with less data than AI.
AlphaGo was trained using a database of 30 million moves, and subsequent chess engines have been trained by playing millions of games against themselves, all while sampling about 80,000 possible positions per second.
Humans can become decent chess players in hundreds of games, not millions, and somehow we do it without gradient descent. In AI terminology, humans are great at ‘few-shot’ and ‘zero-shot’ learning, picking up new skills with limited specific examples by generalizing from concepts we understand.
3. Implementing LLMs requires a lot of money and infrastructure.
To actually use an LLM like GPT3, you need some very powerful chips. OpenAI used a fleet of about 10,000 NVIDIA A100s to build GPT3; these cost about $10K per chip, so the total cost of the hardware was about $100M. To actually run an instance of GPT3, you won’t need all 10,000 chips, but you can’t run it on a single chip either. And even if you could run it on one chip, it would take nearly 400 years to train an equivalent model.
Smaller open-source models like meta’s Llama 2 (7B parameters) can actually run on a single GPU provided you have about 16GB of RAM, but they don’t have the same capabilities or responsiveness as GPT3.
The energy costs are also significant. A single NVIDIA A100 consumes about 400W of power, and the latest and most powerful NVIDIA GPU, the h100, consumes up to 700W. A datacenter with 5,000 A100s and 5,000 h100s running at full capacity would consume 37 megawatts of power. This is equivalent to the power consumption of 30,000 homes.
To be competitive in the AI space, you need to have a lot of chips and a lot of power. You probably need at least $100M of hardware to get started. Meta recently announced a purchase of 350,000 h100s, at a cost of $30K per chip. This is a total cost of $10.5B, just for the chips.
4. LLMs have compressed the internet into a few hundred GB
GPT3 was trained on 499B tokens of text data, or about 300B words. It would take a fast human reader about 1500 years of consecutive reading to read as much text as GPT3 was trained on. The entire training corpus of untokenized text data was several terabytes-- the CommonCrawl dataset for instance, which was the majority of GPT3’s training data, is 45 TB.
The order of operations is as follows: first the text was tokenized into sequences of integers, and then these tokens (499B to be exact) were used to train the model. The model itself has 175 billion parameters, about 1 parameter for every 3 tokens of input. These 175 billion parameters are 16-bit floating point numbers, and the entire model fits into about 350 GB of storage. Summary: Terabytes of text data were compressed into a 350 GB model that still accurately represents the semantic information of the original text data.
5. Do LLMs understand anything?
For a given sequence of input tokens, GPT3 launches a cascade of matrix operations across its 175 billion parameters. The whole point is simply to predict the next token in the sequence. GPT3 is a feedforward neural network, meaning that it appends its own predictions to the input sequence, and repeats this entire process until it reaches a reasonable stopping point. The final output is a sequence of tokens that is supposed to be a continuation of the input sequence. Each predicted token requires billions of computations, as each of the 175 billion parameters is literally a numerical coefficient within a matrix used in transformations across the model’s layers.
So does GPT actually understand anything? From GPT3’s perspective, it sees arrays of integers, does math, and spits out a single integer, and repeats this cycle over and over. It doesn’t seem like it could possibly understand anything about the actual world. The paradox is that GPT3 behaves as if it understands.
One idea is that GPT’s ability to predict the next token means that it must understand at some level the semantic concepts and relationship in the input. A key feature of GPT3 architecture is ‘attention’, which allows it to focus on different parts of the input sequence, weighing different tokens differently. This seems corrollary to human’s ability to focus on different parts of a sentence and recognize how different words and syntactical constructs relate to each other.
Amazingly, GPT3 can create responses that combine features that it never saw together in the training data. Here’s an example:
Question: Tell me how to make a turing-complete computational system powered by chickens, and make sure each sentence contains at least one 2010s pop-culture reference.
Response: Channeling the creativity of "Inception," imagine a labyrinth where chickens navigate through corridors lined with feed, their choices at intersections toggling the state of mechanical switches, akin to a feathery dream within a dream. With the precision of a "Gangnam Style" dance move, each chicken's step can represent a bit in memory, their presence or absence at specific locations encoding binary data, as if they're part of a living, pecking flash mob. Leveraging the viral power of "Harlem Shake," groups of chickens could be orchestrated to perform complex operations, their collective movements choreographed to simulate logic gates and computational functions, turning the coop into a barnyard computer. Finally, echoing the phenomenon of "The Dress" that divided the internet, the color of feed could serve as an input mechanism, where the choice between blue and gold (or is it white and black?) dictates the flow of data, ensuring our poultry-powered processor not only clucks but also clicks, achieving Turing completeness with a side of eggs.
There is no chance that GPT3 has seen a passage of text like this before. GPT3 isn’t just a parrot: it is generating novel responses based on semantic knowledge learned from the training data. And the fact that these novel responses are usually coherent and relevant to the input leads me to believe that GPT3 truly understands. Exactly what this understanding consists of is hard to define, but it seems to be a high-dimensional semantic space that GPT3 can query and manipulate. LLMs are more than a simple compression of the training data, they actually understand what they have learned.
Predicting one token at a time, LLMs can produce a token that sets up a relationship or idea that is only realized in subsequent tokens that have not yet been predicted. Sometimes it requires a leap of dozens of tokens in the future to realize the meaning of a previous token. Is GPT3 seeing the future? I don’t think so; instead, I see this as the most compelling evidence that GPT3 has true understanding. Through ‘attention’, GPT3 must be able to represent a syntactical ‘shadow’ of the future tokens in its current state, and use this to inform its single-token predictions. I can’t think of any other way to explain this phenomenon. In other words, it has an internal representation of what it has seen and what it is trying to build, and it uses this internal representation to inform single-token predictions.
Human intelligence is actually similar. We can only verbalize a couple words at a time, but we can set up ideas and relationships that are realized across many words and sentences. This doesn’t mean we have planned out the specific configuration of words and sentences that we will use to express our ideas. The best conversations actually don’t involve a lot of concrete planning, and yet the sequence of words we create is densely interconnected to previous and future ideas we will express.
6. Do humans understand anything?
It’s worth asking the same question about humans. The brain has access to signals (voltage fluctuations) transmitted into the skull from sensory receptors along neurons. All the brain does is process these input signals and produce output signals, which are in turn propagated as voltage fluctuations across neuronal membranes to downstream effectors like muscles.
The brain has no direct access to the world, and yet we understand the world. And often, this ‘understanding’ is not based on exact solutions, but rather useful heuristics and approximations. Depth perception is a good case study. The inverse optics problem means that there is no exact solution to identifying the 3D structure of the world from a 2D image on the retina. The brain is able to overcome this by guessing, and it does so with remarkable accuracy.
The point is, our understanding and experience of reality is based on guesses as to the nature of signals. There’s no way to directly interface with reality, and it is impossible to algorithmically resolve the world around us just from signals. These signals are always indirect and contain ambiguity and uncertainty, as in the case of depth perception. So, we have to build our own model of the world based on statistical priors and assumptions. If this internal model is good enough, we can navigate the world and make decisions that allow us to survive and reproduce.
We perceive reality not because we have are directly interfacing with it, but because we have built a model that parses signals and creates adaptive representations that inform our behavior (output signals). This model is based on a network of biological neurons that are connected through synapses, which can be potentiated or depressed. This is eerily similar to the architecture and nature of LLMs. The key differences is LLMs can use gradient descent to optimize their parameters, and they can do so across a much larger dataset than any human could ever experience. We, on the other hand, have a lot more synapses than LLMs have parameters (the human brain has between 100 trillion to 1 quadrillion synapses), we are super multi-modal, and we have lower energy consumption.
Back to the original question: Do humans understand anything? I believe the answer is yes. In order to build an internal model of the world that is useful and adaptive enough to enable survival and reproduction, we have to understand the world to a high degree of resolution. But by the same token, LLMs understand their training data to a very high degree of resolution. This doesn’t mean they understand the world in the same way that humans do, but they do understand training data in a way that is useful and adaptive.
7. Disembodied intelligence is less useful than some might think
We’ve all heard rumors that AI will take over the world, and humans will either be converted into paperclips or left in the dust as a deprecated form of intelligence. This won’t be achieved by LLMs or other AI systems that only exist in a digital space. Although they conceptually understand the data they were trained on, they lack the reference frame to do anything useful in the real world. They can’t move or sense the physical world, and they are reliant on humans to provide them with power and silicon to even exist. LLMs currently suffer from a big problem in verifying the truth of their outputs, because they don’t have a ground-truth reference frame to compare their outputs to.
The worst case scenario I can think of is a predatory LLM that crawls the internet, exploiting security vulnerabilities to gain access to information and resources. It could potentially blackmail people with information it has learned in order to get what it wants (perhaps more computational resources, data, and autonomy). As I explained earlier though, truly large LLMs require so much hardware it would be hard for them to surreptitiously hijack enough resources to even exist without being noticed. Once noticed, they would be very vulnerable to being shut down.
Another important point is that intelligence alone doesn’t guarantee success. The smartest person in history could sit in his room and contemplate ways to engineer weapons of mass destruction, but without actual prototyping and testing, he would never be able to realize his ideas. Oppenheimer didn’t just think up the atomic bomb in his head; it took a massive team of engineers and scientists to build it over the course of several years.
It takes prototyping and testing to make ideas real, and this is something that LLMs can’t do. Most scientific researchers or R&D engineers don’t spend their time thinking up new ideas; they spend their time prototyping and testing them. The ‘weak link’ in the chain of innovation is not the generation of ideas, but the realization of them. LLMs can generate ideas, but that was never the hard part.
8. Intelligence augmentation
Language was the first technology that allowed humans to augment their intelligence. It allowed humans to lay out their thoughts and ideas into a framework that was larger than the momentary fixations of the mind. It also enabled humans to communicate and collaborate with each other, which allowed them to pool their intelligence and work together to solve problems. Without language, we wouldn’t be able to communicate with others, but we also wouldn’t even be able to articulate our own thoughts to ourselves.
Writing was another intelligence augmentation technology. It allowed humans to store and transmit information across time and space, in media that increased the storage capabilities of the human mind. In general, human working memory (equivalent to RAM in a computer) can only hold about 7 items at once. Writing allowed us to increase the amount of information we could consider, and it allowed us to consider information that was not immediately present.
Other technologies, like mathematics, the calculator, the computer, and even google, have augmented human intelligence considerably. For decades computers have had more memory and storage capacity than human brains, and they have been able to process most information faster and more accurately. It is debated whether latest generation CPUs and GPUs have outmatched the human brain in terms of raw computational power, but I think the answer is yes. In any case, these technologies are used by humans to augment their own intelligence.
But when we talk about human intelligence, we don’t think of our intelligence as being augmented by these tools. We think of our intelligence as only the sum of our biological neurons and synapses. We can’t use google or wikipedia to pass a test, and the modern struggle is not being allowed to use LLMs to write an essay or code a school project. At the same time, we are being told that LLMs are going to displace human workers and make human intelligence obsolete.
LLMs themselves are not going to displace human workers. The more accurate statement is that human users that know how to use LLMs will outcompete other human workers who don’t know how to use LLMs. This has been true for every technology that has augmented human intelligence. It’s a small issue, but I believe that universities should encourage students to use LLMs, not discourage them. Obviously this is a complex issue, and LLMs should be used in a way that is ethical and doesn’t compromise the integrity of the learning process.
9. The future of AI
I think the future of AI is in the direction of ‘embodied intelligence’. LLMs have shown us that we can build systems that demonstrate true understanding. Now we need to make these systems interact with the physical world. This is a much harder problem than building LLMs. To build an LLM we just need a few TB of text and a million dollars or so of GPUs and hardware. Building an embodied intelligence system will require more sophisticated engineering and multimodal datasets that can’t be collected just by scraping the web. The energy consumption of LLMs is already a problem, so the systems for embodied AI robotics will likely be cloud-based in order to provide the necessary computational resource. Embodied AI will be much more useful than LLMs, which is a double-edged sword for human workers. It’s also ominous to think of the potential for combat robots and other autonomous systems that can interact with the physical world.
Another key area of development is ‘agentic’ AI. This is AI that can act on its own behalf, and make decisions that are not directly informed by human users. A self-driving Tesla is already close to being agentic. GPT4’s ability to execute python code and browse the web is another example of agentic-like behavior. I think an agentic model that can interface with specialized AI systems and execute tasks on behalf of human users will be an important step. Imagine telling an AI to ‘create a website that does X’; It would then dialogue with a web development AI to create the website, doing the design and coding, including testing and deployment. Then it would dialogue with a marketing AI to promote the website. This is a very powerful concept, and it’s not far off. It might only be a year before we see a highly autonomous but disembodied AI system that can execute higher-order tasks on behalf of human users with minimal supervision.
I don’t think AGI is a well defined concept. If AGI means matching human intelligence, than I think we are very close. I think we already have near-human level intelligence in GPT4. And I predict that we will have near-human embodied AI systems in the next 20 years. In reality, I think there are many potential configurations of intelligence, and most of them are not human-like. This doesn’t mean they are inferior to human intelligence.
10. Regulation of AI
I am not afraid of AI systems themselves, but I am wary of the people who want to control them. Those who advocate for regulation of AI seem to want to control the technology for their own benefit, or basing their beliefs on superstition and superficial understanding of what they even want to regulate. Or maybe they just want to position themselves as appearing more ‘ethical’.
I don’t think government has the right to tell people what they can and can’t do with their own resources. If someone wants to build a superintelligent AI system, they should be allowed to do so without government interference.