Summary of Andrej's LLM Sessions for General Audience

Sam Li

2025-03-17

machine learning › neural networks

After watching three videos by Andrej Karpathy^[1], I have gained new perspectives on current LLMs.

Here were my misunderstandings and questions about LLMs before:

Can LLMs truly think? I doubted this when Claude released its 3.5 Sonnet model. There is a script called thinking-Claude which makes Claude explicitly show its thinking process of answering questions using natural language. Though it appeared as if Claude could think step-by-step and verify the results, I was hesitant about the nature of the chatbot under the hood. What if it just makes stuff up and demonstrates a chain of thoughts that users think it’s supposed to have?
Why does Claude hallucinate and behave inconsistently? It needs so many calibrations to make a simple app work.
How does thinking help chatbots improve their answers?
Claude seems more humane than ChatGPT to me.
How does ChatGPT work?
Cognitive ability: self-consciousness

Conclusions of the intro sessions:

Next-word prediction: The generation of a chatbot breaks down into two main stages: pre-training and post-training. In the first stage, a very large amount of internet documents are extracted and processed to train a base model. The question it solves is very simple, which is next-word prediction. This stage is the most time and computation consuming, taking many months or even a year. It outputs a base model containing billions/millions (?) of parameters, depending on the size of data fed and the computational ability of chips/hardware. In the second stage, we want to make the base model talk/answer questions, to function like an assistant model. This step is trained using Q&A documents first collected from human labelers and later simulated by the model itself. The assistant model can then answer questions following the Q&A modality. By nature, an LLM merely predicts the next word one by one.
Context windows and tokens:
- Token: Nowadays, language models can only take text tokens as inputs. All forms of human-readable files such as PDFs, DOCs, TXTs, Excel files, etc., are transcribed to one-dimensional token sequences. The model understands the world as tokens, captures internal behaviors between words, and provides the most statistically likely following tokens. Images, videos, and audio are also transformed into text.
- Lossy simulation: A language model has a limited context window to which it has direct access. It has base memory and working memory. Base (knowledge) memory consists of numerous parameters derived from internet data, which is vague recollection of internet data and can be outdated because the pre-training stage is expensive and seldom runs fully. Working memory is what can be instantly loaded into the context window. The prompt window provides a place the model can access very quickly. It usually provides more details than having the model retrieve tokens from its base memory. When you ask a model to generate a summary of a book chapter, it’s best to provide the text of the chapter instead of asking for it directly.
- Language models see the world as tokens. It’s beneficial to distribute the computational overhead across tokens/answers. Each token goes through a finite layers of computation to return a result. If we ask the model to concentrate computation on one token, the context window is likely to run out. For example, when evaluating answers to a math question such as “What’s the price of an apple if we spend 15 euros on 2 oranges (whose price is 3 euros each) and 3 apples?” Solution 1 gives the price of an apple first and then shows its analysis, while Solution 2 does the inverse. Solution 2 is better because it doesn’t make one token (the price of an apple) represent the entire calculation process, which consumes memory quickly.
Thinking and self-consciousness: We must decide what constitutes thinking or self-consciousness before judging a language model. In Andrej’s view, intelligence is information processing and reorganization. As mentioned above, LLMs are just next-word prediction engines computed via complex neural networks. Therefore, they are not self-conscious and don’t have personas. However, they can “think” by providing statistical results sampled from large portions of internet documents which are not verified. What an LLM gives is not opinions, thoughts, or ideas of itself, but a reflection of human intelligence, possibly representing an expert in a certain domain, found on the internet.
Neural networks: Don’t overthink this concept. This point is what I was eager to understand when deciding on a major in college. I used to believe there were inherent connections between biology and computer science, which I mistakenly took from some “famous” journal. I realized it was wrong shortly after being admitted to the college of biology. Artificial neural networks are just a collection of mathematical formulas and knobs to tweak, sharing the form of biological neural networks. They are complicated but not specifically related to biological structures or processes. Through neural networks, the model is able to tweak parameters to represent the characteristics of pre-processed data extracted from certain portions of internet documents.
Transformer: This is one type of neural network that works well in various cases of training models. And yes, it’s based on mathematical formulas.
Hallucinations of models: Each time you ask a model a question, you may receive a different answer. Returning to the point about the pre-training stage, the base model is just a large set of parameters representing a portion of internet documents. When you ask a question, it is transformed into tokens. Whether the model knows the correct answer is irrelevant. It always samples the most statistically likely result from its base memory.
Thinking models: “Thinking” can be considered as the next stage of assistant models. OpenAI hides chains of thoughts of its reasoning model in case of distillation risk through imitation of reasoning. For factural questions, thinking model might be an overkill.
Tool uses: It’s a two-sided story.
- Language models are not good at math, at least in their early stages. The reason is simple: if some internet document shows 2+2 as 3.9, it will surely deviate the statistically sampled result. Basic tools such as Python interpreters and calculators are introduced to language models for assistance. Similarly, search tools are introduced so that the model has updated memory on specific topics. These tools upload new information as tokens to the model’s context window.
- From a user perspective, we also need to take advantage of various LLMs to increase efficiency in many daily matters. As long as we are aware of the benefits and drawbacks of language models, they can be more helpful than expected. They are good at generating summaries of chapters, books, and papers when provided with the exact texts of the content. Different models have unique features. Claude has artifacts to run simulations of apps for you. ChatGPT can use search tools and serve as a data analysis tool.
- Answers from language models need to undergo factual checks. They may hallucinate various information and alter your input data. If you want to do something with a picture, it’s best to transcribe the image to text and check if the model interprets it correctly. The outputs of models, whether they be sources or estimated values from a diagram, need to be verified. The code they provide, especially if you don’t fully understand it, is better checked line by line. I personally find it annoying that code generated by models often contains small bugs. Yes, they may get a basic version right, but the debugging time needed to achieve perfection can be considerable.
LLM OS: LLMs are not just word generators, but the kernel process of uprising systems. We can build peripheral devices, browsers and make better use of storages, file system embeddings, RAM, etc. around an LLM. But in the other hand, doesn’t the words of “a new compuation stack” emerge too frequently?
RLHF: reinforcement learning from human feedback is a practical way to tackle un-verified domain questions. It uses neural net simulator of human preference as a reward model. The upside is decreasing the discriminator-generator gap as it is easier for LLMs to discriminate than to generate. Though the downside also comes from lossy simulation of humans, which could be misleading, deviate and stray away.

https://www.youtube.com/playlist?list=PLAqhIrjkxbuW9U8-vZ_s_cjKPT_FqRStI ↩︎