The following are my notes on this video by Andrej Karpathy that provide a high level overview on what LLMs are and how they’re trained. It is by no means meant to supplement the content of the video and recommend giving it a watch yourself.
-
The idea of an LLM can be distilled into two files
- The model parameters
- Code to run the forward inference of the NN (~500 lines of c code)
-
Together you could take these and run the model locally if desired
-
Model training tends to be the more involved and computationally intensive aspect opposed to inference which is quite well understood as a forward pass through the network.
What is the NN doing?
Given a sequence of words the neural network produces a prediction of which word comes next.
- Turns out that next word prediction forces the neural network to learn a lot about the world.
- If you’re the NN and your objective is to predict the next word then when looking at some context you have to learn a ton about the world which is ultimiately compressed in the network’s parameters.
- The Neural Network in a sense “dreams” internet documents. If you ask it to produce some text it doesn’t parrot the text it has seen but it mimics the form popualted by content it thinks is correct. This is where hallucinations come from.
How does the Neural Network Work
- We train a transformer nerual network architecture
- we know all the mathematical operations that are happening at each step in the network. But the billion parameters are dispersed through the network.
- we know how to iteratively adjust them to make it better at prediction (backprop)
- we can measure this works but we don’t really know how the billion of parameters collaborate to do it.
- some work is done here in a space called interpretability
- We know that to some level the LLMs build and maintain some kind of knowledge database but it is a bit strange and imperfect
Training
There are two stages to training an LLM
1. Pretraining
The general idea is to take a large chunk of the internet (~10TB of text), take a group of GPUs (~6000) and run for ~12 days to obtain parameters that can be thought of as a zip file of the chunk of the internet. While a zip file is lossless, this parameter training is a lossy compression.
- At this point once pretraining is done, the model can produce a a corpus of text in some mimicked form. Essentially just produces these “documents” or webpages.
- in a sense it is an internet document sampler and doesn’t actually give answers to inputs, it just produces documents/text based on a distribution
- The knowledge from the training data is distilled in the model
2. Finetuning
- The assistant aspect comes from fine tuning the model. We would like to give questions and have the assistant give us answers
- The process behind finetuning is to keep the optimization identical to the pretraining step, but we swap out the dataset we are training with
- Instead of training on internet corpus/documents we swap it out with data we collect manually. Humans are brought in to manually create this
Test Set:
<user>
prompt
<assistant>
- Humans at this point will fill in the ideal response to answer the prompt
- We care more about quality over quantity here
- After finetuning we have an assistant
- The model now understands that it should produce an answer format that is helpful to what the prompt is while still using all the knowledge they developed during the pretraining phase
Summary
Stage 1: Pretraining (~annually)
- Download ~10TB of text
- Get a cluster of GPUs
- Compress text into a neural network with training
- Obtain base model
Stage 2: Finetuning (~every week)
- Write labelling instructions (how should your assistant behave, personality, attributes, etc.)
- hire people (or use scale.ai) collect 100k quality ideal Q&A responses and/or comparisons
- Finetune base model on this data (takes ~1 day)
- Obtain assistant model
- run a lot of evaluations
- deploy
- monitor, collect misbehavior, go to step 1.
The way we fix misbehaviour is we take the incorrect answer and have the user fill in the correct response and we then fine-tune on this
Lamma by meta releases both the base model and the assistant model. So we can take the base model with all its knowledge and then fine tune it to our specialized use cases whatever that may be.
Stage 3: Finetuning (optional)
- In this stage we use a second kind of label: comparisons
- It is often much easier to compare answers rather than writing new answers
- We take n versions of a response to a prompt and we take the higher voted response
- This is called RLHF (Reinforcement Learning from Human Feedback)
- Stage three would take these comparisons and then train the model further with this
- Increasingly labeling is a human-machine collaboration
LLM Scaling Laws
- Performance of LLMs is a smooth, well-behaved prediction function of
- N: Number of parameters in the network
- D: the amount of text we train on
- we expect more intelligence “for free” by scaling
Tool Use
if you asked ChatGPT a question which requires quite specific information about a world event or something similar, it will need to do a search. The model will then output a special output that informs that a search should take place in which case it does and like a human we would scroll through and go to various links we aggregate the search results and various pages, and then feed it back into the model to which it can then synthesize a response.
System 1 vs System 2
Humans have 2 systems of thinking
- Instinctual (2 + 2)
- Deep thought (17 * 24)
One of those problems requires us to slow down a little and processes the answer by thinking and reasoning through it.
Currently LLMs only have System 1 but we have been introducing and work is being done for what it may mean if an LLM had System 2 behaviour
Self Improvement
- inspired by alpha-go which had 2 stages
- learn by imitating expert human players
- learn by self-improvement (reward = win the game) in a sandboxed environment with a very clearly defined goal in which it kept playing against itself
Can we bring this second stage to LLMs. If we only train on humans then our knowledge is bounded only by humans.
- the challenge in doing this is the lack of a reward criterion in this environment. How does an LLM know what it sampled is good or not