Intro to Large Language Models

The following are my notes on this video by Andrej Karpathy that provide a high level overview on what LLMs are and how they’re trained. It is by no means meant to supplement the content of the video and recommend giving it a watch yourself.

The idea of an LLM can be distilled into two files
1. The model parameters
2. Code to run the forward inference of the NN (~500 lines of c code)
Together you could take these and run the model locally if desired
Model training tends to be the more involved and computationally intensive aspect opposed to inference which is quite well understood as a forward pass through the network.

What is the NN doing?

Given a sequence of words the neural network produces a prediction of which word comes next.

Turns out that next word prediction forces the neural network to learn a lot about the world.
- If you’re the NN and your objective is to predict the next word then when looking at some context you have to learn a ton about the world which is ultimiately compressed in the network’s parameters.
The Neural Network in a sense “dreams” internet documents. If you ask it to produce some text it doesn’t parrot the text it has seen but it mimics the form popualted by content it thinks is correct. This is where hallucinations come from.

How does the Neural Network Work

We train a transformer nerual network architecture
we know all the mathematical operations that are happening at each step in the network. But the billion parameters are dispersed through the network.
- we know how to iteratively adjust them to make it better at prediction (backprop)
- we can measure this works but we don’t really know how the billion of parameters collaborate to do it.
  - some work is done here in a space called interpretability
We know that to some level the LLMs build and maintain some kind of knowledge database but it is a bit strange and imperfect

Training

There are two stages to training an LLM

1. Pretraining

The general idea is to take a large chunk of the internet (~10TB of text), take a group of GPUs (~6000) and run for ~12 days to obtain parameters that can be thought of as a zip file of the chunk of the internet. While a zip file is lossless, this parameter training is a lossy compression.

At this point once pretraining is done, the model can produce a a corpus of text in some mimicked form. Essentially just produces these “documents” or webpages.
- in a sense it is an internet document sampler and doesn’t actually give answers to inputs, it just produces documents/text based on a distribution
The knowledge from the training data is distilled in the model

2. Finetuning

The assistant aspect comes from fine tuning the model. We would like to give questions and have the assistant give us answers
The process behind finetuning is to keep the optimization identical to the pretraining step, but we swap out the dataset we are training with
Instead of training on internet corpus/documents we swap it out with data we collect manually. Humans are brought in to manually create this

Test Set:

<user> 

prompt

<assistant>

Humans at this point will fill in the ideal response to answer the prompt
We care more about quality over quantity here
After finetuning we have an assistant
The model now understands that it should produce an answer format that is helpful to what the prompt is while still using all the knowledge they developed during the pretraining phase

Summary

Stage 1: Pretraining (~annually)

Download ~10TB of text
Get a cluster of GPUs
Compress text into a neural network with training
Obtain base model

Stage 2: Finetuning (~every week)

Write labelling instructions (how should your assistant behave, personality, attributes, etc.)
hire people (or use scale.ai) collect 100k quality ideal Q&A responses and/or comparisons
Finetune base model on this data (takes ~1 day)
Obtain assistant model
run a lot of evaluations
deploy
monitor, collect misbehavior, go to step 1.

The way we fix misbehaviour is we take the incorrect answer and have the user fill in the correct response and we then fine-tune on this

Lamma by meta releases both the base model and the assistant model. So we can take the base model with all its knowledge and then fine tune it to our specialized use cases whatever that may be.

Stage 3: Finetuning (optional)

In this stage we use a second kind of label: comparisons
It is often much easier to compare answers rather than writing new answers
We take n versions of a response to a prompt and we take the higher voted response
This is called RLHF (Reinforcement Learning from Human Feedback)
Stage three would take these comparisons and then train the model further with this
Increasingly labeling is a human-machine collaboration

LLM Scaling Laws

Performance of LLMs is a smooth, well-behaved prediction function of
- N: Number of parameters in the network
- D: the amount of text we train on
we expect more intelligence “for free” by scaling

Tool Use

if you asked ChatGPT a question which requires quite specific information about a world event or something similar, it will need to do a search. The model will then output a special output that informs that a search should take place in which case it does and like a human we would scroll through and go to various links we aggregate the search results and various pages, and then feed it back into the model to which it can then synthesize a response.

System 1 vs System 2

Humans have 2 systems of thinking

Instinctual (2 + 2)
Deep thought (17 * 24)

One of those problems requires us to slow down a little and processes the answer by thinking and reasoning through it.

Currently LLMs only have System 1 but we have been introducing and work is being done for what it may mean if an LLM had System 2 behaviour

Self Improvement

inspired by alpha-go which had 2 stages

learn by imitating expert human players
learn by self-improvement (reward = win the game) in a sandboxed environment with a very clearly defined goal in which it kept playing against itself

Can we bring this second stage to LLMs. If we only train on humans then our knowledge is bounded only by humans.

the challenge in doing this is the lack of a reward criterion in this environment. How does an LLM know what it sampled is good or not