swyx's must-read list (incl a lot of history)
ELI5: https://news.ycombinator.com/item?id=35977891
2017 paper

- LSTM is dead. Long Live Transformers! (talk)
- the evolution of natural language processing (NLP) techniques, starting with the limitations of bag-of-words models and vanilla recurrent neural networks (RNNs), which suffer from vanishing and exploding gradients.
- The introduction of long short-term memory (LSTM) resolved these issues but was still challenging to train and lacked transfer learning abilities, leading to the development of the transformer model.
- The transformer model uses self-attention and a feedforward neural network to read input sequences and create output sequences, incorporating multi-headed attention to generate multiple attention outputs with different sets of parameters.
- Key innovations of the transformer model include positional encoding and the use of ReLU activation functions.
- The speaker highlights the advantages of using transformers and models like Roberta for training models on large-scale unsupervised text data, enabling transfer learning and reduced training time and resources.
- Despite being replaced in most areas by transformers, LSTM still has applications in real-time control.
- The speaker also compares word CNN to transformers and observes that transformers can offer contextual answers more efficiently across the entire document.
- https://youtu.be/-uyXE7dY5H0?si=Lbgi7nPw4ROpu53L from rnns to lstms
- Little Book of Deep Learning (pdf)
- The notion of layer
- Linear layer
- Activation functions
- Pooling
- Dropout
- Normalizing layers
- Skip connections
- Attention layers
- Token embedding
- Positional encoding
- Architectures
- Multi-Layer Perceptrons
- Convolutional networks
- Attention models
- Reminder that my deep learning course @unige_enis entirely available on-line. 1000+ slides, ~20h of screen-casts. https://fleuret.org/dlc/
- https://e2eml.school/transformers.html Transformers from Scratch
https://www.jvoderho.com/blog.html?blogid=Transformer%20as%20a%20general%20purpose%20computer
explaining QKV https://arpitbhayani.me/blogs/qkv-matrices/
https://news.ycombinator.com/item?id=35712334
The Illustrated Transformer is fantastic, but I would suggest that those going into it really should read the previous articles in the series to get a foundation to understand it more, plus later articles that go into GPT and BERT, here's the list:
A Visual and Interactive Guide to the Basics of Neural Networks - https://jalammar.github.io/visual-interactive-guide-basics-n...
A Visual And Interactive Look at Basic Neural Network Math - https://jalammar.github.io/feedforward-neural-networks-visua...
Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention) - https://jalammar.github.io/visualizing-neural-machine-transl...
I made a transformer by hand (no training)
"Thinking Like Transformers" [1]. They introduce a primitive programming language, RASP, which is composed of operations capable of being modeled with transformer components, and demonstrate how different programs can be written with it, e.g. histograms, sorting. Sasha Rush and Gail Weiss have an excellent blog post on it as well [2]. Follow on work actually demonstrated how RASP-like programs could actually be compiled into model weights without training [3].
[1] https://arxiv.org/abs/2106.06981
[2] https://srush.github.io/raspy/
[3] https://arxiv.org/abs/2301.05062
3blue 1 brown visualizing attention https://news.ycombinator.com/item?id=40035514
The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) - https://jalammar.github.io/illustrated-bert/
The Illustrated GPT-2 (Visualizing Transformer Language Models) - https://jalammar.github.io/illustrated-gpt2/
How GPT3 Works - Visualizations and Animations - https://jalammar.github.io/how-gpt3-works-visualizations-ani...
The Illustrated Retrieval Transformer - https://jalammar.github.io/illustrated-retrieval-transformer...
The Illustrated Stable Diffusion - https://jalammar.github.io/illustrated-stable-diffusion/
The math behind Attention Mechanisms - math behind the Keys, Queries, and Values matrices, in a friendly pictorial way using diagrams and linear transformations
https://www.youtube.com/watch?v=UPtG_38Oq8o
If you want to learn how to code them, this book is great: https://d2l.ai/chapter_attention-mechanisms-and-transformers...
Transformer Taxonomy (the last lit review)
https://kipp.ly/blog/transformer-taxonomy/
- It covers 22 models, 11 architectural changes, 7 post-pre-training techniques and 3 training techniques (and 5 things that are none of the above).
attention visualization https://catherinesyeh.github.io/attn-docs/
Explainers
Courses
- Stanford CS25: Transformers United, an online seminar on Transformers.
- Stanford CS324: Large Language Models with Percy Liang, Tatsu Hashimoto, and Chris Re, covering a wide range of technical and non-technical aspects of LLMs.
- Predictive learning, NIPS 2016: In this early talk, Yann LeCun makes a strong case for unsupervised learning as a critical element of AI model architectures at scale. Skip to 19:20 for the famous cake analogy, which is still one of the best mental models for modern AI.
- AI for full-self driving at Tesla: Another classic Karpathy talk, this time covering the Tesla data collection engine. Starting at 8:35 is one of the great all-time AI rants, explaining why long-tailed problems (in this case stop sign detection) are so hard.
- The scaling hypothesis: One of the most surprising aspects of LLMs is that scaling — adding more data and compute — just keeps increasing accuracy. GPT-3 was the first model to demonstrate this clearly, and Gwern’s post does a great job explaining the intuition behind it.
- Chinchilla’s wild implications: Nominally an explainer of the important Chinchilla paper (see below), this post gets to the heart of the big question in LLM scaling: are we running out of data? This builds on the post above and gives a refreshed view on scaling laws.
- A survey of large language models: Comprehensive breakdown of current LLMs, including development timeline, size, training strategies, training data, hardware, and more.
- Sparks of artificial general intelligence: Early experiments with GPT-4: Early analysis from Microsoft Research on the capabilities of GPT-4, the current most advanced LLM, relative to human intelligence.
- The AI revolution: How Auto-GPT unleashes a new era of automation and creativity: An introduction to Auto-GPT and AI agents in general. This technology is very early but important to understand — it uses internet access and self-generated sub-tasks in order to solve specific, complex problems or goals.
- The Waluigi Effect: Nominally an explanation of the “Waluigi effect” (i.e., why “alter egos” emerge in LLM behavior), but interesting mostly for its deep dive on the theory of LLM prompting.
- QKV https://x.com/karpathy/status/1794021159895507173
flash attention
https://twitter.com/amanrsanger/status/1657835933503479808?s=46&t=90xQ8sGy63D2OtiaoGJuww
hyena
openai had a gpt labeling gpt paper
https://twitter.com/generatorman_ai/status/1664410300110766082?s=20
Applications of Transformers New survey paper highlighting major applications of Transformers for deep learning tasks. Includes a comprehensive list of Transformer models. https://arxiv.org/abs/2306.07303
Efficient Methods for Natural Language Processing: A Survey It lists effective techniques that use fewer resources while producing comparable outcomes to resource-intensive NLP systems.
![https://pbs.twimg.com/media/FhnMjm7WYAA8e18?format=jpg&name=medium]
abs: https://arxiv.org/abs/2209.00099