Large Language Models (LLMs)

# Large Language Models (LLMs) ## Overview Large Language Models (LLMs) are AI models made popular by tools such as [[GPT4]] & [[ChatGPT]] ([[OpenAI]]), [[Claude]] (Anthropic), [[Gemini]] (Google), [[Deepseek]], Mistral, etc. LLMs are mainly able to generate text, but there are also LLMs that can generate images, such as [[DALL-E]], [[FLUX.1]], [[Stable Diffusion]], and many others. Others can deal with sound/voice, and even video. There are also multi-modal LLMs, that can handle multiple *modalities*. LLMs work great to explore what is at the borders of your knowledge. They are also great at doing boring tasks for which you can provide perfect guidance (but that still would take you hours) ## LLMs efficiency vs communication ability A key thing with LLMs is that their ability to help, as a tool, changes vastly based on your communication ability. Add context, add references, ask very precise questions explaining the background. ## How to think about those LLMs use induction. Current LLMs are unable to use deduction. If you think that they use deduction, it's just an illusion. LLMs generate tons of hallucinations because of the (statistical) associations they make between words, sentences and ideas. It's systematic. LLMs are trained with induction. They learn by predicting the next word in a sequence of words. This makes them great at outputting words that *look* and *sound* correct in sequence. They have learned patterns for how to do this by seeing billions of word sequence completions. Because they’re so good at outputting correct-sounding words in a sequence, their outputs will often be logically correct. For, it’s more likely that correct-sounding words are logically correct than incorrect-sounding words. > A parrot that lives in a courthouse will regurgitate more correct statements than a parrot that lives in a madhouse This explains why [[Chain-of-Thought (CoT) prompting]], test-time compute and other strategies that make the LLM "think” are effective. LLMs make highly-educated guesses. But it's all still guesses based on statistics. In fact, induction, by definition, is guessing. You use past experience, not premises or understanding, to draw conclusions. There is no “reason” why past experience will hold, or that you are extrapolating the correct conclusion from your experience. Induction is inherently probabilistic. Deduction is not. The key is treating the LLM as an evolution engine: using it to both generate initial attempts and to improve upon successful approaches through guided iteration. ## Transformers - Composed of an Encoder and a Decoder - Encoder: encodes sets of tokens (words, subwords, punctuation) into a meaningful numerical representation (aka an embedding) - Decoder: guess the next token based on the the tokens/words that have been seen before (e.g., GPT, ChatGPT, Claude, ...) ![[LLM - example input output.png]] LLMs are "left to right autoregressive language models". They determine what the next words should be, based on what the previous ones were (aka the context window). ## Encoders The output of encoders are embeddings. An embedding is numerical representation, as a vector that captures the semantics of the encoder's input. Definition: an encoder learns latent representations/embedding or numerical representation of words and sentences *in context* (in a self-supervised way). Example of encoder: BERT, RoBERTa, etc). ![[LLM - encoder example.png]] To learn more about the "meaning" of a word/sentence, encoders are trained by replacing some parts of the sentences with certain probabilities. In the example above, replace the word Valentine (80% probability), replace Valentine by lunch (10% probability), keep the sentence as is (10% probability). ## Decoders Decoders output probabilities (of the next word based on the previous words). They output one word at a time. Example: GPT, Claude, etc. To train decoders, we give the model pairs of Input + Output. ![[LLM - decoder example.png]] ## Attention mechanism The meaning of a specific word depends on the words before and after it. That allows capturing the meaning in a sensible way. Visualization: ![[LLM - self-attention.png]] The semantic meaning of a word depends on the weights that are attached to each surrounding word: ![[LLM - attention calculation.png]] In the example above, the word "Valentine" has a more important weight, because it determines what "date" means in the context of this sentence. Actually, there are multiple "contexts" to consider to correctly determine the meaning of a sentence. Decoders support "multiple attention heads" (aka features): ![[LLM - attention multiple heads.png]] This represents multiple attention mechanisms, helping better determine the next probable words. ## Context window Given how important the context is, the larger the context window is, the better LLMs can determine the most probable next words. The context window of an LLM is the number of tokens used to generate a response. Examples: - GPT4o: 128k tokens - Claude 3.5 Sonnet: 200k tokens ## Reinforcement Learning from Human Feedback (RLHF) LLMs are fine-tuned on human-gathered datasets, suggesting prompts, and desired prompt answers. This is what enables alignment with human preferences. This is also where a lot of bias can be introduced/eliminated. ## Knowledge cut-off dates When using LLMs, it's important to always keep in mind that each model was trained with data from a certain amount of time ago, meaning they don't have the most recent information "in mind". Some examples: - GPT4o: October 2023 - Google Gemini Pro 1.5: November 2023 - Claude 3.5 Sonnet: April 2024 ## Creativity LLMs are limited to their input data. Consider Pedro Domingos' tweet: ``` Levels of learning: - Things we know and can explain (e.g., addition) - Things we know but can't explain (e.g., riding a bike) - Things we don't know (e.g., curing cancer) LLMs are only good for the first, at best. ``` ## Retrieval Augmented Generation Systems (RAGs) Retrieval Augmented Generation Systems (RAGs) enable LLMs to access more recent data to improve their outputs. ![[LLM - rag sequence diagram.png]] When using a RAG, an encoder is used to create embeddings that are then stored in a vector database. Then LLM queries can be sent to the vector database through the RAG. The query itself is encoded into embeddings, and the vector database looks for embeddings that match the query, and provide that back to the LLM. Example: ![[LLM - RAG example steps.png]] > By treating robots like humans, won't we end up treating humans like robots :-- Louis de Diasbach, 2024 ## 6P Framework for Responsible LLM Deployment Goal: mitigate anthopomorphization risks. Paper written by Reusens and Baesans, 2024. Framework: - Prepare for accountability: own the output - The model is not a human. The implementer of the LLM is responsible for the generated output. Therefore, act with caution when implementing it - Pick the right tool: Asses appropriateness - Assess appropriateness, both in terms of whether an anthropomorphic tool itself might be appropriate and in terms of the implemented tool itself - Promote critical thinking: Shift from creation to curation - When applying LLMs within workflows, ensure that employees know it is a machine, and remain critical of its output. Educate them on the system to limit anthropomorphization - Protect privacy: Ensure data security - Make sure that no data leakage can occur when using third-party providers. Also inform users correctly about the data that will be stored and how (or if) it will be used - Prioritize people: Ensure fairness - Make sure the model's output is aligned with the end user's and avoid any forms of discrimination. Refrain from using LLMs if they might lead to discrimination - Phrase carefully: Set realistic expectations - The way LLMs are talked about should be approached with care. Avoid anthropomorphic language and do not overpromise the capabilities of LLMs ## Moravec's paradox Things that are easy for humans are difficult for AI, and vice versa. Example: - Writing, mathematics, creating art: difficult for humans, easy for AI - Filling the dishwasher, riding a biking: easy for humans, hard for AI ## Jevon's paradox Increasing the efficiency of a resource leads to its greater consumption, not less. Falling cost of resource use increases demand so that resource use is increased. ## References - Great visualization: https://bbycroft.net/llm