Explore LLM foundational concepts & architectures in Module 2. Learn about fine-tuning, prompt engineering, and adapting large language models.

Module 2: Foundation Models & Architectures

This module delves into the foundational concepts and architectural underpinnings of modern large language models (LLMs), exploring their core components, various model types, and efficient adaptation techniques.

2.1 Fine-tuning vs. Prompt Engineering

Understanding the differences between fine-tuning and prompt engineering is crucial for effectively leveraging LLMs.

Prompt Engineering:
- Definition: The process of crafting specific inputs (prompts) to guide an LLM towards generating desired outputs without modifying the model's underlying weights.
- Mechanism: Relies on the LLM's pre-trained knowledge and its ability to interpret and respond to natural language instructions.
- Advantages:
  - No retraining required, making it fast and computationally inexpensive.
  - Accessible to users without deep ML expertise.
  - Can achieve good results for a wide range of tasks.
- Disadvantages:
  - Performance can be sensitive to prompt phrasing.
  - May not achieve optimal performance for highly specialized or complex tasks.
  - Limited ability to imbue new knowledge or skills.
- Example: Asking a general-purpose LLM, "Summarize the following article in three bullet points: [Article Text]"
Fine-tuning:
- Definition: The process of further training a pre-trained LLM on a smaller, task-specific dataset to adapt its behavior and knowledge.
- Mechanism: Updates the model's weights based on the new data, allowing it to specialize in a particular domain or task.
- Advantages:
  - Can significantly improve performance on specific tasks or domains.
  - Enables the model to learn new knowledge or skills.
  - Can lead to more consistent and robust outputs for specialized applications.
- Disadvantages:
  - Requires a labeled dataset for training.
  - Computationally more expensive than prompt engineering.
  - Requires more technical expertise in machine learning.
- Example: Taking a general LLM and fine-tuning it on a dataset of medical research papers to create a model specialized in medical question answering.

2.2 LLM Internals: Transformers, Attention, Positional Encoding

The Transformer architecture is the backbone of most modern LLMs. This section explains its key components.

2.2.1 The Transformer Architecture

The Transformer model, introduced in the paper "Attention Is All You Need," revolutionized sequence processing by relying entirely on attention mechanisms, discarding recurrence and convolutions.

2.2.2 Attention Mechanisms

Attention allows the model to weigh the importance of different input tokens when processing a given token.

Self-Attention:
- Concept: Enables the model to relate different positions of a single sequence to compute a representation of the sequence. Each word in a sentence can "attend" to every other word.
- Mechanism: Involves calculating query (Q), key (K), and value (V) vectors for each input token. The attention score between two tokens is computed by taking the dot product of their Q and K vectors, followed by a softmax function. These scores are then used to weight the V vectors.
- Formula: $$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$ where $d_k$ is the dimension of the key vectors.
Multi-Head Attention:
- Concept: Instead of performing a single attention function, multi-head attention concatenates the results of several attention functions, each with different learned linear projections of the queries, keys, and values.
- Benefit: Allows the model to jointly attend to information from different representation subspaces at different positions. Each "head" can learn to focus on different aspects of the input sequence.

2.2.3 Positional Encoding

Since the Transformer architecture does not inherently process sequences in order (unlike RNNs), it needs a way to incorporate positional information.

Concept: Positional encodings are added to the input embeddings to inject information about the relative or absolute position of tokens in the sequence.
Method: Typically, sinusoidal functions of different frequencies are used to generate these positional encodings, ensuring that the model can learn to distinguish between positions and understand the order of words.

2.3 Model Types

LLMs come in various flavors, tailored for specific use cases and capabilities.

Instruction-Tuned Models:
- Description: These models are fine-tuned on datasets of instructions and their corresponding outputs. They are designed to follow instructions given in natural language.
- Use Cases: Question answering, summarization, text generation based on specific commands, code generation.
- Example: Asking an instruction-tuned model to "Write a poem about a starry night" or "Translate this sentence to French."
Chat-Tuned Models:
- Description: Specifically trained through conversational data and Reinforcement Learning from Human Feedback (RLHF) to engage in multi-turn dialogue, maintain context, and respond in a natural, conversational manner.
- Use Cases: Virtual assistants, chatbots, customer service agents, interactive storytelling.
- Example: Engaging in a back-and-forth conversation about a topic, asking follow-up questions, and receiving coherent, contextually relevant responses.
Multilingual Models:
- Description: Trained on vast amounts of text data from multiple languages. They can understand, generate, and translate text across various languages.
- Use Cases: Cross-lingual communication, translation services, sentiment analysis across different languages.
- Example: Translating an English document into Japanese or answering a question posed in German.
Multimodal Models:
- Description: Capable of processing and understanding information from multiple modalities, such as text, images, audio, and video.
- Use Cases: Image captioning, visual question answering, generating images from text descriptions, analyzing video content.
- Example: Providing an image and asking, "What is happening in this picture?" or generating an image based on the prompt, "A cat wearing a party hat."

2.4 Parameter-Efficient Fine-Tuning (PEFT) Methods

Fine-tuning large LLMs can be computationally intensive and require significant memory. PEFT methods offer efficient ways to adapt models without updating all parameters.

2.4.1 PEFT (Parameter-Efficient Fine-Tuning)

Concept: A family of techniques that reduces the number of trainable parameters during fine-tuning, making it more efficient in terms of computation, memory, and storage.

2.4.2 LoRA (Low-Rank Adaptation)

Concept: LoRA injects trainable low-rank decomposition matrices into specific layers (typically attention layers) of the pre-trained model. Only these low-rank matrices are trained, while the original pre-trained weights remain frozen.
Mechanism: For a given weight matrix $W_0$, LoRA introduces two smaller matrices, $A$ and $B$, such that the update to $W_0$ is $W_0 + BA$. The rank $r$ of these matrices ($A$ is $d \times r$, $B$ is $r \times k$) is much smaller than the original dimensions, significantly reducing the number of trainable parameters.
Benefits:
- Dramatically reduces the number of trainable parameters.
- Faster training and inference compared to full fine-tuning.
- Smaller model checkpoints.
- Can be easily swapped out to support multiple tasks with a single base model.
Example: When adapting a large text-generation model for medical summarization, LoRA might be applied to the attention projection matrices.

2.4.3 QLoRA

Concept: QLoRA is an optimization of LoRA that further reduces memory usage by quantizing the pre-trained model weights to a lower precision (e.g., 4-bit) while keeping the LoRA adapters in higher precision.
Mechanism:
1. 4-bit NormalFloat Quantization: The pre-trained model weights are quantized to 4-bit using a new data type called NF4 (NormalFloat4), which is optimized for normally distributed weights.
2. Double Quantization: Further reduces memory by quantizing the quantization constants themselves.
3. Paged Optimizers: Uses NVIDIA's unified memory techniques to prevent out-of-memory errors during gradient checkpointing.
Benefits:
- Enables fine-tuning of very large models on consumer-grade hardware (e.g., GPUs with less VRAM).
- Maintains performance close to full fine-tuning or standard LoRA.
Example: Fine-tuning a 65 billion parameter model on a single 48GB GPU using QLoRA, which would be impossible with full fine-tuning or even standard LoRA on the same hardware.

LLM Foundation Models & Architectures: Module 2