LangChain Core Concepts: Document Loaders & Text Splitting

Master LangChain's core concepts, including document loaders & text splitting, to build advanced LLM applications. Essential for AI & ML developers.

Module 2: LangChain Core Concepts

This module delves into the fundamental building blocks and key concepts of LangChain, empowering you to construct sophisticated LLM-powered applications.

1. Document Loaders and Text Splitting

Efficiently preparing and managing data for LLMs is crucial. LangChain provides powerful tools for loading documents from various sources and splitting them into manageable chunks for processing.

Document Loaders

Document loaders enable you to ingest data from a wide range of sources, including:

  • Files: .txt, .pdf, .csv, .docx, etc.
  • Websites: HTML content from URLs.
  • Databases: Vector databases, SQL databases.
  • APIs: Social media, cloud storage.

Example: Loading a text file.

from langchain_community.document_loaders import TextLoader

loader = TextLoader("my_document.txt")
documents = loader.load()

Text Splitting

LLMs have context window limitations. Text splitting divides large documents into smaller, semantically coherent chunks that can be processed individually by LLMs.

Common Splitting Strategies:

  • Character Splitting: Splits based on a fixed number of characters.
  • Recursive Character Splitting: Attempts to split based on a list of separators, recursively breaking down larger chunks.
  • Token Splitting: Splits based on token count, respecting LLM token limits.

Example: Recursive character splitting.

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = text_splitter.split_documents(documents)

2. Integrating LLM Models

LangChain facilitates seamless integration with various Large Language Models (LLMs), including popular providers like OpenAI, HuggingFace, and Cohere.

OpenAI

Integrate models from OpenAI, such as gpt-3.5-turbo and gpt-4.

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model_name="gpt-3.5-turbo")

HuggingFace

Leverage models hosted on HuggingFace Hub, offering a vast collection of open-source LLMs.

from langchain_community.llms import HuggingFaceHub

llm = HuggingFaceHub(repo_id="google/flan-t5-large")

Cohere

Utilize models provided by Cohere for various natural language tasks.

from langchain_cohere import ChatCohere

llm = ChatCohere(model="command")

3. LangChain Memory

Memory components allow LangChain applications to retain and utilize information from previous interactions, enabling contextual awareness and coherent conversations.

ConversationBufferMemory

Stores the entire conversation history. Suitable for shorter conversations where all context is important.

from langchain.memory import ConversationBufferMemory

memory = ConversationBufferMemory()

ConversationTokenBufferMemory

Stores a limited number of recent messages, using a token count to manage the buffer size. This is useful for preventing overly long inputs.

from langchain.memory import ConversationTokenBufferMemory
from langchain_openai import ChatOpenAI

llm = ChatOpenAI() # Ensure you have an LLM instance
memory = ConversationTokenBufferMemory(llm=llm, max_token_limit=100)

ConversationSummaryMemory

Summarizes past conversations to conserve context window space. This is ideal for long-running dialogues where a condensed history is sufficient.

from langchain.memory import ConversationSummaryMemory
from langchain_openai import ChatOpenAI

llm = ChatOpenAI() # Ensure you have an LLM instance
memory = ConversationSummaryMemory(llm=llm)

4. Chains: Orchestrating LLM Interactions

Chains are fundamental to LangChain, enabling you to combine multiple LLM calls and other components into a single, cohesive workflow.

LLMChain

A basic chain that takes a prompt and an LLM, formats the prompt, and passes it to the LLM.

from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
from langchain_openai import OpenAI

llm = OpenAI()
prompt = PromptTemplate.from_template("Tell me a joke about {topic}")
chain = LLMChain(llm=llm, prompt=prompt)

SequentialChain

Executes a sequence of chains, passing the output of one chain as input to the next. This allows for multi-step reasoning processes.

from langchain.chains import SequentialChain

# Assume chain1 and chain2 are pre-defined LLMChains
overall_chain = SequentialChain(
    chains=[chain1, chain2],
    input_variables=["input_var1"],
    output_variables=["output_var1", "output_var2"]
)

SimpleSequentialChain

A simpler version of SequentialChain where each chain has a single input and a single output, and the output of one chain directly becomes the input of the next.

from langchain.chains import SimpleSequentialChain
from langchain.prompts import PromptTemplate
from langchain_openai import OpenAI

llm = OpenAI()
prompt1 = PromptTemplate.from_template("Translate '{text}' to French.")
chain1 = LLMChain(llm=llm, prompt=prompt1, output_key="french_translation")

prompt2 = PromptTemplate.from_template("Summarize the following French text: '{french_translation}'")
chain2 = LLMChain(llm=llm, prompt=prompt2, output_key="summary")

simple_sequential_chain = SimpleSequentialChain(
    chains=[chain1, chain2],
    input_key="text",
    output_key="summary"
)

5. Prompt Templates

Prompt templates are essential for crafting effective prompts that guide LLMs to produce desired outputs. LangChain offers flexible ways to create and manage these templates.

PromptTemplate

A template for creating string-based prompts. It allows you to define placeholders that will be filled with specific input variables.

from langchain.prompts import PromptTemplate

template = "Here is a question: {question}\nAnswer:"
prompt = PromptTemplate(template=template, input_variables=["question"])

ChatPromptTemplate

A template for creating chat-based prompts, which are structured as a sequence of messages with different roles (system, human, AI). This is particularly useful for conversational LLMs.

from langchain.prompts import ChatPromptTemplate, HumanMessagePromptTemplate, SystemMessagePromptTemplate
from langchain_core.messages import HumanMessage, SystemMessage

chat_template = ChatPromptTemplate.from_messages([
    SystemMessagePromptTemplate.from_template("You are a helpful AI assistant."),
    HumanMessagePromptTemplate.from_template("What is the capital of {country}?"),
])