LangChain Core Concepts: Document Loaders & Text Splitting
Master LangChain's core concepts, including document loaders & text splitting, to build advanced LLM applications. Essential for AI & ML developers.
Module 2: LangChain Core Concepts
This module delves into the fundamental building blocks and key concepts of LangChain, empowering you to construct sophisticated LLM-powered applications.
1. Document Loaders and Text Splitting
Efficiently preparing and managing data for LLMs is crucial. LangChain provides powerful tools for loading documents from various sources and splitting them into manageable chunks for processing.
Document Loaders
Document loaders enable you to ingest data from a wide range of sources, including:
- Files:
.txt
,.pdf
,.csv
,.docx
, etc. - Websites: HTML content from URLs.
- Databases: Vector databases, SQL databases.
- APIs: Social media, cloud storage.
Example: Loading a text file.
from langchain_community.document_loaders import TextLoader
loader = TextLoader("my_document.txt")
documents = loader.load()
Text Splitting
LLMs have context window limitations. Text splitting divides large documents into smaller, semantically coherent chunks that can be processed individually by LLMs.
Common Splitting Strategies:
- Character Splitting: Splits based on a fixed number of characters.
- Recursive Character Splitting: Attempts to split based on a list of separators, recursively breaking down larger chunks.
- Token Splitting: Splits based on token count, respecting LLM token limits.
Example: Recursive character splitting.
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = text_splitter.split_documents(documents)
2. Integrating LLM Models
LangChain facilitates seamless integration with various Large Language Models (LLMs), including popular providers like OpenAI, HuggingFace, and Cohere.
OpenAI
Integrate models from OpenAI, such as gpt-3.5-turbo
and gpt-4
.
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model_name="gpt-3.5-turbo")
HuggingFace
Leverage models hosted on HuggingFace Hub, offering a vast collection of open-source LLMs.
from langchain_community.llms import HuggingFaceHub
llm = HuggingFaceHub(repo_id="google/flan-t5-large")
Cohere
Utilize models provided by Cohere for various natural language tasks.
from langchain_cohere import ChatCohere
llm = ChatCohere(model="command")
3. LangChain Memory
Memory components allow LangChain applications to retain and utilize information from previous interactions, enabling contextual awareness and coherent conversations.
ConversationBufferMemory
Stores the entire conversation history. Suitable for shorter conversations where all context is important.
from langchain.memory import ConversationBufferMemory
memory = ConversationBufferMemory()
ConversationTokenBufferMemory
Stores a limited number of recent messages, using a token count to manage the buffer size. This is useful for preventing overly long inputs.
from langchain.memory import ConversationTokenBufferMemory
from langchain_openai import ChatOpenAI
llm = ChatOpenAI() # Ensure you have an LLM instance
memory = ConversationTokenBufferMemory(llm=llm, max_token_limit=100)
ConversationSummaryMemory
Summarizes past conversations to conserve context window space. This is ideal for long-running dialogues where a condensed history is sufficient.
from langchain.memory import ConversationSummaryMemory
from langchain_openai import ChatOpenAI
llm = ChatOpenAI() # Ensure you have an LLM instance
memory = ConversationSummaryMemory(llm=llm)
4. Chains: Orchestrating LLM Interactions
Chains are fundamental to LangChain, enabling you to combine multiple LLM calls and other components into a single, cohesive workflow.
LLMChain
A basic chain that takes a prompt and an LLM, formats the prompt, and passes it to the LLM.
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
from langchain_openai import OpenAI
llm = OpenAI()
prompt = PromptTemplate.from_template("Tell me a joke about {topic}")
chain = LLMChain(llm=llm, prompt=prompt)
SequentialChain
Executes a sequence of chains, passing the output of one chain as input to the next. This allows for multi-step reasoning processes.
from langchain.chains import SequentialChain
# Assume chain1 and chain2 are pre-defined LLMChains
overall_chain = SequentialChain(
chains=[chain1, chain2],
input_variables=["input_var1"],
output_variables=["output_var1", "output_var2"]
)
SimpleSequentialChain
A simpler version of SequentialChain
where each chain has a single input and a single output, and the output of one chain directly becomes the input of the next.
from langchain.chains import SimpleSequentialChain
from langchain.prompts import PromptTemplate
from langchain_openai import OpenAI
llm = OpenAI()
prompt1 = PromptTemplate.from_template("Translate '{text}' to French.")
chain1 = LLMChain(llm=llm, prompt=prompt1, output_key="french_translation")
prompt2 = PromptTemplate.from_template("Summarize the following French text: '{french_translation}'")
chain2 = LLMChain(llm=llm, prompt=prompt2, output_key="summary")
simple_sequential_chain = SimpleSequentialChain(
chains=[chain1, chain2],
input_key="text",
output_key="summary"
)
5. Prompt Templates
Prompt templates are essential for crafting effective prompts that guide LLMs to produce desired outputs. LangChain offers flexible ways to create and manage these templates.
PromptTemplate
A template for creating string-based prompts. It allows you to define placeholders that will be filled with specific input variables.
from langchain.prompts import PromptTemplate
template = "Here is a question: {question}\nAnswer:"
prompt = PromptTemplate(template=template, input_variables=["question"])
ChatPromptTemplate
A template for creating chat-based prompts, which are structured as a sequence of messages with different roles (system, human, AI). This is particularly useful for conversational LLMs.
from langchain.prompts import ChatPromptTemplate, HumanMessagePromptTemplate, SystemMessagePromptTemplate
from langchain_core.messages import HumanMessage, SystemMessage
chat_template = ChatPromptTemplate.from_messages([
SystemMessagePromptTemplate.from_template("You are a helpful AI assistant."),
HumanMessagePromptTemplate.from_template("What is the capital of {country}?"),
])
What is LangChain? Your Guide to LLM App Development
Discover LangChain, the open-source framework simplifying LLM application development. Learn why to use it for chatbots, assistants, and more, integrating data & tools.
LLM Document Loaders & Text Splitting: Data Prep Guide
Master LLM data preparation with document loaders and text splitting. Learn to ingest & process PDFs, Word, web content for effective AI applications.