Skip to content

LangExtract GuideMaster Google's AI Data Extraction Library

The Community-Led Developer's Handbook. Learn how to use LangExtract to extract structured data from any text using Gemini, OpenAI, and Ollama.

What is LangExtract?

LangExtract is the official Google library designed for extracting structured data (JSON, Pydantic objects) from unstructured text, PDFs, and invoices.

Unlike general-purpose LLM wrappers, LangExtract is built for enterprise-grade extraction with three core differentiators:

  1. Precise Grounding: Every extracted field is linked back to the exact source text coordinates for verification.
  2. Schema Enforcement: Guarantees that the output matches your defined Pydantic models (Type Verification).
  3. Model Agnostic: While built for Google Gemini, it is fully compatible with DeepSeek, OpenAI, and LlamaIndex workflows.

It is the robust, production-ready alternative to fragile regex patterns or unpredictable LLM prompting.

Why Trust This Guide?

Official documentation is great, but real-world projects are messy. As developers using LangExtract in production, we built this guide to bridge the gap between "Hello World" and deployment.

We cover local LLM setups, cost optimization, and handling complex documents — things you won't find in the standard README.


Quick Start

Get up and running in 30 seconds.

1. Install

Install via pip. Requires Python 3.9+.

bash
pip install langextract

TIP

Use a virtual environment to avoid dependency conflicts: python -m venv venv && source venv/bin/activate

2. Configure API Key

By default, LangExtract uses Google Gemini. Get your key from Google AI Studio.

bash
export LANGEXTRACT_API_KEY="your-api-key-here"

3. Your First Extraction

Extract characters from a simple text using few-shot examples.

python
import langextract as lx

# Define your extraction prompt
prompt = "Extract characters and their emotions from the text."

# Provide a high-quality example to guide the model
examples = [
    lx.data.ExampleData(
        text="ROMEO. But soft! What light through yonder window breaks?",
        extractions=[
            lx.data.Extraction(
                extraction_class="character",
                extraction_text="ROMEO",
                attributes={"emotional_state": "wonder"}
            ),
        ]
    )
]

# Input text to process
text = "Lady Juliet gazed longingly at the stars, her heart aching for Romeo."

# Run extraction (uses Gemini Flash by default)
result = lx.extract(
    text_or_documents=text,
    prompt_description=prompt,
    examples=examples,
    model_id="gemini-2.5-flash",
)

print(result.extractions)
print(result.extractions)

4. Visualize the Results (Interactive Visualization) 📊

LangExtract's killer feature. Generate an interactive HTML report for easy verification.

python
# 1. Save as JSONL
lx.io.save_annotated_documents([result], output_name="extraction_results.jsonl", output_dir=".")

# 2. Generate interactive HTML
html_content = lx.visualize("extraction_results.jsonl")
with open("visualization.html", "w") as f:
    f.write(html_content.data if hasattr(html_content, 'data') else html_content)

View Visualization Example


Framework Integrations

LangExtract is designed to fit into your modern AI stack.

LlamaIndex Integration

Combine LlamaIndex's powerful RAG (Retrieval-Augmented Generation) capabilities with LangExtract's precise structuring. Use LlamaIndex to retrieve relevant document chunks, then pass them to LangExtract to ensure the final output is a clean, validated JSON object.

LangChain Support

Easily wrap LangExtract as a Runnable in your LangChain pipelines. Perfect for building complex agents that need to "read" documents and populate a database reliably.


LLM Configuration Guide

LangExtract supports multiple backends. Here is how to configure them.

Local LLMs with Ollama 🏠

Great for privacy and zero cost.

  1. Install Ollama: ollama.com
  2. Pull a Model: ollama pull gemma2:2b
  3. Run Ollama Server: ollama serve
  4. Configure Code:
python
import langextract as lx

result = lx.extract(
    text_or_documents=text,
    prompt_description=prompt,
    examples=examples,
    model_id="gemma2:2b",  # Automatically selects Ollama provider
    model_url="http://localhost:11434",
    fence_output=False,
    use_schema_constraints=False
)

OpenAI GPT-4 🧠

Best for complex reasoning tasks. Requires optional dependency: pip install langextract[openai]

bash
export OPENAI_API_KEY="sk-..."
python
import os
import langextract as lx

result = lx.extract(
    text_or_documents=text,
    prompt_description=prompt,
    examples=examples,
    model_id="gpt-4o",  # Automatically selects OpenAI provider
    api_key=os.environ.get('OPENAI_API_KEY'),
    fence_output=True,
    use_schema_constraints=False
)

Note

OpenAI models require fence_output=True and use_schema_constraints=False because LangExtract doesn't implement schema constraints for OpenAI yet.

OpenAI-Compatible APIs 🔌

LangExtract works with any OpenAI-compatible API, including DeepSeek, Qwen, Doubao, and more.

python
import langextract as lx

result = lx.extract(
    text_or_documents=text,
    prompt_description=prompt,
    examples=examples,
    model_id="deepseek-chat",
    api_key="your-api-key",
    language_model_params={
        "base_url": "https://api.deepseek.com/v1"  # Replace with provider's URL
    },
    fence_output=True,
    use_schema_constraints=False
)
    use_schema_constraints=False
)

Google Vertex AI Batch (Enterprise) 🏢

For large-scale tasks, enable Batch mode to save costs.

python
result = lx.extract(
    ...,
    language_model_params={"vertexai": True, "batch": {"enabled": True}}
)

Advanced Installation: Docker 🐳

Run without polluting your local environment:

bash
docker run --rm -e LANGEXTRACT_API_KEY="your-key" langextract python your_script.py

Scaling to Longer Documents 📚

How to handle books or PDFs larger than the context window? LangExtract features built-in Chunking and Parallel Processing.

No need to split text manually. Just pass the URL or long text:

python
# Example: Process the entire "Romeo and Juliet"
result = lx.extract(
    text_or_documents="https://www.gutenberg.org/files/1513/1513-0.txt",
    prompt_description=prompt,
    examples=examples,
    # Core parameters
    extraction_passes=3,    # Multiple passes to improve recall
    max_workers=20,         # Parallel workers for speed
    max_char_buffer=1000    # Context buffer size
)

Real-World Data Extraction Examples

Explore our collection of Google AI data extraction examples for common use cases.

🏥 Medical Report Extraction

Extract medication names, dosages, and frequencies from clinical notes. View Full Example

📚 Long Text Extraction

Handling PDFs or books that exceed token limits. View Full Example

🇯🇵 Multilingual Extraction

Working with non-English text (Japanese, Chinese). View Full Example


LangExtract vs. Alternatives

How does LangExtract compare to other tools like LlamaExtract or Docling?

FeatureLangExtract 🚀LlamaExtractDocling
Core FocusStructured Data ExtractionDocument Parsing / RAGDocument Conversion (PDF to MD)
GroundingNative (Character-level)
Schema ValidationStrict (Pydantic)
Model SupportGemini, OpenAI, any LLMOpenAI FocusLocal / Cloud
Best ForComplex Schemas, ProvenanceFast RAG pipelinesMarkdown Conversion

FAQ

Q: What is the difference between LangExtract and Docling? A: Docling specializes in parsing documents (like PDFs) into Markdown, handling layout analysis. LangExtract focuses on extracting structured data (like JSON) from text. They work great together: use Docling to parse, and LangExtract to structure the data.

Q: Is LangExtract an official Google product? A: Yes, LangExtract is an official Google open-source library (GitHub: google/langextract). This documentation guide is designed to help developers use it more effectively in production.

Q: Can I use DeepSeek, Groq, or other OpenAI-compatible models? A: Absolutely. LangExtract supports any model with an OpenAI-compatible API. Just set the base_url to your provider's endpoint. It works seamlessly with DeepSeek V3/R1, Groq, local vLLM, etc.

Q: How do I handle documents longer than the context window? A: LangExtract has built-in chunking mechanisms. Check out our Long Text Extraction Example to see how it automatically splits long texts, processes them (in parallel or sequence), and merges the results.

Q: Can I run this locally for privacy? A: Yes. Integrate with Ollama to run models like Llama 3 or Mistral locally. This is free and ensures data never leaves your machine, ideal for medical or legal data.

Q: Is LangExtract free? A: The library is 100% open-source and free. You only pay for the LLM API usage (e.g., Google Gemini, OpenAI). If run locally with Ollama, it operates completely cost-free.


For Chinese Users (中文用户)

Looking for a LangExtract 教程 or 安装指南? This guide covers everything from langextract使用本地模型 (Ollama) to langextract实测 examples.

ShowMySites BadgeSubmit AI ToolsFeatured on Wired BusinessFeatured on Twelve Tools

Community Guide. Dedicated to open-source development.