How to load PDFs
Portable Document Format (PDF), standardized as ISO 32000, is a file format developed by Adobe in 1992 to present documents, including text formatting and images, in a manner independent of application software, hardware, and operating systems.
This guide covers how to load PDF
documents into the LangChain Document format that we use downstream.
LangChain integrates with a host of PDF parsers. Some are simple and relatively low-level; others will support OCR and image-processing, or perform advanced document layout analysis. The right choice will depend on your application. Below we enumerate the possibilities.
Using PyPDF
Here we load a PDF using pypdf
into array of documents, where each document contains the page content and metadata with page
number.
%pip install pypdf
from langchain_community.document_loaders import PyPDFLoader
file_path = (
"../../../docs/integrations/document_loaders/example_data/layout-parser-paper.pdf"
)
loader = PyPDFLoader(file_path)
pages = loader.load_and_split()
pages[0]
Document(page_content='LayoutParser : A Unified Toolkit for Deep\nLearning Based Document Image Analysis\nZejiang Shen1( \x00), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain\nLee4, Jacob Carlson3, and Weining Li5\n1Allen Institute for AI\nshannons@allenai.org\n2Brown University\nruochen zhang@brown.edu\n3Harvard University\n{melissadell,jacob carlson }@fas.harvard.edu\n4University of Washington\nbcgl@cs.washington.edu\n5University of Waterloo\nw422li@uwaterloo.ca\nAbstract. Recent advances in document image analysis (DIA) have been\nprimarily driven by the application of neural networks. Ideally, research\noutcomes could be easily deployed in production and extended for further\ninvestigation. However, various factors like loosely organized codebases\nand sophisticated model configurations complicate the easy reuse of im-\nportant innovations by a wide audience. Though there have been on-going\nefforts to improve reusability and simplify deep learning (DL) model\ndevelopment in disciplines like natural language processing and computer\nvision, none of them are optimized for challenges in the domain of DIA.\nThis represents a major gap in the existing toolkit, as DIA is central to\nacademic research across a wide range of disciplines in the social sciences\nand humanities. This paper introduces LayoutParser , an open-source\nlibrary for streamlining the usage of DL in DIA research and applica-\ntions. The core LayoutParser library comes with a set of simple and\nintuitive interfaces for applying and customizing DL models for layout de-\ntection, character recognition, and many other document processing tasks.\nTo promote extensibility, LayoutParser also incorporates a community\nplatform for sharing both pre-trained models and full document digiti-\nzation pipelines. We demonstrate that LayoutParser is helpful for both\nlightweight and large-scale digitization pipelines in real-word use cases.\nThe library is publicly available at https://layout-parser.github.io .\nKeywords: Document Image Analysis ·Deep Learning ·Layout Analysis\n·Character Recognition ·Open Source library ·Toolkit.\n1 Introduction\nDeep Learning(DL)-based approaches are the state-of-the-art for a wide range of\ndocument image analysis (DIA) tasks including document image classification [ 11,arXiv:2103.15348v2 [cs.CV] 21 Jun 2021', metadata={'source': '../../../docs/integrations/document_loaders/example_data/layout-parser-paper.pdf', 'page': 0})
An advantage of this approach is that documents can be retrieved with page numbers.
Vector search over PDFs
Once we have loaded PDFs into LangChain Document
objects, we can index them (e.g., a RAG application) in the usual way:
import getpass
import os
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
faiss_index = FAISS.from_documents(pages, OpenAIEmbeddings())
docs = faiss_index.similarity_search("What is LayoutParser?", k=2)
for doc in docs:
print(str(doc.metadata["page"]) + ":", doc.page_content[:300])
13: 14 Z. Shen et al.
6 Conclusion
LayoutParser provides a comprehensive toolkit for deep learning-based document
image analysis. The off-the-shelf library is easy to install, and can be used to
build flexible and accurate pipelines for processing documents with complicated
structures. It also supports hi
0: LayoutParser : A Unified Toolkit for Deep
Learning Based Document Image Analysis
Zejiang Shen1( ), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain
Lee4, Jacob Carlson3, and Weining Li5
1Allen Institute for AI
shannons@allenai.org
2Brown University
ruochen zhang@brown.edu
3Harvard University
Extract text from images
Some PDFs contain images of text-- e.g., within scanned documents, or figures. Using the rapidocr-onnxruntime
package we can extract images as text as well:
%pip install rapidocr-onnxruntime
loader = PyPDFLoader("https://arxiv.org/pdf/2103.15348.pdf", extract_images=True)
pages = loader.load()
pages[4].page_content
'LayoutParser : A Unified Toolkit for DL-Based DIA 5\nTable 1: Current layout detection models in the LayoutParser model zoo\nDataset Base Model1Large Model Notes\nPubLayNet [38] F / M M Layouts of modern scientific documents\nPRImA [3] M - Layouts of scanned modern magazines and scientific reports\nNewspaper [17] F - Layouts of scanned US newspapers from the 20th century\nTableBank [18] F F Table region on modern scientific and business document\nHJDataset [31] F / M - Layouts of history Japanese documents\n1For each dataset, we train several models of different sizes for different needs (the trade-off between accuracy\nvs. computational cost). For “base model” and “large model”, we refer to using the ResNet 50 or ResNet 101\nbackbones [ 13], respectively. One can train models of different architectures, like Faster R-CNN [ 28] (F) and Mask\nR-CNN [ 12] (M). For example, an F in the Large Model column indicates it has a Faster R-CNN model trained\nusing the ResNet 101 backbone. The platform is maintained and a number of additions will be made to the model\nzoo in coming months.\nlayout data structures , which are optimized for efficiency and versatility. 3) When\nnecessary, users can employ existing or customized OCR models via the unified\nAPI provided in the OCR module . 4)LayoutParser comes with a set of utility\nfunctions for the visualization and storage of the layout data. 5) LayoutParser\nis also highly customizable, via its integration with functions for layout data\nannotation and model training . We now provide detailed descriptions for each\ncomponent.\n3.1 Layout Detection Models\nInLayoutParser , a layout model takes a document image as an input and\ngenerates a list of rectangular boxes for the target content regions. Different\nfrom traditional methods, it relies on deep convolutional neural networks rather\nthan manually curated rules to identify content regions. It is formulated as an\nobject detection problem and state-of-the-art models like Faster R-CNN [ 28] and\nMask R-CNN [ 12] are used. This yields prediction results of high accuracy and\nmakes it possible to build a concise, generalized interface for layout detection.\nLayoutParser , built upon Detectron2 [ 35], provides a minimal API that can\nperform layout detection with only four lines of code in Python:\n1import layoutparser as lp\n2image = cv2. imread (" image_file ") # load images\n3model = lp. Detectron2LayoutModel (\n4 "lp :// PubLayNet / faster_rcnn_R_50_FPN_3x / config ")\n5layout = model . detect ( image )\nLayoutParser provides a wealth of pre-trained model weights using various\ndatasets covering different languages, time periods, and document types. Due to\ndomain shift [ 7], the prediction performance can notably drop when models are ap-\nplied to target samples that are significantly different from the training dataset. As\ndocument structures and layouts vary greatly in different domains, it is important\nto select models trained on a dataset similar to the test samples. A semantic syntax\nis used for initializing the model weights in LayoutParser , using both the dataset\nname and model name lp://<dataset-name>/<model-architecture-name> .'
Using PyMuPDF
This is the fastest of the PDF parsing options, and contains detailed metadata about the PDF and its pages, as well as returns one document per page.
from langchain_community.document_loaders import PyMuPDFLoader
loader = PyMuPDFLoader("example_data/layout-parser-paper.pdf")
data = loader.load()
data[0]
Additionally, you can pass along any of the options from the PyMuPDF documentation as keyword arguments in the load
call, and it will be pass along to the get_text()
call.
Using MathPix
Inspired by Daniel Gross's https://gist.github.com/danielgross/3ab4104e14faccc12b49200843adab21
from langchain_community.document_loaders import MathpixPDFLoader
file_path = (
"../../../docs/integrations/document_loaders/example_data/layout-parser-paper.pdf"
)
loader = MathpixPDFLoader(file_path)
data = loader.load()
Using Unstructured
Unstructured supports a common interface for working with unstructured or semi-structured file formats, such as Markdown or PDF. LangChain's UnstructuredPDFLoader integrates with Unstructured to parse PDF documents into LangChain Document objects.
from langchain_community.document_loaders import UnstructuredPDFLoader
file_path = (
"../../../docs/integrations/document_loaders/example_data/layout-parser-paper.pdf"
)
loader = UnstructuredPDFLoader(file_path)
data = loader.load()
Retain Elements
Under the hood, Unstructured creates different "elements" for different chunks of text. By default we combine those together, but you can easily keep that separation by specifying mode="elements"
.
file_path = (
"../../../docs/integrations/document_loaders/example_data/layout-parser-paper.pdf"
)
loader = UnstructuredPDFLoader(file_path, mode="elements")
data = loader.load()
data[0]
Document(page_content='1 2 0 2', metadata={'source': '../../../docs/integrations/document_loaders/example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((16.34, 213.36), (16.34, 253.36), (36.34, 253.36), (36.34, 213.36)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': '../../../docs/integrations/document_loaders/example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-03-18T13:22:22', 'page_number': 1, 'filetype': 'application/pdf', 'category': 'UncategorizedText'})
See the full set of element types for this particular document:
set(doc.metadata["category"] for doc in data)
{'ListItem', 'NarrativeText', 'Title', 'UncategorizedText'}
Fetching remote PDFs using Unstructured
This covers how to load online PDFs into a document format that we can use downstream. This can be used for various online PDF sites such as https://open.umn.edu/opentextbooks/textbooks/ and https://arxiv.org/archive/
Note: all other PDF loaders can also be used to fetch remote PDFs, but OnlinePDFLoader
is a legacy function, and works specifically with UnstructuredPDFLoader
.
from langchain_community.document_loaders import OnlinePDFLoader
loader = OnlinePDFLoader("https://arxiv.org/pdf/2302.03803.pdf")
data = loader.load()
Using PyPDFium2
from langchain_community.document_loaders import PyPDFium2Loader
file_path = (
"../../../docs/integrations/document_loaders/example_data/layout-parser-paper.pdf"
)
loader = PyPDFium2Loader(file_path)
data = loader.load()
Using PDFMiner
from langchain_community.document_loaders import PDFMinerLoader
file_path = (
"../../../docs/integrations/document_loaders/example_data/layout-parser-paper.pdf"
)
loader = PDFMinerLoader(file_path)
data = loader.load()
Using PDFMiner to generate HTML text
This can be helpful for chunking texts semantically into sections as the output html content can be parsed via BeautifulSoup
to get more structured and rich information about font size, page numbers, PDF headers/footers, etc.
from langchain_community.document_loaders import PDFMinerPDFasHTMLLoader
file_path = (
"../../../docs/integrations/document_loaders/example_data/layout-parser-paper.pdf"
)
loader = PDFMinerPDFasHTMLLoader(file_path)
data = loader.load()[0]
from bs4 import BeautifulSoup
soup = BeautifulSoup(data.page_content, "html.parser")
content = soup.find_all("div")
import re
cur_fs = None
cur_text = ""
snippets = [] # first collect all snippets that have the same font size
for c in content:
sp = c.find("span")
if not sp:
continue
st = sp.get("style")
if not st:
continue
fs = re.findall("font-size:(\d+)px", st)
if not fs:
continue
fs = int(fs[0])
if not cur_fs:
cur_fs = fs
if fs == cur_fs:
cur_text += c.text
else:
snippets.append((cur_text, cur_fs))
cur_fs = fs
cur_text = c.text
snippets.append((cur_text, cur_fs))
# Note: The above logic is very straightforward. One can also add more strategies such as removing duplicate snippets (as
# headers/footers in a PDF appear on multiple pages so if we find duplicates it's safe to assume that it is redundant info)
from langchain.docstore.document import Document
cur_idx = -1
semantic_snippets = []
# Assumption: headings have higher font size than their respective content
for s in snippets:
# if current snippet's font size > previous section's heading => it is a new heading
if (
not semantic_snippets
or s[1] > semantic_snippets[cur_idx].metadata["heading_font"]
):
metadata = {"heading": s[0], "content_font": 0, "heading_font": s[1]}
metadata.update(data.metadata)
semantic_snippets.append(Document(page_content="", metadata=metadata))
cur_idx += 1
continue
# if current snippet's font size <= previous section's content => content belongs to the same section (one can also create
# a tree like structure for sub sections if needed but that may require some more thinking and may be data specific)
if (
not semantic_snippets[cur_idx].metadata["content_font"]
or s[1] <= semantic_snippets[cur_idx].metadata["content_font"]
):
semantic_snippets[cur_idx].page_content += s[0]
semantic_snippets[cur_idx].metadata["content_font"] = max(
s[1], semantic_snippets[cur_idx].metadata["content_font"]
)
continue
# if current snippet's font size > previous section's content but less than previous section's heading than also make a new
# section (e.g. title of a PDF will have the highest font size but we don't want it to subsume all sections)
metadata = {"heading": s[0], "content_font": 0, "heading_font": s[1]}
metadata.update(data.metadata)
semantic_snippets.append(Document(page_content="", metadata=metadata))
cur_idx += 1
semantic_snippets[4]
Document(page_content='Recently, various DL models and datasets have been developed for layout analysis\ntasks. The dhSegment [22] utilizes fully convolutional networks [20] for segmen-\ntation tasks on historical documents. Object detection-based methods like Faster\nR-CNN [28] and Mask R-CNN [12] are used for identifying document elements [38]\nand detecting tables [30, 26]. Most recently, Graph Neural Networks [29] have also\nbeen used in table detection [27]. However, these models are usually implemented\nindividually and there is no unified framework to load and use such models.\nThere has been a surge of interest in creating open-source tools for document\nimage processing: a search of document image analysis in Github leads to 5M\nrelevant code pieces 6; yet most of them rely on traditional rule-based methods\nor provide limited functionalities. The closest prior research to our work is the\nOCR-D project7, which also tries to build a complete toolkit for DIA. However,\nsimilar to the platform developed by Neudecker et al. [21], it is designed for\nanalyzing historical documents, and provides no supports for recent DL models.\nThe DocumentLayoutAnalysis project8 focuses on processing born-digital PDF\ndocuments via analyzing the stored PDF data. Repositories like DeepLayout9\nand Detectron2-PubLayNet10 are individual deep learning models trained on\nlayout analysis datasets without support for the full DIA pipeline. The Document\nAnalysis and Exploitation (DAE) platform [15] and the DeepDIVA project [2]\naim to improve the reproducibility of DIA methods (or DL models), yet they\nare not actively maintained. OCR engines like Tesseract [14], easyOCR11 and\npaddleOCR12 usually do not come with comprehensive functionalities for other\nDIA tasks like layout analysis.\nRecent years have also seen numerous efforts to create libraries for promoting\nreproducibility and reusability in the field of DL. Libraries like Dectectron2 [35],\n6 The number shown is obtained by specifying the search type as ‘code’.\n7 https://ocr-d.de/en/about\n8 https://github.com/BobLd/DocumentLayoutAnalysis\n9 https://github.com/leonlulu/DeepLayout\n10 https://github.com/hpanwar08/detectron2\n11 https://github.com/JaidedAI/EasyOCR\n12 https://github.com/PaddlePaddle/PaddleOCR\n4\nZ. Shen et al.\nFig. 1: The overall architecture of LayoutParser. For an input document image,\nthe core LayoutParser library provides a set of off-the-shelf tools for layout\ndetection, OCR, visualization, and storage, backed by a carefully designed layout\ndata structure. LayoutParser also supports high level customization via efficient\nlayout annotation and model training functions. These improve model accuracy\non the target samples. The community platform enables the easy sharing of DIA\nmodels and whole digitization pipelines to promote reusability and reproducibility.\nA collection of detailed documentation, tutorials and exemplar projects make\nLayoutParser easy to learn and use.\nAllenNLP [8] and transformers [34] have provided the community with complete\nDL-based support for developing and deploying models for general computer\nvision and natural language processing problems. LayoutParser, on the other\nhand, specializes specifically in DIA tasks. LayoutParser is also equipped with a\ncommunity platform inspired by established model hubs such as Torch Hub [23]\nand TensorFlow Hub [1]. It enables the sharing of pretrained models as well as\nfull document processing pipelines that are unique to DIA tasks.\nThere have been a variety of document data collections to facilitate the\ndevelopment of DL models. Some examples include PRImA [3](magazine layouts),\nPubLayNet [38](academic paper layouts), Table Bank [18](tables in academic\npapers), Newspaper Navigator Dataset [16, 17](newspaper figure layouts) and\nHJDataset [31](historical Japanese document layouts). A spectrum of models\ntrained on these datasets are currently available in the LayoutParser model zoo\nto support different use cases.\n', metadata={'heading': '2 Related Work\n', 'content_font': 9, 'heading_font': 11, 'source': '../../../docs/integrations/document_loaders/example_data/layout-parser-paper.pdf'})
PyPDF Directory
Load PDFs from directory
from langchain_community.document_loaders import PyPDFDirectoryLoader
directory_path = "../../../docs/integrations/document_loaders/example_data/"
loader = PyPDFDirectoryLoader("example_data/")
docs = loader.load()
Using PDFPlumber
Like PyMuPDF, the output Documents contain detailed metadata about the PDF and its pages, and returns one document per page.
from langchain_community.document_loaders import PDFPlumberLoader
data = loader.load()
data[0]
Using AmazonTextractPDFParser
The AmazonTextractPDFLoader calls the Amazon Textract Service to convert PDFs into a Document structure. The loader does pure OCR at the moment, with more features like layout support planned, depending on demand. Single and multi-page documents are supported with up to 3000 pages and 512 MB of size.
For the call to be successful an AWS account is required, similar to the AWS CLI requirements.
Besides the AWS configuration, it is very similar to the other PDF loaders, while also supporting JPEG, PNG and TIFF and non-native PDF formats.
from langchain_community.document_loaders import AmazonTextractPDFLoader
loader = AmazonTextractPDFLoader("example_data/alejandro_rosalez_sample-small.jpeg")
documents = loader.load()
Using AzureAIDocumentIntelligenceLoader
Azure AI Document Intelligence (formerly known as Azure Form Recognizer
) is machine-learning
based service that extracts texts (including handwriting), tables, document structures (e.g., titles, section headings, etc.) and key-value-pairs from
digital or scanned PDFs, images, Office and HTML files. Document Intelligence supports PDF
, JPEG/JPG
, PNG
, BMP
, TIFF
, HEIF
, DOCX
, XLSX
, PPTX
and HTML
.
This current implementation of a loader using Document Intelligence
can incorporate content page-wise and turn it into LangChain documents. The default output format is markdown, which can be easily chained with MarkdownHeaderTextSplitter
for semantic document chunking. You can also use mode="single"
or mode="page"
to return pure texts in a single page or document split by page.
Prerequisite
An Azure AI Document Intelligence resource in one of the 3 preview regions: East US, West US2, West Europe - follow this document to create one if you don't have. You will be passing <endpoint>
and <key>
as parameters to the loader.
%pip install --upgrade --quiet langchain langchain-community azure-ai-documentintelligence
from langchain_community.document_loaders import AzureAIDocumentIntelligenceLoader
file_path = "<filepath>"
endpoint = "<endpoint>"
key = "<key>"
loader = AzureAIDocumentIntelligenceLoader(
api_endpoint=endpoint, api_key=key, file_path=file_path, api_model="prebuilt-layout"
)
documents = loader.load()