Manage PDF Metadata with pypdf in Python

Posted: 2025-05-19 | Tags: Python, PDF

The Python library pypdf (formerly PyPDF2) allows you to retrieve, remove, and modify metadata in PDF files, including details such as author, title, and more.

py-pdf/pypdf: A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files

Contents

Install pypdf
Metadata Fields in PDF Files
Retrieve PDF Metadata
Remove PDF Metadata
- Remove All Metadata
- Remove Specific Metadata Fields
Add or Update PDF Metadata
Handle Password-Protected PDF Files
Read XMP Metadata (PDF 2.0)

The sample PDFs used in this article are available at the following link. All password-protected files use password as their password:

python-snippets/notebook/data/src/pdf

Install pypdf

pypdf has no external dependencies and can be installed via pip (or pip3). If you need support for AES encryption and decryption, install it with the [crypto] extra.

$ pip install pypdf
$ pip install pypdf[crypto]

The examples in this article use pypdf version 5.5.0.

The library was previously known as PyPDF2 until it was renamed to pypdf in 2023.

History of pypdf — pypdf 5.5.0 documentation

Metadata Fields in PDF Files

PDF is an ISO-standardized file format.

The ISO 32000-1 (PDF 1.7) standard is freely available. Metadata is explained in section "14.3 Metadata" (page 548) of the document below:

PDF32000_2008.pdf

Metadata is stored either in Metadata Streams or in the Document Information Dictionary. Common fields in the dictionary include:

Title: Document title
Author: Author
Subject: Subject
Keyword: Keywords
Creator: Tool that created the original document
Producer: Tool that converted the original document to PDF
CreateDate: Creation date
ModDate: Modification date
Trapped: Trapping status

These fields are optional, and custom fields may also be added.

While Author, Creator, and Producer may sound similar, they refer to different entities. Author refers to the person or organization who created the content. Creator and Producer refer to the software used to create or convert the document.

As of the ISO 32000-2 (PDF 2.0) specification, published in July 2017, metadata is stored in the Extensible Metadata Platform (XMP) instead.

PDF 2.0: The worldwide standard for electronic documents has evolved – PDF Association

The following examples mainly cover PDF 1.7 and earlier. PDF 2.0 handling is briefly discussed at the end.

Retrieve PDF Metadata

You can access metadata from the Document Information Dictionary using the metadata attribute of the PdfReader object.

Create a PdfReader object by passing the file path to its constructor. The metadata attribute returns an instance of DocumentInformation.

The DocumentInformation Class — pypdf 5.5.0 documentation

You can access fields like title and author as attributes:

import pypdf

print(pypdf.__version__)
# 5.5.0

pdf = pypdf.PdfReader('data/src/pdf/sample1.pdf')

print(type(pdf.metadata))
# <class 'pypdf._doc_common.DocumentInformation'>

print(pdf.metadata.title)
# sample1

source: pypdf_metadata_get.py

Since DocumentInformation is a subclass of dict, you can use dictionary-style access and methods like items():

Iterate Over Dictionary Keys, Values, and Items in Python

Some files may return IndirectObject(...) when printed directly. You can still extract values by using the keys. The following example uses a file created with Apple Keynote and exported to PDF.

print(isinstance(pdf.metadata, dict))
# True

print(pdf.metadata)
# {'/Title': IndirectObject(33, 0, 4424533392), '/Producer': IndirectObject(34, 0, 4424533392), '/Creator': IndirectObject(35, 0, 4424533392), '/CreationDate': IndirectObject(36, 0, 4424533392), '/ModDate': IndirectObject(36, 0, 4424533392)}

print(pdf.metadata['/Title'])
# sample1

for k, v in pdf.metadata.items():
    print(f'{k}: {v}')
# /Title: sample1
# /Producer: macOS バージョン10.14.2（ビルド18C54） Quartz PDFContext
# /Creator: Keynote
# /CreationDate: D:20190114072947Z00'00'
# /ModDate: D:20190114072947Z00'00'

source: pypdf_metadata_get.py

Note that keys in the metadata dictionary begin with a leading slash (e.g., /Title).

Remove PDF Metadata

Remove All Metadata

To remove all metadata from a PDF:

Load the original file using PdfReader.
Create a PdfWriter object with the contents cloned from the reader.
Set metadata to None.
Save the result using write().

src_pdf = pypdf.PdfReader('data/src/pdf/sample1.pdf')
dst_pdf = pypdf.PdfWriter(clone_from=src_pdf)

dst_pdf.metadata = None

dst_pdf.write('data/temp/sample1_no_meta.pdf')

source: pypdf_metadata_remove.py

Remove Specific Metadata Fields

To safely remove metadata fields, make a copy with dict() and use methods like pop() and del.

Remove an Item from a Dictionary in Python: pop, popitem, clear, del

For example, remove /Creator and /Producer:

src_pdf = pypdf.PdfReader('data/src/pdf/sample1.pdf')
dst_pdf = pypdf.PdfWriter(clone_from=src_pdf)

metadata = dict(src_pdf.metadata)

print(metadata.keys())
# dict_keys(['/Title', '/Producer', '/Creator', '/CreationDate', '/ModDate'])

metadata.pop('/Creator')
del metadata['/Producer']

print(metadata.keys())
# dict_keys(['/Title', '/CreationDate', '/ModDate'])

source: pypdf_metadata_remove.py

Assign this dictionary to the metadata attribute and save it using the write() method.

dst_pdf.metadata = metadata
dst_pdf.write('data/temp/sample1_remove_meta.pdf')

source: pypdf_metadata_remove.py

Add or Update PDF Metadata

You can add or update metadata as follows:

Load the original file with PdfReader.
Create a PdfWriter object with the contents cloned from the reader.
Use add_metadata() with a dictionary of new/updated fields.
Save the result using write().

The add_metadata() method replaces existing values for any specified keys.

src_pdf = pypdf.PdfReader('data/src/pdf/sample1.pdf')
dst_pdf = pypdf.PdfWriter(clone_from=src_pdf)

new_metadata = {
    '/Title': 'new title',
    '/Producer': 'new producer',
    '/NewItem': 'special data'
}

dst_pdf.add_metadata(new_metadata)
dst_pdf.write('data/temp/sample1_new_meta.pdf')

print(pypdf.PdfReader('data/temp/sample1_new_meta.pdf').metadata)
# {'/Title': 'new title', '/Producer': 'new producer', '/Creator': IndirectObject(35, 0, 4398476304), '/CreationDate': IndirectObject(36, 0, 4398476304), '/ModDate': IndirectObject(36, 0, 4398476304), '/NewItem': 'special data'}

source: pypdf_metadata_set.py

To completely replace existing metadata, assign the dictionary directly to the metadata attribute.

dst_pdf.metadata = new_metadata
dst_pdf.write('data/temp/sample1_new_meta_replace.pdf')

print(pypdf.PdfReader('data/temp/sample1_new_meta_replace.pdf').metadata)
# {'/Title': 'new title', '/Producer': 'new producer', '/NewItem': 'special data'}

source: pypdf_metadata_set.py

Handle Password-Protected PDF Files

Encrypted PDF files will raise errors unless decrypted first.

After creating a PdfReader, call decrypt() with the password.

src_pdf = pypdf.PdfReader(src_path)
src_pdf.decrypt(password)

To encrypt the output PDF, use encrypt() before calling write().

dst_pdf.encrypt(password)
dst_pdf.write(dst_path)

For more on passwords and encryption, see:

Encrypt and Decrypt PDFs with pypdf in Python

Read XMP Metadata (PDF 2.0)

As mentioned earlier, PDF 2.0 stores metadata using the XMP format.

Here’s an example using a sample file from the repository below:

In this case, metadata is None.

pdf = pypdf.PdfReader('data/temp/Simple PDF 2.0 file.pdf')
print(pdf.metadata)
# None

source: pypdf_metadata_xmp.py

Use the xmp_metadata attribute to access XMP metadata. This returns an XmpInformation object:

The XmpInformation Class — pypdf 5.5.0 documentation

You can extract various fields from it:

print(type(pdf.xmp_metadata))
# <class 'pypdf.xmp.XmpInformation'>

print(pdf.xmp_metadata.dc_title)
# {'x-default': 'A simple PDF 2.0 example file'}

print(pdf.xmp_metadata.pdf_keywords)
# PDF 2.0 sample example

print(pdf.xmp_metadata.xmp_metadata_date)
# 2017-07-11 07:55:11

source: pypdf_metadata_xmp.py

As of pypdf 5.5.0, there is no built-in method like add_metadata() for writing XMP metadata.

The PdfWriter Class — pypdf 5.5.0 documentation

Manage PDF Metadata with pypdf in Python

Install pypdf

Metadata Fields in PDF Files

Retrieve PDF Metadata

Remove PDF Metadata

Remove All Metadata

Remove Specific Metadata Fields

Add or Update PDF Metadata

Handle Password-Protected PDF Files

Read XMP Metadata (PDF 2.0)

Related Categories

Related Articles