Manage PDF Metadata with pypdf in Python

Posted: | Tags: Python, PDF

The Python library pypdf (formerly PyPDF2) allows you to retrieve, remove, and modify metadata in PDF files, including details such as author, title, and more.

The sample PDFs used in this article are available at the following link. All password-protected files use password as their password:

Install pypdf

pypdf has no external dependencies and can be installed via pip (or pip3). If you need support for AES encryption and decryption, install it with the [crypto] extra.

$ pip install pypdf
$ pip install pypdf[crypto]

The examples in this article use pypdf version 5.5.0.

The library was previously known as PyPDF2 until it was renamed to pypdf in 2023.

Metadata Fields in PDF Files

PDF is an ISO-standardized file format.

The ISO 32000-1 (PDF 1.7) standard is freely available. Metadata is explained in section "14.3 Metadata" (page 548) of the document below:

Metadata is stored either in Metadata Streams or in the Document Information Dictionary. Common fields in the dictionary include:

  • Title: Document title
  • Author: Author
  • Subject: Subject
  • Keyword: Keywords
  • Creator: Tool that created the original document
  • Producer: Tool that converted the original document to PDF
  • CreateDate: Creation date
  • ModDate: Modification date
  • Trapped: Trapping status

These fields are optional, and custom fields may also be added.

While Author, Creator, and Producer may sound similar, they refer to different entities. Author refers to the person or organization who created the content. Creator and Producer refer to the software used to create or convert the document.

As of the ISO 32000-2 (PDF 2.0) specification, published in July 2017, metadata is stored in the Extensible Metadata Platform (XMP) instead.

The following examples mainly cover PDF 1.7 and earlier. PDF 2.0 handling is briefly discussed at the end.

Retrieve PDF Metadata

You can access metadata from the Document Information Dictionary using the metadata attribute of the PdfReader object.

Create a PdfReader object by passing the file path to its constructor. The metadata attribute returns an instance of DocumentInformation.

You can access fields like title and author as attributes:

import pypdf

print(pypdf.__version__)
# 5.5.0

pdf = pypdf.PdfReader('data/src/pdf/sample1.pdf')

print(type(pdf.metadata))
# <class 'pypdf._doc_common.DocumentInformation'>

print(pdf.metadata.title)
# sample1

Since DocumentInformation is a subclass of dict, you can use dictionary-style access and methods like items():

Some files may return IndirectObject(...) when printed directly. You can still extract values by using the keys. The following example uses a file created with Apple Keynote and exported to PDF.

print(isinstance(pdf.metadata, dict))
# True

print(pdf.metadata)
# {'/Title': IndirectObject(33, 0, 4424533392), '/Producer': IndirectObject(34, 0, 4424533392), '/Creator': IndirectObject(35, 0, 4424533392), '/CreationDate': IndirectObject(36, 0, 4424533392), '/ModDate': IndirectObject(36, 0, 4424533392)}

print(pdf.metadata['/Title'])
# sample1

for k, v in pdf.metadata.items():
    print(f'{k}: {v}')
# /Title: sample1
# /Producer: macOS バージョン10.14.2(ビルド18C54) Quartz PDFContext
# /Creator: Keynote
# /CreationDate: D:20190114072947Z00'00'
# /ModDate: D:20190114072947Z00'00'

Note that keys in the metadata dictionary begin with a leading slash (e.g., /Title).

Remove PDF Metadata

Remove All Metadata

To remove all metadata from a PDF:

  1. Load the original file using PdfReader.
  2. Create a PdfWriter object with the contents cloned from the reader.
  3. Set metadata to None.
  4. Save the result using write().
src_pdf = pypdf.PdfReader('data/src/pdf/sample1.pdf')
dst_pdf = pypdf.PdfWriter(clone_from=src_pdf)

dst_pdf.metadata = None

dst_pdf.write('data/temp/sample1_no_meta.pdf')

Remove Specific Metadata Fields

To safely remove metadata fields, make a copy with dict() and use methods like pop() and del.

For example, remove /Creator and /Producer:

src_pdf = pypdf.PdfReader('data/src/pdf/sample1.pdf')
dst_pdf = pypdf.PdfWriter(clone_from=src_pdf)

metadata = dict(src_pdf.metadata)

print(metadata.keys())
# dict_keys(['/Title', '/Producer', '/Creator', '/CreationDate', '/ModDate'])

metadata.pop('/Creator')
del metadata['/Producer']

print(metadata.keys())
# dict_keys(['/Title', '/CreationDate', '/ModDate'])

Assign this dictionary to the metadata attribute and save it using the write() method.

dst_pdf.metadata = metadata
dst_pdf.write('data/temp/sample1_remove_meta.pdf')

Add or Update PDF Metadata

You can add or update metadata as follows:

  1. Load the original file with PdfReader.
  2. Create a PdfWriter object with the contents cloned from the reader.
  3. Use add_metadata() with a dictionary of new/updated fields.
  4. Save the result using write().

The add_metadata() method replaces existing values for any specified keys.

src_pdf = pypdf.PdfReader('data/src/pdf/sample1.pdf')
dst_pdf = pypdf.PdfWriter(clone_from=src_pdf)

new_metadata = {
    '/Title': 'new title',
    '/Producer': 'new producer',
    '/NewItem': 'special data'
}

dst_pdf.add_metadata(new_metadata)
dst_pdf.write('data/temp/sample1_new_meta.pdf')

print(pypdf.PdfReader('data/temp/sample1_new_meta.pdf').metadata)
# {'/Title': 'new title', '/Producer': 'new producer', '/Creator': IndirectObject(35, 0, 4398476304), '/CreationDate': IndirectObject(36, 0, 4398476304), '/ModDate': IndirectObject(36, 0, 4398476304), '/NewItem': 'special data'}

To completely replace existing metadata, assign the dictionary directly to the metadata attribute.

dst_pdf.metadata = new_metadata
dst_pdf.write('data/temp/sample1_new_meta_replace.pdf')

print(pypdf.PdfReader('data/temp/sample1_new_meta_replace.pdf').metadata)
# {'/Title': 'new title', '/Producer': 'new producer', '/NewItem': 'special data'}

Handle Password-Protected PDF Files

Encrypted PDF files will raise errors unless decrypted first.

After creating a PdfReader, call decrypt() with the password.

src_pdf = pypdf.PdfReader(src_path)
src_pdf.decrypt(password)

To encrypt the output PDF, use encrypt() before calling write().

dst_pdf.encrypt(password)
dst_pdf.write(dst_path)

For more on passwords and encryption, see:

Read XMP Metadata (PDF 2.0)

As mentioned earlier, PDF 2.0 stores metadata using the XMP format.

Here’s an example using a sample file from the repository below:

In this case, metadata is None.

pdf = pypdf.PdfReader('data/temp/Simple PDF 2.0 file.pdf')
print(pdf.metadata)
# None

Use the xmp_metadata attribute to access XMP metadata. This returns an XmpInformation object:

You can extract various fields from it:

print(type(pdf.xmp_metadata))
# <class 'pypdf.xmp.XmpInformation'>

print(pdf.xmp_metadata.dc_title)
# {'x-default': 'A simple PDF 2.0 example file'}

print(pdf.xmp_metadata.pdf_keywords)
# PDF 2.0 sample example

print(pdf.xmp_metadata.xmp_metadata_date)
# 2017-07-11 07:55:11

As of pypdf 5.5.0, there is no built-in method like add_metadata() for writing XMP metadata.

Related Categories

Related Articles