Manage PDF Metadata with pypdf in Python
The Python library pypdf (formerly PyPDF2) allows you to retrieve, remove, and modify metadata in PDF files, including details such as author, title, and more.
The sample PDFs used in this article are available at the following link. All password-protected files use password as their password:
Install pypdf
pypdf has no external dependencies and can be installed via pip (or pip3). If you need support for AES encryption and decryption, install it with the [crypto] extra.
$ pip install pypdf
$ pip install pypdf[crypto]
The examples in this article use pypdf version 5.5.0.
The library was previously known as PyPDF2 until it was renamed to pypdf in 2023.
Metadata Fields in PDF Files
PDF is an ISO-standardized file format.
The ISO 32000-1 (PDF 1.7) standard is freely available. Metadata is explained in section "14.3 Metadata" (page 548) of the document below:
Metadata is stored either in Metadata Streams or in the Document Information Dictionary. Common fields in the dictionary include:
Title: Document titleAuthor: AuthorSubject: SubjectKeyword: KeywordsCreator: Tool that created the original documentProducer: Tool that converted the original document to PDFCreateDate: Creation dateModDate: Modification dateTrapped: Trapping status
These fields are optional, and custom fields may also be added.
While Author, Creator, and Producer may sound similar, they refer to different entities. Author refers to the person or organization who created the content. Creator and Producer refer to the software used to create or convert the document.
As of the ISO 32000-2 (PDF 2.0) specification, published in July 2017, metadata is stored in the Extensible Metadata Platform (XMP) instead.
The following examples mainly cover PDF 1.7 and earlier. PDF 2.0 handling is briefly discussed at the end.
Retrieve PDF Metadata
You can access metadata from the Document Information Dictionary using the metadata attribute of the PdfReader object.
Create a PdfReader object by passing the file path to its constructor. The metadata attribute returns an instance of DocumentInformation.
You can access fields like title and author as attributes:
import pypdf
print(pypdf.__version__)
# 5.5.0
pdf = pypdf.PdfReader('data/src/pdf/sample1.pdf')
print(type(pdf.metadata))
# <class 'pypdf._doc_common.DocumentInformation'>
print(pdf.metadata.title)
# sample1
Since DocumentInformation is a subclass of dict, you can use dictionary-style access and methods like items():
Some files may return IndirectObject(...) when printed directly. You can still extract values by using the keys. The following example uses a file created with Apple Keynote and exported to PDF.
print(isinstance(pdf.metadata, dict))
# True
print(pdf.metadata)
# {'/Title': IndirectObject(33, 0, 4424533392), '/Producer': IndirectObject(34, 0, 4424533392), '/Creator': IndirectObject(35, 0, 4424533392), '/CreationDate': IndirectObject(36, 0, 4424533392), '/ModDate': IndirectObject(36, 0, 4424533392)}
print(pdf.metadata['/Title'])
# sample1
for k, v in pdf.metadata.items():
print(f'{k}: {v}')
# /Title: sample1
# /Producer: macOS バージョン10.14.2(ビルド18C54) Quartz PDFContext
# /Creator: Keynote
# /CreationDate: D:20190114072947Z00'00'
# /ModDate: D:20190114072947Z00'00'
Note that keys in the metadata dictionary begin with a leading slash (e.g., /Title).
Remove PDF Metadata
Remove All Metadata
To remove all metadata from a PDF:
- Load the original file using
PdfReader. - Create a
PdfWriterobject with the contents cloned from the reader. - Set
metadatatoNone. - Save the result using
write().
src_pdf = pypdf.PdfReader('data/src/pdf/sample1.pdf')
dst_pdf = pypdf.PdfWriter(clone_from=src_pdf)
dst_pdf.metadata = None
dst_pdf.write('data/temp/sample1_no_meta.pdf')
Remove Specific Metadata Fields
To safely remove metadata fields, make a copy with dict() and use methods like pop() and del.
For example, remove /Creator and /Producer:
src_pdf = pypdf.PdfReader('data/src/pdf/sample1.pdf')
dst_pdf = pypdf.PdfWriter(clone_from=src_pdf)
metadata = dict(src_pdf.metadata)
print(metadata.keys())
# dict_keys(['/Title', '/Producer', '/Creator', '/CreationDate', '/ModDate'])
metadata.pop('/Creator')
del metadata['/Producer']
print(metadata.keys())
# dict_keys(['/Title', '/CreationDate', '/ModDate'])
Assign this dictionary to the metadata attribute and save it using the write() method.
dst_pdf.metadata = metadata
dst_pdf.write('data/temp/sample1_remove_meta.pdf')
Add or Update PDF Metadata
You can add or update metadata as follows:
- Load the original file with
PdfReader. - Create a
PdfWriterobject with the contents cloned from the reader. - Use
add_metadata()with a dictionary of new/updated fields. - Save the result using
write().
The add_metadata() method replaces existing values for any specified keys.
src_pdf = pypdf.PdfReader('data/src/pdf/sample1.pdf')
dst_pdf = pypdf.PdfWriter(clone_from=src_pdf)
new_metadata = {
'/Title': 'new title',
'/Producer': 'new producer',
'/NewItem': 'special data'
}
dst_pdf.add_metadata(new_metadata)
dst_pdf.write('data/temp/sample1_new_meta.pdf')
print(pypdf.PdfReader('data/temp/sample1_new_meta.pdf').metadata)
# {'/Title': 'new title', '/Producer': 'new producer', '/Creator': IndirectObject(35, 0, 4398476304), '/CreationDate': IndirectObject(36, 0, 4398476304), '/ModDate': IndirectObject(36, 0, 4398476304), '/NewItem': 'special data'}
To completely replace existing metadata, assign the dictionary directly to the metadata attribute.
dst_pdf.metadata = new_metadata
dst_pdf.write('data/temp/sample1_new_meta_replace.pdf')
print(pypdf.PdfReader('data/temp/sample1_new_meta_replace.pdf').metadata)
# {'/Title': 'new title', '/Producer': 'new producer', '/NewItem': 'special data'}
Handle Password-Protected PDF Files
Encrypted PDF files will raise errors unless decrypted first.
After creating a PdfReader, call decrypt() with the password.
src_pdf = pypdf.PdfReader(src_path)
src_pdf.decrypt(password)
To encrypt the output PDF, use encrypt() before calling write().
dst_pdf.encrypt(password)
dst_pdf.write(dst_path)
For more on passwords and encryption, see:
Read XMP Metadata (PDF 2.0)
As mentioned earlier, PDF 2.0 stores metadata using the XMP format.
Here’s an example using a sample file from the repository below:
- pdf-association/pdf20examples: PDF 2.0 example files
- https://github.com/pdf-association/pdf20examples/raw/master/Simple%20PDF%202.0%20file.pdf
In this case, metadata is None.
pdf = pypdf.PdfReader('data/temp/Simple PDF 2.0 file.pdf')
print(pdf.metadata)
# None
Use the xmp_metadata attribute to access XMP metadata. This returns an XmpInformation object:
You can extract various fields from it:
print(type(pdf.xmp_metadata))
# <class 'pypdf.xmp.XmpInformation'>
print(pdf.xmp_metadata.dc_title)
# {'x-default': 'A simple PDF 2.0 example file'}
print(pdf.xmp_metadata.pdf_keywords)
# PDF 2.0 sample example
print(pdf.xmp_metadata.xmp_metadata_date)
# 2017-07-11 07:55:11
As of pypdf 5.5.0, there is no built-in method like add_metadata() for writing XMP metadata.