Manage PDF Metadata with pypdf in Python
The Python library pypdf (formerly PyPDF2) allows you to retrieve, remove, and modify metadata in PDF files, including details such as author, title, and more.
The sample PDFs used in this article are available at the following link. All password-protected files use password
as their password:
Install pypdf
pypdf
has no external dependencies and can be installed via pip
(or pip3
). If you need support for AES encryption and decryption, install it with the [crypto]
extra.
$ pip install pypdf
$ pip install pypdf[crypto]
The examples in this article use pypdf version 5.5.0
.
The library was previously known as PyPDF2 until it was renamed to pypdf in 2023.
Metadata Fields in PDF Files
PDF is an ISO-standardized file format.
The ISO 32000-1 (PDF 1.7) standard is freely available. Metadata is explained in section "14.3 Metadata" (page 548) of the document below:
Metadata is stored either in Metadata Streams
or in the Document Information Dictionary
. Common fields in the dictionary include:
Title
: Document titleAuthor
: AuthorSubject
: SubjectKeyword
: KeywordsCreator
: Tool that created the original documentProducer
: Tool that converted the original document to PDFCreateDate
: Creation dateModDate
: Modification dateTrapped
: Trapping status
These fields are optional, and custom fields may also be added.
While Author
, Creator
, and Producer
may sound similar, they refer to different entities. Author
refers to the person or organization who created the content. Creator
and Producer
refer to the software used to create or convert the document.
As of the ISO 32000-2 (PDF 2.0) specification, published in July 2017, metadata is stored in the Extensible Metadata Platform (XMP) instead.
The following examples mainly cover PDF 1.7 and earlier. PDF 2.0 handling is briefly discussed at the end.
Retrieve PDF Metadata
You can access metadata from the Document Information Dictionary
using the metadata
attribute of the PdfReader
object.
Create a PdfReader
object by passing the file path to its constructor. The metadata
attribute returns an instance of DocumentInformation
.
You can access fields like title
and author
as attributes:
import pypdf
print(pypdf.__version__)
# 5.5.0
pdf = pypdf.PdfReader('data/src/pdf/sample1.pdf')
print(type(pdf.metadata))
# <class 'pypdf._doc_common.DocumentInformation'>
print(pdf.metadata.title)
# sample1
Since DocumentInformation
is a subclass of dict
, you can use dictionary-style access and methods like items()
:
Some files may return IndirectObject(...)
when printed directly. You can still extract values by using the keys. The following example uses a file created with Apple Keynote and exported to PDF.
print(isinstance(pdf.metadata, dict))
# True
print(pdf.metadata)
# {'/Title': IndirectObject(33, 0, 4424533392), '/Producer': IndirectObject(34, 0, 4424533392), '/Creator': IndirectObject(35, 0, 4424533392), '/CreationDate': IndirectObject(36, 0, 4424533392), '/ModDate': IndirectObject(36, 0, 4424533392)}
print(pdf.metadata['/Title'])
# sample1
for k, v in pdf.metadata.items():
print(f'{k}: {v}')
# /Title: sample1
# /Producer: macOS バージョン10.14.2(ビルド18C54) Quartz PDFContext
# /Creator: Keynote
# /CreationDate: D:20190114072947Z00'00'
# /ModDate: D:20190114072947Z00'00'
Note that keys in the metadata dictionary begin with a leading slash (e.g., /Title
).
Remove PDF Metadata
Remove All Metadata
To remove all metadata from a PDF:
- Load the original file using
PdfReader
. - Create a
PdfWriter
object with the contents cloned from the reader. - Set
metadata
toNone
. - Save the result using
write()
.
src_pdf = pypdf.PdfReader('data/src/pdf/sample1.pdf')
dst_pdf = pypdf.PdfWriter(clone_from=src_pdf)
dst_pdf.metadata = None
dst_pdf.write('data/temp/sample1_no_meta.pdf')
Remove Specific Metadata Fields
To safely remove metadata fields, make a copy with dict()
and use methods like pop()
and del
.
For example, remove /Creator
and /Producer
:
src_pdf = pypdf.PdfReader('data/src/pdf/sample1.pdf')
dst_pdf = pypdf.PdfWriter(clone_from=src_pdf)
metadata = dict(src_pdf.metadata)
print(metadata.keys())
# dict_keys(['/Title', '/Producer', '/Creator', '/CreationDate', '/ModDate'])
metadata.pop('/Creator')
del metadata['/Producer']
print(metadata.keys())
# dict_keys(['/Title', '/CreationDate', '/ModDate'])
Assign this dictionary to the metadata
attribute and save it using the write()
method.
dst_pdf.metadata = metadata
dst_pdf.write('data/temp/sample1_remove_meta.pdf')
Add or Update PDF Metadata
You can add or update metadata as follows:
- Load the original file with
PdfReader
. - Create a
PdfWriter
object with the contents cloned from the reader. - Use
add_metadata()
with a dictionary of new/updated fields. - Save the result using
write()
.
The add_metadata()
method replaces existing values for any specified keys.
src_pdf = pypdf.PdfReader('data/src/pdf/sample1.pdf')
dst_pdf = pypdf.PdfWriter(clone_from=src_pdf)
new_metadata = {
'/Title': 'new title',
'/Producer': 'new producer',
'/NewItem': 'special data'
}
dst_pdf.add_metadata(new_metadata)
dst_pdf.write('data/temp/sample1_new_meta.pdf')
print(pypdf.PdfReader('data/temp/sample1_new_meta.pdf').metadata)
# {'/Title': 'new title', '/Producer': 'new producer', '/Creator': IndirectObject(35, 0, 4398476304), '/CreationDate': IndirectObject(36, 0, 4398476304), '/ModDate': IndirectObject(36, 0, 4398476304), '/NewItem': 'special data'}
To completely replace existing metadata, assign the dictionary directly to the metadata
attribute.
dst_pdf.metadata = new_metadata
dst_pdf.write('data/temp/sample1_new_meta_replace.pdf')
print(pypdf.PdfReader('data/temp/sample1_new_meta_replace.pdf').metadata)
# {'/Title': 'new title', '/Producer': 'new producer', '/NewItem': 'special data'}
Handle Password-Protected PDF Files
Encrypted PDF files will raise errors unless decrypted first.
After creating a PdfReader
, call decrypt()
with the password.
src_pdf = pypdf.PdfReader(src_path)
src_pdf.decrypt(password)
To encrypt the output PDF, use encrypt()
before calling write()
.
dst_pdf.encrypt(password)
dst_pdf.write(dst_path)
For more on passwords and encryption, see:
Read XMP Metadata (PDF 2.0)
As mentioned earlier, PDF 2.0 stores metadata using the XMP format.
Here’s an example using a sample file from the repository below:
- pdf-association/pdf20examples: PDF 2.0 example files
- https://github.com/pdf-association/pdf20examples/raw/master/Simple%20PDF%202.0%20file.pdf
In this case, metadata
is None
.
pdf = pypdf.PdfReader('data/temp/Simple PDF 2.0 file.pdf')
print(pdf.metadata)
# None
Use the xmp_metadata
attribute to access XMP metadata. This returns an XmpInformation
object:
You can extract various fields from it:
print(type(pdf.xmp_metadata))
# <class 'pypdf.xmp.XmpInformation'>
print(pdf.xmp_metadata.dc_title)
# {'x-default': 'A simple PDF 2.0 example file'}
print(pdf.xmp_metadata.pdf_keywords)
# PDF 2.0 sample example
print(pdf.xmp_metadata.xmp_metadata_date)
# 2017-07-11 07:55:11
As of pypdf 5.5.0
, there is no built-in method like add_metadata()
for writing XMP metadata.