Tutorial
This tutorial will show you the use of PyMuPDF, MuPDF in Python, step by step.
Because MuPDF supports not only PDF, but also XPS, OpenXPS, CBZ, CBR, FB2 and EPUB formats, so does PyMuPDF [1]. Nevertheless we will only talk about PDF files for the sake of brevity. At places where indeed only PDF files are supported, this will be mentioned explicitely.
Importing the Bindings
The Python bindings to MuPDF are made available by this import statement:
>>> import fitz
You can check your version by printing the docstring:
>>> print (fitz.__doc__)
PyMuPDF 1.9.1: Python bindings for the MuPDF 1.9a library,
built on 2016-07-01 13:06:02
Opening a Document
To access a supported document, it must be opened with the following statement:
>>> doc = fitz.open(filename) # or fitz.Document(filename)
This creates a Document object doc
. filename
must be a Python string specifying the name of an existing file.
It is also possible to open a document from memory data, or to create a new, empty PDF. See Document for details.
A document contains many attributes and functions. Among them are meta information (like “author” or “subject”), number of total pages, outline and encryption information.
Some Document Methods and Attributes
Method / Attribute | Description |
---|---|
Document.pageCount |
number of pages (int) |
Document.metadata |
metadata (dict) |
Document.getToC() |
table of contents (list) |
Document.loadPage() |
read a page (Page) |
Accessing Meta Data
PyMuPDF fully supports standard metadata. Document.metadata
is a Python dictionary with the following keys. It is available for all document types, though not all entries may contain data in every single case. For details of their meanings and formats consult the PDF manuals, e.g. Adobe PDF Reference 1.7. Further information can also be found in chapter Document. The meta data fields are strings (or None
) if not otherwise indicated. Be aware that not all of them necessarily contain meaningful data.
Key | Value |
---|---|
producer | producer (producing software) |
format | PDF format, e.g. ‘PDF-1.4’ |
encryption | encryption method used |
author | author |
modDate | date of last modification |
keywords | keywords |
title | title |
creationDate | date of creation |
creator | creating application |
subject | subject |
Note
Apart from these standard metadata, PDF documents of PDF version 1.4 or later may also contain so-called “metadata streams”. Information in such streams is coded in XML. PyMuPDF deliberately contains no XML components, so we do not directly support access to information contained therein. But you can extract the stream as a whole, inspect or modify it using a package like lxml and then store the result back into the PDF. If you want, you can also delete these data altogether.
Working with Outlines
The easiest way to get all outlines (also called “bookmarks”) of a document, is creating a table of contents:
>>> toc = doc.getToC()
This will return a Python list of lists [[lvl, title, page, ...], ...]
.
lvl
is the hierarchy level of the entry (starting from 1), title
is the entry’s title, and page
the page number (1-based!). Other parameters describe details of the bookmark target.
Working with Pages
Tasks that can be performed with a Page are at the core of MuPDF’s functionality.
- You can render a page into an image, optionally zooming, rotating, shifting or shearing it.
- You can extract a page’s text or search for text strings.
First, a page object must be created:
>>> page = doc.loadPage(n) # represents page n of the document (0-based)
>>> page = doc[n] # short form
n
may be any integer less than the total number of pages of the document. doc[-1]
is the last page, like with Python lists.
Some typical uses of Pages follow:
Inspecting the Links of a Page
Here is how to get all links and their types:
>>> # get all links on a page
>>> links = page.getLinks()
links
is a Python list of dictionaries. For details see Page.getLinks()
.
Rendering a Page
This example creates an image out of a page’s content:
>>> pix = page.getPixmap()
Now pix
is a Pixmap object that contains an RGBA image of the page, ready to be used. This method offers lots of variations for controlling the image: resolution, colorspace, transparency, rotation, mirroring, shifting, shearing, etc.
Saving the Page Image in a File
We can simply store the image in a PNG file:
>>> pix.writePNG("test.png")
Displaying the Image in Dialog Managers
We can also use it in GUI dialog managers. Pixmap.samples
represents the area of bytes of all the pixels as a Python bytes object. Here are two examples, find more here.
wxPython:
>>> bitmap = wx.BitmapFromBufferRGBA(pix.width, pix.height, pix.samples)
Tkinter:
>>> # the following requires: "from PIL import Image, ImageTk"
>>> img = Image.frombytes("RGBA", [pix.width, pix.height], pix.samples)
>>> photo = ImageTk.PhotoImage(img)
Now, photo
can be used as an image in TK.
Extracting Text
We can also extract all text of a page in one chunk of string:
>>> text = page.getText(type)
Use one of the following strings for type
:
"text"
: (default) plain text with line breaks. No formatting, no text position details."html"
: creates a full visual version of the page including any images, which can be displayed in browsers."json"
: same information level as HTML. Use a JSON module to interpret."xhtml"
: text information level as the TEXT version, but includes images and can also be displayed in browsers."xml"
: contains no images, but full position and font information about each single text character. Use an XML module to interpret.
To give you an idea about the output of these alternatives, we did text example extracts. See Appendix 2: Details on Text Extraction.
Searching Text
You can find out, exactly where on a page a certain string appears:
>>> areas = page.searchFor("mupdf", hit_max = 16)
The variable areas
will contain a list of up to 16 Rectangles, each of which surrounds one occurrence of the string “mupdf” (case insensitive). You could use this information to e.g. highlight those areas or create a cross reference of the document.
Please also do have a look at chapter Working together: DisplayList and TextPage and at demo program demo.py. Among other things they contain details on how the TextPage, Device and DisplayList classes can be used for a more direct control, e.g. when performance considerations suggest it.
PDF Maintenance
Since version 1.9, PyMuPDF provides several options to modify PDF documents (only).
Document.save()
always stores a PDF in its current (potentially modified) state on disk.
Apart from your changes, there are less obvious ways for a PDF to becoming “modified”:
- During open, integrity checks are used to determine the health of the PDF structure. Any errors will be corrected as far as possible to present a repaired document in memory for further processing. If this is the case, the document is regarded as being modified.
- After a document has been decrypted, the document in memory has changed and also counts as being modified.
In these two cases, Document.save()
will store a repaired and / or decrypted version, and saving must occur to a new file.
The following describe some more intentional ways to manipulate PDF documents. This description is by no means exhaustive: much more can be found in the following chapters.
Modifying, Creating, Re-arranging and Deleting Pages
There are several ways to manipulate the page tree of a PDF:
Methods Document.deletePage()
and Document.deletePageRange()
delete pages.
Methods Document.copyPage()
and Document.movePage()
copy or move a page to another location within the document.
Document.insertPage()
and Document.newPage()
insert pages.
Method Document.select()
shrinks a document down to selected pages. It accepts a sequence of integers as argument. These integers must be in range 0 <= i < pageCount
. When executed, all pages missing in this list will be deleted. Only pages that do occur will remain - in the sequence specified and as many times (!) as specified.
So you can easily create new PDFs with the first or last 10 pages, only the odd or only the even pages (for doing double-sided printing), pages that do or don’t contain a certain text, reverse their sequence, … whatever you may think of.
The saved new document will contain all still valid links, annotations and bookmarks.
Pages themselves can moreover be modified by a range of methods (e.g. page rotation, annotation and link maintenance, text and image insertion).
Joining and Splitting PDF Documents
Method Document.insertPDF()
inserts pages from another PDF at a specified place of the current one. Here is a simple joiner example (doc1
and doc2
being openend PDFs):
>>> # append complete doc2 to the end of doc1
>>> doc1.insertPDF(doc2)
Here is how to split doc1
. This creates a new document of its first and last 10 pages (could also be done using Document.select()
):
>>> doc2 = fitz.open() # new empty PDF
>>> doc2.insertPDF(doc1, to_page = 9)
>>> doc2.insertPDF(doc1, from_page = len(doc1) - 10)
>>> doc2.save(...)
More can be found in the Document chapter. Also have a look at PDFjoiner.py.
Saving
As mentioned above, Document.save()
will always save the document in its current state.
Since MuPDF 1.9, you can write changes back to the original PDF by specifying incremental = True
. This process is (usually) extremely fast, since changes are appended to the original file without rewriting it.
Document.save()
supports all options of MuPDF’s command line utility mutool clean
, see the following table (corresponding mutool clean
option is indicated as “mco”).
Option | mco | Effect |
---|---|---|
garbage = 1 | g | garbage collect unused objects |
garbage = 2 | gg | in addition to 1, compact xref tables |
garbage = 3 | ggg | in addition to 2, merge duplicate objects |
garbage = 4 | gggg | in addition to 3, skip duplicate streams |
clean = 1 | s | clean content streams |
deflate = 1 | z | deflate uncompressed streams |
ascii = 1 | a | convert binary data to ASCII format |
linear = 1 | l | create a linearized version |
expand = 1 | i | decompress images |
expand = 2 | f | decompress fonts |
expand = 255 | d | decompress all |
incremental = 1 | n/a | append changes to the original |
For example, mutool clean -ggggz file.pdf
yields excellent compression results. It corresponds to doc.save(filename, garbage=4, deflate=1)
.
Closing
It is often desirable to “close” a document to relinquish control of the underlying file to the OS, while your program continues.
This can be achieved by the Document.close()
method. Apart from closing the underlying file, buffer areas associated with the document will be freed.
Example: Dynamically Cleaning up Corrupt PDF Documents
This shows a potential use of PyMuPDF with another Python PDF library (pdfrw).
If a clean, non-corrupt or decompressed PDF is needed, one could dynamically invoke PyMuPDF to recover from problems like so:
import sys
from pdfrw import PdfReader
import fitz
from io import BytesIO
#---------------------------------------
# 'tolerant' PDF reader
#---------------------------------------
def reader(fname):
ifile = open(fname, "rb")
idata = ifile.read() # put in memory
ifile.close()
ibuffer = BytesIO(idata) # convert to stream
try:
return PdfReader(ibuffer) # let us try
except: # problem! heal it with PyMuPDF
doc = fitz.open("pdf", idata) # open and save a corrected
c = doc.write(garbage = 4) # version in memory
doc.close()
doc = idata = None # free storage
ibuffer = BytesIO(c) # convert to stream
return PdfReader(ibuffer) # let pdfrw retry
#---------------------------------------
# Main program
#---------------------------------------
pdf = reader("pymupdf.pdf")
print pdf.Info
# do further processing
With the command line utility pdftk
(available for Windows only) a similar result can be achieved, see here. However, you must invoke it as a separate process via subprocess.Popen
, using stdin and stdout as communication vehicles.
Further Reading
Also have a look at PyMuPDF’s Wiki pages. Especially those named in the sidebar under title “Recipies” cover over 15 topics written in “How-To” style.
Footnotes
[1] | PyMuPDF lets you also open several image file types just like normal documents. See section Supported Input Image Types in chapter Pixmap for more comments. |