Page

PyMuPDF

previous page next page

Page

Class representing a document page. A page object is created by Document.loadPage() or, equivalently, via indexing the document like doc[n] - it has no independent constructor.

There is a parent-child relationship between a document and its pages. If the document is closed or deleted, all page objects (and their respective children, too) in existence will become unusable. If a page property or method is being used, an exception is raised saying that the page object is “orphaned”.

Several page methods have a Document counterpart for convenience. At the end of this chapter you will find a synopsis.

Methods insertText(), insertTextbox() and draw*() are for PDF pages only. They provide “stand-alone” shortcut versions for the same-named methods of the Shape class. For detailed descriptions have a look in that chapter.

In contrast to Shape, the results of page methods are not interconnected: they do not share properties like colors, line width / dashing, morphing, etc.
Each page draw*() method invokes a Shape.finish() and then a Shape.commit() and consequently accepts the combined arguments of both these methods.
Text insertion methods (insertText() and insertTextbox()) do not need Shape.finish() and therefore only invoke Shape.commit().

Method / Attribute	Short Description
`Page.bound()`	rectangle of the page
`Page.deleteAnnot()`	PDF only: delete an annotation
`Page.deleteLink()`	PDF only: delete a link
`Page.drawBezier()`	PDF only: draw a cubic Bézier curve
`Page.drawCircle()`	PDF only: draw a circle
`Page.drawCurve()`	PDF only: draw a special Bézier curve
`Page.drawLine()`	PDF only: draw a line
`Page.drawOval()`	PDF only: draw an oval / ellipse
`Page.drawPolyline()`	PDF only: connect a point sequence
`Page.drawRect()`	PDF only: draw a rectangle
`Page.drawSector()`	PDF only: draw a circular sector
`Page.drawSquiggle()`	PDF only: draw a squiggly line
`Page.drawZigzag()`	PDF only: draw a zig-zagged line
`Page.getFontList()`	PDF only: get list of used fonts
`Page.getImageList()`	PDF only: get list of used images
`Page.getLinks()`	get all links
`Page.getPixmap()`	create a Pixmap
`Page.getSVGimage()`	convert page image to SVG format
`Page.getText()`	extract the page’s text
`Page.getTextBlocks()`	extract text blocks as a Python list
`Page.getTextWords()`	extract text words as a Python list
`Page.insertImage()`	PDF only: insert an image
`Page.insertLink()`	PDF only: insert a new link
`Page.insertText()`	PDF only: insert text
`Page.insertTextbox()`	PDF only: insert a text box
`Page.loadLinks()`	return the first link on a page
`Page.newShape()`	PDF only: start a new Shape
`Page.searchFor()`	search for a string
`Page.setRotation()`	PDF only: set page rotation
`Page.showPDFpage()`	PDF only: display PDF page image
`Page.updateLink()`	PDF only: modify a link
`Page.CropBoxPosition`	top-left point of /CropBox
`Page.CropBox`	the page’s /CropBox
`Page.MediaBoxSize`	bottom-right point of /MediaBox
`Page.MediaBox`	the page’s /MediaBox
`Page.firstAnnot`	first Annot on the page
`Page.firstLink`	first Link on the page
`Page.number`	page number
`Page.parent`	owning document object
`Page.rect`	rectangle (mediabox) of the page
`Page.rotation`	PDF only: page rotation

Class API

class Page

bound()

Determine the rectangle (before transformation) of the page. For PDF documents this usually coincides with the /MediaBox and the /CropBox objects, but not always. The best description hence is probably “relocated /CropBox such that top-left coordinates are (0, 0)”. Also see attributes Page.CropBox and Page.MediaBox.

Return type:	Rect

deleteAnnot(annot)

PDF only: Delete the specified annotation from the page and (for all document types) return the next one.

Parameters:	annot (Annot) – the annotation to be deleted.
Return type:	Annot
Returns:	the next annotation of the deleted one.

deleteLink(linkdict)

PDF only: Delete the specified link from the page. The parameter must be a dictionary of format as provided by the getLinks() method (see below).

Parameters:	linkdict (dict) – the link to be deleted.

insertLink(linkdict)

PDF only: Insert a new link on this page. The parameter must be a dictionary of format as provided by the getLinks() method (see below).

Parameters:	linkdict (dict) – the link to be inserted.

updateLink(linkdict)

PDF only: Modify the specified link. The parameter must be a dictionary of format as provided by the getLinks() method (see below).

Parameters:	linkdict (dict) – the link to be modified.

getLinks()

Retrieves all links of a page.

Return type:	list
Returns:	A list of dictionaries. The entries are in the order as specified during PDF generation. For a description of the dictionary entries see below. Always use this method if you intend to make changes to the links of a page.

insertText(point, text = text, fontsize = 11, fontname = "Helvetica", fontfile = None, idx = 0, color = (0, 0, 0), rotate = 0, morph = None, overlay = True): PDF only: Insert text.

insertTextbox(rect, buffer, fontsize = 11, fontname = "Helvetica", fontfile = None, idx = 0, color = (0, 0, 0), expandtabs = 8, align = TEXT_ALIGN_LEFT, charwidths = None, rotate = 0, morph = None, overlay = True): PDF only: Insert text into the specified rectangle.

drawLine(p1, p2, color = (0, 0, 0), width = 1, dashes = None, roundCap = True, overlay = True, morph = None): PDF only: Draw a line from Point objects p1 to p2.

drawZigzag(p1, p2, breadth = 2, color = (0, 0, 0), width = 1, dashes = None, roundCap = True, overlay = True, morph = None): PDF only: Draw a zigzag line from Point objects p1 to p2.

drawSquiggle(p1, p2, breadth = 2, color = (0, 0, 0), width = 1, dashes = None, roundCap = True, overlay = True, morph = None): PDF only: Draw a squiggly (wavy, undulated) line from Point objects p1 to p2.

drawCircle(center, radius, color = (0, 0, 0), fill = None, width = 1, dashes = None, roundCap = True, overlay = True, morph = None): PDF only: Draw a circle around center with a radius of radius.

drawOval(rect, color = (0, 0, 0), fill = None, width = 1, dashes = None, roundCap = True, overlay = True, morph = None): PDF only: Draw an oval (ellipse) within the given rectangle.

drawSector(center, point, angle, color = (0, 0, 0), fill = None, width = 1, dashes = None, roundCap = True, fullSector = True, overlay = True, closePath = False, morph = None): PDF only: Draw a circular sector, optionally connecting the arc to the circle’s center (like a piece of pie).

drawPolyline(points, color = (0, 0, 0), fill = None, width = 1, dashes = None, roundCap = True, overlay = True, closePath = False, morph = None): PDF only: Draw several connected lines defined by a sequence of points.

drawBezier(p1, p2, p3, p4, color = (0, 0, 0), fill = None, width = 1, dashes = None, roundCap = True, overlay = True, closePath = False, morph = None): PDF only: Draw a cubic Bézier curve from p1 to p4 with the control points p2 and p3.

drawCurve(p1, p2, p3, color = (0, 0, 0), fill = None, width = 1, dashes = None, roundCap = True, overlay = True, closePath = False, morph = None): PDF only: This is a special case of drawBezier().

drawRect(rect, color = (0, 0, 0), fill = None, width = 1, dashes = None, roundCap = True, overlay = True, morph = None): PDF only: Draw a rectangle.

Note

An efficient way to background-color a PDF page with the old Python paper color is page.drawRect(page.rect, color = py_color, fill = py_color, overlay = False), where py_color = getColor("py_color").

insertImage(rect, filename = None, pixmap = None, overlay = True)

PDF only: Fill the given rectangle with an image. Width and height need not have the same proportions as the image: it will be adjusted to fit. The image is either taken from a pixmap or from a file - exactly one of these parameters must be specified.

Parameters:

Parameters:	rect (Rect) – where to put the image on the page. `rect` must be finite, not empty and be completely contained in the page’s rectangle. filename (str) – name of an image file (all MuPDF supported formats - see Pixmap chapter). pixmap (Pixmap) – pixmap containing the image. When inserting the same image multiple times, this should be the preferred option, because the overhead of opening the image and decompressing its content will occur every time with the filename option.

rect (Rect) – where to put the image on the page. rect must be finite, not empty and be completely contained in the page’s rectangle.
filename (str) – name of an image file (all MuPDF supported formats - see Pixmap chapter).
pixmap (Pixmap) – pixmap containing the image. When inserting the same image multiple times, this should be the preferred option, because the overhead of opening the image and decompressing its content will occur every time with the filename option.

For a description of the other parameters see Common Parameters.

Returns:	zero

This example puts the same image on every page of a document:

>>> doc = fitz.open(...)
>>> rect = fitz.Rect(0, 0, 50, 50)   # put thumbnail in upper left corner
>>> pix = fitz.Pixmap("some.jpg")    # an image file
>>> for page in doc:
        page.insertImage(rect, pixmap = pix)
>>> doc.save(...)

Notes:

If that same image had already been present in the PDF, then only a reference will be inserted. This of course considerably saves disk space and processing time. But to detect this fact, existing PDF images need to be compared with the new one. This is achieved by storing an MD5 code for each image in a table and only compare the new image’s code against its entries. Generating this MD5 table, however, is done only when triggered by the first image insertion - which therefore may have an extended response time.
You can use this method to provide a background image for the page, like a copyright, a watermark or a background color. Or you can combine it with searchFor() to achieve a textmarker effect.
The image may be inserted uncompressed, e.g. if a Pixmap is used or if the image has an alpha channel. Therefore, consider using deflate = True when saving the file.
The image content is stored in its original size - which may be much bigger than the size you want to get displayed. Consider decreasing the stored image size by using the pixmap option and then shrinking it or scaling it down (see Pixmap chapter). The file size savings can be very significant.

getText(output = 'text')

Retrieves the text of a page. Depending on the output parameter, the results of the TextPage extract methods are returned.

If 'text' is specified, plain text is returned in the order as specified during PDF creation (which is not necessarily the normal reading order). This may not always look as expected, consider using (and probably modifying) the example program PDF2TextJS.py. It tries to re-arrange text according to the Western reading layout convention “from top-left to bottom-right”.

Parameters:	output (str) – A string indicating the requested text format, one of `"text"` (default), `"html"`, `"json"`, `"xml"` or `"xhtml"`.
Return type:	string
Returns:	The page’s text as one string.

Note

Use this method to convert the document into a valid HTML version by wrapping it with appropriate header and trailer strings, see the following snippet. Creating XML, XHTML or JSON documents works in exactly the same way. For XML and JSON you may also include an arbitrary filename like so: fitz.ConversionHeader("xml", filename = doc.name). Also see Controlling Quality of HTML Output.

>>> doc = fitz.open(...)
>>> ofile = open(doc.name + ".html", "w")
>>> ofile.write(fitz.ConversionHeader("html"))
>>> for page in doc: ofile.write(page.getText("html"))
>>> ofile.write(fitz.ConversionTrailer("html"))
>>> ofile.close()

getTextBlocks(images = False)

Extract all text blocks as a Python list. Provides basic positioning information without the need to interpret the output of TextPage.extractJSON() or TextPage.extractXML(). The block sequence is as specified in the document. All lines of a block are concatenated into one string, separated by a space.

Parameters:	images (bool) – also extract image blocks. Default is false. This serves as a means to get complete page layout information. Only metadata, not the image data itself is extracted. Use `TextPage.extractJSON()` for accessing this information.
Return type:	list
Returns:	a list whose items have the following entries. `x0, y0, x1, y1`: 4 floats defining the bbox of the block. `text`: concatenated text lines in the block (str). If this is an image block, a text like this is contained: `<image: DeviceRGB, width 511, height 379, bpc 8>` (original image’s width and height). `block_n`: 0-based block number (int). `type`: block type (int), 0 = text, 1 = image.

getTextWords()

Extract all words as a Python list. Provides positioning information for words without having to interpret the output of TextPage.extractXML(). The word sequence is as specified in the document. The accompanying rectangle coordinates can be used to re-arrange the final text output to your liking. Block and line numbers help keeping track of the original position.

Return type:	list
Returns:	a list whose items are lists with the following entries: `x0, y0, x1, y1`: 4 floats defining the bbox of the word. `word`: the word, spaces stripped off (str). Note that any non-space character is accepted as part of a word - not just letters. So, `Hello world!` will yield the two words `Hello` and `world!`. `block_n, line_n, word_n`: 0-based numbers for block, line and word (int).

getFontList(): PDF only: Return a list of fonts referenced by the page. Same as Document.getPageFontList().

getImageList(): PDF only: Return a list of images referenced by the page. Same as Document.getPageImageList().

getSVGimage(matrix = fitz.Identity)

Create an SVG image from the page. Only full page images are currently supported.

Parameters:	matrix (Matrix) – a Matrix, default is Identity. Valid operations include scaling and rotation.
Returns:	a UTF-8 encoded string that contains the image. This is XML syntax and can be saved in a text file with extension `.svg`.

getPixmap(matrix = fitz.Identity, colorspace = fitz.csRGB, clip = None, alpha = True)

Create a pixmap from the page. This is probably the most often used method to create pixmaps.

Parameters:	matrix (Matrix) – a Matrix, default is Identity. colorspace (string, Colorspace) – Defines the required colorspace, one of `GRAY`, `RGB` or `CMYK` (case insensitive). Or specify a Colorspace, e.g. one of the predefined ones: `csGRAY`, `csRGB` or `csCMYK`. clip (IRect) – restrict rendering to the rectangle’s area. The default will render the full page. alpha (bool) – A bool indicating whether an alpha channel should be included in the pixmap. Choose `False` if you do not really need transparency. This will save a lot of memory (25% in case of RGB … and pixmaps are typically large!), and also processing time in most cases. Also note an important difference in how the image will appear: `True`: pixmap’s samples will be pre-cleared with `0x00`, including the alpha byte. This will result in transparent areas where the page is empty. `False`: pixmap’s samples will be pre-cleared with `0xff`. This will result in white where the page has nothing to show.
Return type:	Pixmap
Returns:	Pixmap of the page.

loadLinks()

Return the first link on a page. Synonym of property firstLink.

Return type:	Link
Returns:	first link on the page (or `None`).

setRotation(rot)

PDF only: Sets the rotation of the page.

Parameters:	rot (int) – An integer specifying the required rotation in degrees. Should be a (positive or negative) multiple of 90.
Returns:	zero if successfull, `-1` if not a PDF.

showPDFpage(rect, docsrc, pno = 0, keep_proportion = True, overlay = True, reuse_xref = 0, clip = None)

PDF only: Display the page of another PDF as a vector image.

Parameters:

Parameters:	rect (Rect) – where to place the image. docsrc (Document) – source PDF document containing the page. Must be a different document object, but may be the same file. pno (int) – page number (0-based) to be shown. keep_proportion (bool) – control whether to scale width and height synchronously (default). overlay (bool) – put image in foreground (default) or background. reuse_xref (int) – specify an xref number if an already stored page shall be shown. This suppresses copying the source page once more. clip (Rect) – choose which part of the source page to show. Default is its `/CropBox`.
Returns:	xref number of the stored page image if successful. Use this as the value of argument `reuse_xref` to show the same page again.

rect (Rect) – where to place the image.
docsrc (Document) – source PDF document containing the page. Must be a different document object, but may be the same file.
pno (int) – page number (0-based) to be shown.
keep_proportion (bool) – control whether to scale width and height synchronously (default).
overlay (bool) – put image in foreground (default) or background.
reuse_xref (int) – specify an xref number if an already stored page shall be shown. This suppresses copying the source page once more.
clip (Rect) – choose which part of the source page to show. Default is its /CropBox.

Returns:

xref number of the stored page image if successful. Use this as the value of argument reuse_xref to show the same page again.

Note

This is a multi-purpose method. For instance, it can be used to create “2-up” / “4-up” or posterized versions of existing PDF files (see examples 4-up.py and posterize.py). Or use it to include PDF-based vector images (company logos, watermarks, etc.).

Note

Unfortunately, garbage collection currently does not detect multiple copies of a to-be-displayed source page. Therefore, use the reuse_xref argument to prevent multiple creations as follows. For a technical description of how this function is implemented, see Design of Method Page.showPDFpage().

>>> # the first showPDFpage will copy the page, the following
>>> # will reuse the result via its xref.
>>> xref = 0
>>> for page in doc:
        xref = page.showPDFpage(rect, docsrc, pno,
                                reuse_xref = xref)

newShape()

PDF only: Create a new Shape object for the page.

Return type:	Shape
Returns:	a new Shape to use for compound drawings. See description there.

searchFor(text, hit_max = 16)

Searches for text on a page. Identical to TextPage.search().

Parameters:	text (str) – Text to searched for. Upper / lower case is ignored. hit_max (int) – Maximum number of occurrences accepted.
Return type:	list
Returns:	A list of Rect rectangles each of which surrounds one occurrence of `text`.

rotation

PDF only: contains the rotation of the page in degrees and -1 for other document types.

Type:	int

CropBoxPosition

Contains the top-left coordinates of the page’s /CropBox for a PDF, otherwise the top-left coordinates of the page’s rectangle.

Type:	Point

CropBox

The page’s /CropBox for a PDF, the page’s rectangle.

Type:	Rect

MediaBoxSize

Contains the width and height of the page’s /MediaBox for a PDF, otherwise the bottom-right coordinates of the page’s rectangle.

Type:	Point

MediaBox

The page’s /MediaBox for a PDF, otherwise the page’s rectangle.

type: Rect

type:	Rect

Note

For non-PDF documents (and for most PDF documents, too) page.rect == page.CropBox == page.MediaBox is true. For some PDF documents however, page.rect may be a true subset of the /MediaBox. In these cases the above attributes help to correctly position / evaluate elements of the page.

firstLink

Contains the first Link of a page (or None).

Type:	Link

firstAnnot

Contains the first Annot of a page (or None).

Type:	Annot

number

The page number.

Type:	int

parent

The owning document object.

Type:	Document

rect

Contains the rectangle (“mediabox”, before transformation) of the page. Same as result of method bound().

Type:	Rect

Description of `getLinks()` Entries

Each entry of the getLinks() list is a dictionay with the following keys:

kind: (required) an integer indicating the kind of link. This is one of LINK_NONE, LINK_GOTO, LINK_GOTOR, LINK_LAUNCH, or LINK_URI. For values and meaning of these names refer to Link Destination Kinds.
from: (required) a Rect describing the “hot spot” location on the page’s visible representation (where the cursor changes to a hand image, usually).
page: a 0-based integer indicating the destination page. Required for LINK_GOTO and LINK_GOTOR, else ignored.
to: either a fitz.Point, specifying the destination location on the provided page, default is fitz.Point(0, 0), or a symbolic (indirect) name. If an indirect name is specified, page = -1 is required and the name must be defined in the PDF in order for this to work. Required for LINK_GOTO and LINK_GOTOR, else ignored.
file: a string specifying the destination file. Required for LINK_GOTOR and LINK_LAUNCH, else ignored.
uri: a string specifying the destination internet resource. Required for LINK_URI, else ignored.
xref: an integer specifying the PDF cross reference entry of the link object. Do not change this entry in any way. Required for link deletion and update, otherwise ignored. For non-PDF documents, this entry contains -1. It is also -1 for all entries in the getLinks() list, if any of the links is not supported by MuPDF - see the note below.

Notes on Supporting Links

MuPDF’s support for links has changed in v1.10a. These changes affect link types LINK_GOTO and LINK_GOTOR.

Reading (pertains to method `getLinks()` and the `firstLink` property chain)

If MuPDF detects a link to another file, it will supply either a LINK_GOTOR or a LINK_LAUNCH link kind. In case of LINK_GOTOR destination details may either be given as page number (eventually including position information), or as an indirect destination.

If an indirect destination is given, then this is indicated by page = -1, and link.dest.dest will contain this name. The dictionaries in the getLinks() list will contain this information as the to value.

Internal links are always of kind LINK_GOTO. If an internal link specifies an indirect destination, it will always be resolved and the resulting direct destination will be returned. Names are never returned for internal links, and undefined destinations will cause the link to be ignored.

Writing

PyMuPDF writes (updates, inserts) links by constructing and writing the appropriate PDF object source. This makes it possible to specify indirect destinations for LINK_GOTOR and LINK_GOTO link kinds (pre PDF 1.2 file formats are not supported).

Caution

If a LINK_GOTO indirect destination specifies an undefined name, this link can later on not be found / read again with MuPDF / PyMuPDF. Other readers however will detect it, but flag it as erroneous.

Indirect LINK_GOTOR destinations can in general of course not be checked for validity and are therefore always accepted.

Homologous Methods of Document and Page

This is an overview of homologous methods on the Document and on the Page level.

Document Level	Page Level
Document.getPageFontlist(pno)	Page.getFontlist()
Document.getPageImageList(pno)	Page.getImageList()
Document.getPagePixmap(pno, …)	Page.getPixmap(…)
Document.getPageText(pno, …)	Page.getText(…)
Document.searchPageFor(pno, …)	Page.searchFor(…)
Document._getPageXref(pno)	Page._getXref()

The page number pno is 0-based and can be any negative or positive number < len(doc). The document methods invoke their page counterparts via Document[pno].<method>.

previous page start next page

Page

PyMuPDF

Page

Description of getLinks() Entries

Notes on Supporting Links

Reading (pertains to method getLinks() and the firstLink property chain)

Writing

Homologous Methods of Document and Page

Description of `getLinks()` Entries

Reading (pertains to method `getLinks()` and the `firstLink` property chain)