Functions

PyMuPDF

previous page next page

Functions

The following are miscellaneous functions to be used by the experienced PDF programmer.

Function	Short Description
`Document.FontInfos`	PDF only: information on inserted fonts
`Annot._cleanContents()`	PDF only: clean the annot’s `/Contents` objects
`Annot._getXref()`	PDF only: return XREF number of annotation
`ConversionHeader()`	return header string for `getText` methods
`ConversionTrailer()`	return trailer string for `getText` methods
`Document._delXmlMetadata()`	PDF only: remove XML metadata
`Document._getGCTXerrmsg()`	retrieve C-level exception message
`Document._getNewXref()`	PDF only: create and return a new XREF entry
`Document._getObjectString()`	PDF only: return object source code
`Document._getOLRootNumber()`	PDF only: return / create XREF of `/Outline`
`Document._getPageObjNumber()`	PDF only: return XREF and generation number of a page
`Document._getPageXref()`	PDF only: same as `_getPageObjNumber()`
`Document._getXmlMetadataXref()`	PDF only: return XML metadata XREF number
`Document._getXrefLength()`	PDF only: return length of XREF table
`Document._getXrefStream()`	PDF only: return content of a stream
`Document._getXrefString()`	PDF only: return object source code
`Document._updateObject()`	PDF only: insert or update a PDF object
`Document._updateStream()`	PDF only: replace the stream of an object
`Document.extractFont()`	PDF only: extract embedded font
`Document.getCharWidths()`	PDF only: return a list of glyph widths of a font
`Document.getPageRawText()`	PDF only: return raw string between two points
`getPDFnow()`	return the current timestamp in PDF format
`getPDFstr()`	return PDF-compatible string
`Page._cleanContents()`	PDF only: clean the page’s `/Contents` objects
`Page._getContents()`	PDF only: return a list of content numbers
`Page._getXref()`	PDF only: return XREF number of page
`Page.getDisplayList()`	create the page’s display list
`Page.extractTextLines()`	return text between two points
`Page.extractTextRect()`	return text inside a rectangle
`Page.insertFont()`	PDF only: store a new font in the document
`Page.run()`	run a page through a device
`PaperSize()`	return width, height for known paper formats

PaperSize(s)

Convenience function to return width and height of a known paper format code. These values are given in pixels for the standard resolution 72 pixels = 1 inch.

Currently defined formats include A0 through A10, B0 through B10, C0 through C10, Card-4x6, Card-5x7, Commercial, Executive, Invoice, Ledger, Legal, Legal-13, Letter, Monarch and Tabloid-Extra, each in either portrait or landscape format.

A format name must be supplied as a string (case insensitive), optionally suffixed with “-L” (landscape) or “-P” (portrait). No suffix defaults to portrait.

Parameters: s (str) – a format name like "A4" or "letter-l".

Return type: tuple

Returns: (width, height) of the paper format. For an unknown format (-1, -1) is returned. Esamples: PaperSize("A4") returns (595, 842) and PaperSize("letter-l") delivers (792, 612).

Parameters:	s (str) – a format name like `"A4"` or `"letter-l"`.
Return type:	tuple
Returns:	`(width, height)` of the paper format. For an unknown format `(-1, -1)` is returned. Esamples: `PaperSize("A4")` returns `(595, 842)` and `PaperSize("letter-l")` delivers `(792, 612)`.

getPDFnow()

Convenience function to return the current local timestamp in PDF compatible format, e.g. D:20170501121525-04'00' for local datetime May 1, 2017, 12:15:25 in a timezone 4 hours westward of the UTC meridian.

Return type: str

Returns: current local PDF timestamp.

Return type:	str
Returns:	current local PDF timestamp.

getPDFstr(obj, brackets = True)

Make a PDF-compatible string: if obj contains code points ord(c) > 255, then it will be converted to UTF-16BE as a hexadecimal character string like <feff...>. Otherwise, if brackets = True, it will enclose the argument in () replacing any characters with code points ord(c) > 127 by their octal number \nnn prefixed with a backslash. If brackets = False, then the string is returned unchanged.

Parameters: obj (str or bytes or unicode) – the object to convert

Return type: str

Returns: PDF-compatible string enclosed in either () or <>.

ConversionHeader(output = "text", filename = "UNKNOWN")

Return the header string required to make a valid document out of page text outputs.

Parameters:

output (str) – type of document. Use the same as the output parameter of getText().

filename (str) – optional arbitrary name to use in output types “json” and “xml”.

Return type:
str

ConversionTrailer(output)

Return the trailer string required to make a valid document out of page text outputs. See Page.getText() for an example.

Parameters: output (str) – type of document. Use the same as the output parameter of getText().

Return type: str

Parameters:	obj (str or bytes or unicode) – the object to convert
Return type:	str
Returns:	PDF-compatible string enclosed in either `()` or `<>`.

Parameters:	output (str) – type of document. Use the same as the output parameter of `getText()`. filename (str) – optional arbitrary name to use in output types “json” and “xml”.
Return type:	str

Parameters:	output (str) – type of document. Use the same as the output parameter of `getText()`.
Return type:	str

Document._delXmlMetadata()

Delete an object containing XML-based metadata from the PDF. (Py-) MuPDF does not support XML-based metadata. Use this if you want to make sure that the conventional metadata dictionary will be used exclusively. Many thirdparty PDF programs insert their own metadata in XML format and thus may override what you store in the conventional dictionary. This method deletes any such reference, and the corresponding PDF object will be deleted during next garbage collection of the file.

Document._getXmlMetadataXref()

Return he XML-based metadata object id from the PDF if present - also refer to Document._delXmlMetadata(). You can use it to retrieve the content via Document._getXrefStream() and then work with it using some XML software.

Document._getPageObjNumber(pno)

or

Document._getPageXref(pno)

Return the XREF and generation number for a given page.

Parameters: pno (int) – Page number (zero-based).

Return type: list

Returns: XREF and generation number of page pno as a list [xref, gen].

Parameters:	pno (int) – Page number (zero-based).
Return type:	list
Returns:	XREF and generation number of page `pno` as a list `[xref, gen]`.

Page._getXref()

Page version for _getPageObjNumber() only delivering the XREF (not the generation number).

Page.run(dev, transform)

Run a page through a device.

Parameters:

dev (Device) – Device, obtained from one of the Device constructors.

transform (Matrix) – Transformation to apply to the page. Set it to Identity if no transformation is desired.

Parameters:	dev (Device) – Device, obtained from one of the Device constructors. transform (Matrix) – Transformation to apply to the page. Set it to Identity if no transformation is desired.

Page.insertFont(fontname = "Helvetica", fontfile = None, idx = 0, set_simple = False)

Store a new font for the page and return its XREF. If the page already references this font, it is a no-operation and just the XREF is returned.

Parameters:

fontname (str) – The reference name of the font. If the name does not occur in Page.getFontList(), then this must be either the name of one of the PDF Base 14 Fonts, or fontfile must also be given. Following this method, font name prefixed with a slash “/” can be used to refer to the font in text insertions. If it appears in the list, the method ignores all other parameters and exits with the xref number.

fontfile (str) – font file name. This file will be embedded in the PDF.

idx (int) –
index of the font in the given file. Has no meaning and is ingored if fontfile is not specified. Default is zero. An invalid index will cause an exception.

Note

Certain font files can contain more than one font. This parameter can be used to select the right one. PyMuPDF has no way to tell whether the font file indeed contains a font for any non-zero index.

Caution

Only the first choice of idx will be honored - subsequent specifications are ignored.

set_simple (bool) –
When inserting from a font file, a “Type0” font will be installed by default. This option causes the font to be installed as a simple font instead. Only 1-byte characters will then be presented correctly, others will appear as “?” (question mark).

Caution

Only the first choice of set_simple will be honored. Subsequent specifications are ignored.

Return type:
int

Returns:
the XREF of the font. PyMuPDF records inserted fonts in two places:

An inserted font will appear in Page.getFontList().

Document.FontInfos records information about all fonts that have been inserted by this method on a document-wide basis.

Parameters:	fontname (str) – The reference name of the font. If the name does not occur in `Page.getFontList()`, then this must be either the name of one of the PDF Base 14 Fonts, or `fontfile` must also be given. Following this method, font name prefixed with a slash “/” can be used to refer to the font in text insertions. If it appears in the list, the method ignores all other parameters and exits with the xref number. fontfile (str) – font file name. This file will be embedded in the PDF. idx (int) – index of the font in the given file. Has no meaning and is ingored if `fontfile` is not specified. Default is zero. An invalid index will cause an exception. Note Certain font files can contain more than one font. This parameter can be used to select the right one. PyMuPDF has no way to tell whether the font file indeed contains a font for any non-zero index. Caution Only the first choice of `idx` will be honored - subsequent specifications are ignored. set_simple (bool) – When inserting from a font file, a “Type0” font will be installed by default. This option causes the font to be installed as a simple font instead. Only 1-byte characters will then be presented correctly, others will appear as “?” (question mark). Caution Only the first choice of `set_simple` will be honored. Subsequent specifications are ignored.
Return type:	int
Returns:	the XREF of the font. PyMuPDF records inserted fonts in two places: An inserted font will appear in `Page.getFontList()`. `Document.FontInfos` records information about all fonts that have been inserted by this method on a document-wide basis.

Page.getDisplayList()

Run a page through a list device and return its display list.

Return type: DisplayList

Returns: the display list of the page.

Return type:	DisplayList
Returns:	the display list of the page.

Page._getContents()

Return a list of XREF numbers of /Contents objects belongig to the page. The length of this list will always be at least one.

Return type: list

Returns: a list of XREF integers.

Each page has one or more associated contents objects (streams) which contain PDF operator syntax describing what appears where on the page (like text or images, etc. See the Adobe PDF Reference 1.7, chapter “Operator Summary”, page 985). This function only enumerates the XREF number(s) of such objects. To get the actual stream source, use function Document._getXrefStream() with one of the numbers in this list. Use Document._updateStream() to replace the content [1] [2].

Return type:	list
Returns:	a list of XREF integers.

Page._cleanContents()

Clean all /Contents objects associated with this page (including contents of all annotations). “Cleaning” includes syntactical corrections, standardizations and “pretty printing” of the contents stream. If a page has several contents objects, they will be combined into one. Any discrepancies between /Contents and /Resources objects are also resolved / corrected. Note that the resulting contents stream will be stored uncompressed (if you do not specify deflate on save). See Page._getContents() for more details.

Return type: int

Returns: 0 on success.

Return type:	int
Returns:	0 on success.

Annot._getXref()

Return the xref number of an annotation.

Return type: int

Returns: XREF number of the annotation.

Return type:	int
Returns:	XREF number of the annotation.

Annot._cleanContents()

Clean the /Contents streams associated with the annotation. This is the same type of action Page._cleanContents() performs - just restricted to this annotation.

Return type: int

Returns: 0 if successful (exception raised otherwise).

Return type:	int
Returns:	0 if successful (exception raised otherwise).

Document.getCharWidths(xref = 0, limit = 256)
Return a list of character glyphs and their widths for a font that is present in the document. A font must be specified by its PDF cross reference number xref. This function is called automatically from Page.insertText() and Page.insertTextbox(). So you should rarely need to do this yourself.

Parameters:

xref (int) – cross reference number of a font embedded in the PDF. To find a font xref, use e.g. doc.getPageFontList(pno) of page number pno and take the first entry of one of the returned list entries.

limit (int) – limits the number of returned entries. The default of 256 is enforced for all fonts that only support 1-byte characters, so-called “simple fonts” (checked by this method). All PDF Base 14 Fonts are simple fonts.

Return type:
list

Returns:
a list of limit tuples. Each character c has an entry (g, w) in this list with an index of ord(c). Entry g (integer) of the tuple is the glyph id of the character, and float w is its normalized width. The actual width for some fontsize can be calculated as w * fontsize. For simple fonts, the g entry can always be safely ignored. In all other cases g is the basis for graphically representing c.

This function calculates the pixel width of a string called text:
def pixlen(text, widthlist, fontsize):
try:
    return sum([widthlist[ord(c)] for c in text]) * fontsize
except IndexError:
    m = max([ord(c) for c in text])
    raise ValueError:("max. code point found: %i, increase limit" % m)

Parameters:	xref (int) – cross reference number of a font embedded in the PDF. To find a font xref, use e.g. `doc.getPageFontList(pno)` of page number `pno` and take the first entry of one of the returned list entries. limit (int) – limits the number of returned entries. The default of 256 is enforced for all fonts that only support 1-byte characters, so-called “simple fonts” (checked by this method). All PDF Base 14 Fonts are simple fonts.
Return type:	list
Returns:	a list of `limit` tuples. Each character `c` has an entry `(g, w)` in this list with an index of `ord(c)`. Entry `g` (integer) of the tuple is the glyph id of the character, and float `w` is its normalized width. The actual width for some fontsize can be calculated as `w * fontsize`. For simple fonts, the `g` entry can always be safely ignored. In all other cases `g` is the basis for graphically representing `c`.

Document.getPageRawText(pno, p1, p2)

Return lines of raw text contained between a pair of points.

Parameters:

pno (int) – page number.

p1 (Point) – Text delimiter point.

p2 (Point) – Text delimiter point.

Return type:
string

Returns:
see the page version of this mehod.

Parameters:	pno (int) – page number. p1 (Point) – Text delimiter point. p2 (Point) – Text delimiter point.
Return type:	string
Returns:	see the page version of this mehod.

Page.extractTextLines(p1, p2)

Return lines of text contained between a pair of points.

Parameters:

p1 (Point) – text delimiter point.

p2 (Point) – text delimiter point.

Return type:
str

Returns:
text lines between the two points (UTF-8 encoded).

Parameters:	p1 (Point) – text delimiter point. p2 (Point) – text delimiter point.
Return type:	str
Returns:	text lines between the two points (UTF-8 encoded).

Page.extractTextRect(rect)

Return lines of text contained in a rectangle.

Parameters: rect (Rect) – rectangle.

Return type: str

Returns: text occurring inside the rectangle.

Parameters:	rect (Rect) – rectangle.
Return type:	str
Returns:	text occurring inside the rectangle.

Document._getObjectString(xref)

Document._getXrefString(xref)

Return the string (“source code”) representing an arbitrary object. For stream objects, only the non-stream part is returned. To get the stream content, use _getXrefStream().

Parameters: xref (int) – XREF number.

Return type: string

Returns: the string defining the object identified by xref.

Parameters:	xref (int) – XREF number.
Return type:	string
Returns:	the string defining the object identified by `xref`.

Document._getGCTXerrmsg()

Retrieve exception message text issued by PyMuPDF’s low-level code. This in most cases, but not always, are MuPDF messages. This string will never be cleared - only overwritten as needed. Only rely on it if a RuntimeError had been raised.

Return type: str

Returns: last C-level error message on occasion of a RuntimeError exception.

Return type:	str
Returns:	last C-level error message on occasion of a `RuntimeError` exception.

Document._getNewXref()

Increase the XREF by one entry and return that number. This can then be used to insert a new object.

Return type: int

Returns: the number of the new XREF entry.

Return type:	int
Returns:	the number of the new XREF entry.

Document._updateObject(xref, obj_str, page = None)

Associate the object identified by string obj_str with the XREF number xref, which must already exist. If xref pointed to an existing object, this will be replaced with the new object. If a page object is specified, links and other annotations of this page will be reloaded after the object has been updated.

Parameters:

xref (int) – XREF number.

obj_str (str) – a string containing a valid PDF object definition.

page (Page) – a page object. If provided, indicates, that annotations of this page should be refreshed (reloaded) to reflect changes incurred with links and / or annotations.

Return type:
int

Returns:
zero if successful, otherwise an exception will be raised.

Parameters:	xref (int) – XREF number. obj_str (str) – a string containing a valid PDF object definition. page (Page) – a page object. If provided, indicates, that annotations of this page should be refreshed (reloaded) to reflect changes incurred with links and / or annotations.
Return type:	int
Returns:	zero if successful, otherwise an exception will be raised.

Document._getXrefLength()

Return length of XREF table.

Return type: int

Returns: the number of entries in the XREF table.

Return type:	int
Returns:	the number of entries in the XREF table.

Document._getXrefStream(xref)

Return decompressed content stream of the object referenced by xref. If the object has / is no stream, an exception is raised.

Parameters: xref (int) – XREF number.

Return type: str or bytes

Returns: the (decompressed) stream of the object. This is a string in Python 2 and a bytes object in Python 3.

Parameters:	xref (int) – XREF number.
Return type:	str or bytes
Returns:	the (decompressed) stream of the object. This is a string in Python 2 and a `bytes` object in Python 3.

Document._updateStream(xref, stream)

Replace the stream of an object identified by xref. If the object has no stream, an exception is raised. The function automatically performs a compress operation (“deflate”).

Parameters:

xref (int) – XREF number.

stream (bytes or bytearray) – the new content of the stream.

Return type:
int

This method is intended to manipulate streams containing PDF operator syntax (see pp. 985 of the Adobe PDF Reference 1.7) as it is the case for e.g. page content streams.

If you update a contents stream, you should use save parameter clean = True. This ensures consistency between PDF operator source and the object structure.

Example: Let us assume that you no longer want a certain image appear on a page. This can be achieved by deleting [2] the respective reference in its contents source(s) - and indeed: the image will be gone after reloading the page. But the page’s /Resources object would still [3] show the image as being referenced by the page. This save option will clean up any such mismatches.

Parameters:	xref (int) – XREF number. stream (bytes or bytearray) – the new content of the stream.
Return type:	int

Document._getOLRootNumber()

Return XREF number of the /Outlines root object (this is not the first outline entry!). If this object does not exist, a new one will be created.

Return type: int

Returns: XREF number of the /Outlines root object.
Document.extractFont(xref, info_only = False)
Return an embedded font file’s data and appropriate file extension. This can be used to store the font as an external file. The method does not throw exceptions (other than via checking for PDF).

Parameters:

xref (int) – PDF object number of the font to extract.

info_only (bool) – only return font information, not the buffer. To be used for information-only purposes, saves allocation of large buffer areas.

Return type:
tuple

Returns:
a tuple (basename, ext, subtype, buffer), where ext is a 3-byte suggested file extension (str), basename is the font’s name (str), subtype is the font’s type (e.g. “Type1”) and buffer is a bytes object containing the font file’s content (or b""). For possible extension values and their meaning see Font File Extensions. Return details on error:

("", "", "", b"") - invalid xref or xref is not a (valid) font object.

(basename, "n/a", "Type1", b"") - basename is one of the PDF Base 14 Fonts, which cannot be extracted.

Example:
>>> # store font as an external file
>>> name, ext, buffer = doc.extractFont(4711)
>>> # assuming buffer is not None:
>>> ofile = open(name + "." + ext, "wb")
>>> ofile.write(buffer)
>>> ofile.close()
Caution

The basename is returned unchanged from the PDF. So it may contain characters (such as blanks) which disqualify it as a valid filename for your operating system. Take appropriate action.
Document.FontInfos

Contains following information for any font inserted via Page.insertFont():

xref (int) - XREF number of the /Type/Font object.

info (dict) - detail font information with the following keys:

name (str) - name of the basefont

idx (int) - index number for multi-font files

type (str) - font type (like “TrueType”, “Type0”, etc.)

ext (str) - extension to be used, when font is extracted to a file (see Font File Extensions).

glyphs (list) - list of glyph numbers and widths (filled by textinsertion methods).

Return type: list

Return type:	int
Returns:	XREF number of the /Outlines root object.

Parameters:	xref (int) – PDF object number of the font to extract. info_only (bool) – only return font information, not the buffer. To be used for information-only purposes, saves allocation of large buffer areas.
Return type:	tuple
Returns:	a tuple `(basename, ext, subtype, buffer)`, where `ext` is a 3-byte suggested file extension (str), `basename` is the font’s name (str), `subtype` is the font’s type (e.g. “Type1”) and `buffer` is a bytes object containing the font file’s content (or `b""`). For possible extension values and their meaning see Font File Extensions. Return details on error: `("", "", "", b"")` - invalid xref or xref is not a (valid) font object. `(basename, "n/a", "Type1", b"")` - `basename` is one of the PDF Base 14 Fonts, which cannot be extracted.

Return type:	list

Footnotes

[1] If a page has multiple contents streams, they are treated as being one logical stream when the document is processed by reader software. A single operator cannot be split between stream boundaries, but a single instruction may well be. E.g. invoking the display of an image looks like this: q a b c d e f cm /imageid Do Q. Any single of these items (PDF notation: “lexical tokens”) is always contained in one stream, but q a b c d e f cm may be in one and /imageid Do Q in the next one.

[2] (1, 2) Note that /Contents objects (similar to /Resources) may be shared among pages. A change to a contents stream may therefore affect other pages, too. To avoid this: (1) use Page._cleanContents(), (2) read the /Contents object (there will now be only one left), (3) make your changes.

[3] Resources objects are inheritable. This means that many pages can share one. Keeping a page’s /Resources object in sync with changes of its /Contents therefore may require creating an own /Resources object for the page. This can best be achieved by using clean when saving, or by invoking Page._cleanContents().

previous page start next page