Functions

The following are miscellaneous functions to be used by the experienced PDF programmer.

Function Short Description
Document.FontInfos PDF only: information on inserted fonts
Annot._cleanContents() PDF only: clean the annot’s /Contents objects
Annot._getXref() PDF only: return XREF number of annotation
ConversionHeader() return header string for getText methods
ConversionTrailer() return trailer string for getText methods
Document._delXmlMetadata() PDF only: remove XML metadata
Document._getGCTXerrmsg() retrieve C-level exception message
Document._getNewXref() PDF only: create and return a new XREF entry
Document._getObjectString() PDF only: return object source code
Document._getOLRootNumber() PDF only: return / create XREF of /Outline
Document._getPageObjNumber() PDF only: return XREF and generation number of a page
Document._getPageXref() PDF only: same as _getPageObjNumber()
Document._getXmlMetadataXref() PDF only: return XML metadata XREF number
Document._getXrefLength() PDF only: return length of XREF table
Document._getXrefStream() PDF only: return content of a stream
Document._getXrefString() PDF only: return object source code
Document._updateObject() PDF only: insert or update a PDF object
Document._updateStream() PDF only: replace the stream of an object
Document.extractFont() PDF only: extract embedded font
Document.getCharWidths() PDF only: return a list of glyph widths of a font
Document.getPageRawText() PDF only: return raw string between two points
getPDFnow() return the current timestamp in PDF format
getPDFstr() return PDF-compatible string
Page._cleanContents() PDF only: clean the page’s /Contents objects
Page._getContents() PDF only: return a list of content numbers
Page._getXref() PDF only: return XREF number of page
Page.getDisplayList() create the page’s display list
Page.extractTextLines() return text between two points
Page.extractTextRect() return text inside a rectangle
Page.insertFont() PDF only: store a new font in the document
Page.run() run a page through a device
PaperSize() return width, height for known paper formats
PaperSize(s)

Convenience function to return width and height of a known paper format code. These values are given in pixels for the standard resolution 72 pixels = 1 inch.

Currently defined formats include A0 through A10, B0 through B10, C0 through C10, Card-4x6, Card-5x7, Commercial, Executive, Invoice, Ledger, Legal, Legal-13, Letter, Monarch and Tabloid-Extra, each in either portrait or landscape format.

A format name must be supplied as a string (case insensitive), optionally suffixed with “-L” (landscape) or “-P” (portrait). No suffix defaults to portrait.

Parameters:s (str) – a format name like "A4" or "letter-l".
Return type:tuple
Returns:(width, height) of the paper format. For an unknown format (-1, -1) is returned. Esamples: PaperSize("A4") returns (595, 842) and PaperSize("letter-l") delivers (792, 612).

getPDFnow()

Convenience function to return the current local timestamp in PDF compatible format, e.g. D:20170501121525-04'00' for local datetime May 1, 2017, 12:15:25 in a timezone 4 hours westward of the UTC meridian.

Return type:str
Returns:current local PDF timestamp.

getPDFstr(obj, brackets = True)

Make a PDF-compatible string: if obj contains code points ord(c) > 255, then it will be converted to UTF-16BE as a hexadecimal character string like <feff...>. Otherwise, if brackets = True, it will enclose the argument in () replacing any characters with code points ord(c) > 127 by their octal number \nnn prefixed with a backslash. If brackets = False, then the string is returned unchanged.

Parameters:obj (str or bytes or unicode) – the object to convert
Return type:str
Returns:PDF-compatible string enclosed in either () or <>.
ConversionHeader(output = "text", filename = "UNKNOWN")

Return the header string required to make a valid document out of page text outputs.

Parameters:
  • output (str) – type of document. Use the same as the output parameter of getText().
  • filename (str) – optional arbitrary name to use in output types “json” and “xml”.
Return type:

str

ConversionTrailer(output)

Return the trailer string required to make a valid document out of page text outputs. See Page.getText() for an example.

Parameters:output (str) – type of document. Use the same as the output parameter of getText().
Return type:str

Document._delXmlMetadata()

Delete an object containing XML-based metadata from the PDF. (Py-) MuPDF does not support XML-based metadata. Use this if you want to make sure that the conventional metadata dictionary will be used exclusively. Many thirdparty PDF programs insert their own metadata in XML format and thus may override what you store in the conventional dictionary. This method deletes any such reference, and the corresponding PDF object will be deleted during next garbage collection of the file.


Document._getXmlMetadataXref()

Return he XML-based metadata object id from the PDF if present - also refer to Document._delXmlMetadata(). You can use it to retrieve the content via Document._getXrefStream() and then work with it using some XML software.


Document._getPageObjNumber(pno)

or

Document._getPageXref(pno)
Return the XREF and generation number for a given page.
Parameters:pno (int) – Page number (zero-based).
Return type:list
Returns:XREF and generation number of page pno as a list [xref, gen].

Page._getXref()

Page version for _getPageObjNumber() only delivering the XREF (not the generation number).


Page.run(dev, transform)

Run a page through a device.

Parameters:
  • dev (Device) – Device, obtained from one of the Device constructors.
  • transform (Matrix) – Transformation to apply to the page. Set it to Identity if no transformation is desired.

Page.insertFont(fontname = "Helvetica", fontfile = None, idx = 0, set_simple = False)

Store a new font for the page and return its XREF. If the page already references this font, it is a no-operation and just the XREF is returned.

Parameters:
  • fontname (str) – The reference name of the font. If the name does not occur in Page.getFontList(), then this must be either the name of one of the PDF Base 14 Fonts, or fontfile must also be given. Following this method, font name prefixed with a slash “/” can be used to refer to the font in text insertions. If it appears in the list, the method ignores all other parameters and exits with the xref number.
  • fontfile (str) – font file name. This file will be embedded in the PDF.
  • idx (int) –

    index of the font in the given file. Has no meaning and is ingored if fontfile is not specified. Default is zero. An invalid index will cause an exception.

    Note

    Certain font files can contain more than one font. This parameter can be used to select the right one. PyMuPDF has no way to tell whether the font file indeed contains a font for any non-zero index.

    Caution

    Only the first choice of idx will be honored - subsequent specifications are ignored.

  • set_simple (bool) –

    When inserting from a font file, a “Type0” font will be installed by default. This option causes the font to be installed as a simple font instead. Only 1-byte characters will then be presented correctly, others will appear as “?” (question mark).

    Caution

    Only the first choice of set_simple will be honored. Subsequent specifications are ignored.

Return type:

int

Returns:

the XREF of the font. PyMuPDF records inserted fonts in two places:

  1. An inserted font will appear in Page.getFontList().
  2. Document.FontInfos records information about all fonts that have been inserted by this method on a document-wide basis.

Page.getDisplayList()

Run a page through a list device and return its display list.

Return type:DisplayList
Returns:the display list of the page.

Page._getContents()

Return a list of XREF numbers of /Contents objects belongig to the page. The length of this list will always be at least one.

Return type:list
Returns:a list of XREF integers.

Each page has one or more associated contents objects (streams) which contain PDF operator syntax describing what appears where on the page (like text or images, etc. See the Adobe PDF Reference 1.7, chapter “Operator Summary”, page 985). This function only enumerates the XREF number(s) of such objects. To get the actual stream source, use function Document._getXrefStream() with one of the numbers in this list. Use Document._updateStream() to replace the content [1] [2].


Page._cleanContents()

Clean all /Contents objects associated with this page (including contents of all annotations). “Cleaning” includes syntactical corrections, standardizations and “pretty printing” of the contents stream. If a page has several contents objects, they will be combined into one. Any discrepancies between /Contents and /Resources objects are also resolved / corrected. Note that the resulting contents stream will be stored uncompressed (if you do not specify deflate on save). See Page._getContents() for more details.

Return type:int
Returns:0 on success.

Annot._getXref()

Return the xref number of an annotation.

Return type:int
Returns:XREF number of the annotation.

Annot._cleanContents()

Clean the /Contents streams associated with the annotation. This is the same type of action Page._cleanContents() performs - just restricted to this annotation.

Return type:int
Returns:0 if successful (exception raised otherwise).

Document.getCharWidths(xref = 0, limit = 256)

Return a list of character glyphs and their widths for a font that is present in the document. A font must be specified by its PDF cross reference number xref. This function is called automatically from Page.insertText() and Page.insertTextbox(). So you should rarely need to do this yourself.

Parameters:
  • xref (int) – cross reference number of a font embedded in the PDF. To find a font xref, use e.g. doc.getPageFontList(pno) of page number pno and take the first entry of one of the returned list entries.
  • limit (int) – limits the number of returned entries. The default of 256 is enforced for all fonts that only support 1-byte characters, so-called “simple fonts” (checked by this method). All PDF Base 14 Fonts are simple fonts.
Return type:

list

Returns:

a list of limit tuples. Each character c has an entry (g, w) in this list with an index of ord(c). Entry g (integer) of the tuple is the glyph id of the character, and float w is its normalized width. The actual width for some fontsize can be calculated as w * fontsize. For simple fonts, the g entry can always be safely ignored. In all other cases g is the basis for graphically representing c.

This function calculates the pixel width of a string called text:

def pixlen(text, widthlist, fontsize):
try:
    return sum([widthlist[ord(c)] for c in text]) * fontsize
except IndexError:
    m = max([ord(c) for c in text])
    raise ValueError:("max. code point found: %i, increase limit" % m)

Document.getPageRawText(pno, p1, p2)

Return lines of raw text contained between a pair of points.

Parameters:
  • pno (int) – page number.
  • p1 (Point) – Text delimiter point.
  • p2 (Point) – Text delimiter point.
Return type:

string

Returns:

see the page version of this mehod.


Page.extractTextLines(p1, p2)

Return lines of text contained between a pair of points.

Parameters:
  • p1 (Point) – text delimiter point.
  • p2 (Point) – text delimiter point.
Return type:

str

Returns:

text lines between the two points (UTF-8 encoded).


Page.extractTextRect(rect)

Return lines of text contained in a rectangle.

Parameters:rect (Rect) – rectangle.
Return type:str
Returns:text occurring inside the rectangle.

Document._getObjectString(xref)
Document._getXrefString(xref)

Return the string (“source code”) representing an arbitrary object. For stream objects, only the non-stream part is returned. To get the stream content, use _getXrefStream().

Parameters:xref (int) – XREF number.
Return type:string
Returns:the string defining the object identified by xref.

Document._getGCTXerrmsg()

Retrieve exception message text issued by PyMuPDF’s low-level code. This in most cases, but not always, are MuPDF messages. This string will never be cleared - only overwritten as needed. Only rely on it if a RuntimeError had been raised.

Return type:str
Returns:last C-level error message on occasion of a RuntimeError exception.

Document._getNewXref()

Increase the XREF by one entry and return that number. This can then be used to insert a new object.

Return type:int
Returns:the number of the new XREF entry.

Document._updateObject(xref, obj_str, page = None)

Associate the object identified by string obj_str with the XREF number xref, which must already exist. If xref pointed to an existing object, this will be replaced with the new object. If a page object is specified, links and other annotations of this page will be reloaded after the object has been updated.

Parameters:
  • xref (int) – XREF number.
  • obj_str (str) – a string containing a valid PDF object definition.
  • page (Page) – a page object. If provided, indicates, that annotations of this page should be refreshed (reloaded) to reflect changes incurred with links and / or annotations.
Return type:

int

Returns:

zero if successful, otherwise an exception will be raised.


Document._getXrefLength()

Return length of XREF table.

Return type:int
Returns:the number of entries in the XREF table.

Document._getXrefStream(xref)

Return decompressed content stream of the object referenced by xref. If the object has / is no stream, an exception is raised.

Parameters:xref (int) – XREF number.
Return type:str or bytes
Returns:the (decompressed) stream of the object. This is a string in Python 2 and a bytes object in Python 3.

Document._updateStream(xref, stream)

Replace the stream of an object identified by xref. If the object has no stream, an exception is raised. The function automatically performs a compress operation (“deflate”).

Parameters:
  • xref (int) – XREF number.
  • stream (bytes or bytearray) – the new content of the stream.
Return type:

int

This method is intended to manipulate streams containing PDF operator syntax (see pp. 985 of the Adobe PDF Reference 1.7) as it is the case for e.g. page content streams.

If you update a contents stream, you should use save parameter clean = True. This ensures consistency between PDF operator source and the object structure.

Example: Let us assume that you no longer want a certain image appear on a page. This can be achieved by deleting [2] the respective reference in its contents source(s) - and indeed: the image will be gone after reloading the page. But the page’s /Resources object would still [3] show the image as being referenced by the page. This save option will clean up any such mismatches.


Document._getOLRootNumber()
Return XREF number of the /Outlines root object (this is not the first outline entry!). If this object does not exist, a new one will be created.
Return type:int
Returns:XREF number of the /Outlines root object.
Document.extractFont(xref, info_only = False)

Return an embedded font file’s data and appropriate file extension. This can be used to store the font as an external file. The method does not throw exceptions (other than via checking for PDF).

Parameters:
  • xref (int) – PDF object number of the font to extract.
  • info_only (bool) – only return font information, not the buffer. To be used for information-only purposes, saves allocation of large buffer areas.
Return type:

tuple

Returns:

a tuple (basename, ext, subtype, buffer), where ext is a 3-byte suggested file extension (str), basename is the font’s name (str), subtype is the font’s type (e.g. “Type1”) and buffer is a bytes object containing the font file’s content (or b""). For possible extension values and their meaning see Font File Extensions. Return details on error:

  • ("", "", "", b"") - invalid xref or xref is not a (valid) font object.
  • (basename, "n/a", "Type1", b"") - basename is one of the PDF Base 14 Fonts, which cannot be extracted.

Example:

>>> # store font as an external file
>>> name, ext, buffer = doc.extractFont(4711)
>>> # assuming buffer is not None:
>>> ofile = open(name + "." + ext, "wb")
>>> ofile.write(buffer)
>>> ofile.close()

Caution

The basename is returned unchanged from the PDF. So it may contain characters (such as blanks) which disqualify it as a valid filename for your operating system. Take appropriate action.

Document.FontInfos

Contains following information for any font inserted via Page.insertFont():

  • xref (int) - XREF number of the /Type/Font object.

  • info (dict) - detail font information with the following keys:

    • name (str) - name of the basefont
    • idx (int) - index number for multi-font files
    • type (str) - font type (like “TrueType”, “Type0”, etc.)
    • ext (str) - extension to be used, when font is extracted to a file (see Font File Extensions).
    • glyphs (list) - list of glyph numbers and widths (filled by textinsertion methods).
Return type:list

Footnotes

[1]If a page has multiple contents streams, they are treated as being one logical stream when the document is processed by reader software. A single operator cannot be split between stream boundaries, but a single instruction may well be. E.g. invoking the display of an image looks like this: q a b c d e f cm /imageid Do Q. Any single of these items (PDF notation: “lexical tokens”) is always contained in one stream, but q a b c d e f cm may be in one and /imageid Do Q in the next one.
[2](1, 2) Note that /Contents objects (similar to /Resources) may be shared among pages. A change to a contents stream may therefore affect other pages, too. To avoid this: (1) use Page._cleanContents(), (2) read the /Contents object (there will now be only one left), (3) make your changes.
[3]Resources objects are inheritable. This means that many pages can share one. Keeping a page’s /Resources object in sync with changes of its /Contents therefore may require creating an own /Resources object for the page. This can best be achieved by using clean when saving, or by invoking Page._cleanContents().