Functions
The following are miscellaneous functions to be used by the experienced PDF programmer.
Function | Short Description |
---|---|
Document.FontInfos |
PDF only: information on inserted fonts |
Annot._cleanContents() |
PDF only: clean the annot’s /Contents objects |
Annot._getXref() |
PDF only: return XREF number of annotation |
ConversionHeader() |
return header string for getText methods |
ConversionTrailer() |
return trailer string for getText methods |
Document._delXmlMetadata() |
PDF only: remove XML metadata |
Document._getGCTXerrmsg() |
retrieve C-level exception message |
Document._getNewXref() |
PDF only: create and return a new XREF entry |
Document._getObjectString() |
PDF only: return object source code |
Document._getOLRootNumber() |
PDF only: return / create XREF of /Outline |
Document._getPageObjNumber() |
PDF only: return XREF and generation number of a page |
Document._getPageXref() |
PDF only: same as _getPageObjNumber() |
Document._getXmlMetadataXref() |
PDF only: return XML metadata XREF number |
Document._getXrefLength() |
PDF only: return length of XREF table |
Document._getXrefStream() |
PDF only: return content of a stream |
Document._getXrefString() |
PDF only: return object source code |
Document._updateObject() |
PDF only: insert or update a PDF object |
Document._updateStream() |
PDF only: replace the stream of an object |
Document.extractFont() |
PDF only: extract embedded font |
Document.getCharWidths() |
PDF only: return a list of glyph widths of a font |
Document.getPageRawText() |
PDF only: return raw string between two points |
getPDFnow() |
return the current timestamp in PDF format |
getPDFstr() |
return PDF-compatible string |
Page._cleanContents() |
PDF only: clean the page’s /Contents objects |
Page._getContents() |
PDF only: return a list of content numbers |
Page._getXref() |
PDF only: return XREF number of page |
Page.getDisplayList() |
create the page’s display list |
Page.extractTextLines() |
return text between two points |
Page.extractTextRect() |
return text inside a rectangle |
Page.insertFont() |
PDF only: store a new font in the document |
Page.run() |
run a page through a device |
PaperSize() |
return width, height for known paper formats |
PaperSize
(s)Convenience function to return width and height of a known paper format code. These values are given in pixels for the standard resolution 72 pixels = 1 inch.
Currently defined formats include A0 through A10, B0 through B10, C0 through C10, Card-4x6, Card-5x7, Commercial, Executive, Invoice, Ledger, Legal, Legal-13, Letter, Monarch and Tabloid-Extra, each in either portrait or landscape format.
A format name must be supplied as a string (case insensitive), optionally suffixed with “-L” (landscape) or “-P” (portrait). No suffix defaults to portrait.
Parameters: s (str) – a format name like "A4"
or"letter-l"
.Return type: tuple Returns: (width, height)
of the paper format. For an unknown format(-1, -1)
is returned. Esamples:PaperSize("A4")
returns(595, 842)
andPaperSize("letter-l")
delivers(792, 612)
.
getPDFnow
()Convenience function to return the current local timestamp in PDF compatible format, e.g.
D:20170501121525-04'00'
for local datetime May 1, 2017, 12:15:25 in a timezone 4 hours westward of the UTC meridian.
Return type: str Returns: current local PDF timestamp.
getPDFstr
(obj, brackets = True)Make a PDF-compatible string: if
obj
contains code pointsord(c) > 255
, then it will be converted to UTF-16BE as a hexadecimal character string like<feff...>
. Otherwise, ifbrackets = True
, it will enclose the argument in()
replacing any characters with code pointsord(c) > 127
by their octal number\nnn
prefixed with a backslash. Ifbrackets = False
, then the string is returned unchanged.
Parameters: obj (str or bytes or unicode) – the object to convert Return type: str Returns: PDF-compatible string enclosed in either ()
or<>
.
ConversionHeader
(output = "text", filename = "UNKNOWN")Return the header string required to make a valid document out of page text outputs.
Parameters:
- output (str) – type of document. Use the same as the output parameter of
getText()
.- filename (str) – optional arbitrary name to use in output types “json” and “xml”.
Return type: str
ConversionTrailer
(output)Return the trailer string required to make a valid document out of page text outputs. See
Page.getText()
for an example.
Parameters: output (str) – type of document. Use the same as the output parameter of getText()
.Return type: str
Document.
_delXmlMetadata
()Delete an object containing XML-based metadata from the PDF. (Py-) MuPDF does not support XML-based metadata. Use this if you want to make sure that the conventional metadata dictionary will be used exclusively. Many thirdparty PDF programs insert their own metadata in XML format and thus may override what you store in the conventional dictionary. This method deletes any such reference, and the corresponding PDF object will be deleted during next garbage collection of the file.
Document.
_getXmlMetadataXref
()Return he XML-based metadata object id from the PDF if present - also refer to
Document._delXmlMetadata()
. You can use it to retrieve the content viaDocument._getXrefStream()
and then work with it using some XML software.
Document.
_getPageObjNumber
(pno)or
Document.
_getPageXref
(pno) Return the XREF and generation number for a given page.
Parameters: pno (int) – Page number (zero-based). Return type: list Returns: XREF and generation number of page pno
as a list[xref, gen]
.
Page.
_getXref
()Page version for
_getPageObjNumber()
only delivering the XREF (not the generation number).
Page.
insertFont
(fontname = "Helvetica", fontfile = None, idx = 0, set_simple = False)Store a new font for the page and return its XREF. If the page already references this font, it is a no-operation and just the XREF is returned.
Parameters:
- fontname (str) – The reference name of the font. If the name does not occur in
Page.getFontList()
, then this must be either the name of one of the PDF Base 14 Fonts, orfontfile
must also be given. Following this method, font name prefixed with a slash “/” can be used to refer to the font in text insertions. If it appears in the list, the method ignores all other parameters and exits with the xref number.- fontfile (str) – font file name. This file will be embedded in the PDF.
- idx (int) –
index of the font in the given file. Has no meaning and is ingored if
fontfile
is not specified. Default is zero. An invalid index will cause an exception.Note
Certain font files can contain more than one font. This parameter can be used to select the right one. PyMuPDF has no way to tell whether the font file indeed contains a font for any non-zero index.
Caution
Only the first choice of
idx
will be honored - subsequent specifications are ignored.- set_simple (bool) –
When inserting from a font file, a “Type0” font will be installed by default. This option causes the font to be installed as a simple font instead. Only 1-byte characters will then be presented correctly, others will appear as “?” (question mark).
Caution
Only the first choice of
set_simple
will be honored. Subsequent specifications are ignored.Return type: int
Returns: the XREF of the font. PyMuPDF records inserted fonts in two places:
- An inserted font will appear in
Page.getFontList()
.Document.FontInfos
records information about all fonts that have been inserted by this method on a document-wide basis.
Page.
getDisplayList
()Run a page through a list device and return its display list.
Return type: DisplayList Returns: the display list of the page.
Page.
_getContents
()Return a list of XREF numbers of
/Contents
objects belongig to the page. The length of this list will always be at least one.
Return type: list Returns: a list of XREF integers. Each page has one or more associated contents objects (streams) which contain PDF operator syntax describing what appears where on the page (like text or images, etc. See the Adobe PDF Reference 1.7, chapter “Operator Summary”, page 985). This function only enumerates the XREF number(s) of such objects. To get the actual stream source, use function
Document._getXrefStream()
with one of the numbers in this list. UseDocument._updateStream()
to replace the content [1] [2].
Page.
_cleanContents
()Clean all
/Contents
objects associated with this page (including contents of all annotations). “Cleaning” includes syntactical corrections, standardizations and “pretty printing” of the contents stream. If a page has several contents objects, they will be combined into one. Any discrepancies between/Contents
and/Resources
objects are also resolved / corrected. Note that the resulting contents stream will be stored uncompressed (if you do not specifydeflate
on save). SeePage._getContents()
for more details.
Return type: int Returns: 0 on success.
Annot.
_getXref
()Return the xref number of an annotation.
Return type: int Returns: XREF number of the annotation.
Annot.
_cleanContents
()Clean the
/Contents
streams associated with the annotation. This is the same type of actionPage._cleanContents()
performs - just restricted to this annotation.
Return type: int Returns: 0 if successful (exception raised otherwise).
Document.
getCharWidths
(xref = 0, limit = 256)Return a list of character glyphs and their widths for a font that is present in the document. A font must be specified by its PDF cross reference number
xref
. This function is called automatically fromPage.insertText()
andPage.insertTextbox()
. So you should rarely need to do this yourself.
Parameters:
- xref (int) – cross reference number of a font embedded in the PDF. To find a font xref, use e.g.
doc.getPageFontList(pno)
of page numberpno
and take the first entry of one of the returned list entries.- limit (int) – limits the number of returned entries. The default of 256 is enforced for all fonts that only support 1-byte characters, so-called “simple fonts” (checked by this method). All PDF Base 14 Fonts are simple fonts.
Return type: list
Returns: a list of
limit
tuples. Each characterc
has an entry(g, w)
in this list with an index oford(c)
. Entryg
(integer) of the tuple is the glyph id of the character, and floatw
is its normalized width. The actual width for some fontsize can be calculated asw * fontsize
. For simple fonts, theg
entry can always be safely ignored. In all other casesg
is the basis for graphically representingc
.This function calculates the pixel width of a string called
text
:def pixlen(text, widthlist, fontsize): try: return sum([widthlist[ord(c)] for c in text]) * fontsize except IndexError: m = max([ord(c) for c in text]) raise ValueError:("max. code point found: %i, increase limit" % m)
Page.
extractTextRect
(rect)Return lines of text contained in a rectangle.
Parameters: rect (Rect) – rectangle. Return type: str Returns: text occurring inside the rectangle.
Document.
_getObjectString
(xref)
Document.
_getXrefString
(xref)Return the string (“source code”) representing an arbitrary object. For stream objects, only the non-stream part is returned. To get the stream content, use
_getXrefStream()
.
Parameters: xref (int) – XREF number. Return type: string Returns: the string defining the object identified by xref
.
Document.
_getGCTXerrmsg
()Retrieve exception message text issued by PyMuPDF’s low-level code. This in most cases, but not always, are MuPDF messages. This string will never be cleared - only overwritten as needed. Only rely on it if a
RuntimeError
had been raised.
Return type: str Returns: last C-level error message on occasion of a RuntimeError
exception.
Document.
_getNewXref
()Increase the XREF by one entry and return that number. This can then be used to insert a new object.
Return type: int Returns: the number of the new XREF entry.
Document.
_updateObject
(xref, obj_str, page = None)Associate the object identified by string
obj_str
with the XREF numberxref
, which must already exist. Ifxref
pointed to an existing object, this will be replaced with the new object. If a page object is specified, links and other annotations of this page will be reloaded after the object has been updated.
Parameters:
- xref (int) – XREF number.
- obj_str (str) – a string containing a valid PDF object definition.
- page (Page) – a page object. If provided, indicates, that annotations of this page should be refreshed (reloaded) to reflect changes incurred with links and / or annotations.
Return type: int
Returns: zero if successful, otherwise an exception will be raised.
Document.
_getXrefLength
()Return length of XREF table.
Return type: int Returns: the number of entries in the XREF table.
Document.
_getXrefStream
(xref)Return decompressed content stream of the object referenced by
xref
. If the object has / is no stream, an exception is raised.
Parameters: xref (int) – XREF number. Return type: str or bytes Returns: the (decompressed) stream of the object. This is a string in Python 2 and a bytes
object in Python 3.
Document.
_updateStream
(xref, stream)Replace the stream of an object identified by
xref
. If the object has no stream, an exception is raised. The function automatically performs a compress operation (“deflate”).
Parameters:
- xref (int) – XREF number.
- stream (bytes or bytearray) – the new content of the stream.
Return type: int
This method is intended to manipulate streams containing PDF operator syntax (see pp. 985 of the Adobe PDF Reference 1.7) as it is the case for e.g. page content streams.
If you update a contents stream, you should use save parameter
clean = True
. This ensures consistency between PDF operator source and the object structure.Example: Let us assume that you no longer want a certain image appear on a page. This can be achieved by deleting [2] the respective reference in its contents source(s) - and indeed: the image will be gone after reloading the page. But the page’s
/Resources
object would still [3] show the image as being referenced by the page. This save option will clean up any such mismatches.
Document.
_getOLRootNumber
() Return XREF number of the /Outlines root object (this is not the first outline entry!). If this object does not exist, a new one will be created.
Return type: int Returns: XREF number of the /Outlines root object.
Document.
extractFont
(xref, info_only = False)Return an embedded font file’s data and appropriate file extension. This can be used to store the font as an external file. The method does not throw exceptions (other than via checking for PDF).
Parameters:
- xref (int) – PDF object number of the font to extract.
- info_only (bool) – only return font information, not the buffer. To be used for information-only purposes, saves allocation of large buffer areas.
Return type: tuple
Returns: a tuple
(basename, ext, subtype, buffer)
, whereext
is a 3-byte suggested file extension (str),basename
is the font’s name (str),subtype
is the font’s type (e.g. “Type1”) andbuffer
is a bytes object containing the font file’s content (orb""
). For possible extension values and their meaning see Font File Extensions. Return details on error:
("", "", "", b"")
- invalid xref or xref is not a (valid) font object.(basename, "n/a", "Type1", b"")
-basename
is one of the PDF Base 14 Fonts, which cannot be extracted.Example:
>>> # store font as an external file >>> name, ext, buffer = doc.extractFont(4711) >>> # assuming buffer is not None: >>> ofile = open(name + "." + ext, "wb") >>> ofile.write(buffer) >>> ofile.close()Caution
The basename is returned unchanged from the PDF. So it may contain characters (such as blanks) which disqualify it as a valid filename for your operating system. Take appropriate action.
Document.
FontInfos
Contains following information for any font inserted via
Page.insertFont()
:
xref (int) - XREF number of the
/Type/Font
object.info (dict) - detail font information with the following keys:
- name (str) - name of the basefont
- idx (int) - index number for multi-font files
- type (str) - font type (like “TrueType”, “Type0”, etc.)
- ext (str) - extension to be used, when font is extracted to a file (see Font File Extensions).
- glyphs (list) - list of glyph numbers and widths (filled by textinsertion methods).
Return type: list
Footnotes
[1] | If a page has multiple contents streams, they are treated as being one logical stream when the document is processed by reader software. A single operator cannot be split between stream boundaries, but a single instruction may well be. E.g. invoking the display of an image looks like this: q a b c d e f cm /imageid Do Q . Any single of these items (PDF notation: “lexical tokens”) is always contained in one stream, but q a b c d e f cm may be in one and /imageid Do Q in the next one. |
[2] | (1, 2) Note that /Contents objects (similar to /Resources) may be shared among pages. A change to a contents stream may therefore affect other pages, too. To avoid this: (1) use Page._cleanContents() , (2) read the /Contents object (there will now be only one left), (3) make your changes. |
[3] | Resources objects are inheritable. This means that many pages can share one. Keeping a page’s /Resources object in sync with changes of its /Contents therefore may require creating an own /Resources object for the page. This can best be achieved by using clean when saving, or by invoking Page._cleanContents() . |