Document
This class represents a document. It can be constructed from a file or from memory.
Since version 1.9.0 there exists the alias open
for this class.
For addional details on embedded files refer to Appendix 3.
Method / Attribute | Short Description |
---|---|
Document.authenticate() |
decrypt the document |
Document.close() |
close the document |
Document.copyPage() |
PDF only: copy a page to another location |
Document.deletePage() |
PDF only: delete a page by its number |
Document.deletePageRange() |
PDF only: delete a range of pages |
Document.embeddedFileAdd() |
PDF only: add a new embedded file from buffer |
Document.embeddedFileDel() |
PDF only: delete an embedded file entry |
Document.embeddedFileGet() |
PDF only: extract an embedded file buffer |
Document.embeddedFileInfo() |
PDF only: metadata of an embedded file |
Document.embeddedFileSetInfo() |
PDF only: change metadata of an embedded file |
Document.getPageFontList() |
make a list of fonts on a page |
Document.getPageImageList() |
make a list of images on a page |
Document.getPagePixmap() |
create a pixmap of a page by page number |
Document.getPageText() |
extract the text of a page by page number |
Document.getToC() |
create a table of contents |
Document.insertPage() |
PDF only: insert a new page |
Document.insertPDF() |
PDF only: insert pages from another PDF |
Document.loadPage() |
read a page |
Document.movePage() |
PDF only: move a page to another location |
Document.newPage() |
PDF only: insert a new empty page |
Document.save() |
PDF only: save the document |
Document.saveIncr() |
PDF only: save the document incrementally |
Document.searchPageFor() |
search for a string on a page |
Document.select() |
PDF only: select a subset of pages |
Document.setMetadata() |
PDF only: set the metadata |
Document.setToC() |
PDF only: set the table of contents (TOC) |
Document.write() |
PDF only: writes the document to memory |
Document.embeddedFileCount |
number of embedded files |
Document.isClosed |
has document been closed? |
Document.isPDF |
is document type PDF? |
Document.metadata |
metadata |
Document.name |
filename of document |
Document.needsPass |
require password to access data? |
Document.openErrCode |
> 0 if repair occurred during open |
Document.openErrMsg |
last error message if openErrCode > 0 |
Document.outline |
first Outline item |
Document.pageCount |
number of pages |
Document.permissions |
permissions to access the document |
Class API
-
class
Document
-
__init__
(self[, filename]) Constructs a
Document
object fromfilename
.Parameters: filename (str) – A string containing the path / name of the document file to be used. The file will be opened and remain open until either explicitely closed (see below) or until end of program. If omitted or None
, a new empty PDF document will be created.Return type: Document
Returns: A Document
object.
-
__init__
(self, filetype, stream) Constructs a
Document
object from memory areastream
.Parameters: - filetype (str) – A string specifying the type of document contained in
stream
. This may be either something that looks like a filename (e.g."x.pdf"
), in which case MuPDF uses the extension to determine the type, or a mime type likeapplication/pdf
. Recommended is using the filename scheme, or even the name of the original file for documentation purposes. But just using strings like"pdf"
will also work. - stream (bytes) – A memory area representing the content of a supported document type. A type of
bytearray
is supported, too.
Return type: Document
Returns: A
Document
object.- filetype (str) – A string specifying the type of document contained in
-
authenticate
(password) Decrypts the document with the string
password
. If successful, all of the document’s data can be accessed (e.g. for rendering).Parameters: password (str) – The password to be used. Return type: int Returns: True (1)
if decryption withpassword
was successful,False (0)
otherwise. If successfull, indicatorisEncrypted
is set toFalse
.
-
loadPage
(pno = 0) Loads a
Page
for further processing like rendering, text searching, etc. See the Page object.Parameters: pno (int) – page number, zero-based (0 is default and the first page of the document) and < doc.pageCount
. Ifpno < 0
, then pagepno % pageCount
will be loaded (IAWpageCount
will be added topno
until the result is no longer negative). For example: to load the last page, you can specifydoc.loadPage(-1)
. After this you havepage.number == doc.pageCount - 1
.Return type: Page
Note
Conveniently, pages can also be loaded via indexes over the document:
doc.loadPage(n) == doc[n]
. Consequently, a document can also be used as an iterator over its pages, e.g.for page in doc: ...
andfor page in reversed(doc): ...
will yield the Pages ofdoc
aspage
.-
getToC
(simple = True) Creates a table of contents out of the document’s outline chain.
Parameters: simple (bool) – Indicates whether a simple or a detailed ToC is required. If simple == False
, each entry of the list also contains a dictionary with linkDest details for each outline entry.Return type: list Returns: a list of lists. Each entry has the form [lvl, title, page, dest]
. Its entries have the following meanings:- lvl - hierarchy level (integer). The first entry has hierarchy level 1, and entries in a row increase by at most one level.
- title - title (string)
- page - 1-based page number (integer). Page numbers
< 1
either indicate a target outside this document or no target at all (see next entry). - dest - included only if
simple = False
is specified. A dictionary containing details of the link destination.
-
getPagePixmap
(pno, *args, **kwargs) Creates a pixmap from page
pno
(zero-based). InvokesPage.getPixmap()
.Return type: Pixmap
-
getPageImageList
(pno) PDF only: Return a list of all image descriptions referenced by a page.
Parameters: pno (int) – page number, zero-based. Any value < len(doc)
is acceptable.Return type: list Returns: a list of images shown on this page. Each entry looks like [xref, smask, width, height, bpc, colorspace, alt. colorspace, name]
. Wherexref
is the image object number,smask
is the object number of its soft-mask image (if present),width
andheight
are the image dimensions,bpc
denotes the number of bits per component (a typical value is 8),colorspace
a string naming the colorspace (likeDeviceRGB
),alt. colorspace
is any alternate colorspace depending on the value ofcolorspace
, andname
- which is the symbolic name (str) by which the page references this particular image in its content stream. See below how this information can be used to extract PDF images as separate files. Another demonstration:>>> doc = fitz.open("pymupdf.pdf") >>> imglist = doc.getPageImageList(0) >>> for img in imglist: print img ((241, 0, 1043, 457, 8, 'DeviceRGB', '', 'Im1')) >>> pix = fitz.Pixmap(doc, 241) >>> pix fitz.Pixmap(DeviceRGB, fitz.IRect(0, 0, 1043, 457), 0)
-
getPageFontList
(pno) PDF only: Return a list of all fonts referenced by the page.
Parameters: pno (int) – page number, zero-based, any value < len(doc)
.Return type: list Returns: a list of fonts referenced by this page. Each entry looks like (xref, ext, type, basefont, name)
. Wherexref
is the font object number,ext
font file extension,type
is the font type (likeType1
orTrueType
etc.),basefont
is the base font name andname
is the reference name, by which the page references it in its contents stream:>>> doc=fitz.open("pymupdf.pdf") >>> for f in doc.getPageFontList(85): print(f) (344, 'pfa', 'Type1', 'HVTNTB+SFSX1000', 'F18') (343, 'pfa', 'Type1', 'KPGUVC+SFTT1000', 'F16') (745, 'pfa', 'Type1', 'OBIJJJ+SFRM1440', 'F38') (470, 'pfa', 'Type1', 'AFLLUK+SFTI1000', 'F49') (342, 'pfa', 'Type1', 'GWNVMD+SFRM1000', 'F15') (341, 'pfa', 'Type1', 'MFMRXE+SFBX1000', 'F41') (523, 'pfa', 'Type1', 'LDRDRB+SFIT1000', 'F74')
Note
Fonts are stored on the document level (like images). The reference name is specific for the page. Other pages may use a different name for the same font. Also note, that a font may appear in this list allthough no text actually uses it. But conversely, every piece of text on the page will refer to exactly one of these entries. Look here for the meaning of Font File Extensions.
Note
For more background see Adobe PDF Reference 1.7 chapters 5.4 to 5.8, pp 410.
-
getPageText
(pno, output = "text") Extracts the text of a page given its page number
pno
(zero-based). InvokesPage.getText()
.Parameters: - pno (int) – Page number, zero-based. Any value
< len(doc)
is acceptable. - output (str) – A string specifying the requested output format: text, html, json or xml. Default is
text
.
Return type: str
- pno (int) – Page number, zero-based. Any value
-
select
(list) PDF only: Keeps only those pages of the document whose numbers occur in the list. Empty lists or elements outside the range
0 <= page < doc.pageCount
will cause aValueError
. For more details see remarks at the bottom or this chapter.Parameters: list (sequence) – A list (or tuple) of page numbers (zero-based) to be included. Pages not in the list will be deleted (from memory) and become unavailable until the document is reopened. Page numbers can occur multiple times and in any order: the resulting document will reflect the list exactly as specified. Return type: int Returns: Zero upon successful execution. All document information will be updated to reflect the new state of the document, like outlines, number and sequence of pages, etc. Changes become permanent only after saving the document. Incremental save is supported.
-
setMetadata
(m) PDF only: Sets or updates the metadata of the document as specified in
m
, a Python dictionary. As with methodselect()
, these changes become permanent only when you save the document. Incremental save is supported.Parameters: m (dict) – A dictionary with the same keys as metadata
(see below). All keys are optional. A PDF’s format and encryption method cannot be set or changed, these keys therefore have no effect and will be ignored. If any value should not contain data, do not specify its key or set the value toNone
. If you usem = {}
all metadata information will be cleared to the string"none"
. If you want to selectively change only some values, modify a copy ofdoc.metadata
and use it as the argument for this method.Return type: int Returns: Zero upon successful execution and doc.metadata
will be updated.
-
setToC
(toc) PDF only: Replaces the complete current outline tree (table of contents) with a new one. After successful execution, the new outline tree can be accessed as usual via method
getToC()
or via propertyoutline
. Like with other output-oriented methods, changes become permanent only viasave()
(incremental save supported). Internally, this method consists of the following two steps. For a demonstration see example below.Please note, that currently the
is_open
flag is set toFalse
. Therefore all entries other than level 1 will initially be shown collapsed in PDF readers.- Step 1 deletes all existing bookmarks.
- Step 2 creates a new TOC from the entries contained in
toc
.
Parameters: toc (sequence) – A Python nested sequence with all bookmark entries that should form the new table of contents. Each entry is a list with the following format. Output variants of method
getToC()
are also acceptable as input.[lvl, title, page, dest]
, wherelvl
is the hierarchy level (int > 0) of the item, starting with1
and being at most 1 higher than that of the predecessor,title
(str) is the title to be displayed.page
(int) is the target page number (attention: 1-based to support getToC()-output), must be in valid page range if positive. Set this to-1
if there is no target, or the target is external.dest
(optional) is a dictionary or a number. If a number, it will be interpreted as the desired height (in points) this entry should point to onpage
in the current document. Use a dictionary (like the one given as output bygetToC(simple = False)
) if you want to store destinations that are either “named”, or reside outside this documennt (other files, internet resources, etc.).
Return type: int Returns: outline
andgetToC()
will be updated upon successful execution. The return code will either equal the number of inserted items (len(toc)
) or the number of deleted items iftoc
is an empty sequence.Note
We currently always set the Outline attribute
is_open
toFalse
. This shows all entries below level 1 as collapsed.
-
save
(outfile, garbage=0, clean=0, deflate=0, incremental=0, ascii=0, expand=0, linear=0) PDF only: Saves the document in its current state under the name
outfile
. A document may have changed for a number of reasons: e.g. after a successfulauthenticate
, a decrypted copy will be saved, and, in addition (even without optional parameters), some basic cleaning may also have occurred, e.g. broken xref tables may have been repaired and earlier incremental changes may have been resolved. If you executed any modifying methods, their results will also be reflected in the saved version.Parameters: - outfile (str) – The file name to save to. Must be different from the original value value if
incremental=False
. When saving incrementally,garbage
andlinear
must beFalse / 0
andoutfile
must equal the original filename (for convenience usedoc.name
). - garbage (int) – Do garbage collection: 0 = none, 1 = remove unused objects, 2 = in addition to 1, compact xref table, 3 = in addition to 2, merge duplicate objects, 4 = in addition to 3, check streams for duplication. Excludes
incremental
. - clean (int) – Clean content streams [1]: 0 / False, 1 / True.
- deflate (int) – Deflate uncompressed streams: 0 / False, 1 / True.
- incremental (int) – Only save changed objects: 0 / False, 1 / True. Excludes
garbage
andlinear
. Cannot be used for decrypted files and for files opened in repair mode (openErrCode > 0
). In these cases saving to a new file is required. - ascii (int) – Where possible make the output ASCII: 0 / False, 1 / True.
- expand (int) – Decompress contents: 0 = none, 1 = images, 2 = fonts, 255 = all. This convenience option generates a decompressed file version that can be better read by some other programs.
- linear (int) – Save a linearised version of the document: 0 = False, 1 = True. This option creates a file format for improved performance when read via internet connections. Excludes
incremental
.
Return type: int
Returns: Zero upon successful execution.
- outfile (str) – The file name to save to. Must be different from the original value value if
-
saveIncr
() PDF only: saves the document incrementally. This is a convenience abbreviation for
doc.save(doc.name, incremental = True)
.
Caution
A PDF may not be encrypted, but still be password protected against changes - see the
permissions
property. Performing incremental saves ifpermissions["edit"] == False
can lead to unpredictable results. Save to a new file in such a case. We also consider raising an exception under this condition.-
searchPageFor
(pno, text, hit_max = 16) Search for
text
on page numberpno
. Works exactly like the correspondingPage.searchFor()
. Any integerpno < len(doc)
is acceptable.
-
write
(garbage=0, clean=0, deflate=0, ascii=0, expand=0, linear=0) PDF only: Writes the current content of the document to a bytes object instead of to a file like
save()
. Obviously, you should be wary about memory requirements. The meanings of the parameters exactly equal those inDocument.save()
. The tutorial contains an example for using this method as a pre-processor to pdfrw.Return type: bytes Returns: a bytes object containing the complete document data.
-
insertPDF
(docsrc, from_page = -1, to_page = -1, start_at = -1, rotate = -1, links = True) PDF only: Copy the page range [from_page, to_page] (including both) of PDF document
docsrc
into the current one. Inserts will start with page numberstart_at
. Negative values can be used to indicate default values. All pages thus copied will be rotated as specified. Links can be excluded in the target, see below. All page numbers are zero-based.Parameters: - docsrc (
Document
) – An opened PDFDocument
which must not be the current document object. However, it may refer to the same underlying file. - from_page (int) – First page number in
docsrc
. Default is zero. - to_page (int) – Last page number in
docsrc
to copy. Default is the last page. - start_at (int) – First copied page will become page number
start_at
in the destination. If omitted, the page range will be appended to current document. If zero, the page range will be inserted before current first page. - rotate (int) – All copied pages will be rotated by the provided value (degrees). If you do not specify a value (or
-1
), the original will not be changed. Otherwise it must be an integer multiple of 90 (not checked). Rotation is counter-clockwise ifrotate
is positive, else clockwise. - links (bool) – Choose whether (internal and external) links should be included with the copy. Default is
True
. An internal link is always excluded if its destination is not one of the copied pages.
Return type: int
Returns: Zero upon successful execution.
- docsrc (
Note
If
from_page > to_page
, pages will be copied in reverse order. If0 <= from_page == to_page
, then one page will be copied.Note
docsrc
bookmarks will not be copied. It is easy however, to recover a table of contents for the resulting document. Look at the examples below and at program PDFjoiner.py in the examples directory: it can join PDF documents and at the same time piece together respective parts of the tables of contents.-
insertPage
(to = -1, text = None, fontsize = 11, width = 595, height = 842, fontname = "Helvetica", fontfile = None, color = (0, 0, 0)) PDF only: Insert an empty page. Default page dimensions are those of A4 portrait paper format. Optionally, text can also be inserted - provided as a string or asequence.
Parameters: - to (int) – page number (0-based) in front of which to insert. Valid specifications must be in range
-1 <= pno <= len(doc)
. The default-1
andpno = len(doc)
indicate end of document, i.e. after the last page. - text (str or sequence) – optional text to put on the page. If given, it will start at 72 points (one inch) below top and 50 points from left. Line breaks (
\n
) will be honored, if it is a string. No care will be taken as to whether lines are too wide. However, text output stops when no more lines will fit on the page (discarding any remaining text). If a sequence is specified, its entries must be a of type string. Each entry will be put on one line. Line breaks within an entry will be treated as any other white space. If you want to calculate the number of lines fitting on a page beforehand, use this formula:int((height - 108) / (fontsize * 1.2)
. So, this methods reserves one inch at the top and 1/2 inches at the bottom of the page as free space. - fontsize (float) – font size in pixels. Default is 11. If more than one line is provided, a line spacing of
fontsize * 1.2
(fontsize plus 20%) is used. - width (float) – width in pixels. Default is 595 (A4 width). Choose 612 for Letter width.
- height (float) – page height in pixels. Default is 842 (A4 height). Choose 792 for Letter height.
- fontname (str) – name of one of the PDF Base 14 Fonts (default is “Helvetica”) if fontfile is not specified.
- fontfile (str) – file path of a font existing on the system. If this parameter is specified, specifying
fontname
is mandatory. If the font is new to the PDF, it will be embedded. Of the font file, index 0 is used. Be sure to choose a font that supports horizontal, left-to-right spacing. - color (sequence) – RGB text color specified as a triple of floats in range 0 to 1. E.g. specify black (default) as
(0, 0, 0)
, red as(1, 0, 0)
, some gray value as(0.5, 0.5, 0.5)
, etc.
Return type: int
Returns: number of text lines put on the page. Use this to check which part of your text did not fit.
Notes:
This method can be used to
- create a PDF containing only one empty page of a given dimension. The size of such a file is well below 500 bytes and hence close to the theoretical PDF minimum.
- create a protocol page of which files have been embedded, or separator pages between joined pieces of PDF Documents.
- convert textfiles to PDF like in the demo script text2pdf.py.
- For now, the inserted text should restrict itself to one byte character codes.
- An easy way to create pages with a usual paper format, use a statement like
width, height = fitz.PaperSize("A4-L")
. - To simplify color specification, we provide a Color Database. This allows you to specify
color = getColor("turquoise")
, without bothering about any more details.
- to (int) – page number (0-based) in front of which to insert. Valid specifications must be in range
-
newPage
(to = -1, width = 595, height = 842) PDF only: Convenience method: insert an empty page like
insertPage()
does. Valid parameters have the same meaning. However, no text can be inserted, instead the inserted page object is returned.Return type: Page Returns: the page object just inserted.
-
deletePage
(pno) PDF only: Delete a page given by its 0-based number in range
0 <= pno < len(doc)
.Parameters: pno (int) – the page to be deleted.
-
deletePageRange
(from_page = -1, to_page = -1) PDF only: Delete a range of pages specified as 0-based numbers. Any negative parameter will first be replaced by
len(doc) - 1
. After that, condition0 <= from_page <= to_page < len(doc)
must be true. If the parameters are equal, one page will be deleted.Parameters: - from_page (int) – the first page to be deleted.
- to_page (int) – the last page to be deleted.
-
copyPage
(pno, to = -1) PDF only: Copy a page within the document.
Parameters: - pno (int) – the page to be copied. Number must be in range
0 <= pno < len(doc)
. - to (int) – the page number in front of which to insert the copy. To insert at end of document (default), specify a negative value.
- pno (int) – the page to be copied. Number must be in range
-
movePage
(pno, to = -1) PDF only: Move (copy and then delete original) page to another location.
Parameters: - pno (int) – the page to be moved. Number must be in range
0 <= pno < len(doc)
. - to (int) – the page number in front of which to insert the moved page. To insert at end of document (default), specify a negative value. Must not be in
(pno, pno + 1)
.
- pno (int) – the page to be moved. Number must be in range
-
embeddedFileInfo
(n) PDF only: Retrieve information of an embedded file identified by either its number or by its name.
Parameters: n (int or str) – index or name of entry. Obviously 0 <= n < embeddedFileCount
must be true ifn
is an integer.Return type: dict Returns: a dictionary with the following keys: name
- (str) name under which this entry is storedfile
- (str) filename associated with the entrydesc
- (str) description of the entrysize
- (int) original content sizelength
- (int) compressed content length
-
embeddedFileSetInfo
(n, filename = filename, desc = desc) PDF only: Change some information of an embedded file given its entry number or name. At least one of
filename
anddesc
must be specified. Response will be zero if successful, else an exception is raised.Parameters: - n (int or str) – index or name of entry. Obviously
0 <= n < embeddedFileCount
must be true ifn
is an integer. - filename (str) – sets the filename of the entry.
- desc (str) – sets the description of the entry.
- n (int or str) – index or name of entry. Obviously
-
embeddedFileGet
(n) PDF only: Retrieve the content of embedded file by its entry number or name. If the document is not a PDF, or entry cannot be found, an exception is raised.
Parameters: n (int or str) – index or name of entry. Obviously 0 <= n < embeddedFileCount
must be true ifn
is an integer.Return type: bytes
(Python 3),str
(Python 2)
-
embeddedFileDel
(name) PDF only: Remove an entry from the portfolio. As always, physical deletion of the embedded file content (and file space regain) will occur when the document is saved to a new file with
garbage
option. With an incremental save, the associated object will only be marked deleted.Note
We do not support entry numbers for this function yet. If you need to e.g. delete all embedded files, scan through all embedded files by number, and use the returned dictionary’s
name
entry to delete each one. This function will delete the first entry with this name it finds. Be wary that for arbitrary PDF files, this may not have been the only one, because PDF itself has no mechanism to prevent duplicate entries …Parameters: name (str) – name of entry.
-
embeddedFileAdd
(stream, name, filename = filename, desc = desc) PDF only: Add new content to the document’s portfolio.
Parameters: - stream (bytes or bytearray or str (Python 2 only)) – contents
- name (str) – new entry identifier, must not already exist in embedded files.
- filename (str) – optional filename or
None
, documentation only, will be set toname
ifNone
or omitted. - desc (str) – optional description or
None
, arbitrary documentation text, will be set toname
ifNone
or omitted.
Return type: int
Returns: the index given to the new entry. In the current (April 11, 2017) MuPDF version, this is not reliably true (for this reason we have decided to restrict
embeddedFileDel()
to entries identified by name). Use character string look up to find your entry again. For any error condition, an exception is raised.
-
close
() Release objects and space allocations associated with the document. If created from a file, also closes
filename
(releasing control to the OS).
-
outline
Contains the first Outline entry of the document (or
None
). Can be used as a starting point to walk through all outline items. Accessing this property for encrypted, not authenticated documents will raise anAttributeError
.Type: Outline
-
isClosed
False / 0
if document is still open,True / 1
otherwise. If closed, most other attributes and methods will have been deleted / disabled. In addition, Page objects referring to this document (i.e. created withDocument.loadPage()
) and their dependent objects will no longer be usable. For reference purposes,Document.name
still exists and will contain the filename of the original document (if applicable).Type: bool
-
isPDF
True
if this is a PDF document, elseFalse
.Type: bool
-
needsPass
Contains an indicator showing whether the document is encrypted (
True (1)
) or not (False (0)
). This indicator remains unchanged - even after the document has been authenticated. Precludes incremental saves if set.Type: bool
-
isEncrypted
This indicator initially equals
needsPass
. After successful authentication, it is set toFalse
to reflect the situation.Type: bool
-
permissions
Shows the permissions to access the document. Contains a dictionary likes this:
>>> doc.permissions {'print': True, 'edit': True, 'note': True, 'copy': True}
The keys have the obvious meaning of permissions to print, change, annotate and copy the document, respectively.
Type: dict
-
metadata
Contains the document’s meta data as a Python dictionary or
None
(ifisEncrypted = True
andneedPass=True
). Keys areformat
,encryption
,title
,author
,subject
,keywords
,creator
,producer
,creationDate
,modDate
. All item values are strings orNone
.Except
format
andencryption
, the key names correspond in an obvious way to the PDF keys/Creator
,/Producer
,/CreationDate
,/ModDate
,/Title
,/Author
,/Subject
, and/Keywords
respectively.format
contains the PDF version (e.g. ‘PDF-1.6’).encryption
either containsNone
(no encryption), or a string naming an encryption method (e.g.'Standard V4 R4 128-bit RC4'
). Note that an encryption method may be specified even ifneedsPass = False
. In such cases not all permissions will probably have been granted. Check dictionarypermissions
for details.If the date fields contain valid data (which need not be the case at all!), they are strings in the PDF-specific timestamp format “D:<TS><TZ>”, where
- <TS> is the 12 character ISO timestamp
YYYYMMDDhhmmss
(YYYY
- year,MM
- month,DD
- day,hh
- hour,mm
- minute,ss
- second), and - <TZ> is a time zone value (time intervall relative to GMT) containing a sign (‘+’ or ‘-‘), the hour (
hh
), and the minute ('mm'
, note the apostrophies!).
- <TS> is the 12 character ISO timestamp
A Paraguayan value might hence look like
D:20150415131602-04'00'
, which corresponds to the timestamp April 15, 2015, at 1:16:02 pm local time Asuncion.
Type: dict
-
name
Contains the
filename
orfiletype
value with whichDocument
was created.Type: str
-
pageCount
Contains the number of pages of the document. May return 0 for documents with no pages. Function
len(doc)
will also deliver this result.Type: int
-
openErrCode
If
openErrCode > 0
, errors have occurred while opening / parsing the document, which usually means document structure issues. In this case incremental save cannot be used.Type: int
-
openErrMsg
Contains either an empty string or the last open error message if
openErrCode > 0
. Together with any other error messages of MuPDF’s C library, it will also appear onSYSERR
.Type: str
-
embeddedFileCount
Contains the number of files in the embedded / portfolio files list (also known as collection or attached files). If the document is not a PDF,
-1
will be returned.Type: int
-
Note
For methods that change the structure of a PDF (insertPDF()
, select()
, copyPage()
, deletePage()
and others), be aware that objects or properties in your program may have been invalidated or orphaned. Examples are Page objects and their children (links and annotations), variables holding old page counts, tables of content and the like. Remember to keep such variables up to date or delete orphaned objects.
Remarks on select()
Page numbers in the list need not be unique nor be in any particular sequence. This makes the method a versatile utility to e.g. select only the even or the odd pages, re-arrange a document from back to front, duplicate it, and so forth. In combination with text search or extraction you can also omit / include pages with no text or containing a certain text, etc.
You can execute several selections in a row. The document structure will be updated after each method execution.
Any of those changes will become permanent only with a doc.save()
. If you have de-selected many pages, consider specifying the garbage
option to eventually reduce the resulting document’s size (when saving to a new file).
Also note, that this method preserves all links, annotations and bookmarks that are still valid. In other words: deleting pages only deletes references which point to de-selected pages. Page number of bookmarks (outline items) are automatically updated when a TOC is retrieved again with getToC()
. If a bookmark’s destination page happened to be deleted, then its page number in getToC()
will be set to -1
.
The results of this method can of course also be achieved using combinations of methods copyPage()
, deletePage()
and movePage()
. While there are many cases, when these methods are more practical, select()
is easier and safer to use when many pages are involved.
select()
Examples
In general, any list of integers within the document’s page range can be used. Here are some illustrations.
Delete pages with no text:
import fitz
doc = fitz.open("any.pdf")
r = list(range(len(doc))) # list of page numbers
for page in doc:
if not page.getText(): # page contains no text
r.remove(page.number) # remove page number from list
if len(r) < len(doc): # did we actually delete anything?
doc.select(r) # apply the list
doc.save("out.pdf", garbage = 4) # save result to new PDF, OR
# update the original document ... *** VERY FAST! ***
doc.saveIncr()
Create a sub document with only the odd pages:
>>> import fitz
>>> doc = fitz.open("any.pdf")
>>> r = list(range(0, len(doc), 2))
>>> doc.select(r) # apply the list
>>> doc.save("oddpages.pdf", garbage = 4) # save sub-PDF of the odd pages
Concatenate a document with itself:
>>> import fitz
>>> doc = fitz.open("any.pdf")
>>> r = list(range(len(doc)))
>>> r += r # turn PDF into a copy of itself
>>> doc.select(r)
>>> doc.save("any+any.pdf") # contains doubled <any.pdf>
Create document copy in reverse page order (well, don’t try with a million pages):
>>> import fitz
>>> doc = fitz.open("any.pdf")
>>> r = list(range(len(doc) - 1, -1, -1))
>>> doc.select(r)
>>> doc.save("back-to-front.pdf")
setMetadata()
Example
Clear metadata information. If you do this out of privacy / data protection concerns, make sure you save the document as a new file with garbage > 0
. Only then the old /Info
object will also be physically removed from the file. In this case, you may also want to clear any XML metadata inserted by several PDF editors:
>>> import fitz
>>> doc=fitz.open("pymupdf.pdf")
>>> doc.metadata # look at what we currently have
{'producer': 'rst2pdf, reportlab', 'format': 'PDF 1.4', 'encryption': None, 'author':
'Jorj X. McKie', 'modDate': "D:20160611145816-04'00'", 'keywords': 'PDF, XPS, EPUB, CBZ',
'title': 'The PyMuPDF Documentation', 'creationDate': "D:20160611145816-04'00'",
'creator': 'sphinx', 'subject': 'PyMuPDF 1.9.1'}
>>> doc.setMetadata({}) # clear all fields
0
>>> doc.metadata # look again to show what happened
{'producer': 'none', 'format': 'PDF 1.4', 'encryption': None, 'author': 'none',
'modDate': 'none', 'keywords': 'none', 'title': 'none', 'creationDate': 'none',
'creator': 'none', 'subject': 'none'}
>>> doc._delXmlMetadata() # clear any XML metadata
0
>>> doc.save("anonymous.pdf", garbage = 4) # save anonymized doc
0
setToC()
Example
This shows how to modify or add a table of contents. Also have a look at csv2toc.py and toc2csv.py in the examples directory:
>>> import fitz
>>> doc = fitz.open("test.pdf")
>>> toc = doc.getToC()
>>> for t in toc: print(t) # show what we have
[1, 'The PyMuPDF Documentation', 1]
[2, 'Introduction', 1]
[3, 'Note on the Name fitz', 1]
[3, 'License', 1]
>>> toc[1][1] += " modified by setToC" # modify something
>>> doc.setToC(toc) # replace outline tree
3 # number of bookmarks inserted
>>> for t in doc.getToC(): print(t) # demonstrate it worked
[1, 'The PyMuPDF Documentation', 1]
[2, 'Introduction modified by setToC', 1] # <<< this has changed
[3, 'Note on the Name fitz', 1]
[3, 'License', 1]
insertPDF()
Examples
(1) Concatenate two documents including their TOCs:
>>> doc1 = fitz.open("file1.pdf") # must be a PDF
>>> doc2 = fitz.open("file2.pdf") # must be a PDF
>>> pages1 = len(doc1) # save doc1's page count
>>> toc1 = doc1.getToC(simple = False) # save TOC 1
>>> toc2 = doc2.getToC(simple = False) # save TOC 2
>>> doc1.insertPDF(doc2) # doc2 at end of doc1
>>> for t in toc2: # increase toc2 page numbers
t[2] += pages1 # by old len(doc1)
>>> doc1.setToC(toc1 + toc2) # now result has total TOC
Obviously, similar ways can be found in more general situations. Just make sure that hierarchy levels in a row do not increase by more than one. Inserting dummy bookmarks before and after toc2
segments would heal such cases. A ready-to-use GUI (wxPython) solution can be found in script PDFjoiner.py of the examples directory.
(2) More examples:
>>> # insert 5 pages of doc2, where its page 21 becomes page 15 in doc1
>>> doc1.insertPDF(doc2, from_page = 21, to_page = 25, start_at = 15)
>>> # same example, but pages are rotated and copied in reverse order
>>> doc1.insertPDF(doc2, from_page = 25, to_page = 21, start_at = 15, rotate = 90)
>>> # put copied pages in front of doc1
>>> doc1.insertPDF(doc2, from_page = 21, to_page = 25, start_at = 0)
Other Examples
Extract all page-referenced images of a PDF into separate PNG files:
for i in range(len(doc)):
imglist = doc.getPageImageList(i)
for img in imglist:
xref = img[0] # xref number
pix = fitz.Pixmap(doc, xref) # make pixmap from image
if pix.n - pix.alpha < 4: # can be saved as PNG
pix.writePNG("p%s-%s.png" % (i, xref))
else: # CMYK: must convert first
pix0 = fitz.Pixmap(fitz.csRGB, pix)
pix0.writePNG("p%s-%s.png" % (i, xref))
pix0 = None # free Pixmap resources
pix = None # free Pixmap resources
Rotate all pages of a PDF:
>>> for page in doc: page.setRotation(90)
Footnotes
[1] | Content streams describe what (e.g. text or images) appears where and how on a page. PDF uses a specialized mini language similar to PostScript to do this (pp. 985 in Adobe PDF Reference 1.7), which gets interpreted when a page is loaded. |