Working together: DisplayList and TextPage

PyMuPDF

previous page next page

Working together: DisplayList and TextPage

Here are some instructions on how to use these classes together.

In some situations, performance improvements may be achievable when you fall back to the detail level explained here.

Create a DisplayList

A DisplayList represents an interpreted document page. Methods for pixmap creation, text extraction and text search are - behind the curtain - all using the page’s display list to perform their tasks. If a page must be rendered several times (e.g. because of changed zoom levels), or if text search and text extraction should both be performed, overhead can be saved, if the display list is created only once and then used for all other tasks.

>>> dl = page.getDisplayList()              # create the display list

You can also create display lists for many pages “on stack” (in a list), may be during document open, or you store it when a page is visited for the first time.

Note, that for everything what follows, only the display list is needed - the corresponding Page object could have been deleted.

Generate Pixmap

The following creates a Pixmap from a DisplayList. Parameters are the same as for Page.getPixMap().

>>> pix = dl.getPixmap()                    # create the page's pixmap

The execution time of this statement may be 20% up to 50% shorter than that of Page.getPixMap().

Perform Text Search

With the display list from above, we can also search for text.

For this we need to create a TextPage.

>>> tp = dl.getTextPage()                    # display list from above
>>> rlist = tp.search("needle")              # look up "needle" locations
>>> for r in rlist:                          # work with found locations:
        pix.invertIRect(r.irect)             # e.g. invert colors in rectangle

Extract Text

With the same TextPage object from above, we can now immediately use any or all of the 5 text extraction methods.

Note

Above, we have created our text page without argument. This leads to a default value of 3 = fitz.TEXT_PRESERVE_LIGATURES | fitz.TEXT_PRESERVE_WHITESPACE, IAW images will not be extracted - see below.

>>> txt  = tp.extractText()                  # plain text format
>>> json = tp.extractJSON()                  # json format
>>> html = tp.extractHTML()                  # HTML format
>>> xml  = tp.extractXML()                   # XML format
>>> xml  = tp.extractXHTML()                 # XHTML format

Further Performance improvements

Pixmap

As explained in the Page chapter:

If you do not need transparency set alpha = 0 when creating pixmaps. This will save 25% memory (if RGB, the most common case) and possibly 5% execution time (depending on the GUI software).

TextPage

If you do not need images extracted alongside the text of a page, you can set the following option:

>>> flags = fitz.TEXT_PRESERVE_LIGATURES | fitz.TEXT_PRESERVE_WHITESPACE
>>> tp = dl.getTextPage(flags)

This will save ca. 25% overall execution time for the HTML, XHTML and JSON text extractions and hugely reduce the amount of storage (memory and disk space) if the document is graphics oriented.

previous page start next page