Working together: DisplayList and TextPage
Here are some instructions on how to use these classes together.
In some situations, performance improvements may be achievable when you fall back to the detail level explained here.
Create a DisplayList
A DisplayList represents an interpreted document page. Methods for pixmap creation, text extraction and text search are - behind the curtain - all using the page’s display list to perform their tasks. If a page must be rendered several times (e.g. because of changed zoom levels), or if text search and text extraction should both be performed, overhead can be saved, if the display list is created only once and then used for all other tasks.
>>> dl = page.getDisplayList() # create the display list
You can also create display lists for many pages “on stack” (in a list), may be during document open, or you store it when a page is visited for the first time.
Note, that for everything what follows, only the display list is needed - the corresponding Page object could have been deleted.
The following creates a Pixmap from a DisplayList. Parameters are the same as for
>>> pix = dl.getPixmap() # create the page's pixmap
The execution time of this statement may be 20% up to 50% shorter than that of
Perform Text Search
With the display list from above, we can also search for text.
For this we need to create a TextPage.
>>> tp = dl.getTextPage() # display list from above >>> rlist = tp.search("needle") # look up "needle" locations >>> for r in rlist: # work with found locations: pix.invertIRect(r.irect) # e.g. invert colors in rectangle
With the same TextPage object from above, we can now immediately use any or all of the 5 text extraction methods.
Above, we have created our text page without argument. This leads to a default value of
3 = fitz.TEXT_PRESERVE_LIGATURES | fitz.TEXT_PRESERVE_WHITESPACE, IAW images will not be extracted - see below.
>>> txt = tp.extractText() # plain text format >>> json = tp.extractJSON() # json format >>> html = tp.extractHTML() # HTML format >>> xml = tp.extractXML() # XML format >>> xml = tp.extractXHTML() # XHTML format
Further Performance improvements
As explained in the Page chapter:
If you do not need transparency set
alpha = 0 when creating pixmaps. This will save 25% memory (if RGB, the most common case) and possibly 5% execution time (depending on the GUI software).
If you do not need images extracted alongside the text of a page, you can set the following option:
>>> flags = fitz.TEXT_PRESERVE_LIGATURES | fitz.TEXT_PRESERVE_WHITESPACE >>> tp = dl.getTextPage(flags)
This will save ca. 25% overall execution time for the HTML, XHTML and JSON text extractions and hugely reduce the amount of storage (memory and disk space) if the document is graphics oriented.