TextPage
TextPage represents the text of a page.
| Method | Short Description |
|---|---|
TextPage.extractText() |
Extract the page’s plain text |
TextPage.extractTEXT() |
synonym of previous |
TextPage.extractHTML() |
Extract the page’s text in HTML format |
TextPage.extractJSON() |
Extract the page’s text in JSON format |
TextPage.extractXHTML() |
Extract the page’s text in XHTML format |
TextPage.extractXML() |
Extract the page’s text in XML format |
TextPage.search() |
Search for a string in the page |
Class API
-
class
TextPage -
extractText()
-
extractTEXT() Extract the text from a
TextPageobject. Returns a string of the page’s complete text. The text is UTF-8 unicode and in the same sequence as the PDF creator specified it. If this looks awkward for your document, consider using a program that re-arranges the text according to a more familiar layout, e.g. PDF2TextJS.py in the examples directory. Or use another extraction method which also provides text position information likeTextPage.extractHTML(),TextPage.extractXML(), orPage.extractTextList().Return type: str
-
extractHTML() Extract all text and images in HTML format. This version contains complete formatting and positioning information on line level. Images will be included as base64 strings. You need a HTML package to interpret the output. Also see Controlling Quality of HTML Output.
Return type: str
-
extractJSON() Extract all text in JSON format. Provides same information detail as HTML (including images). You need a JSON module to interpret the output. The result will be nested Python dictionaries and lists. See below for the structure.
Return type: str
-
extractXHTML() Extract all text in XHTML format. Text information detail is comparable with
extractTEXT, but also contains images. This method makes no attempt to re-create the original visual appearance.Return type: str
-
extractXML() Extract all text in XML format. This contains complete formatting information about every single character on the page: font, size, line, paragraph, location, etc. Contains no images.
Return type: str
-
search(string, hit_max = 16) Search for
string.Parameters: - string (str) – The string to search for.
- hit_max (int) – Maximum number of expected hits (default 16).
Return type: list
Returns: a list of Rect objects (without transformation), each surrounding a found
stringoccurrence.
Note
All of the above can be achieved by using the appropriate
Page.getText()andPage.searchFor()methods. Also see further down and in the Page chapter for examples on how to create a valid file format by adding respective headers and trailers.-
Structure of TextPage.extractJSON()
A text page in JSON format is a nested object consisting of dictionaries and lists.
Page Dictionary
| Key | Value |
|---|---|
| width | page width in pixels (float) |
| height | page height in pixels (float) |
| blocks | list of blocks (list) |
Block Dictionaries
Blocks come in two types with a different structure: image blocks and text blocks.
Image block:
| Key | Value |
|---|---|
| type | 1 = image (int) |
| bbox | block / image rectangle, formatted as list(fitz.Rect) |
| imgtype | image type (int), see list below |
| width | original image width (float) |
| height | original image height (float) |
| image | image content (base64 str), may be None |
Image type values:
- 0 (unknown): image type could not be determined and is provided as PNG if possible
- 1 (raw): uncompressed samples
- 2 (FAX)
- 3 (flate)
- 4 (LZW)
- 5 (RLD)
- 6 (BMP)
- 7 (GIF)
- 8 (JPEG)
- 9 (JPX)
- 10 (JXR)
- 11 (PNG)
- 12 (PNM)
- 13 (TIFF)
Text block:
| Key | Value |
|---|---|
| type | 0 = text (int) |
| bbox | block rectangle, formatted as list(fitz.Rect) |
| lines | list of text lines (list) |
Line Dictionary
| Key | Value |
|---|---|
| bbox | line rectangle, formatted as list(fitz.Rect) |
| wmode | writing mode (int): 0 = horizontal, 1 = vertical |
| dir | writing direction (tuple of floats): [x, y] |
| spans | list of spans (list) |
The entries of writing direction dir should be interpreted as follows:
x: positive = “left-right”, negative = “right-left”, 0 = neithery: positive = “top-bottom”, negative = “bottom-top”, 0 = neither
The values indicate the “relative writing speed” in each direction, such that x2 + y2 = 1. In other words dir = [cos(beta), sin(beta)] where beta is the writing angle relative to the horizontal.
Span Dictionary
Spans contain the actual text. In contrast to MuPDF versions up to 1.11, a span no longer includes positioning information. Therefore, to reconstruct the text a line, span text pieces must be concatenated. A span now contains font information. A line contains more than one span only, when any changes of the font or its attributes occur.
| Key | Value |
|---|---|
| font | name of font (str) |
| size | font size (float) |
| flags | font characteristics (int) |
| text | text (str) |
flags is a set of bools describing the font:
- bit 0: superscripted text
- bit 1: italic
- bit 2: serifed
- bit 3: monospaced
- bit 4: bold
Full Document Output in JSON Format
Converting a document to JSON format requires a little programmer attention. Use the following schema to create a valid (i.e. de-serializable JSON) document:
>>> doc = fitz.open(...) # maybe any document type!
>>> jsonfile = open("document.json", "w")
>>> pno = 0
>>> jsonfile.write(fitz.ConversionHeader("json", filename = doc.name))
>>> for page in doc:
if pno > 0:
jsonfile.write(",\n") # comma needed between pages!
jsonfile.write(page.getText("json"))
pno += 1
>>> jsonfile.write(fitz.ConversionTrailer("json"))
>>> jsonfile.close()
The document level dictionary then looks like so:
| Key | Value |
|---|---|
| document | specified filename (str) |
| pages | list of pages (list) |