image_extractor
Pull bitmap images out of PDFs.
pypaperretriever.image_extractor.ImageExtractor
Extract figures from a PDF file.
The extractor handles both native PDFs (containing embedded images) and scanned PDFs where each page is an image.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
pdf_file_path
|
str
|
Path to the PDF file to process. |
required |
Attributes:
Name | Type | Description |
---|---|---|
filepath |
str
|
Path to the PDF file. |
dir |
str
|
Directory containing the PDF file. |
is_valid_pdf |
bool
|
Whether the file can be opened by PyMuPDF. |
is_native_pdf |
bool
|
|
img_paths |
list[str]
|
Paths to extracted image files. |
img_counter |
int
|
Counter used to name extracted images. |
id |
str | None
|
Optional identifier prefix for saved images. |
Initialize the extractor.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
pdf_file_path
|
str
|
Path to the PDF file to process. |
required |
Source code in pypaperretriever/image_extractor.py
extract_images
Extract images from the PDF.
The method determines the PDF type and delegates to the appropriate
extraction routine. Extracted image paths are stored in img_paths
.
Returns:
Name | Type | Description |
---|---|---|
Self |
Self
|
This instance with |
Source code in pypaperretriever/image_extractor.py
extract_from_native_pdf
Extract figures from a native PDF using PyMuPDF.
Saves each valid image to disk and records its file path.
Source code in pypaperretriever/image_extractor.py
handle_image_based_pdf
Process a scanned PDF.
Each page is converted to an image and potential figures are extracted
using :meth:_crop_boxes_in_image
.