
PicAxe: Extracting Figures from Structurally and Syntactically Heterogeneous Corpora of PDF Files
References
- Lee B, Seo MK, Kim D, Shin I, Schich M, Jeong H, Han SK. Dissecting Landscape Art History with Information Theory. Proceedings of the National Academy of Sciences. 2020;117(43):26580–26590. DOI: 10.1073/pnas.2011927117
- Miton H, Sperber D, Hernik M. A Forward Bias in Human Profile-Oriented Portraits. Cognitive Science. 2020;44(6):
e12866 . DOI: 10.1111/cogs.12866 - Moreira D, Cardenuto JP, Shao R, Baireddy S, Cozzolino D, Gragnaniello D, Abd-Almageed W, Bestagini P, Tubaro S, Rocha A, Scheirer W, Verdoliva L, Delp E. SILA: A System for Scientific Image Analysis. Scientific Reports. 2022;12(1):
18306 . DOI: 10.1038/s41598-022-21535-3 - Chen Y, Sherren K, Smit M, Lee KY. Using Social Media Images as Data in Social Science Research. New Media & Society. 2023;25(4):849–871. DOI: 10.1177/14614448211038761
- Soh LK, Lorang L, Pack C, Liu Y. Applying Image Analysis and Machine Learning to Historical Newspaper Collections. The American Historical Review. 2023;128(3):1382–1389. DOI: 10.1093/ahr/rhad369
- Valente J, Antonio J, Mora C, Jardim S. Developments in Image Processing Using Deep Learning and Reinforcement Learning. Journal of Imaging. 2023;9(10):
207 . DOI: 10.3390/jimaging9100207 - Artifex. PyMuPDF; 2024. URL:
https://github.com/pymupdf/PyMuPDF . - Binmakhashen GM, Mahmoud SA. Document Layout Analysis: A Comprehensive Survey. ACM Computing Surveys. 2019;52(6):
109 . DOI: 10.1145/3355610 - Yu F, Huang J, Luo Z, Zhang L, Lu W. An Effective Method for Figures and Tables Detection in Academic Literature. Information Processing & Management. 2023;60(3). DOI: 10.1016/j.ipm.2023.103286
- Subramani N, Matton A, Greaves M, Lam A. A Survey of Deep Learning Approaches for OCR and Document Understanding. ArXiv; 2020. DOI: 10.48550/arXiv.2011.13534
- Ultralytics. YOLOv8; 2024. URL:
https://github.com/ultralytics/ultralytics/blob/main/docs/en/models/yolov8.md . - Community P. PaddleOCR; 2024. URL:
https://github.com/PaddlePaddle/PaddleOCR . - Guerrero AC, Kamath K, Zhou Q, Felalaga B, Dinner AR. PicAxe; 2025. DOI: 10.5281/zenodo.14873182
- Damerow J, Peirson BRE, Laubichler MD. The Giles Ecosystem – Storage, Text Extraction, and OCR of Documents. Journal of Open Research Software. 2017;5(1):
26 . DOI: 10.5334/jors.164 - Otsu N. A Threshold Selection Method from Gray-Level Histograms. IEEE Transactions on Systems, Man, and Cybernetics. 1979;9(1):62–66. DOI: 10.1109/TSMC.1979.4310076
- Vincent O, Folorunso O. A Descriptive Algorithm for Sobel Image Edge Detection. Informing Science + IT Education Conference;
2009 . DOI: 10.28945/3351 - Kahu SY, Ingram WA, Fox EA, Wu J. ScanBank: A Benchmark Dataset for Figure Extraction from Scanned Electronic Theses and Dissertations. ACM/IEEE Joint Conference on Digital Libraries;
2021 . pp. 180–191. DOI: 10.1109/JCDL52503.2021.00030 - Peter. Find the Images Dataset; 2022. URL:
https://universe.roboflow.com/peter-j1jzx/findthe-images . - Da C, Luo C, Zheng Q, Yao C. Vision Grid Transformer for Document Layout Analysis. ArXiv; 2023. DOI: 10.1109/ICCV51070.2023.01783
- Rezanezhad V, Baierer K, Gerber M, Labusch K, Neudecker C. Document Layout Analysis with Deep Learning and Heuristics. Proceedings of the 7th International Workshop on Historical Document Imaging and Processing. 2023;73–78. DOI: 10.1145/3604951.3605513
- Taraday V, Baskin C. Enhanced Meta Label Correction for Coping with Label Corruption. ArXiv; 2023. DOI: 10.1109/ICCV51070.2023.01493
- Hudson L. pyzbar; 2022. URL:
https://github.com/NaturalHistoryMuseum/pyzbar . - Belval E. pdf2image; 2024. URL:
https://github.com/Belval/pdf2image . - Shen Z, Zhang R, Dell M, Lee BCG, Carlson J, Li W. LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis. Document Analysis and Recognition – ICDAR. 2021;12821:131–146. DOI: 10.1007/978-3-030-86549-8_9
- Hoffstaetter S, Lee M. pytesseract; 2024. URL
https://github.com/h/pytesseract . - Khow ZJ, Tan YF, Karim HA, Rashid HAA. Improved YOLOv8 Model for a Comprehensive Approach to Object Detection and Distance Estimation. IEEE Access. 2024;12:63754–63767. DOI: 10.1109/ACCESS.2024.3396224
- Groleau A, Chee KW, Larson S, Maini S, Boarman J. ShabbyPages: A Reproducible Document Denoising and Binarization Dataset. ArXiv; 2023. DOI: 10.48550/arXiv.2303.09339
DOI: https://doi.org/10.5334/jors.574 | Journal eISSN: 2049-9647
Language: English
Submitted on: Apr 28, 2025
Accepted on: Dec 1, 2025
Published on: Dec 16, 2025
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year
Keywords:
© 2025 Anna C. Guerrero, Krishna Kamath, Qilin Zhou, Bruno Felalaga, Julia Damerow, Aaron R. Dinner, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.