摘要PDF格式在电子学术文献出版发行领域占有极其重要的地位,但因其复杂的技术规则,使得PDF无法直接被机器阅读,给针对学术文献的研究工作造成了诸多不便。本文提出了一种基于机器视觉的PDF文档结构识别方法,该方法针对常见的PDF学术论文,将PDF文件中的视觉对象和文本对象进行映射,获得内容对象的几何属性和文本属性,并辅以启发式算法对内容对象进行类型判断,得到PDF文档的物理结构和逻辑结构。该方法以直观的方式克服了其他PDF解析方法需要大量人工特征构建或大规模语料训练、难以识别公式表格等缺点,并成功地对ACL(Association for Computational Linguistics)的论文集进行了结构识别和全文抽取。
于丰畅, 陆伟. 基于机器视觉的PDF学术文献结构识别[J]. 情报学报, 2019, 38(4): 384-390.
Yu Fengchang, Lu Wei. Structural Recognition of PDF Academic Literature Based on Computer Vision. 情报学报, 2019, 38(4): 384-390.
1 MaoS, RosenfeldA, KanungoT. Document structure analysis algorithms: a literature survey[C]// Document Recognition and Retrieval X. International Society for Optics and Photonics, 2003, 5010: 197-208. 2 NagyG, SethS, ViswanathanM. A prototype document image analysis system for technical journals[J]. Computer, 1992, 25(7): 10-22. 3 BairdH S, JonesS E, FortuneS J. Image segmentation by shape-directed covers[C]//Proceedings of the 10th International Conference on Pattern Recognition. IEEE, 1990: 820-825. 4 O’GormanL. The document spectrum for page layout analysis[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1993, 15(11): 1162-1173. 5 KiseK, SatoA, IwataM. Segmentation of page images using the area Voronoi diagram[J]. Computer Vision and Image Understanding, 1998, 70(3): 370-382. 6 WahlF M, WongK Y, CaseyR G. Block segmentation and text extraction in mixed text/image documents[J]. Computer Graphics and Image Processing, 1982, 20(4): 375-390. 7 PavlidisT, ZhouJ Y. Page segmentation and classification[J]. CVGIP: Graphical Models and Image Processing, 1992, 54(6): 484-496. 8 ChenK, SeuretM, LiwickiM, et al. Page segmentation of historical document images with convolutional autoencoders[C]// Proceedings of the 13th International Conference on Document Analysis and Recognition. IEEE, 2015: 1011-1015. 9 ChenK, SeuretM, HennebertJ, et al. Convolutional neural networks for page segmentation of historical document images[C]// Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition. IEEE, 2017, 1: 965-970. 10 ConstantinA, PettiferS, VoronkovA. PDFX: fully-automated PDF-to-XML conversion of scientific literature[C]// Proceedings of the 2013 ACM Symposium on Document Engineering. New York: ACM Press, 2013: 177-180. 11 YildizB, KaiserK, MikschS. pdf2table: A method to extract table information from PDF files[OL]. http://citeseerx.ist.psu. edu/viewdoc/summary?doi=10.1.1.94.9382. 12 ClarkC, DivvalaS. PDFFigures 2.0: Mining figures from research papers[C]//Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries. New York: ACM Press, 2016: 143-152. 13 Al-ZaidyR A, GilesC L. A machine learning approach for semantic structuring of scientific charts in scholarly documents[C]// Proceedings of the Twenty-Ninth AAAI Conference on Innovative Applications. Palo Alto: AAAI Press, 2017: 4644-4649. 14 SiegelN, LourieN, PowerR, et al. Extracting scientific figures with distantly supervised neural networks[C]// Proceedings of the 18th ACM/IEEE Joint Conference on Digital Libraries. New York: ACM Press, 2018: 223-232. 15 王津涛, 康晓东, 李玫, 等. PDF 文件中可识别图像的提取[J]. 计算机工程与设计, 2006, 27(9): 1539-1541. 16 TsujimotoS, AsadaH. Understanding multi-articled documents[C]// Proceedings on 10th International Conference on Pattern Recognition. IEEE, 1990, 1: 551-556. 17 YamashitaA, AmanoT, TakahashiI, et al. A model based layout understanding method for the document recognition system[C]// Proceedings of the International Conference on Document Analysis and Recognition, Saint-Malo, France, 1991: 130-138. 18 RameshS H, DharA, KumarR R, et al. Automatically identify and label sections in scientific journals using conditional random fields[C]// Proceedings of Conference on Semantic Web Evaluation Challenge. Cham: Springer, 2016, 641: 269-280. 19 FauconnierJ P, KamelM. Discovering hypernymy relations using text layout[C]// Proceedings of the Fourth Joint Conference on Lexical and Computational Semantics. Stroudsburg: The Association for Computational Linguistics, 2015: 249-258.