基于机器视觉的<bold>PDF</bold>学术文献结构识别

doi:10.3772/j.issn.1000-0135.2019.04.006

情报学报

2019, Vol. 38

Issue (4): 384-390 DOI: 10.3772/j.issn.1000-0135.2019.04.006

情报分析方法与技术

本期目录 | 过刊浏览 | 高级检索

基于机器视觉的PDF学术文献结构识别

于丰畅, 陆伟

武汉大学信息管理学院，武汉 430072

Structural Recognition of PDF Academic Literature Based on Computer Vision

Yu Fengchang, Lu Wei

School of Information Management, Wuhan University, Wuhan 430072

摘要
图/表
参考文献
相关文章 (4)

全文: PDF (1505 KB) HTML (77 KB)
输出: BibTeX | EndNote (RIS)

摘要 PDF格式在电子学术文献出版发行领域占有极其重要的地位，但因其复杂的技术规则，使得PDF无法直接被机器阅读，给针对学术文献的研究工作造成了诸多不便。本文提出了一种基于机器视觉的PDF文档结构识别方法，该方法针对常见的PDF学术论文，将PDF文件中的视觉对象和文本对象进行映射，获得内容对象的几何属性和文本属性，并辅以启发式算法对内容对象进行类型判断，得到PDF文档的物理结构和逻辑结构。该方法以直观的方式克服了其他PDF解析方法需要大量人工特征构建或大规模语料训练、难以识别公式表格等缺点，并成功地对ACL（Association for Computational Linguistics）的论文集进行了结构识别和全文抽取。

	服务

	把本文推荐给朋友
	加入我的书架
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	于丰畅
	陆伟

关键词 ： PDF, 学术文献, 机器视觉, 结构识别

收稿日期: 2018-09-26

引用本文:

于丰畅, 陆伟. 基于机器视觉的PDF学术文献结构识别[J]. 情报学报, 2019, 38(4): 384-390.
Yu Fengchang, Lu Wei. Structural Recognition of PDF Academic Literature Based on Computer Vision. 情报学报, 2019, 38(4): 384-390.

链接本文:

https://qbxb.istic.ac.cn/CN/10.3772/j.issn.1000-0135.2019.04.006 或 https://qbxb.istic.ac.cn/CN/Y2019/V38/I4/384

1 MaoS, RosenfeldA, KanungoT. Document structure analysis algorithms: a literature survey[C]// Document Recognition and Retrieval X. International Society for Optics and Photonics, 2003, 5010: 197-208.
2 NagyG, SethS, ViswanathanM. A prototype document image analysis system for technical journals[J]. Computer, 1992, 25(7): 10-22.
3 BairdH S, JonesS E, FortuneS J. Image segmentation by shape-directed covers[C]//Proceedings of the 10th International Conference on Pattern Recognition. IEEE, 1990: 820-825.
4 O’GormanL. The document spectrum for page layout analysis[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1993, 15(11): 1162-1173.
5 KiseK, SatoA, IwataM. Segmentation of page images using the area Voronoi diagram[J]. Computer Vision and Image Understanding, 1998, 70(3): 370-382.
6 WahlF M, WongK Y, CaseyR G. Block segmentation and text extraction in mixed text/image documents[J]. Computer Graphics and Image Processing, 1982, 20(4): 375-390.
7 PavlidisT, ZhouJ Y. Page segmentation and classification[J]. CVGIP: Graphical Models and Image Processing, 1992, 54(6): 484-496.
8 ChenK, SeuretM, LiwickiM, et al. Page segmentation of historical document images with convolutional autoencoders[C]// Proceedings of the 13th International Conference on Document Analysis and Recognition. IEEE, 2015: 1011-1015.
9 ChenK, SeuretM, HennebertJ, et al. Convolutional neural networks for page segmentation of historical document images[C]// Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition. IEEE, 2017, 1: 965-970.
10 ConstantinA, PettiferS, VoronkovA. PDFX: fully-automated PDF-to-XML conversion of scientific literature[C]// Proceedings of the 2013 ACM Symposium on Document Engineering. New York: ACM Press, 2013: 177-180.
11 YildizB, KaiserK, MikschS. pdf2table: A method to extract table information from PDF files[OL]. http://citeseerx.ist.psu. edu/viewdoc/summary?doi=10.1.1.94.9382.
12 ClarkC, DivvalaS. PDFFigures 2.0: Mining figures from research papers[C]//Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries. New York: ACM Press, 2016: 143-152.
13 Al-ZaidyR A, GilesC L. A machine learning approach for semantic structuring of scientific charts in scholarly documents[C]// Proceedings of the Twenty-Ninth AAAI Conference on Innovative Applications. Palo Alto: AAAI Press, 2017: 4644-4649.
14 SiegelN, LourieN, PowerR, et al. Extracting scientific figures with distantly supervised neural networks[C]// Proceedings of the 18th ACM/IEEE Joint Conference on Digital Libraries. New York: ACM Press, 2018: 223-232.
15 王津涛, 康晓东, 李玫, 等. PDF 文件中可识别图像的提取[J]. 计算机工程与设计, 2006, 27(9): 1539-1541.
16 TsujimotoS, AsadaH. Understanding multi-articled documents[C]// Proceedings on 10th International Conference on Pattern Recognition. IEEE, 1990, 1: 551-556.
17 YamashitaA, AmanoT, TakahashiI, et al. A model based layout understanding method for the document recognition system[C]// Proceedings of the International Conference on Document Analysis and Recognition, Saint-Malo, France, 1991: 130-138.
18 RameshS H, DharA, KumarR R, et al. Automatically identify and label sections in scientific journals using conditional random fields[C]// Proceedings of Conference on Semantic Web Evaluation Challenge. Cham: Springer, 2016, 641: 269-280.
19 FauconnierJ P, KamelM. Discovering hypernymy relations using text layout[C]// Proceedings of the Fourth Joint Conference on Lexical and Computational Semantics. Stroudsburg: The Association for Computational Linguistics, 2015: 249-258.