|
|
Structural Recognition of PDF Academic Literature Based on Computer Vision |
Yu Fengchang, Lu Wei |
School of Information Management, Wuhan University, Wuhan 430072 |
|
|
Abstract Portable Document Format (PDF) documents play an important role in the publication of academic electronic literature. However, owing to the technical and structural complexities of PDF documents, they cannot be directly read by digital devices, which in turn can hinder research studies based on academic electronic literature. Hence, this paper proposes a method based on computer vision for the structural recognition of PDF documents. The proposed method, supplemented by a heuristic algorithm, maps graphic objects and text objects present in the PDF files of academic documents and thereby obtains geometric and text attributes of the file objects. The proposed algorithm can identify the category of a PDF object for determining the physical and logical structures of a PDF document. Conventional PDF analysis methods require a significant amount of artificial feature construction and large-scale lexical corpus training and cannot identify formulae and tables. The proposed method can overcome the aforementioned shortcomings and can successfully perform full-text extraction and structural recognition of ACL data collections.
|
Received: 26 September 2018
|
|
|
|
1 MaoS, RosenfeldA, KanungoT. Document structure analysis algorithms: a literature survey[C]// Document Recognition and Retrieval X. International Society for Optics and Photonics, 2003, 5010: 197-208. 2 NagyG, SethS, ViswanathanM. A prototype document image analysis system for technical journals[J]. Computer, 1992, 25(7): 10-22. 3 BairdH S, JonesS E, FortuneS J. Image segmentation by shape-directed covers[C]//Proceedings of the 10th International Conference on Pattern Recognition. IEEE, 1990: 820-825. 4 O’GormanL. The document spectrum for page layout analysis[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1993, 15(11): 1162-1173. 5 KiseK, SatoA, IwataM. Segmentation of page images using the area Voronoi diagram[J]. Computer Vision and Image Understanding, 1998, 70(3): 370-382. 6 WahlF M, WongK Y, CaseyR G. Block segmentation and text extraction in mixed text/image documents[J]. Computer Graphics and Image Processing, 1982, 20(4): 375-390. 7 PavlidisT, ZhouJ Y. Page segmentation and classification[J]. CVGIP: Graphical Models and Image Processing, 1992, 54(6): 484-496. 8 ChenK, SeuretM, LiwickiM, et al. Page segmentation of historical document images with convolutional autoencoders[C]// Proceedings of the 13th International Conference on Document Analysis and Recognition. IEEE, 2015: 1011-1015. 9 ChenK, SeuretM, HennebertJ, et al. Convolutional neural networks for page segmentation of historical document images[C]// Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition. IEEE, 2017, 1: 965-970. 10 ConstantinA, PettiferS, VoronkovA. PDFX: fully-automated PDF-to-XML conversion of scientific literature[C]// Proceedings of the 2013 ACM Symposium on Document Engineering. New York: ACM Press, 2013: 177-180. 11 YildizB, KaiserK, MikschS. pdf2table: A method to extract table information from PDF files[OL]. http://citeseerx.ist.psu. edu/viewdoc/summary?doi=10.1.1.94.9382. 12 ClarkC, DivvalaS. PDFFigures 2.0: Mining figures from research papers[C]//Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries. New York: ACM Press, 2016: 143-152. 13 Al-ZaidyR A, GilesC L. A machine learning approach for semantic structuring of scientific charts in scholarly documents[C]// Proceedings of the Twenty-Ninth AAAI Conference on Innovative Applications. Palo Alto: AAAI Press, 2017: 4644-4649. 14 SiegelN, LourieN, PowerR, et al. Extracting scientific figures with distantly supervised neural networks[C]// Proceedings of the 18th ACM/IEEE Joint Conference on Digital Libraries. New York: ACM Press, 2018: 223-232. 15 王津涛, 康晓东, 李玫, 等. PDF 文件中可识别图像的提取[J]. 计算机工程与设计, 2006, 27(9): 1539-1541. 16 TsujimotoS, AsadaH. Understanding multi-articled documents[C]// Proceedings on 10th International Conference on Pattern Recognition. IEEE, 1990, 1: 551-556. 17 YamashitaA, AmanoT, TakahashiI, et al. A model based layout understanding method for the document recognition system[C]// Proceedings of the International Conference on Document Analysis and Recognition, Saint-Malo, France, 1991: 130-138. 18 RameshS H, DharA, KumarR R, et al. Automatically identify and label sections in scientific journals using conditional random fields[C]// Proceedings of Conference on Semantic Web Evaluation Challenge. Cham: Springer, 2016, 641: 269-280. 19 FauconnierJ P, KamelM. Discovering hypernymy relations using text layout[C]// Proceedings of the Fourth Joint Conference on Lexical and Computational Semantics. Stroudsburg: The Association for Computational Linguistics, 2015: 249-258. |
|
|
|