IntelliExtract: An End-to-End Framework for Chinese Resume Information Extraction from Document Images

Authors

  • Yijing Liu

DOI:

https://doi.org/10.56028/aetr.6.1.570.2023

Keywords:

End-to-End, document information extraction, document intelligent question-answering, named entity extraction.

Abstract

 Traditional document processing can be labor-intensive and time-consuming to manually extract and organize the information in a document. This manual process is often inefficient and error-prone. In order to improve processing efficiency and accuracy of document data, we develop IntelliExtract, an end-to-end framework designed for document information extraction. This is a comprehensive framework that includes image text detection and recognition, information extraction, and document intelligent question-answering. Some recent models and algorithms are employed, OCR models for converting scanned documents into machine readable text, layout analysis algorithms for understanding the spatial arrangement of document elements, and information extraction techniques for extracting structured data from unstructured documents. To evaluate the effectiveness of the framework, we conducted experiments by employing a Chinese Talent Resumes Dataset for visualizing the results. For named entity extraction, the confidence level of the extracted results from the text in the images is generally above 0.95. The proposed framework provides a powerful tool for enterprises, educational institutions, and other entities in processing document information, and holds promise for significant practical applications.

Downloads

Published

2023-07-18