Unlock instant, AI-driven research and patent intelligence for your innovation.

Paper fragmentation information extraction method based on machine learning

An information extraction and machine learning technology, applied in the field of information extraction, can solve problems such as poor effect, numerous, complex academic paper formats, etc.

Inactive Publication Date: 2018-09-14
同方知网数字出版技术股份有限公司
View PDF5 Cites 7 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Although great achievements have been made in metadata classification and extraction, the effect of the above individual methods is not ideal in some cases due to the complexity and variety of academic papers.
And most of the traditional studies only focus on the extraction of metadata, and do not give a good description of the content structure of the paper and the information and data in the content.
From previous studies, it can be found that the single method is sometimes particularly effective in the process of extracting metadata, and sometimes the effect is very poor.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Paper fragmentation information extraction method based on machine learning
  • Paper fragmentation information extraction method based on machine learning
  • Paper fragmentation information extraction method based on machine learning

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0022] In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the embodiments and accompanying drawings.

[0023] Such as figure 1 As shown, it is a paper fragmentation information extraction method based on machine learning, including the following steps:

[0024] Step 10 adopts XPDF to extract the text content, picture and table of PDF, and preserves as xml form;

[0025] Unify the pdf documents into the library. Convert files in word, ppt, pdf and other formats into pdf format in a unified way, so that unified conversion of pdf in the database into xml format. figure 2 It is a unified structure of the database, in which the unique identifier of the attribute is the unique identifier of each article, the title is the title name of each article, and the status of the fragmentation task is the status of identifying the article. This algorithm mainly...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a paper fragmentation information extraction method based on machine learning. The method comprises the steps of extracting PDF text contents, pictures and tables by using XPDF, and saving in an xml form; performing data analysis on paragraph texts in the xml, calculating and extracting feature vectors of each paragraph block para, converting the feature vectors of each para into feature vectors of a machine learning model, analyzing according to the selection and accuracy of the machine model to select reasonable feature vectors, and using the obtained feature vectorsof paragraphs to train a support vector machine model and a random forest model; predicting a title and structure information of a target PDF article according to the feature vectors of the machine learning model, and saving in a database in an xml format. The method makes full use of the advantages of machine learning in information classification, selects sample features to construct a trainingset and selects the RF (random forest) model to complete the information extraction based on the machine learning.

Description

technical field [0001] The invention relates to the technical field of information extraction, in particular to a machine learning-based method for extracting paper fragmented information. Background technique [0002] With the development of the Internet and information technology, big data has become the most popular term in various fields. In the face of massive information and data resources, quickly obtaining potential and useful knowledge is an important direction of data mining today. Academic papers have strong professionalism and accuracy. The information and data in the papers can play a great role in many professional fields, and can provide underlying data support for many application technologies. Therefore, it is very meaningful to extract information and data from academic papers. [0003] At present, academic papers at home and abroad are mostly stored in PDF format. There are two main ways to extract the content of PDF documents. One is to directly extract...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/27G06K9/62G06N99/00
CPCG06F40/205G06F40/289G06F18/2411
Inventor 段飞虎吴盼盼冯自强张宏伟
Owner 同方知网数字出版技术股份有限公司