Paper fragmentation information extraction method based on machine learning

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
An information extraction and machine learning technology, applied in the field of information extraction, can solve problems such as poor effect, numerous, complex academic paper formats, etc.

Inactive Publication Date: 2018-09-14

同方知网数字出版技术股份有限公司

View PDF5 Cites 7 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

Although great achievements have been made in metadata classification and extraction, the effect of the above individual methods is not ideal in some cases due to the complexity and variety of academic papers.

And most of the traditional studies only focus on the extraction of metadata, and do not give a good description of the content structure of the paper and the information and data in the content.

From previous studies, it can be found that the single method is sometimes particularly effective in the process of extracting metadata, and sometimes the effect is very poor.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0022] In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the embodiments and accompanying drawings.

[0023] Such as figure 1 As shown, it is a paper fragmentation information extraction method based on machine learning, including the following steps:

[0024] Step 10 adopts XPDF to extract the text content, picture and table of PDF, and preserves as xml form;

[0025] Unify the pdf documents into the library. Convert files in word, ppt, pdf and other formats into pdf format in a unified way, so that unified conversion of pdf in the database into xml format. figure 2 It is a unified structure of the database, in which the unique identifier of the attribute is the unique identifier of each article, the title is the title name of each article, and the status of the fragmentation task is the status of identifying the article. This algorithm mainly...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a paper fragmentation information extraction method based on machine learning. The method comprises the steps of extracting PDF text contents, pictures and tables by using XPDF, and saving in an xml form; performing data analysis on paragraph texts in the xml, calculating and extracting feature vectors of each paragraph block para, converting the feature vectors of each para into feature vectors of a machine learning model, analyzing according to the selection and accuracy of the machine model to select reasonable feature vectors, and using the obtained feature vectorsof paragraphs to train a support vector machine model and a random forest model; predicting a title and structure information of a target PDF article according to the feature vectors of the machine learning model, and saving in a database in an xml format. The method makes full use of the advantages of machine learning in information classification, selects sample features to construct a trainingset and selects the RF (random forest) model to complete the information extraction based on the machine learning.

Description

technical field [0001] The invention relates to the technical field of information extraction, in particular to a machine learning-based method for extracting paper fragmented information. Background technique [0002] With the development of the Internet and information technology, big data has become the most popular term in various fields. In the face of massive information and data resources, quickly obtaining potential and useful knowledge is an important direction of data mining today. Academic papers have strong professionalism and accuracy. The information and data in the papers can play a great role in many professional fields, and can provide underlying data support for many application technologies. Therefore, it is very meaningful to extract information and data from academic papers. [0003] At present, academic papers at home and abroad are mostly stored in PDF format. There are two main ways to extract the content of PDF documents. One is to directly extract...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

IPC IPC(8): G06F17/27G06K9/62G06N99/00

CPCG06F40/205G06F40/289G06F18/2411

Inventor 段飞虎吴盼盼冯自强张宏伟

Owner 同方知网数字出版技术股份有限公司

Paper fragmentation information extraction method based on machine learning

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology