Unlock instant, AI-driven research and patent intelligence for your innovation.

Table extraction method based on machine learning

A machine learning and form technology, applied in the field of data processing, to achieve accurate and complete extraction

Pending Publication Date: 2020-07-10
苏州机数芯微科技有限公司
View PDF3 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, tables in the literature are designed for human readability, and computer recognition of tables is a major challenge

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Table extraction method based on machine learning
  • Table extraction method based on machine learning
  • Table extraction method based on machine learning

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0043] In this embodiment, the automatic extraction tool uses chemdataextractor. In order to ensure more accurate identification by chemdataextractor, in this embodiment, the method for preprocessing the original xml file in step S1 specifically includes the following steps, for details, please refer to figure 2 .

[0044] S11. Add a tag text at the beginning of the xml table. The addition of marked text can prevent the content of the table from being ignored, assisting the later table recognition, and making it easier for chemdataextractor to read the entire table.

[0045] S12. Identify and mark the title, and move the marked title to the extended position at the beginning of the xml form;

[0046] S13, the content of the superscript label in the xml form is 标签中的内容转换成LaTeX的形式。如此,保证了对上标的统一处理便于后期处理和识别,避免信息丢失和信息谬误。

[0047]S14、对xml表格中的脚注进行标记,并将标记后的脚注放到xml表格的顺延位置。具体的,本步骤中,可根据 标签的id属性判断脚注。

[0048]S15、提取xml表格中的列宽属性并进行标记。如此,可避免被隐藏的列宽导致表格的列错位的情况,保证对表格的精确识别。

[0049]S16、对处理后的xm...

Embodiment 2

[0051]实施例1的步骤S15中,两个单元格宽度标记行之间的内容就是上一个单元格宽度标记行的作用行数范围。

[0052]本实施例的步骤S3中,识别跨列子标题,并把子标题填充到对应的列的方法具体包括:

[0053]S31、获取表格中每一列的列宽作为基准值,根据基准值获取跨列单元格的起始位置与具体跨列范围。

[0054]S32、对于跨列的单元格,判断它是不是一个子标题,如果是,则向右填充。

Embodiment 3

[0056]本实施例中的步骤S4中,通过机器学习,区分出表头所在行的范围,然后合并表头。

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides a table extraction method based on machine learning, and the method comprises the following steps: preprocessing an original xml file to obtain a pre-selected new xml file whichcan be identified by an automatic extraction tool; identifying the new xml file through an automatic extraction tool and converting the new xml file into a two-dimensional list which can be identified by python; separating table titles and footnotes from the two-dimensional list, recognizing cross-column sub-titles, and filling corresponding columns with the sub-titles; distinguishing the range of the row where the header is located through machine learning, and then combining the headers; and merging the cross-row data to obtain final table data. According to the method, the cleaning table content is extracted from the file with the format of xml based on machine learning, and accurate and complete extraction of xml file information is guaranteed.

Description

technical field [0001] The invention relates to the technical field of data processing, in particular to a table extraction method based on machine learning. Background technique [0002] Tables are widely used as a form of presenting data. With such a large amount of data, it can be described as a mine. However, ordinary manual editing and sorting is too time-consuming and labor-intensive. With the development of big data technology, it is the general trend that automatic extraction and data cleaning by computer software can greatly improve work efficiency. However, tables in the literature are designed for human readability, and computer recognition of tables is a major challenge. Contents of the invention [0003] Based on the technical problems existing in the background technology, the present invention proposes a table extraction method based on machine learning. [0004] A kind of table extracting method based on machine learning that the present invention propo...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F40/154G06F40/169G06F40/174G06F40/279G06F16/11G06N20/00
CPCG06F16/113G06N20/00
Inventor 李鑫郑磊鲍琦
Owner 苏州机数芯微科技有限公司