Table extraction method based on machine learning
A machine learning and form technology, applied in the field of data processing, to achieve accurate and complete extraction
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
Embodiment 1
[0043] In this embodiment, the automatic extraction tool uses chemdataextractor. In order to ensure more accurate identification by chemdataextractor, in this embodiment, the method for preprocessing the original xml file in step S1 specifically includes the following steps, for details, please refer to figure 2 .
[0044] S11. Add a tag text at the beginning of the xml table. The addition of marked text can prevent the content of the table from being ignored, assisting the later table recognition, and making it easier for chemdataextractor to read the entire table.
[0045] S12. Identify and mark the title, and move the marked title to the extended position at the beginning of the xml form;
[0046] S13, the content of the superscript label in the xml form is 标签中的内容转换成LaTeX的形式。如此,保证了对上标的统一处理便于后期处理和识别,避免信息丢失和信息谬误。
[0047]S14、对xml表格中的脚注进行标记,并将标记后的脚注放到xml表格的顺延位置。具体的,本步骤中,可根据 标签的id属性判断脚注。
[0048]S15、提取xml表格中的列宽属性并进行标记。如此,可避免被隐藏的列宽导致表格的列错位的情况,保证对表格的精确识别。
[0049]S16、对处理后的xm...
Embodiment 2
[0051]实施例1的步骤S15中,两个单元格宽度标记行之间的内容就是上一个单元格宽度标记行的作用行数范围。
[0052]本实施例的步骤S3中,识别跨列子标题,并把子标题填充到对应的列的方法具体包括:
[0053]S31、获取表格中每一列的列宽作为基准值,根据基准值获取跨列单元格的起始位置与具体跨列范围。
[0054]S32、对于跨列的单元格,判断它是不是一个子标题,如果是,则向右填充。
Embodiment 3
[0056]本实施例中的步骤S4中,通过机器学习,区分出表头所在行的范围,然后合并表头。
PUM
Login to View More Abstract
Description
Claims
Application Information
Login to View More 


