Method for extracting and organizing unstructured sheet document data under big data environment

A structured data, unstructured technology, applied in electrical digital data processing, special data processing applications, instruments, etc., can solve problems such as inability to extract structured table documents, lack of flexibility, etc.

Active Publication Date: 2016-06-01
ZHEJIANG UNIV OF TECH
View PDF5 Cites 20 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0007] Due to the lack of flexibility of the existing data extraction technolo

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for extracting and organizing unstructured sheet document data under big data environment
  • Method for extracting and organizing unstructured sheet document data under big data environment
  • Method for extracting and organizing unstructured sheet document data under big data environment

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0096] combine figure 2 An actual unstructured tabular document is given to illustrate the specific implementation of the method for extracting and organizing unstructured tabular document data in a big data environment proposed by the present invention. The steps are as follows: (1) define the basic characteristics of the tabular document and extraction rules;

[0097] (1.1) Define the structural features of the table document;

[0098] (1.1.1) If figure 2 As shown, according to the rule that a title area of ​​a single value area corresponds to a data area, figure 2 (a) is a single-value area; according to the rule that one title area of ​​a multi-value area corresponds to one or more data areas, figure 2 (b) is a multi-valued area;

[0099] (1.1.2) If figure 2 as shown, figure 2 (a) "Name" is the title area, and "Chen" is the data area; figure 2 (b) "Start and end time" is the title area, and "2009.12.14-12.16" is the data area;

[0100] (1.2) Define the data ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a method for extracting and organizing unstructured sheet document data under big data environment. The method comprises the following steps: firstly, analyzing the structure features and data stream features of an unstructured sheet document, and defining a data extraction rule; secondly, giving an unstructured sheet document data extraction flow and an extraction algorithm; thirdly, giving an organizing method for converting the extraction result into structured data; and finally, giving a method for analyzing a structured data set obtained on the basis of a MapReduce parallel programing model. The method provided by the invention is capable of providing technical support for mining the knowledges contained in the unstructured sheet documents under the big data environment.

Description

technical field [0001] The patent of the invention relates to a data extraction and organization method of unstructured table documents in a big data environment. Firstly, the structural features and data flow characteristics of unstructured tabular documents are analyzed, and the data extraction rules are defined; secondly, the data extraction process and algorithm of unstructured tabular documents are given; thirdly, a method for extracting results The organizational method of transforming into structured data; finally, a method of analyzing the obtained structured data set based on the MapReduce parallel programming model is given. This method can provide technical support for mining knowledge contained in unstructured tabular documents in a big data environment. Background technique [0002] With the wide application of office automation, form documents are widely used in the daily affairs of enterprises, institutions and government affairs, such as survey forms, perfor...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
CPCG06F16/3331G06F16/3344
Inventor 张元鸣肖刚陈苗陆佳炜徐俊高飞沈志鹏高亚琳
Owner ZHEJIANG UNIV OF TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products