Data cleaning method

A data cleaning and data table technology, applied in the field of data cleaning, can solve problems such as inability to complete yearbook data cleaning work, OCR recognition method cannot take effect, etc., to achieve high degree of automation, low labor cost, and good results

Active Publication Date: 2019-02-26
盐城优易数据有限公司
View PDF3 Cites 1 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0006] OCR recognition technology is fast, but for forms with complex structures, the OCR recognition method cannot take effect, so OCR recognition cannot complete the cleaning of yearbook data

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Data cleaning method
  • Data cleaning method
  • Data cleaning method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0053] like figure 1 As shown, this embodiment provides a data cleaning method, comprising the following steps:

[0054] S1. Determine area coordinates corresponding to preset areas in the two-dimensional data table, where the preset areas include the area where row headers are located and the area where column headers are located.

[0055] This step specifically includes: determining the areas corresponding to different fill colors in the two-dimensional data table, and determining the preset area in the two-dimensional data table according to the coordinate value characteristics of the area coordinates corresponding to the areas corresponding to different fill colors Corresponding area coordinates; or determining the area coordinates respectively corresponding to the preset areas in the two-dimensional data table according to the preset corresponding relationship between different filling colors and different preset areas.

[0056] For example, when performing data cleaning...

Embodiment 2

[0070] like figure 2 As shown, this embodiment provides another data cleaning method, which is different from Embodiment 1 in that: the preset area in step S1 in this embodiment also includes the area where the table title in the two-dimensional data table is located, And also include steps after step S36:

[0071] S37. Determine the last column of the area where the list header is located according to the area coordinates corresponding to the area where the list header is located;

[0072] S38. Write the header attribute in the table jointly identified by the preset column after the last column and the first preset row in the row occupied by the area where the table title is located; wherein, the number of preset columns is related to the header attribute the same quantity;

[0073] S39. According to the semantics of the words constituting the table title obtained through analysis, write the corresponding words in the column of the table header attribute to which the corre...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a data cleaning method, which comprises the following steps of: determining the coordinates of the preset regions corresponding to the preset regions in a two-dimensional datatable; The preset region comprises: a region where the row header is located and a region where the list header is located; Reading and analyzing the contents in the region corresponding to the regioncoordinates by a preset natural language processing algorithm; And according to the analysis result, determining the header attributes to which the contents of the preset area belong respectively; The value of each header attribute and the corresponding data value are written into the one-dimensional data table as a row or a column to obtain the cleaned one-dimensional data table. The invention has the advantages of high automation degree, low labor cost, and can process a large amount of yearbook data with complex format, and has better effect.

Description

technical field [0001] The invention relates to computer data processing, in particular to a data cleaning method. Background technique [0002] The data in the Statistical Yearbook are very complicated, mainly including national economic accounting, population, employment and wages, fixed asset investment and real estate, foreign economic and trade, energy, finance, price index, people's life, urban overview, resources and environment, agriculture, industry , construction, transportation and post and telecommunications, total retail sales of social consumer goods, wholesale and retail, accommodation and catering, tourism, finance, education, science and technology, health, social services, culture, sports, public management, social security and others . Due to the differences in the statistical systems and statistical standards adopted by the statistical yearbook data in various regions, and the statistical yearbook is displayed in the form of a web page or pdf, the data f...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/215
Inventor 辅小红唐诚
Owner 盐城优易数据有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products