User address data cleaning method based on word segmentation

A user address and data cleaning technology, which is applied in the field of data processing, can solve the problems of low standardization requirements for user address data, long cleaning time, and low cleaning efficiency, so as to improve data cleaning speed, reduce development workload, and save time. Effect

Active Publication Date: 2018-06-29
BEIJING GAS GRP
View PDF8 Cites 10 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0007] In order to solve the problems of long time consumption, high complexity, and heavy workload existing in the existing address data cleaning methods, the present invention innovatively proposes a user address data cleaning method based on word segmentation, and implements standard address data cleaning by constructing a metadata database. Word segmentation, extraction and correction of data to achieve the purpose of cleaning user address data. This method has lower normative requirements for user address data and has wide applicability, thereby solving the problems of large workload and Problems such as long cleaning time and low cleaning efficiency

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • User address data cleaning method based on word segmentation

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0028] The method for cleaning user address data based on word segmentation of the present invention will be explained and illustrated in detail below in conjunction with the accompanying drawings.

[0029] The principle of data cleaning is to investigate and analyze the process of data flow by analyzing the causes and forms of existence of dirty data, and summarize some methods (such as mathematical statistics, data mining or predefined rules, etc.) to transform dirty data into into data that meets data quality requirements.

[0030] like figure 1 As shown, the present invention discloses a method for cleaning user address data based on word segmentation, the method includes the following steps,

[0031] Step 1. Construct a metadata database based on administrative areas, streets, communities, buildings, units, and house numbers, and store standard address data and word segmentation rule data in the metadata database. The standard address data includes all types of special c...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a user address data cleaning method based on word segmentation. The method includes the following steps that 1, a metadatabase is established, and standard address data is stored in the metadatabase; 2, characteristic characters in user address data are read and identified based on the metadatabase, then the user address data is subjected to word segmentation operation according to the characteristic characters, and therefore multiple pieces of address sub-data are extracted; 3, the multiple pieces of address sub-data are matched with the standard address data, and theuser address data is corrected through the standard address data. According to the user address data cleaning method based on word segmentation, the standardability of original data is not required, the requirement for data sources is low, and therefore the application range of the user address data cleaning method is wide; it is achieved that nonstandard or indeterminate addresses are matched andcleaned by establishing the actual metadatabase, data cleaning workloads are effectively reduced, data cleaning time is effectively shortened, the problem that the nonstandard addresses are difficultly matched is effectively solved, and rapid and effective matching of the nonstandard addresses is achieved.

Description

technical field [0001] The present invention relates to the technical field of data processing, and more specifically, the present invention is a method for cleaning user address data based on word segmentation. Background technique [0002] "Dirty data" mainly refers to inconsistent / inaccurate data, outdated data, and human error data, etc., which directly affect the quality of data, which in turn affects the accuracy of corporate decision-making and the amount of cost input. According to statistics, the data error rate of some enterprises is expected to be 1%-5%, and some may be higher. "Dirty data" will bring risks and additional costs to enterprises. Among them, address data is an important data of an enterprise, and address "dirty data" directly affects the actual business development of the enterprise. Therefore, address data cleaning is of great help to the promotion of the enterprise's big data business. Existing address data cleaning methods mainly include the foll...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30G06F17/27
CPCG06F16/215G06F40/289
Inventor 韩金丽李洪根张大兵赵新磊
Owner BEIJING GAS GRP
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products