Unlock instant, AI-driven research and patent intelligence for your innovation.

Data cleaning method, device, equipment and readable storage medium

A data cleaning and storage medium technology, applied in the field of data processing, can solve the problems of inefficient cleaning, inability to obtain high-quality text data sets, and inefficient data cleaning solutions.

Inactive Publication Date: 2021-11-30
INSPUR SUZHOU INTELLIGENT TECH CO LTD
View PDF0 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, due to the huge amount of data, many data cleaning solutions are not efficient. Although these data cleaning solutions use parallel computing frameworks such as Hadoop and Spark, due to problems in the cleaning solutions of the data cleaning system, they still cannot be cleaned efficiently, so they cannot Get a high-quality text dataset

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Data cleaning method, device, equipment and readable storage medium
  • Data cleaning method, device, equipment and readable storage medium
  • Data cleaning method, device, equipment and readable storage medium

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0052] In order to enable those skilled in the art to better understand the solution of the present application, the present application will be further described in detail below in conjunction with the drawings and specific implementation methods. Apparently, the described embodiments are only some of the embodiments of this application, not all of them. Based on the embodiments in this application, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the scope of protection of this application.

[0053] Please refer to figure 1 , figure 1 It is a flow chart of a data cleaning method in the embodiment of this application, and this method can be specifically applied to such as figure 2 In the framework shown, the method includes the following steps:

[0054] S101. Acquire text data to be cleaned.

[0055] In this embodiment, the text data to be cleaned can be obtained by receiving the text data, the text data to ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a data cleaning method, device and equipment and a readable storage medium. The method comprises the steps: obtaining to-be-cleaned text data; segmenting the text data by taking an article as a unit to obtain article data; performing punctuation mark detection on each text line in the article data to obtain a target text line without punctuation marks at the end; and deleting the target text line in the article data to obtain target article data. According to the method, the article is taken as a unit, accurate cleaning is realized, and efficient text cleaning can be realized, so that a high-quality text data set is obtained.

Description

technical field [0001] The present application relates to the technical field of data processing, in particular to a data cleaning method, device, equipment and readable storage medium. Background technique [0002] There are more and more text data such as news, blogs, and forums on the Internet. How to use these massive text data to generate high-quality text data sets for training and reasoning of artificial intelligence models has become a hot research direction. [0003] In order to clean massive amounts of data to generate high-quality text data sets, many cleaning frameworks have been born, such as Hadoop (an open source software framework that supports data-intensive distributed applications) MapReduce (programming model, used Parallel computing for large-scale data sets (greater than 1TB).) Computing framework, Spark (a fast and general-purpose computing engine designed for large-scale data processing) framework, etc. However, due to the huge amount of data, many d...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/31G06F16/33G06F16/335G06F16/35G06F16/955
CPCG06F16/322G06F16/3335G06F16/335G06F16/35G06F16/955
Inventor 张荣国
Owner INSPUR SUZHOU INTELLIGENT TECH CO LTD