A method and device for generating data labels

A data label and data technology, applied in the field of Internet data, can solve the problems of reducing the quality of topic clustering, time-consuming and expensive, scattered and free labels, etc., and achieve the effect of refined and rich content, accurate content division, and complete structure.

Active Publication Date: 2021-02-12
北京融数云途科技有限公司
View PDF5 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

At this time, it is no longer sufficient to analyze and summarize the theme of the text in a manual way.
Manual text processing is not only time-consuming and expensive, but also involves some subjective biases during processing, which reduces the quality of topic clustering
[0004] Artificially generated ones can become a system, but relying entirely on manual definition methods cannot be mass-produced. Such a label system is not scalable and will not be very rich; user-defined labels are too fragmented and free to Structural, it is a challenge for the use of tags; using a simple word segmentation algorithm keyword extraction can generate a large number of tags by machine, but it is not representative and fragmented

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A method and device for generating data labels
  • A method and device for generating data labels
  • A method and device for generating data labels

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0056] A method for generating data labels: obtaining original text data; performing top-level classification on the original text data by using a top-level subject database to obtain multiple top-level subject text data; performing de-redundancy preprocessing on multiple top-level subject text data to obtain multiple top-level subject text data Topic preprocessing text data; obtain the total number of documents N and the total number of words M in each top-level topic preprocessing text data, extract the Tf-idf feature value of each word in each document, and obtain matrix data V; among them, The number of rows of V is N, one row is one document, the number of columns of V is M, and one column is the Tf-idf feature value of a word in N documents respectively; subject clustering is performed on the matrix data V to obtain X different topics Clustering; pick 20-50 keywords that are closely related to the corresponding topic clusters from each topic cluster; sort according to the...

Embodiment 2

[0093] A data label generating device, comprising: an original data acquisition module; a top-level subject database module, which is used to perform top-level classification on original text data, and obtain top-level subject text data of the original text data;

[0094] A data preprocessing module, configured to perform de-redundancy preprocessing on each top-level topic text data, to obtain multiple top-level topic preprocessing text data;

[0095] The acquisition matrix data module is used to obtain the total number of documents and the total number of words in each top-level topic preprocessing text data, and extract the Tf-idf feature value of each word in each document of the same top-level topic preprocessing text data , to obtain matrix data; wherein, the number of rows of matrix data is the total number of documents, one row is one document, the number of columns of matrix data is the total number of words, and one column is the Tf-idf feature value of a word in multi...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a data label generation method and device, which relate to the field of Internet data. The method of the present invention comprises: obtaining original text data; utilizing top-level theme database analysis to obtain a plurality of top-level theme text data; preprocessing top-level theme text data to obtain top-level theme preprocessing text data; extracting all documents Tf-idf in the top-level theme preprocessing text data The eigenvalues ​​are used to obtain the matrix data; subject clustering is performed on the matrix data to obtain multiple different subject clusters; keywords are sorted from high to low from each subject cluster; the keyword sorting table is corrected according to the actual application, and the Keywords that are closely related to the content of the corresponding topic cluster and correctly express the content of the corresponding topic cluster; get the labels of each topic cluster according to the new keyword sorting table. The method of the invention can quickly and efficiently carry out subject clustering on massive data, and the obtained label system has a complete structure, rich content, and is closer to practical application so as to be convenient for users to use.

Description

technical field [0001] The invention relates to the technical field of Internet data, in particular to a method and device for generating data labels. Background technique [0002] A tag is a keyword that is more accurate and specific than a classification and can summarize the content of an information subject. The label system is an important part of websites, apps, digital marketing, advertising, recommendation systems, etc. in the Internet age. It is the basis for realizing user portraits and precise orientation. The tags of most systems come from artificial generation, user-defined, or mapping processing after machine keyword extraction. [0003] In the context of big data, people are exposed to more and more text information, and the amount of text data is increasing exponentially. At this time, it is no longer sufficient to analyze and summarize the theme of the text in a manual way. Manual text processing is not only time-consuming and expensive, but also involve...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/242G06F16/35
CPCG06F16/2425G06F16/353
Inventor 李晖胡宁杭郑悦
Owner 北京融数云途科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products