Data tag generation method and apparatus

A data labeling and data technology, applied in the field of Internet data, can solve the problems of reducing the quality of topic clustering, time-consuming and expensive, scattered and free labels, etc., and achieve the effect of detailed and rich content, accurate content division, and complete structure

Active Publication Date: 2017-10-27
北京融数云途科技有限公司
View PDF5 Cites 20 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

At this time, it is no longer sufficient to analyze and summarize the theme of the text in a manual way.
Manual text processing is not only time-consuming and expensive, but also involves some subjective biases during processing, which reduces the quality of topic clustering
[0004] Artificially generated ones can become a system, but relying entirely on manual definition methods cannot be mass-produced. Such a label system is not scalable and will not be very rich; user-defined labels are too fragmented and free to Structural, it is a challenge for the use of tags; using a simple word segmentation algorithm keyword extraction can generate a large number of tags by machine, but it is not representative and fragmented

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Data tag generation method and apparatus
  • Data tag generation method and apparatus
  • Data tag generation method and apparatus

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0056] A method for generating data labels: obtaining original text data; performing top-level classification on the original text data by using a top-level subject database to obtain multiple top-level subject text data; performing de-redundancy preprocessing on multiple top-level subject text data to obtain multiple top-level subject text data Topic preprocessing text data; obtain the total number of documents N and the total number of words M in each top-level topic preprocessing text data, extract the Tf-idf feature value of each word in each document, and obtain matrix data V; among them, The number of rows of V is N, one row is one document, the number of columns of V is M, and one column is the Tf-idf feature value of a word in N documents respectively; subject clustering is performed on the matrix data V to obtain X different topics Clustering; pick 20-50 keywords that are closely related to the corresponding topic clusters from each topic cluster; sort according to the...

Embodiment 2

[0093] A data label generating device, comprising: an original data acquisition module; a top-level subject database module, which is used to perform top-level classification on original text data, and obtain top-level subject text data of the original text data;

[0094] A data preprocessing module, configured to perform de-redundancy preprocessing on each top-level topic text data, to obtain multiple top-level topic preprocessing text data;

[0095] The acquisition matrix data module is used to obtain the total number of documents and the total number of words in each top-level topic preprocessing text data, and extract the Tf-idf feature value of each word in each document of the same top-level topic preprocessing text data , to obtain matrix data; wherein, the number of rows of matrix data is the total number of documents, one row is one document, the number of columns of matrix data is the total number of words, and one column is the Tf-idf feature value of a word in multi...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a data tag generation method and apparatus, and relates to the field of internet data. The method comprises the steps of obtaining original text data; performing analysis by utilizing a top-layer subject database to obtain multiple pieces of top-layer subject text data; preprocessing the top-layer subject text data to obtain preprocessed top-layer subject text data; extracting Tf-idf eigenvalues of all documents in the preprocessed top-layer subject text data to obtain matrix data; performing subject clustering on the matrix data to obtain a plurality of different subject clusters; sorting keywords in each subject cluster from high to low; correcting a keyword sorting table according to an actual application, and reserving the keywords closely related to and correctly expressing contents of the corresponding subject clusters; and obtaining a tag of each subject cluster according to a new keyword sorting table. According to the method, massive data can be subjected to subject clustering quickly and efficiently; and the obtained tag is complete in system structure, rich in content, closer to the actual application and convenient for users to use.

Description

technical field [0001] The invention relates to the technical field of Internet data, in particular to a method and device for generating data labels. Background technique [0002] A tag is a keyword that is more accurate and specific than a classification and can summarize the content of an information subject. The label system is an important part of websites, apps, digital marketing, advertising, recommendation systems, etc. in the Internet age. It is the basis for realizing user portraits and precise orientation. The tags of most systems come from artificial generation, user-defined, or mapping processing after machine keyword extraction. [0003] In the context of big data, people are exposed to more and more text information, and the amount of text data is increasing exponentially. At this time, it is no longer sufficient to analyze and summarize the theme of the text in a manual way. Manual text processing is not only time-consuming and expensive, but also involve...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/2425G06F16/353
Inventor 李晖胡宁杭郑悦
Owner 北京融数云途科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products