Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Method and device for labeling corpus

A corpus tagging and corpus technology, applied in semantic tool creation, natural language data processing, unstructured text data retrieval, etc., can solve problems such as process redundancy, achieve simple algorithms, high annotation efficiency, and reduce repetitive manual processing The effect of work

Active Publication Date: 2021-06-01
XIAMEN KUAISHANGTONG INFORMATION TECH CO LTD
View PDF0 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0006] The second is to use an unsupervised algorithm (clustering algorithm) to cluster the labeled data, and then label each category; this method can directly label the corpus without relying on too much prior information, but some manual intervention is required later
[0007] For the second labeling method, the most common one is to pre-label the corpus with the k-means algorithm as the core algorithm, but the disadvantage is that because k-means needs to achieve a given number of clusters, the possible result is that it depends on experience first Specify a number of clusters, and then adjust the value of the number of clusters continuously according to the clustering effect until the appropriate value is adjusted. The whole process is too redundant

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and device for labeling corpus
  • Method and device for labeling corpus
  • Method and device for labeling corpus

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0034]In order to make the technical problems, technical solutions and beneficial effects to be solved by the present invention, the present invention will be described in further detail below with reference to the accompanying drawings and examples. It will be appreciated that the specific embodiments described herein are intended to explain the present invention and is not intended to limit the invention.

[0035]Such asfigure 1 As shown, a corpus method of the present invention includes the following steps:

[0036]a. Treating the text to quantify, obtain the text of the text;

[0037]b. Based on the text vectors of the text, the text is used to clustering the text using the DBSCan clustering algorithm to obtain long tail corpus and to be marked.

[0038]c. For the long tail type, return to step B; for the laminated corpus, set the tag to obtain a labeling;

[0039]d. Merge all labels to get the final labeling.

[0040]The step A further includes:

[0041]A1. Treatment of the text to obtain a word re...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a corpus labeling method and device, which obtains the text vector of the corpus by vectorizing the corpus to be processed; clusters the corpus by using the DBSCAN clustering algorithm according to the text vector of the corpus class processing to obtain long-tail corpus and to-be-labeled corpus; for the long-tail corpus, return to clustering again; for the to-be-labeled corpus, set labels to obtain the labeled corpus; The corpus is merged to obtain the final labeled corpus, without the need to adjust the number of clusters multiple times, the algorithm is simpler, the labeling efficiency is higher, and the reliability is better.

Description

Technical field[0001]The present invention relates to the field of natural language processing, in particular, a tanguition method and apparatus for applying the method.Background technique[0002]The corpus is the basic resources of the corpus linguistic research, but also the main resources of empirical language research methods. The traditional corpus is mainly used in dictionary compilation, language teaching, traditional language research, natural language processing based on statistics or examples. With the development of Internet big data and artificial intelligence technology, corpus is also widely used.[0003]The corpus is stored in the actual use of the language in the actual use of the language. For example, the user message directly from the web page, customer service dialogue, etc .; the corpus is the basic resource that carries the language knowledge, but does not mean language knowledge; true corpus It is necessary to be processed to become useful resources, and the proc...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/36G06F40/295G06K9/62
CPCG06F40/295G06F18/2321
Inventor 林志伟肖龙源蔡振华李稀敏刘晓葳谭玉坤
Owner XIAMEN KUAISHANGTONG INFORMATION TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products