Text clustering method based on comparative learning and integrated with dynamic adjustment mechanism

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A dynamic adjustment, text clustering technology, applied in the direction of text database clustering/classification, unstructured text data retrieval, instruments, etc. Example quality, effect improvement effect

Pending Publication Date: 2022-07-29

KUNMING UNIV OF SCI & TECH

View PDF0 Cites 2 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0005] In order to solve the above-mentioned technical problems, the present invention designs a text clustering method based on contrastive learning and incorporating a dynamic adjustment mechanism, in order to solve the low clustering semantic confidence caused by the inconsistency between auxiliary tasks and clustering task goals during deep text clustering problem, the present invention introduces the loss weight dynamic adjustment method and the contrastive learning negative example screening method into the text clustering model

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment 1

[0062] see Figure 1 to Figure 2 As shown, a text clustering method based on contrastive learning and incorporating a dynamic adjustment mechanism, the specific steps of the method are as follows:

[0063] Step1. Download the public text clustering dataset from the Internet, specifically using eight datasets of SearchSnippets, StackOverflow, Biomedical, AgNews, Tweet, GoogleNews-TS, GoogleNews-T, GoogleNews-S, among which GoogleNews-TS, GoogleNews-T and GoogleNews-S are obtained by extracting titles and abstracts, respectively, from the GoogleNews [25] dataset.

[0064] The data set obtained above is shown in Table 1:

[0065] Table 1 Dataset Details

[0066]

[0067]

[0068] Step2. First, based on the context enhancement method, the enhanced text pair of the text is obtained through two different masked word prediction models, and then the feature representation is obtained by passing in the pre-trained Bert model with shared parameters, and finally the initial seman...

Embodiment 2

[0119] (1) Comparative test

[0120] Compare the effects of this experiment with other text clustering methods on eight datasets, and specifically compare the eight benchmark text clustering methods of BoW, TF-IDF, KMeans, DEC, STCC, Self-Train, HAC-SD, and SCCL. The experimental results are shown in Table 2.

[0121] Table 2 Comparison of experimental effects

[0122]

[0123]

[0124]

[0125] From the analysis of Table 2, we can see that the performance of our model exceeds the existing benchmark models on most datasets, especially compared with the SCCL model that also uses the combination of contrastive learning and clustering to achieve a certain effect.

[0126] (2) Ablation experiment

[0127] To better verify the effectiveness of our model, in this section we conduct ablation experiments. On the SearchSnippets dataset we compare the model with sequential learning, fixed-scale joint learning, and validate the effect of negative screening on the model. The ...

Embodiment 3

[0134] The present invention proposes a method for dynamically adjusting the loss weight, so as to alleviate the problem of inconsistency between comparative learning and clustering objectives. During the training process, the model adjusts the contrast loss and clustering loss by adjusting the function, so as to achieve a smooth transition from contrastive learning to clustering; by assigning pseudo-labels to the data with high confidence in the cluster assignment probability, the negative examples are screened to solve the problem. The problem that the same cluster data are negative examples of each other effectively improves the quality of negative examples. The data representation obtained by comparative learning through this method is more friendly to clustering; compared with the existing comparative clustering method, the present invention achieves a significant improvement in effect, and is superior to the existing short text clustering on most data sets method.

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a text clustering method integrated with a dynamic adjustment mechanism based on comparative learning, and the method is characterized in that a group of enhanced texts is obtained based on a context enhancement method, the feature representation of the enhanced texts is obtained through a pre-training model, and the initial cluster center of a semantic cluster is obtained through a K-Means clustering method; the confidence coefficient of text clustering distribution is improved, and dynamic screening is carried out; and finally, obtaining a model total loss function, and continuously and dynamically adjusting through a dynamic adjustment function, so that the model training weight is smoothly transited to a clustering task from comparative learning. According to the invention, the problem of inconsistency between comparative learning and clustering targets is alleviated; smooth transition from comparative learning to clustering is realized; the negative examples are screened by allocating pseudo labels to the data with high confidence coefficient of the cluster allocation probability, so that the problem that the same cluster data are mutual negative examples is solved, and the quality of the negative examples is effectively improved; the data representation obtained by contrast learning is more friendly to clustering; the short text clustering method is superior to the existing short text clustering method in most data sets.

Description

technical field [0001] The invention belongs to the technical field of natural language processing, and in particular relates to a text clustering method based on comparative learning and integrating a dynamic adjustment mechanism. Background technique [0002] Text clustering is one of the unsupervised data processing methods. The purpose of this method is to divide the texts into different clusters according to the similarity without labels, so that the semantics of the texts in the clusters are as similar as possible, and the semantics of the texts outside the clusters are as similar as possible. different. In recent years, deep learning has attracted extensive attention, and a considerable part of clustering research work focuses on the combination of clustering and deep learning, using the powerful representation ability of deep learning to extract semantic features in text, and then clustering for better clustering effect. Yang et al. (2017) combined autoencoders and...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & Authority Applications(China)

IPC IPC(8): G06F16/35G06F40/216G06F40/279G06F40/30

CPCG06F16/355G06F40/279G06F40/30G06F40/216

Inventor 王红斌李瑞辉线岩团文永华

Owner KUNMING UNIV OF SCI & TECH

Features

R&D
Intellectual Property
Life Sciences
Materials
Tech Scout

Why Patsnap Eureka

Unparalleled Data Quality
Higher Quality Content
60% Fewer Hallucinations

Social media

Patsnap Eureka Blog

Learn More

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

Text clustering method based on comparative learning and integrated with dynamic adjustment mechanism

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment 1

Embodiment 2

Embodiment 3

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology