Language model fine tuning method for low-resource adhesive language text classification

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A language text and language model technology, applied in text database clustering/classification, semantic analysis, neural learning methods, etc., can solve high-uncertainty writing form vocabulary redundancy, difficulty in text classification, spelling and coding uncertainty sexual issues

Pending Publication Date: 2021-06-25

XINJIANG UNIVERSITY

View PDF23 Cites 4 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

Whereas data collected from the internet is noisy and uncertain in terms of encoding and spelling

The main problems of NLP tasks in Uyghur, Kazakh and Kyrgyz languages are the uncertainty of spelling and encoding and the lack of annotation data sets, which pose a great challenge to classify short and noisy text data

Due to the large number of manually annotated text corpora, text classification is often difficult in the case of insufficient data

Word stems serve as representations of textual content, a property that allows infinitely derived vocabularies, leading to high-deterministic writing forms and huge lexical redundancies

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment 1

[0050] The step S1 uses the XLM-R model to model the language model. The XLM-R model uses the same shared vocabulary, randomly extracts sentences from the monolingual corpus for connection, learns BPE splitting, and processes language through byte pair encoding BPE , this method greatly improves the alignment of embedding spaces across languages that share the same alphabet or anchor tokens such as numbers or proper nouns.

[0051] The step S1 randomly extracting sentences is carried out according to multinomial distribution with probability, and its multinomial distribution is {q i} i =1,2,3,...n, specifically:

[0052]

[0053] in, And α=0.3.

[0054] This distributed sampling approach increases the number of tokens associated with low-resource languages and mitigates bias toward high-resource languages. In particular, words in low-resource languages can be prevented from being split at the character level.

Embodiment 2

[0056] The steps of fine-tuning the cross-lingual model in step S2 are:

[0057] S21: Using a suffix-based semi-supervised morpheme tokenizer, for candidate words, the semi-supervised morpheme tokenizer uses an iterative search algorithm to generate all word segmentation results by matching stem sets and suffix sets;

[0058] S22: When morphemes merge into words, the phonemes on the boundary change their surface morphology according to the rules of phonetics and writing, and the morphemes will coordinate with each other and appeal to each other's pronunciation;

[0059] S23: When the pronunciation is accurately represented, the phonetic harmony can be clearly observed in the text, and in the low-resource sticky text classification task, an independent statistical model is used to select the best result from the n best results ;

[0060] S24: Collect necessary terms by extracting word stems to form a fine-tuning data set with less noise, and then use the XLM-R model to fine-tu...

Embodiment 3

[0062] The concrete method of described step S3 discrimination fine-tuning is:

[0063] Different layers of the neural network can capture different levels of syntactic and semantic information. The lower layers of the XLM-R model may contain more general information. Use the classification learning rate to fine-tune the captured information, and divide the parameter θ into {θ 1 ,...,θ L}, where θL contains the parameters of the L-th layer, and the parameters are updated as follows:

[0064]

[0065] where η l Represents the learning rate of the L-th layer, t represents the update step, and the basic learning rate is set to η L , then η k-1 =ξ·η k , where ξ is the attenuation factor, and is less than or equal to 1; when ξ<1, the learning speed of the lower layer is slower than that of the upper layer; when ξ=1, all layers have the same learning rate, which is equivalent to regular stochastic gradient descent (SGD).

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a language model fine-tuning method for low-resource adhesive language text classification, and relates to the technical field of language processing; a low-noise fine-tuning data set is constructed through morphological analysis and stem extraction; a cross-language pre-training model is fine-tuned on the data set to obtain a fine-tuning result; a meaningful and easy-to-use feature extractor is provided for a downstream text classification task, related semantic and syntactic information is better selected from a pre-trained language model, and these features are used for the downstream text classification task.

Description

technical field [0001] The invention relates to the technical field of language processing, in particular to a language model fine-tuning method for classifying language texts with low resource adhesion. Background technique [0002] Text classification is the backbone of most natural language processing tasks, such as sentiment analysis, news topic classification, and intent recognition. Although deep learning models have achieved state-of-the-art in many natural language processing (NLP) tasks, these models are trained from scratch, which makes them require larger datasets. Nonetheless, many low-resource languages lack the resources of rich annotated datasets supporting various tasks in text classification. [0003] The main challenges of low-resource-adhesion text classification are the lack of labeled data in the target domain and the morphological diversity of derived words in the linguistic structure. For low-resource-adhesion languages such as Uyghur, Kazakh, an...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

IPC IPC(8): G06F16/35G06F40/211G06F40/284G06F40/30G06N3/04G06N3/08

CPCG06F16/35G06F40/211G06F40/284G06F40/30G06N3/08G06N3/047Y02D10/00

Inventor 柯尊旺李哲蔡茂昌曹如鹏

Owner XINJIANG UNIVERSITY

Features

R&D
Intellectual Property
Life Sciences
Materials
Tech Scout

Why Patsnap Eureka

Unparalleled Data Quality
Higher Quality Content
60% Fewer Hallucinations

Social media

Patsnap Eureka Blog

Learn More

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

Language model fine tuning method for low-resource adhesive language text classification

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment 1

Embodiment 2

Embodiment 3

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology