Language model fine tuning method for low-resource adhesive language text classification

A language text and language model technology, applied in text database clustering/classification, semantic analysis, neural learning methods, etc., can solve high-uncertainty writing form vocabulary redundancy, difficulty in text classification, spelling and coding uncertainty sexual issues

Pending Publication Date: 2021-06-25
XINJIANG UNIVERSITY
View PDF23 Cites 4 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Whereas data collected from the internet is noisy and uncertain in terms of encoding and spelling
The main problems of NLP tasks in Uyghur, Kazakh and Kyrgyz languages ​​are the uncertainty of spelling and encoding and the insufficient annotation data sets, which pose a great challenge to classify short and noisy text d

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Language model fine tuning method for low-resource adhesive language text classification
  • Language model fine tuning method for low-resource adhesive language text classification
  • Language model fine tuning method for low-resource adhesive language text classification

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0050] The step S1 uses the XLM-R model to model the language model. The XLM-R model uses the same shared vocabulary, randomly extracts sentences from the monolingual corpus for connection, learns BPE splitting, and processes language through byte pair encoding BPE , this method greatly improves the alignment of embedding spaces across languages ​​that share the same alphabet or anchor tokens such as numbers or proper nouns.

[0051] The step S1 randomly extracting sentences is carried out according to multinomial distribution with probability, and its multinomial distribution is {q i} i =1,2,3,...n, specifically:

[0052]

[0053] in, And α=0.3.

[0054] This distributed sampling approach increases the number of tokens associated with low-resource languages ​​and mitigates bias toward high-resource languages. In particular, words in low-resource languages ​​can be prevented from being split at the character level.

Embodiment 2

[0056] The steps of fine-tuning the cross-lingual model in step S2 are:

[0057] S21: Using a suffix-based semi-supervised morpheme tokenizer, for candidate words, the semi-supervised morpheme tokenizer uses an iterative search algorithm to generate all word segmentation results by matching stem sets and suffix sets;

[0058] S22: When morphemes merge into words, the phonemes on the boundary change their surface morphology according to the rules of phonetics and writing, and the morphemes will coordinate with each other and appeal to each other's pronunciation;

[0059] S23: When the pronunciation is accurately represented, the phonetic harmony can be clearly observed in the text, and in the low-resource sticky text classification task, an independent statistical model is used to select the best result from the n best results ;

[0060] S24: Collect necessary terms by extracting word stems to form a fine-tuning data set with less noise, and then use the XLM-R model to fine-tu...

Embodiment 3

[0062] The concrete method of described step S3 discrimination fine-tuning is:

[0063] Different layers of the neural network can capture different levels of syntactic and semantic information. The lower layers of the XLM-R model may contain more general information. Use the classification learning rate to fine-tune the captured information, and divide the parameter θ into {θ 1 ,...,θ L}, where θL contains the parameters of the L-th layer, and the parameters are updated as follows:

[0064]

[0065] where η l Represents the learning rate of the L-th layer, t represents the update step, and the basic learning rate is set to η L , then η k-1 =ξ·η k , where ξ is the attenuation factor, and is less than or equal to 1; when ξ<1, the learning speed of the lower layer is slower than that of the upper layer; when ξ=1, all layers have the same learning rate, which is equivalent to regular stochastic gradient descent (SGD).

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a language model fine-tuning method for low-resource adhesive language text classification, and relates to the technical field of language processing; a low-noise fine-tuning data set is constructed through morphological analysis and stem extraction; a cross-language pre-training model is fine-tuned on the data set to obtain a fine-tuning result; a meaningful and easy-to-use feature extractor is provided for a downstream text classification task, related semantic and syntactic information is better selected from a pre-trained language model, and these features are used for the downstream text classification task.

Description

technical field [0001] The invention relates to the technical field of language processing, in particular to a language model fine-tuning method for classifying language texts with low resource adhesion. Background technique [0002] Text classification is the backbone of most natural language processing tasks, such as sentiment analysis, news topic classification, and intent recognition. Although deep learning models have achieved state-of-the-art in many natural language processing (NLP) tasks, these models are trained from scratch, which makes them require larger datasets. Nonetheless, many low-resource languages ​​lack the resources of rich annotated datasets supporting various tasks in text classification. [0003] The main challenges of low-resource-adhesion text classification are the lack of labeled data in the target domain and the morphological diversity of derived words in the linguistic structure. For low-resource-adhesion languages ​​such as Uyghur, Kazakh, an...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F16/35G06F40/211G06F40/284G06F40/30G06N3/04G06N3/08
CPCG06F16/35G06F40/211G06F40/284G06F40/30G06N3/08G06N3/047Y02D10/00
Inventor 柯尊旺李哲蔡茂昌曹如鹏
Owner XINJIANG UNIVERSITY
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products