Cross-language text classification method based on cross-language word vector representation and classifier joint training

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A text classification and word vector technology, applied in natural language translation, natural language data processing, instruments, etc., can solve the problems of low classification accuracy, long training time, large amount of corpus, etc., to achieve strong practicability, good performance, good performance performance effect

Active Publication Date: 2018-12-07

HARBIN INST OF TECH

View PDF3 Cites 15 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0008] The purpose of the present invention is to solve the low classification accuracy rate of the existing method based on synonym replacement, and the high accuracy rate of the existing translation-based method, but training the translator requires a large amount of corpus, and the training takes a long time and the complexity of the task The time consumption far exceeds the relatively simple task of text classification, so it is not a practical problem, and a cross-language text classification method based on cross-language word vector representation and classifier joint training is proposed

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

specific Embodiment approach 1

[0025] Specific implementation mode one: combine figure 1 To illustrate this embodiment, the specific process of the cross-language text classification method based on cross-language word vector representation and classifier joint training in this embodiment is:

[0026] Traditional text classification tasks usually represent a word as a one-hot vector, and represent the text as a high-dimensional text vector through the word bag model. The dimension of the vector is the same as the size of the vocabulary, and the components of the vector in each dimension represent The weight of a word in the text, the common useful word frequency indicates the weight or 0 and 1 represent the existence or non-existence of the word. Using this bag of words representation will cause serious sparsity and dimensionality problems. In large-scale text classification, more computing resources are required. In addition, the word bag representation ignores the context information and word order info...

specific Embodiment approach 2

[0049] Embodiment 2: The difference between this embodiment and Embodiment 1 is that the specific solution process of the total loss function loss in step 2 is:

[0050] The total loss function consists of three items:

[0051] One is the source language loss, that is, the loss on the source language S, which is obtained from the source language part in the parallel corpus;

[0052] The second is the target-side language loss, that is, the loss on the target-side language T, which is obtained from the target-side language part in the parallel corpus;

[0053] The third is the classifier loss;

[0054] Construct the total loss function loss according to the source-side language loss, target-side language loss and classifier loss.

[0055] Other steps and parameters are the same as those in Embodiment 1.

specific Embodiment approach 3

[0056] Embodiment 3: The difference between this embodiment and Embodiment 1 or 2 is that the source language loss, that is, the loss on the source language S, is obtained from the source language part in the parallel corpus; the specific process is:

[0057] in C s Medium, monolingual (C only s ) loss is:

[0058]

[0059] Among them, C s Represents the source language part; Obj(C s |C s ) represents the monolingual loss in the source language in the parallel corpus; w represents one of the words in the context of the word s in the source language; p(w|s) represents the probability of predicting the window of s under the condition that the central word is s ; adj(s) represents the word in the context of the word s in the source language;

[0060] The probability value p in the formula is obtained by a two-layer fully connected feedforward neural network; the process is:

[0061] Will C s The word vectors of all the words in are input into the neural network as the c...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

A cross-language text classification method based on cross-language word vector representation and classifier joint training is disclosed. The invention relates to a cross-language text classificationmethod and aims to solve the problem that an existing method based on synonym replacement has low classification accuracy, an existing translation-based method has high accuracy but the training of atranslator requires a large amount of corpus, the training takes a long time, the task complexity and the time consumption are far more than that of a simple task of text categorization, and so the existing translation-based method is not practical. The method of the invention comprises the steps of (1) performing corpus preprocessing, (2) optimizing a total loss function by a gradient optimization method so that the total loss function reaches a minimum value, wherein the total loss function is corresponding to a set of word vectors and a classifier, and (3) taking a label with a highest probability as a classification result of a test text on a target end language T, comparing the result with a standard result of a test set, and obtaining test accuracy and a recall rate index. The method is used in the field of cross-language text classification.

Description

technical field [0001] The invention relates to a cross-language text classification method. Background technique [0002] Text classification is one of the most important fundamental technologies in the fields of natural language processing, machine learning and information retrieval. Its task is to classify a piece of text into a specific category, or to apply one or more labels to a piece of text. It is also an important field of research. [0003] The background of the cross-language text classification task is: there are texts in two languages, which are respectively defined as the source language text and the target language text, and there is insufficient training corpus in the target language so that a text classifier with qualified performance cannot be trained. source language is required. The goal of the task is to train a text classifier on the source language, so that the classifier can be tested on the target language text and can achieve good classification...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & Authority Applications(China)

IPC IPC(8): G06K9/62G06F17/27G06F17/28

CPCG06F40/247G06F40/40G06F18/24G06F18/214

Inventor 曹海龙杨沐昀赵铁军高国骥

Owner HARBIN INST OF TECH

Features

R&D
Intellectual Property
Life Sciences
Materials
Tech Scout

Why Patsnap Eureka

Unparalleled Data Quality
Higher Quality Content
60% Fewer Hallucinations

Social media

Patsnap Eureka Blog

Learn More

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

Cross-language text classification method based on cross-language word vector representation and classifier joint training

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

specific Embodiment approach 1

specific Embodiment approach 2

specific Embodiment approach 3

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology