Unlock instant, AI-driven research and patent intelligence for your innovation.

Text structuring technology based on sliding window and random discrete sampling

A discrete sampling and sliding window technology, applied in the field of natural language processing and deep learning, can solve the problems of short text and unclear semantic representation, and achieve the effect of improving semantic representation and classification accuracy

Pending Publication Date: 2021-08-10
XIANGTAN UNIV
View PDF0 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

To solve the problem that the semantic representation is not obvious because the text is too short, the code is written using Pytorch, a neural network framework open sourced by Facebook based on the python language. First, each text in the training set is divided into two subsequence matrices with strong semantics. Then it iteratively increases the semantics of each other, and finally performs multi-classification according to the feature matrix, and selects the category with the largest weight in the result to obtain the final classification result

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Text structuring technology based on sliding window and random discrete sampling
  • Text structuring technology based on sliding window and random discrete sampling
  • Text structuring technology based on sliding window and random discrete sampling

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0025] The actual application environment of the present invention is aimed at short text classification, and the present invention will be further described in detail below in conjunction with the accompanying drawings.

[0026] When the present invention is implemented, such as figure 1 The following steps are shown:

[0027] S1: Input the text that needs to be classified, first perform word segmentation processing on the text, then perform word vector training on the word through Word2Vec, and then add word position information to obtain a new word vector;

[0028] S2: After obtaining the text matrix composed of word vectors, use the sliding window method to obtain multiple subsequences with close contexts to form a new text matrix;

[0029] S3: Use random discrete sampling to obtain multiple subsequences with long context distances but can enhance semantics to form a new text matrix;

[0030] S4: Input the matrices obtained by S2 and S3 to the Encoder layer belonging to ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention is suitable for the field of short text classification, and provides a processing technology based on a sliding window and random discrete sampling aiming at the problem that semantic representation is not obvious due to the fact that a text is too short. According to the specific scheme, the method comprises the following steps of S1, inputting a text, and performing word segmentation and training on the text to obtain a plurality of word vectors; s2, after a text matrix composed of word vectors is obtained, adopting a sliding window method to obtain a plurality of subsequences with close contexts, and forming a new text matrix; s3, adopting random discrete sampling to obtain a plurality of subsequences which are far in context distance and can enhance semantics, and forming a new text matrix; s4, respectively inputting the matrixes obtained in the S2 and the S3 into Encoder layers of different Transformers which belong to the same layer, wherein each layer interactively affects enhanced semantics; and S5, repeating S4 until two matrixes with strong features and strong semantics are trained, inputting the two matrixes into the CNN to obtain two one-dimensional vectors, splicing the two one-dimensional vectors, and inputting the two one-dimensional vectors into the full-connection neural network for classification.

Description

technical field [0001] The invention relates to natural language processing and deep learning, and belongs to the field of computer application technology. More specifically, it relates to a text structuring technology based on sliding windows and random discrete sampling. Background technique [0002] Google open sourced word2vec in 2013, it is a toolkit for obtaining word vector, it is simple and efficient. word2vec uses two important models - CBOW model (Continuous Bag-of-Words Model) and Skip-gram model (Continuous Skip-gram Model), both models contain three layers: output layer, projection layer, input layer. Among them, the CBOW model training is to input the word vector corresponding to the context-related word of a feature word, and output the word vector corresponding to the feature word. The idea of ​​the model is just the opposite. It is to input the word vector of a specific word and output the context word vector corresponding to the specific word. In short, it...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F16/35G06F40/289G06F40/216G06N3/04
CPCG06F16/353G06F40/289G06F40/216G06N3/045
Inventor 刘新马中昊李广黄浩钰张远明
Owner XIANGTAN UNIV