Text Semantic Representation Method Based on Aggregation Weighted Matrix Compression Algorithm

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
A compression algorithm and weighted matrix technology, applied in computing, instrumentation, electrical and digital data processing, etc., can solve the problems of ignoring word vector combination, missing document vector information, weight estimation deviation, etc.

Active Publication Date: 2019-01-29

NANKAI UNIV

View PDF5 Cites 0 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

However, such word-vector-based document vector generation methods usually focus on how word vectors are generated, but ignore how word vectors are combined

The weighted accumulation method tends to use some word vectors with larger weights to represent the entire document, which makes the generated document vector information missing, which further causes misestimation of the similarity between documents

In addition, the existing methods do not consider the semantic similarity of the word vector itself when estimating the weight of the word vector, resulting in biased weight estimation.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0017] 1. Concepts and definitions related to the present invention.

[0018] 1. Glossary

[0019] For a given corpus, the set of all distinct words v in the corpus is called vocabulary V.

[0020] 2. Word frequency in the document

[0021] for any word w i , which in any document d k The frequency of occurrence in is defined as the word frequency in the document, and the statistical method is:

[0022]

[0023] in for w i in document d k The number of occurrences in l k for document d k length.

[0024] 3. Document frequency

[0025] for any word w i , whose document frequency is defined as:

[0026]

[0027] where n is the number of documents in a given corpus; n i For the word w contained in the given corpus i number of documents.

[0028] 4. Inverse document frequency

[0029] The natural logarithm of the reciprocal of the document frequency is generally called the inverse document frequency:

[0030]

[0031] 5. Co-occurrence frequency

[0032] F...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention belongs to the technical field of data mining, and specifically provides a text semantic representation method based on an aggregation weighting matrix compression algorithm. According to the method, through constructing a global smooth context matrix, influences of pragmatic habits on word semantics are weakened; through utilizing a word vector aggregation algorithm to weight global smooth contexts, a word vector weight measurement method is corrected; and finally, texts are represented through utilizing an infinite norm compression algorithm to convert global smooth contexts into vectors. Compared with the existing method, the method has the advantages that the highest possible value of each dimension of the vectors is reserved and the text information is represented more fully.

Description

technical field [0001] The invention belongs to the technical field of data mining, and specifically proposes a text representation method based on an aggregation weighted matrix compression algorithm. Background technique [0002] As a widespread data form (different types of web pages, digitized books, log files, etc.), plain text has always been the focus of research in the fields of natural language processing, machine learning, and artificial intelligence. Under the condition of continuous expansion of network data, the analysis and processing technology for text data is becoming more and more important. A common denominator of traditional text processing techniques is the need to transform text data presented in natural paragraphs into a computable data form. Among them, converting text into a fixed-size vector form as input is a prerequisite for many text processing techniques. Taking text clustering analysis as an example, the core idea of many clustering algorit...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & AuthorityPatents(China)

IPC IPC(8): G06F17/27

Inventor卫金茂韦阳徐恒鹏樊文哲

OwnerNANKAI UNIV

Text Semantic Representation Method Based on Aggregation Weighted Matrix Compression Algorithm

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology