Text Semantic Representation Method Based on Aggregation Weighted Matrix Compression Algorithm

A compression algorithm and weighted matrix technology, applied in computing, instrumentation, electrical and digital data processing, etc., can solve the problems of ignoring word vector combination, missing document vector information, weight estimation deviation, etc.

Active Publication Date: 2019-01-29
NANKAI UNIV
View PDF5 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, such word-vector-based document vector generation methods usually focus on how word vectors are generated, but ignore how word vectors are combined
The weighted accumulation method tends to use some word vectors with larger weights to represent the entire document, which makes the generated document vector information missing, which further causes misestimation of the similarity between documents
In addition, the existing methods do not consider the semantic similarity of the word vector itself when estimating the weight of the word vector, resulting in biased weight estimation.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Text Semantic Representation Method Based on Aggregation Weighted Matrix Compression Algorithm
  • Text Semantic Representation Method Based on Aggregation Weighted Matrix Compression Algorithm
  • Text Semantic Representation Method Based on Aggregation Weighted Matrix Compression Algorithm

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0017] 1. Concepts and definitions related to the present invention.

[0018] 1. Glossary

[0019] For a given corpus, the set of all distinct words v in the corpus is called vocabulary V.

[0020] 2. Word frequency in the document

[0021] for any word w i , which in any document d k The frequency of occurrence in is defined as the word frequency in the document, and the statistical method is:

[0022]

[0023] in for w i in document d k The number of occurrences in l k for document d k length.

[0024] 3. Document frequency

[0025] for any word w i , whose document frequency is defined as:

[0026]

[0027] where n is the number of documents in a given corpus; n i For the word w contained in the given corpus i number of documents.

[0028] 4. Inverse document frequency

[0029] The natural logarithm of the reciprocal of the document frequency is generally called the inverse document frequency:

[0030]

[0031] 5. Co-occurrence frequency

[0032] F...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention belongs to the technical field of data mining, and specifically provides a text semantic representation method based on an aggregation weighting matrix compression algorithm. According to the method, through constructing a global smooth context matrix, influences of pragmatic habits on word semantics are weakened; through utilizing a word vector aggregation algorithm to weight global smooth contexts, a word vector weight measurement method is corrected; and finally, texts are represented through utilizing an infinite norm compression algorithm to convert global smooth contexts into vectors. Compared with the existing method, the method has the advantages that the highest possible value of each dimension of the vectors is reserved and the text information is represented more fully.

Description

technical field [0001] The invention belongs to the technical field of data mining, and specifically proposes a text representation method based on an aggregation weighted matrix compression algorithm. Background technique [0002] As a widespread data form (different types of web pages, digitized books, log files, etc.), plain text has always been the focus of research in the fields of natural language processing, machine learning, and artificial intelligence. Under the condition of continuous expansion of network data, the analysis and processing technology for text data is becoming more and more important. A common denominator of traditional text processing techniques is the need to transform text data presented in natural paragraphs into a computable data form. Among them, converting text into a fixed-size vector form as input is a prerequisite for many text processing techniques. Taking text clustering analysis as an example, the core idea of ​​many clustering algorit...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/27
Inventor 卫金茂韦阳徐恒鹏樊文哲
Owner NANKAI UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products