Unlock instant, AI-driven research and patent intelligence for your innovation.

Efficient short text similarity determination method and device

A technology for determining devices and methods, which is applied in character and pattern recognition, instruments, calculations, etc., and can solve problems such as bumps in text vectors, large similarity vibrations, and high idf values ​​for low-frequency words.

Pending Publication Date: 2022-04-29
ALIPAY (HANGZHOU) INFORMATION TECH CO LTD
View PDF0 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] However, there are several problems in the similarity measurement for short texts, including: 1) the character-level similarity measurement ignores the word order factor; 2) for large-scale corpora, the vector dimension of one-hot increases linearly, and at the same time, data sparseness leads to similarity The distinction is not obvious, and for a small-scale corpus, the calculation formula of TF-idf will cause the idf value of low-frequency words to be too large, there are bumps in the text vector, and the similarity vibration is large

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Efficient short text similarity determination method and device
  • Efficient short text similarity determination method and device
  • Efficient short text similarity determination method and device

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0018] Term Frequency-Inverse Document Frequency (TF-IDF) technology is a commonly used weighting technology for data retrieval and text mining, which can be used to evaluate the impact of a single word on a text in a text database or corpus. Importance. The importance of a word increases in direct proportion to the number of times it appears in the document, that is, term frequency (TF), but at the same time it decreases inversely proportional to the frequency (IDF) it appears in the corpus. If a word is relatively rare, but it appears many times in this article, then it probably reflects the characteristics of this article and is the desired keyword.

[0019] In order to count the keywords of the text, the text can be segmented first, and then the word frequency of each word can be counted. Word frequency refers to the number of times a given word appears in the text. The keywords of the text appear more frequently in the text. However, high-frequency words with no signif...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to an efficient short text similarity determination method. The method comprises the following steps: performing word segmentation on short texts in a corpus to obtain corresponding word sequences; determining a penalty based on the total number of the short texts in the corpus, wherein the penalty is reduced along with the increase of the total number of the short texts in the corpus; determining a word frequency of each word in the word sequence and an adjusted inverse document frequency, wherein the adjusted inverse document frequency is calculated based on the penalty; weighting the word frequency of each word by using the adjusted inverse document frequency; combining the weighted word frequency of each word in the word sequence to determine a word frequency vector of the short text; and determining the similarity between the short text and other short texts based on the word frequency vector. The disclosure also relates to other related aspects.

Description

technical field [0001] The present application relates generally to natural language processing (NLP), and in particular to efficient short text similarity determination. Background technique [0002] Text similarity measurement is a common problem in the field of NLP. For long text and short text, academia and industry have also studied different measurement methods. [0003] For the similarity measurement method of long text, there are usually two paradigms: 1. Carry out vectorized representation of words or words, and calculate the similarity after aggregation to obtain the vector representation of long text. Common ones are: Word2vec, Bow model, etc.; 2. Introduce the deep learning network structure, according to the context semantics, learn the vector of the sentence or text, the common ones are Elmo, Bert, etc., and directly calculate the similarity through the constructed sentence vector. [0004] For short text similarity measurement methods, there are usually two ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06K9/62G06F40/289G06F40/216
CPCG06F40/289G06F40/216G06F18/22
Inventor 刘东亚
Owner ALIPAY (HANGZHOU) INFORMATION TECH CO LTD