Short text query expansion and indexing method based on word vector

A query expansion, short text technology, applied in the field of short text query expansion and retrieval based on word vector, can solve the problems of reducing retrieval accuracy, topic offset, noise, etc., to avoid the number of clusters and the process of iteration, The effect of reducing time complexity and meeting the requirements of clustering

Active Publication Date: 2015-07-08
DALIAN UNIV OF TECH
View PDF3 Cites 74 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The disadvantage of this type of method is: when the user gives a search term, the search engine can only return documents containing the search term, but cannot return other documents that are semantically related but expressed in different words
The disadvantage of this type of method is: when the user gives a search term, the search engine will introduce a lot of noise information, although the recall rate of the retrieval system is improved to a certain extent, but it also introduces a large amount of irrelevant text, reducing the search accuracy
[0016] These methods only enrich the representation of query words semantically, but they do not attempt to understand the user's query intent, but find words that are similar to each word for query expansion, which can easily lead to problems such as topic deviation and introduction of noise.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Short text query expansion and indexing method based on word vector
  • Short text query expansion and indexing method based on word vector
  • Short text query expansion and indexing method based on word vector

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0068] In order to illustrate the working process of this system in detail, the specific process of this system is introduced below in conjunction with specific examples.

[0069] A. Short text corpus information preprocessing

[0070] For short texts and forwarded texts less than 20 characters, delete them directly. Segment the remaining text in the corpus. Get a corpus dictionary, record the number of occurrences of each word, and remove words that appear too infrequently. Create an inverted index for the remaining short text.

[0071] B. The training model represents each word in the corpus dictionary with a word vector

[0072] Such as figure 2 As shown, each word is encoded and classified, and according to its context information, the logistic regression model is used for classification training, so as to obtain the vector representation of each word.

[0073] For the convenience of illustration, assume that the input data X = [0.2, -0.1, 0.3, -0.2] T , training to...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a short text query expansion and indexing method based on a word vector. The short text query expansion and indexing method particularly comprises: A, pretreatment of corpus information of a short text; B, expression of every word in a corpus dictionary by the word vector through a training model; C, query extension; D, obtaining of a text candidate set through a query expansion word set and a BM25 index model; E, extraction of subject of the short text; F, calculation of the text vector of the short text; G, re-sequencing of the short text returned by a traditional indexing model. The short text query expansion and indexing method can more exactly and effectively satisfy the indexing demand of a user; moreover, the query expansion module can find out words capable of expressing user's intension according to the existing data so as to perform the query expansion.

Description

technical field [0001] The invention relates to the technical fields of data mining and search engines, in particular to a short text query expansion and retrieval method based on word vectors. Background technique [0002] With the rapid development of computer and Internet, it becomes more and more difficult to accurately obtain information from massive information resources. A large part of the massive information exists in the form of short text, and short text is also an indispensable data form in people's daily life. Short text information mainly includes blog messages, microblog information, short messages, chat records, etc., and is characterized by short message length, flexible language form, huge data scale, strong timeliness, and fast update speed. Traditional search engines are not very accurate in these short text retrievals, and cannot meet people's needs for accurate information acquisition. Therefore, the present invention designs and implements a search en...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 林鸿飞王琳
Owner DALIAN UNIV OF TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products