Subject-based searching method and device

A search method and topic technology, applied in the computer field, can solve problems such as difficult to rank in the front, unable to be recalled, redundant query expressions, etc.

Inactive Publication Date: 2013-12-04
BEIJING BAIDU NETCOM SCI & TECH CO LTD
View PDF3 Cites 46 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] First, when no or few identical keywords appear, even if the actual content reflects the requirements of the query, it is difficult to rank first, or even impossible to recall
For example, if a user enters the query "beautiful Lincoln", some documents contain "Lincoln with a streamlined design", "white, black or red Lincoln", etc. Although the appearance of Lincoln is also involved, the keyword "beautiful" does not appear. , may not be able to be recalled or ranked in the top position, but in fact these documents reflect that users want to

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Subject-based searching method and device
  • Subject-based searching method and device
  • Subject-based searching method and device

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0068] figure 1 The main flowchart of the subject-based search method provided by Embodiment 1 of the present invention, such as figure 1 As shown, the method may include the following steps:

[0069] Step 101: Use the topic analysis model to perform topic analysis on the query input by the user to determine the topic distribution corresponding to the query, and use the topic analysis model to perform topic analysis on each document in the document library to determine the topic distribution corresponding to each document.

[0070] The subject analysis model involved in this step is pre-established, including the subject words contained in each subject and the weight of each subject term in the subject to which it belongs. Using the topic analysis model, the topic distribution corresponding to the query and the topic distribution corresponding to each document can be determined. Among them, the establishment process and content of the theme analysis model will be described i...

Embodiment 2

[0077] In the embodiment of the present invention, the theme analysis model may adopt a probability model describing the theme, which may include but not limited to: Probabilistic Latent Semantic Analysis (PLSA) model, Latent Dirichlet Allocation (LDA) and so on.

[0078] LSA is a method that uses mathematical and statistical methods to extract terms in documents, infers the semantic relationship between them, and builds a semantic index, and organizes documents into semantic space structures, that is, those with high semantic relevance Terms map to the same topic. PLSA uses a probability model to describe between documents and latent semantics, latent semantics and terms on the basis of latent semantic indexing of LSA. The so-called latent semantics is the subject referred to in the embodiments of the present invention.

[0079] LDA is an unsupervised machine learning technique used to identify hidden topic information in large-scale document collections or corpora. It uses ...

Embodiment 3

[0091] figure 2 The detailed flow chart of the topic-based search method provided by Embodiment 3 of the present invention, such as figure 2 As shown, the process specifically includes the following steps:

[0092] Step 201: Analyzing the subject terms of each document in the document library to obtain subject term sets of each document.

[0093] The process of analyzing the keywords of the document first divides the document into words, and then selects the keywords based on TF or TF-IDF, that is, selects words that meet the requirements of TF or TF-IDF as the keywords. This method usually performs well, but for some documents with scattered words, the statistical word frequency has no obvious characteristics. In addition, for some cheating documents, the cheater piles up words that have nothing to do with the text topic. If it is purely based on word frequency information , obviously not an accurate reflection of the theme. Therefore, the embodiment of the present inven...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a subject-based searching method and device. The subject-based searching method comprises the steps of performing subject analysis on each document in a document library by using a subject analysis model to determine a subject distribution corresponding to each document; performing subject analysis on a query input by a user by using the subject analysis model to determine the subject distribution corresponding to the query; calculating the subject matching degree between the query and each document by using the subject distribution corresponding to the query and the subject distribution corresponding to each document; obtaining the matching degree between the query and each document by using the subject matching degree, and determining searching results according to the matching degree between the query and each document. According to the subject-based searching method and device, the subject matching mode instead of a keyword matching mode is adopted, the documents can still be recalled even the documents are not consistent with the query of the user or not matched with redundant terms, and the search results are matched with the query in subject to a maximum extent in the query in statement, so that the research recall and accuracy are improved.

Description

【Technical field】 [0001] The invention relates to the field of computer technology, in particular to a subject-based search method and device. 【Background technique】 [0002] With the continuous development of computer network technology, search engines have become an important means for people to obtain information. Users input search items (query) through search engines, and search engines search for documents related to query from the captured documents, and follow relevant sorted by degree. The widely used search model is the vector space model. Its basic idea is to represent the query and the document as word vectors respectively. The weight of the vector can be the frequency of word occurrence (TF) or word frequency-inverse document frequency (TF-IDF) , and then calculate the similarity between the word vector of the query and the word vector of the document as a measure of correlation. In practical applications, there are various variants, but essentially they all c...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30G06F17/28
Inventor 方高林王海峰
Owner BEIJING BAIDU NETCOM SCI & TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products