Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Method and device for searching documents

A document and document set technology, applied in the field of retrieval, can solve the problem that sorting cannot be well applied to heterogeneous academic networks.

Active Publication Date: 2010-05-12
北京智谱华章科技有限公司
View PDF0 Cites 12 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0007] In view of the defects and deficiencies in the prior art, the object of the present invention is to provide a sorting device and method for document importance, and a retrieval device and method for retrieving webpages and documents by using the above sorting device and method, effectively Solve the problem that the sorting in the existing search cannot be well applied to heterogeneous academic networks

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and device for searching documents
  • Method and device for searching documents
  • Method and device for searching documents

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0076] The ordering method of document importance proposed by the present invention, its preferred embodiment comprises:

[0077] Step 1: Topic Modeling

[0078] The purpose of step one is to discover topics from document collections using a probabilistic topic model. Probabilistic topic models can efficiently mine topics in document collections. In these methods, documents are usually assumed to be generated from a mixture of |T| probabilistic models. Latent Dirichlet Allocation (LDA) is a widely used topic model. In this model, the likelihood of a document set D is defined as:

[0079] P ( z , w | Θ , Φ ) = Π d ∈ D Π z ∈ T θ dz n dz ...

Embodiment 2

[0115] The retrieval device of the document that the present invention proposes, its preferred embodiment comprises:

[0116] A topic identification module, the topic identification module uses a probabilistic topic model to identify topics from the document set, and obtains the topic distribution of the documents according to the identified topics;

[0117] A random walk module, the random walk module calculates a random walk ranking score for each document according to topic distribution;

[0118] A retrieval module, the retrieval module calculates the relevance score of the document to the query keyword according to the query keyword, and combines the random walk ranking score and the correlation score to obtain a retrieval result.

[0119] Wherein, the topic identification module includes:

[0120] The parameter calculation submodule, the parameter calculation module calculates the posterior probability distribution on the topic z according to the Gibbs sampling method: ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a device and a method for searching documents, aiming at the problem that the conventional topic model cannot automatically identify topics. The device of the invention comprises a topic identification module, a random walking module and a searching module. The method comprises the following steps: identifying the topics from a document set by using a probabilistic topic model, and acquiring topic distribution of the documents according to the identified topics; calculating a random walking order score of a topic level for each document; and calculating the relevant score of the document relevant to the searching keywords according to the searching keywords and the topics, and combining the random walking order importance score and the relevant score of the topic level to acquire a searching result.

Description

technical field [0001] The invention relates to a retrieval technology, in particular to a retrieval device and method that can be applied to documents retrieved by webpages. Background technique [0002] With the popularization of computers and networks, the way people obtain information has been greatly changed. But how to quickly obtain the information required by users from the vast world wide web information has become an important research topic. On the World Wide Web, each web page can be regarded as a document, and the World Wide Web can be regarded as a collection of documents combined by countless hyperlinks. Therefore, for document retrieval, one of the most important methods is based on the analysis of hyperlink relationships. [0003] In the analysis technology of hyperlink relationship in the prior art, random walk is widely used. Random walk is based on random mathematical theory, which formally expresses the trajectory of random steps. For example, the ex...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30
Inventor 唐杰杨子
Owner 北京智谱华章科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products