Webpage searching result sequencing method based on content reference

A sorting method and web search technology, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve problems such as result interference and achieve the effect of avoiding interference

Inactive Publication Date: 2009-09-09
TSINGHUA UNIV
View PDF0 Cites 23 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

These reference blocks will interfere with the results t

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Webpage searching result sequencing method based on content reference
  • Webpage searching result sequencing method based on content reference
  • Webpage searching result sequencing method based on content reference

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0050] In the specific implementation plan, we used the Google search engine as a relevant webpage query tool to obtain 100 pending webpages. Use the jericho-html-2.5 toolkit to extract the text of the webpage and convert the webpage into a plain text format. Using the Sogou Internet Corpus as a large-scale Internet corpus, a list of invalid citation blocks is generated. Next, we describe the specific steps of the algorithm for an actual query "cross star" as follows:

[0051] Preparation: Divide the Sogou Internet Corpus into chunks, find the 50 chunks with the most occurrences, and generate a list of invalid reference chunks.

[0052] 1. Call the Google search engine to search for "cross star" and get the first 100 pages returned by it. These pages serve as relevant documents for the query term. We do not use the page ranking information given by Google, but use this algorithm to recalculate the ranking output for these 100 pages.

[0053] 2. Call the jericho-html-2.5 to...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a webpage searching result sequencing method based on content reference, belonging to the computer information retrieving technology field. The method is characterized in that firstly, based on various searching terms of diverse users, webpage complete works of various webpage are acquired, all reference lists of each text block in the webpage complete works are acquired by steps of text extraction, text blocking and the establishment of the reference lists, and fifty text blocks which are referred mostly are used as a reference blacklist after the webpage ranking calculation; secondly, when the same user inputs a searching term, the reference blacklist is used as a text block index table during the establishment of the reference list, and the webpage list in the table is used as a reference object during the webpage ranking calculation to acquire all webpage rankings including terms searched by the users. The sequencing method eliminates the webpage interference of navigation property and simultaneously improves the speed of searching and sequencing.

Description

technical field [0001] The invention belongs to the technical field of natural language processing Background technique [0002] With the rapid expansion of the scale of the Internet, how to obtain the information needed by users from the Internet has become an important research topic, so search engine technology has emerged as the times require. It returns a series of web pages that may be related to the user's query according to the user's query, sorts these web pages according to a certain algorithm, and finally presents them to the user. Evaluation of the performance of a search engine mainly includes the following indicators: accuracy rate, recall rate, and accuracy rate of the first page (or the first N results). Since the amount of information on the Internet is extremely large, and what users care about is finding the information they need quickly and accurately, the most direct experience index for real users is the accuracy rate of the first page (or the first N ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30G06F17/20G06F40/00
Inventor 高嵩周强
Owner TSINGHUA UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products