Searching system and method based on web page extraction

A search system and search method technology, applied in the field of information search, can solve the problems of low extraction accuracy and poor operability of search engines, and achieve the effect of reducing complexity and speeding up

Inactive Publication Date: 2008-06-04
TENCENT TECH (SHENZHEN) CO LTD
View PDF1 Cites 46 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0006] The technical problem to be solved by the present invention is to provide a search system and search method based on web page extraction for the problems of low extraction accuracy and poor operability of the above-mentioned search engine

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Searching system and method based on web page extraction
  • Searching system and method based on web page extraction
  • Searching system and method based on web page extraction

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0037] The present invention can accurately extract target content and eliminate irrelevant information through a preset template, thereby improving the accuracy and fault tolerance of information extraction, thereby improving the accuracy of search results. Different from ordinary text files, HTML pages contain obvious hierarchical information, which can be described in a tree structure, that is, DOM (Document Object Model, Document Object Model). Since DOM has a unified specification and programming interface, this embodiment establishes a DOM tree for HTML, and any node information in the tree can be conveniently accessed by using the DOM interface.

[0038] Such as figure 1 Shown is a schematic structural diagram of an embodiment of a search system based on web page extraction in the present invention. In this embodiment, the search system includes a web page downloading unit 11 , a web page extracting unit 12 , a template storage unit 13 and a result storage unit 14 . W...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The present invention discloses a search system based on web page extraction and comprises a web page download unit for downloading web pages and a result storage unit for storing search results; the invention also comprises a template storage unit and a web page extraction unit, wherein, the template storage unit is used for storing one or more than one templates which are used for recording the properties of preset web pages; the web page extraction unit uses the content of the web page which is downloaded by the web page download unit and matched with the template as a search result. The invention also discloses a corresponding search method based on the web page extraction. Through matching the properties of the downloaded web page and the preset web page, thereby the invention achieves more accurate search results.

Description

technical field [0001] The present invention relates to the field of information search, more specifically, to a search system and search method based on web page extraction. Background technique [0002] With the development of search engine technology, the accuracy of search results has become a common concern. At present, most search engines can display a large number of search results, but often only records with good relevance and accurate results are paid attention to. Therefore, special search with the characteristics of strong pertinence, accurate information, and timely update is widely used. [0003] In the entire search engine, the download and analysis of web pages is the data source of search results. Therefore, the web page extraction algorithm is one of its key technologies. The complexity, operability, fault tolerance and accuracy of the algorithm are all important factors affecting the quantity and quality of search results, and may even become the bottlen...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 杜建强邓大付
Owner TENCENT TECH (SHENZHEN) CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products