Self-adaptive incremental deep web data source discovery method

A deep network and discovery method technology, applied in the computer field, can solve problems such as difficulty in ensuring high coverage, low efficiency, and reduced crawling efficiency, and achieve the effect of increasing coverage, improving efficiency, and expanding sites

Active Publication Date: 2014-04-09
HUAZHONG UNIV OF SCI & TECH
View PDF2 Cites 15 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] Existing data source discovery mechanisms can be mainly divided into two categories: general crawlers and theme crawlers. General crawlers generally crawl in an exhaustive manner, which will download a large number of irrelevant pages, resulting in low efficiency; theme crawlers crawl according to domains. Use the page classifier to filter out the pages irrelevant to the topic, and then use the link classifier to filter the links to speed up the crawling process, but due to factors such as topic drift and link classifier precision, the efficiency of crawling will be reduced; in addition, due to the sparse distribution of data sources, the topic The crawler speeds up the crawling process by setting termination conditions, which will result in a large number of pages without the opportunity to visit, and the page classifier and link classifier will also filter out many pages and links, so it is difficult to ensure high coverage

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Self-adaptive incremental deep web data source discovery method
  • Self-adaptive incremental deep web data source discovery method
  • Self-adaptive incremental deep web data source discovery method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0025] The present invention will be further described below in conjunction with the accompanying drawings and specific embodiments.

[0026] Such as figure 1 As shown, the self-adaptive incremental deep network data source discovery method of the embodiment of the present invention includes a site location stage and an in-site search stage.

[0027] (1) The stage of site location includes site collection, site sorting and site classification.

[0028] Site collection is used to discover new sites and ensure that there are sufficient site links in the site queue for selection during the crawling process. Such as figure 2 As shown, site collection includes the following steps:

[0029] (1-1) Judging whether the site queue size is smaller than the predefined threshold, if the condition is met, then go to step (1-2); otherwise, end directly;

[0030] (1-2) Submit the discovered deep network sites as input to the search engine for reverse search, then extract the links in the...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a self-adaptive incremental deep web data source discovery method. According to the method, the deep web data source discovery processes comprise a website positioning stage and an in-web searching stage, and in the website positioning stage, a website discovery mechanism is introduced so that website data can be efficiently expanded and the creep effect can be improved; a self-adaptive sorting mechanism is adopted in website and in-web linkage so that a deep web website and a queryable form can be discovered more rapidly. The method achieves automatic incremental efficient deep web data source acquisition, can be applied to deep web data integration and a hidden web crawler, and meanwhile is also suitable for building on-line database catalog websites.

Description

technical field [0001] The invention belongs to information retrieval and data mining in the field of computers, and in particular relates to an adaptive incremental deep network data source discovery method, which can automatically and efficiently discover deep network data sources according to fields. Background technique [0002] With the explosive growth of Internet data, more and more websites adopt network database technology. A large number of pages on the Internet are dynamically generated by the database. This information cannot be crawled through static links, but must be obtained by submitting queries. Since search engine crawlers do not have the ability to automatically fill in forms, this part of the data cannot be indexed by search engines and is hidden behind the network database. This part of the data is called the deep web or dark web. [0003] The White Paper on Deep Network released by BrightPlanet in 2001 made a relatively comprehensive macro-statistics o...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/951
Inventor 赵峰金海聂昶陈恒
Owner HUAZHONG UNIV OF SCI & TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products