Integrated data source finding method for deep layer net page data source

A technology of web page data and discovery method, which is applied in the data source discovery of deep web pages and the integration of deep web data sources, can solve the problems of inability to obtain in-depth information, subject drift, and inability to control the crawling process well, so as to improve the Discover efficiency, prevent theme drift, and avoid the effect of theme drift

Inactive Publication Date: 2007-10-10
束兰
View PDF0 Cites 20 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

In the process of training the classifier, it uses search engines such as Google to get all the outer pages pointing to the inner pages, but the disadvantage of this method is: the more you go to the outer layer, the more pages there are, and many of

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Integrated data source finding method for deep layer net page data source
  • Integrated data source finding method for deep layer net page data source

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0031] Embodiment 1: Referring to Figure 1 to Figure 2, a data source discovery method for deep webpage data source integration includes the following steps:

[0032] (1) Provide the topic of the data to be queried, construct the site root link queue and the local link queue respectively, put at least one seed root link address in the site root link queue, and give a weight according to its relationship with the topic;

[0033] (2) If the local link queue is empty, take the root link address with the largest weight from the site root link queue and put it into the local link queue; take the highest-scoring page link from the local link queue, and download it by the crawling module page;

[0034] (3) Use the form classifier to process the page downloaded in step (2). If it contains a form query interface, add it to the deep web page data source;

[0035] (4) Use the page classifier to process the page downloaded in step (2). The page classifier adopts the best-first strategy for sub...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

A method for discovering data source used on deep web data integration includes setting up station root chain queue and local chain queue, taking page chain with highest score out from local chain queue and using creepage module to download it, processing downloaded page by table sorter, adding said page into deep web data source if it has table query interface, processing downloaded page by page sorter and returning back to step of taking page chain if subject score is less than threshold, picking up chain address in page then placing it into local chain queue, repeating step of taking page chain to step of picking up chain address for realizing automatic creepage of deep web data source.

Description

Technical field [0001] The invention relates to a method for discovering a data source based on a network, in particular to a method for discovering a data source of a deep webpage connected by a network query interface, which is used for the integration of the deep webpage data source. Background technique [0002] With the widespread application of network databases, the network is accelerating its "deepening". There are a large number of pages on the Internet that are dynamically generated by a back-end database. This part of information cannot be obtained directly through static links. It can only be obtained by filling in a form and submitting a query. Since traditional web crawlers do not have the ability to fill in forms, they cannot be obtained. page. Therefore, existing search engines cannot search for this part of the page information, which results in this part of information being hidden and invisible to users. We call it Deep Web (also known a...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
Inventor 崔志明赵朋朋方巍
Owner 束兰
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products