Integrated data source finding method for deep layer net page data source

A technology of web page data and discovery methods, applied in the integration of deep web data sources and the field of data source discovery of deep web pages, can solve problems such as theme drift, inability to obtain in-depth information, and inability to control the crawling process well, to prevent Theme drift, improve discovery efficiency, and avoid the effect of theme drift

Inactive Publication Date: 2009-01-14
束兰
View PDF6 Cites 2 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

In the process of training the classifier, it uses search engines such as Google to get all the outer pages pointing to the inner pages, but the disadvantage of this method is: the more you go to the outer layer, the more pages there are, and many of them are irrelevant page, so it will cause problems such as "theme drift"
Moreover, the above method cannot obtain accurate in-depth information of a certain page in the site to which it belongs, so that the crawling process cannot be well controlled.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Integrated data source finding method for deep layer net page data source
  • Integrated data source finding method for deep layer net page data source
  • Integrated data source finding method for deep layer net page data source

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0031] Embodiment one: see attached figure 1 to attach figure 2 As shown, a data source discovery method for deep web data source integration includes the following steps:

[0032] (1) Provide the topic of the data to be queried, build the site root link queue and the local link queue respectively, put at least one seed root link address in the site root link queue, and give weight according to its relationship with the topic;

[0033](2) If the local link queue is empty, then get the root link address with the greatest weight from the site root link queue, and put it into the local link queue; get the page link with the highest score from the local link queue, and download it by the crawling module page;

[0034] (3) Utilize the form classifier to process the page downloaded in step (2), if it contains a form query interface, add it in the deep web data source;

[0035] (4) Utilize the page classifier to process the page downloaded in step (2), the page classifier adopts ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

A method for discovering data source used on deep web data integration includes setting up station root chain queue and local chain queue, taking page chain with highest score out from local chain queue and using creepage module to download it, processing downloaded page by table sorter, adding said page into deep web data source if it has table query interface, processing downloaded page by page sorter and returning back to step of taking page chain if subject score is less than threshold, picking up chain address in page then placing it into local chain queue, repeating step of taking page chain to step of picking up chain address for realizing automatic creepage of deep web data source.

Description

technical field [0001] The invention relates to a method for discovering data sources based on a network, in particular to a method for discovering data sources of deep webpages connected by a network query interface, which is used for the integration of deep webpage data sources. Background technique [0002] With the widespread application of network databases, the network is "deepening" at an accelerated rate. There are a large number of pages on the Internet that are dynamically generated by the background database. This part of the information cannot be directly obtained through static links, but can only be obtained by filling out forms and submitting queries. Since traditional web crawlers (Crawlers) do not have the ability to fill out forms, they cannot be obtained. page. Therefore, the existing search engines cannot search for this part of the page information, thus causing this part of the information to be hidden and invisible to the user. We call it Deep Web (De...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/30
Inventor 崔志明赵朋朋方巍
Owner 束兰
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products