Integrated data source finding method for deep layer net page data source

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
A technology of web page data and discovery method, which is applied in the data source discovery of deep web pages and the integration of deep web data sources, can solve the problems of inability to obtain in-depth information, subject drift, and inability to control the crawling process well, so as to improve the Discover efficiency, prevent theme drift, and avoid the effect of theme drift

Inactive Publication Date: 2007-10-10

束兰

View PDF0 Cites 20 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

In the process of training the classifier, it uses search engines such as Google to get all the outer pages pointing to the inner pages, but the disadvantage of this method is: the more you go to the outer layer, the more pages there are, and many of them are irrelevant page, so it will cause problems such as "theme drift"

Moreover, the above method cannot obtain accurate in-depth information of a certain page in the site to which it belongs, so that the crawling process cannot be well controlled.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment 1

[0031] Embodiment 1: Referring to Figure 1 to Figure 2, a data source discovery method for deep webpage data source integration includes the following steps:

[0032] (1) Provide the topic of the data to be queried, construct the site root link queue and the local link queue respectively, put at least one seed root link address in the site root link queue, and give a weight according to its relationship with the topic;

[0033] (2) If the local link queue is empty, take the root link address with the largest weight from the site root link queue and put it into the local link queue; take the highest-scoring page link from the local link queue, and download it by the crawling module page;

[0034] (3) Use the form classifier to process the page downloaded in step (2). If it contains a form query interface, add it to the deep web page data source;

[0035] (4) Use the page classifier to process the page downloaded in step (2). The page classifier adopts the best-first strategy for sub...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

A method for discovering data source used on deep web data integration includes setting up station root chain queue and local chain queue, taking page chain with highest score out from local chain queue and using creepage module to download it, processing downloaded page by table sorter, adding said page into deep web data source if it has table query interface, processing downloaded page by page sorter and returning back to step of taking page chain if subject score is less than threshold, picking up chain address in page then placing it into local chain queue, repeating step of taking page chain to step of picking up chain address for realizing automatic creepage of deep web data source.

Description

Technical field [0001] The invention relates to a method for discovering a data source based on a network, in particular to a method for discovering a data source of a deep webpage connected by a network query interface, which is used for the integration of the deep webpage data source. Background technique [0002] With the widespread application of network databases, the network is accelerating its "deepening". There are a large number of pages on the Internet that are dynamically generated by a back-end database. This part of information cannot be obtained directly through static links. It can only be obtained by filling in a form and submitting a query. Since traditional web crawlers do not have the ability to fill in forms, they cannot be obtained. page. Therefore, existing search engines cannot search for this part of the page information, which results in this part of information being hidden and invisible to users. We call it Deep Web (also known a...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

IPC IPC(8): G06F17/30

Inventor崔志明赵朋朋方巍

Owner束兰

Integrated data source finding method for deep layer net page data source

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment 1

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology