Deep web self-adapting crawling method based on minimum searchable mode

A query mode and self-adaptive technology, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve the problem of lack of sufficient basis for keyword selection, limited processing capacity, and extraction of the minimum queryable mode that does not involve DeepWeb query forms Methods and other issues

Inactive Publication Date: 2012-11-28
XI AN JIAOTONG UNIV
View PDF0 Cites 2 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The shortcomings of this type of method are: 1. The query form is required to contain more information for learning prior knowledge. For query forms that contain less information, such as a single text box form that only accepts keyword queries, its processing The ability is relatively limited; 2. One query needs to complete the filling of the entire form, which reduces the crawling efficiency
Compared with the crawling method based on prior knowledge, the crawling method without prior knowledge improves the crawling processing ability. However, this type of method still has the following two problems: 1. It can only crawl a single text box and the default The obtained keywords match the text box; 2. The selection of keywords for initial crawling lacks sufficient basis
[0020] After analysis and comparison, the Deep Web crawling method introduced in domestic and foreign literature does not involve the minimum queryable pattern extraction method of the Deep Web query form and the crawling method based on the minimum queryable pattern

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Deep web self-adapting crawling method based on minimum searchable mode
  • Deep web self-adapting crawling method based on minimum searchable mode
  • Deep web self-adapting crawling method based on minimum searchable mode

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0053] A method for crawling the Deep Web based on a minimum queryable pattern, specifically comprising the following steps:

[0054] 1) Generate the minimum queryable pattern set S of the target Deep Web query form mep ;

[0055] 2) Add seed candidate query q i into the set of candidate queries. Candidate queries can be expressed as q i (kv, mep j ) where mep j for S mep The minimum queryable mode in , kv is filled to mep j Keyvectors for ;

[0056] 3) For each minimum queryable pattern mep in the minimum queryable pattern set j Predict its model return rate P new (q(mep j )) is the expected rate of return on new records for the smallest queryable mode;

[0057] 4) For each candidate query q in the candidate query set i (kv, mep j ) to estimate the conditional rate of return P of its keyword vector kv to the new record new (q i (kv|mep j )).

[0058] 5) For query q in the candidate query set i (kv, mep j ) to compute the query q i Return to new record P n...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a Deep Web self-adapting crawling method based on a minimum enquiry pattern. Aiming at the problem of the existing Deep Web crawling method that the crawling efficiency is low due to data isolated island, the invention firstly provides a conception of a minimum enquire pattern MEP and then provides an MEP generating algorithm and the self-adapting crawling method based on the MEP. The invention can cause an enquiry interface to be popularized to a minimum enquiry pattern set from a single textbox, a once enquiry is commonly determined by one MEP and keyword vector matched with the MEP, and a next enquiry with optimal expectation can be produced by a self-adapting way until enquiry stop conditions are satisfied. By using the minimum enquiry pattern, not only the form filling accuracy ratio is improved, but also the characteristics of all patterns can be fully utilized to select keywords so as to overcome the data isolated island problem better.

Description

technical field [0001] The invention belongs to the field of computer applications, and mainly relates to Web mining and information acquisition, in particular to a Deep Web self-adaptive crawling method based on the minimum queryable mode, which mainly solves the data island problem existing in the current crawling methods of the same kind. Background technique [0002] The goal of Deep Web crawling is to obtain as many Deep Web data records as possible [2], and the key lies in how to generate suitable crawling. At present, Deep Web crawling can be divided into two types of query methods based on prior knowledge and without prior knowledge. [0003] The crawling method based on prior knowledge needs to establish a corresponding prior knowledge base before crawling, and then generate queries under the guidance of prior knowledge. The shortcomings of this type of method are: 1. The query form is required to contain more information for learning prior knowledge. For query for...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/30
Inventor 刘均郑庆华蒋路吴朝晖常晓
Owner XI AN JIAOTONG UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products