Supercharge Your Innovation With Domain-Expert AI Agents!

Sorting type hidden network database data acquisition method

An acquisition method and database technology, applied in the field of data acquisition of sorted hidden network database, can solve problems such as lack of research, incomplete solution of crawling word selection, etc., and achieve the effects of reducing repetition rate, improving coverage rate, and low cost

Inactive Publication Date: 2018-05-29
CENTRAL UNIVERSITY OF FINANCE AND ECONOMICS
View PDF4 Cites 3 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, these algorithms have not completely solved the problem of crawling word selection in sorting hidden web crawling, and the research on this aspect is still relatively lacking.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Sorting type hidden network database data acquisition method
  • Sorting type hidden network database data acquisition method
  • Sorting type hidden network database data acquisition method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0036] The present invention will be described in detail below with reference to the accompanying drawings and examples.

[0037] The invention provides a method for obtaining data in a sorted hidden network database, which uses a crawling method to obtain data in a target database; during the crawling operation, the crawling keywords are selected using a method based on document frequency estimation , mainly includes four parts: sample data set acquisition, extraction of candidate keyword sets from sample data set, document frequency estimation of candidate keywords, determination of crawled keywords. Finally, according to the obtained crawling keywords, the documents of the sorted hidden web database are crawled to obtain the sorted hidden web database data.

[0038] The specific ideas of the method of the present invention are as follows: first, obtain a certain number of documents from the sorting type hidden network data source DB to form a document sample set D; then, ob...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a sorting type hidden network database data acquisition method. By adopting the method, a high redundancy phenomenon caused by sorting type hidden network return limitation canbe avoided, and the better effect is obtained with the minimum cost. The method considers the influences of return limitation on sorting type hidden network return data, starts from document frequencies, and utilizes the low document frequency of a sample set to select crawling keywords, the degree of coverage of crawling data is increased, and repeatability is reduced. The method firstly selectswords corresponding to the low document frequency from the document sample set as alternative words to the crawling keywords, so that the coverage rate of returned results is improved; a document frequency estimation method is used for estimating document frequencies of all the alternative keywords in a target database, the alternative words of which the document frequencies are smaller than a limit value of the number of return as the crawling keywords, the number of target database documents is maximized, the repetition rate of crawling structures is reduced, and a lot of data in a hidden network can be acquired within the minimum cost.

Description

technical field [0001] The invention relates to the technical field of information retrieval, in particular to a method for acquiring sorted hidden network database data. Background technique [0002] In today's information age, networks and information exist in all areas of our production and life. The World Wide Web includes not only the surface web that has been indexed by standard search engines, but also some webs that are dynamically generated by online databases or file systems but not indexed by search engines, called deep webs or hidden webs. According to relevant data in 2001, the number of hidden web pages is 500 times that of the surface web (500 billion hidden web pages, about 1 billion web pages on the surface web), and, compared with the surface web, the hidden web database stores a large number of high-quality information. However, the hidden network database cannot be directly accessed, and can only be obtained by entering keywords through the search inter...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30G06F17/22
CPCG06F16/316G06F16/3346G06F16/335G06F40/151
Inventor 王焱谭艳欢陈炜琛肖飞李亚欣
Owner CENTRAL UNIVERSITY OF FINANCE AND ECONOMICS
Features
  • R&D
  • Intellectual Property
  • Life Sciences
  • Materials
  • Tech Scout
Why Patsnap Eureka
  • Unparalleled Data Quality
  • Higher Quality Content
  • 60% Fewer Hallucinations
Social media
Patsnap Eureka Blog
Learn More