Unlock instant, AI-driven research and patent intelligence for your innovation.

A Keyword-Based Oriented Web Page Acquisition Method

A collection method and keyword technology, applied in electrical digital data processing, digital data information retrieval, instruments, etc., can solve the problems of low accuracy and efficiency of classifiers, accurate crawling of difficult subject web pages, frequent communication, etc. The overall collection accuracy, improving the data collection rate, and improving the effect of global searchability

Active Publication Date: 2019-12-10
深圳市东晟数据有限公司
View PDF6 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] In the traditional face, the current network data directional collection technology mainly has the following problems: (1) The topic crawler needs to judge the relevance of the page before saving the web page, and only saves the web page related to the topic
At present, most of the topic identification methods are based on classifiers. The accuracy and efficiency of classifiers are very low, and it is difficult to accurately capture the topic web pages.
(2) The main problem of the topic collection algorithm based on the link structure is that the calculated link value has little correlation with the topic, which is easy to cause "topic drift", and collects web pages that have nothing to do with the topic, and the topic search strategy based on the link content evaluation low efficiency
(3) In the current distributed directional data acquisition system, frequent communication between nodes is required, and the scalability of the system is not high

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A Keyword-Based Oriented Web Page Acquisition Method
  • A Keyword-Based Oriented Web Page Acquisition Method
  • A Keyword-Based Oriented Web Page Acquisition Method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0033] The specific implementation manners of the present invention will be further described in detail below in conjunction with the accompanying drawings.

[0034]The keyword-based directional web page collection method designed by the present invention, (1) to solve the collection accuracy rate in the subject collection is not high, this paper by proposing a data directional collection method, with historical collection data as a reference, dynamically formulate suitable Threshold, adjust the system acquisition model in time, so as to achieve good and fast capture. And it can improve the global searchability to a certain extent, avoid the collection of web pages falling into a local optimal state, and improve the overall collection accuracy of the system through adaptive algorithms. (2) Based on the distributed platform, this paper optimizes the distributed configuration environment, and uses the Nutch open source crawler framework to realize the distributed multi-threaded ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to a keyword-based oriented webpage collection method. The method comprises the steps of introducing a text weighting algorithm to set a weight for a keyword; calculating a webpage theme relevancy in combination with a spatial vector model algorithm; judging the importance of a webpage by use of a webpage link structure and the theme relevancy; gathering related subject webpage files according to a text clustering algorithm; calculating a probability of a to-be-grabbed webpage which belongs to the subject theme by use of a naive bayesian algorithm; setting a fitness function and screening a webpage related to the theme; and dynamically adjusting a system model according to the real-time crawling situation of the webpage. Based on a distributed platform, the oriented crawling of the theme webpage is realized by use of an adaptive theme algorithm in combination with an open source network collection architecture; parallel webpage crawling is realized by use of a distributed technology, each node calculation resource is fully utilized and the crawling rate of the webpage is improved.

Description

technical field [0001] The invention relates to a method for collecting directional webpages based on keywords, and belongs to the intersecting technical fields of theme web crawlers and distributed computing. Background technique [0002] With the rapid development of information technologies such as electronic computers, storage devices, and mobile communication networks, the rapid popularization of mobile Internet, social networks, and the Internet of Things has led to a rapid increase in the amount of data on Internet platforms. The era of big data has come. According to statistics, as of mid-March 2016, the total number of known web pages (excluding hidden web pages) on the Internet alone has exceeded 4.6 billion worldwide. How to efficiently collect network data is particularly important. [0003] Data collection is the premise of subsequent data mining, analysis, and decision-making. The capture efficiency of network data collection directly determines the effect of d...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/9535G06F16/35G06F16/33G06F17/27
CPCG06F16/3344G06F16/35G06F16/9535G06F40/284
Inventor 徐小龙杨春春
Owner 深圳市东晟数据有限公司