Keyword-based oriented webpage collection method

A collection method and keyword technology, applied in electrical digital data processing, special data processing applications, instruments, etc., can solve problems such as accurate crawling of difficult subject web pages, low classifier accuracy and efficiency, subject drift, etc. The effect of data acquisition rate, improving the overall acquisition accuracy, and shortening the acquisition time

Active Publication Date: 2017-09-15
深圳市东晟数据有限公司
View PDF6 Cites 6 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] In the traditional face, the current network data directional collection technology mainly has the following problems: (1) The topic crawler needs to judge the relevance of the page before saving the web page, and only saves the web page related to the topic
At present, most of the topic identification methods are based on classifiers. The accuracy and efficiency of classifiers are very low, and it is difficult to accurately capture the topic web pages.
(2) The main problem of the topic collection algorithm based on the link structure is that the calculated link value has little correlation with the topic, which is easy to cause "topic drift", and collects web pages that have nothing to do with the topic, and the topic search strategy based on the link content evaluation low efficiency
(3) In the current distributed directional data acquisition system, frequent communication between nodes is required, and the scalability of the system is not high

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Keyword-based oriented webpage collection method
  • Keyword-based oriented webpage collection method
  • Keyword-based oriented webpage collection method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0033] The specific embodiments of the present invention will be described in further detail below in conjunction with the accompanying drawings of the specification.

[0034] The keyword-based directional webpage collection method designed by the present invention (1) In order to solve the problem that the collection accuracy rate in subject collection is not high, this paper proposes a data directional collection method, taking historical collected data as a reference, and dynamically formulating suitable Threshold, adjust the system collection model in time to achieve a good and fast capture. And it can improve the global searchability to a certain extent, avoid collecting web pages from falling into a local optimal state, and improve the overall collection accuracy of the system through an adaptive algorithm. (2) Based on the distributed platform, this article optimizes the distributed configuration environment, and uses the Nutch open source crawler framework to realize the ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a keyword-based oriented webpage collection method. The method comprises the steps of introducing a text weighting algorithm to set a weight for a keyword; calculating a webpage theme relevancy in combination with a spatial vector model algorithm; judging the importance of a webpage by use of a webpage link structure and the theme relevancy; gathering related subject webpage files according to a text clustering algorithm; calculating a probability of a to-be-grabbed webpage which belongs to the subject theme by use of a naive bayesian algorithm; setting a fitness function and screening a webpage related to the theme; and dynamically adjusting a system model according to the real-time crawling situation of the webpage. Based on a distributed platform, the oriented crawling of the theme webpage is realized by use of an adaptive theme algorithm in combination with an open source network collection architecture; parallel webpage crawling is realized by use of a distributed technology, each node calculation resource is fully utilized and the crawling rate of the webpage is improved.

Description

Technical field [0001] The invention relates to a keyword-based directional webpage collection method, which belongs to the cross-technical field of subject web crawlers and distributed computing. Background technique [0002] With the rapid development of information technologies such as electronic computers, storage devices, and mobile communication networks, the rapid popularization of mobile Internet, social networking, and the Internet of Things has led to the rapid growth of data on Internet platforms, and the era of big data has come. According to statistics, as of mid-March 2016, the total number of known web pages (excluding hidden web pages) on the Internet alone has exceeded 4.6 billion globally, and how to efficiently collect network data becomes particularly important. [0003] Data collection is the prerequisite for subsequent data mining, analysis and decision-making. The capture efficiency of network data collection directly determines the effect of data processing....

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30G06F17/27
CPCG06F16/3344G06F16/35G06F16/9535G06F40/284
Inventor 徐小龙杨春春
Owner 深圳市东晟数据有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products