A Timely and Efficient Internet Information Crawling Method

A technology for Internet information and web page information, applied in the information field to simplify resource allocation, simplify the scope and complexity, and reduce misjudgments
CN103176985BActive Publication Date: 2016-06-29COMP NETWORK INFORMATION CENT CHINESE ACADEMY OF SCI

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
COMP NETWORK INFORMATION CENT CHINESE ACADEMY OF SCI
Publication Date
2016-06-29

Smart Images

  • Figure 1
    Figure 1
  • Figure 2
    Figure 2
  • Figure 3
    Figure 3
Patent Text Reader

Abstract

The invention discloses a timely and high-efficiency crawling method for internet information and belongs to the technical field of information. The method comprises the following steps: (1) setting a seed address, crawling and storing webpage information, and ensuring navigation pages; (2) carrying out more than once crawling on each navigation page, and analyzing and labeling the crawling webpage; (4) building a theme judgment model and a navigation page change time series prediction model of each website; (5) predicting next time change time of each website navigation page, ensuring next crawling time, crawling the navigation page and extracting a subpage address and an anchor text which are not crawled; (6) adopting the built theme judgment model to judge the extracted subpage address and the anchor text in the last step, and respectively processing the extracted subpage address and the anchor text according to a judgment result; (7) based on a new related page of the crawled theme, forming or updating a present change time series of each website navigation page, and ensuring next crawling time to carry out webpage crawling. The timely and high-efficiency crawling method for the internet information guarantees novelty and topicality of collected information under a small load.
Need to check novelty before this filing date? Find Prior Art

Description

technical field

[0001] The invention belongs to the field of information technology, and in particular relates to a timely and efficient Internet information crawling method. Background technique

[0002] With the rapid development of the Internet, it has become the largest public data source in the world, and its scale is still growing. Judging from the content contained therein, there are many webpage information linked together by hyperlinks on the Internet, and a considerable part of them has the characteristics of dynamic changes; based on this, many services can be provided on the Internet, and through The communication between people and organizations forms a virtual society that has a certain correspondence and relationship with the real society. For this reason, Web data mining, which aims to find useful knowledge from the structure, content, and logs of the Internet, has received great attention and development, especially the content mining that takes the content...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More