A Timely and Efficient Internet Information Crawling Method

A technology for Internet information and web page information, applied in the information field to simplify resource allocation, simplify the scope and complexity, and reduce misjudgments

Active Publication Date: 2016-06-29
COMP NETWORK INFORMATION CENT CHINESE ACADEMY OF SCI
View PDF3 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

After consulting the literature, it is found that there is no research work involving this

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A Timely and Efficient Internet Information Crawling Method
  • A Timely and Efficient Internet Information Crawling Method
  • A Timely and Efficient Internet Information Crawling Method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0037] The specific embodiment of the present invention is as figure 1 shown. The steps are detailed below.

[0038] 1. Information collection and collation (such as figure 2 shown)

[0039] 1. Collect relevant information Url address

[0040] According to the pre-determined topic meaning, first select a certain part (such as 3-5) topic keywords; enter these topic keywords on a general search engine to get a list of query results; organize the query results and extract Url to get some relevant information URL address.

[0041] 2. Initial Url setting and web page information crawling

[0042] Select Internet information crawler software (such as Heritrix, Nutch, etc.), and set these Url addresses obtained in steps 1 and 1 as seed Url addresses in the software. Parameters such as the number of pages (determined in advance) are set in the software, and then the general Internet information crawling method (without subject-related judgment and timeliness prediction) is used...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a timely and high-efficiency crawling method for internet information and belongs to the technical field of information. The method comprises the following steps: (1) setting a seed address, crawling and storing webpage information, and ensuring navigation pages; (2) carrying out more than once crawling on each navigation page, and analyzing and labeling the crawling webpage; (4) building a theme judgment model and a navigation page change time series prediction model of each website; (5) predicting next time change time of each website navigation page, ensuring next crawling time, crawling the navigation page and extracting a subpage address and an anchor text which are not crawled; (6) adopting the built theme judgment model to judge the extracted subpage address and the anchor text in the last step, and respectively processing the extracted subpage address and the anchor text according to a judgment result; (7) based on a new related page of the crawled theme, forming or updating a present change time series of each website navigation page, and ensuring next crawling time to carry out webpage crawling. The timely and high-efficiency crawling method for the internet information guarantees novelty and topicality of collected information under a small load.

Description

technical field [0001] The invention belongs to the field of information technology, and in particular relates to a timely and efficient Internet information crawling method. Background technique [0002] With the rapid development of the Internet, it has become the largest public data source in the world, and its scale is still growing. Judging from the content contained therein, there are many webpage information linked together by hyperlinks on the Internet, and a considerable part of them has the characteristics of dynamic changes; based on this, many services can be provided on the Internet, and through The communication between people and organizations forms a virtual society that has a certain correspondence and relationship with the real society. For this reason, Web data mining, which aims to find useful knowledge from the structure, content, and logs of the Internet, has received great attention and development, especially the content mining that takes the content...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/30
Inventor 杨风雷黎建辉杨俊峰虞路清周园春
Owner COMP NETWORK INFORMATION CENT CHINESE ACADEMY OF SCI
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products