Unlock instant, AI-driven research and patent intelligence for your innovation.

RSS (Really Simple Syndication)-based multi-thread graphic information synchronization crawling control method and system

A control method and information synchronization technology, applied in the field of web crawlers, can solve problems such as lack of semantic information, lack of pertinence, difficulty in supporting semantic information query, etc., and achieve the effect of improving access efficiency, enhancing effectiveness, and matching user needs

Inactive Publication Date: 2012-07-25
EAST CHINA NORMAL UNIV
View PDF2 Cites 20 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] However, traditional crawlers have certain limitations in practical applications. First, users in different fields and backgrounds often need different retrieval purposes and needs. The results returned by traditional crawlers are universal and lack pertinence, including a large number of Second, with the development of the World Wide Web, a large number of different data types of information, such as pictures, databases, audio, video, etc., cannot be completed by the inherent methods of traditional crawlers. Finally, through the simple crawling of traditional crawlers, there is a lack of certain semantic information, and it is difficult to support the query of semantic information

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • RSS (Really Simple Syndication)-based multi-thread graphic information synchronization crawling control method and system
  • RSS (Really Simple Syndication)-based multi-thread graphic information synchronization crawling control method and system
  • RSS (Really Simple Syndication)-based multi-thread graphic information synchronization crawling control method and system

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0058] The purpose of the present invention is: propose a kind of multi-threaded webpage information synchronous grasping method based on RSS technology, construct focusing crawler, carry out classification acquisition to the picture in the webpage, word information by breadth-first strategy, to utilize hyperlink information weight contribution, Improve the search strategy of the web crawler, effectively filter and extract, and maximize the matching and speed. Especially for the picture information data that traditional crawlers cannot solve well, carry out targeted analysis and processing, and ensure that the picture and text information is effectively synchronized and real-time acquisition, so that the information capture is more perfect.

[0059] In order to achieve the above object, the technical solution adopted by the present invention is: for the network information that needs to be grabbed, analyze the different characteristics of text and pictures, analyze the XML file...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides an RSS (Really Simple Syndication)-based multi-thread graphic information synchronization crawling control method, which is used for classifying and acquiring a picture and written information in a webpage and comprises the following steps: a, analyzing a target webpage file to be crawled through an RSS document analysis program; b, acquiring a uniform resource locator (URL) of a target webpage; c, filtering and analyzing data of the target webpage, and acquiring the URL of useful information by a breadth-first strategy; d, storing the URL of the useful information; e, downloading webpage contents corresponding to each URL of the useful information in the step d; and f, executing the step a on each webpage content in the step e. The invention also provides an RSS-based multi-thread graphic information synchronization crawling control system. The control method and the control system have the advantages of: (1) a proper recall factor and a high precision ratio; (2) function modularization and high portability; (3) pertinence; (4) real-time property; and (5) maintainability.

Description

technical field [0001] The invention relates to the real-time grabbing of graphic and text information on webpages, and mainly belongs to the technical field of web crawlers. Specifically, the present invention relates to an RSS-based multi-thread graphic information synchronous crawling control system and a corresponding control method. Background technique [0002] With the development of the Internet, information floods the entire network environment, which provides convenience for people to obtain information. However, how to obtain the required information from the vast amount of data for my use is an urgent problem that needs to be solved. Web crawler technology It came into being under this background. A web crawler is a program that automatically extracts web pages. It downloads web pages from the World Wide Web for search engines and is an important component of search engines. [0003] Traditional crawlers start from the URL of one or several initial webpages, ob...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30
Inventor 吕钊李琴黄小霞俞云飞梁璐蔡颂梅陈鹏
Owner EAST CHINA NORMAL UNIV