Webpage information acquisition system and method

A webpage information and acquisition system technology, applied in the field of webpage information acquisition system, can solve problems such as the accuracy cannot meet the requirements, the vertical webpage information structure is clear, and the information accuracy cannot be met, etc., to achieve the expansion of system processing capacity, clear function division, Simplify the effect of repeated downloads

Inactive Publication Date: 2013-01-30
ALIBABA (CHINA) CO LTD
View PDF1 Cites 18 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Therefore, the accuracy of information obtained from the general network-wide web page information cannot meet the requirements
[0004] The current webpage acquisition system mainly solves how to obtain the webpages that the system needs to obtain, but the accuracy of the information in its own pages cannot meet the requirements of the video vertical field; the video vertical field requires information update frequency. Fetching pages cannot precisely control how often they are updated
[0005] Currently, there is no general-purpose system and method for obtaining vertical web page information with a clear structure, independent functions, and easy control of repeated downloads and update frequencies.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Webpage information acquisition system and method
  • Webpage information acquisition system and method
  • Webpage information acquisition system and method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0058] if we want to download http: / / www.youku.com / Information about all videos on the page. The specific implementation plan is as follows:

[0059] One, such as Figure 4 As shown, the preparations are as follows:

[0060] 1. Start the storage service to ensure normal access to data.

[0061] 2. Start the task queue service to ensure normal access to tasks.

[0062] 3. Implement a universal URL parser, which is responsible for parsing out URLs that appear on any page. If the same URL has relevant information such as pictures, it is also responsible for parsing out. Use the existing web page analysis technology to analyze all links, titles, pictures, duration, high-definition and other information in any page. The analysis of this step is a general analysis of any page where a video link appears, such as: http: / / tv.youku.com / This page itself does not have a video playback box, but it has many pictures and video titles pointing to the video. Using the analysis of th...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a webpage information acquisition system and method. The system comprises a task delivery device, a task queue, a task actuator and a memory. All the modules are independently operated in a service way, and this property guarantees good expandability of the whole system. The method comprises the steps of packaging a URL into a task to be downloaded, delivering the task to the task queue, receiving the task through the task queue, acquiring the task from the task queue, acquiring a corresponding task resolver according to the task type, analyzing downloaded webpage source code through the task resolver, storing data, packaging a sub-URL analyzed in a URL webpage into a new task, and delivering the new task to the task queue again. The system and the method can applied to webpage information download in any vertical fields, can easily add different website resolvers by paying attention to custom only, so as to guarantee the accuracy of webpage information acquisition of the whole network, and systems with universal repeated download and update frequencies are easily controlled.

Description

technical field [0001] The invention relates to the field of network video information acquisition, in particular to a web page information acquisition system and method. Background technique [0002] At present, web page information acquisition technologies mainly focus on the acquisition of web page information on the entire network and the acquisition of vertical web page information. [0003] Ordinary web search engines, such as Google's Google Search (www.google.com) and Baidu's Baidu Search (www.baidu.com), this type of web search, its information is mainly from each download page according to the Certain conditions extract important information such as text, but this extraction can allow less accurate information (for example: in a video playback page, it will not pay too much attention to who the director is and how long the video is. Another example: TV shows In the preview page, it does not need to pay attention to whether a program starts at 7:00 or 7:30), and ve...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 刘云剑姚健潘柏宇卢述奇
Owner ALIBABA (CHINA) CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products