Distributed acquisition method and system oriented to user generated content

A user-oriented, collection method technology, applied in the transmission system, special data processing applications, instruments, etc., can solve the problems of not paying attention to efficiency, high real-time requirements, and unable to meet the diverse page collection requirements of news certification and early warning, and achieve improvement Real-time, fast collection effect

Active Publication Date: 2015-06-24
INST OF COMPUTING TECH CHINESE ACAD OF SCI
View PDF4 Cites 34 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] From the perspective of technological progress at home and abroad, on the one hand, the existing distributed collection schemes do not focus on efficiency, but on continuity and stability. On the other hand, the current collection tasks focus on single-page collection, and each sub-node usually collects one page. Th

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Distributed acquisition method and system oriented to user generated content

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0032] figure 1 A frame diagram of a UGC news distributed collection system according to an embodiment of the present invention is shown, including: a thread preprocessing module, a collection entity selection module, a collection cluster, a storage management module, a login management module and an anti-blocking management module. These modules are introduced separately below.

[0033] 1. Clue preprocessing module

[0034] The clue preprocessing module is used for preprocessing the collected clues. The collection clues include a short description or phrase of the news, the possible start time and end time of the news, etc. It contains various news elements, but is often not suitable as an input for subsequent data processing directly. Therefore, the clue preprocessing module performs word segmentation, keyword extraction, invalid word filtering, semantic entity recognition and other preprocessing on the collected clues to extract the news elements. These news elements wi...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a distributed collection method oriented to user generated content. The method includes the steps that first, page types are divided according to collection amount of collection pages and collection difficulty, and on the basis of the page types, collection tasks are built and added into a collection queue, wherein the collection tasks comprise composite collection tasks, and the composite collection tasks are collection tasks formed in the mode that a plurality of same type collection pages are classified into same collection tasks; second, the collection tasks are taken out of the collection queue concurrently, the collection tasks are executes, and acquired information is fed back. The invention further provides a corresponding distributed collection system which comprises a master control node and a plurality of sub-nodes, wherein the master control node is used for constructing and maintaining the queue of the collection tasks, and the sub-nodes are used for executing the collection tasks concurrently. According to the distributed collection method and system, collection speed is high, and the UGC news collection instantaneity is remarkably improved; the distributed collection method and system is applicable to collection of the pages in different types, and diversified collection tasks are executed; monitoring measures of collected objects can be avoided.

Description

technical field [0001] The present invention relates to the technical field of information collection, in particular, the present invention relates to a distributed collection method and system for content generated by users. Background technique [0002] User Generated Content is referred to as UGC (User Generated Content). UGC news is news event information uploaded or shared by users in social media (such as Weibo, blog, social network, etc.). UGC content has also become a major source of information for traditional media due to its timely response and fast dissemination. At present, with the popularization of Internet technology and the vigorous development of WEB2.0 technology, ordinary users have become the main producers of content on the Internet. However, due to the low threshold of UGC news, any user can upload content to the Internet, UGC news lacks effective supervision, and there are a lot of fake news. [0003] UGC-based news certification and early warning ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): H04L29/08G06F17/30
CPCG06F16/95H04L67/02
Inventor 张勇东吴波曹娟郭俊波李锦涛
Owner INST OF COMPUTING TECH CHINESE ACAD OF SCI
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products