Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

A distributed topic discovery method and system for big data

A discovery method and distributed technology, applied in news media industry, Web big data analysis, Internet industry, can solve problems such as inability to process massive data, imperfect keyword scoring mechanism, etc., to reduce memory pressure and computing pressure.

Active Publication Date: 2017-03-29
INST OF COMPUTING TECH CHINESE ACAD OF SCI
View PDF3 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0008] In order to solve the above problems, the purpose of the present invention is to solve the problems that the traditional single-path incremental clustering method cannot handle massive data, relies heavily on the input order and the imperfect keyword scoring mechanism, and proposes a distributed theme for big data Discover the core method process, improve it based on the traditional single-path method process, use the big data processing framework Hadoop and Map / Reduce mechanism, and perform local clustering on smaller data blocks after segmentation on multiple Mapper ends, reducing the pressure on the memory and calculation of a single machine , re-aggregate the generated classes on the Mapper side at the Reducer side to form a global clustering; use randomly extracted documents as the original clustering seeds to solve the problem of relying heavily on the input order; use an improved scoring mechanism to take into account both keyword frequency and document frequency , to reduce the interference of very few "abnormal documents"

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A distributed topic discovery method and system for big data
  • A distributed topic discovery method and system for big data
  • A distributed topic discovery method and system for big data

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0048] The present invention is based on the open-source software platform Hadoop, utilizes Map / Reduce (mapping / simplification) programming frame (this programming frame is used for the parallel computing of large-scale TB level data set, adopts the thought of " divide and conquer " on it, the large-scale The operation of the data set is distributed to the sub-nodes under the management of a master node to complete together, and then the final result is obtained by integrating the intermediate results of each sub-node.) It improves the traditional single-path clustering topic discovery process to achieve distributed calculation purposes.

[0049] The technical solution is generally divided into three Map / Reduce processes.

[0050] The initial input is a text file stored on HDFS (distributed file system) containing all documents to be processed, and the format of each line is: "webpage name $ webpage title name\t word number in the dictionary: word frequency".

[0051] The big...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to a distributed subject finding system and method for big data. The system and the method comprise three parallel processing processes. Each process is composed of one or more of a mapping functional module, a combination functional module and a simplification functional module. Normalization is carried out on the feature vector of each input document, and the number of times, occurring in the documents, of each word forming the documents is counted; at the end of each mapping functional module, each document is used as an original cluster, frequency of the documents is counted, at the end of each combination functional module, local clustering is carried out on the original clusters generated by ends of the mapping functional modules so that local clusters can be generated, and at the end of each simplification functional module, clustering is carried out on the local clusters generated by combination of a plurality of far end physical nodes so that overall clusters can be generated; grading and sorting are carried out on internal keywords of the overall clusters generated in the second mapping / simplification process, and K high-mark keyword expression subjects needed by users are output. Accordingly, the TB level data are processed, the calculation capacity is improved in a linear mode, the distributed calculation can be truly achieved, and the performance and the efficiency are improved.

Description

technical field [0001] The present invention relates to the Internet industry, the news media industry, and the Web big data analysis industry, and in particular to a distributed topic discovery method and system for big data. Background technique [0002] The main task of topic discovery is to aggregate a large number of news reports discussing the same event or related topics under the same cluster to reduce duplication and redundancy. For governments and telecom operators, massive news and comment topic discovery technology can help them understand social conditions and public opinions faster and in real time. According to the processing process, topic discovery can be divided into the following steps: event-related web page crawling, web page text parsing, text word segmentation, dictionary generation, text modeling, and text single-path incremental clustering. Event-related web page crawling is to use crawler tools to capture original Internet information related to cu...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/30
CPCG06F16/182G06F16/35G06F16/36
Inventor 吴新宇何清庄福振敖翔
Owner INST OF COMPUTING TECH CHINESE ACAD OF SCI
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products