Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

High-efficiency text data mining method

A text data, high-efficiency technology, applied in the direction of electrical digital data processing, special data processing applications, instruments, etc., can solve problems such as load imbalance, affecting overall efficiency, affecting computing efficiency, etc., to reduce data volume, avoid overhead, reduce Quantity effect

Active Publication Date: 2013-03-20
COMP NETWORK INFORMATION CENT CHINESE ACADEMY OF SCI
View PDF2 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

From the perspective of the entire calculation process, although such an implementation method can calculate the results, there are some problems that affect the further improvement of the calculation efficiency. These problems are:
[0008] (1) In the process of reading files, since the web page text is small (usually a few kbytes) but the number is large, the initial file reading and Map task creation will be a great burden
[0009] (2) Since the distribution of the input file to each Map node (Map task) does not consider the characteristics of the file size and word distribution in it, the amount of tasks faced by each Map node (Map task) is not the same, This causes load imbalance among Map nodes (Map tasks), but in the Map / Reduce computing model, the overall end of Map tasks occurs after all Map node tasks are executed, which may cause a partial impact on the whole Shortcomings in efficiency (assuming that each server has the same configuration, if the Map task with the heaviest load is not completed, even if all other Map tasks are executed, the entire Map task will not end)
However, this copy process is performed after the Map node task is executed and before the Reduce node task is executed, which means that if the cache of the Map node is not enough to save these intermediate results, they will be stored on the local disk periodically, which means A large number of disk write operations are caused; after the above process (Map task) is executed, the intermediate result copy process from the Map node to the Reduce node is performed, which in turn causes a large number of disk read operations and network transmission operations; at the same time, if the Reduce node When the cache is not enough to store these copied intermediate results, they will first be put on the disk for later reading, which also causes additional disk read and write operations
All of these will greatly affect the calculation start time of the Reduce node
[0011] (4) In the Map / Reduce calculation model, the Reduce calculation process is similar to the situation faced by the Map node (Map task), because the intermediate results are usually sorted and grouped (the intermediate results are copied to the specified Reduce task after grouping) ) is to carry out Hash calculation on the key (Key), that is, words, but considering the distribution characteristics of words, this often causes the problem of load imbalance between Reduce nodes (Reduce tasks); at the same time, similarly, Reduce The overall end of the task occurs after the execution of all Reduce node tasks, which also has a short board phenomenon that partially affects the overall efficiency
The existence of all the above phenomena greatly affects the overall data mining efficiency

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • High-efficiency text data mining method
  • High-efficiency text data mining method
  • High-efficiency text data mining method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0042] The specific implementation of the present invention will be further described below by taking the extraction of keywords based on the calculation of the frequency of words in the webpage text as an example. The concrete implementation of the present invention is based on above-mentioned Hadoop system.

[0043] 1. File preprocessing (Preprocess) (such as Figure 4 shown)

[0044] (1) The work at this stage is completed before the frequency calculation starts, and it is realized through custom modification based on the Archive tool in Hadoop.

[0045] (2) Suppose the number of webpage text files to be processed is n (each file has an id as a mark), the maximum number of Map tasks configured in the Hadoop system to run simultaneously is m, and the maximum number of Map tasks configured in the Hadoop system to run simultaneously is m. The number of Reduce tasks is r; at the same time, set the adjustment coefficient k1 to an integer between 5000 and 10000 (assuming that t...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a high-efficiency text data mining method and belongs to the technical field of information. The method comprises the following steps of: 1) in a file pre-processing stage, combining original files of which contents are subjected to word segmentation into a plurality of new files; 2) in a data mapping stage, computing the total frequency number of each word in the new files, the frequency number of each word in each original file, relative frequency pr and the like, and sending a result to a re-orientation module; 3) in a re-orientation stage, computing the payload of each Reduce task, and arranging a payload indicator payi for each Reduce task; 4) judging whether the current word is allocated to the Reduce task, if the current word is not allocated to the Reduce task, allocating the current word to a Reducej task, wherein payj plus pr*100 is less than or equal to the payload, then updating a payload indicator payj of the Reducej task, and otherwise, allocating the current word to a corresponding Reducei task; 5) in a data protocol stage, computing parameters such as the final frequency number and the like of the allocated word; and 6) according to a data protocol result, extracting the word of which the frequency number is greater than a set threshold value in a set range. By the method, frequency number computing efficiency and data mining efficiency are greatly improved.

Description

technical field [0001] The invention belongs to the field of information technology and relates to an efficient text data mining method, which is mainly used in data mining, Web data mining, natural language processing, intelligent search and other fields. Background technique [0002] With the rapid development of the Internet, it has become the largest public data source in the world, and its scale is still growing. Judging from the content contained in it, most of the content on the Internet is linked together through hyperlinks, and a considerable part of it has the characteristics of dynamic changes; based on this, many services can be provided based on the Internet, and through Communication among people, organizations, etc. forms a virtual society. For this reason, Web data mining, which aims to find useful knowledge from the structure, content, and logs of the Internet, has received great attention and development, especially content mining that uses the content on ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/30
Inventor 杨风雷黎建辉吴开超薛正华张波
Owner COMP NETWORK INFORMATION CENT CHINESE ACADEMY OF SCI
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products