High-efficiency text data mining method

A text data, high-efficiency technology, applied in the direction of electrical digital data processing, special data processing applications, instruments, etc., can solve problems such as load imbalance, affecting overall efficiency, affecting computing efficiency, etc., to reduce data volume, avoid overhead, reduce Quantity effect

Active Publication Date: 2012-04-04
COMP NETWORK INFORMATION CENT CHINESE ACADEMY OF SCI
View PDF2 Cites 10 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

From the perspective of the entire calculation process, although such an implementation method can calculate the results, there are some problems that affect the further improvement of the calculation efficiency. These problems are:
[0008] (1) In the process of reading files, since the web page text is small (usually a few kbytes) but the number is large, the initial file reading and Map task creation will be a great burden
[0009] (2) Since the distribution of the input file to each Map node (Map task) does not consider the characteristics of the file size and word distribution in it, the amount of tasks faced by each Map node (Map task) is not the same, This causes load imbalance among Map nodes (Map tasks), but in the Map/Reduce computing model, the overall end of Map tasks occurs after all Map node tasks are executed, which may cause a partial impact on the whole Shortcomings in efficiency (assuming that each server has the same configuration, if the Map task with the heaviest load is not completed, even if all other Map tasks are executed, the entire Map task will not end)
However, this copy process is performed after the Map node task is executed and before the Reduce node task is executed, which means that if the cache of the Map node is not enough to save these intermediate results, they will be stored on the local disk periodically, which means A l

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • High-efficiency text data mining method
  • High-efficiency text data mining method
  • High-efficiency text data mining method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0042] The specific implementation of the present invention will be further described below by taking the extraction of keywords based on the calculation of the frequency of words in the webpage text as an example. The concrete implementation of the present invention is based on above-mentioned Hadoop system.

[0043] 1. File preprocessing (Preprocess) (such as Figure 4 shown)

[0044] (1) The work at this stage is completed before the frequency calculation starts, and it is realized through custom modification based on the Archive tool in Hadoop.

[0045] (2) Suppose the number of webpage text files to be processed is n (each file has an id as a mark), the maximum number of Map tasks configured in the Hadoop system to run simultaneously is m, and the maximum number of Map tasks configured in the Hadoop system to run simultaneously is m. The number of Reduce tasks is r; at the same time, set the adjustment coefficient k1 to an integer between 5000 and 10000 (assuming that t...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a high-efficiency text data mining method and belongs to the technical field of information. The method comprises the following steps of: 1) in a file pre-processing stage, combining original files of which contents are subjected to word segmentation into a plurality of new files; 2) in a data mapping stage, computing the total frequency number of each word in the new files, the frequency number of each word in each original file, relative frequency pr and the like, and sending a result to a re-orientation module; 3) in a re-orientation stage, computing the payload of each Reduce task, and arranging a payload indicator payi for each Reduce task; 4) judging whether the current word is allocated to the Reduce task, if the current word is not allocated to the Reduce task, allocating the current word to a Reducej task, wherein payj plus pr*100 is less than or equal to the payload, then updating a payload indicator payj of the Reducej task, and otherwise, allocatingthe current word to a corresponding Reducei task; 5) in a data protocol stage, computing parameters such as the final frequency number and the like of the allocated word; and 6) according to a data protocol result, extracting the word of which the frequency number is greater than a set threshold value in a set range. By the method, frequency number computing efficiency and data mining efficiency are greatly improved.

Description

technical field [0001] The invention belongs to the field of information technology and relates to an efficient text data mining method, which is mainly used in data mining, Web data mining, natural language processing, intelligent search and other fields. Background technique [0002] With the rapid development of the Internet, it has become the largest public data source in the world, and its scale is still growing. Judging from the content contained in it, most of the content on the Internet is linked together through hyperlinks, and a considerable part of it has the characteristics of dynamic changes; based on this, many services can be provided based on the Internet, and through Communication among people, organizations, etc. forms a virtual society. For this reason, Web data mining, which aims to find useful knowledge from the structure, content, and logs of the Internet, has received great attention and development, especially content mining that uses the content on ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
Inventor 杨风雷黎建辉吴开超薛正华张波
Owner COMP NETWORK INFORMATION CENT CHINESE ACADEMY OF SCI
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products