Method for extracting features based on distributed mutual information documents

An extraction method and document feature technology, applied in the field of document feature extraction based on distributed mutual information, can solve the problems of data processing scale limitation and insufficient performance

Active Publication Date: 2013-09-04
STATE GRID CORP OF CHINA +4
View PDF6 Cites 23 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0009] Aiming at the bottleneck problems of data processing scale limitation and insufficient performance in the process of massive document processing, the present invention provides a document feature extraction method based on distributed mutual information, using the MapReduce distributed computing framework to perform document classification feature words and its The extraction of weights can speed up the speed and scalability of document classification. Through the design of key-value pairs, the weights of feature words in documents can be calculated in parallel while extracting feature words, and the efficiency of document classification can be accelerated.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for extracting features based on distributed mutual information documents
  • Method for extracting features based on distributed mutual information documents
  • Method for extracting features based on distributed mutual information documents

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0042] The present invention will be described in further detail below in conjunction with the accompanying drawings.

[0043] Such as figure 1 A method for feature extraction based on distributed mutual information documents is provided, the method includes the following steps:

[0044] Step 1: Collect documents and initialize documents;

[0045] Step 2: Calculate the frequency of word segmentation in the document and the mutual information value of word segmentation in different categories, so as to select the set of feature words;

[0046] Step 3: Calculate the weights of all feature words to form the final document vector set.

[0047] In the step 1, initializing the document includes word segmentation simplification and distributed representation of the document.

[0048] Described step 1 comprises the following steps:

[0049] Step 1-1: Let D={d 1 , d 2 ,...,d j ,...,d N} represents the corpus, d j Represents each document in the corpus, and N represents the num...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a method for extracting features based on distributed mutual information documents to solve the bottleneck problems of data processing scale limit and poor performance in the process of processing a great number of documents. The method comprises a first step of collecting and initializing the documents, a second step of calculating the frequency of the appearance of participles in the documents and the mutual information value of the participles in different classification and selecting feature word collection accordingly, and a third step of calculating the weight of all feature words to form the final document vector quantity collection. The feature words of the document classification and the weight of the feature words are extracted by utilizing the MapReduce distributed calculating frame, and the process of classifying the documents and the expandability of the document classification can be accelerated. By means of the key assignment pair design, the weight, in the documents, of the feature words can be calculated while the feature words are extracted, and the efficiency of classifying the documents is improved.

Description

technical field [0001] The invention belongs to the technical field of distributed computing and data mining, and in particular relates to a document feature extraction method based on distributed mutual information. Background technique [0002] With the rapid development of the Internet, it has also brought us a very spectacular information explosion. How to process the huge amount of data on the Internet is a severe test that Internet companies must face. To solve the problem of "rich data and poor information", It is necessary to analyze and mine massive data. A common and practical method for processing massive data is to classify documents, that is, document classification. [0003] The task of document classification is to classify a document with an unknown category label according to its content under a given classification system. It can be classified into multiple categories or not belong to any category (for a given set of categories ). [0004] Common document...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30G06F17/27
Inventor 林为民张涛马媛媛邓松李伟伟时坚汪晨王玉斐周诚
Owner STATE GRID CORP OF CHINA
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products