Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Document classification method based on hadoop data mining

A document classification and data mining technology, applied in the field of data classification, to achieve the effect of improving stability, strong analysis ability, and reducing computational complexity

Inactive Publication Date: 2018-07-10
NANJING UNIV OF POSTS & TELECOMM
View PDF6 Cites 2 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Document classification in the prior art requires a huge amount of text similarity matching calculations, and the time and space consumed are unbearable for the system

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Document classification method based on hadoop data mining
  • Document classification method based on hadoop data mining
  • Document classification method based on hadoop data mining

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0031] The following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some, not all, embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

[0032] see figure 1 , the present invention provides a technical solution: a method for classifying documents based on hadoop data mining, comprising the following steps:

[0033] A. Preprocessing the data document, and determining each keyword in the data document library and the corresponding relationship between each keyword and the document it belongs to;

[0034] B. Using the attribute feature conversion method to describe the attribute feature of the da...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a document classification method based on hadoop data mining. The method comprises following steps: A. preprocessing the data document to determine the keywords and the correspondence between each keyword and the document to which the keyword belongs; B. describing the attribute characteristics of data in a document by means of attribute feature transformation; C. using a matching rule to generate keyword vectors from a keyword set and generating concept vectors according to the keyword vectors and the data attribute characteristics obtained in step B; D. calculating the similarity between any two text documents in the data document to be classified according to the keyword vectors and the concept vectors in step C; E. performing a classification operation based onthe clustering process on the attribute vector, obtaining a classification result of the attribute vector, and the classification result indicating the classification of the target object corresponding to each attribute vector; F. Hadoop automatically collects the above classification results and classifies the classification data documents. The invention has the remarkable advantages of easy implementation and high classification accuracy.

Description

technical field [0001] The invention belongs to the technical field of data classification, and in particular relates to a document classification method based on hadoop data mining. Background technique [0002] Hadoop implements a distributed file system, HDFS for short. HDFS has the characteristics of high fault tolerance and is designed to be deployed on low-cost hardware; and it provides high throughput to access application data, suitable for applications with very large data sets. HDFS relaxes the requirements of POSIX and can access data in the file system in the form of streams. [0003] With the rapid development of Internet technology, the number of network documents is experiencing explosive growth. Massive documents provide the basis for users to obtain documents conveniently, but also bring great challenges to obtain usable and expected documents. The document classification technology is a technology for efficiently classifying documents. This method quickl...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30
CPCG06F16/182G06F16/334G06F16/35
Inventor 王海勇窦敏
Owner NANJING UNIV OF POSTS & TELECOMM
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products