Document classification method and device

A document and inverse document technology, applied in the computer field, can solve the problems of unreasonable document classification, low semantic understanding and low semantic recognition, and achieve good classification effect, high semantic recognition and reasonable classification

Active Publication Date: 2016-11-16
BEIJING QIYI CENTURY SCI & TECH CO LTD
View PDF6 Cites 20 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] The technical problem to be solved by the embodiments of the present invention is to provide a document classification method and device to solve the common problem of low semantic understanding and low semantic recognition when document clustering methods in the prior art classify documents And the problem of unreasonable document classification

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Document classification method and device
  • Document classification method and device
  • Document classification method and device

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0027] In order to make the above objects, features and advantages of the present invention more comprehensible, the present invention will be further described in detail below in conjunction with the accompanying drawings and specific embodiments.

[0028] One of the core ideas of the embodiment of the present invention is that the characterization of the document word segmentation in the embodiment of the present invention relies on the deep neural network model, and the vectors of similar word segmentation are clustered, and the subsequent classification processing is performed based on the features obtained by the clustering , the context information of word segmentation in a specific context is taken into account when classifying documents, so that each type of document has a high degree of semantic understanding and semantic recognition; in addition, the embodiment of the present invention classifies the hierarchy based on preset termination conditions The clustering tree...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The embodiment of the invention provides a document classification method and device. The document classification method comprises the steps that each participle in documents to be classified is converted into a vector by training a deep neural network language model; vectors are clustered to generate a similar participle set; the documents to be classified are converted into a characteristic frequency inverse document matrix according to the characteristics of the set; the characteristic frequency inverse document matrix is converted into a hierarchical clustering tree by calculating the similarity of the vectors of any two documents to be classified; the hierarchical clustering tree is dynamically cut at different heights based on preset end conditions to obtain classified documents. According to the method, context information of participles in specific language environments is considered during document classification, and accordingly each type of documents is higher in semantic comprehension and semantic recognition. In addition, the hierarchical clustering tree is dynamically cut at different heights based on the preset end conditions, the problem that all types of documents are large in document number difference is avoided, and document classification is more reasonable.

Description

technical field [0001] The invention relates to the field of computer technology, in particular to a document classification method and a document classification device. Background technique [0002] The explosive growth of information in the Internet has brought inconvenience to the management and use of information. In order to reveal the potentially valuable information or structure hidden behind Web data, Web data mining technology has achieved rapid development and wide application in recent years. Document clustering is one of the most important tools in the field of Web data mining. Among them, document clustering methods in the prior art mainly include K-means, hierarchical clustering methods, and the like. [0003] However, the document clustering methods in the prior art still have the following problems: the context information of the words in the document in a specific context is not considered when classifying the document, so the resulting classified document...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30G06F17/27
CPCG06F16/35G06F16/367G06F40/295
Inventor 丁希晨
Owner BEIJING QIYI CENTURY SCI & TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products