Unlock instant, AI-driven research and patent intelligence for your innovation.

Document Classification Method and System Based on LDA Topic Model

A technology of document classification and topic model, applied in text database clustering/classification, unstructured text data retrieval, etc., can solve the problems that LDA cannot meet the classification requirements and cannot realize classification

Active Publication Date: 2020-07-17
北京智通云联科技有限公司
View PDF10 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

There is only one optimal cluster calculated by LDA, but there are many attributes that need to be mined in reality, so LDA cannot meet the classification requirements.
Pure LDA clustering cannot express human thoughts, so classification cannot be achieved in practice

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Document Classification Method and System Based on LDA Topic Model

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0056] According to the method of the present invention, take the sentimental classification of the documents to be classified according to the title as an example, as follows:

[0057] There are 10 pre-set words in the initial supervision dictionary, which are "like", "first", "desperate", "divorce", "upgrade", "disband", "depressed", "transfer", "also", " For", according to the emotion of the words, four categories are preset corresponding to four themes, which are positive, negative, neutral, and others. Negative includes "divorce", "disbandment", and "depressed", neutral includes "transfer", and others include "also" and "for", and others contain words that have nothing to do with emotion.

[0058] There are 100 documents to be classified in a txt text. After reading the 100 documents to be classified, first perform Chinese word segmentation on the words in the title of the document to be classified, and then remove the words that appear in the title of the document to be ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a document classification method based on an LDA subject model, which comprises the following steps: 1 pre-compiling an initial supervisory dictionary, wherein the words in theinitial supervisory dictionary comprise a plurality of classes and correspond to the subject of the LDA subject model one by one; 2 pre-compiling an initial supervisory dictionary. 2, obtaining all that words in the literature to be classify, calculating the probability that each word belongs to each subject, and obtaining a clustering dictionary; 3, composing a new supervision dictionary according to that cluster dictionary; 4, searching that subjects correspond to the words in the new supervised dictionary contained in each document to be classify, taking the subject with the largest numberof words as the subject of the document, and completing the classification of the documents to be classified. The invention also discloses a document classification system based on LDA subject model.The method of the invention not only maintains the accuracy of the rule classification method, but also has the associative clustering ability of LDA, the classification result is accurate, and the engineering quantity is small.

Description

technical field [0001] The invention relates to the technical field of document classification, in particular to a document classification method and system based on an LDA topic model. Background technique [0002] Existing classification methods, whether they are rule-based methods, statistical methods or deep learning methods, rely on a large amount of labeled corpus to achieve. In reality, it is very difficult to obtain all the annotated corpus, and often only part of the corpus and keywords can be determined, that is, to obtain high-precision classification results under the premise of partial prior knowledge. The technical contradiction here is reflected in the contradiction between the whole and part of the annotated corpus, between the infinite and the limited. [0003] Statistical classification with limited samples, depending on the context, will seriously damage the recall rate of classification, that is, for some obvious classification results, unexpected classi...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/35
Inventor 史晓凌唐先明景帅刘锋陈新荣王晓丽
Owner 北京智通云联科技有限公司