Lexical classification method and system and realization method

A technology of vocabulary and document classification, applied in the field of document analysis, it can solve the problems of repeated webpages and unclear contribution levels.

Active Publication Date: 2013-07-03
CHINA MOBILE COMM GRP CO LTD
View PDF3 Cites 20 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

For example, in the classification of Sina, the first-level classification is sports, and the second-level classification changes with current events. The two classifications of football and the World Cup will exist at the same time, and some web pages will be repeated. This situation has not been solved by the existing technology. method
[0013] 3. The degree of contribution of the existing web pages to the classification is not clearly indicated, but the importance of the web pages to the classification is available and has great application value

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Lexical classification method and system and realization method
  • Lexical classification method and system and realization method
  • Lexical classification method and system and realization method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0095] The preferred embodiments of the present invention will be described below in conjunction with the accompanying drawings. It should be understood that the preferred embodiments described here are only used to illustrate and explain the present invention, and are not intended to limit the present invention.

[0096] figure 1 It is a schematic structural diagram of the vocabulary classification system in Embodiment 1 of the present invention. Such as figure 1 As shown, the vocabulary classification system includes a document classification training set module, a document preprocessing module, a word frequency statistics module, a vocabulary frequency-inverse document frequency value calculation module, a vocabulary category generation module and a document contribution degree calculation module.

[0097] Wherein, the document classification training set module stores the document classification training set, and provides the document classification training set to the do...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a lexical classification method and system and a realization method. The lexical classification method comprises the steps as follows: firstly, obtaining a document classification training set comprising documents and document type information belongs to the documents; secondly, preprocessing all the documents to form lexes to be classified; thirdly, obtaining the TF (lexical frequency) value and the IDF (inverse document frequency) value of each lexis to be classified in one document type, and summating the TF value and the IDF value to obtain a TF-IDF value; and lastly, taking a quotient obtained by division of the TF-IDF value by the sum of the TF-IDF values of the TF values and the IDF values of the lexes to be classified in all the document types belonging to the documents as a probability that the lexes to be classified belong to the document types, and generating a lexical classification database comprising the lexes to be classified, the types corresponding to the lexes to be classified and the probability that the lexes to be classified belong to the corresponding types. Due to the adoption of the technical scheme, the lexical classification operation can be automatically finished, the input cost is very low, and the effect is more accurate.

Description

technical field [0001] The present invention relates to the technical field of document analysis, in particular to a vocabulary classification method, system and implementation method. Background technique [0002] In the field of document analysis technology, thesaurus is a relatively important technology, which can be used in various purposes and fields. For example, in user behavior analysis, the user's basic category can be determined by using the user's input of keywords. Collect the search keywords entered by the user, and the category of the keyword can be obtained through the thesaurus, and then the user category is marked. [0003] Table 1 [0004] vocabulary category confidence probability Yao Ming physical education 90% entertainment 10% fund finance 72% public welfare 28% [0005] As shown in Table 1, vocabulary classification can be used in dictionary editing, sema...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30G06F17/27
Inventor 徐萌何洪凌邓超罗治国孙少陵陶涛
Owner CHINA MOBILE COMM GRP CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products