A method and device for obtaining classification weights of domain document terms

A technology of domain words and weights, applied in instruments, electrical digital data processing, computing, etc., can solve the problems of great representational significance, inability to characterize document topics well, and low representational significance, etc., to achieve the effect of accurate representation

Active Publication Date: 2021-06-15
CENT SOUTH UNIV
View PDF4 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, in the general term-document matrix, the number of occurrences of terms is purely used to represent the representation of the subject of the term, and the terms that are low-frequency in a specific document term and relatively high-frequency in other documents are used as the subject terms; but The high and low frequency features of word frequency cannot well represent the theme of the document. For example, some high-frequency common word "appearance" cannot characterize the document well, but the high-frequency word "mineralization" in geological documents can better In addition, low-frequency general word segmentation has little meaning for document representation, but low-frequency field terms such as "Qibaoshan Mine" have great significance for document representation
The TextRank algorithm uses the relationship between local words (co-occurrence window) to rank subsequent keywords, and does not consider the representational differences between domain terms and high-level terms

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A method and device for obtaining classification weights of domain document terms
  • A method and device for obtaining classification weights of domain document terms
  • A method and device for obtaining classification weights of domain document terms

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0041] as attached figure 1 As shown, the method for obtaining domain document term classification weights in this embodiment includes steps:

[0042] A1. Obtain domain terms based on preset multiple terms including multiple specific terms, term levels corresponding to the terms, and word frequencies corresponding to specific terms in the multiple terms The total number of terms, the total number of general terms, the total number of field term frequencies, the total number of common term frequencies, the total number of terms corresponding to each term level, the total number of term frequencies corresponding to specific terms in each term level.

[0043] The term levels include: multiple domain levels and multiple general levels.

[0044] The total number of word frequencies in the domain is: the total number of word frequencies corresponding to a specific term in the domain level.

[0045] The total number of common word frequencies is: the total number of word frequencie...

Embodiment 2

[0067] (1) Input description in this embodiment

[0068] The input is the term classification table words_list of domain-specific documents, which is a database table containing the basic information of all terms extracted from domain-specific documents. See Table 1 for specific field definitions.

[0069] Table 1 Definition of Term Grading Table

[0070]

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The present invention relates to a method and device for obtaining the classification weight of domain document terms, including: based on a plurality of terms including a plurality of specific terms set in advance, the level of terms corresponding to the terms, and the plurality of terms The number of word frequencies corresponding to specific terms in the item, the total number of domain terms, the total number of common terms, the total number of domain terms, the total number of common term frequencies, the total number of terms corresponding to each level of terms, and the total number of terms in each level of terms Sum of term frequencies for specific terms; based on total domain terms, total common terms, total domain terms, total common term frequency, total number of terms for each term level and in each term level The total number of word frequencies corresponding to a specific term in , to obtain the final ranking weight at the term level. Based on the difference in the role of different terms in document topic representation, the importance of document representation can be graded for terms, so that the differences in document representation by different terms are finally reflected in the weights of different graded terms.

Description

technical field [0001] The present invention relates to the field of language processing, in particular to a method and a device for obtaining classification weights of document terms in the field. Background technique [0002] At present, most Chinese text classification systems use words as feature items, called feature words. These feature words are used as the intermediate representation of the document, and are used to realize the similarity calculation between documents and documents, documents and user targets. Usually, the score value of each feature is calculated according to a feature evaluation function, and then these features are sorted according to the score value, and several highest score values ​​are selected as feature words. [0003] The most commonly used and effective text representation method is to establish a term-document matrix. Each element value in the term-document matrix represents the weight of the term on the corresponding row to the documen...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F40/289G06F40/295
Inventor 邓吉秋路馥毓刘文毅李晨菡何美香
Owner CENT SOUTH UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products