Supercharge Your Innovation With Domain-Expert AI Agents!

A News Classification Method Based on Semantic Analysis and Multiple Cosine Theorem

A technology of cosine theorem and semantic analysis, applied in the field of information processing, can solve the problems of increasing the accuracy of news classification, error-prone classification, and poor flexibility.

Active Publication Date: 2021-05-14
KUNMING UNIV OF SCI & TECH
View PDF5 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The present invention improves the news classification method at the present stage, mainly solves the problems of poor accuracy, error-prone classification and poor flexibility of the existing technology, and devotes itself to increasing the accuracy of news classification by computers relying on the law of cosines

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A News Classification Method Based on Semantic Analysis and Multiple Cosine Theorem
  • A News Classification Method Based on Semantic Analysis and Multiple Cosine Theorem
  • A News Classification Method Based on Semantic Analysis and Multiple Cosine Theorem

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0042] Embodiment 1: as Figure 1-4 As shown, a news classification method based on semantic analysis and multiple cosine theorem, the specific steps are:

[0043] Step1: Obtain the news text X to be classified, and preprocess the news text X to be classified: first use the named entity recognition technology to select special words in the news text X to be classified, and perform word segmentation, stop words and synonyms on the remaining text Substitution and other operations to generate the substantive word set X of the news text X to be classified:{x 1 ,x 2 … x m}, where, the substantive word set X:{x 1 ,x 2 … x m} contains special terms;

[0044] Step2: Calculating the weight: mainly based on the TFIDF value, supplemented by part of speech and word length, traverse the substantive word set X of the news text X to be classified obtained in Step1:{x 1 ,x 2 … x m}, for each substantive word x i ,i∈[1,m] to find its weight, and generate a weight set Y of substantive...

Embodiment 2

[0076] Embodiment 2: as Figure 1-4 As shown, on the basis of Example 1, for most text similarity measurement methods, some special terms such as personal names, place names, organization names, professional terms, etc. will be ignored, because these special terms do not provide valid information. But the present invention thinks that these special terms such as person's name, place name, organizational structure name, professional term are the important index that weighs what category a news text belongs to. For example, if words such as the names of national leaders often appear in a news text, it can be basically determined that the news text should belong to the political category without browsing the full text. For another example, if some vocabulary such as the names of athletes often appear in a news text, it can be basically determined that the news text should belong to the sports category without browsing the full text. This is also the reason why the present invent...

Embodiment 3

[0077] Embodiment 3: as Figure 1-4 As shown, on the basis of Embodiment 1, the present invention also uses word length as an indicator for weighing word weight. According to research, the length of Chinese words obeys the χ under certain conditions 2 Distribution, that is to say, the longer the vocabulary, the less likely it is to appear in the text, which also determines that the longer the vocabulary has a good class discrimination ability. For example, if words such as "People's Republic of China" appear in a news text, you can basically confirm that the news text should belong to the category of international news without browsing the full text, because most domestic news uses the abbreviation "China" instead of "People's Republic of China" .

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to a news classification method based on semantic analysis and multiple cosine theorem, belonging to the technical field of information processing. The present invention redefines the word weight innovatively, utilizes the multiple cosine theorem, and improves the current news classification method. Abandoning the method of simply using TFIDF value as word weight, but innovatively adding weighted TFIDF value, part of speech, word length, etc., and adding additional weights especially for special terms such as people, places, and technical terms; on the other hand , the multiple cosine theorem is also used to calculate the matching degree of news, and the matching degree of substantive words and keywords are calculated respectively, and then the relevant definition determines which news category it belongs to.

Description

technical field [0001] The invention relates to a news classification method based on semantic analysis and multiple cosine theorem, belonging to the technical field of information processing. Background technique [0002] News classification is an important direction in information processing. By organizing a large number of news texts into a few meaningful clusters and ensuring that the texts in the same cluster are similar to a certain extent, the purpose of improving retrieval is achieved. [0003] At present, the similarity measurement methods for text are mainly divided into two categories: based on statistics and based on semantic analysis. These two types of methods have their own advantages and disadvantages. Among them, the cosine law is largely relied on for the classification of news texts with an order of magnitude below one million. However, at this stage, the technology of using computers to classify news based on the cosine theorem is immature, and there are...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/35G06F40/295G06F40/289G06F40/30
CPCG06F16/35G06F40/289G06F40/295G06F40/30
Inventor 龙华祁俊辉邵玉斌杜庆治
Owner KUNMING UNIV OF SCI & TECH
Features
  • R&D
  • Intellectual Property
  • Life Sciences
  • Materials
  • Tech Scout
Why Patsnap Eureka
  • Unparalleled Data Quality
  • Higher Quality Content
  • 60% Fewer Hallucinations
Social media
Patsnap Eureka Blog
Learn More