Multi-subject extracting method based on semantic categories

An extraction method and semantic technology, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve the problems of low quality and low algorithm efficiency in extracting subject words

Active Publication Date: 2014-08-06
HOHAI UNIV
View PDF3 Cites 42 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0006] The technical problem to be solved by the present invention is that in order to solve the problem that traditional text processing technology is based on word frequency statistics, a text can only propose a single topic. In order to solve the problems of low algorithm efficiency and low quality of extracted keywords caused by context information, a multi-topic extraction method based on semantic classes is provid

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Multi-subject extracting method based on semantic categories
  • Multi-subject extracting method based on semantic categories
  • Multi-subject extracting method based on semantic categories

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0089] Embodiments of the present invention are described in detail below, examples of which are shown in the drawings, wherein the same or similar reference numerals denote the same or similar elements or elements having the same or similar functions throughout. The embodiments described below by referring to the figures are exemplary only for explaining the present invention and should not be construed as limiting the present invention.

[0090] Those skilled in the art will understand that unless otherwise stated, the singular forms "a", "an", "said" and "the" used herein may also include plural forms. It should be further understood that the word "comprising" used in the description of the present invention refers to the presence of said features, integers, steps, operations, elements and / or components, but does not exclude the presence or addition of one or more other features, Integers, steps, operations, elements, components, and / or groups thereof. It will be understoo...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a multi-subject extracting method based on semantic categories. The multi-subject extracting method based on the semantic categories comprises the following steps that firstly, a document is preprocessed according to a traditional method and a vector composed of feature words is obtained preliminarily; secondly, synonyms are merged by the utilization of the corresponding relation between word meanings and concepts of 'HowNet', polysemic word disambiguation is carried out according to the correlation between the semantic categories and the context, and a concept vector model is constructed to represent the document; then the concept vector model is converted to be a semantic category model according to the one-to-one corresponding relation between the concepts and the semantic categories; the concept similarity is calculated by the utilization of the related semantic information in the concepts in 'HowNet' and then the semantic similarity is obtained; the semantic categories are clustered by improving the K-means algorithm according to the method of presetting seeds, and a plurality of subject semantic category clusters are formed; finally, a plurality of sub-subject word sets are obtained in a reverse mode according to the corresponding relations between the semantic categories and the concepts and between the concepts and words. The method considers the semantic information, overcomes the defect that the sensibility to the initial center by the K-means algorithm and time-and-space cost are not stable, and improves the quality of extracted subjects.

Description

technical field [0001] The invention relates to the technical field of text information extraction, in particular to a semantic class-based multi-theme extraction method. Background technique [0002] Since human society entered the information age, various electronic texts have emerged in large numbers. Among these massive texts, there are a large number of multi-theme texts, which contain rich and diverse subject information. For example: a report on Premier Li Keqiang’s visit to Europe is a political news, but also economic news. With the development of science and technology, the degree of integration between disciplines is getting higher and higher, most of the research spans multiple disciplines, and many scientific and technological texts contain multiple topics from different aspects, such as a text on biological gene information mining , covering both computer science and biomedical domain topics. Therefore, there are a large number of multi-theme texts in the rea...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/27G06F17/30
Inventor 马甲林王志坚
Owner HOHAI UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products