Multi-subject extraction method based on concept vector model

A concept vector and extraction method technology, applied in the field of text information extraction, can solve the problems of low algorithm efficiency and low quality of extracted keywords

Inactive Publication Date: 2014-08-27
HOHAI UNIV
View PDF2 Cites 28 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0006] The technical problem to be solved by the present invention is that in order to solve the problem that traditional text processing technology is based on word frequency statistics, a text can only propose a single topic. In order to solve the problems of low algorithm efficiency and low quality of extracted subject words caused by context information, a multi-topic extraction method based on concept vector model is provided. Carry out one-to-one mapping, express the text as a concept model, and automatically merge synonyms into the same concept during the mapping process, realizing vector dimensionality reduction; disambiguate polysemy words appearing in the text according to the correlation between semantic classes and context

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Multi-subject extraction method based on concept vector model
  • Multi-subject extraction method based on concept vector model
  • Multi-subject extraction method based on concept vector model

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0081] Embodiments of the present invention are described in detail below, examples of which are shown in the drawings, wherein the same or similar reference numerals designate the same or similar elements or elements having the same or similar functions throughout. The embodiments described below by referring to the figures are exemplary only for explaining the present invention and should not be construed as limiting the present invention.

[0082]Those skilled in the art will understand that unless otherwise stated, the singular forms "a", "an", "said" and "the" used herein may also include plural forms. It should be further understood that the word "comprising" used in the description of the present invention refers to the presence of said features, integers, steps, operations, elements and / or components, but does not exclude the presence or addition of one or more other features, Integers, steps, operations, elements, components, and / or groups thereof. It will be underst...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a multi-subject extraction method based on a concept vector model. The method includes the following steps that firstly, a document is preprocessed through a traditional method and then vectors formed by feature words are preliminarily acquired; then synonyms are merged through the corresponding relation between word meanings and concepts in Hownet, disambiguation is conducted on polysemes through correlation between semantic classes and contexts, and the concept vector model is established to represent the document; concept similarity is calculated through related semantic information of the concepts in Hownet, a K-means algorithm is improved through a 'preset seed' method for clustering of the concepts, and then a plurality of subject concept clusters are formed; eventually, according to the corresponding relation between the concepts and words, a plurality of sub subject term sets are acquired. According to the method, semantic information is taken into consideration, the defects of sensitivity of the K-means algorithm to an initial center, space-time cost instability and the like are overcome, and the quality of extracted subjects is improved.

Description

technical field [0001] The invention relates to the technical field of text information extraction, in particular to a multi-subject extraction method based on a concept vector model. Background technique [0002] Since human society entered the information age, various electronic texts have emerged in large numbers. Among these massive texts, there are a large number of multi-theme texts, which contain rich and diverse subject information. For example: a report on Premier Li Keqiang’s visit to Europe is a political news, but also economic news. With the development of science and technology, the degree of integration between disciplines is getting higher and higher, most of the research spans multiple disciplines, and many scientific and technological texts contain multiple topics from different aspects, such as a text on biological gene information mining , covering both computer science and biomedical domain topics. Therefore, there are a large number of multi-theme tex...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/27
Inventor 马甲林王志坚
Owner HOHAI UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products