Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Method for extracting multiple subject terms from single Chinese text

A subject heading and text technology, applied in the field of text information extraction, can solve the problems of low quality of subject headings and low algorithm efficiency.

Inactive Publication Date: 2014-08-06
HOHAI UNIV
View PDF7 Cites 22 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0006] The technical problem to be solved by the present invention is that in order to solve the problem that traditional text processing technology is based on word frequency statistics, a text can only propose a single topic. In order to solve the problems of low algorithm efficiency and low quality of extracted subject words caused by context information, a method for extracting multiple subject words from a single Chinese text is provided. The feature words are mapped one by one, and the text is represented as a concept model, and synonyms are automatically merged into the same concept during the mapping process, realizing vector dimensionality reduction; polysemy words appearing in the text are analyzed according to the correlation between semantic classes and context Disambiguation

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for extracting multiple subject terms from single Chinese text
  • Method for extracting multiple subject terms from single Chinese text
  • Method for extracting multiple subject terms from single Chinese text

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0081] Embodiments of the present invention are described in detail below, examples of which are shown in the drawings, wherein the same or similar reference numerals denote the same or similar elements or elements having the same or similar functions throughout. The embodiments described below by referring to the figures are exemplary only for explaining the present invention and should not be construed as limiting the present invention.

[0082]Those skilled in the art will understand that unless otherwise stated, the singular forms "a", "an", "said" and "the" used herein may also include plural forms. It should be further understood that the word "comprising" used in the description of the present invention refers to the presence of said features, integers, steps, operations, elements and / or components, but does not exclude the presence or addition of one or more other features, Integers, steps, operations, elements, components, and / or groups thereof. It will be understood...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides a method for extracting multiple subject terms from a single Chinese text. The method comprises the steps that firstly, a traditional method is used for preprocessing the text, and vectors composed of feature words are primarily obtained; secondly, merger is performed on synonyms by means of the corresponding relation between the meaning of the words and concepts in the 'HowNet', disambiguation is performed on polysemous words according to the dependency of semantic types and the context, and a concept vector model is built to represent the text; thirdly, the concept similarity of related semantic information is calculated by means of the concepts in the 'HowNet', the K-means algorithm is improved through a 'seed presetting' method to perform clustering on the concepts, and a plurality of subject concept clusters are formed; fourthly, according to the corresponding relation of the concepts and the words, a plurality of sub-subject terms are obtained. According to the method, semantic information is considered, the defect that the K-means algorithm is not stable in sensibility and space overhead of a primary center, and quality of extracted subjects is improved.

Description

technical field [0001] The invention relates to the technical field of text information extraction, in particular to a method for extracting multiple subject words from a single Chinese text. Background technique [0002] Since human society entered the information age, various electronic texts have emerged in large numbers. Among these massive texts, there are a large number of multi-theme texts, which contain rich and diverse subject information. For example: a report on Premier Li Keqiang’s visit to Europe is a political news, but also economic news. With the development of science and technology, the degree of integration between disciplines is getting higher and higher, most of the research spans multiple disciplines, and many scientific and technological texts contain multiple topics from different aspects, such as a text on biological gene information mining , covering both computer science and biomedical domain topics. Therefore, there are a large number of multi-t...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/27
Inventor 马甲林王志坚
Owner HOHAI UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products