Co-occurrence latent semantic vector space model semantic core method based on literature resource topic clustering

A technology of vector space and kernel method, which is applied in the field of semantic kernel of latent semantic vector space model of topic clustering of literature resources, which can solve problems such as high model dimension, high time and space complexity of clustering algorithm, insufficient semantic information extraction, etc. problem, achieve the effect of reducing dimensionality and improving clustering effect

Active Publication Date: 2017-05-24
SHANXI UNIV
View PDF2 Cites 24 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] The present invention mainly aims at the semantic kernel method of the current semantic vector space model, which has relatively large semantic information extraction complexity, insufficient semantic information extraction, high model dimension, and high time and space complexity when applied to clustering algorithms, etc. problem, providing a semantic kernel method for text resource topic clustering co-occurrence latent semantic vector space model

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Co-occurrence latent semantic vector space model semantic core method based on literature resource topic clustering
  • Co-occurrence latent semantic vector space model semantic core method based on literature resource topic clustering
  • Co-occurrence latent semantic vector space model semantic core method based on literature resource topic clustering

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0045] The first step: data preprocessing: data cleaning, labeling documents, extracting keywords of each document, and retaining the corresponding relationship between keywords and corresponding documents.

[0046] The data comes from CNKI. According to its classification, 300 documents are selected from each of the three disciplines of "Publishing", "Library Information and Digital Library" and "Archives and Museums" under the information science as the documents for analysis, except for those without keywords. There are 4 documents, and the total number of documents finally obtained is 896, including 299 articles of "publishing", 298 articles of "library information and digital library", 299 articles of "archives and museums", and 2509 different keywords were obtained. That is: the number of documents n=896, the number of keywords m=2509, the following table shows the first 20 documents and all corresponding keywords. In Table 1, LM is the document category, ID is the docum...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention belongs to the technical field of a semantic vector space model semantic core method, and particularly relates to a co-occurrence latent semantic vector space model semantic core method based on literature resource topic clustering. The method mainly solves the problems that an existing semantic vector space model semantic core method is high in semantic information extraction complexity, is insufficient in semantic information extraction, is high in model dimension, is high in complexity on the aspects of time and space when the existing semantic vector space model semantic core method is applied to a clustering algorithm and the like. The co-occurrence latent semantic vector space model semantic core method based on the literature resource topic clustering comprises the following steps that: 1) preprocessing literature data; 2) carrying out word frequency statistics on an extracted keyword for subsequently establishing a co-occurrence matrix to be used; 3) taking whether the keyword is in the presence in the literature or not as a weight to construct a vector space model shown by the literature; 4) constructing a co-occurrence latent semantic vector space model; 5) constructing a semantic core function; and 6) carrying out literature clustering.

Description

technical field [0001] The invention belongs to the technical field of a semantic kernel method of a semantic vector space model, and in particular relates to a semantic kernel method of a document resource subject clustering co-occurrence latent semantic vector space model. Background technique [0002] The era of big data has brought us a large number of unstructured text resources. As an unsupervised machine learning method, clustering is one of the main means to realize text resource mining. Text clustering is different from general data clustering. It first needs to represent the text information in a data structure. The basic model of text representation is the vector space model (VSM), which maps each document into a high-dimensional sparse vector in the text space, so the semantic similarity calculation problem between texts can be transformed into It is the calculation of vectors in the vector space, that is, by calculating the similarity between vectors to measure...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30G06F17/27
CPCG06F16/35G06F40/30
Inventor 牛奉高张亚宇
Owner SHANXI UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products