Keyword vectorization method based on topic semantic information and application thereof

Pending Publication Date: 2022-04-08
NANJING UNIV OF POSTS & TELECOMM
View PDF0 Cites 1 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Even if the keyword vectorization is performed based on the word2vec vectorization method, there is a problem of lack of semantic information of the reflected keywords;
[0004] In practical application scenarios, it is often necessary to vectorize keywords, such as the extraction of subject words in document classification, and vectorization of search keywords in the field of information retrieval; the vectorization method of keywords in the current prior art is mainly LDA The topic model, through the LDA topic model, can convert keywords into topic vectors, which reflect the relationship between keywords and topics, but the vectorization method based on the LDA topic model also has the problem of lack of semantic information of keywords reflected

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Keyword vectorization method based on topic semantic information and application thereof
  • Keyword vectorization method based on topic semantic information and application thereof
  • Keyword vectorization method based on topic semantic information and application thereof

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0044] The present invention will be described in further detail below in conjunction with the accompanying drawings.

[0045] figure 1 It is a flowchart of the present invention, describing the process of keyword vectorization based on subject semantic information. For the convenience of description, the following specific example is given. This example mainly solves the problem of document retrieval. It is based on the 20newsgroups data set, which contains 20 different categories of news and a total of 11315 articles. The relevant symbols are defined as follows:

[0046] document set D = {d 1 , d 2 ,...,d n}, remove stop words and extract keywords from each document in the document set D to form a keyword set W={w 1 ,w 2 ,...,w u}, the topic set obtained by HDBSCAN clustering algorithm is T={t 1 ,t 2 ,...,t m}. is the document vector matrix trained by the Sentence-BERT model on the document set D. is the reduced document vector matrix output by the UMAP dimensi...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a keyword vectorization method based on topic semantic information and an application thereof. The keyword vectorization method specifically comprises the following steps: firstly, generating a vector with document semantic information for each document by utilizing a Sension-BERT model; dimension reduction is carried out on the generated document vector through a UMAP dimension reduction algorithm, and local semantic features are highlighted; then, HDBSCAN topic clustering is carried out on the document vectors after dimension reduction, and each document is classified into one or more topics; and finally, calculating a subject term frequency-inverse subject frequency (TTF-ITF) score of each keyword in the subject by using a relationship between the document and the subject, and merging the keyword and the subject term frequency-inverse subject frequency (TTF-ITF) score of each subject to generate a final keyword vector. According to the method, high-precision keyword vectorization of topic semantic information is realized, and the method can be applied to topic word extraction, text classification and document retrieval.

Description

technical field [0001] The invention relates to the fields of natural language processing, text mining and searchable encryption, in particular to a keyword vectorization method based on subject semantic information and its application. Background technique [0002] With the continuous development of Internet technology and the advent of the era of big data, the scale of data has become increasingly large. In the face of large-scale and various types of data, how to classify these data, obtain keywords that are strongly related to a certain category, and effectively use them to guide practical activities is particularly important. Document data contains a large number of keywords and cannot be directly used by computers. Keyword vectorization is an effective means to solve this problem. Keyword vectorization plays an important role in the effective use of document data. For example, in an information retrieval scenario, given a search keyword, the user's retrieval intention...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F40/279G06F40/216G06F40/30G06K9/62
Inventor 戴华胡正刘源龙陆佳行杨庚陈燕俐
Owner NANJING UNIV OF POSTS & TELECOMM
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products