Automatic extraction method for text labels in combination with theme model and semantic analyses

A semantic analysis and topic model technology, applied in semantic analysis, natural language data processing, special data processing applications, etc., can solve the problems of unrealistic training set labeling, time-consuming and labor-intensive labeling, etc.

Active Publication Date: 2016-10-26
DATAGRAND TECH INC
View PDF10 Cites 37 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, the labeling of the training set is very time-consuming and laborious, and the subject of the document often changes drastically over time, so it is unrealistic to label the training set at any time

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Automatic extraction method for text labels in combination with theme model and semantic analyses
  • Automatic extraction method for text labels in combination with theme model and semantic analyses
  • Automatic extraction method for text labels in combination with theme model and semantic analyses

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0042] The present invention will be further described below in conjunction with accompanying drawing:

[0043] Such as figure 2 Shown: A text label automatic extraction method combining topic model and semantic analysis, including the following steps:

[0044] The first step: preprocessing;

[0045] Step 2: LDA modeling and context analysis;

[0046] The third step: label extraction.

[0047] The preprocessing method of the first step is: if there are low-frequency words, stop words and marking information, the preprocessing includes removing low-frequency words, removing stop words and removing marking information; the low-frequency words are only in one or two texts Appeared, the stop words are auxiliary words that hardly carry any information, words that reflect the grammatical structure of the sentence, all function words, and punctuation marks; the markup information is web page text or other markup language text information; other markup language text information in...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to an automatic extraction method for text labels in combination with theme model and semantic analyses, pertaining to the technical field of computer application. The method comprises pre-treatment, LDA modeling, context analyses and label extraction.The pre-treatment comprises following steps: removing low-frequency words, removing stop words and removing label information, wherein stop words are auxiliary words without any information, words showing sentence grammar structures, all function words and punctuations. The LDA modeling process comprises following steps: obtaining two matrixes after processing the LDA model: one is a file-theme matrix of N*K with each element corresponding to a hidden theme distribution of each file and the other is a K*M theme-word matrix with each element corresponding to a word distribution of each theme. Based on a conventional counting method, the method takes correlations of words in files into consideration and fully utilizes one key feature of context information so that label information of files is obtained.

Description

technical field [0001] The invention relates to a method for automatically extracting text labels by combining topic models and semantic analysis, and belongs to the technical field of computer applications. Background technique [0002] In the DT (data technology) era, Internet information is growing explosively, and various text data emerge in endlessly, such as diversified news and massive original articles from We Media. Faced with such a wealth of information, people urgently need some automated tools to help them accurately and quickly find the key information they need from the vast ocean of information. It is against this background that label extraction occurs. Tags are an important way to quickly obtain key information of texts and grasp themes, and have important applications in information retrieval, natural language processing, intelligent recommendation and other fields. Many websites provide users with the function of labeling objects of interest (such as pic...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/27
CPCG06F40/289G06F40/30
Inventor 于敬
Owner DATAGRAND TECH INC
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products