A text subtopic discovery method based on improved LDA

A discovery method and text sub-technology, applied in the direction of text database query, unstructured text data retrieval, text database browsing/visualization, etc. question

Active Publication Date: 2019-06-14
HEFEI UNIV OF TECH
View PDF5 Cites 8 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

This method consumes a lot of manpower and is not universal, and the extracted keywords are not very understandable

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A text subtopic discovery method based on improved LDA
  • A text subtopic discovery method based on improved LDA
  • A text subtopic discovery method based on improved LDA

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0052] In this example, if figure 1 As shown, a text subtopic discovery method based on improved LDA is carried out as follows:

[0053] Step 1. In this embodiment, the selected document collection is webpage news data, and the content of two weeks is grabbed from the webpage news around three event keywords, a total of more than 12,000 articles, one event is a document collection, and Treat each news data as a document. According to the domain of the event, a domain dictionary is constructed. In this embodiment, the event belongs to the financial field, so the word segmentation of the financial news dictionary and the financial news stop word list are constructed for pre-text preprocessing. The preprocessing steps include: removing stop words and word segmentation. Record the preprocessed document collection as D={D 1 ,...,D d ,...,D |D|}, where D d Indicates the preprocessed document of the dth article, 1≤d≤|D|, |D| indicates the total number of document collections; ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses an improved LDA-based text subtopic discovery method. The method comprises the following steps: 1, calculating TF-IDF value of words in a text set; taking nouns and verbs withTF-IDF values greater than a threshold value as feature words of next weighting; 2, finding subtopics and corresponding keywords based on a feature word weighted LDA model; 3, optimizing the subtopicsbased on a TSR method and KL divergence; 4, enabling the Word2Vec model to expand for the subtopic keywords, and improving the semantic comprehensiveness of the subtopic keywords; and 5, constructinga subtopic word vector and a title word vector, and clustering by utilizing a cosine distance. According to the method, the subtopic discovery effect can be improved on the aspects of topic discrimination and semantic comprehensiveness.

Description

technical field [0001] The invention belongs to the field of data mining, in particular to a text subtopic discovery method based on improved LDA. Background technique [0002] With the rapid development of Internet information technology, a large amount of unstructured data is generated on the network, and people urgently need to extract valuable information and knowledge from it. Topic discovery technology is a common method for analyzing these unstructured data. A topic is composed of a seed event and subsequent directly related events or activities. Sub-topics are related descriptions for one of the events, that is, different aspects of the seed event. Subtopic discovery technology has achieved good application results in news classification, mastering event hotspots, and detecting event development trends, and has quickly become a current research hotspot. Due to the strong similarity of related reports belonging to the same event, it is difficult to find sub-topics wi...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/34G06F16/33G06F17/27
Inventor 倪丽萍李想倪志伟朱旭辉李应夏千姿
Owner HEFEI UNIV OF TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products