Automatic hot topic mining system based on internet corpora

A hot topic and automatic mining technology, applied in special data processing applications, instruments, unstructured text data retrieval, etc., can solve problems such as poor scalability and non-reusable matching templates

Active Publication Date: 2016-04-13
北京一览群智数据科技有限责任公司
View PDF5 Cites 31 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] The method based on rule matching requires a lot of prior knowledge. Although the accuracy is high, the scalability is poor, and matching templates in different fields cannot be reused; the method based on site statistics needs to collect a large number of logs based on a large number of user groups , these data cannot be obtained by small and medium-sized companies or research institutes; the method based on event detection first needs to generate high-quality candidate words, because the information on the Internet is changing with each passing day, and new words emerge in an endless stream, the problem of unregistered words is a challenge for this method

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Automatic hot topic mining system based on internet corpora
  • Automatic hot topic mining system based on internet corpora
  • Automatic hot topic mining system based on internet corpora

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0036] The present invention will be further described below in combination with specific embodiments.

[0037] At first some relevant concepts involved in the present invention are described as follows:

[0038] Named entities: names of people, institutions, places, and all other entities identified by names.

[0039] Named entity recognition: Named entity recognition is a subtask of information extraction. The purpose is to locate and identify named entities that appear in text. The main difficulty of named entity recognition lies in the problem of ambiguity.

[0040] Tf-idf: Tf-idf is a model for evaluating the importance of a word to a document. Tf is the word frequency, which refers to the frequency of word w appearing in the document d, idf is the reverse document frequency, Refers to the product of the reciprocal of the number of documents containing the word w and the total number of documents.

[0041] Cosine similarity: Two vectors with the same dimension exist i...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses an automatic hot topic mining system based on internet corpora. The system is composed of two routes: 1) crawling hot words of existing hot word statistics sites, and generating a series of hot topics through the steps of clustering, entity extraction and key word mining; and 2) extracting n-gram from massive news documents, mining high-frequency hot words from the massive news documents by calculating mutual information and conditional entropy values of the n-gram, and recognizing new topics by using an event detection method based on a time sequence. By adopting the system, not only can current hot events be mined in real time, but also can relevant keywords and named entities of a hot topic be mined when the topic is generated.

Description

technical field [0001] The invention relates to an automatic hot topic mining system based on Internet corpus. Background technique [0002] There are three main methods in existing hot word mining systems: the method based on rule matching, the method based on site statistics and the method based on event detection. The method based on rule matching requires a lot of domain knowledge, and hot words are mined by using manually established hot word matching templates. The method based on site statistical information mainly utilizes the statistical data of site traffic, such as news access logs of portal websites, query logs of search engines, etc., and mines hot words from frequently accessed content. The method based on event detection first uses named entity recognition, high-frequency string statistics and other methods to mine candidate hot words, and then uses related methods of time series analysis to select words with obvious hot trends in the candidate set as the fin...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/35G06F16/9535
Inventor 窦志成文继荣江政宝
Owner 北京一览群智数据科技有限责任公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products