Automatic hot topic mining system based on internet corpora

A hot topic and automatic mining technology, applied in special data processing applications, instruments, unstructured text data retrieval, etc., can solve problems such as poor scalability and non-reusable matching templates
CN105488196AActive Publication Date: 2016-04-13北京一览群智数据科技有限责任公司

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
北京一览群智数据科技有限责任公司
Publication Date
2016-04-13

Smart Images

  • Figure 1
    Figure 1
  • Figure 2
    Figure 2
  • Figure 3
    Figure 3
Patent Text Reader

Abstract

The invention discloses an automatic hot topic mining system based on internet corpora. The system is composed of two routes: 1) crawling hot words of existing hot word statistics sites, and generating a series of hot topics through the steps of clustering, entity extraction and key word mining; and 2) extracting n-gram from massive news documents, mining high-frequency hot words from the massive news documents by calculating mutual information and conditional entropy values of the n-gram, and recognizing new topics by using an event detection method based on a time sequence. By adopting the system, not only can current hot events be mined in real time, but also can relevant keywords and named entities of a hot topic be mined when the topic is generated.
Need to check novelty before this filing date? Find Prior Art

Description

technical field

[0001] The invention relates to an automatic hot topic mining system based on Internet corpus. Background technique

[0002] There are three main methods in existing hot word mining systems: the method based on rule matching, the method based on site statistics and the method based on event detection. The method based on rule matching requires a lot of domain knowledge, and hot words are mined by using manually established hot word matching templates. The method based on site statistical information mainly utilizes the statistical data of site traffic, such as news access logs of portal websites, query logs of search engines, etc., and mines hot words from frequently accessed content. The method based on event detection first uses named entity recognition, high-frequency string statistics and other methods to mine candidate hot words, and then uses related methods of time series analysis to select words with obvious hot trends in the candidate set as the fin...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More