New event theme extraction method

An extraction method and technology of new events, applied in the field of network information, can solve the problems of huge resource expenditure, professional quality, poor performance of new data models, etc., and achieve the effect of simple method and accurate expression.

Active Publication Date: 2020-08-28
QINGDAO UNIV
13 Cites 3 Cited by

AI-Extracted Technical Summary

Problems solved by technology

However, this method still has the following disadvantages: first, this method can only be used for specific domain data sets, and is not suitable for general data sets in various fields; Professiona...
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Abstract

The invention belongs to the technical field of network information and relates to a new event theme extraction method. The vectorization representation is carried out on the news event text data setbased on BERT; the context connection is closer; expression ways are more accurate, a bidirectional long-short memory network of an attention mechanism is utilized to learn news texts with large datavolume in the network; to discover new events, efficient and accurate utilization of the data is realized; compared with a single mode, a mode of combining a supervised method and an unsupervised method is more efficient, the method is simple, semantic information can be extracted deeply, news texts in a network can be analyzed and mined, new events can be discovered, real-time mastering of the new events by related supervision departments and personal users is facilitated, and subsequent work is facilitated.

Application Domain

Technology Topic

Image

  • New event theme extraction method
  • New event theme extraction method
  • New event theme extraction method

Examples

  • Experimental program(1)

Example Embodiment

[0025] Example:
[0026] In the embodiment of the present invention, the process of realizing new event theme extraction includes the following steps:
[0027] Step 1: Obtain the news event text data stream according to the event keywords, and construct the news event text data set according to the obtained news event text data stream. Each record in the text includes the event type label of the news text and the specific text description of the event. And the news event text data set is divided into training set Train, verification set Val and test set Test, specifically:
[0028] Step 1.1: Determine the keywords of specific news events according to the acquisition requirements of news event text data;
[0029] Step 1.2: For the determined news event keywords, build a data crawler system based on the Scrapy framework to obtain the news event text data link through the Baidu search engine, and obtain the news event text data stream;
[0030] Step 1.3: Standardize the text content of the obtained news event text data stream, remove invalid content such as spaces, and splicing the remaining valid content to form a record as a standardized representation of a news text to form a news event text set;
[0031] Step 1.4: For the news event text set obtained in step 1.3, divide the training set Train, verification set Val and test set Test according to the ratio of 7:2:1;
[0032] Step 2: For the training set Train, verification set Val, and test set Test divided in step 1, the text is vectorized based on the BERT representation model, and the high-dimensional dense vector representation is output to obtain the high-dimensional dense vector of the news event text data set In which BERT indicates that the number of model layers of model parameters is 12, the hidden size is 768, and the attention head is 12. The resulting high-dimensional dense vector representation dimension is 768, specifically: [8.3772335e-05,3.9696515e-05, 3.854327e-05,0.0018235502,0.00028364992,3.3392924e-05,3.613378e-05,0.0011939545,8.937488e-06,0.00028550622,1.6984109e-06,0.014312873,4.2274103e-05,0.0057512685,0.008945758,2.318987e-05,1.9686187 e-05,3.6920403e-05,…]
[0033] Step 3: Take the high-dimensional dense vector representation of the news event text data set obtained in step 2 as input, use Xavier to initialize the neural network parameters according to the training set Train and verification set Val, and use the dropout strategy to use the gradient descent method as the neural network The parameters and input feature vectors are updated to obtain a new event discovery model based on the BERT and attention mechanism of the two-way long-short memory network;
[0034] Step 4: Set the threshold of the new event discovery model to 0.9. If the recognition result is greater than this threshold, it is determined that the event belongs to a known news event type and the subject of the event is given; if the prediction result threshold is less than the set threshold, it is determined that this The event is a new event, and the news text judged as a new event is integrated and stored to obtain a new event text data set;
[0035] Step 5: Remove the useless information contained in the new event text data set obtained in step 4, retain the description content of the news event text in the news event text, and use the stuttering Chinese word segmentation tool to perform word segmentation and establish a custom dictionary to improve the accuracy of word segmentation; The useless information includes the preprocessing results of special characters, stop words and other tags that have no real value;
[0036] Step 6: Extract entity features and LDA topic hotword features from the preprocessed new event text dataset obtained in step 5, and then perform word-level splicing with the original text to form a new news text description, and analyze entity features and LDA topic hotword features Weighted representation is performed by adding word frequency to features; entity features include character entity features, location entity features, and organization name entity features;
[0037] Step 7: For the news text data set processed in step 6, calculate the word frequency/inverse document rate of each word to measure the importance of each word relative to the current topic, and give each word a corresponding Weight vector; details are as follows: 0.11178106295272044, 0.11178106295272044, 0.11178106295272044, 0.11178106295272044, 0.11178106295272044, 0.16767159442908067…
[0038] Step 8: According to the features and weight values ​​obtained in steps 6 and 7, use the Kmeans algorithm to cluster the new event text data set obtained in step 7 according to multiple events, and perform topic modeling analysis on new events; Combined with the representation of the new event text set by the model analysis result of the word frequency/inverse document rate, ten keywords are extracted for each event as the subject words of the new event, and the extraction of the new event topic is completed. The Kmeans new event topic extraction is an iterative iteration The process is divided into four steps. First, k objects in the news text set are selected as initial centers, and each object represents a cluster center; secondly, for the data objects in the sample, according to their relationship with these cluster centers Euclidean distance, according to the principle of the closest distance, they are divided into the corresponding class of the nearest cluster center; then, the mean value corresponding to all objects in each category is used as the cluster center of the category, and the objective function is calculated value; finally, judge whether the cluster center and the value of the objective function have changed, if not, output the result, and return to the second step if changed. After the final clustering is completed, the keywords of each event category are extracted by combining the TF-IDF representation of the new event text.
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

no PUM

Description & Claims & Application Information

We can also present the details of the Description, Claims and Application information to help users get a comprehensive understanding of the technical details of the patent, such as background art, summary of invention, brief description of drawings, description of embodiments, and other original content. On the other hand, users can also determine the specific scope of protection of the technology through the list of claims; as well as understand the changes in the life cycle of the technology with the presentation of the patent timeline. Login to view more.
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Similar technology patents

Tooth socket for invisible dental orthodontics and set for dental orthodontics

PendingCN108670457AGood correction effectAccurate expressionArch wiresBracketsTooth crownOrthodontic wire
Owner:GUANGZHOU OO MEDICAL SCI LTD

Classification and recommendation of technical efficacy words

  • Simple method
  • Accurate expression

Detecting irises and pupils in images of humans

ActiveUS20060098867A1Good efficient and moderate computing resourceSimple methodImage enhancementImage analysisEye detectionPupil
Owner:MONUMENT PEAK VENTURES LLC

Modeling method of data logic model utilizing public conceptual sets

ActiveCN102708161AAccurate expressionClear expressionSpecial data processing applicationsMetadataModelling methods
Owner:TSINGHUA UNIV

Evaluation apparatus, evaluation method, and program

InactiveUS20090122676A1Accurate expressionCombination recordingOptical discsMechanical engineeringSoftware engineering
Owner:SONY DISC & DIGITAL SOLUTIONS INC

HOG image feature extraction algorithm based on vector homomorphic encryption

ActiveCN106952212AAccurate expressionSimplify feature extraction stepsCharacter and pattern recognitionImage data processing detailsImaging FeatureHomomorphic encryption
Owner:UNIV OF ELECTRONIC SCI & TECH OF CHINA

Cell ablation using trans-splicing ribozymes

InactiveUS6010904AAccurate expressionSusceptibilitySugar derivativesBacteriaTrans-splicingEnzyme
Owner:THE GENERAL HOSPITAL CORP +1

Media data sharing method and system in Internet of things

ActiveCN106572131AAccurate expressionSolve the problem of inaccurate expressionTransmissionData sharingInternet of Things
Owner:TENCENT TECH (SHENZHEN) CO LTD +1
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products