An event fine-grained public opinion analysis method based on a bert model

By adopting a BERT-based event-based fine-grained sentiment analysis method, the problem of low data acquisition and analysis efficiency in fine-grained sentiment analysis of Chinese microblog texts is solved. It achieves efficient sentiment six-classification and sentiment analysis, meeting the sentiment analysis needs in specific scenarios.

CN117009513BActive Publication Date: 2026-06-12NANJING UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
NANJING UNIV
Filing Date
2023-05-09
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing technologies for fine-grained sentiment analysis of Chinese microblog texts suffer from difficulties in data acquisition, low data value density, and challenges in achieving public opinion analysis in specific scenarios, resulting in low efficiency in sentiment analysis.

Method used

We adopt a fine-grained sentiment analysis method based on the BERT model. By constructing a new crawler system, label noise discrimination and smoothing, Transformer encoder architecture and dual difference model, we optimize the data acquisition and sentiment analysis process, including data acquisition, category label correction and model building steps, and solve the problems of data noise and class imbalance.

🎯Benefits of technology

It achieves efficient six-class classification of Chinese microblog text sentiment, improves the efficiency and accuracy of sentiment analysis, meets the needs of public opinion analysis in specific scenarios, and reduces the complexity of analysis.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN117009513B_ABST
    Figure CN117009513B_ABST
Patent Text Reader

Abstract

The application discloses an event fine-grained public opinion analysis method based on a BERT model and relates to the field of natural language processing sentiment analysis. Specifically, the method comprises the following steps: a data acquisition step, in which microblog information containing specific keywords on a microblog platform is collected by using a network crawler in a python crawler mode, the microblog information containing the keywords is stored in a TIDB database, a label noise discrimination and smoothing method based on loss distribution guidance is used to alleviate the misleading of model by label noise, a sentiment analysis model is constructed, a priori knowledge is organically combined with the model by modifying the attention mechanism through domain deviation, a Chinese microblog text fine-grained classification model WEBERT is learned and generated, Chinese text sentiment six classification is realized, and finally, the predicted result is input into a double difference model. The application can help researchers to perform fast and effective sentiment analysis and construct an effective analysis process in limited time.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to a fine-grained event sentiment analysis method based on the BERT model, belonging to the field of natural language processing, and is particularly applicable to the field of fine-grained sentiment analysis of Chinese microblog texts. Background Technology

[0002] Traditional techniques for analyzing online public opinion primarily include: content analysis, which describes and infers the trends in the generation and changes of public opinion information; and text data mining, which includes feature extraction, text classification, correlation analysis, and trend prediction. In the era of big data, the ability of computers to process information has greatly improved, and the real-time nature and complexity of public information exchange are increasing. Utilizing key computer technologies such as data processing, text mining, and sentiment analysis has gradually become a key research direction in online public opinion analysis.

[0003] Public opinion research based on public platforms such as Weibo has become a key focus. Currently, fine-grained sentiment analysis technology in Chinese is a major direction in the field of computer science. Due to the proliferation of social media and the low barrier to entry for posting messages, sentiment or opinions from social media provide the latest and hottest information. The increase in available information on social media makes sentiment analysis even more important. Sentiment analysis has significant practical value. It can provide online suggestions and recommendations for customers and businesses; sentiment analysis can serve multiple downstream tasks such as public opinion detection. Methods based on sentiment vocabularies and machine learning are often used to perform sentiment analysis tasks. Since sentiment vocabularies require expert knowledge to maintain and often exhibit domain relevance, supervised learning strategies based on labeled data have gradually become the mainstream approach for this task in recent years. Thanks to the powerful natural language understanding capabilities of pre-trained language models, sentiment analysis models using pre-trained language models (such as BERT) as the foundation for semantic understanding are constantly emerging and have become the most advanced architecture in academia. However, research on mining niche emotions under fine-grained sentiment classification and extracting more value is still insufficient. In public opinion analysis research, there is also the problem of difficulty in constructing effective variables using data processing. Information extraction and text sentiment analysis are the main directions of public opinion analysis research both domestically and internationally. Given the significant characteristics of public opinion data—large volume, diverse types, low data value density, and rapid data generation and processing—utilizing information technology to assist in the analysis of public opinion information is an important trend. However, due to the wide and scattered distribution of online public opinion information, obtaining sufficient and effective information is difficult. Furthermore, the limited richness of public opinion data in specific scenarios makes it challenging to conduct public opinion analysis targeting specific periods and regions. Summary of the Invention

[0004] The purpose of this invention is to overcome the shortcomings of existing technologies, solve the current problem of fine-grained sentiment analysis of microblog texts in the Chinese language, and help researchers formulate reasonable public opinion research plans, build analysis within a limited time, provide fine-grained sentiment prediction, and propose a fine-grained public opinion analysis method based on the BERT model.

[0005] This invention adopts the following technical solution: a fine-grained public opinion analysis method based on the BERT model, comprising the following steps:

[0006] Step SS1: Data acquisition steps, including: building a new crawler system based on the cookie pool to provide raw data for subsequent Gaussian mixture model prediction and analysis; adding an expiration time to each account in the cookie pool; randomly selecting an account for each request; and not participating in the subsequent random rotation process before the expiration time expires.

[0007] Step SS2: Category label correction step, which includes adopting a label noise discrimination and smoothing method based on loss distribution to provide effective label smoothing for label noise, thereby reducing the impact of noisy labels in the labeled dataset on the training process of Gaussian mixture model;

[0008] Step SS3: Model building steps, specifically including: building a pre-trained language model based on the Transformer encoder architecture, wherein the Transformer encoder is composed of several identical layers stacked together, and each layer of the Transformer encoder includes a multi-head self-attention mechanism and a feedforward neural network;

[0009] Step SS4: Analysis and prediction steps, specifically including: using the difference-in-differences (DID) model to characterize the emotions generated by the same type of event at different stages, using the same event at different times as the independent variable and the public's emotions at the two stages as the dependent variable, constructing the difference-in-differences model and conducting analysis.

[0010] In a preferred embodiment, step SS1 specifically includes: the Cookies pool includes a PoCookie module, a CoManage module, and a SpCookie module. The PoCookie module is responsible for adding, obtaining, and deleting cookies, and checks the validity of cookies through the request module; the CoManage module checks and manages cookies by periodically executing the cookie manager; the SpCookie module implements the function of crawling cookies by simulating login, and achieves efficient maintenance of the cookie pool based on Redis Hash mapping.

[0011] In a preferred embodiment, the main structure of the new crawler system is divided into a control crawler, a UID crawler, and a Weibo detailed content crawler. The control crawler is mainly used to control the crawler process and the progress status of the overall crawler process. The progress status includes whether it is necessary to change the account or proxy. The UID crawler is used to solve the problem of random data loss caused by the special content retrieval engine of the Weibo platform. The Weibo detailed content crawler is used for dynamic data crawling.

[0012] In a preferred embodiment, step SS1 specifically includes the following steps:

[0013] Step SS11: Initial state;

[0014] Step SS12: The uid crawler builds the next batch of URL web pages;

[0015] Step SS13: Construct the URL webpage for the pagination nextpage;

[0016] Step SS14: Send a data retrieval request to the Weibo platform and receive dynamic data sent by the Weibo platform;

[0017] Step SS15: Determine whether the webpage returns data and whether it has been detected by the Weibo platform. If not, proceed to step SS16; otherwise, proceed to step SS17.

[0018] Step SS16: Perform main account replacement through IP proxy pool and cookie pool, and match the corresponding cookies;

[0019] Step SS17: Obtain Weibo UID data;

[0020] Step SS18: Write the Weibo UID data into the database and hand it over to the Weibo detailed content crawler for subsequent content crawling;

[0021] Step SS19: Repeatedly check whether the current crawling process is complete, until all crawling tasks are completed;

[0022] Step SS10: End state.

[0023] In a preferred embodiment, step SS2 specifically includes:

[0024] Step SS21: Construct a noisy pseudo-data set Uˆ. For the selection of the clean sample data set Dˆ, according to the principle of minimizing loss, the loss distributions of clean samples and noisy samples tend to follow two Gaussian distributions during training. The loss distributions of clean samples and noisy samples are modeled as a Gaussian mixture model. The parameters of the Gaussian mixture model are estimated by maximizing the log-likelihood function.

[0025] Step SS22: Train the Gaussian mixture model for a binary classification task using the constructed datasets Uˆ and Dˆ. The noise label discriminator needs to determine which distribution the input data comes from. When the input data is sampled from Uˆ or Dˆ, the Gaussian mixture model needs to determine the conditional probability that the input data belongs to Uˆ. The Gaussian mixture model predicts the following for the input samples:

[0026] ;

[0027] in ( = 1|X) represents the conditional probability given by the discriminant network, and σ represents the sigmoid function. and This represents the classification header parameter of the noise label discriminator, where X represents... and The concatenated input, where To be converted into the corresponding natural language form of the label, Extract the embedding at the first [CLS] tag in the encoder;

[0028] Step SS23: Through noisy data augmentation, the data loss distribution predicted by the noise label discriminator will produce significant differences. The Gaussian mixture model mitigates the impact of noisy labeled samples on the training of the Gaussian mixture model by smoothing possible noisy samples.

[0029] In a preferred embodiment, step SS2 includes the following specific steps:

[0030] Step SS21: Initial state;

[0031] Step SS22: Construct a set of noisy pseudo-data ,Will These are noise labels randomly sampled from the non-training set labels;

[0032] Step SS23: Solve using the EM algorithm, stopping iteration when the log-likelihood function converges or the maximum number of iterations is reached;

[0033] Step SS24: Select samples whose loss follows a Gaussian distribution with a mean below a set threshold as the clean sample set, denoted as: ;

[0034] Step SS25: Utilize the constructed dataset and Training for binary classification tasks;

[0035] Step SS26: Obtain the noise probability for each sample, which will be used to smooth the original data labels;

[0036] Step SS27: Adjust the final loss function calculation during the model training phase;

[0037] Step SS28: End of state.

[0038] As a preferred embodiment, step SS3 specifically includes: assuming that there is only a first-order dependency in the context of the sentence, focusing only on the two words to the left and right of the given word; to align the distribution of the two domains at the word level, reducing the influence of words with large deviations on the context by controlling the information interaction between each word; the pre-trained language model corrects the attention score, reducing the attention score of words with large domain deviations in the context, for the query... and key , ..., The process of calculating the corrected attention score:

[0039] ;

[0040] in This indicates the adjusted query. and Key The attention score is used, the hyperparameter λ is used to control the degree of correction, and n(·) represents the dimension of ·. Furthermore, the training objective of the model is adjusted by dynamically utilizing the distribution of labels in the training set so that the distribution of the model's output labels conforms as closely as possible to the distribution of labels in the original training set. First, the frequency p(y) of each category needs to be statistically analyzed from the training set as prior knowledge. Based on this, the prior knowledge is organically combined with the pre-trained language model. The output distribution of the pre-trained language model is used to fit the original distribution p(y). After the training process ends, the trained sentiment analysis model WEBERT is output, which achieves the goal of Chinese sentiment six-classification, which is divided into the following six categories: neutral; happy and encouraging; angry; sad and disgusted; afraid; surprised and exclaiming.

[0041] In a preferred embodiment, step SS3 specifically includes the following steps:

[0042] Step SS301: Initial state;

[0043] Step SS302: Analyze the PMI data for each word in different domain datasets;

[0044] Step SS303: Calculate the neighborhood offset of each word W and context c in distribution p and distribution q based on KL divergence;

[0045] Step SS304: Simplify word calculation based on first-order context dependencies In context and The probability of occurrence; and Words indicating the (t-1)th and (t+1)th positions in a sentence;

[0046] Step SS305: Modify the attention score Ai, and introduce a hyperparameter λ to control the degree of correction, where n(·) represents the dimension of ·;

[0047] Step SS306: Perform softmax processing;

[0048] Step SS307: Output the corrected attention score;

[0049] Step SS308: Model the mutual information of the two distributions before and after introducing prior probabilities;

[0050] Step SS309: Softmax normalization operation, which transforms the distribution calculation into a probability distribution;

[0051] Step SS310: Perform label distribution weighting and adjust the original loss value based on different category distribution information;

[0052] Step SS311: Improve the loss calculation of the model by introducing adjustable weights to the estimated p(y);

[0053] Step SS312: End state.

[0054] In a preferred embodiment, step SS4 specifically includes: the dual difference model is:

[0055] ;

[0056] in, For the explained variable, For grouping variables, For time dummy variables, For random error term, The constant term represents the fundamental difference between the predictor and the predicted variable. To capture the group effect parameters of the treatment group, To control the time effect parameters of the treatment period, To determine the true effect of the treatment group during the treatment period, This is a hyperparameter used to control the trade-off between the empirical loss function and the regularization term, thereby controlling model complexity and preventing overfitting. To control other variables that may affect the results, a spatiotemporal comparative experiment was conducted to study and analyze the impact of the effectiveness of the event on public opinion.

[0057] In a preferred embodiment, step SS4 specifically includes the following steps:

[0058] Step SS41: Initial state;

[0059] Step SS42: Identify the explanatory and explained variables and construct the difference-in-differences model;

[0060] Step SS43: Measure variable characteristics based on propensity score matching.

[0061] Step SS44: Perform regression using the difference-in-differences model and conduct time difference analysis;

[0062] Step SS45: Output the test results of the spatiotemporal characteristics of public opinion to verify the correctness of the conclusion;

[0063] Step SS46: End state.

[0064] As a preferred embodiment, the specific process of data acquisition in step SS1 is as follows: automatic account login and verification, obtaining the corresponding Weibo account and password from the TIDB database according to the designed time rotation algorithm for simulated login; automatic cookie acquisition and update, establishing a cookie pool, and determining which accounts need to acquire cookies based on Hash information; in terms of data storage, storing Weibo data in the TIDB distributed database according to the structural characteristics of the data, and designing a deduplication strategy to achieve high-efficiency data deduplication on the same timeline.

[0065] In a preferred embodiment, in step SS2, the EM algorithm stops iterating when the log-likelihood function converges or the maximum number of iterations is reached, and the model parameters Θ are output. During training, samples whose loss follows a small-mean Gaussian distribution are selected as clean samples, denoted as Dˆ. After noise label smoothing, the probability distribution of the corresponding data labels is adjusted to (1−pi), which allows for adjustment of the final loss calculation during model training. Smoothing potentially noisy samples mitigates the impact of noisy labeled samples on model training.

[0066] As a preferred embodiment, in step SS3, the model is built based on the Transformers architecture, improving the attention mechanism and optimizing the class imbalance problem. Because d(wi) ∈ [0, ∞], the attention score of out-of-domain words will receive a negative correction, thus the attention score of context for out-of-domain words will be smaller, which indirectly reduces the influence of out-of-domain information and reduces the difference between the two domains. A hyperparameter λ is used to control the strength of the correction, avoiding the influence of the correction being too large due to the range of d(wi) values, thereby affecting the semantic representation. A conditional probability model is designed based on the mutual information of different distributions, and the output distribution of the model is used to fit the original distribution, alleviating the class label imbalance problem of the original dataset. Thus, a fine-grained sentiment prediction model for Chinese microblogs, WEBERT, is learned and generated, achieving six-class classification of Chinese text sentiment.

[0067] As a preferred embodiment, in step SS4, based on the premise that the selection and grouping of experimental samples are completely random "natural events", the propensity score matching (PSM) method is used to measure the micro-characteristics of each user, thereby predicting the impact on changes in public sentiment and realizing the analysis and judgment of the spatiotemporal characteristics of public opinion.

[0068] The beneficial effects achieved by this invention are as follows: This invention optimizes large-text web crawlers based on web crawler design principles, proposes time-based replacement and data filtering mechanisms to meet the real-time requirements of data, and realizes a high-efficiency microblogging data crawler that meets the requirements. It solves problems such as relying entirely on individual crawling capabilities, large file sizes, blocked crawler IPs, CAPTCHAs, and empty content. Based on this, an optimized BERT model is proposed, employing a domain-biased attention mechanism for correction, a Gaussian mixture model for label correction, and a distribution-based class label imbalance optimization method to adapt to the microblogging sentiment multi-classification requirements of this invention, achieving six-classification of Chinese text sentiment. Finally, it satisfies the presuppositions of the DID (Difference-In-Differences) model, enabling the study of public opinion. This invention solves the current problem of unmet needs for fine-grained sentiment analysis of Chinese microblogging, thereby helping researchers conduct rapid and effective sentiment analysis, and helping them build effective analysis workflows within a limited time. Ultimately, it aims to improve the efficiency of fine-grained Chinese sentiment analysis models and reduce the complexity of public opinion analysis. Attached Figure Description

[0069] Figure 1 This is a flowchart of a fine-grained public opinion analysis method based on the BERT model in an embodiment of the present invention.

[0070] Figure 2 for Figure 1 The flowchart for data acquisition in China.

[0071] Figure 3 for Figure 1 A flowchart for class label correction.

[0072] Figure 4 for Figure 1 The flowchart for model construction.

[0073] Figure 5 for Figure 1 The flowchart for analysis and prediction. Detailed Implementation

[0074] The present invention will be further described below with reference to the accompanying drawings. The following embodiments are only used to more clearly illustrate the technical solution of the present invention, and should not be used to limit the scope of protection of the present invention.

[0075] Example 1: As Figure 1 , Figure 2 , Figure 3 , Figure 4 and Figure 5 As shown, this invention proposes a fine-grained public opinion analysis method based on the BERT model, which includes the following steps.

[0076] S1 Data Acquisition: The specific process of data acquisition is as follows: Automatic account login and verification: Retrieve the corresponding Weibo account and password from the TIDB database according to the designed time rotation algorithm for simulated login; Automatic cookie acquisition and update: Establish a cookie pool and determine which accounts need to acquire cookies based on Hash information; In terms of data storage, store Weibo data in the TIDB distributed database according to its structural characteristics, and design a deduplication strategy to achieve high-efficiency data deduplication on the same timeline.

[0077] S2 class label correction uses the EM algorithm to stop iteration when the log-likelihood function converges or reaches the maximum number of iterations, and outputs the model parameters Θ. During training, samples whose loss follows a small-mean Gaussian distribution are selected as clean samples, denoted as Dˆ. After noise label smoothing, the probability distribution of the corresponding data labels is adjusted to (1−pi), which can adjust the final loss calculation during model training. By smoothing potentially noisy samples, the impact of noisy labeled samples on model training is mitigated.

[0078] The S3 model is built upon the Transformers architecture, improving the attention mechanism and optimizing for class imbalance. Because d(wi) ∈ [0, ∞], the attention score for out-of-domain words receives a negative correction, resulting in a smaller attention score for out-of-domain words from the context. This reduces the influence of out-of-domain information and minimizes the difference between the two domains. A hyperparameter λ is used to control the strength of the correction, preventing the influence of the correction from being too large due to the range of d(wi) values, thus affecting semantic representation. A conditional probability model is designed based on mutual information from different distributions. The model's output distribution is used to fit the original distribution, alleviating the class label imbalance problem in the original dataset. This leads to the learning and generation of a fine-grained sentiment prediction model for Chinese microblogs, WEBERT, achieving six-class classification of Chinese text sentiment.

[0079] S4 analysis and prediction, based on the premise that the selection and grouping of experimental samples are completely random "natural events", uses the propensity score matching (PSM) method to measure the micro-characteristics of each user, thereby predicting the impact on changes in public sentiment and realizing the analysis and judgment of the spatiotemporal characteristics of public opinion.

[0080] Figure 2 This is a flowchart for data acquisition. Addressing the issues of low crawling efficiency and random data loss in current web crawling systems, this paper introduces a cookie pooling design, a distributed database, and improves the Weibo crawling process by breaking down the crawler into a UID sub-crawler and a detailed content sub-crawler. This improves crawling efficiency and ensures data richness. The specific steps are as follows: Step 1: Initial state; Step 2: The UID sub-crawler constructs the next batch of URLs; Step 3: Construct the URL of the next page; Step 4: Send a request and receive data; Step 5: Determine if the webpage returns data and if it has been detected by the platform. If not, proceed to Step 7; otherwise, proceed to Step 6; Step 6: Use the IP proxy pool and cookie management pool to handle the main account change and automatically match the corresponding cookies; Step 7: Obtain Weibo UID data; Step 8: Write the data to the database and hand it over to the detailed content crawler for subsequent content crawling; Step 9: Repeat the judgment until all crawling tasks are completed; Step 10: End state.

[0081] Figure 3The flowchart for class label correction is shown. A label noise discrimination and smoothing strategy guided by loss distribution is proposed, which can effectively smooth label noise, thereby mitigating the misleading effect of label noise on the model. The calculation process uses the EM algorithm. The specific steps are as follows: Step 1: Initial state; Step 2: Construct a noisy pseudo-data set Uˆ, where yˆi is a noise label randomly sampled from the non-training set labels; Step 3: Solve based on the EM algorithm, stopping iteration when the log-likelihood function converges or reaches the maximum number of iterations; Step 4: Select samples whose loss follows a Gaussian distribution with a small mean as clean samples, denoted as Dˆ; Step 5: Train the binary classification task using the constructed datasets Uˆ and Dˆ; Step 6: Obtain the noise probability of each sample, which will be used to smooth the original data labels; Step 7: Adjust the final loss calculation during the model training phase; Step 8: End state.

[0082] Figure 4 The flowchart for model construction is as follows. Based on the powerful pre-trained model BERT, attention score correction is implemented to address the domain offset problem. Simultaneously, the frequency p(y) of each category is acquired as prior knowledge, achieving an organic combination of prior knowledge and the model. This leads to the learning and generation of a fine-grained sentiment prediction model for Chinese microblogs, WEBERT. The specific steps are as follows: Step 1: Initial state; Step 2: Statistical analysis of PMI data for each word in different domain datasets; Step 3: Calculation of the domain offset of each word W and its context c in distributions p and q based on KL divergence; Step 4: Simplification of word calculation based on first-order context dependencies. In context and The probability of occurrence; and Step 5: Modify the attention score Ai, with hyperparameter λ controlling the degree of correction, and n(·) representing the dimension of ·; Step 6: Perform softmax processing; Step 7: Output the corrected attention score; Step 8: Model the mutual information of the two distributions before and after the prior probability; Step 9: Perform softmax normalization to transform the distribution calculation into a probability distribution; Step 10: Perform label distribution weighting to adjust the original loss value based on different category distribution information; Step 11: Improve the model's loss calculation by introducing adjustable weights to the estimated p(y), and output the WEBERT model; Step 12: End state.

[0083] Figure 5This is a flowchart for analysis and prediction. Based on the commonly used difference-in-differences model in current analytical methods, and addressing the prevalent endogeneity problem, the Webertov model is used to predict current Weibo sentiment. The difference-in-differences model is then used for regression analysis to demonstrate the correlation of public opinion. The specific steps are as follows: Step 1: Initial state; Step 2: Determine the explanatory and explained variables and construct the model; Step 3: Measure variable characteristics based on propensity score matching; Step 4: Perform regression analysis and conduct time-varying variance analysis; Step 5: Output the spatiotemporal characteristics test results of public opinion to verify the correctness of the conclusions; Step 6: End state.

[0084] In summary, this invention solves the current problem of unmet needs for fine-grained sentiment analysis in Chinese microblogs, thereby helping researchers conduct rapid and effective sentiment analysis, enabling them to build effective analysis processes within a limited time, and ultimately improving the efficiency of fine-grained sentiment analysis models in Chinese and reducing the complexity of public opinion analysis.

[0085] Those skilled in the art will understand that embodiments of this application can be provided as methods, systems, or computer program products. Therefore, this application can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, this application can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.

[0086] This application is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of this application. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart... Figure 1 One or more processes and / or boxes Figure 1 The computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to operate in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1The functions specified in one or more boxes. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable apparatus for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.

[0087] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention and not to limit it. Although the present invention has been described in detail with reference to the above embodiments, those skilled in the art should understand that modifications or equivalent substitutions can still be made to the specific implementation of the present invention. Any modifications or equivalent substitutions that do not depart from the spirit and scope of the present invention should be covered within the scope of protection of the claims of the present invention.

Claims

1. A fine-grained public opinion analysis method based on the BERT model, characterized in that, Includes the following steps: Step SS1: Data acquisition steps, including: building a new crawler system based on the cookie pool to provide raw data for subsequent Gaussian mixture model prediction and analysis; adding an expiration time to each account in the cookie pool; randomly selecting an account for each request; and not participating in the subsequent random rotation process before the expiration time expires. Step SS2: Category label correction step, which includes adopting a label noise discrimination and smoothing method based on loss distribution to provide effective label smoothing for label noise, thereby reducing the impact of noisy labels in the labeled dataset on the training process of Gaussian mixture model; Step SS3: Model building step, specifically including: building a pre-trained language model based on the Transformer encoder architecture, wherein the Transformer encoder is composed of stacked identical layers, and each layer of the Transformer encoder includes a multi-head self-attention mechanism and a feedforward neural network; Step SS3 specifically includes: assuming that there are only first-order dependencies in the context of a sentence, focusing only on the two words to the left and right of each word, and in order to align the distribution of the two domains at the word level, reducing the influence of words with large deviations on the context by controlling the information interaction between each word; the pre-trained language model corrects the attention score, reducing the attention score of words with large domain deviations in the context, for query and key , ..., The process of calculating the corrected attention score: ; in This indicates the adjusted query. and Key The attention score is used, the hyperparameter λ is used to control the degree of correction, and n(·) represents the dimension of ·. Furthermore, the training objective of the model is adjusted dynamically by utilizing the distribution of labels in the training set so that the distribution of the model's output labels conforms to the distribution of labels in the original training set. First, the frequency p(y) of each category needs to be statistically analyzed from the training set as prior knowledge. Based on this, the prior knowledge is organically combined with the pre-trained language model. The output distribution of the pre-trained language model is used to fit the original distribution p(y). At the end of the training process, the trained sentiment analysis model WEBERT is output, achieving the goal of six-category Chinese sentiment classification, which includes the following six categories: neutral; happy and encouraging; angry; sad and disgusted; afraid; surprised and exclaiming. Step SS4: Analysis and prediction steps, specifically including: using the difference-in-differences (DID) model to characterize the emotions generated by the same type of event at different stages, using the same event at different times as the independent variable and the public's emotions at the two stages as the dependent variable, constructing the difference-in-differences model and conducting analysis.

2. The event-based fine-grained public opinion analysis method according to claim 1, characterized in that, Step SS1 specifically includes: the Cookies pool includes the PoCookie module, the CoManage module, and the SpCookie module. The PoCookie module is responsible for adding, retrieving, and deleting cookies, and checks the validity of cookies through the request module; the CoManage module checks and manages cookies by periodically executing the cookie manager; the SpCookie module implements the function of crawling cookies by simulating login, and achieves efficient maintenance of the cookie pool based on Redis Hash mapping.

3. The event-based fine-grained public opinion analysis method according to claim 1, characterized in that, The main structure of the new crawler system is divided into a control crawler, a UID crawler, and a Weibo detailed content crawler. The control crawler is mainly used to control the crawler process and the progress status of the overall crawler process. The progress status includes whether it is necessary to change the account or proxy. The UID crawler is used to solve the problem of random data loss caused by the special content retrieval engine of the Weibo platform. The Weibo detailed content crawler is used for dynamic data crawling.

4. The event-based fine-grained public opinion analysis method according to claim 1, characterized in that, The specific steps of step SS1 include: Step SS11: Initial state; Step SS12: The uid crawler builds the next batch of URL web pages; Step SS13: Construct the URL webpage for the pagination nextpage; Step SS14: Send a data retrieval request to the Weibo platform and receive dynamic data sent by the Weibo platform; Step SS15: Determine whether the webpage returns data and whether it has been detected by the Weibo platform. If not, proceed to step SS16; otherwise, proceed to step SS17. Step SS16: Perform main account replacement through IP proxy pool and cookie pool, and match the corresponding cookies; Step SS17: Obtain Weibo UID data; Step SS18: Write the Weibo UID data into the database and hand it over to the Weibo detailed content crawler for subsequent content crawling; Step SS19: Repeatedly check whether the current crawling process is complete, until all crawling tasks are completed; Step SS10: End state.

5. The event-based fine-grained public opinion analysis method according to claim 1, characterized in that, Step SS2 specifically includes: Step SS21: Construct a noisy pseudo-data set Uˆ. For the selection of the clean sample data set Dˆ, according to the principle of minimizing loss, the loss distributions of clean samples and noisy samples tend to follow two Gaussian distributions during training. The loss distributions of clean samples and noisy samples are modeled as a Gaussian mixture model. The parameters of the Gaussian mixture model are estimated by maximizing the log-likelihood function. Step SS22: Train the Gaussian mixture model for a binary classification task using the constructed datasets Uˆ and Dˆ. The noise label discriminator needs to determine which distribution the input data comes from. When the input data is sampled from Uˆ or Dˆ, the Gaussian mixture model needs to determine the conditional probability that the input data belongs to Uˆ. The Gaussian mixture model predicts the following for the input samples: ; in This represents the conditional probability given by the discriminant network, where σ represents the sigmoid function. and This represents the classification header parameter of the noise label discriminator, where X represents... and The concatenated input, where To be converted into the corresponding natural language form of the label, Extract the embedding at the first [CLS] tag in the encoder; Step SS23: Through noisy data augmentation, the data loss distribution predicted by the noise label discriminator will produce significant differences. The Gaussian mixture model mitigates the impact of noisy labeled samples on the training of the Gaussian mixture model by smoothing the noisy samples.

6. The event-based fine-grained public opinion analysis method based on the BERT model according to claim 1, characterized in that, The specific steps of step SS2 include: Step SS21: Initial state; Step SS22: Construct a set of noisy pseudo-data Uˆ, where yˆi are noise labels randomly sampled from the non-training set labels; Step SS23: Solve using the EM algorithm, stopping iteration when the log-likelihood function converges or the maximum number of iterations is reached; Step SS24: Select samples whose loss follows a Gaussian distribution with a mean below a set threshold as the clean sample set, denoted as Dˆ; Step SS25: Train for a binary classification task using the constructed datasets Uˆ and Dˆ; Step SS26: Obtain the noise probability for each sample, which will be used to smooth the original data labels; Step SS27: Adjust the final loss function calculation during the model training phase; Step SS28: End of state.

7. The event-based fine-grained public opinion analysis method based on the BERT model according to claim 1, characterized in that, The specific steps of step SS3 include: Step SS301: Initial state; Step SS302: Analyze the PMI data for each word in different domain datasets; Step SS303: Calculate the neighborhood offset of each word W and context c in distribution p and distribution q based on KL divergence; Step SS304: Simplify computation based on first-order context dependencies In context and The probability of occurrence; and Words indicating the (t-1)th and (t+1)th positions in a sentence; Step SS305: Modify the attention score Ai, and introduce a hyperparameter λ to control the degree of correction, where n(·) represents the dimension of ·; Step SS306: Perform softmax processing; Step SS307: Output the corrected attention score; Step SS308: Model the mutual information of the two distributions before and after introducing prior probabilities; Step SS309: Softmax normalization operation, which transforms the distribution calculation into a probability distribution; Step SS310: Perform label distribution weighting and adjust the original loss value based on different category distribution information; Step SS311: Improve the loss calculation of the model by introducing adjustable weights to the estimated p(y); Step SS312: End state.

8. The event-based fine-grained public opinion analysis method according to claim 1, characterized in that, Step SS4 specifically includes: the double difference model is: ; in, For the explained variable, For grouping variables, For time dummy variables, For random error term, The constant term represents the fundamental difference between the predictor and the predicted variable. To capture the group effect parameters of the treatment group, To control the time effect parameters of the treatment period, To determine the true effect of the treatment group during the treatment period, This is a hyperparameter used to control the trade-off between the empirical loss function and the regularization term, thereby controlling model complexity and preventing overfitting. To control other variables that may affect the results, a spatiotemporal comparative experiment was conducted to study and analyze the impact of the effectiveness of the event on public opinion.

9. The event-based fine-grained public opinion analysis method according to claim 1, characterized in that, The specific steps of step SS4 include: Step SS41: Initial state; Step SS42: Identify the explanatory and explained variables and construct the difference-in-differences model; Step SS43: Measure variable characteristics based on propensity score matching. Step SS44: Perform regression using the difference-in-differences model and conduct time difference analysis; Step SS45: Output the test results of the spatiotemporal characteristics of public opinion to verify the correctness of the conclusion; Step SS46: End state.