A method and system for sentiment classification of a homestay review based on a multi-factor dynamic threshold

This paper proposes a sentiment classification method for homestay reviews based on multi-factor dynamic thresholds. This method addresses the issues of low recall rate and insufficient robustness of negative reviews in traditional methods, achieving high-precision and highly interpretable sentiment classification. It adapts to the dynamic changes in homestay review data and meets the high throughput and high credibility requirements of online travel platforms.

CN122241218APending Publication Date: 2026-06-19NANTONG UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
NANTONG UNIV
Filing Date
2026-03-06
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Traditional sentiment classification methods suffer from extreme class imbalance in homestay reviews, making it difficult to balance the classification accuracy of reviews with different characteristics. Negative reviews have low recall rates and lack multi-dimensional information fusion and dynamic threshold adaptability, resulting in insufficient model robustness and an inability to meet the high throughput and high credibility requirements of online travel platforms.

Method used

A sentiment classification method for homestay reviews using multi-factor dynamic thresholds is proposed. This method involves multi-source heterogeneous data collection and preprocessing, multi-source feature fusion and weight optimization, logistic regression model training and inference, four-factor dynamic threshold calculation, and Bayesian optimization and manual calibration to construct a closed-loop optimization mechanism, thereby achieving dynamic threshold adaptation and feature weight update.

Benefits of technology

It improves the recall rate and classification interpretability of negative reviews, achieves high-precision and robust sentiment classification, adapts to the dynamic changes in homestay review data, and meets the business needs of online travel platforms.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122241218A_ABST
    Figure CN122241218A_ABST
Patent Text Reader

Abstract

This invention discloses a method and system for sentiment classification of homestay reviews based on multi-factor dynamic thresholds, including: collecting and preprocessing review text, time-series metadata, and entity profile data; performing multi-source feature fusion and weight optimization; training a logistic regression model to output the negative probability; extracting four factors, obtaining the optimal weights through Bayesian optimization, and calculating the dynamic classification threshold; classifying reviews at multiple levels based on the dynamic threshold and confidence level, and manually correcting low-confidence samples; generating an analysis report and providing feedback on optimized model parameters. This invention achieves adaptive dynamic threshold calculation through four-factor fusion and Bayesian optimization, solving the problem of missed negative reviews caused by traditional fixed thresholds; it enhances the collaborative representation capability of multi-dimensional features through multi-source data fusion and weight optimization; and it continuously updates the model and rules through a closed-loop collaborative optimization mechanism, improving the recall rate and classification interpretability of negative reviews, thus achieving high-precision and robust sentiment classification of homestay reviews.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of natural language processing and data analysis technology, and in particular to a method and system for classifying sentiment in homestay reviews based on multi-factor dynamic thresholds. Background Technology

[0002] With the rapid development of online travel platforms, homestay reviews have become a core basis for user decision-making and business service optimization. Sentiment classification technology, as a key means of extracting the sentiment (positive, neutral, negative) of reviews, directly impacts business value. However, homestay review data commonly suffers from extreme category imbalance, and traditional sentiment classification methods have significant limitations in this scenario, making it difficult to meet practical application needs.

[0003] Currently, traditional sentiment classification methods primarily rely on fixed thresholds for decision-making, employing a globally uniform threshold to categorize sentiment. This approach fails to adapt to the individualized differences in the attributes of individual reviews. Homestay reviews exhibit significant variations in text quality, clarity of expression, timeliness, and structural complexity. Fixed thresholds struggle to ensure accurate classification across reviews with diverse characteristics, particularly resulting in low recall rates for negative reviews (a minority category), leading to the systematic omission of a large amount of valuable negative feedback. Furthermore, existing methods largely focus on the semantic features of the review text itself, failing to effectively integrate multi-dimensional information such as temporal metadata and entity profile data. This results in a somewhat one-sided feature representation, further impacting the reliability of classification results. Simultaneously, traditional techniques lack a closed-loop optimization mechanism based on classification results, failing to continuously optimize feature weights and threshold parameters based on manual annotation corrections and historical classification patterns. This leads to insufficient model robustness, making it difficult to adapt to the dynamic changes in homestay review data.

[0004] In recent years, sentiment classification techniques based on machine learning and deep learning have been gradually introduced into this field. Some studies have attempted to improve performance by optimizing model structure or adjusting training data distribution, but significant shortcomings remain. Existing techniques typically use deep learning models or simple linear weighting methods to process text features alone. While this can improve the accuracy of semantic understanding to some extent, it often ignores the dynamic adaptability of classification thresholds and fails to personalize threshold adjustments for the features of individual comments. Furthermore, complex deep learning-based models often have high computational resource requirements, and their decision-making logic is black-box, lacking interpretability, making it difficult to meet the high-throughput and high-reliability business scenarios required by online travel platforms. Summary of the Invention

[0005] In view of the shortcomings of the prior art, the purpose of this invention is to provide a method and system for classifying the sentiment of homestay reviews based on multi-factor dynamic thresholds, so as to solve one or more problems in the prior art.

[0006] To achieve the above objectives, the technical solution of the present invention is as follows:

[0007] A method and system for classifying sentiment in homestay reviews based on multi-factor dynamic thresholds, wherein the classification method includes the following steps:

[0008] S1. Multi-source heterogeneous data collection and preprocessing: Collect homestay review text data, time-series metadata and entity profile data, and perform data preprocessing respectively;

[0009] S2, Multi-source feature fusion and weight optimization: The preprocessed data is concatenated with features, and the weights of various features are iteratively optimized based on classification feedback;

[0010] S3. Training and Inference of Logistic Regression Model: Train the logistic regression model using labeled datasets and output the negative probability and feature weight coefficients of the comments;

[0011] S4. Four-factor dynamic threshold calculation: Extract the temporal seasonal composite decay factor, model decision consistency factor, scoring profile balance factor and text signal-to-noise ratio factor, obtain the optimal weights through Bayesian optimization and calculate the dynamic classification threshold and confidence interval.

[0012] S5. Comment sentiment classification and manual calibration: Used to classify comments into multiple levels based on dynamic thresholds and confidence levels, and to manually label and correct comments with low confidence levels;

[0013] S6. Analysis Report Generation and Collaborative Optimization: Generate analysis reports based on classification results and historical data, and provide feedback to optimize model parameters and feature weights.

[0014] Furthermore, after data preprocessing in step S1, the following steps are included:

[0015] Text feature vector: obtained by sequentially performing word segmentation, text cleaning, TF-IDF feature extraction, and domain sentiment dictionary weighting on the comment text data;

[0016] Time decay factor: obtained by sequentially removing outliers, standardizing time format, and generating time decay factor from time series metadata;

[0017] Entity profile feature vector: obtained by performing data verification, labeling, and splicing predetermined features on the entity profile data.

[0018] Furthermore, step S2, multi-source feature fusion and weight optimization, includes the following steps:

[0019] S21. Feature concatenation: Concatenate the text feature vector, time decay factor and entity portrait feature vector into a fused feature vector;

[0020] S22. Relevance Validation: By calculating the correlation between text sentiment and entity profile tags, invalid data with a correlation lower than the preset value is removed.

[0021] S23. Missing value handling: Remove data samples with missing feature values ​​exceeding a preset value;

[0022] S24. Adaptive weight initialization: Initial weight allocation is performed on the text feature vector, time decay factor and entity profile feature vector. The initial weight allocation is based on the historical feature contribution statistics of the homestay review scenario.

[0023] S25. Closed-loop controller iterative learning: Based on the feedback of classification results, construct a classification accuracy-weight relationship model, and perform iterative convergence of weights based on preset step size and iteration trigger conditions.

[0024] Furthermore, the training and inference of the logistic regression model in step S3 includes the following steps:

[0025] S31 and L1 regularization feature selection: Set regularization strength parameters and filter features based on contribution;

[0026] S32. Feature normalization processing: Based on the Min-Max normalization method, the filtered feature vectors are mapped to eliminate dimensional differences;

[0027] S33. Model training data preparation: Based on the labeled homestay review dataset, including negative samples;

[0028] S34. Logistic Regression Model Construction: Build a model based on the Scikit-learn library and set the number of iterations and convergence threshold;

[0029] S35. Model Inference: Input the feature vectors into the trained model and output the negative probability, classification result, and weight coefficients of each feature for each comment.

[0030] Furthermore, the four-factor dynamic threshold calculation in step S4 includes the following steps:

[0031] S41. Four-factor extraction: The following factors are obtained through calculation: time-series seasonal composite decay factor, model decision consistency factor, score profile balance factor, and text signal-to-noise ratio factor, among which:

[0032] Time-series seasonal composite decay factor: generated jointly by time decay factor and seasonal adjustment weight;

[0033] Model decision consistency factor: Weights are assigned based on the negative probability and the degree of feature conflict determined by the standard deviation of feature classification contribution.

[0034] Rating profile balance factor: calculated based on the root mean square error of the overall rating and the sub-item ratings of the homestay;

[0035] Text signal-to-noise ratio factor: calculated based on the effective sentiment signal strength of the matching domain sentiment dictionary and the proportion of irrelevant information in the unmatched dictionary.

[0036] S42. Bayesian optimization framework construction: Based on the GPyOpt library, a Gaussian process regression model is built, and the recall and precision of negative reviews are used as optimization objectives.

[0037] S43. Initialize the weights of the four factors: Preset and constrain the initial weight array of the four factors;

[0038] S44. Hyperparameter Iterative Optimization: Obtain the optimal weight combination by pre-setting the number of iterations and based on the expected improvement criteria;

[0039] S45. Multi-factor weighted fusion: The four factors are weighted and summed using optimized weights to generate a factor fusion value;

[0040] S46. Dynamic Threshold Calculation: The dynamic threshold is calculated based on a preset negative comment baseline threshold, and the threshold range is controlled. The calculation formula is as follows:

[0041]

[0042] In the formula: For dynamic classification threshold, To preset a basic threshold for negative comments, This is the weighted fusion value of four factors.

[0043] S47. Confidence Interval Output: Based on the prediction results of the Gaussian process regression model, output the preset confidence interval of the threshold.

[0044] Furthermore, in the model decision consistency factor, the feature classification contribution is calculated as follows:

[0045]

[0046] In the formula: Contribution to feature classification The feature weight coefficients output by the logistic regression model. These are the normalized eigenvector values;

[0047] The weighted grading is assigned based on the standard deviation of the feature classification contribution. and negative probability Determined according to preset rules.

[0048] Furthermore, the text signal-to-noise ratio factor is calculated as follows:

[0049]

[0050]

[0051]

[0052] In the formula: For text signal-to-noise ratio factor, For effective emotional signal strength, The percentage of irrelevant information. To match the TF-IDF weights of the domain sentiment lexicon, The number of words in the unmatched domain sentiment dictionary. This represents the total number of words in a single comment after word segmentation and cleaning.

[0053] Furthermore, step S5, which involves sentiment grading and manual calibration of comments, includes the following steps:

[0054] S51. Confidence level calculation: The calculation formula is as follows:

[0055]

[0056] In the formula: For confidence level, For dynamic thresholds, This represents a negative probability.

[0057] S52, Confidence Level Data Classification: Comments with confidence levels higher than preset values ​​are classified using automatic multi-level classification units, including critical, general, and minor levels based on negative probability.

[0058] S53. Confidence Level Data Filtering: Export comments with confidence levels below the preset value to the manual annotation platform;

[0059] S54, Manual annotation: Annotate and reclassify comments with confidence levels below the preset value, and perform incremental training or hybrid retraining based on the preset cumulative annotation amount;

[0060] S55. Classification result synchronization: The manually corrected classification results are added to the automatic multi-level classification results.

[0061] Furthermore, step S6, which involves generating the analysis report and co-optimizing, includes the following steps:

[0062] S61. Explainability Description Generation: For each negative comment, output the top-ranking contribution features and their corresponding weights;

[0063] S62. Performance metrics statistics: Negative review recall rate, precision rate, and distribution percentage of each category level within the statistical period;

[0064] S63. Report Output: Generate and push visual reports;

[0065] S64. Model parameter update: Within a preset period, manually labeled data and newly labeled data are mixed at a preset ratio and the logistic regression model is retrained.

[0066] S65. Historical Model Records: Store the four factor values, the optimal threshold, and the classification performance index for each threshold adjustment to build a historical model library.

[0067] S66. Feedback optimization: This includes adjusting the adaptive weight fusion unit by feeding back the classification performance indicators and reusing historical parameters in the feature selection unit.

[0068] A classification system, applied to the aforementioned method and system for classifying the sentiment of homestay reviews based on multi-factor dynamic thresholds, characterized in that: the system comprises:

[0069] The multi-source heterogeneous data input module is used to input collected comment text data, time-series metadata, and entity profile data into the system;

[0070] The comment text data processing module is configured to perform word segmentation, text cleaning, TF-IDF feature extraction, and domain sentiment dictionary weighting on the comment text data, and output text feature vectors.

[0071] The time series metadata processing module is configured to remove outliers, unify time formats, and generate and output time decay factors for time series metadata.

[0072] The entity profile data processing module is configured to perform data verification, tagging, and pre-defined feature splicing on entity profile data, and output entity profile feature vectors.

[0073] The multi-source feature fusion and verification module is configured to perform correlation verification and remove invalid data;

[0074] The classification model inference module is configured to use a logistic regression model to output the negative probability, classification result, and weights of each feature.

[0075] The dynamic decision threshold optimization module is configured to obtain the optimal classification threshold based on the fusion result.

[0076] The analysis report generation module is configured to output interpretable reports and key performance indicator statistics.

[0077] The collaborative optimization control module is configured to update model parameters and record historical patterns.

[0078] Compared with the prior art, the beneficial technical effects of the present invention are as follows:

[0079] This invention effectively solves the problem of missed negative reviews caused by threshold limitations in traditional methods by introducing a temporal and seasonal composite decay factor, a model decision consistency factor, a rating profile balance factor, and a text signal-to-noise ratio factor, and combining this with Bayesian optimization to achieve adaptive calculation of the four-factor dynamic threshold. Simultaneously, through deep fusion of multi-source heterogeneous data and adaptive weight optimization, it enhances the collaborative representation capability of multi-dimensional features such as text semantics, temporal dynamics, and user profiles. Furthermore, this invention constructs a closed-loop collaborative optimization mechanism based on classification feedback, continuously updating model parameters and rule weights through manual calibration and incremental training, significantly improving the recall rate and classification interpretability of negative reviews, ultimately achieving a unity of high accuracy, high robustness, and business interpretability in sentiment classification of homestay reviews. Attached Figure Description

[0080] Figure 1 The diagram illustrates a flowchart of the sentiment classification method and system for homestay reviews based on multi-factor dynamic thresholds, according to an embodiment of the present invention.

[0081] Figure 2 This paper illustrates a flowchart of the multi-source heterogeneous data processing of a homestay review sentiment classification method and system based on multi-factor dynamic thresholds, according to an embodiment of the present invention.

[0082] Figure 3 The diagram illustrates the four-factor dynamic threshold optimization process of the homestay review sentiment classification method and system based on multi-factor dynamic thresholds according to an embodiment of the present invention.

[0083] Figure 4 The flowchart illustrates the closed-loop collaborative optimization of the homestay review sentiment classification method and system based on multi-factor dynamic thresholds according to an embodiment of the present invention. Detailed Implementation

[0084] To make the objectives, technical solutions, and advantages of this invention clearer, the following detailed description of the method and system for classifying homestay reviews based on multi-factor dynamic thresholds, in conjunction with the accompanying drawings and specific embodiments, will further illustrate the present invention. The advantages and features of this invention will become clearer from the following description. It should be noted that the accompanying drawings are in a very simplified form and use non-precise proportions, used only to facilitate and clearly illustrate the purpose of the embodiments of this invention. Please refer to the accompanying drawings to make the objectives, features, and advantages of this invention more apparent and understandable. It should be understood that the structures, proportions, sizes, etc., depicted in the accompanying drawings are only for illustrative purposes to aid those skilled in the art and are not intended to limit the implementation conditions of this invention. Therefore, they have no substantial technical significance. Any modifications to the structure, changes in proportions, or adjustments to the size, without affecting the effects and objectives achieved by this invention, should still fall within the scope of the technical content disclosed in this invention.

[0085] Please refer to the following: Figures 1 to 4 A method and system for classifying sentiment in homestay reviews based on multi-factor dynamic thresholds, characterized in that the classification method includes the following steps:

[0086] S1. Multi-source heterogeneous data collection and preprocessing: Collect homestay review text data, time-series metadata and entity profile data, and perform data preprocessing respectively.

[0087] By utilizing open interfaces and database export methods, three core data categories—homestay review text data, time-series metadata, and entity profile data—are automatically collected from online travel and homestay platforms. Homestay review text data, containing explicit user reviews and implicit requests, is stored uniformly as a UTF-8 encoded text file. Time-series metadata includes the publication time and seasonal attribute for each review, stored in CSV format with fields set to "Review ID-Publication Time-Season Tag". Entity profile data includes homestay category preference tags filled in by users during registration and homestay types associated with historical booking records, stored in JSON format. A data splitting script is written using Python, a well-known technology, to automatically distinguish the three types of input data through data type identification functions and direct them to corresponding processing modules for parallel preprocessing, thereby improving overall processing efficiency.

[0088] Specifically, data preprocessing includes:

[0089] This embodiment involves several adjustable parameters, including: vocabulary size, maximum document frequency threshold, minimum document frequency threshold, attenuation coefficient, sentiment enhancement coefficient, weight for hygiene categories, weight for noise reduction categories, and weight for service categories. All of these parameters are preset values, and their specific values ​​can be determined based on the business scenario through methods such as cross-validation.

[0090] Text feature vectors: These are obtained by sequentially performing word segmentation, text cleaning, TF-IDF feature extraction, and domain sentiment dictionary weighting on the review text data. Word segmentation utilizes the well-known Jieba Chinese word segmentation tool, combined with a custom dictionary for the homestay domain, to accurately segment the review text, including specific terms such as "homestay," "cleaning," "soundproofing," and "lighting," avoiding ambiguous segmentation and improving the accuracy of domain terminology recognition. Text cleaning uses regular expressions to remove special symbols, emoticons, and meaningless numeric strings, retaining core Chinese and English vocabulary to reduce noise interference.

[0091] TF-IDF feature extraction sets the vocabulary size, maximum document frequency threshold, and minimum document frequency threshold to generate a TF-IDF feature vector for each review, capturing high-frequency effective words and filtering common stop words. A weighted sentiment lexicon is introduced from a well-known homestay sentiment lexicon. This lexicon, containing multiple core negative words, multiplies the TF-IDF weight of matched words by a sentiment enhancement coefficient. This lexicon is built based on the experience of homestay industry experts and the annotation of multiple reviews using small samples, and includes explicit weighting rules: Rule R1 (Hygiene): Reviews containing hygiene keywords such as "moldy," "cockroaches," and "odor" are weighted by the hygiene feature weight. Rule R2 (Soundproofing): Reviews containing soundproofing keywords such as "loud noise" and "sleepless nights" are weighted by the soundproofing feature weight. Rule R3 (Service): Reviews containing service keywords such as "poor attitude" and "fraudulent consumption" are weighted by the service feature weight. By using a weighted approach with linear multipliers, and determining the coefficients through 5-fold cross-validation, a balance between feature discrimination and model generalization ability is ensured.

[0092] Time decay factor: Obtained by sequentially performing outlier removal, time format standardization, and time decay factor generation on time-series metadata. Outlier removal uses box plots to identify and remove abnormal publication times, including future times and expired data exceeding a preset period from the current time, ensuring the timeliness and reliability of time-series features. Time format standardization converts publication times from different formats to the "YYYY-MM-DDHH:MM:SS" format, ensuring consistency in the time decay factor calculation. The time decay factor is generated using an exponential decay formula to calculate the base decay value. The calculation formula is shown in Formula 1 below:

[0093] (1)

[0094] In the formula: This represents the number of days since the comment was posted. This is the preset attenuation coefficient.

[0095] Base attenuation value Multiplying the time decay factor by the seasonal adjustment weight yields the final time decay factor. In this embodiment, the seasonal adjustment weight is set to preset values ​​for peak season, off-season, and low season respectively, to achieve coordinated adaptation between time decay and seasonal attributes.

[0096] Entity profile feature vector: Obtained by performing data verification, tagging, and concatenating predetermined features on the entity profile data. Entity profile data verification uses regular expressions to verify the legality of the user ID format and to check whether preference tags are within a preset tag set; unmatched tags are marked as "general". Entity tagging process converts user preference tags into one-hot encoded vectors. By associating the user's historically booked homestay type features with the tagged vectors, a preset-dimensional entity profile feature vector is generated in this embodiment.

[0097] S2. Multi-source feature fusion and weight optimization: The preprocessed data is concatenated for features, and the weights of various features are iteratively optimized based on classification feedback. This includes the following steps:

[0098] S21. Feature Concatenation: The text feature vector, time decay factor, and entity profile feature vector are concatenated into a fused feature vector. In this embodiment, the text feature vector, time decay factor, and entity profile feature vector are concatenated into a fused feature vector of a preset dimension to retain the complete multi-source information representation capability and ensure that the three types of features—text semantics, temporal dynamics, and user profile—work synergistically within a unified vector space.

[0099] S22. Relevance Verification: By calculating the correlation between text sentiment and entity profile tags, invalid data with a correlation lower than a preset value is removed. In this embodiment, invalid data with a correlation lower than 0.1 is removed, thereby avoiding noise samples that seriously deviate from the profile tags and comment content from interfering with model training.

[0100] S23. Missing value handling: Remove data samples with missing feature values ​​exceeding a preset proportion to ensure the integrity of the fused feature vector and avoid feature space distortion caused by missing data.

[0101] S24. Adaptive Weight Initialization: Initial weights are assigned to the text feature vector, time decay factor, and entity profile feature vector. This initial weight assignment is based on historical feature contribution statistics for the homestay review scenario. In this embodiment, preset initial weights are assigned to the text features, time decay factor, and entity profile features. This initial weight assignment is based on historical experience with homestay review scenarios; text sentiment features typically contribute significantly to the classification results, hence they are given a larger weight.

[0102] S25. Closed-Loop Controller Iterative Learning: Based on classification result feedback, a classification accuracy-weight relationship model is constructed, and the weights converge within the iteration based on a preset step size and iteration trigger condition. In this embodiment, the closed-loop controller iterative learning constructs a classification accuracy-weight relationship model based on classification result feedback within a preset time window, and iterates the weight parameters according to a preset period. The iteration trigger condition is that the negative review recall rate continuously decreases beyond a preset threshold. The negative review recall rate is calculated based on the daily updated test set data. The test set sample size is determined according to the actual business data distribution, and the proportion of negative samples is determined according to the actual business data distribution, ensuring the objectivity and consistency of the recall rate statistics. The step size is set to a preset value to ensure convergence to the optimal weight within a finite number of iterations, avoiding parameter oscillations. Through the iterative learning mechanism of the closed-loop controller, the system can dynamically perceive classification performance fluctuations and adaptively adjust feature weights, ensuring the optimal synergistic effect of text sentiment features, temporal dynamic features, and entity profile features under different data distributions. When the recall rate of negative reviews is detected to be declining continuously beyond a preset threshold, the weight iteration update process is triggered to recalculate the marginal contribution of various features to the classification accuracy and adjust the weight allocation according to a preset step size until it converges to a new optimal balance point.

[0103] S3. Training and Inference of the Logistic Regression Model: The logistic regression model is trained using a labeled dataset, and the negative probability and feature weight coefficients of the reviews are output. Specifically, the training and inference of the logistic regression model includes the following steps:

[0104] S31, L1 Regularized Feature Selection: This section sets the regularization strength parameter and filters features based on contribution. An L1 regularized logistic regression model is built using the Scikit-learn library. The regularization strength parameter is set to a preset value, and features with a predetermined proportion of contribution are selected, providing a highly interpretable and dimensionally controllable feature subset for subsequent feature selection. L1 regularization introduces an L1 norm penalty term, forcing the model coefficients to become sparse and automatically compressing redundant feature coefficients to zero, thus achieving embedded feature selection.

[0105] S32. Feature Normalization Processing: Based on the Min-Max normalization method, the filtered feature vectors are mapped to eliminate dimensional differences. In this embodiment, by mapping the filtered feature vectors to the [0,1] interval, the dimensional differences between text features, time decay factors, and entity profile features are eliminated, avoiding features with large numerical ranges from dominating the model training process and ensuring that all types of features participate fairly in probability calculation under a unified scale.

[0106] S33. Model Training Data Preparation: A training set is constructed based on a homestay review annotation dataset containing negative samples. In this embodiment, the homestay review annotation dataset is used for model training. The number of annotated samples and the proportion of negative samples in this dataset are determined based on the actual business data distribution. This provides a training foundation with a reasonable category distribution and reliable annotation quality, ensuring the model has sufficient learning ability for a minority of negative samples. Negative samples are obtained through a combination of manual annotation and active learning sampling, prioritizing the inclusion of difficult samples with ambiguous boundaries and subtle emotional expressions to improve the model's accuracy in discriminating complex negative emotions.

[0107] S34. Logistic Regression Model Construction: A model is built based on the Scikit-learn library, and the number of iterations and convergence threshold are set. In this embodiment, the number of iterations and convergence threshold are set to preset values. The construction of the logistic regression model provides a probabilistic output basis for subsequent classification decisions. The model is trained using the training data prepared in S33.

[0108] S35. Model Inference: Input the feature vectors into the trained model and output the negative probability of each comment. Classification results and weight coefficients of each feature Negative probability This provides basic data support for subsequent four-factor extraction and dynamic threshold calculation.

[0109] S4. Four-Factor Dynamic Threshold Calculation: Extract the temporal-seasonal composite decay factor, model decision consistency factor, score profile balance factor, and text signal-to-noise ratio factor. Obtain the optimal weights through Bayesian optimization and calculate the dynamic classification threshold and confidence interval. Specifically, the four-factor dynamic threshold calculation includes the following steps:

[0110] S41. Four-factor extraction: The following factors are obtained through calculation: time-series seasonal composite decay factor, model decision consistency factor, score profile balance factor, and text signal-to-noise ratio factor, among which:

[0111] The time-series seasonal composite decay factor is generated by the time decay factor and the seasonal adjustment weight. Referring to the acquisition of the time decay factor in step S1, the time decay effect and the influence of seasonal attributes are combined.

[0112] Model decision consistency factor: Weights are assigned based on the negative probability and the degree of feature conflict determined by the standard deviation of feature classification contribution. In this embodiment, the standard deviation of feature classification contribution is used as the weight. and negative probability Preset weight levels are assigned. Weights are assigned from low to high based on the degree of feature conflict; specifically, the highest weight is assigned when the degree of feature conflict is low and the negative probability is high. The degree of feature conflict is calculated by determining the standard deviation of the classification contribution of text sentiment features, temporal features, and entity profile features. The standard deviation of the feature classification contribution is used for judgment. The standard deviation is obtained by calculating the dispersion of the classification contributions of text sentiment features, temporal features, and entity profile features. The calculation formula is shown in Equation 2 below:

[0113] (2)

[0114] In the formula: For the first The classification contribution of class features =1, 2, and 3 correspond to text sentiment features, temporal features, and entity profile features, respectively. The value represents the average contribution of the three feature classifications, where n=3. The formula for calculating the feature classification contribution is shown in Equation 3 below:

[0115] (3)

[0116] In the formula: Contribution to feature classification The feature weight coefficients output by the logistic regression model. These are the normalized eigenvector values.

[0117] The weighted grading is assigned based on the standard deviation of the feature classification contribution. and negative probability Determined according to preset rules.

[0118] Rating profile balance factor: Calculated based on the root mean square error (RMSE) of the overall and sub-item ratings of the guesthouse. The factor is defined as 1 / (1+RMSE) by obtaining the overall and sub-item ratings and calculating the RMSE. Sub-item rating dimensions include hygiene, facility completeness, service attitude, soundproofing, location, and cost-effectiveness. In this embodiment, the factor value range is limited to a preset interval.

[0119] Text signal-to-noise ratio factor: Calculated based on the effective sentiment signal intensity of the matching domain sentiment dictionary and the proportion of irrelevant information from the unmatched dictionary, as shown in equations 4 to 6 below:

[0120] (4)

[0121] (5)

[0122] (6)

[0123] In the formula: For text signal-to-noise ratio factor, For effective emotional signal strength, The percentage of irrelevant information. To match the TF-IDF weights of the domain sentiment lexicon, The number of words in the unmatched domain sentiment dictionary. This represents the total number of words in a single comment after word segmentation and cleaning.

[0124] S42. Bayesian Optimization Framework Construction: Based on the GPyOpt library, a Gaussian process regression model is constructed, and the F1 score (harmonic mean of recall and precision) of negative review classification is used as the optimization objective, thereby providing an efficient Bayesian optimization framework for four-factor weight optimization.

[0125] S43. Four-Factor Weight Initialization: Preset and constrain the initial weight array of the four factors. In this embodiment, the initial weight array of the four factors is set to a preset value, and the optimization parameter space is a weight of [0,1] for each factor with a sum of 1. The constraint that the weight sum is 1 is implemented through the equality constraint function of the GPyOpt library, with the constraint condition set as w1+w2+w3+w4=1, where w1-w4 are the weights of the four factors. Weight initialization provides a reasonable search starting point for the four-factor weight optimization, avoids the optimization process from getting stuck in local optima, and ensures that each factor has a certain influence in the initial state, which facilitates the dynamic adjustment of its relative importance based on data feedback in subsequent iterations.

[0126] S44. Hyperparameter Iterative Optimization: The optimal weight combination is obtained by setting a preset number of iterations and based on the expected improvement criterion. In this embodiment, the number of iterations is set to a preset value to ensure that the optimization process fully converges. The expected improvement criterion, by balancing exploration and utilization, selects the parameter point most likely to improve the F1 value in each iteration for evaluation, gradually approaching the global optimum. The expected improvement criterion is used to balance the "exploration" and "utilization" of the four-factor weight combination, and the calculation formula is shown in Equation 7 below:

[0127] (7)

[0128] In the formula: It is a four-factor weighted combination. This represents the current optimal weight combination. Weighted combination The corresponding F1 value.

[0129] S45. Multi-factor weighted fusion: The four factors are weighted and summed using optimized weights to generate a factor fusion value. The multi-factor weighted fusion comprehensively reflects the integrated characteristics of comments across four dimensions: time dimension, model decision stability, rating consistency, and text quality, providing an adaptive adjustment basis for dynamic threshold calculation.

[0130] S46. Dynamic Threshold Calculation: A dynamic threshold is calculated based on a preset negative comment baseline threshold, and the threshold range is controlled. In this embodiment, the negative comment baseline threshold is set to... This is a preset value, which can be adjusted according to the business's requirements for recall and precision. The calculation formula is shown in Equation 8 below:

[0131] (8)

[0132] In the formula: To preset a basic threshold for negative comments, This is the weighted fusion value of four factors.

[0133] S47. Confidence Interval Output: Based on the prediction results of the Gaussian process regression model, the preset confidence interval of the threshold is output. In this embodiment, a 95% confidence interval is set for the output threshold to provide a probabilistic representation for the uncertainty quantification of the dynamic threshold. The width of the confidence interval reflects the model's confidence in the threshold estimate; the narrower the interval, the better the convergence of the optimization process and the more reliable the threshold estimate.

[0134] S5. Comment Sentiment Classification and Manual Calibration: This function is used to classify comments into multiple levels based on dynamic thresholds and confidence levels, and to manually label and correct low-confidence comments. Specifically, it includes the following steps:

[0135] S51. Confidence level calculation: The calculation formula is shown in Equation 9 below:

[0136] (9)

[0137] In the formula: For confidence level, For dynamic thresholds, This represents a negative probability.

[0138] S52. Confidence Level Data Classification: Comments with confidence levels higher than a preset value are classified using an automatic multi-level classification unit, categorized into critical, general, and minor levels based on negative probability. In this embodiment, a confidence level threshold is set, and comments exceeding this threshold enter the automatic multi-level classification unit. Comments are divided into critical, general, and minor levels based on negative probability, with each level having a preset negative probability threshold. Critical levels must also include preset security risk keywords or preset major service breach keywords.

[0139] S53. Confidence Level Data Filtering: In this embodiment, comments with a confidence level below a preset threshold are exported to a manual annotation platform for manual calibration. A hybrid mechanism of "automatic classification + manual calibration" is adopted to balance large-scale processing efficiency and classification accuracy.

[0140] S54. Manual Annotation: Comments with confidence levels below a preset value are annotated and reclassified, and incremental training or hybrid retraining is performed based on a preset cumulative annotation amount. In this embodiment, incremental training of the model is triggered when the cumulative annotation amount reaches a preset number. If this threshold is not reached, the annotated data will be included in periodic hybrid retraining at a preset interval. In hybrid training, manually annotated data and newly annotated data are weighted according to a preset ratio. This ensures that the model continuously absorbs manual calibration feedback and gradually improves its ability to identify low-confidence samples.

[0141] In this embodiment, the specific process of incremental training of the model is to divide the accumulated manually labeled data into an incremental training set and an incremental validation set according to a preset ratio. The training is terminated when the F1 value of the incremental validation set reaches a preset threshold. If the condition is not met, the amount of labeled data is increased and retraining is performed after each new data is added to a preset number.

[0142] S55. Classification Result Synchronization: The manually corrected classification results are added to the automatic multi-level classification results to form a complete classification result dataset. In this embodiment, the manually corrected classification results are synchronized to the automatic multi-level classification result database in real time. Seamless integration of classification results is achieved through a unified data interface, ensuring that subsequent data analysis and business applications obtain complete and consistent classification data.

[0143] S6. Analysis Report Generation and Collaborative Optimization: Generate an analysis report based on the classification results and historical data, and provide feedback to optimize model parameters and feature weights. This includes the following steps:

[0144] S61. Explainability Description Generation: For each negative comment, output the top-ranked contribution features and their corresponding weights. In this embodiment, the top 3 contribution features and their corresponding weights are output, thereby providing an intuitive and actionable basis for attributing negative comments.

[0145] S62. Performance Indicator Statistics: This section calculates the recall rate, precision, and percentage distribution of negative reviews within a given time period. The recall rate, precision, and percentage distribution of each category are tallied daily, weekly, and monthly. The recall rate is calculated based on the daily updated test set data. In this embodiment, the test set sample size is determined according to the actual business data distribution to ensure the objectivity and consistency of the recall rate statistics.

[0146] S63. Report Output: Generate and push visual reports, supporting PDF and Excel formats, and push them to the homestay merchant management backend and platform operation backend.

[0147] S64. Model Parameter Update: Within a preset period, manually labeled data and newly labeled data are mixed at a preset ratio and the logistic regression model is retrained. In this embodiment, manually labeled data and newly labeled data are mixed at a preset ratio according to a preset period, and the logistic regression model is retrained, providing a periodic optimization mechanism for model parameter updates. Through regular mixed retraining, the model can continuously absorb the distribution change information in the newly labeled data, gradually adapt to the evolution of comment language style and emerging negative expression patterns, and avoid the model performance from decaying over time. The mixing ratio setting retains the high-precision supervision signal of manual calibration while making full use of the generalization ability of large-scale new data, achieving a balance between stability and adaptability.

[0148] S65. Historical Model Records: Store the four factor values, optimal threshold, and classification performance indicators for each threshold adjustment to build a historical model library, thereby providing a data traceability basis for the evolution of threshold optimization strategies.

[0149] S66. Feedback Optimization: This includes adjusting the adaptive weight fusion unit by feeding back classification performance metrics and reusing historical parameters to the feature selection unit. Feedback optimization adjusts the feature weights of the adaptive weight fusion unit by feeding back classification performance metrics to the closed-loop controller. In this embodiment, a two-way collaborative mechanism of "rule-weighted guidance for model training and model output optimization of rule parameters" is used. The domain sentiment dictionary weighted rules improve the model's sensitivity to negative keywords, and the model iteration results feed back into the rule weight adjustment, achieving dynamic adaptation between rules and the model. The core connotation of collaborative optimization is reflected in three points: first, closed-loop feedback, with rule weighting and dynamic thresholds having a two-way influence; second, unified objectives, with improving the recall rate and interpretability of negative comments as the core objective; and third, a non-serial process, with the weighting strategy and threshold calculation being interdependent.

[0150] In this embodiment, the specific logic for setting closed-loop feedback optimization is as follows: when the recall rate or precision of negative comments is lower than a preset threshold, the feature weight is triggered to iterate urgently and the step size is adjusted to a preset value. When the F1 value is still lower than the preset threshold after a preset number of iterations, the domain sentiment dictionary is reconstructed and the labeled dataset is supplemented, and the number of new samples reaches a preset number.

[0151] Historical parameters are reused, and initial parameters for L1 regularization are provided based on the feature filtering results of similar scenarios in the historical pattern library. In this embodiment, the dimensions for judging similar scenarios are set as seasonal attributes, homestay type, and comment data distribution (the proportion of negative samples is within a preset deviation range). The matching priority of similar scenarios is set as follows: homestay type + seasonal attribute (highest) > homestay type + negative sample proportion > seasonal attribute + negative sample proportion. When matching two or more dimensions, the initial C value of L1 regularization for the corresponding scenario is reused. When the matching degree is 100%, the historical optimal threshold is directly reused.

[0152] A classification system, applied to the aforementioned method and system for classifying homestay reviews based on multi-factor dynamic thresholds, the system comprising:

[0153] The multi-source heterogeneous data input module is used to input collected comment text data, time-series metadata, and entity profile data into the system, providing raw data support for subsequent feature extraction and sentiment analysis. This module supports batch import and real-time streaming of various data formats.

[0154] Comment text data processing module: configured to perform word segmentation, text cleaning, TF-IDF feature extraction, and domain sentiment dictionary weighting on comment text data, and output text feature vectors.

[0155] The time series metadata processing module is configured to remove outliers, standardize time formats, and generate and output time decay factors for time series metadata.

[0156] Entity profile data processing module: configured to perform data verification, tagging, and predefined feature splicing on entity profile data, and output entity profile feature vector.

[0157] The multi-source heterogeneous data input module and the three data processing modules achieve efficient collaboration through a unified data bus architecture.

[0158] The multi-source feature fusion and verification module is configured to perform correlation verification and remove invalid data. It performs multi-source feature fusion and verification operations. This module first performs correlation verification on the output features from the three data processing modules. By calculating the information values ​​between features, it identifies and removes invalid feature combinations caused by abnormal data acquisition or preprocessing errors.

[0159] The classification model inference module is configured to use a logistic regression model to output the negative probability, classification result, and weights of each feature. This is used to predict the probability of negative reviews and make classification decisions based on the fused multi-source features. The module incorporates a Bayesian-optimized logistic regression model and supports both real-time batch inference and rapid response for single reviews. During inference, the linear weight coefficients of each feature are output synchronously, providing foundational data for subsequent interpretability analysis.

[0160] The dynamic decision threshold optimization module is configured to obtain the optimal classification threshold based on the fusion results. It achieves adaptive threshold adjustment through multi-factor weighted fusion and Bayesian optimization, ensuring that the classification boundary dynamically evolves with data distribution. This module receives the negative probability and four-factor feature values ​​output by the classification model inference module, and sequentially performs factor calculation, weight optimization, fusion summation, and threshold generation. Finally, it outputs a dynamic threshold quantized with confidence intervals for subsequent hierarchical decision-making.

[0161] The analysis report generation module is configured to output interpretable reports and key performance indicator statistics, and present the analysis results through a visual interface. This module integrates three main functional units: interpretable report generation, performance indicator statistics, and visual report output. It supports the generation of analysis reports at multiple time granularities (daily, weekly, monthly) and pushes them to the homestay merchant management backend and platform operation backend in real time via API interface, providing data-driven support for business decisions.

[0162] The collaborative optimization control module is configured to update model parameters and record historical patterns, achieving a two-way feedback loop between rules and the model. This module dynamically triggers optimization actions such as feature weight adjustment, model retraining, and reuse of historical parameters by monitoring the classification performance indicators output by the analysis report generation module, forming a complete closed-loop control chain of "monitoring-analysis-optimization-verification".

[0163] Through the synergistic effect of the aforementioned modules, this system achieves a complete sentiment classification process, from multi-source heterogeneous data input to dynamic threshold optimization, and from automatic classification to manual calibration feedback. A closed-loop feedback mechanism enables dynamic adaptation between rules and the model, ultimately providing the homestay platform with high-precision, interpretable, and continuously evolving negative review recognition capabilities, effectively supporting improvements in merchant service quality and optimization of platform operational decisions.

[0164] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.

[0165] The embodiments described above are merely illustrative of several implementations of the present invention, and while the descriptions are relatively specific and detailed, they should not be construed as limiting the scope of the invention patent. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of the present invention, and these all fall within the protection scope of the present invention. Therefore, the protection scope of this invention patent should be determined by the appended claims.

Claims

1. A method and system for classifying sentiment in homestay reviews based on multi-factor dynamic thresholds, characterized by: The classification method includes the following steps: S1. Multi-source heterogeneous data collection and preprocessing: Collect homestay review text data, time-series metadata and entity profile data, and perform data preprocessing respectively; S2, Multi-source feature fusion and weight optimization: The preprocessed data is concatenated with features, and the weights of various features are iteratively optimized based on classification feedback; S3. Training and Inference of Logistic Regression Model: Train the logistic regression model using labeled datasets and output the negative probability and feature weight coefficients of the comments; S4. Four-factor dynamic threshold calculation: Extract the temporal seasonal composite decay factor, model decision consistency factor, scoring profile balance factor and text signal-to-noise ratio factor, obtain the optimal weights through Bayesian optimization and calculate the dynamic classification threshold and confidence interval. S5. Comment sentiment classification and manual calibration: Used to classify comments into multiple levels based on dynamic thresholds and confidence levels, and to manually label and correct comments with low confidence levels; S6. Analysis Report Generation and Collaborative Optimization: Generate analysis reports based on classification results and historical data, and provide feedback to optimize model parameters and feature weights.

2. The method and system for classifying the sentiment of homestay reviews based on multi-factor dynamic thresholds as described in claim 1, characterized in that, After data preprocessing in step S1, the following steps are included: Text feature vector: obtained by sequentially performing word segmentation, text cleaning, TF-IDF feature extraction, and domain sentiment dictionary weighting on the comment text data; Time decay factor: obtained by sequentially removing outliers, standardizing time format, and generating time decay factor from time series metadata; Entity profile feature vector: obtained by performing data verification, labeling, and splicing predetermined features on the entity profile data.

3. The method and system for classifying the sentiment of homestay reviews based on multi-factor dynamic thresholds as described in claim 1, characterized in that: Step S2, multi-source feature fusion and weight optimization, includes the following steps: S21. Feature concatenation: Concatenate the text feature vector, time decay factor and entity portrait feature vector into a fused feature vector; S22. Relevance Validation: By calculating the correlation between text sentiment and entity profile tags, invalid data with a correlation lower than the preset value is removed. S23. Missing value handling: Remove data samples with missing feature values ​​exceeding a preset value; S24. Adaptive weight initialization: Initial weight allocation is performed on the text feature vector, time decay factor and entity profile feature vector. The initial weight allocation is based on the historical feature contribution statistics of the homestay review scenario. S25. Closed-loop controller iterative learning: Based on the feedback of classification results, construct a classification accuracy-weight relationship model, and perform iterative convergence of weights based on preset step size and iteration trigger conditions.

4. The method and system for classifying the sentiment of homestay reviews based on multi-factor dynamic thresholds as described in claim 1, characterized in that, Step S3, the training and inference of the logistic regression model, includes the following steps: S31 and L1 regularization feature selection: Set regularization strength parameters and filter features based on contribution; S32. Feature normalization processing: Based on the Min-Max normalization method, the filtered feature vectors are mapped to eliminate dimensional differences; S33. Model training data preparation: Based on the labeled homestay review dataset, including negative samples; S34. Logistic Regression Model Construction: Build a model based on the Scikit-learn library and set the number of iterations and convergence threshold; S35. Model Inference: Input the feature vectors into the trained model and output the negative probability, classification result, and weight coefficients of each feature for each comment.

5. The method and system for classifying the sentiment of homestay reviews based on multi-factor dynamic thresholds as described in claim 1, characterized in that, Step S4, the calculation of the four-factor dynamic threshold, includes the following steps: S41. Four-factor extraction: The following factors are obtained through calculation: time-series seasonal composite decay factor, model decision consistency factor, score profile balance factor, and text signal-to-noise ratio factor, among which: Time-series seasonal composite decay factor: generated jointly by time decay factor and seasonal adjustment weight; Model decision consistency factor: Weights are assigned based on the negative probability and the degree of feature conflict determined by the standard deviation of feature classification contribution. Rating profile balance factor: calculated based on the root mean square error of the overall rating and the sub-item ratings of the homestay; Text signal-to-noise ratio factor: calculated based on the effective sentiment signal intensity of the matching domain sentiment dictionary and the proportion of irrelevant information in the unmatched dictionary; S42. Bayesian optimization framework construction: Based on the GPyOpt library, a Gaussian process regression model is built, and the recall and precision of negative reviews are used as optimization objectives. S43. Initialize the weights of the four factors: Preset and constrain the initial weight array of the four factors; S44. Hyperparameter Iterative Optimization: Obtain the optimal weight combination by pre-setting the number of iterations and based on the expected improvement criteria; S45. Multi-factor weighted fusion: The four factors are weighted and summed using optimized weights to generate a factor fusion value; S46. Dynamic Threshold Calculation: The dynamic threshold is calculated based on a preset negative comment baseline threshold, and the threshold range is controlled. The calculation formula is as follows: In the formula: For dynamic classification threshold, To preset a basic threshold for negative comments, This is a four-factor weighted fusion value; S47. Confidence Interval Output: Based on the prediction results of the Gaussian process regression model, output the preset confidence interval of the threshold.

6. The method and system for classifying the sentiment of homestay reviews based on multi-factor dynamic thresholds as described in claim 5, characterized in that, In the model decision consistency factor, the feature classification contribution is calculated as follows: In the formula: Contribution to feature classification The feature weight coefficients output by the logistic regression model. These are the normalized eigenvector values; The weighted grading is assigned based on the standard deviation of the feature classification contribution. and negative probability Determined according to preset rules.

7. The method and system for classifying the sentiment of homestay reviews based on multi-factor dynamic thresholds as described in claim 5, characterized in that, The formula for calculating the text signal-to-noise ratio factor is as follows: In the formula: For text signal-to-noise ratio factor, For effective emotional signal strength, The percentage of irrelevant information. To match the TF-IDF weights of the domain sentiment lexicon, The number of words in the unmatched domain sentiment dictionary. This represents the total number of words in a single comment after word segmentation and cleaning.

8. The method and system for classifying the sentiment of homestay reviews based on multi-factor dynamic thresholds as described in claim 1, characterized in that: Step S5, which involves sentiment grading and manual calibration, includes the following steps: S51. Confidence level calculation: The calculation formula is as follows: In the formula: For confidence level, For dynamic thresholds, The probability is negative. S52, Confidence Level Data Classification: Comments with confidence levels higher than preset values ​​are classified using automatic multi-level classification units, including critical, general, and minor levels based on negative probability. S53. Confidence Level Data Filtering: Export comments with confidence levels below the preset value to the manual annotation platform; S54, Manual annotation: Annotate and reclassify comments with confidence levels below the preset value, and perform incremental training or hybrid retraining based on the preset cumulative annotation amount; S55. Classification result synchronization: The manually corrected classification results are added to the automatic multi-level classification results.

9. The method and system for classifying the sentiment of homestay reviews based on multi-factor dynamic thresholds as described in claim 1, characterized in that: Step S6, analysis report generation and collaborative optimization, includes the following steps: S61. Explainability Description Generation: For each negative comment, output the top-ranking contribution features and their corresponding weights; S62. Performance metrics statistics: Negative review recall rate, precision rate, and distribution percentage of each category level within the statistical period; S63. Report Output: Generate and push visual reports; S64. Model parameter update: Within a preset period, manually labeled data and newly labeled data are mixed at a preset ratio and the logistic regression model is retrained. S65. Historical Model Records: Store the four factor values, the optimal threshold, and the classification performance index for each threshold adjustment to build a historical model library. S66. Feedback optimization: This includes adjusting the adaptive weight fusion unit by feeding back the classification performance indicators and reusing historical parameters in the feature selection unit.

10. A classification system, said system being applied to the homestay review sentiment classification method and system based on multi-factor dynamic thresholds as described in any one of claims 1 to 9, characterized in that: The system includes: The multi-source heterogeneous data input module is used to input collected comment text data, time-series metadata, and entity profile data into the system; The comment text data processing module is configured to perform word segmentation, text cleaning, TF-IDF feature extraction, and domain sentiment dictionary weighting on the comment text data, and output text feature vectors. The time series metadata processing module is configured to remove outliers, unify time formats, and generate and output time decay factors for time series metadata. The entity profile data processing module is configured to perform data verification, tagging, and pre-defined feature splicing on entity profile data, and output entity profile feature vectors. The multi-source feature fusion and verification module is configured to perform correlation verification and remove invalid data; The classification model inference module is configured to use a logistic regression model to output the negative probability, classification result, and weights of each feature. The dynamic decision threshold optimization module is configured to obtain the optimal classification threshold based on the fusion result. The analysis report generation module is configured to output interpretable reports and key performance indicator statistics. The collaborative optimization control module is configured to update model parameters and record historical patterns.