An equipment manufacturing enterprise risk prediction method, device, equipment and storage medium

By constructing a risk dictionary for equipment manufacturing enterprises and using the XGBoost model for feature fusion and calibration, the problems of text semantic understanding and data imbalance in risk prediction for equipment manufacturing enterprises are solved, and multi-dimensional and high-precision risk prediction is achieved.

CN122241530APending Publication Date: 2026-06-19WUHAN UNIV OF TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
WUHAN UNIV OF TECH
Filing Date
2026-03-27
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing technologies lack domain risk dictionaries specifically for the equipment manufacturing industry, resulting in insufficient semantic understanding of text and a lack of systematic feature engineering strategies, making it difficult to achieve multi-dimensional and high-precision prediction of risks for equipment manufacturing enterprises.

Method used

A risk dictionary for equipment manufacturing enterprises is constructed. Expanded words are generated using the BERT large language model. TF-IDF features and custom risk features are combined and weighted, and prediction is performed using the XGBoost model to generate a balanced feature matrix. Post-processing calibration is then performed to determine the final risk category.

Benefits of technology

It achieves multi-dimensional and high-precision prediction of risks for equipment manufacturing enterprises, solves the problems of imbalanced text data and insufficient semantic understanding, and improves the predictive performance of the model.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122241530A_ABST
    Figure CN122241530A_ABST
Patent Text Reader

Abstract

This invention provides a method, apparatus, equipment, and storage medium for risk prediction in equipment manufacturing enterprises. The method includes: acquiring text data of equipment manufacturing enterprises; constructing a risk dictionary for equipment manufacturing enterprises based on knowledge of the equipment manufacturing field; labeling the text data using the risk dictionary to generate tagged text data; performing augmentation processing on the tagged text data based on a pre-trained language model; extracting TF-IDF features and custom risk features from the augmented text and weighted fusion; adaptively sampling the fused feature set to generate a balanced feature matrix; training an XGBoost model using the balanced feature matrix; predicting the original probability distribution of the text data to be predicted; performing post-processing calibration based on the importance of risk categories; and determining the final risk prediction category according to the maximum a posteriori probability criterion. This invention achieves multi-dimensional and high-precision prediction of risks for equipment manufacturing enterprises.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of text data processing technology, specifically to a risk prediction method, device, electronic equipment, and storage medium for equipment manufacturing enterprises. Background Technology

[0002] Equipment manufacturing is a core industry characterized by high technology content, long industrial chains, and significant investment risks. It encompasses key areas such as high-end lithography machines, chip manufacturing, and large-scale integrated circuits, serving as a crucial support for promoting economic restructuring and upgrading, and enhancing international competitiveness. The operation and development of equipment manufacturing enterprises are influenced by multiple factors, including market conditions, policies, technology, and supply chains, resulting in risks characterized by complexity, suddenness, and transmissibility. With the adjustment of the global industrial landscape and the increasing complexity of the international situation, various unforeseen risk events are occurring frequently. Equipment manufacturing enterprises have an increasingly urgent need for risk prediction, which has become a key means for enterprises to shift from "post-event response" to "pre-event prevention." Related technologies have also become a research focus in the field of risk management in the manufacturing industry.

[0003] Currently, research on risk management in equipment manufacturing enterprises mainly focuses on risk identification. Existing methods are mostly based on numerical data for risk prediction, resulting in limited predictive dimensions. Regarding the application of textual data, existing research primarily remains at the risk identification and assessment stage—that is, "discovering" risks from text—without systematically conducting research on risk prediction based on textual data. This leads to textual analysis being primarily used for post-event explanation or current status description, failing to be effectively used for pre-event warning. Furthermore, existing risk prediction models are mostly built on general or financial domains, lacking dedicated risk dictionaries and high-quality labeled data that deeply integrate knowledge of the equipment manufacturing industry (such as the industrial chain, technological iteration, and policy sensitivity). This makes it difficult for models to accurately understand and quantify the industry-specific risk semantics. Simultaneously, addressing the problems of high noise, imbalance, and difficulty in obtaining labels from textual data, existing research only employs isolated single processing strategies, failing to develop a systematic solution from data augmentation and feature fusion to sampling balancing, thus hindering the full potential of machine learning models' predictive performance.

[0004] In summary, the technical problem that this invention aims to solve is that the lack of a domain risk dictionary for the equipment manufacturing industry leads to insufficient semantic understanding of text, and the lack of a systematic feature engineering strategy for unbalanced text data results in limited model prediction performance, making it difficult to achieve multi-dimensional and high-precision prediction of risks for equipment manufacturing enterprises. Summary of the Invention

[0005] In view of this, it is necessary to provide a method, device, electronic device and storage medium for risk prediction of equipment manufacturing enterprises, so as to solve the technical problem that it is difficult to achieve multi-dimensional and high-precision prediction of risks of equipment manufacturing enterprises in the existing technology.

[0006] To address the aforementioned technical problems, in a first aspect, the present invention provides a risk prediction method for equipment manufacturing enterprises, comprising: Text data of equipment manufacturing enterprises is acquired, a risk dictionary of equipment manufacturing enterprises is constructed based on knowledge of the equipment manufacturing field, and the text data is annotated using the risk dictionary to generate tagged text data. The labeled text data is augmented based on a pre-trained language model. The TF-IDF features of the augmented text and the custom risk features are extracted and weighted and fused. The fused feature set is then adaptively sampled to generate a balanced feature matrix. An XGBoost model is trained using the balanced feature matrix. The trained XGBoost model is then used to predict the balanced feature matrix generated from the text data to be predicted to obtain the original probability distribution. The original probability distribution is then post-processed and calibrated based on the importance of risk categories. Finally, the risk prediction category is determined according to the maximum a posteriori probability criterion.

[0007] In one possible implementation, the construction of the equipment manufacturing enterprise risk dictionary based on knowledge in the equipment manufacturing field includes: Based on the COSO-ERM framework and ISO and IEC international standards, and combined with the characteristics of the equipment manufacturing industry, six types of risks and their corresponding seed words were selected to form the risk meta-dictionary of the equipment manufacturing industry. Using the BERT large language model, expanded words are generated for the seed words based on mainstream media news text; By merging the aforementioned risk element dictionary and extended terms for the equipment manufacturing industry, a risk dictionary for the equipment manufacturing industry is obtained.

[0008] In one possible implementation, the step of using the BERT large language model to generate expanded words for the seed words based on mainstream media news text includes: Standardized text is obtained by preprocessing data from mainstream news media. The standardized text is input into the BERT large language model to generate a word vector table, and the seed word vectors corresponding to the seed words are extracted. Calculate the cosine similarity between each word in the word vector table and the seed word vector, and select words with a similarity greater than a preset threshold as expanded words.

[0009] In one possible implementation, the TF-IDF feature extraction step includes: Define a TF-IDF feature extraction function, which processes the enhanced text based on character-level n-grams to obtain the n-gram lexical patterns of each text; Calculate the word frequency and inverse document frequency of each n-gram lexical pattern in the enhanced text; The product of term frequency and inverse document frequency is standardized to obtain the TF-IDF feature vector of each enhanced text.

[0010] In one possible implementation, the custom risk feature extraction includes: For the set of words in the word vector vocabulary in the enhanced text, extract the embedding vector of each word and perform average pooling to generate word vector features; The statistical characteristics of the enhanced text were obtained from the text length, total frequency of risky keywords, and count of risky keywords in each category. By horizontally stacking word vector features and statistical features, a custom risk feature is obtained.

[0011] In one possible implementation, the adaptive sampling of the fused feature set to generate a balanced feature matrix includes: The statistically fused feature set shows the distribution of the number of samples in each category, identifying the minority and majority categories. For samples from minority categories, the ADASYN algorithm is used for oversampling to generate synthetic samples; Undersample most categories, retaining a subset of samples; The synthesized samples generated by oversampling, the samples retained by undersampling, and the original samples of other categories are combined to generate a balanced feature matrix.

[0012] In one possible implementation, the XGBoost model is trained using the balanced feature matrix, and the original probability distribution is obtained by predicting the balanced feature matrix generated from the text data to be predicted based on the trained XGBoost model. The original probability distribution is then post-processed and calibrated based on the importance of risk categories, and the final risk prediction category is determined according to the maximum a posteriori probability criterion, including: The XGBoost multi-class model is trained using the balanced feature matrix, and the original probability distribution is obtained by predicting the balanced feature matrix generated based on the text data to be predicted based on the trained XGBoost model. Based on the risk characteristics of the equipment manufacturing field, a category importance weight vector is predefined; The original probability distribution is weighted and adjusted using the weight vector, and the weighting result is normalized to generate a calibrated probability distribution. Based on the maximum a posteriori probability criterion, the category with the highest probability after calibration is determined as the final risk prediction category and output.

[0013] On the other hand, the present invention also provides a risk prediction device for equipment manufacturing enterprises, comprising: The data processing module is used to acquire text data from equipment manufacturing enterprises, construct a risk dictionary for equipment manufacturing enterprises based on knowledge in the equipment manufacturing field, and use the risk dictionary to annotate the text data to generate tagged text data. The feature engineering module is used to perform enhancement processing on the labeled text data based on a pre-trained language model, extract the TF-IDF features of the enhanced text and custom risk features and perform weighted fusion, and adaptively sample the fused feature set to generate a balanced feature matrix. The model prediction module is used to train an XGBoost model using the balanced feature matrix, predict the original probability distribution based on the balanced feature matrix generated from the text data to be predicted using the trained XGBoost model, perform post-processing calibration on the original probability distribution based on the importance of risk categories, and determine the final risk prediction category according to the maximum a posteriori probability criterion.

[0014] In a second aspect, the present invention also provides an electronic device, including a memory and a processor, wherein, The memory is used to store programs; The processor, coupled to the memory, is used to execute the program stored in the memory to implement the steps in the equipment manufacturing enterprise risk prediction method described in any of the above implementations.

[0015] Thirdly, the present invention also provides a computer-readable storage medium for storing a computer-readable program or instruction, which, when executed by a processor, can implement the steps in the equipment manufacturing enterprise risk prediction method described in any of the above implementations.

[0016] The beneficial effects of this invention are as follows: The risk prediction method for equipment manufacturing enterprises provided by this invention firstly acquires text data of equipment manufacturing enterprises, constructs a risk dictionary for equipment manufacturing enterprises based on knowledge of the equipment manufacturing field, and annotates the text data. This effectively solves the problems of existing technologies where general dictionaries cannot accurately match the unique risk semantics of the equipment manufacturing industry and the text annotation accuracy is low. Then, by performing pre-trained language model enhancement processing on the labeled text data, and combining the weighted fusion of TF-IDF features and custom risk features, it takes into account both global semantic features of the text and risk features specific to the equipment manufacturing field, making up for the deficiency of single feature representation capabilities. Furthermore, by adaptively sampling the fused feature set to generate a balanced feature matrix, using the balanced feature matrix to train an XGBoost model, and performing post-processing calibration on the model output based on the importance of risk categories, it effectively solves the problem of uneven distribution of risk text samples for equipment manufacturing enterprises, further optimizes the model prediction accuracy, avoids the problem of uneven attention to different risk categories in equipment manufacturing by general models, and achieves multi-dimensional and high-precision prediction of risks for equipment manufacturing enterprises. Attached Figure Description

[0017] To more clearly illustrate the technical solutions in the embodiments of the present invention, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0018] Figure 1 A schematic flowchart of an embodiment of the risk prediction method for equipment manufacturing enterprises provided by the present invention; Figure 2 For the present invention Figure 1 A schematic diagram of an embodiment of S101; Figure 3 For the present invention Figure 2 A schematic diagram of an embodiment of S202; Figure 4 For the present invention Figure 1 A schematic diagram of an embodiment of S102; Figure 5 For the present invention Figure 1 Another embodiment of the process diagram of S102; Figure 6 For the present invention Figure 1 A schematic diagram of another embodiment of S102; Figure 7 For the present invention Figure 1 A schematic diagram of an embodiment of S103; Figure 8A schematic diagram of an embodiment of the risk prediction device for equipment manufacturing enterprises provided by the present invention; Figure 9 A schematic diagram of an embodiment of the electronic device provided by the present invention. Detailed Implementation

[0019] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of the present invention, and not all of them. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative effort are within the scope of protection of the present invention.

[0020] In the description of the embodiments of the present invention, unless otherwise stated, "multiple" means two or more. "And / or" describes the relationship between related objects, indicating that there can be three relationships. For example, A and / or B can represent three situations: A exists alone, A and B exist simultaneously, and B exists alone.

[0021] The terms "first," "second," etc., used in the embodiments of this invention are for descriptive purposes only and should not be construed as indicating or implying their relative importance or implicitly specifying the number of technical features indicated. Therefore, a technical feature defined with "first" or "second" may explicitly or implicitly include at least one of that feature.

[0022] In this document, the term "embodiment" means that a particular feature, structure, or characteristic described in connection with an embodiment may be included in at least one embodiment of the invention. The appearance of this phrase in various places throughout the specification does not necessarily refer to the same embodiment, nor is it a separate or alternative embodiment mutually exclusive with other embodiments. It will be explicitly and implicitly understood by those skilled in the art that the embodiments described herein can be combined with other embodiments.

[0023] Before demonstrating the embodiments, the following terms will be explained.

[0024] Equipment Manufacturing Enterprise Risk Dictionary: Based on the COSO-ERM framework and ISO and IEC international standards, and integrating knowledge from the equipment manufacturing industry, it includes a three-level vocabulary system of risk categories, seed words, and extended words, used to accurately annotate risk semantics in text data.

[0025] TF-IDF: Term Frequency-Inverse Document Frequency, a statistical method for assessing the importance of words in text. This invention uses character-level n-gram extraction to adapt to the professional vocabulary characteristics of equipment manufacturing enterprise texts.

[0026] This invention provides a risk prediction method, device, electronic device, and storage medium for equipment manufacturing enterprises, which will be described below.

[0027] The risk prediction method for equipment manufacturing enterprises provided by this invention can be executed by computer equipment with data processing capabilities, including but not limited to servers, workstations, cloud service platforms, or embedded terminal devices. This method is applicable to technical application scenarios of risk identification, early warning, and risk management in equipment manufacturing enterprises. By deploying it on the aforementioned equipment or platform, this invention can automatically collect, analyze, and predict enterprise annual report text data and mainstream media news text data, achieving intelligent early warning of enterprise risks. Figure 1 As shown, risk prediction methods for equipment manufacturing enterprises include: S101. Obtain text data of equipment manufacturing enterprises, construct a risk dictionary for equipment manufacturing enterprises based on knowledge of the equipment manufacturing field, and use the risk dictionary to annotate the text data to generate tagged text data.

[0028] It should be noted that the text data of the equipment manufacturing enterprises includes, but is not limited to, enterprise annual report text data and mainstream media news data. To ensure the standardization and timeliness of the research sample, this invention sets the data collection time window to 2017 to 2023. In selecting sample enterprises, the list of equipment manufacturing enterprises is extracted based on the "Results of Industry Classification of Listed Companies in the Third Quarter of 2021". As an example, this specific embodiment selects enterprises in category C39 (computer, communication and other electronic equipment manufacturing industry) as the research object. This subcategory belongs to a typical technology-intensive equipment manufacturing field and has strong industry representativeness. Mainstream media news texts are characterized by high structure, strong professionalism, and low noise, and can reflect changes in the external environment and industry dynamics. In this embodiment, the PyNewspaper package, a dedicated news content crawler, is used to obtain news corpus related to the risks of the computer, communication and other electronic equipment manufacturing industry from 2017 to 2023 by setting keywords related to the risks of the equipment manufacturing industry, as a supplementary data source.

[0029] Specifically, the annual report text data of the acquired equipment manufacturing enterprises was preprocessed, and the preprocessed annual report text data was converted into standardized text. Standardized text Divide into training set according to proportion Test set and verification set This facilitates subsequent keyword and feature engineering and model training in conjunction with the risk dictionary.

[0030] S102. The labeled text data is augmented based on a pre-trained language model. The TF-IDF features and custom risk features of the augmented text are extracted and weighted and fused. The fused feature set is then adaptively sampled to generate a balanced feature matrix.

[0031] It should be noted that the pre-trained language model uses the WordVec2 model to segment the labeled text data and filter out effective words. Based on the semantic similarity of the Word2Vec model, synonymous industry words are replaced with core risk words. The text fragments are semantically coherently reorganized to generate semantically equivalent enhanced samples and calibrate risk labels, ultimately obtaining enhanced text data.

[0032] S103. Train the XGBoost model using the balanced feature matrix, predict the original probability distribution based on the balanced feature matrix generated from the text data to be predicted using the trained XGBoost model, perform post-processing calibration on the original probability distribution based on the importance of risk categories, and determine the final risk prediction category according to the maximum a posteriori probability criterion. In summary, the risk prediction method for equipment manufacturing enterprises provided by this invention firstly acquires text data from equipment manufacturing enterprises, constructs a risk dictionary for these enterprises based on knowledge of the equipment manufacturing domain, and annotates the text data. This effectively solves the problems of existing technologies where general dictionaries cannot accurately match the unique risk semantics of the equipment manufacturing industry and where text annotation accuracy is low. Then, by pre-training a language model to enhance the labeled text data, and combining TF-IDF features with weighted fusion of custom risk features, it takes into account both global semantic features of the text and risk features specific to the equipment manufacturing domain, thus overcoming the shortcomings of insufficient representation capabilities of single features. Furthermore, by adaptively sampling the fused feature set to generate a balanced feature matrix, training an XGBoost model using the balanced feature matrix, and post-processing and calibrating the model output based on the importance of risk categories, it effectively solves the problem of uneven distribution of risk text samples from equipment manufacturing enterprises, further optimizes the model's prediction accuracy, avoids the problem of uneven attention to different risk categories in equipment manufacturing by general models, and achieves multi-dimensional, high-precision prediction of risks for equipment manufacturing enterprises.

[0033] In some embodiments of the present invention, such as Figure 2 As shown, the risk dictionary for equipment manufacturing enterprises constructed based on knowledge in the equipment manufacturing field includes: S201. Based on the COSO-ERM framework and ISO and IEC international standards, and combined with the characteristics of the equipment manufacturing industry, six types of risks and their corresponding seed words are selected to form the risk meta-dictionary of the equipment manufacturing industry. S202. Using the BERT large language model, based on mainstream media news text, generate extended words for the seed words; S203. Merge the aforementioned risk element dictionary and extended terms in the equipment manufacturing industry to obtain the risk dictionary for the equipment manufacturing industry.

[0034] The specific construction process of the risk meta-dictionary for the equipment manufacturing industry is as follows: Experts with experience in equipment manufacturing risk control extracted core keywords from standard clauses and historical risk cases, based on the dimensions of strategic risk, operational risk, reporting risk, and compliance risk defined in the COSO-ERM framework, combined with the technical specifications for the equipment manufacturing field in ISO risk management standards and IEC functional safety standards. For example, "failure," "malfunction," and "redundancy deficiency" were extracted from the clauses on "functional safety" in IEC international standards as seed words for technical risks; "supply disruption," "license," and "long lead time equipment" were extracted from historical supply chain disruption events as seed words for operational risks. After multiple rounds of expert discussions and cross-validation, six core risk categories—market risk, policy risk, strategic risk, operational risk, technical risk, and financial risk—and their corresponding seed word sets were finally determined, forming the risk meta-dictionary for the equipment manufacturing industry.

[0035] In some embodiments of the present invention, such as Figure 3 As shown, step S202, using the BERT large language model, generates expanded words for the seed words based on mainstream media news text, including: S301. Standardized text is obtained by preprocessing data from mainstream news media. S302. Input the standardized text into the Bert large language model to generate a word vector table, and extract the seed word vectors corresponding to the seed words; S303. Calculate the cosine similarity between each word in the word vector table and the seed word vector, and select words with similarity greater than a preset threshold as extended words.

[0036] Preprocessing of mainstream news media data includes cleaning, stop word removal, and word segmentation to ensure the effectiveness of subsequent semantic expansion and feature extraction. The cleaning, stop word removal, and word segmentation technologies are relatively mature and will not be elaborated on here.

[0037] It should be noted that the preset threshold is determined based on the semantic relevance of risk terms in the equipment manufacturing field. It is used to select words that are highly similar to the seed word in terms of risk semantics. While ensuring the semantic accuracy of the extended words, it also takes into account the coverage and recognition accuracy of the risk dictionary. For example, if the preset similarity threshold is 0.95, words with a similarity greater than the threshold of 0.95 are selected as extended words to form a similar word dictionary.

[0038] In some embodiments of the present invention, a pre-constructed risk meta-dictionary for the equipment manufacturing industry is converted into a risk classification mapping table. This mapping table is used to establish the correspondence between risk categories and corresponding seed words. Using the risk classification mapping table, each word in the similar word dictionary obtained through similarity screening is classified and integrated into the corresponding risk category according to its semantic belonging relationship with the seed word. This completes the classification and reorganization of similar words, ultimately forming a three-level vocabulary list of risk category, seed word, and extended word. Finally, experts manually screen the candidate extended words, remove noisy words, and merge them with the original seed words to form the final risk dictionary for the equipment manufacturing industry.

[0039] This invention, by combining the COSO-ERM framework with ISO and IEC international standards and considering the characteristics of the equipment manufacturing industry, ensures the professionalism and accuracy of the dictionary, guaranteeing comprehensive coverage of core risk dimensions. Then, through the semantic understanding capabilities of the BERT large language model, the dictionary is efficiently expanded, capturing the latest industry trends and the diversity of textual expressions. This avoids the inefficiency and subjectivity of purely manual construction and prevents the introduction of noisy vocabulary that might be introduced by purely data-driven methods, thus providing a high-quality semantic foundation for subsequent risk labeling and model training.

[0040] In some embodiments of the present invention, step S101, which involves using the risk dictionary to annotate the text data and generate tagged text data, includes: Constructing a risk dictionary and tagging system for the equipment manufacturing industry based on a risk dictionary of equipment manufacturing enterprises. ,in, ={Market risk, policy risk, strategic risk, operational risk, technological risk, financial risk, other risks} It is a risk category Keyword set, It is a risk category Weight configuration.

[0041] For the training set Test set and verification set Keywords are matched separately, and the frequency of occurrence of keywords corresponding to each risk category is counted. A weighted average is applied based on both word length and category importance, and the risk points and risk scores are calculated using this weighted calculation method, resulting in the final text. Real risk label .

[0042] It should be understood that the risk dictionary tagging system in the equipment manufacturing industry adopts a hierarchical structure of category-keyword to organize the risk semantic space, and each risk category corresponds to a set of core keywords verified by domain experts.

[0043] In some embodiments of the present invention, such as Figure 4 As shown, the TF-IDF feature extraction step includes: S401. Define a TF-IDF feature extraction function, which processes the enhanced text based on character-level n-grams to obtain the n-gram lexical patterns of each text; S402. Calculate the word frequency and inverse document frequency of each n-gram lexical pattern in the enhanced text; S403. Standardize the product of term frequency and inverse document frequency to obtain the TF-IDF feature vector of each enhanced text.

[0044] It should be noted that the TF-IDF feature extraction function is defined as follows: ,in, The dimension of the TF-IDF subspace vector; the formula for calculating the TF-IDF feature vector is:

[0045] in, Indicates the first n-gram vocabulary patterns, For the first n-gram vocabulary patterns In enhanced text The standardized weights in the equation are the standardized product of term frequency and inverse document frequency.

[0046] In some embodiments of the present invention, such as Figure 5 As shown, the custom risk feature extraction includes: S501. For the set of words in the word vector vocabulary in the enhanced text, extract the embedding vector of each word and perform average pooling to generate word vector features; the expression of the word vector features is:

[0047] in, To enhance text The set of words that exist in the word vector vocabulary. For vocabulary The embedding vector; The feature dimension of word vectors; S502. The statistical features are obtained from the text length, total frequency of risk keywords, and count of risk keywords in each category of the statistically enhanced text; the expression for the statistical features is:

[0048] in, To enhance text The total number of characters reflects the level of detail and information content of the text; To enhance text The total number of all risky keywords appearing in the text; The number of times each risk category keyword appears. ={Market risk, policy risk, strategic risk, operational risk, technological risk, financial risk, other risks}; Statistical feature dimension; S503. Stack the word vector features and statistical features horizontally to obtain the custom risk feature. The expression of the custom risk feature is as follows: .

[0049] It should be noted that, in order to balance the impact of the dimensionality difference between TF-IDF features and custom features, this invention adopts a weighted fusion strategy to weight the TF-IDF feature space and the custom feature space. Specifically, a balancing coefficient is introduced. After weighting the custom subspace, the two weighted vectors are horizontally stacked, and the multi-feature vector fusion representation is as follows:

[0050] in, .

[0051] This invention employs character-level n-gram lexical patterns to extract TF-IDF features, enabling the capture of word combination features in Chinese text. It demonstrates excellent representation capabilities for technical terms and compound words specific to the equipment manufacturing field. The extraction of custom risk features considers both the global semantic information of the text and the risk statistics of the equipment manufacturing field, allowing the feature vector to comprehensively represent the risk connotation in the text. Compared to single features, the model's risk prediction classification achieves improved consistency. In summary, by combining word vector features (custom risk features) with TF-IDF features, each standardized text data segment is processed into a feature vector composed of basic text features, keyword count features based on risk categories, and word vector semantic features. These vectors form a structured numerical feature matrix, which constitutes the combined risk features, providing complementary information input for the subsequent gradient boosting tree model. TF-IDF excels at capturing specific, surface-level keywords and N-gram patterns; word vector-based semantic features capture context and semantic information, compensating for the shortcomings of keyword matching; and keyword-based statistical features have the advantage of strong interpretability, introducing domain knowledge into the model. The combination of these three elements forms a composite text feature that enables feature transformation that is friendly to the XGBoost model. Through sample size and model feedback, dynamic adjustment of dimensions is achieved, making the features input to the XGBoost model more consistent with its tree splitting logic, thereby improving model efficiency and accuracy.

[0052] In some embodiments of the present invention, such as Figure 6 As shown, the adaptive sampling of the fused feature set to generate a balanced feature matrix includes: S601. The number distribution of samples in each category in the feature set after statistical fusion, to identify the minority and majority categories; It should be noted that the fused feature set is divided into training, validation and test sets according to the proportions, and the number distribution of samples of each category in the training set is statistically analyzed to identify the minority category, majority category and intermediate category. This distribution is consistent with the overall category distribution of the fused feature set, reflecting the class imbalance problem of the original data.

[0053] It should also be noted that, to address the class imbalance problem in the training set, an adaptive target number of samples is set based on the class distribution of the training set, as shown in the following formula:

[0054] in, This represents the undersampling ratio for the majority of categories, used to control the degree of sample compression. This represents the total number of samples in the category with the largest number of samples in the original dataset. This is the minimum sample cardinality for oversampling of a minority of classes, ensuring that the minimum amount of data required for model training can still be met even if the original sample size is extremely small; This is the oversampling factor for the minority class, used to amplify the original sample size; This represents the actual number of samples in the original dataset for risk class c. The majority class set refers to risk categories that have a high proportion of samples and a large number of samples. A minority category refers to a risk category with a low percentage of samples and a very small number of samples. This is an intermediate set of risk categories, referring to risk categories whose sample size falls between that of the majority and minority classes, while keeping the original sample size unchanged during sampling.

[0055] S602. Oversample the minority class samples using the ADASYN algorithm to generate synthetic samples; It should be noted that the ADASYN algorithm can adjust the samples according to the distribution of the data itself to achieve self-adaptation and solve the problem of imbalanced samples. Specifically, for minority class sets... For each sample, the majority class percentage among its K nearest neighbors is calculated to assess the learning difficulty. The number of synthetic samples is allocated according to the difficulty, and synthetic samples are generated through linear interpolation.

[0056] S603. Undersample most categories of samples and retain some samples; It should be noted that for each majority class set According to the undersampling ratio Calculate the target retention number, perform random sampling without replacement, and randomly select from the original samples of that class. Select one sample as the retained sample and discard the rest. The random sampling process ensures that each sample has an equal probability of being selected, thus maintaining the randomness of the sample distribution.

[0057] S604. Combine the oversampled synthetic samples, the undersampled retained samples, and the original samples of other categories to generate a balanced feature matrix, which is represented as follows:

[0058] in, This represents the total number of samples in the final balanced training set. For the first The feature vector of each sample For the corresponding category label.

[0059] This invention employs the ADASYN algorithm to oversample minority class samples, generating synthetic samples. The adaptive nature of the ADASYN algorithm avoids the limitations of traditional algorithms that generate samples uniformly, and can specifically enhance difficult-to-classify boundary regions in the minority class. Undersampling of majority class samples reduces the sample size of the majority class while preserving its core distribution characteristics. The combination of these two approaches effectively enhances the model's ability to perceive key features of the minority class, thereby mitigating model bias caused by data imbalance while maintaining performance in the majority class. This significantly improves the prediction performance of the minority class, enabling the subsequent XGBoost model to learn the feature patterns of various risks equally, helping enterprises identify potential and difficult-to-identify risks.

[0060] In some embodiments of the present invention, such as Figure 7 As shown, the process involves training an XGBoost model using the balanced feature matrix, predicting the original probability distribution based on the balanced feature matrix generated from the text data to be predicted using the trained XGBoost model, performing post-processing calibration on the original probability distribution based on the importance of risk categories, and determining the final risk prediction category according to the maximum a posteriori probability criterion, including: S701. Train an XGBoost multi-classification model using the balanced feature matrix, and predict the original probability distribution based on the balanced feature matrix generated from the text data to be predicted using the trained XGBoost model. It should be noted that when training the XGBoost multi-class classification model using the aforementioned balanced feature matrix, the objective function is defined as:

[0061] in, For multi-class log loss function, This is the regularization term. The model training process employs an early stopping strategy, optimizing model parameters based on validation set performance.

[0062] It should be noted that for the text data to be predicted, the same preprocessing operations as the training set text are first performed, including stop word removal, word segmentation, and removal of meaningless redundant information to form standardized text to be predicted, ensuring that the preprocessing process is completely consistent with the training stage. Then, for the standardized text to be predicted, TF-IDF feature extraction and custom risk feature extraction operations, consistent with the training stage, are performed. The TF-IDF feature vector of the text to be predicted and the custom risk feature vector are horizontally stacked to obtain the fused feature vector of the text to be predicted. The fusion method and feature dimensions are consistent with the fused feature vector in the training stage, ensuring that the structure of the fused features is consistent with the fused features in the training set. Since the text to be predicted is a single sample or batch sample, adaptive sampling is not required, but feature alignment is needed according to the feature dimensions and transformation rules determined in the training stage to generate a balanced feature matrix consistent with the feature space in the training stage. This balanced feature matrix serves as the input to the trained XGBoost model.

[0063] Finally, for the balanced feature matrix generated based on the text data to be predicted... The original score vector output by the XGBoost multi-class classification model is: ; The probability distribution is converted using the softmax function as follows: .

[0064] S702. Predefine category importance weight vectors based on the risk characteristics of the equipment manufacturing field. ; S703. The original probability distribution is weighted and adjusted using the weight vector, and the weighting result is normalized to generate a calibrated probability distribution, expressed as:

[0065] in, Represents the Hadamard product; It should be noted that, to ensure the calibrated probability distribution satisfies the Kolmogorov probability axioms, the weighted results are... Norm normalization is expressed by the following formula: .

[0066] S704. Based on the maximum a posteriori probability criterion, the category with the highest probability after calibration is determined as the final risk prediction category and output. The formula is expressed as follows: .

[0067] In some embodiments of this invention, the XGBoost model employs five-fold cross-validation for parameter optimization during training and introduces an early stopping strategy—training stops when the validation set loss fails to improve for 10 consecutive rounds, and the model version with the minimum validation set loss is saved as the final model. After model training is complete, for each input sample, the original probability distribution is output, corresponding to the probabilities of six risk classes.

[0068] This invention employs domain knowledge to statically calibrate the model output, effectively improving the sensitivity of identifying key risk categories without altering the model's structure. Compared to directly adjusting the loss function weights, this post-processing method is more flexible, allowing for weight configuration adjustments based on different application scenarios without affecting the model's training stability.

[0069] To verify the XGboost model's ability to handle imbalanced multi-class data, this invention also designed three comparative experiments based on the baseline results.

[0070] First, the model parameters were changed to verify the performance comparison between the model before and after parameter optimization. Second, the dataset partitioning method was changed to compare the model prediction performance under time series partitioning and stratified random sampling. Third, based on the framework of this paper, the prediction performance of the XGboost model under TF-IDF feature extraction strategy and AME feature extraction strategy was compared to verify the performance improvement of the multi-feature extraction, data balancing, and data augmentation methods proposed in this paper compared with traditional feature extraction methods. Second, the impact of the dictionary construction method based on word frequency statistics and the equipment manufacturing risk dictionary constructed in this invention on the prediction performance of the model was compared. Finally, the prediction performance of the XGboost model, Random Forest model, and LightGBM model were compared under the same feature extraction method and dictionary construction conditions to verify the superior prediction performance of XGboost compared with other models.

[0071] To better implement the risk prediction method for equipment manufacturing enterprises in this invention embodiment, based on the magnetic resonance image optimization method, correspondingly, as follows: Figure 8 As shown, this embodiment of the invention also provides a risk prediction device for equipment manufacturing enterprises. The equipment manufacturing enterprise risk prediction device 800 includes: The data processing module is used to acquire text data from equipment manufacturing enterprises, construct a risk dictionary for equipment manufacturing enterprises based on knowledge in the equipment manufacturing field, and use the risk dictionary to annotate the text data to generate tagged text data. The feature engineering module is used to perform enhancement processing on the labeled text data based on a pre-trained language model, extract the TF-IDF features of the enhanced text and custom risk features and perform weighted fusion, and adaptively sample the fused feature set to generate a balanced feature matrix. The model prediction module is used to train an XGBoost model using the balanced feature matrix, predict the original probability distribution based on the balanced feature matrix generated from the text data to be predicted using the trained XGBoost model, perform post-processing calibration on the original probability distribution based on the importance of risk categories, and determine the final risk prediction category according to the maximum a posteriori probability criterion.

[0072] The equipment manufacturing enterprise risk prediction device 800 provided in the above embodiments can realize the technical solutions described in the above equipment manufacturing enterprise risk prediction method embodiments. The specific implementation principles of each module or unit can be found in the corresponding content in the above equipment manufacturing enterprise risk prediction method embodiments, which will not be repeated here.

[0073] like Figure 9 As shown, the present invention also provides an electronic device 900. The electronic device 900 includes a processor 901, a memory 902, and a display 903. Figure 9 Only some components of the electronic device 900 are shown, but it should be understood that it is not required to implement all of the components shown, and more or fewer components may be implemented instead.

[0074] In some embodiments, processor 901 may be a central processing unit (CPU), microprocessor, or other data processing chip, used to run program code stored in memory 902 or process data, such as the equipment manufacturing enterprise risk prediction method of the present invention.

[0075] In some embodiments, processor 901 may be a single server or a group of servers. The server group may be centralized or distributed. In some embodiments, processor 901 may be local or remote. In some embodiments, processor 901 may be implemented on a cloud platform. In one embodiment, the cloud platform may include a private cloud, public cloud, hybrid cloud, community cloud, distributed cloud, intranet, multi-cloud, etc., or any combination thereof.

[0076] In some embodiments, memory 902 may be an internal storage unit of electronic device 900, such as a hard disk or memory of electronic device 900. In other embodiments, memory 902 may also be an external storage device of electronic device 900, such as a plug-in hard disk, smart media card (SMC), secure digital (SD) card, flash card, etc. equipped on electronic device 900.

[0077] Furthermore, the memory 902 may include both internal storage units of the electronic device 900 and external storage devices. The memory 902 is used to store application software and various types of data installed on the electronic device 900.

[0078] In some embodiments, display 903 may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, or an OLED (Organic Light-Emitting Diode) touchscreen. Display 903 is used to display information from electronic device 900 and to display a visual user interface. Components 901-903 of electronic device 900 communicate with each other via a system bus.

[0079] In one embodiment, when the processor 901 executes the equipment manufacturing enterprise risk prediction program in the memory 902, the following steps can be implemented: Text data of equipment manufacturing enterprises is acquired, a risk dictionary of equipment manufacturing enterprises is constructed based on knowledge of the equipment manufacturing field, and the text data is annotated using the risk dictionary to generate tagged text data. The labeled text data is augmented based on a pre-trained language model. The TF-IDF features of the augmented text and the custom risk features are extracted and weighted and fused. The fused feature set is then adaptively sampled to generate a balanced feature matrix. The XGBoost model is trained using the balanced feature matrix. The original probability distribution is obtained by predicting the text data to be predicted based on the trained model. The original probability distribution is then post-processed and calibrated based on the importance of risk categories. Finally, the risk prediction category is determined according to the maximum a posteriori probability criterion.

[0080] It should be understood that when the processor 901 executes the equipment manufacturing enterprise risk prediction program in the memory 902, in addition to the functions mentioned above, it can also perform other functions, as can be found in the description of the corresponding method embodiments above.

[0081] Furthermore, this embodiment of the invention does not specifically limit the type of electronic device 900 mentioned. Electronic device 900 can be a mobile phone, tablet computer, personal digital assistant (PDA), wearable device, laptop computer, or other portable electronic device. Exemplary embodiments of portable electronic devices include, but are not limited to, portable electronic devices running iOS, Android, Microsoft, or other operating systems. The aforementioned portable electronic device can also be other portable electronic devices, such as a laptop computer with a touch-sensitive surface (e.g., a touch panel). It should also be understood that in some other embodiments of the invention, electronic device 900 may not be a portable electronic device, but rather a desktop computer with a touch-sensitive surface (e.g., a touch panel).

[0082] Accordingly, this application also provides a computer-readable storage medium for storing computer-readable programs or instructions. When the programs or instructions are executed by a processor, they can implement the steps or functions of the equipment manufacturing enterprise risk prediction method provided in the above-described method embodiments.

[0083] Those skilled in the art will understand that all or part of the processes of the methods described in the above embodiments can be implemented by a computer program instructing related hardware (such as a processor, controller, etc.), and the computer program can be stored in a computer-readable storage medium. The computer-readable storage medium may be a disk, optical disk, read-only memory, or random access memory, etc.

[0084] The above provides a detailed description of the risk prediction method, device, electronic equipment, and storage medium for equipment manufacturing enterprises provided by this invention. Specific examples have been used to illustrate the principles and implementation methods of this invention. The descriptions of the above embodiments are only for the purpose of helping to understand the method and core ideas of this invention. At the same time, for those skilled in the art, there will be changes in the specific implementation methods and application scope based on the ideas of this invention. Therefore, the content of this specification should not be construed as a limitation of this invention.

Claims

1. A risk prediction method for equipment manufacturing enterprises, characterized in that, include: Acquire text data of equipment manufacturing enterprises, construct a risk dictionary of equipment manufacturing enterprises based on knowledge of the equipment manufacturing field, and use the risk dictionary to annotate the text data to generate tagged text data; The labeled text data is augmented based on a pre-trained language model. The TF-IDF features of the augmented text and the custom risk features are extracted and weighted and fused. The fused feature set is then adaptively sampled to generate a balanced feature matrix. An XGBoost model is trained using the balanced feature matrix. The trained XGBoost model is then used to predict the balanced feature matrix generated from the text data to be predicted to obtain the original probability distribution. The original probability distribution is then post-processed and calibrated based on the importance of risk categories. Finally, the risk prediction category is determined according to the maximum a posteriori probability criterion.

2. The risk prediction method for equipment manufacturing enterprises according to claim 1, characterized in that, The aforementioned risk dictionary for equipment manufacturing enterprises, constructed based on knowledge in the equipment manufacturing field, includes: Based on the COSO-ERM framework and ISO and IEC international standards, and combined with the characteristics of the equipment manufacturing industry, six types of risks and their corresponding seed words were selected to form the risk meta-dictionary of the equipment manufacturing industry. Using the BERT large language model, expanded words are generated for the seed words based on mainstream media news text; By merging the aforementioned risk element dictionary and extended terms for the equipment manufacturing industry, a risk dictionary for the equipment manufacturing industry is obtained.

3. The risk prediction method for equipment manufacturing enterprises according to claim 2, characterized in that, The process of generating expanded words for seed words using the BERT large language model based on mainstream media news text includes: Standardized text is obtained by preprocessing data from mainstream news media. The standardized text is input into the BERT large language model to generate a word vector table, and the seed word vectors corresponding to the seed words are extracted. Calculate the cosine similarity between each word in the word vector table and the seed word vector, and select words with a similarity greater than a preset threshold as expanded words.

4. The risk prediction method for equipment manufacturing enterprises according to claim 1, characterized in that, The TF-IDF feature extraction includes: Define a TF-IDF feature extraction function, which processes the enhanced text based on character-level n-grams to obtain the n-gram lexical patterns of each text; Calculate the word frequency and inverse document frequency of each n-gram lexical pattern in the enhanced text; The product of term frequency and inverse document frequency is standardized to obtain the TF-IDF feature vector of each enhanced text.

5. The risk prediction method for equipment manufacturing enterprises according to claim 1, characterized in that, The custom risk feature extraction includes: For the set of words in the word vector vocabulary in the enhanced text, extract the embedding vector of each word and perform average pooling to generate word vector features; The statistical characteristics of the enhanced text were obtained from the text length, total frequency of risky keywords, and count of risky keywords in each category. By horizontally stacking word vector features and statistical features, a custom risk feature is obtained.

6. The risk prediction method for equipment manufacturing enterprises according to claim 1, characterized in that, The step of adaptively sampling the fused feature set to generate a balanced feature matrix includes: The statistically fused feature set shows the distribution of the number of samples in each category, identifying the minority and majority categories. For samples from minority categories, the ADASYN algorithm is used for oversampling to generate synthetic samples; Undersample most categories, retaining a subset of samples; The synthesized samples generated by oversampling, the samples retained by undersampling, and the original samples of other categories are combined to generate a balanced feature matrix.

7. The risk prediction method for equipment manufacturing enterprises according to claim 6, characterized in that, The process involves training an XGBoost model using the balanced feature matrix, predicting the original probability distribution based on the balanced feature matrix generated from the text data to be predicted using the trained XGBoost model, performing post-processing calibration on the original probability distribution based on the importance of risk categories, and determining the final risk prediction category according to the maximum a posteriori probability criterion, including: The XGBoost multi-class model is trained using the balanced feature matrix, and the original probability distribution is obtained by predicting the balanced feature matrix generated based on the text data to be predicted based on the trained XGBoost model. Based on the risk characteristics of the equipment manufacturing field, a category importance weight vector is predefined; The original probability distribution is weighted and adjusted using the weight vector, and the weighting result is normalized to generate a calibrated probability distribution. Based on the maximum a posteriori probability criterion, the category with the highest probability after calibration is determined as the final risk prediction category and output.

8. A risk prediction device for equipment manufacturing enterprises, characterized in that, include: The data processing module is used to acquire text data from equipment manufacturing enterprises, construct a risk dictionary for equipment manufacturing enterprises based on knowledge in the equipment manufacturing field, and use the risk dictionary to annotate the text data to generate tagged text data. The feature engineering module is used to perform enhancement processing on the labeled text data based on a pre-trained language model, extract the TF-IDF features of the enhanced text and custom risk features and perform weighted fusion, and adaptively sample the fused feature set to generate a balanced feature matrix. The model prediction module is used to train an XGBoost model using the balanced feature matrix, predict the original probability distribution based on the balanced feature matrix generated from the text data to be predicted using the trained XGBoost model, perform post-processing calibration on the original probability distribution based on the importance of risk categories, and determine the final risk prediction category according to the maximum a posteriori probability criterion.

9. An electronic device, characterized in that, Including memory and processor, among which, The memory is used to store programs; The processor, coupled to the memory, is used to execute the program stored in the memory to implement the steps in the risk prediction method for equipment manufacturing enterprises as described in any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that, Used to store computer-readable programs or instructions, which, when executed by a processor, can implement the steps in the risk prediction method for equipment manufacturing enterprises as described in any one of claims 1 to 7.