Privacy protection analysis method and device based on federated learning, equipment and medium
By constructing a data parsing and parameter encryption interaction mechanism in a federated learning network, the problem of data isolation between medical institutions is solved, enabling efficient and accurate risk assessment and underwriting decisions under the protection of privacy, thereby improving underwriting efficiency and ensuring privacy security.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- PING AN HEALTH INSURANCE CO LTD
- Filing Date
- 2026-03-23
- Publication Date
- 2026-06-19
AI Technical Summary
In health insurance underwriting, data isolation and privacy protection restrictions between medical institutions prevent cross-institutional sharing, leading to risk assessment relying on incomplete self-reported information. This results in low underwriting efficiency and a high risk of privacy breaches. Existing technologies struggle to achieve automated and accurate risk assessment while maintaining privacy compliance.
A privacy-preserving analysis method based on federated learning is constructed. By configuring a data parsing mechanism and a parameter encryption interaction mechanism in the federated learning network, local parsing of data nodes and interaction of encrypted model parameters are achieved, generating standardized feature data and training anomaly analysis models. The results are then verified in conjunction with business conditions to generate anomaly quantification results.
While meeting privacy compliance requirements, it improves underwriting efficiency and risk identification accuracy, enabling efficient risk assessment and underwriting decisions across institutions and avoiding the risk of privacy leaks due to the external transmission of raw data.
Smart Images

Figure CN122241737A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of intelligent decision-making technology, and in particular to a privacy-preserving analysis method, apparatus, device, and medium based on federated learning. Background Technology
[0002] In health insurance underwriting scenarios, insurance institutions need to complete risk assessments and underwriting decisions based on the insured's past medical records, examination records, and treatment information. However, data silos are common among medical institutions. Core health data, such as electronic medical records, are limited by privacy protection, data security, and institutional management boundaries, making it difficult to share and collaborate across institutions and regions. Insurance institutions typically cannot directly obtain complete and accurate health information during underwriting and can only rely on self-reported information from policyholders or scattered supporting documents. The completeness and accuracy of this information are difficult to guarantee, resulting in a significant lack of basic data for risk assessment.
[0003] In actual underwriting processes, a large number of unstructured medical records, examination reports, and treatment records need to be read and understood manually, line by line. Underwriters need to extract key risk factors from complex medical information and make judgments, making the process cumbersome and time-consuming. As the scale of health insurance underwriting expands, the traditional underwriting model that relies on manual analysis and judgment can hardly meet the needs of high-concurrency business. The underwriting cycle is long and inefficient, affecting the insurance application experience while also increasing the operating costs and manpower burden of insurance institutions.
[0004] Meanwhile, some existing intelligent underwriting systems typically rely on centralized data aggregation for model training and risk assessment. This centralized data storage and processing poses significant privacy risks and fails to meet the compliance requirements of medical and health data in financial scenarios. Due to a lack of technical means to both utilize multi-source medical data for risk modeling and complete model training and analysis without leaving the local data source, existing technologies struggle to achieve a balance between privacy protection, risk assessment accuracy, and underwriting automation. Summary of the Invention
[0005] The main objective of this invention is to provide a privacy-preserving analysis method, apparatus, device, and storage medium based on federated learning, which aims to solve the technical problem that existing health insurance underwriting technologies struggle to achieve automated, accurate, and efficient risk assessment and underwriting decisions using real and complete health data under the conditions that medical data cannot be shared and privacy compliance must be met.
[0006] To achieve the above objectives, this invention provides a privacy-preserving analysis method based on federated learning, comprising: Construct a federated learning network that includes a coordination node, multiple data nodes, and business nodes, and configure a data parsing mechanism and a parameter encryption interaction mechanism in the federated learning network; The data parsing mechanism is used to perform structured processing on the unstructured raw sample data stored in the data nodes to generate standardized feature data for training. Based on the parameter encryption interaction mechanism, the data node and the coordination node are controlled to perform encrypted model parameter interaction, and the anomaly analysis model is trained and generated using the standardized feature data for training. Obtain the identification information of the object to be processed, extract the target unstructured raw data associated with the identification information in the data node through the data parsing mechanism and process it to generate target standardized feature data; The target standardized feature data is input into the anomaly analysis model and jointly verified in conjunction with preset business conditions to generate anomaly quantification results; Based on the anomaly quantification results, a processing decision is determined, and the processing decision is fed back to the business node.
[0007] Furthermore, to achieve the above objectives, the present invention provides a privacy-preserving analysis apparatus based on federated learning, comprising: The network initialization module is used to construct a federated learning network containing a coordination node, multiple data nodes, and business nodes, and to configure a data parsing mechanism and a parameter encryption interaction mechanism in the federated learning network. The sample parsing module is used to perform structured processing on the unstructured raw sample data stored in the data node through the data parsing mechanism to generate standardized feature data for training. The federated training module is used to control the data nodes and the coordination nodes to perform encrypted model parameter interaction based on the parameter encryption interaction mechanism, and to train and generate an anomaly analysis model using the standardized feature data for training. The target parsing module is used to obtain the identification information of the object to be processed, extract the unstructured raw data of the target associated with the identification information in the data node through the data parsing mechanism, process it, and generate standardized feature data of the target. The anomaly determination module is used to input the standardized feature data of the target into the anomaly analysis model, and perform joint verification in combination with preset business conditions to generate anomaly quantification results. The decision feedback module is used to determine the processing decision based on the anomaly quantification result and to feed the processing decision back to the business node.
[0008] Furthermore, to achieve the above objectives, the present invention also provides a computer device, the computer device including a memory, a processor, and a privacy-preserving analysis program based on federated learning stored in the memory and executable on the processor, wherein when the privacy-preserving analysis program based on federated learning is executed by the processor, it implements the steps of the privacy-preserving analysis method based on federated learning as described above.
[0009] Furthermore, to achieve the above objectives, the present invention also provides a computer-readable storage medium storing a privacy-preserving analysis program based on federated learning, wherein the privacy-preserving analysis program based on federated learning, when executed by a processor, implements the steps of the privacy-preserving analysis method based on federated learning as described above.
[0010] Beneficial Effects: This invention relates to the field of intelligent decision-making technology, and discloses a privacy-preserving analysis method, apparatus, device, and medium based on federated learning. The method includes: constructing a federated learning network comprising a coordination node, multiple data nodes, and business nodes; configuring a data parsing mechanism and a parameter encryption interaction mechanism; using the data parsing mechanism to perform structured processing on the unstructured raw sample data stored by the data nodes, generating standardized feature data for training; achieving encrypted model parameter interaction through the parameter encryption interaction mechanism to train and generate an anomaly analysis model; obtaining the identification information of the object to be processed, generating target standardized feature data; inputting the target standardized feature data into the anomaly analysis model and verifying it in conjunction with business conditions to generate anomaly quantification results; determining the processing decision based on the anomaly quantification results and feeding it back to the business nodes. This invention can be applied to business scenarios such as medical, financial, and insurance, achieving model training and risk analysis without external transmission of raw data through local data node parsing and encrypted parameter interaction, improving underwriting efficiency and risk identification accuracy while meeting privacy compliance requirements. Attached Figure Description
[0011] The present invention will be further described below with reference to the accompanying drawings and embodiments. In the accompanying drawings: Figure 1 This is a schematic diagram of an application environment for a privacy-preserving analysis method based on federated learning in one embodiment of the present invention; Figure 2 This is a flowchart illustrating an embodiment of the privacy-preserving analysis method based on federated learning of the present invention. Figure 3 This is a schematic diagram of the functional modules of a preferred embodiment of the privacy-preserving analysis device based on federated learning of the present invention. Figure 4 This is a schematic diagram of the structure of a computer device according to an embodiment of the present invention; Figure 5 This is another structural schematic diagram of a computer device according to one embodiment of the present invention. Detailed Implementation
[0012] It should be understood that the specific embodiments described herein are for illustrative purposes only and are not intended to limit the scope of the invention.
[0013] The privacy-preserving analysis method based on federated learning provided in this invention can be applied to, for example... Figure 1 In this application environment, the client communicates with the server via a network. The server can construct a federated learning network containing a coordination node, multiple data nodes, and business nodes through the client, configuring a data parsing mechanism and a parameter encryption interaction mechanism. The data parsing mechanism is used to structure the unstructured raw sample data stored by the data nodes, generating standardized feature data for training. The encrypted model parameter interaction mechanism is used to train and generate an anomaly analysis model. The identification information of the object to be processed is obtained, generating target standardized feature data. The target standardized feature data is input into the anomaly analysis model and verified in conjunction with business conditions to generate anomaly quantification results. Based on the anomaly quantification results, a processing decision is determined and fed back to the business nodes. This invention can be applied to business scenarios such as medical, financial, and insurance. Through local data node parsing and encrypted parameter interaction, it enables model training and risk analysis without external transmission of raw data, improving underwriting efficiency and risk identification accuracy while meeting privacy compliance requirements. The client can be, but is not limited to, various personal computers, laptops, smartphones, tablets, and portable wearable devices. The server can be implemented using a standalone server or a server cluster consisting of multiple servers. The following detailed description of specific embodiments further illustrates this invention.
[0014] Please see Figure 2 , Figure 2 This is a flowchart illustrating an embodiment of the privacy-preserving analysis method based on federated learning provided by the present invention. It should be noted that although a logical order is shown in the flowchart, in some cases, the steps shown or described may be performed in a different order than that shown here.
[0015] like Figure 2 As shown, the privacy-preserving analysis method based on federated learning proposed in this invention includes the following steps: S10, construct a federated learning network containing a coordination node, multiple data nodes and business nodes, and configure a data parsing mechanism and a parameter encryption interaction mechanism in the federated learning network; In this embodiment, constructing a federated learning network comprising a coordination node, multiple data nodes, and business nodes is the basic architecture for achieving multi-source data collaborative analysis while protecting privacy. The coordination node is typically deployed in an environment controlled by the data user, such as an insurance company's data center or cloud server, running a federated learning coordination service. This service is responsible for handling registration requests from data nodes and business nodes, assigning unique network identifiers to each participant, and maintaining a node registry. Based on this registry, a point-to-point secure transport layer protocol connection is established between the coordination node and each data node. This connection is encrypted using transport layer security protocols to ensure the confidentiality and integrity of all subsequent parameter interactions. An application layer connection is established between the coordination node and the business nodes, such as an application programming interface based on Hypertext Transfer Protocol (HTTP) for transmitting business instructions and decision results. Data nodes are deployed within the internal environment of the original data holder, such as the medical information system server of a partner hospital. Each node independently runs a federated learning client program. The core function of the data nodes is to locally store and process unstructured raw data protected by privacy regulations, such as electronic medical record texts and examination report images, and to ensure that this raw data remains on local storage media throughout the entire federated learning process without physical transfer. The business node is the initiator and receiver of the analysis task. In the medical finance and insurance scenario, this node can be the insurance company's automated underwriting system or the underwriter's workbench. It submits a processing request containing the identification information of the object to be evaluated to the coordination node through the established application layer connection.
[0016] Configure a data parsing mechanism within the established federated learning network. This mechanism is a set of software components deployed locally on each data node. Its core objective is to automatically transform the diverse, unstructured raw data stored on the nodes into uniformly formatted, machine-readable structured feature data. This mechanism typically includes a data scheduling module, a text preprocessing engine, and a deep semantic extraction module. The data scheduling module reads specified raw data records from the local database or file system according to task instructions. The text preprocessing engine performs a series of cleaning and standardization operations on the read text content, including removing irrelevant characters, standardizing text encoding, performing word segmentation, and obfuscating sensitive information such as direct identifiers according to preset rules. The deep semantic extraction module is a computational unit built based on a pre-trained natural language processing model, such as a transformer model architecture finely tuned on a large amount of medical literature and medical record data. This module performs contextual encoding on the preprocessed text, identifies key entity boundaries and types in the text through sequence labeling technology, and extracts semantic relationships between entities through a relation classification network. In the field of health insurance underwriting, the identified entities can include disease diagnosis names, surgical records, medication history, and laboratory test values, while relationships include associations such as disease and symptoms, treatment and effects. The extracted entities and their sets of relationships are then mapped to a predefined, standardized feature labeling system that meets the needs of insurance risk assessment. This labeling system is defined by the business side, specifying the health dimensions to be focused on and their quantification methods; for example, mapping "history of diabetes" to a binary feature and "systolic blood pressure reading" to a continuous numerical feature. Through the operation of the data parsing mechanism, the messy raw medical text is transformed into a set of numerical feature vectors with fixed dimensions and clear meanings.
[0017] A parameter encryption interaction mechanism is configured in the federated learning network. This mechanism aims to ensure that during the federated learning process, when model parameters or gradient updates are exchanged between participating nodes, the original data information of any data node is not leaked, thus meeting strict privacy protection compliance requirements. The parameter encryption interaction mechanism is implemented by collaborative software components deployed on the coordinating node and each data node. On the coordinating node side, this mechanism includes a key management module and a ciphertext aggregation module. The key management module is responsible for generating cryptographic materials, such as generating public and private key pairs for the Pai encryption algorithm that supports addition homomorphic operations, and can generate noise parameters to meet differential privacy requirements. The distribution and scale of the noise are determined according to a preset privacy budget. The generated public key and differential privacy noise parameters need to be distributed to each data node participating in training via a secure transport layer protocol connection. The ciphertext aggregation module is responsible for receiving model parameter updates from each data node after encryption and noise addition processing, and performing aggregation calculations in ciphertext state, such as performing a weighted average operation on multiple encrypted parameter vectors. On the data node side, this mechanism includes a local encryption client and a noise injection unit. The local encryption client uses the homomorphic encryption public key obtained from the coordinating node to perform encryption operations on the model parameter updates (such as gradient vectors) generated during local training, generating ciphertext parameters. The noise injection unit, based on the differential privacy noise parameters issued by the coordinating node, injects random noise conforming to a specific distribution (such as a Gaussian or Laplace distribution) into the encrypted parameters. The encrypted and noise-added parameters are then uploaded to the coordinating node via a previously established secure transport layer protocol connection. This entire interaction process ensures that the original data remains locally on the data nodes, while only strictly privacy-protected ciphertext parameters flow between nodes, fundamentally preventing the possibility of reverse-engineering the original training data from the interaction parameters.
[0018] The software implementation of the coordinating node can adopt various architectures to adapt to different scales of business needs and operational environments. A microservice-based architecture can be used, breaking down functions such as node registration, model management, key services, and parameter aggregation into independent, deployable, and scalable microservices, governed through a service mesh. An event-driven architecture can also be used, leveraging message middleware to handle asynchronous events such as node connections and parameter uploads, improving system throughput and responsiveness in high-concurrency scenarios. For the secure transport layer protocol connection between the coordinating node and data nodes, the implementation can support configuring different encryption suites and protocol versions to adapt to varying security policies and compliance audit requirements, such as mandating the use of a transport layer security protocol and specifying the use of a forward-secret encryption algorithm suite. Connection management can employ connection pooling technology to reuse persistent connections, significantly reducing the network and computational overhead caused by frequently establishing new connections.
[0019] The specific structure of the data parsing mechanism can be flexibly adjusted and optimized according to the characteristics of the data to be processed and the available computing resources. The natural language processing model upon which the deep semantic extraction module relies can choose different backbone network architectures. Besides transformer models based on self-attention mechanisms, sequence models based on long short-term memory networks, or hybrid models combining convolutional neural networks and attention mechanisms, can also be used to achieve the best balance between the accuracy of entity and relation extraction and the model's inference speed. For data nodes deployed in edge environments with limited computing resources, model compression techniques, such as knowledge distillation, can be used to transfer the knowledge of a large, high-precision teacher model to a lighter-weight student model, thereby significantly reducing the model's memory footprint and computational latency while ensuring a certain level of performance. The definition of the standardized feature label system is not fixed; business users can dynamically adjust the feature dimensions, value types, and mapping rules they need to focus on based on the specific risk control logic and risk factors of different insurance products. These business rules can be encapsulated into distributeable feature extraction templates, which can be distributed to each data node by the coordinating node to achieve centralized management and flexible adaptation of business strategies.
[0020] The privacy protection techniques used in the parameter encryption interaction mechanism can be specifically selected and optimized based on the trade-off between security strength, computational cost, and model utility. Besides the Pai cipher, other public-key encryption schemes that satisfy some homomorphic properties can also be used for homomorphic encryption. In scenarios with a limited number of participants and a certain level of trust among them, a secure multi-party computation protocol based on secret sharing can be used to achieve secure aggregation of model parameters, often offering higher computational efficiency. The configuration of differential privacy noise parameters is crucial; the distribution type and scale of the noise need to be precisely calculated in conjunction with the privacy budget parameters. The privacy budget defines the upper limit of the risk of privacy leakage, and the noise parameter configuration that minimizes the impact on the accuracy of the final aggregated model under a given budget can be determined through theoretical derivation or experimental simulation. At the communication level, compression algorithms can be applied to the high-dimensional encrypted parameter vector before network transmission to effectively reduce bandwidth consumption, especially in federated learning tasks with numerous participating nodes or a large number of model parameters.
[0021] The overall deployment model of federated learning networks needs to adapt to diverse IT infrastructure environments. In public cloud deployment, coordinating nodes, data nodes, and business nodes can all be encapsulated based on container images and automated deployment, scaling, and failover can be achieved using a container orchestration platform. In hybrid cloud or edge computing scenarios, data nodes are typically deployed in the data owner's private data center or internal servers, while coordinating nodes may be deployed in the business's public cloud / virtual private cloud environment. In this case, site-to-site virtual private networks or software-defined wide area network (SDW) technology needs to be configured to establish secure and efficient cross-network communication channels. For edge data nodes with unstable network conditions or intermittent connectivity, the federated learning process can adopt an asynchronous update strategy, allowing data nodes to upload parameters at any time after local training is completed, while the coordinating node continuously performs asynchronous aggregation, thereby improving the robustness and fault tolerance of the entire system.
[0022] This embodiment constructs a federated learning network with clearly defined roles for coordination nodes, data nodes, and business nodes, and pre-configures data parsing and parameter encryption interaction mechanisms. This lays the foundation for utilizing dispersed, multi-source data under strict privacy constraints. The data parsing mechanism enables the localized and automated transformation of raw, unstructured data distributed across various data nodes into standardized, structured feature representations, overcoming the fundamental obstacle of directly aligning and comprehensively utilizing heterogeneous multi-source data. The parameter encryption interaction mechanism, by integrating homomorphic encryption and differential privacy technologies, ensures the confidentiality and privacy of interactive information during federated learning at the algorithm level, allowing data holders to securely contribute their data value without exposing the original data content. This series of pre-configuration efforts realizes a paradigm where data remains stationary while value flows, effectively solving the data silo problem caused by legal and regulatory restrictions. This provides an essential and solid technical infrastructure for building compliant and reliable collaborative risk assessment models in fields such as finance and insurance.
[0023] S20, the unstructured raw sample data stored in the data node is structured through the data parsing mechanism to generate standardized feature data for training; In this embodiment, a data parsing mechanism is used to structure the unstructured raw sample data stored on the data nodes. The core of this mechanism lies in executing a localized, automated data transformation pipeline. Upon receiving a processing instruction, the data parsing mechanism first activates its data scheduling module. This module, based on a predefined sample selection strategy, accesses the local storage system of the data nodes, locates and loads data files or database records that meet the criteria. This unstructured raw sample data typically exists in free text form, such as medical records, discharge summaries, and examination reports in hospital information systems. In medical finance and insurance scenarios, this text contains descriptions of the health status of the policyholder or insured, serving as a key source of information for risk assessment.
[0024] Subsequently, the text preprocessing engine performs a series of normalization operations on the loaded raw text. This includes encoding conversion to ensure the text is uniformly encoded in a common encoding. Text cleaning is performed, removing irrelevant metadata, formatting characters, and repeated spaces from the beginning and end of the document. Tokenization is then performed, dividing continuous text into meaningful words or sub-words; this segmentation process can be based on a domain dictionary to improve accuracy. Simultaneously, in accordance with privacy compliance requirements, the engine identifies and obfuscates direct personal identifiers in the text, such as names, identification numbers, and precise addresses. This process is typically completed based on preliminary screening using regular expression pattern matching or named entity recognition models.
[0025] The preprocessed text sequence is fed into the deep semantic extraction module. This module is a computational system built on deep neural networks, typically employing an encoder-decoder or pure encoder architecture. For example, a model based on a transformer encoder includes an input embedding layer, a positional encoding layer, stacked multi-head self-attention layers, and a feedforward neural network layer. In the case of optimization for medical text, the model's input embedding layer combines word embeddings with special token embeddings for medical terminology. The model typically undergoes two-stage learning during the training phase: first, pre-training on a large corpus of general language and medical literature for a masked language modeling task, learning general representations of language and medical knowledge; then, supervised fine-tuning on medical record text labeled with medical entities and relationships, with fine-tuning tasks including sequence labeling and relationship classification. In the inference phase, i.e., the current processing step, the preprocessed text sequence is converted into a word index and input into the model. The model's multi-layer self-attention mechanism captures long-distance contextual dependencies in the text, and its output passes through a specific task head, such as a linear chain conditional random field layer for entity boundary recognition and a fully connected layer for relationship classification. Through this process, the model identifies entities such as "diabetes", "coronary artery bypass surgery", and "aspirin" from the text and determines the relationships between entities, such as the "disease-treatment" relationship.
[0026] The extracted entities and their relationships are passed to the feature mapping and standardization module. This module maintains or accesses a predefined feature labeling system, designed based on business requirements. In the insurance underwriting field, this system maps medical concepts to quantifiable risk assessment factors. For example, the identified entity "Type II diabetes" might be mapped to the feature dimension of "history of endocrine and metabolic diseases" based on its context, such as diagnosis time and control status, and assigned a numerical code based on severity, such as "1" representing a history of the disease but good control. A specific numerical entity, "systolic blood pressure 150 mmHg," is directly mapped to the "blood pressure index" feature and standardized, for example, through Z-score normalization, converting it to a value with a mean of 0 and a standard deviation of 1. The module executes strict mapping rules to ensure that the same medical concept is transformed into a completely consistent feature representation across different data nodes and medical records. For missing information, the module fills in the missing information according to a strategy, such as using default values or inferring from the context. Finally, all features are assembled into a fixed-dimensional numerical vector, i.e., the standardized feature data for training. Each dimension of this vector corresponds to a predefined term in the feature label system, and its value represents the quantification result of that feature on that sample. The entire process is completed locally on the data node, and neither the original text nor the intermediate semantic extraction results leave this node.
[0027] The deep semantic extraction module can be implemented using different neural network architectures to suit performance and accuracy requirements. One approach is to use an architecture based on a bidirectional long short-term memory network combined with a conditional random field for entity recognition. This architecture has fewer parameters and is more advantageous for deployment on edge data nodes with limited computing resources. Relation extraction can be handled separately using an attention-based neural network model to process the context between entity pairs. Another approach is to use a unified pre-trained language model, such as a clinical BERT or its variants specifically trained on medical literature. This model itself already contains rich medical knowledge through pre-training, and can be fine-tuned by designing specific prompts to simultaneously complete entity recognition and relation extraction tasks. The model's training data can come from publicly available medical corpora, such as MIMIC-III, and combined with manually annotated insurance-related medical record fragments for domain-adaptive training. Key parameters during training include learning rate scheduling strategies, such as linear warm-up and cosine annealing, batch size adjustment based on GPU memory, and gradient pruning to prevent gradient explosion.
[0028] The specific operations of text preprocessing can be adjusted according to data quality and language characteristics. For Chinese medical text, word segmentation can employ a tool that combines dictionary and statistical models, incorporating a medical dictionary to improve segmentation accuracy. For non-standard abbreviations or transcription errors in doctors' handwriting, an error correction module can be introduced, which uses language models or edit distance algorithms for candidate correction. Sensitive information desensitization can be based not only on rules but also on a trained named entity recognition model to more accurately locate various types of privacy information, including less common identifier types.
[0029] The rule engine for the feature mapping and normalization module can be designed to be configurable and pluggable. The feature label system can exist in the form of configuration files or database tables, allowing business experts to update mapping rules without modifying the code. In addition to direct key-value mapping, the mapping logic can also support complex rule-based transformations. For example, when the entity "smoking" and the time description "30 years" appear simultaneously in the text, it can be mapped to the "smoking history" feature and assigned the value "heavy". Normalization methods, besides Z-score normalization, can also employ min-max scaling, quantile transformation, etc., with the specific choice depending on the feature's distribution characteristics. For categorical features, one-hot encoding or embedding encoding can be used.
[0030] This embodiment performs localized and structured deep processing on the unstructured raw sample data stored in the data nodes. The data parsing mechanism transforms the originally difficult-to-use free text medical records into high-quality, standardized machine-readable feature vectors. The deep semantic extraction module utilizes a pre-trained domain knowledge model to accurately extract entity and relationship information crucial for insurance risk assessment from complex medical descriptions, solving the problems of low efficiency and subjective inconsistency in manual reading. The feature mapping and standardization module, based on unified business rules, objectively and consistently quantifies this semantic information into fixed feature dimensions, eliminating the heterogeneity between raw data from different sources and in different formats.
[0031] S30, based on the parameter encryption interaction mechanism, control the data node and the coordination node to perform encrypted model parameter interaction, and use the standardized feature data for training to train and generate an anomaly analysis model; In this embodiment, based on a configured parameter encryption interaction mechanism, the control data nodes and the coordination node interact with encrypted model parameters, and collaboratively train an anomaly analysis model using standardized feature data stored locally on each data node. This process constitutes the core training loop of federated learning. Training begins with the coordination node initializing the anomaly analysis model. This anomaly analysis model is a machine learning model to be trained, and its structure can be customized according to the risk assessment task. In the medical finance and insurance scenario, this model is usually a supervised learning model, such as a gradient boosting tree or a deep neural network, whose design goal is to predict a probability or score representing anomaly risk based on input health feature data. The coordination node defines the global architecture and initial parameters of the model, such as the initial values of the weight matrix and bias vector of each layer of the neural network. These initial parameters constitute the initial global model parameters. The coordination node distributes the initial global model parameters to all data nodes participating in the training through the federated learning network.
[0032] Upon receiving the global model parameters, each data node initiates a local model training task. The data node uses its locally stored standardized feature data as the training sample set. This feature data consists of structured vectors extracted from the node's original medical records via a data parsing mechanism; each feature vector corresponds to a summary of the health status of a historical sample. The training process on the node executes standard model learning algorithms. For example, if the anomaly analysis model is a multilayer perceptron, the training process includes forward propagation to calculate predicted values, calculating the error between the predicted values and the true labels using a loss function, and then calculating the gradient of the loss function relative to the model parameters using a backpropagation algorithm. True labels, in insurance scenarios, are typically subsequent claims or disease diagnosis records; this label information is also securely stored locally on the data node. The data node uses a local optimizer, such as stochastic gradient descent or its variants, to iteratively update the received global model parameters based on the calculated gradients, generating updated local model parameters. The entire local training process is performed within the data node; the standardized feature data and labels never leave the node.
[0033] After completing local training, the data nodes need to securely contribute the updated model parameters to the global model. At this point, the parameter encryption interaction mechanism is invoked to protect parameter privacy. The local encryption client on the data node uses a homomorphic encryption public key obtained beforehand from the coordinating node to perform encryption operations on the updated local model parameters. The encryption process converts each floating-point value in the model parameter vector into a ciphertext element, allowing specific arithmetic operations to be performed even in the ciphertext state. Simultaneously, to meet differential privacy requirements, the noise injection unit injects random noise conforming to a specific probability distribution into the encrypted parameter ciphertext based on the differential privacy noise parameters issued by the coordinating node. This step ensures that even if inference is attempted from multiple updated parameters, it is difficult to reconstruct the original training data. The encrypted and noise-added parameters are packaged into an encrypted model parameter packet and uploaded to the coordinating node via a secure transport layer protocol connection.
[0034] The coordinating node receives encrypted model parameter packets from all participating data nodes. The ciphertext aggregation module on the coordinating node performs secure aggregation calculations on this ciphertext data. Due to the use of an encryption algorithm supporting additive homomorphism, the coordinating node can directly perform element-wise weighted summation on multiple ciphertext parameter vectors without decryption. The weighting method can be a simple average, or different weights can be assigned based on the amount of data from each data node or other trust indicators. The aggregation operation produces an aggregated encrypted global parameter. Subsequently, the coordinating node uses its held homomorphic encryption private key to decrypt this aggregated encrypted global parameter, obtaining the updated plaintext global model parameters. The coordinating node evaluates whether the model has reached convergence, which can be achieved by the change in global parameters being less than a certain threshold over multiple iterations, or by reaching a preset number of iterations. If convergence is not achieved, the coordinating node uses the updated global model parameters as the initial parameters for the next round and redistributes them to the data nodes to start the next training iteration. This cycle of "distribution-local training-encrypted upload-secure aggregation" is repeated continuously.
[0035] When training meets the convergence condition, the coordinating node saves the final global model parameters, which define the trained anomaly analysis model. This model can now output a quantified risk score based on the input health feature vector. In the context of financial insurance underwriting applications, the model's input is a standardized health feature vector, and the output is a value between 0 and 1, representing the estimated probability of the insured experiencing a critical illness or high-risk event as defined in the insurance policy. The model's internal structure, such as the number of neural network layers, the number of neurons in each layer, and the type of activation function, is defined during initialization and optimized together during federated training.
[0036] The architecture of anomaly analysis models can be adjusted based on the complexity and interpretability requirements of the risk assessment task. Logistic regression or gradient boosting decision tree models can be used; these models have relatively few parameters, high training efficiency, and their decision-making process is interpretable, making them easy for insurance risk control personnel to understand. For capturing complex nonlinear interactions between health features, deep neural networks, such as multilayer perceptrons or convolutional neural networks, can be used. If the input features contain temporal information, such as multiple years of continuous health check-up records, recurrent neural networks or long short-term memory networks can be used. The model's structural parameters, such as the depth and width of the neural network, can be set through hyperparameter configuration during the initialization of the coordinating nodes and can remain unchanged during federated training, or be collaboratively optimized by introducing a hyperparameter federated learning mechanism.
[0037] The specific strategies for federated learning training loops can be adapted in various ways. A synchronous update strategy can be used, where the coordinating node waits for all selected data nodes to complete their current training round and upload their parameters before aggregating. This ensures that each update is based on data from all nodes, but is limited by the speed of the slowest node. An asynchronous update strategy can also be used, where the coordinating node performs partial aggregation and model updates immediately upon receiving parameters from a node. This improves training speed but may introduce noise. The number of iterations for local training is a key parameter, which can be uniformly specified at the coordinating node or allowed to be dynamically determined by each data node based on its local data volume or computing power. Optimizer parameters such as the learning rate can be globally uniform or allowed to be fine-tuned locally by each node.
[0038] Privacy techniques in parameter encryption interaction mechanisms can be used interchangeably or in combination. Besides homomorphic encryption, secure multi-party computation protocols can be employed to achieve secure parameter aggregation. For example, a secret-sharing-based scheme splits each node's parameter update into multiple shares, which are then distributed to other nodes or a coordinating node. Aggregation is achieved through share operations without exposing the original values. Differential privacy noise can be added at different stages; it can be added after data node encryption or to the plaintext parameters after aggregation and decryption at the coordinating node. The type and scale of the noise distribution need to strictly match the chosen privacy budget model. For example, a centralized differential privacy model requires careful calculation of global sensitivity, while a localized differential privacy model adds noise independently to the output of each node.
[0039] To address the issues of data imbalance or concept drift in the healthcare, finance, and insurance sectors, federated training processes can incorporate special mechanisms. For example, the coordinating node can dynamically adjust aggregation weights based on the label distribution of samples contributed by each data node to mitigate class imbalance. For newly added healthcare institution data nodes, incremental learning or transfer learning strategies can be employed to enable the trained model to quickly adapt to the feature distribution of the new data source without requiring retraining from scratch. A model performance validation phase can be introduced during training. The coordinating node periodically distributes the current global model to some nodes, evaluates its performance on a locally retained validation set, and securely aggregates the performance metrics to guide early training pausing or hyperparameter tuning.
[0040] This embodiment controls the encrypted model parameter interaction between data nodes and coordination nodes through a parameter-based encrypted interaction mechanism, and utilizes locally standardized feature data for collaborative training. This allows for the aggregation of sample knowledge scattered across different institutions to jointly construct a powerful anomaly analysis model while strictly protecting the original data privacy of each data node. The application of encryption and differential privacy technologies ensures that the model parameters interacting during training do not leak any individual's sensitive health information, meeting the compliance requirements for medical data use. Using locally generated standardized feature data for training guarantees the consistency and high quality of the input data, enabling effective joint training. The final anomaly analysis model integrates the statistical patterns of multi-source data samples, and its risk assessment capability is more comprehensive and accurate than models trained based on a single data source or limited samples, providing insurance institutions with a reliable technical tool for accurate risk pricing and underwriting decisions within a compliant framework.
[0041] S40, obtain the identification information of the object to be processed, extract the target unstructured raw data associated with the identification information in the data node through the data parsing mechanism and process it to generate target standardized feature data; In this embodiment, obtaining the identification information of the object to be processed is the entry point for initiating analysis targeting a specific individual. In medical financial insurance business, the object to be processed typically refers to a natural person who has submitted an insurance application. Identification information is a data item that can uniquely or quasi-uniquely identify the object, such as a resident identity information number, social security number, a unique customer number assigned by an insurance institution, or a policy application number. This information is typically obtained at a business node, where the business node receives an insurance application data packet from the front-end channel and parses specific fields in the data packet to extract the identification information. In some processes, the identification information may also be manually entered by underwriters or obtained by querying from a customer relationship management system.
[0042] After obtaining the identification information, it is necessary to locate and extract the raw data associated with that identification. This involves determining the specific data node storing the associated data. In a federated network containing multiple healthcare institution data nodes, coordinating nodes or business logic need to resolve one or more target data nodes corresponding to the identification information based on pre-defined mapping relationships or by querying a distributed index. For example, the hospital node where the data is mainly located can be determined by using the medical insurance card prefix or registered hospital information in the identification information. After determining the target data node, a data extraction request is initiated to that node, carrying the identification information in the request. The data access interface running locally on the target data node receives the request, and its internal data retrieval module queries the local storage system based on the identification information. The local storage system may be a relational database, document database, or file system, which stores unstructured raw data records with the identification information as the key index, such as all outpatient medical records, inpatient medical record cover images, and laboratory report texts of the patient in the institution. The retrieval module executes the query operation and returns a set of raw data documents that exactly match or fuzzily match the identification information; these documents constitute the target unstructured raw data. The entire retrieval process is completed within the data node, and the original data does not leave the boundary of that node.
[0043] Next, on the target data node, the pre-deployed data parsing mechanism is triggered to perform structured processing on the retrieved unstructured raw data. This processing flow maintains consistency with the core process used to generate standardized feature data for training, ensuring consistency in the feature space. The data scheduling module in the data parsing mechanism loads the retrieved raw data documents into the processing pipeline. The text preprocessing engine then performs cleaning, word segmentation, encoding standardization, and sensitive information desensitization on the document content. The cleaning process removes formatting tags, irrelevant headers, and other noise from the document. Word segmentation divides continuous text into lexical units; for Chinese medical text, it may be necessary to combine medical dictionaries for terminology segmentation. Sensitive information desensitization masks or replaces direct personal identifiers other than those used for analysis.
[0044] The preprocessed text is fed into a deep semantic extraction module. This module is based on a pre-trained natural language processing model incorporating medical domain knowledge, such as a transformer model fine-tuned on clinical text. The model encodes the input text, understands the contextual semantics through its multi-layer neural network structure, and performs named entity recognition and relation extraction tasks. In the context of insurance risk assessment, the model is trained or configured to focus on entities related to disease diagnosis, surgical procedures, medication history, abnormal laboratory indicators, symptom descriptions, family medical history, and their temporal, degree, and causal relationships. For example, from a discharge summary, the model needs to identify "acute myocardial infarction" as a disease entity, "percutaneous coronary intervention" as a treatment entity, determine the "disease-treatment" relationship between the two, and extract time-modification information such as "3 days post-operation".
[0045] The identified entities and their relationship sets output by the deep semantic extraction module are passed to the feature mapping and standardization module. This module accesses a predefined feature label system identical to that used in the training phase. This system, defined by insurance risk control business requirements, maps various medical concepts to fixed, quantifiable risk factor dimensions. Based on preset mapping rules, the module converts the identified entities and their attributes into numerical values for the corresponding dimensions in the feature label system. For example, the identified entity "Type II diabetes," combined with contextual information such as "10-year disease duration" and "oral metformin," might be mapped to the "history of endocrine and metabolic diseases" feature and assigned a specific coded value representing "long-term medical history and medication control" according to the rule base. A specific numerical entity, "low-density lipoprotein cholesterol 4.5 mmol / L," is directly mapped to the "blood lipid index" feature and, after standardization compared with the overall sample database, converted into a standardized score. The mapping process needs to handle information conflicts, missing information, and uncertainties, and may rely on a rule engine or simple reasoning logic. Finally, all mapping results are assembled into a fixed-length and ordered numerical vector. Each position in this vector corresponds to a predetermined dimension in the feature label system, and its value represents the quantified evaluation result of that dimension on the object to be processed. This numerical vector is the target standardized feature data, and its format is fully compatible with the standardized feature data used for model training, and can be directly used as input to the trained model.
[0046] The methods for obtaining and verifying identification information can be diversified. Besides parsing from structured application forms, it can also be obtained indirectly through biometric recognition, bank card number binding, etc., but it must be ensured that it can ultimately be linked to the index key in the medical data system. For situations where identification information may correspond to multiple data nodes, a collaborative query mechanism can be used, with the coordinating node broadcasting query requests to all possible data nodes, or a centralized patient master index service can be used to determine the most relevant data node. The local retrieval implementation on data nodes can be optimized. For high-frequency queries, a cached index of identification information and its storage location can be built in memory to accelerate the retrieval process. For unstructured documents, a full-text search engine can be used, associating the document with the identification information during index building, improving the efficiency of locating specific patient records from massive amounts of documents.
[0047] The data parsing mechanism can employ slightly different strategies when processing target data compared to the training phase to optimize real-time performance. The deep semantic extraction module can use a quantized, lightweight model version or a small model obtained through knowledge distillation to reduce latency per inference iteration. For text preprocessing, given that the target data is typically composed of relatively recent records, a word segmentation dictionary adapted to the latest medical terminology and abbreviations can be used. The feature mapping and normalization modules can be optimized for real-time processing, for example, by precompiling mapping rules into efficient decision trees or lookup tables to avoid complex rule interpretation during inference.
[0048] The processing flow can be adjusted according to the real-time requirements of the business scenario. For online underwriting scenarios requiring extremely fast response, the entire processing flow can be highly pipelined and deployed on high-performance computing instances. Asynchronous processing and callback mechanisms can be adopted, meaning that the business node returns immediately after submitting a request, and the data node actively notifies the business node of the result after processing. For batch processing scenarios, such as centralized pre-underwriting of a batch of insurance applications, the identification information of multiple objects to be processed can be batch packaged and sent to the data node. The data node internally utilizes parallel computing resources to process multiple requests simultaneously, improving the overall throughput.
[0049] Dynamic updates to the feature labeling system need to be supported. When insurance product terms or risk control logic change, the feature labeling system and its mapping rules may need to be updated. New feature definitions and mapping rules can be encapsulated into update packages through a coordination node and securely distributed to all data nodes. Upon receiving the updates, data nodes dynamically load the new rules, enabling subsequent processing to immediately generate features based on the new rules without requiring service restarts or model retraining. This requires the feature mapping and standardization modules to have hot-update capabilities.
[0050] This embodiment acquires the identification information of the object to be processed and utilizes a configured data parsing mechanism to extract and process the associated unstructured raw data locally at the target data node. This enables the generation of standardized feature representations isomorphic to the training data for a specific individual without requiring centralized raw data. This process ensures that the input features used for model inference are completely aligned with the feature space learned during model training, guaranteeing the correct and reliable application of the trained anomaly analysis model. In financial insurance underwriting scenarios, this step allows insurance companies to utilize the latest medical record data distributed across different medical institutions in real time and in compliance with regulations, transforming it into structured risk assessment factors, providing high-quality, standardized input for subsequent automated risk quantification.
[0051] S50, Input the target standardized feature data into the anomaly analysis model, and perform joint verification in conjunction with preset business conditions to generate anomaly quantification results; In this embodiment, the target standardized feature data, serving as a quantitative summary of the health status of the object to be analyzed, is input into a pre-trained anomaly analysis model. This data is a fixed-dimensional numerical vector, with each dimension corresponding to a predefined health risk factor, such as the standardized value of blood pressure, the encoding of specific medical history, or the discrete levels of key physiological indicators. In the context of financial insurance underwriting, this vector comprehensively reflects the applicant's quantitative status across a series of medical dimensions. The anomaly analysis model is a machine learning model whose structure and parameters are optimized through the aforementioned federated learning process. A typical architecture of the model can be a deep neural network, containing an input layer, several hidden layers, and an output layer. The number of neurons in the input layer perfectly matches the dimensions of the target standardized feature data, serving to receive the vector. The hidden layers are typically composed of fully connected layers, each containing multiple neurons, and use non-linear activation functions to capture complex interactions between features. The output layer is designed according to the task; for binary risk assessment tasks, it typically uses a neuron with a sigmoid activation function, outputting a scalar value between 0 and 1, representing the initial probability score calculated by the model based on the input features, indicating that the individual belongs to the high-risk category. The connection weights and bias parameters between the layers within the model are determined during the training phase. In insurance risk assessment applications, the model is trained to distinguish individuals who will experience a critical illness or health event covered by an insurance contract within a certain future period. Its training labels are derived from historical claims records or disease diagnosis records.
[0052] The pre-defined business conditions are a set of rules and thresholds based on insurance product terms, company risk control strategies, and actuarial experience. These conditions are independent of the data-driven model and reflect the knowledge of domain experts and compliance requirements. Business conditions may include numerical thresholds, such as boundary values for different risk levels set for the initial probability score of the model output; logical exclusion clauses, such as explicitly stipulating that if certain specific disease diagnoses or health conditions appear in the insured's characteristics, a special processing procedure must be triggered regardless of the model score; and feature weights or correction coefficients, used to adjust the model output under specific circumstances. These conditions are typically stored in a structured form in a rule base or configuration file and can be dynamically loaded according to different insurance product types.
[0053] The joint verification process refers to the collaborative processing of the initial probability score output by the anomaly analysis model with pre-defined business conditions to generate a more comprehensive and logically consistent final risk quantification result. This process first compares the initial probability score output by the model with the numerical thresholds defined in the business conditions, categorizing it into a pre-defined risk level range, such as "low risk," "medium risk," or "high risk." Simultaneously, the system iterates through the exclusion clauses in the business conditions, matching specific feature values in the target standardized feature data with the feature patterns defined in the clauses. A successful match indicates that a hard rule has been triggered, which may directly lead to an increase or decrease in the risk level or trigger an instruction for manual review. Subsequently, the system integrates the threshold comparison results with the rule matching results according to predetermined logic. This integration logic can be sequential, such as first applying exclusion rules to cover the model results and then applying threshold grading to uncovered cases; or it can be a weighted fusion, such as adjusting the model score based on the strength of the rule matching. Ultimately, this series of processes produces a single, quantified anomaly quantification result. The result can be a discrete risk level code or a continuous risk score after rule correction. Its purpose is to provide a comprehensive risk assessment conclusion that includes both data-driven insights and business rule constraints for subsequent decision-making.
[0054] The specific implementation architecture of anomaly analysis models can be selected and adjusted according to different requirements for accuracy, interpretability, and computational efficiency. Besides deep neural networks, gradient boosting decision tree models can be used. This model integrates multiple decision trees, iteratively building new trees to correct the residuals of previous trees. Its output is a weighted sum of the predictions from each tree, which can also be converted into probability scores. These models often perform well on tabular feature data and provide a certain measure of feature importance. Logistic regression models can also be used, whose output has a clear probabilistic interpretation, and whose model coefficients can directly correlate the direction and strength of the correlation between features and risk. For more complex time-series or multimodal data, a hybrid architecture of convolutional neural networks and recurrent neural networks can be designed. The design of the model's output layer can also be varied. For tasks that need to distinguish multiple risk levels, a softmax output layer can be used to directly output the probability distribution of each level.
[0055] The organization and execution of pre-defined business conditions can be diversified. Expert systems based on rule engines such as Drools can be used to manage complex, multi-condition business rules. Rules are written in the form of "when...then...", supporting flexible combinations and priority settings. For a large number of simple threshold rules, they can be stored in a configuration table of a relational database and executed through efficient query and comparison logic. Business conditions can also be partially embedded through feature engineering. For example, when generating target standardized feature data, some strong rules can be transformed into binary features and directly used as model input, allowing the model to learn the influence of these rules during training.
[0056] The integrated logic of joint validation can be designed as a configurable strategy. Decision trees or decision tables can be used to explicitly represent the mapping of the final result under different combinations of conditions. Score adjustment formulas can be designed; for example, when an exclusion clause is triggered, a fixed penalty value is added to the initial score of the model. Weight vectors can be introduced to perform a weighted average of the probability score output by the model and the score calculated based on the rules. The weights can be dynamically adjusted according to the credibility of the rules or the importance of the business. The validation process can be a strictly sequential process or a parallel computation of multiple sub-results followed by aggregation.
[0057] The performance of the entire joint verification system can be optimized. For high-concurrency online underwriting scenarios, model inference and rule verification can be deployed on GPU-accelerated inference servers, and frequently accessed business condition rules can be cached in memory. Asynchronous batch processing can be implemented, simultaneously inputting the feature data of a batch of insurance applications into the model for batch inference, and then applying business rules in batches, thereby improving overall throughput. The system should be designed with a monitoring module to continuously track the distribution and consistency of model output and business rule results, providing data support for model iteration and rule optimization.
[0058] This implementation obtains data-driven initial risk insights by inputting standardized target feature data into a pre-trained anomaly analysis model. This is then combined with pre-defined business conditions reflecting domain knowledge and compliance requirements for joint validation, generating a more robust and reliable anomaly quantification result. The model utilizes complex patterns learned from multi-source data to provide basic risk probability estimates, while the business conditions embed explicit business rules, actuarial logic, and compliance red lines. This combination effectively compensates for potential blind spots in pure data models or uncertainties related to out-of-distribution samples of the training data, while also supplementing subtle feature interactions that pure rule-based systems cannot capture. This joint validation mechanism ensures that the final risk assessment possesses both the foresight of statistical learning and is firmly rooted in actual insurance business logic and the regulatory framework, significantly improving the accuracy, interpretability, and business applicability of the risk quantification results.
[0059] S60, determine a processing decision based on the anomaly quantification result, and feed the processing decision back to the business node.
[0060] In this embodiment, the anomaly quantification result is a numerical value or level code characterizing the risk level of the object to be processed, such as a risk score of 0.85 or a classification label marked "high risk". The process of determining the processing decision involves mapping this quantification result to specific, actionable business actions. This mapping process relies on a predefined hierarchical mapping strategy. This strategy defines the correspondence between the numerical range or level of the anomaly quantification result and different processing decision types. In the underwriting scenario of medical financial insurance, typical processing decision types include automatic approval, automatic rejection, and marking as requiring manual review. The hierarchical mapping strategy might stipulate that when the risk score is below 0.1, the decision is automatic approval; when the risk score is above 0.7, the decision is automatic rejection; and when the risk score is between 0.1 and 0.7, the decision is to transfer to manual review. The mapping strategy is typically formulated by the insurance company's actuarial and risk control departments based on product characteristics, market strategies, and risk tolerance, and stored in the form of configuration files or rules in a location accessible to coordination nodes or business nodes.
[0061] The specific operations for determining the processing decision involve querying and matching. The system compares the generated anomaly quantification results with the various intervals defined in the hierarchical mapping strategy to identify the target numerical interval or matching level into which the result falls. Subsequently, the system searches the strategy for the target processing decision type associated with that target interval or level. This process can be a simple interval judgment or a complex multi-dimensional rule engine matching.
[0062] When the target processing decision type is automatic approval or automatic rejection, the system typically generates a structured analysis report as the formal document for the decision. This report not only includes the final decision conclusion but also integrates key judgment criteria, such as the main features leading to high risk, the raw scores output by the model, and the key business rules triggered. The report aims to provide transparency and auditability of the decision. When the target processing decision type requires manual review, the system generates a review work order. This work order extracts and highlights the key feature information leading to the abnormal quantitative results, such as the dimensions with the highest contribution in the risk feature vector and their values, providing a clear focus for subsequent manual underwriters. The review work order is sent to a task queue or workflow system, awaiting manual processing. After reviewing the work order, the manual underwriter may choose to maintain the original recommendation, adjust the risk level, or make a different final decision. The correction instructions input by the manual underwriter are captured by the system and packaged into the final processing decision.
[0063] Once a processing decision is generated, it needs to be sent back to the requesting business node via the established federated learning network. In a typical star topology, the coordinating node acts as a relay, responsible for receiving the final output from each data processing node and routing the encapsulated processing decision to the corresponding business node. Communication occurs through previously established application layer connections, such as by calling the application programming interfaces (APIs) provided by the business nodes. The feedback data packet contains the specific content of the processing decision, such as analysis reports, automatic pass / reject instruction codes, or manually revised conclusions. Upon receiving the processing decision, the business node triggers its corresponding internal business process, such as updating the policy status, notifying the customer of the underwriting conclusion, or assigning the manual review task to a specific underwriter's workbench.
[0064] The implementation and loading methods of hierarchical mapping strategies can be diversified to adapt to dynamic business needs. Strategies can be stored in configuration tables of a relational database, allowing business administrators to adjust the mapping relationship between threshold ranges and decision types in real time through a management interface, without requiring a service restart. Feature tagging or function toggle technologies can also be used to support the application of different mapping strategies to different product lines or customer groups. For more complex decision logic involving multiple feature combinations, a business rule management system can be integrated, with strategies written in a domain-specific language and interpreted and executed by the rule engine at runtime.
[0065] The generation and encapsulation format of processing decisions can be adjusted according to the integration requirements of downstream systems. For automated decision-making, the generated reports can use JavaScript object notation, Extensible Markup Language, or a predefined binary protocol buffer format. The reports can include instruction code for direct machine processing and text descriptions for human review. For manual review paths, the work order system can be integrated with existing transaction tracking systems or workflow engines, and work order status, assigned personnel, processing time limits, etc., can all be customized. The capture of manual correction instructions can be achieved by designing a dedicated review operation interface. This interface displays key features and model suggestions, and provides drop-down menus, sliders, or text input boxes for underwriters to enter their final conclusions and remarks.
[0066] The communication mechanism for decision feedback can be optimized to ensure reliability and timeliness. A synchronous application programming interface (API) call based on the Hypertext Transfer Protocol Secure (HTTP) can be used, with business nodes waiting and receiving responses within a specified time. In high-concurrency scenarios, an asynchronous message queue pattern can be adopted, where the coordinating node publishes decisions to a message topic, and business nodes subscribed to that topic asynchronously pull and process the decisions, achieving decoupling and peak / valley smoothing. To ensure message delivery, acknowledgment and retry mechanisms can be introduced, and messages can be persisted. The security of the feedback channel can be further enhanced through transport layer security protocol encryption and application layer digital signatures or token authentication.
[0067] This embodiment determines processing decisions based on anomaly quantification results and a pre-defined hierarchical mapping strategy, then feeds these decisions back to business nodes, achieving a seamless transition from automated risk assessment to executable business actions. The mapping strategy allows data-driven risk scores to be flexibly and controllably transformed into specific operational instructions that align with business logic. In automated decision-making scenarios, generating evidence-based reports enhances transparency and efficiency. In scenarios requiring human intervention, the generation of structured policy documents effectively focuses on the key concerns of manual review, improving review efficiency. Finally, the federated learning network reliably feeds the decisions back to business nodes, triggering subsequent policy processing flows, thus completing a closed loop from distributed data privacy-preserving analysis to actual insurance business decisions. This process, while ensuring compliance and privacy security in the analysis process, significantly shortens the underwriting decision-making cycle and improves the responsiveness and automation level of insurance services.
[0068] In one embodiment, step S10 above includes: S101, Deploy network registration service at the coordinating node; S102, at the network registration service, connection requests from various data nodes and service nodes are received; S103, based on the received connection request, assign a network identifier to each data node and service node that made the connection request, and record the assigned network identifier in the node registry; S104, Based on the node registry, establish a secure transport layer protocol connection between the coordinating node and each data node; S105, Establish application layer connection between the coordination node and the business node; S106, At the coordination node, a feature extraction template is generated according to the predefined data structure specification, and the feature extraction template is distributed to each data node through the secure transport layer protocol connection; S107, Locally at each data node, a natural language processing parsing engine is configured based on the received feature extraction template to form a data parsing mechanism; S108, at the coordinating node, a homomorphic encryption public key and differential privacy noise parameters are generated, and the homomorphic encryption public key and differential privacy noise parameters are distributed to each data node through the secure transport layer protocol connection; S109, at each data node, the local encryption client is initialized using the homomorphic encryption public key, and a local noise addition strategy is configured according to the differential privacy noise parameters. The initialized local encryption client and the configured local noise addition strategy together constitute a parameter encryption interaction mechanism.
[0069] In this embodiment, the coordinating node deploys a network registration service, which is a continuously running background process or microservice that listens on a specific network port, waiting for and processing connection requests from data nodes and business nodes. The core function of the network registration service is to discover and manage the identities of nodes. Physically, this service can be deployed in virtual machines or containers within the insurance institution's data center and provides a unified endpoint externally through a load balancer. Data nodes are typically deployed within the internal networks of cooperating medical institutions, while business nodes reside within the insurance company's business systems. They access the coordinating node via the internet or dedicated lines.
[0070] Both data nodes and business nodes initiate connection requests to the network registration service. These requests contain metadata about the node itself, such as node type, affiliated organization, and supported protocol versions. Data node requests may include a summary of the data types they store, while business node requests may include the type of analysis task they require. Upon receiving the request, the network registration service performs authentication, for example, through a pre-shared key, digital certificate, or organization whitelist mechanism. After successful authentication, the network registration service assigns each node a globally unique network identifier, which can be a UUID string used to uniquely identify the node in all subsequent interactions.
[0071] The network registration service associates assigned network identifiers with node metadata, forming a node registry. This registry is typically stored in the coordinating node's persistent storage, such as a relational database or key-value store. The registry records information such as each node's identifier, type, connection status, network address, and registration time. Based on the node registry, the coordinating node can understand the topology of the entire federated learning network and knows how to establish subsequent dedicated connections with each node.
[0072] The coordinating node uses the network address information in the node registry to establish a secure transport layer protocol (SLP) connection with each data node. This SLP connection is built on top of the transport control protocol (TCP), negotiating encryption algorithms and keys through a handshake protocol to establish an encrypted channel. The coordinating node acts as the client, and the data nodes as the server, or vice versa, depending on the network architecture. During connection establishment, both parties exchange digital certificates to verify identities, ensuring that the other end of the connection is a legitimate registered node. After the connection is established, both parties can transmit arbitrary data through this encrypted channel, guaranteeing confidentiality, integrity, and the authenticity of the other end's identity during transmission. The coordinating node establishes an application layer connection with the business nodes. This connection can be based on an application programming interface (API) call using the Hypertext Transfer Protocol Secure (HTTP) or a connection based on a remote procedure call (RPC) framework. The application layer connection is used to transmit business logic-related instructions and results, such as submitting analysis tasks and returning processing decisions.
[0073] In the coordination node, a feature extraction template is generated according to a predefined data structure specification. This data structure specification, jointly developed by insurance risk control experts and data scientists, defines the standardized feature dimensions, data types, value specifications, and mapping rules to be extracted from the original medical text. The feature extraction template is an instantiation of this specification into a machine-readable configuration file, such as in JSON or YAML format, or compiled into a specific binary format. The template may contain feature names, the corresponding entity types that the natural language processing model should recognize, the conversion function from entity to feature value, and rules for handling missing values. The coordination node distributes the feature extraction template to each data node via an established secure transport layer protocol connection. The distribution process can be either a push mode or a data node actively pulling mode.
[0074] Locally on each data node, upon receiving the feature extraction template, a natural language processing (NLP) parsing engine is configured based on that template. The NLP parsing engine is a software module, potentially built on an open-source framework or a self-developed system. The configuration process includes loading the pre-trained model file specified in the template. This model file may be distributed by the coordinating node along with the template, or obtained by the data node from a designated model repository. The model is typically based on a transformer architecture and fine-tuned on large-scale medical text, capable of performing tasks such as named entity recognition, relation extraction, and attribute classification. Configuration also includes setting up the feature extraction pipeline according to the template, such as configuring text preprocessing parameters, entity filters, and feature transformers. Once configured, the engine can run independently locally, forming the core of the data parsing mechanism, thus enabling the transformation of raw medical record text into standardized feature data.
[0075] In the coordinating node, a homomorphic encryption public key and differential privacy noise parameters are generated. The generation of the homomorphic encryption public key depends on the chosen homomorphic encryption algorithm, such as the Pai encryption algorithm. The key generation algorithm produces a public and private key pair; the public key is used for encryption, and the private key is securely stored by the coordinating node. The differential privacy noise parameters include noise distribution type, noise scale, etc., and these parameters are determined according to a pre-calculated privacy budget, which defines the upper limit of privacy leakage allowed throughout the federated learning process. The coordinating node distributes the homomorphic encryption public key and differential privacy noise parameters to each data node via a secure transport layer protocol connection. The distribution process needs to ensure that the parameters are not tampered with during transmission, typically using digital signatures or leveraging the integrity protection of the secure transport layer protocol connection itself.
[0076] At each data node, the local encryption client is initialized using the received homomorphic encryption public key. The local encryption client is a software library or service module that encapsulates the encryption, decryption, and ciphertext operation functions of the homomorphic encryption algorithm. Initialization includes loading the public key, setting encryption parameters, and allocating the encryption context. Simultaneously, the data node configures a local noise addition strategy based on the received differential privacy noise parameters. The noise addition strategy defines when and how noise is added to the model parameters to be uploaded, such as adding element-wise random noise satisfying a Laplace or Gaussian distribution to the encrypted gradient vector. Configuration includes setting the seed, distribution parameters, and possible noise injection points for the noise generator. The initialized local encryption client and the configured local noise addition strategy work together to constitute the implementation of the parameter encryption interaction mechanism on the data node side. This mechanism ensures that when a data node participates in federated learning training, the uploaded model parameter updates are both encrypted and satisfy differential privacy requirements, thus providing dual privacy guarantees at the information theory and cryptographic levels.
[0077] This embodiment constructs a federated learning network and configures a dual mechanism through the above steps, realizing the basic architecture for cross-institutional data collaborative analysis under strict privacy protection. Network registration and secure connection establish a trusted communication foundation, the distribution of feature extraction templates and the configuration of the parsing engine unify the processing standards for heterogeneous data, while the initialization of encryption and noise parameters provides cryptographic and differential privacy guarantees for secure collaborative training. This lays the foundation for subsequent compliant, efficient, and accurate risk assessment in the medical finance field.
[0078] In one embodiment, step S20 above includes: S201, the data parsing mechanism is invoked at each data node, and a preset domain knowledge graph and feature alignment template are loaded. The domain knowledge graph contains entity terminology definitions and topological mapping relationships between entities. S202, the unstructured raw sample data stored in the data node is denoised, cleaned, and serialized into words using the data parsing mechanism to generate a basic text sequence of the sample. S203, input the sample basic text sequence into the deep semantic extraction network integrated by the data parsing mechanism, and use the domain knowledge graph to perform entity boundary localization and relationship extraction to obtain key business entities; S204, based on the feature alignment template, the key business entity mapping is converted into a unified numerical vector format to generate standardized feature data for training.
[0079] In this embodiment, a data parsing mechanism is invoked at each data node. This mechanism is initialized as a configuration management service, which loads a pre-defined domain knowledge graph and feature alignment template from local storage or a trusted remote repository based on a pre-set configuration list. The domain knowledge graph is a structured semantic network stored in a graph database or resource description framework format. Its nodes represent standardized coded medical entity concepts (such as International Classification of Diseases codes, surgical procedure classification codes, and generic drug names), and edges represent semantic relationships between concepts (such as "belongs to," "causes," and "contraindicated"). In insurance applications for cardiovascular disease risk assessment, the graph might include disease progression paths from "hypertension" to "heart failure" and drug classification relationships from "aspirin" to "antiplatelet therapy." The feature alignment template is a configuration file for a mapping rule engine, defining the transformation logic from entities and their attributes in the knowledge graph to specific dimensions in the final output feature vector. The template may contain the following rule: if the entity "diabetes" is identified and the attribute "disease duration" is greater than 5 years, then assign a value of 1.0 to the "endocrine and metabolic disease history_long-term" dimension of the feature vector; otherwise, assign a value of 0.0. The loading process ensures that the local processing logic of the data node is aligned with the global business rules.
[0080] The unstructured raw sample data is denoised, cleaned, and segmented sequentially using a data parsing mechanism. This process is completed by a dedicated text preprocessing pipeline. The denoising and cleaning module reads the raw data file and uses regular expressions to remove formatting tags, headers and footers, repeated spaces, and garbled characters irrelevant to medical descriptions. Error correction algorithms based on dictionaries or statistical language models are used to correct potential errors introduced by optical character recognition. A sensitive information desensitization module runs concurrently, locating and masking direct personal identifiers based on pattern recognition, such as replacing names with "[Name]" and shifting precise dates to relative time intervals. The sequential segmentation module then operates on the cleaned continuous text, using a segmenter based on a maximum matching algorithm combined with a medical dictionary to divide sentences into sequences of words or sub-words. For English text, byte-pair encoding or the WordPiece algorithm may be used to generate sub-words. The final output is a basic sample text sequence composed of a word index sequence and its positional information. This sequence preserves the semantic structure of the original text but eliminates formatting noise and privacy risks, providing standardized input for downstream in-depth analysis.
[0081] This deep semantic extraction network integrates a sample-based text sequence input data parsing mechanism. The network is a pre-trained language model based on a Transformer encoder architecture, for example, pre-trained on corpora containing billions of words of medical literature, electronic health records, and medical question-and-answer pairs through masked language modeling and next-sentence prediction tasks. The model's input layer maps word indices to high-dimensional embedding vectors and adds them to positional encodings. The core consists of stacked Transformer blocks, each containing a multi-head self-attention mechanism and a feedforward neural network, applying residual connections and layer normalization. The self-attention mechanism enables the model to compute contextual dependency weights between any two words in a sequence, thus understanding the strong association between "myocardial infarction" and "chest pain" in a medical context. In the fine-tuning phase for insurance risk control, the model is trained in a supervised manner on medical record segments labeled with medical entity boundaries and relationships. The training objective is to minimize the joint loss function of entity recognition (typically modeled as a sequence labeling task using conditional random fields) and relationship classification (modeled as an entity pair classification task). The optimizer uses AdamW, and the learning rate employs a linear decay strategy with warm-start. During forward reasoning, the model utilizes its attention weights to focus on fragments of text that are related to concepts in the domain knowledge graph. For example, when there is a "diagnosis method" relationship between "coronary heart disease" and "coronary angiography" in the knowledge graph, the model will increase its attention to the text fragments describing the angiography results, thereby more accurately locating entity boundaries and classifying their relationships, and outputting a structured set of key business entities, such as identifying the entity "positive coronary angiography result" and establishing a "confirmed diagnosis" relationship with the entity "coronary heart disease".
[0082] Based on the feature alignment template, key business entities are mapped and converted into a unified numerical vector format. This process is executed by a feature mapping engine, which parses the rule set defined in the template. The engine first aligns and disambiguates the entities and relationships output by the deep semantic extraction network with the loaded domain knowledge graph. For example, it normalizes the description of "heart attack" in the text to the standard concept "myocardial infarction" in the graph. Then, it iterates through each mapping rule in the template. Rules may be direct key-value mappings, such as "Entity: Myocardial infarction → Feature dimension: Cardiovascular disease history, value: 1"; or they may be complex conditional logic, such as "IF Entity: Blood pressure reading AND Attribute value > 140 / 90 mmHg AND Existence relationship: Taking (Medication: Antihypertensive drug) THEN Feature dimension: Hypertension control status, value: 2 (Uncontrolled after medication)". The engine executes these rules, converting each entity and its contextual attributes into one or more numerical values for the corresponding feature dimension. For continuous numerical values, such as age or laboratory indicators, standardization is applied based on global sample statistics, such as Z-score normalization, to convert the original values into scores conforming to a standard normal distribution. The transformation results across all dimensions are assembled into a fixed-length and ordered dense or sparse numerical vector. The index and semantics of each position in this vector are uniformly defined by a template, ensuring that the standardized feature data generated from samples from different data nodes and different original medical records are strictly aligned and comparable in the feature space, and can be directly used for subsequent federated machine learning model training.
[0083] This embodiment, through the aforementioned process, transforms disorganized, unstructured medical text distributed across various data nodes into high-quality, standardized training features in a localized and automated manner. A deep semantic extraction network, combined with a domain knowledge graph, achieves accurate and interpretable semantic understanding and information structuring of complex medical text. Feature alignment templates ensure that this understanding is objectively and consistently quantified into numerical features closely aligned with business objectives. The standardized training feature data produced by this process overcomes the core obstacle of using multi-source heterogeneous medical data directly for joint modeling, providing an input foundation for building a unified and powerful risk assessment model while protecting privacy.
[0084] In one embodiment, step S30 above includes: S301, the global model parameters are initialized at the coordination node and distributed to each data node through the federated learning network; S302, control each data node to load the locally stored standardized feature data for training, and use the standardized feature data for training to perform local iterative training on the global model parameters to generate local model gradients; S303, invoke the parameter encryption interaction mechanism, use the homomorphic encryption algorithm to encrypt the local model gradient and add differential privacy noise to generate an encrypted gradient parameter package; S304, control the data node to upload the encrypted gradient parameter package to the coordination node, and the coordination node performs a ciphertext aggregation operation on the encrypted gradient parameter package to update the global model parameters. The process is repeated until the preset convergence condition is met, and an anomaly analysis model is generated.
[0085] In this embodiment, the global model parameters are initialized at the coordinating node. This step defines the basic architecture and initial state of the anomaly analysis model. The anomaly analysis model is a computational graph instantiated as a machine learning model to be trained. In healthcare insurance risk prediction tasks, this model typically employs a supervised learning paradigm, and its architecture can be concretized as a multilayer perceptron. This network contains an input layer with the number of neurons strictly equal to the dimension of the standardized feature data used for training, which receives the individual's health feature vector. This is followed by several fully connected hidden layers, each containing a specific number of neurons, and using non-linear activation functions such as ReLU or GELU to introduce the model's non-linear fitting capability. The final output layer typically uses a neuron with a sigmoid activation function, mapping the previous layer's output to a scalar between 0 and 1, which is interpreted as the predicted probability of an individual experiencing a major health event defined by the insurance contract within a specific future time window. Initializing the global model parameters involves assigning initial values to all weight matrices and bias vectors in the network, typically using Xavier or He initialization strategies to ensure stability in the early stages of training. After initialization, the coordinating node connects via the secure transport layer protocol established in the federated learning network and broadcasts or unicasts data packets containing all initial parameter values to each data node participating in the training.
[0086] At each data node, standardized training feature data stored locally is loaded. This data is a set of structured vectors extracted from historical medical record samples of this node through a data parsing mechanism. Each vector corresponds to a feature snapshot of a historical individual and is associated with a real binary label indicating whether the individual subsequently experienced the target health event. The data nodes use this local data to perform local iterative training on the received global model parameters. The training process executes mini-batch stochastic gradient descent or its variants. For each mini-batch feature vector, the model performs forward propagation: the input vector is sequentially passed through layers of linear transformations and non-linear activations to obtain the predicted probability. The loss function uses binary cross-entropy to calculate the difference between the predicted probability and the real label. Then, the backpropagation algorithm is executed, using the chain rule to calculate the partial derivative of the loss function with respect to each trainable parameter in the model, i.e., the gradient. This gradient indicates the direction and magnitude by which each parameter should be adjusted to reduce the loss of the current batch. Based on the calculated gradient, the local optimizer updates a copy of the global model parameters according to the set learning rate. This process is repeated locally for multiple cycles until a preset number of local iterations is reached, ultimately producing a set of updated parameters. The difference vector between the initial global parameters and the final updated parameters, or the gradient directions accumulated during optimization, are extracted as the local model gradients that need to be uploaded. The entire process is completed within the data node, and the original feature data and labels do not leave the node's memory.
[0087] The parameter encryption interaction mechanism is invoked to encrypt and privacy-enable the local model gradients. The initialized local encryption client on the data node uses the pre-configured public key of the Pai homomorphic encryption algorithm to operate on the gradient vector. The Pai encryption algorithm is a public-key encryption scheme that supports additive homomorphism. The encryption process is as follows: for each floating-point value in the gradient vector, it is encoded as an element on the integer modular ring, and then encrypted using the public key to generate the corresponding ciphertext. The encrypted gradient vector is transformed from a plaintext numerical sequence into a ciphertext sequence. Subsequently, differential privacy noise is injected into this ciphertext gradient vector according to the configured local noise addition strategy. The noise addition strategy specifies the distribution and scale of the noise; for example, noise values are independently sampled from a Laplace distribution with a mean of zero and a scale parameter of b. The scale parameter b is related to the total privacy budget of the federated learning task and the global sensitivity of the gradient. In terms of technical implementation, since directly adding random noise to the ciphertext is computationally equivalent to adding noise to the plaintext before encryption, this step is usually optimized as follows: first, add random noise that meets differential privacy requirements element-wise to the plaintext gradient, and then perform homomorphic encryption on the noisy gradient. The final generated encrypted gradient parameter package contains the complete ciphertext representation of the noise-perturbed gradient, ensuring that its content is unreadable to the coordinating node and other parties, and satisfies strict differential privacy guarantees.
[0088] The control data node uploads the encrypted gradient parameter packet to the coordinating node via a secure transport layer protocol connection. The coordinating node's ciphertext aggregation module receives encrypted packets from all participating nodes. Due to the addative homomorphism of Pai encryption, for two encrypted values, the ciphertext of their encrypted sum can be calculated without knowing the corresponding plaintext. The aggregation module utilizes this property to perform element-wise homomorphic addition on the ciphertext gradients uploaded by all nodes. If weighted aggregation is used, the weights can be homomorphically multiplied with the ciphertext in plaintext form. The output of the aggregation operation is a new ciphertext vector, namely the aggregated encrypted global gradient. The coordinating node then decrypts this aggregated encrypted global gradient using its securely stored Pai private key to obtain the plaintext aggregated gradient. This aggregated gradient represents the average update direction of the local data of all participating nodes. The coordinating node uses this aggregated gradient to update its maintained global model parameters according to the set global learning rate. This update process can be formulated as the global parameter equals the original parameter minus the global learning rate multiplied by the aggregated gradient.
[0089] The process described above—distributing global parameters, local training, encrypted gradient upload, and secure aggregation updates—constitutes a complete federated learning communication round. The coordinating node determines whether to terminate training based on preset convergence conditions. These conditions may include the global model parameters having a norm less than a threshold over several consecutive rounds, reaching a preset maximum number of communication rounds, or the model performance no longer improving on a reserved global validation set. If convergence fails, the coordinating node uses the updated global model parameters as the initial parameters for the next round of iteration. If the convergence conditions are met, training terminates, and the final global model parameters held by the coordinating node define the trained anomaly analysis model, ready for inference. This model encapsulates joint knowledge learned from the distributed data of all participating nodes for predicting health risks, and its training process never requires the centralization of any raw or individual-level medical data.
[0090] This embodiment, through the aforementioned encrypted federated training process, effectively aggregates dispersed sample knowledge while strictly protecting the privacy of the original data at each data node, collectively optimizing a powerful anomaly analysis model. Homomorphic encryption ensures the confidentiality of gradient interactions, while differential privacy noise provides strict statistical privacy guarantees, ensuring the training process complies with the most stringent data protection regulations. Locally generated standardized feature data serves as input, guaranteeing the consistency and effectiveness of model learning. The final generated model integrates the statistical patterns of multi-source data, and its risk assessment capabilities surpass the limitations of a single data source, providing insurance institutions with a reliable and secure core technology for achieving accurate risk quantification within a compliant framework.
[0091] In one embodiment, step S40 above includes: S401, Receive a business request for the object to be processed, parse the business request to obtain the identity code as identification information; S402, determine the corresponding target data node based on the identification information; S403, locally at the target data node, retrieve the target unstructured raw data that matches the identification information; S404, trigger the data parsing mechanism to perform noise removal, cleaning, and serialization word segmentation operations on the target unstructured raw data to generate the target basic text sequence; S405, the target basic text sequence is input into the deep semantic extraction network integrated by the data parsing mechanism to identify and separate the target key business entities in the target basic text sequence; S406, Invoke the feature alignment template to convert the target key business entity mapping into a unified numerical vector format and generate target standardized feature data.
[0092] In this embodiment, a business request for an object to be processed is received. This request is typically initiated by a business node and transmitted to the coordination node via an established application layer connection. Structurally, the business request is a data packet conforming to a predetermined protocol, such as a JSON object or a protocol buffer message. Its payload contains basic information about the object to be processed, which must include a string that uniquely identifies its identity, i.e., an identity code. In the medical finance and insurance scenario, this identity code is typically the insured's (i.e., the policy applicant's) resident identity information number, social security number, or a unique customer identifier assigned by the insurance core system. The coordination node runs a request parsing service, which performs syntax parsing and field extraction on the received data packet according to a predefined interface specification, extracting the identity code field. The extraction process may involve traversing multiple nested structures, data format verification (such as verifying the legality of the identity information number), and character encoding conversion, ultimately yielding a standardized identification information string that can be used for subsequent queries.
[0093] The target data node is determined based on the identification information. The coordinating node maintains a node registry and a routing mapping service. The node registry stores metadata about all registered data nodes, including node identifiers, network addresses, and descriptions of the data subjects they serve. The routing mapping service encapsulates the mapping logic from identification information to data node identifiers. This mapping logic can be based on pre-configured static rules, such as mapping to the corresponding regional medical data center node based on the administrative division prefix in the identity code; or it can be based on dynamic queries, such as the coordinating node querying a patient master index service, which returns a list of medical institutions where the patient's medical records are primarily stored based on the identity code, and the coordinating node then matching this list with the node registry to determine the specific data node identifier. In some implementations, the mapping result may be a list of nodes, in which case the coordinating node needs to select a master node or initiate parallel queries to all nodes according to a strategy. After determining the target data node, the coordinating node forwards the query command containing the identification information to that node via the corresponding secure transport layer protocol connection.
[0094] On the target data node, its local data retrieval service receives query commands. Based on identification information, this service performs a retrieval operation in the node's local storage system. This local storage system might be a relational database, where the medical record table uses the patient's identification code as the primary key; it could also be a document database or file system, where the document's metadata includes an identification code field. The retrieval service constructs and executes a query statement, such as executing a SELECT statement in an SQL database. The query can be executed within an Elasticsearch cluster using the `FROM medical_records WHERE patient_id = 'specific identification code'` method, or by performing an exact match query based on that identification code. The query will return one or more unstructured raw data documents, such as complete discharge summary texts, radiology reports, laboratory test results, etc. These documents represent the target unstructured raw data that precisely matches the identification information. The retrieval process may involve access control checks to ensure that the current federated learning task context has permission to access the specific type of record for that patient.
[0095] Trigger the data parsing mechanism deployed on the target data node to perform denoising cleaning and serialization tokenization operations on the retrieved target unstructured raw data. The preprocessing pipeline within the data parsing mechanism is instantiated. The denoising cleaning module reads the binary or text stream of the original document and applies a series of filters: removing formatting noises such as XML / HTML tags and removing table border characters; applying spelling correction based on rules or statistical models to correct common typos such as correcting "cardiac infarction" to "myocardial infarction"; unifying and normalizing formats such as numbers and dates. The sensitive information desensitization module runs synchronously, using regular expressions or lightweight named entity recognition models to locate other direct identifiers (such as names, phone numbers, detailed addresses) in the document except for identity codes and replaces or masks them. The serialization tokenization module then operates on the cleaned pure text, loading a tokenization model or dictionary optimized for the medical field to cut the sentence into a sequence of words or subwords. For Chinese, a tokenization model based on bidirectional long short-term memory network combined with conditional random field may be adopted; for English, subword segmentation based on byte pair encoding may be adopted. This process outputs the target base text sequence, which is a structured representation consisting of vocabulary indices and their position information, completely stripping the format and privacy risks of the original document and retaining the complete semantic content for in-depth analysis.
[0096] This involves using a deep semantic extraction network that integrates a target-based text sequence input data parsing mechanism for inference. This network is a pre-trained and task-fine-tuned deep learning model, typically based on a Transformer architecture (such as variants of BERT and RoBERTa). During fine-tuning, the model is optimized on medical record segments heavily annotated with medical entities (e.g., diseases, symptoms, examinations, treatments, drugs) and relationships (e.g., disease-symptom, drug-treatment), by minimizing the joint loss function of sequence annotation (e.g., using the BIOES annotation system) and entity-relationship classification. The optimizer uses AdamW, applying gradient pruning and learning rate decay strategies. In forward inference mode, the preprocessed text sequence is converted into token IDs and positionally encoded before being input into the model. The model's multi-layer Transformer encoder computes the representation of each token in the sequence relative to the global context through a self-attention mechanism. The hidden state of the last layer of the sequence is input into two specific task heads: a conditional random field layer is responsible for predicting the entity type label for each lexical unit, thereby identifying entity boundaries; an attention-based or fully connected relation classifier is responsible for classifying predefined entity pairs (such as two disease entities identified in a sentence) and determining whether a preset relation type exists between them. The inference process utilizes the loaded domain knowledge graph as an external constraint or feature enhancement. For example, prior association information of entity concepts is injected into the model's attention calculation through knowledge graph embedding, thereby improving the confidence in recognizing the simultaneous occurrence of strongly related entities such as "coronary atherosclerosis" and "myocardial ischemia". The model outputs a structured set of target key business entities, including a text fragment of each entity, its type, its position in the sequence, and relation triples between entities.
[0097] The feature alignment template is invoked to convert the target key business entities into a unified numerical vector format. The feature alignment template, acting as a container for a set of mapping rules, is loaded and interpreted in this step. The mapping engine iterates through each rule in the template. Rules can be direct lookup mappings, such as "Entity Type: Disease, Standardized Name: Diabetes → Feature Dimension Index: 15, Value: 1". More complex rules involve conditional judgments on entity attributes, such as "IF Entity Type: Test Indicator AND Standardized Name: Serum Creatinine AND Numerical Attribute > 133 μmol / L THEN Feature Dimension Index: 28, Value: 2 (representing abnormal elevation)". For entities with relationships, rules may trigger the generation of combined features, such as "IF Entity A: Disease (Hypertension) AND Entity B: Drug (Nifedipine) AND Relationship: Treatment THEN Feature Dimension Index: 42 (History of Hypertension Medication), Value: 1". The engine needs to handle entity disambiguation (e.g., determining "CA" as "cancer" rather than "calcium" based on context) and numerical standardization. For continuous numerical features (such as age 65), the same normalizer as used in the training phase (e.g., the previously saved mean and standard deviation) is applied for Z-score transformation. After all rules are executed, a fixed-length (e.g., 300-dimensional) numerical vector is generated. Each position in this vector corresponds to a predefined risk factor dimension in the feature label system, and its value represents the quantitative evaluation result of the object to be processed in that dimension. This vector is the target standardized feature data, which is fully compatible with the standardized feature data used for training the model in terms of format, dimensional order, and numerical meaning. It can be directly used as input to the trained anomaly analysis model for forward propagation to calculate the risk score.
[0098] This embodiment achieves real-time, privacy-compliant characterization for specific individuals through the aforementioned process. Precise routing and localized retrieval of identification information ensure that raw medical data remains within the storage node. Integrated deep semantic extraction and rule-based feature mapping objectively and consistently transform messy, unstructured text into standardized numerical vectors perfectly aligned with the trained model. This enables the generation of high-quality input features for each insurance applicant, directly usable for automated risk assessment, while strictly protecting patient data privacy, thus bridging the crucial gap between distributed medical data and centralized intelligent decision-making.
[0099] In one embodiment, step S50 above includes: S501, Input the target standardized feature data into the anomaly analysis model, and perform forward propagation processing through the anomaly analysis model to obtain the predicted probability value representing the anomaly probability; S502, parse the preset business conditions and obtain the numerical limit threshold and exclusion conditions; S503, determine whether the predicted probability value falls within the numerical range limited by the numerical limit threshold, and compare whether the target standardized feature data contains the feature items defined by the exclusion condition; S504, based on the judgment result and the comparison result, the predicted probability value is weighted and corrected or the result is overwritten to generate an anomaly quantification result that characterizes the degree of anomaly of the object to be processed.
[0100] In this embodiment, the target standardized feature data is input into the anomaly analysis model, which is a pre-trained machine learning model instance. A typical architecture for the anomaly analysis model can be a deep neural network, such as a multilayer perceptron, where the number of neurons in the input layer is strictly equal to the dimension of the target standardized feature data. This network contains one or more hidden layers, each consisting of multiple neurons connected by trainable weight matrices and bias vectors, and applies nonlinear activation functions, such as rectified linear units or hyperbolic tangent functions. The output layer is designed according to the task; for binary risk assessment tasks, it typically uses a neuron with a sigmoid activation function, mapping the linear combination of the previous layer's output to the (0,1) interval. The model's parameters (weights and biases) have been optimized through distributed data training during federated learning. During forward propagation, the target standardized feature data, as the input vector, enters from the input layer, sequentially undergoes linear transformations and nonlinear activations in each hidden layer, and finally, a scalar value, i.e., the predicted probability, is calculated through the output layer. This value represents the probability estimate calculated by the model based on the input features that the object to be processed belongs to the high-risk category. In the context of healthcare insurance, this value can be interpreted as the estimated likelihood of the insured experiencing a specific health event as stipulated in the insurance contract in the future.
[0101] The system parses pre-defined business conditions, which are stored in a structured form, such as in configuration files, database tables, or the knowledge base of a rules engine. These business conditions are formulated based on the insurance product's risk control strategy, actuarial data, and regulatory requirements. The parsing process is executed by a condition parsing module, which reads and interprets the condition definitions. Numerical thresholds are one type of condition, defining different risk level boundaries for predicted probability values; for example, [0.0, 0.1] is defined as low risk, (0.1, 0.3) as medium risk, and (0.3, 1.0) as high risk. Exclusionary conditions are another type, defining hard rules based on the feature data itself, usually expressed as logical expressions, used to identify specific high-risk feature combinations. For example, an exclusionary condition might be stated as: a high-risk flag is triggered when the feature value of "history of malignant tumor" is true and the feature value of "recent treatment time" is less than 3 years. The parsing module converts the conditions in text or code form into in-memory data structures, such as threshold range objects and feature matching rule trees, for use by subsequent validation logic.
[0102] The operation of determining whether the predicted probability value falls within the numerical range defined by the numerical limit threshold is performed by an interval determiner. The interval determiner compares the predicted probability value with all the threshold intervals obtained from the analysis to determine its corresponding interval, thus obtaining a preliminary risk level classification. Simultaneously, the operation of comparing whether the target standardized feature data contains features defined by the exclusion conditions is performed by a feature matcher. The feature matcher iterates through all exclusion conditions, and for each condition, checks whether the value of the corresponding feature dimension in the target standardized feature data satisfies the logical relationship specified in the condition (such as equal to, greater than, or contained in a set). For example, for the above exclusion conditions, the feature matcher will check whether the value of the "history of malignant tumors" dimension is 1 (true) and whether the value of the "time of most recent treatment" dimension is less than 3. The comparison result is a Boolean value indicating whether at least one exclusion condition has been triggered.
[0103] Based on the judgment and comparison results, the predicted probability values are either weighted or overwritten. This step is performed by a result synthesizer. The synthesis logic can be predefined as a decision table or rule set. For example, a typical synthesis logic is: if the comparison result triggers any exclusionary conditions, the interval judgment result of the predicted probability value is ignored, and a preset maximum risk quantification value is directly output, i.e., result overwriting. If no exclusionary conditions are triggered, the predicted probability value is mapped to a corresponding quantification value according to the interval it falls into, such as 0.2 for low risk, 0.5 for medium risk, and 0.8 for high risk, or the original predicted probability value can be used directly. Weighted correction is another optional logic. For example, when certain non-mandatory exclusionary conditions are triggered, a penalty term can be added to the predicted probability value, and then the interval is re-judged or the corrected value is directly output. The synthesizer performs calculations according to the preset logic and finally outputs an anomaly quantification result that characterizes the degree of anomaly of the object to be processed. This result can be a continuous risk score or a discrete risk level code, but its design aims to comprehensively reflect the dual conclusions of data-driven model prediction and domain knowledge rule verification.
[0104] This embodiment combines data-driven model prediction with domain-knowledge-based business rules through the aforementioned joint verification process. The resulting anomaly quantification results possess both the foresight of statistical learning and the robustness of business rules. The model provides risk probabilities based on complex pattern recognition, while the business conditions embed clear risk control red lines and actuarial logic. These complementary approaches effectively enhance the accuracy, interpretability, and ability to handle boundary cases in risk assessment. This lays a solid foundation for making reliable and compliant insurance decisions based on the quantification results.
[0105] In one embodiment, step S60 above includes: S601, Obtain a preset hierarchical mapping strategy, wherein the hierarchical mapping strategy defines the correspondence between the numerical range of the anomaly quantification result and the processing decision type, wherein the processing decision type includes pass, review and rejection processing; S602, determine the target value range into which the abnormal quantification result falls, and find the target processing decision type corresponding to the target value range in the hierarchical mapping strategy; S603, when the target processing decision type is pass or reject processing, an analysis report containing the judgment criteria is generated as the processing decision; S604, when the target processing decision type is review, extract the key feature information associated with the abnormal quantification result to generate a review work order; S605, Receive the correction instruction for the review work order, and use the correction instruction as a processing decision; S606, The processing decision is sent to the business node through the federated learning network.
[0106] In this embodiment, a preset hierarchical mapping strategy is obtained. This strategy is typically stored in the form of a configuration file, database table, or rule engine knowledge base, and is predefined by the risk management or product department of the insurance institution based on the risk tolerance, market strategy, and actuarial model of the specific insurance product. The core data structure of the hierarchical mapping strategy defines the mapping relationship between the numerical range of abnormal quantification results (e.g., continuous risk score range or discrete risk level) and three processing decision types (pass, review, and rejection). For example, the strategy may stipulate that a risk score in the range [0.0, 0.2) corresponds to a "pass" decision, in the range [0.2, 0.6) a "review" decision, and in the range [0.6, 1.0] a "reject" decision. The loading and parsing of the strategy is handled by a configuration management module. When the coordination node or business node starts up or receives a processing request, this module reads the strategy data from persistent storage and converts it into an efficient in-memory lookup structure, such as a range tree or hash table, for fast matching.
[0107] Determining which target numerical interval an abnormal quantification result falls into involves matching calculations between numerical intervals. The system compares the generated abnormal quantification result (a floating-point number or discrete level code) with all numerical intervals defined in the hierarchical mapping strategy. The comparison logic typically employs interval boundary checks to determine which interval the result belongs to. For example, for a risk score of 0.45, the system iterates through the intervals defined in the strategy and finds that 0.45 is greater than or equal to 0.2 and less than 0.6, thus determining that it falls into the interval corresponding to the "review" decision. The identifier of this target numerical interval is recorded. Subsequently, in the data structure of the hierarchical mapping strategy, the system uses the identifier of this target numerical interval as a key to search for and retrieve the corresponding target processing decision type. The search operation is a direct key-value retrieval with a time complexity of O(1) or O(log n), ensuring high efficiency.
[0108] When the target processing decision type is pass or rejection, the system generates an analysis report containing the basis for the decision as the processing decision. The generation of the analysis report is executed by a report generation engine. The report engine collects and integrates key data and logical judgment results from the entire risk assessment process, including but not limited to: input identification information, key feature values in the generated target standardized feature data, predicted probability values output by the anomaly analysis model, results of business condition verification (such as triggered exclusionary conditions), anomaly quantification results, and the final matched hierarchical mapping strategy range and decision type. The report engine organizes this information into a structured document according to predefined templates (such as XML, JSON, or natural language paragraphs). The report typically includes machine-readable fields and human-readable explanations, aiming to provide transparency, traceability, and compliance audit basis for the decision. The generated report itself serves as the carrier of the processing decision, its content clearly indicating the automatic pass or automatic rejection conclusion.
[0109] When the target processing decision type is review, the system initiates a manual review workflow. First, the system extracts key feature information associated with the anomaly quantification result. This information is typically derived from the process of generating the anomaly quantification result, and includes the feature dimensions and their values that significantly contribute to the final risk score or trigger specific rules. For example, the system might select the top five feature dimensions with the highest values from the target standardized feature data, or select feature items that trigger exclusionary conditions. A work order generation module encapsulates this key feature information, the anomaly quantification result, and related context (such as the policyholder's basic information and product type) into a structured review work order. This work order usually includes a detailed description of the matter to be reviewed, information points requiring manual confirmation or supplementation, and possible operation options. The work order is submitted to a task queue or workflow management system, awaiting assignment to a human underwriter with the appropriate permissions. Subsequently, the system enters a waiting state, listening for input from the human underwriter's interface. After reviewing the work order, the human underwriter submits a correction instruction through the interface. This instruction may be a new decision conclusion (such as "Approved but excluding a certain liability," "Rejected"), an adjustment to the risk level, or additional remarks. The system receives this correction instruction through an event listener or API endpoint, verifies and formats it, and determines it as the final processing decision.
[0110] The processing decisions are sent to business nodes via a federated learning network. Whether the decision is an automatically generated analysis report or a manually reviewed correction instruction, the system encapsulates it into data packets conforming to the communication protocol. The sending operation is performed by the coordinating node or the component responsible for decision aggregation. Data packets are transmitted through a pre-established application-layer connection between the coordinating node and the business nodes, typically using HTTPS POST requests, remote procedure calls, or message queues. In addition to the decision content (such as reports or instructions), the data packets may also contain transaction IDs, timestamps, and recipient identifiers to ensure reliable and traceable transmission. Upon receiving the data packets, the business nodes parse and confirm them, triggering subsequent internal business processes, such as updating policy status, notifying customers, or recording underwriting conclusions. The entire feedback process ensures that decision information is accurately, securely, and promptly returned to the requesting business node, completing the closed loop of the federated learning analysis process.
[0111] For example, to improve the accuracy and efficiency of critical illness insurance underwriting, an insurance company plans to train a risk assessment model using electronic medical record data from multiple partner hospitals. Due to the sensitivity of medical data, centralized processing of data from each hospital is not feasible. The insurance company deploys a coordination node, and each partner hospital deploys its own data node on its internal network. The insurance company's automated underwriting system acts as the business node. After startup, the hospital servers and the insurance company's system initiate connection requests to the coordination node's network registration service. The coordination node assigns a unique network identifier to each party and records it in the node registry. Based on this table, it establishes a secure connection with each hospital's data node using a transport layer security protocol and an application layer connection with the business node. The coordination node defines the feature dimensions to be extracted from medical records based on insurance risk control requirements, such as medical history, surgical history, and key laboratory indicators, and generates feature extraction templates, which are then distributed to each hospital's data node via secure connections. Based on these templates, the hospital data nodes configure a natural language processing parsing engine locally, built upon a pre-trained clinical BERT model, thus forming a data parsing mechanism. Simultaneously, the coordinating node generated a public key for the Pai homomorphic encryption algorithm to protect data privacy, along with a set of Gaussian noise parameters that met the differential privacy budget, and distributed them to each hospital data node. The hospital data nodes used the public key to initialize their local encryption clients and configured local noise addition strategies based on the noise parameters, together forming the parameter encryption interaction mechanism.
[0112] Once the network and mechanisms are ready, the model training phase begins. Each hospital's data node invokes its local data parsing mechanism. This mechanism first loads an insurance domain knowledge graph containing standardized medical terms such as diseases, surgeries, and medications, along with their interrelationships. It also loads feature alignment templates defining how to map medical terms to insurance risk features. Next, the data nodes read a large amount of locally stored historical electronic medical records as unstructured raw sample data. The data parsing mechanism denoises and cleans these medical record texts, removing irrelevant formatting characters and desensitizing sensitive information such as patient names. Then, it performs word segmentation to generate a structured sample base text sequence. This sequence is fed into an integrated deep semantic extraction network. This network is a finely tuned Transformer model that utilizes the semantic information of the knowledge graph to accurately identify key business entities in the text, such as "type II diabetes," "coronary artery bypass grafting," and "taking aspirin," and extracts the relationships between them, including time and treatment details. Finally, based on the feature alignment template, these entities are mapped and converted into a unified numerical vector. For example, "diabetes history of more than 10 years" is mapped to the feature "endocrine and metabolic disease history" and assigned a higher risk code, generating a batch of standardized feature data for training.
[0113] The coordinating node initializes a neural network model for predicting the risk of cardiovascular disease incidence within five years and distributes the initial parameters to all hospital data nodes. Each hospital uses locally generated standardized feature data for training and its corresponding historical disease labels to perform multiple rounds of local training on the model parameters, calculating the updated gradient of the model parameters. To protect privacy, the hospitals invoke a parameter encryption interaction mechanism, encrypting the gradient using a homomorphic encryption public key and adding Gaussian noise according to the configuration to generate an encrypted gradient parameter package, which is then uploaded to the coordinating node. The coordinating node aggregates the encrypted gradient packages from all hospitals, calculates the average value in the encrypted state, decrypts it, and updates the global model parameters. This process iterates hundreds of times until the model performance converges, ultimately generating a high-performance anomaly analysis model.
[0114] When a new customer submits an insurance application, the insurance company's business nodes send a request containing the customer's ID number to the coordination node. Based on this identification information, the coordination node determines that the customer's historical medical records are primarily stored in the data node of cooperating hospital A and initiates a query to that node. Hospital A's data node retrieves the customer's electronic medical record from its local database as the target unstructured raw data. The data parsing mechanism is triggered, performing the same cleaning and word segmentation operations on this medical record to generate the target basic text sequence. This sequence is input into a deep semantic extraction network to identify key target business entities, such as "5-year history of hypertension" and "high cholesterol." Using the same feature alignment template, these entities are mapped and converted into a target standardized feature data vector that is completely consistent with the training data format.
[0115] The feature vector is input into the trained anomaly analysis model. The model performs forward propagation and outputs a predicted probability value of 0.65, representing the customer's basic risk probability. The system simultaneously parses preset business conditions, obtaining a numerical threshold of "risk score greater than 0.6 requires manual review" and exclusionary conditions such as "risk label upgrade if there is a history of both hypertension and dyslipidemia." The system determines that the score of 0.65 falls within the threshold range requiring review, and the feature data matches the exclusionary condition of "hypertension combined with dyslipidemia." Based on this comparison result, the system decides to overwrite the model's prediction, generating a higher anomaly quantification result labeled "high risk."
[0116] Subsequently, based on a pre-defined hierarchical mapping strategy, the system determines the target processing decision type corresponding to this "high-risk" abnormal quantitative result as "review." The system extracts key characteristic information leading to the high risk, such as "history of hypertension" and "dyslipidemia indicators," automatically generates a detailed review work order, and assigns it to the expert workbench of the insurance company's underwriting department. After reviewing the work order, the underwriting expert, combining clinical experience, inputs a correction instruction of "underwriting after excluding cardiovascular and cerebrovascular disease liability." The coordination node takes this correction instruction as the final processing decision and sends it back to the business node through the secure channel of the federated learning network. Upon receiving the decision, the business node automatically updates the application status, generates a draft policy with exclusion clauses, and notifies the customer of the underwriting conclusion. The entire process is completed without centralizing any original medical data, ensuring patient privacy while achieving efficient and accurate automated underwriting.
[0117] This embodiment transforms risk quantification results into clear and actionable business decisions through preset strategies, and achieves transparency in automated decision-making and precision in manual review through structured reports and work orders. Decision information is reliably fed back through a federated network, ensuring a seamless connection between risk assessment results and business execution. This improves the efficiency and quality of underwriting automation while maintaining the compliance and auditability of the process.
[0118] In one embodiment, a privacy-preserving analysis device based on federated learning is provided, which corresponds one-to-one with the privacy-preserving analysis method based on federated learning in the above embodiments. (Refer to...) Figure 3 , Figure 3 This is a schematic diagram of the functional modules of a preferred embodiment of the privacy-preserving analysis device based on federated learning of the present invention. The modules include a network initialization module 10, a sample parsing module 20, a federated training module 30, a target parsing module 40, an anomaly detection module 50, and a decision feedback module 60. Detailed descriptions of each functional module are as follows: The network initialization module 10 is used to construct a federated learning network containing a coordination node, multiple data nodes and business nodes, and to configure a data parsing mechanism and a parameter encryption interaction mechanism in the federated learning network. The sample parsing module 20 is used to perform structured processing on the unstructured raw sample data stored in the data node through the data parsing mechanism to generate standardized feature data for training. The federated training module 30 is used to control the data node and the coordination node to perform encrypted model parameter interaction based on the parameter encryption interaction mechanism, and to train and generate an anomaly analysis model using the standardized feature data for training. The target parsing module 40 is used to obtain the identification information of the object to be processed, extract the unstructured raw data of the target associated with the identification information in the data node through the data parsing mechanism, process it, and generate standardized feature data of the target. The anomaly determination module 50 is used to input the standardized feature data of the target into the anomaly analysis model, and perform joint verification in combination with preset business conditions to generate anomaly quantification results. The decision feedback module 60 is used to determine a processing decision based on the anomaly quantification result and to feed the processing decision back to the business node.
[0119] Specific limitations regarding the privacy-preserving analysis apparatus based on federated learning can be found in the foregoing limitations of the privacy-preserving analysis method based on federated learning, and will not be repeated here. Each module in the aforementioned privacy-preserving analysis apparatus can be implemented entirely or partially through software, hardware, or a combination thereof. These modules can be embedded in or independent of the processor in a computer device, or stored in the memory of a computer device as software, so that the processor can invoke and execute the operations corresponding to each module.
[0120] In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as follows: Figure 4 As shown, the computer device includes a processor, memory, network interface, and database connected via a system bus. The processor provides determination and control capabilities. The memory includes non-volatile and / or volatile storage media and internal memory. The non-volatile storage media stores the operating system, computer programs, and database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface is used to communicate with external clients via a network connection. When the computer program is executed by the processor, it implements server-side functions or steps of a privacy-preserving analysis method based on federated learning.
[0121] In one embodiment, a computer device is provided, which may be a client, and its internal structure diagram may be as follows: Figure 5As shown, the computer device includes a processor, memory, network interface, display screen, and input devices connected via a system bus. The processor provides determination and control capabilities. The memory includes non-volatile storage media and internal memory. The non-volatile storage media stores the operating system and computer programs. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface is used to communicate with an external server via a network connection. When executed by the processor, the computer program implements client-side functions or steps of a privacy-preserving analysis method based on federated learning.
[0122] In one embodiment, a computer device is provided, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to perform the following steps: Construct a federated learning network that includes a coordination node, multiple data nodes, and business nodes, and configure a data parsing mechanism and a parameter encryption interaction mechanism in the federated learning network; The data parsing mechanism is used to perform structured processing on the unstructured raw sample data stored in the data nodes to generate standardized feature data for training. Based on the parameter encryption interaction mechanism, the data node and the coordination node are controlled to perform encrypted model parameter interaction, and the anomaly analysis model is trained and generated using the standardized feature data for training. Obtain the identification information of the object to be processed, extract the target unstructured raw data associated with the identification information in the data node through the data parsing mechanism and process it to generate target standardized feature data; The target standardized feature data is input into the anomaly analysis model and jointly verified in conjunction with preset business conditions to generate anomaly quantification results; Based on the anomaly quantification results, a processing decision is determined, and the processing decision is fed back to the business node.
[0123] In one embodiment, a computer-readable storage medium is provided, which may be non-volatile or volatile, and a computer program is stored thereon, which, when executed by a processor, performs the following steps: Construct a federated learning network that includes a coordination node, multiple data nodes, and business nodes, and configure a data parsing mechanism and a parameter encryption interaction mechanism in the federated learning network; The data parsing mechanism is used to perform structured processing on the unstructured raw sample data stored in the data nodes to generate standardized feature data for training. Based on the parameter encryption interaction mechanism, the data node and the coordination node are controlled to perform encrypted model parameter interaction, and the anomaly analysis model is trained and generated using the standardized feature data for training. Obtain the identification information of the object to be processed, extract the target unstructured raw data associated with the identification information in the data node through the data parsing mechanism and process it to generate target standardized feature data; The target standardized feature data is input into the anomaly analysis model and jointly verified in conjunction with preset business conditions to generate anomaly quantification results; Based on the anomaly quantification results, a processing decision is determined, and the processing decision is fed back to the business node.
[0124] It should be noted that the functions or steps that can be implemented by the computer-readable storage medium or computer device described above can be referred to the relevant descriptions on the server side and client side in the foregoing method embodiments. To avoid repetition, they will not be described one by one here.
[0125] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium, and when executed, it can include the processes of the embodiments of the above methods. Any references to memory, storage, databases, or other media used in the embodiments provided in this application can include non-volatile and / or volatile memory. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link DRAM (SLDRAM), Rambus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
[0126] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the above-described division of functional units and modules is used as an example. In practical applications, the above functions can be assigned to different functional units and modules as needed, that is, the internal structure of the device can be divided into different functional units or modules to complete all or part of the functions described above.
[0127] It should be noted that any software tools or components not belonging to this company appearing in the embodiments of this application are merely illustrative examples and do not represent actual use. All user personal information involved in the embodiments of this application has been authorized (with knowledge and consent) by the relevant parties or has been fully authorized by all parties, and the executing entity may obtain it through various legal and compliant means. The collection, storage, use, processing, transmission, provision, and disclosure of the information, data, and signals involved all comply with relevant laws and regulations and do not violate public order and good morals.
[0128] The above-described embodiments are only used to illustrate the technical solutions of the present invention, and are not intended to limit it. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention, and should all be included within the protection scope of the present invention.
Claims
1. A privacy-preserving analysis method based on federated learning, characterized in that, Includes the following steps: Construct a federated learning network that includes a coordination node, multiple data nodes, and business nodes, and configure a data parsing mechanism and a parameter encryption interaction mechanism in the federated learning network; The data parsing mechanism is used to perform structured processing on the unstructured raw sample data stored in the data nodes to generate standardized feature data for training. Based on the parameter encryption interaction mechanism, the data node and the coordination node are controlled to perform encrypted model parameter interaction, and the anomaly analysis model is trained and generated using the standardized feature data for training. Obtain the identification information of the object to be processed, extract the target unstructured raw data associated with the identification information in the data node through the data parsing mechanism and process it to generate target standardized feature data; The target standardized feature data is input into the anomaly analysis model and jointly verified in conjunction with preset business conditions to generate anomaly quantification results; Based on the anomaly quantification results, a processing decision is determined, and the processing decision is fed back to the business node.
2. The privacy-preserving analysis method based on federated learning as described in claim 1, characterized in that, A federated learning network is constructed, comprising a coordination node, multiple data nodes, and business nodes. A data parsing mechanism and a parameter encryption interaction mechanism are configured within the federated learning network, including: Deploy network registration services at the coordinating node; The network registration service receives connection requests from various data nodes and service nodes. Based on the received connection requests, a network identifier is assigned to each data node and service node that made the connection request, and the assigned network identifier is recorded in the node registry. Based on the node registry, a secure transport layer protocol connection is established between the coordinating node and each data node; Establish application-layer connections between coordination nodes and business nodes; At the coordination node, a feature extraction template is generated according to the predefined data structure specifications, and the feature extraction template is distributed to each data node through the secure transport layer protocol connection. Locally on each data node, a natural language processing parsing engine is configured based on the received feature extraction template to form a data parsing mechanism; At the coordinating node, a homomorphic encryption public key and differential privacy noise parameters are generated, and distributed to each data node through the secure transport layer protocol connection. At each data node, the local encryption client is initialized using the homomorphic encryption public key, and a local noise addition strategy is configured according to the differential privacy noise parameters. The initialized local encryption client and the configured local noise addition strategy together constitute a parameter encryption interaction mechanism.
3. The privacy-preserving analysis method based on federated learning as described in claim 1, characterized in that, The data parsing mechanism performs structuring processing on the unstructured raw sample data stored in the data nodes to generate standardized feature data for training, including: The data parsing mechanism is invoked at each data node, and a preset domain knowledge graph and feature alignment template are loaded. The domain knowledge graph contains entity terminology definitions and topological mapping relationships between entities. The data parsing mechanism is used to denoise, clean, and serialize the unstructured raw data of the samples stored in the data nodes to generate a basic text sequence of the samples. The sample basic text sequence is input into the deep semantic extraction network integrated by the data parsing mechanism, and the domain knowledge graph is used to locate entity boundaries and extract relationships to obtain key business entities. Based on the feature alignment template, the key business entities are mapped and converted into a unified numerical vector format to generate standardized feature data for training.
4. The privacy-preserving analysis method based on federated learning as described in claim 1, characterized in that, Based on the parameter encryption interaction mechanism, the data node and the coordination node are controlled to exchange encrypted model parameters, and an anomaly analysis model is trained and generated using the standardized feature data for training, including: The global model parameters are initialized at the coordination node and distributed to each data node through the federated learning network. Control each data node to load locally stored standardized feature data for training, and use the standardized feature data for training to perform local iterative training on the global model parameters to generate local model gradients; The parameter encryption interaction mechanism is invoked, and the local model gradient is encrypted using a homomorphic encryption algorithm and differential privacy noise is added to generate an encrypted gradient parameter package. The data node is controlled to upload the encrypted gradient parameter package to the coordination node. The coordination node performs a ciphertext aggregation operation on the encrypted gradient parameter package to update the global model parameters. This process is repeated until a preset convergence condition is met, thereby generating an anomaly analysis model.
5. The privacy-preserving analysis method based on federated learning as described in claim 1, characterized in that, Obtain the identification information of the object to be processed, extract the target unstructured raw data associated with the identification information from the data node through the data parsing mechanism, and process it to generate target standardized feature data, including: Receive a business request for an object to be processed, parse the business request to obtain an identity code as identification information; The corresponding target data node is determined based on the identification information; Locally at the target data node, retrieve the target unstructured raw data that matches the identification information; The data parsing mechanism is triggered to perform noise removal, cleaning, and serialization word segmentation operations on the target unstructured raw data to generate the target basic text sequence; The target basic text sequence is input into the deep semantic extraction network integrated by the data parsing mechanism to identify and separate the target key business entities in the target basic text sequence; The feature alignment template is invoked to map the target key business entity into a unified numerical vector format, generating target standardized feature data.
6. The privacy-preserving analysis method based on federated learning as described in claim 1, characterized in that, The standardized feature data of the target is input into the anomaly analysis model, and joint verification is performed in conjunction with preset business conditions to generate anomaly quantification results, including: The target standardized feature data is input into the anomaly analysis model, and forward propagation processing is performed through the anomaly analysis model to obtain the predicted probability value representing the anomaly probability. Analyze the preset business conditions to obtain numerical limit thresholds and exclusion conditions; Determine whether the predicted probability value falls within the numerical range defined by the numerical limit threshold, and compare whether the target standardized feature data contains the feature items defined by the exclusion condition; Based on the judgment and comparison results, the predicted probability values are weighted and corrected or the results are overwritten to generate an anomaly quantification result that characterizes the degree of anomaly of the object to be processed.
7. The privacy-preserving analysis method based on federated learning as described in claim 1, characterized in that, Based on the anomaly quantification results, a processing decision is determined, and the processing decision is fed back to the business node, including: Obtain a preset hierarchical mapping strategy, wherein the hierarchical mapping strategy defines the correspondence between the numerical range of the anomaly quantification result and the processing decision type, wherein the processing decision type includes pass, review and rejection processing; Determine the target value range into which the abnormal quantification result falls, and find the target processing decision type corresponding to the target value range in the hierarchical mapping strategy; When the target processing decision type is pass or reject, an analysis report containing the judgment criteria is generated as the processing decision; When the target processing decision type is review, extract the key feature information associated with the abnormal quantification result to generate a review work order; Receive correction instructions for the review work order and use the correction instructions as a processing decision; The processing decision is sent to the business node through the federated learning network.
8. A privacy-preserving analysis device based on federated learning, characterized in that, The privacy-preserving analysis device based on federated learning includes: The network initialization module is used to construct a federated learning network containing a coordination node, multiple data nodes, and business nodes, and to configure a data parsing mechanism and a parameter encryption interaction mechanism in the federated learning network. The sample parsing module is used to perform structured processing on the unstructured raw sample data stored in the data node through the data parsing mechanism to generate standardized feature data for training. The federated training module is used to control the data nodes and the coordination nodes to perform encrypted model parameter interaction based on the parameter encryption interaction mechanism, and to train and generate an anomaly analysis model using the standardized feature data for training. The target parsing module is used to obtain the identification information of the object to be processed, extract the unstructured raw data of the target associated with the identification information in the data node through the data parsing mechanism, process it, and generate standardized feature data of the target. The anomaly determination module is used to input the standardized feature data of the target into the anomaly analysis model, and perform joint verification in combination with preset business conditions to generate anomaly quantification results. The decision feedback module is used to determine the processing decision based on the anomaly quantification result and to feed the processing decision back to the business node.
9. A computer device, characterized in that, The computer device includes a memory, a processor, and a federated learning-based privacy protection analysis program stored in the memory and executable on the processor. When executed by the processor, the federated learning-based privacy protection analysis program implements the steps of the federated learning-based privacy protection analysis method as described in any one of claims 1-7.
10. A computer-readable storage medium, characterized in that, The storage medium stores a privacy-preserving analysis program based on federated learning, which, when executed by a processor, implements the steps of the privacy-preserving analysis method based on federated learning as described in any one of claims 1-7.