Entity extraction-oriented federated learning optimization method, system, device and terminal
By constructing an approximate IID entity annotation dataset, the problems of data silos and non-independent identically distributed entities in entity extraction are solved, achieving the effect of improving model accuracy and reducing communication costs while protecting data privacy.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- XIDIAN UNIV
- Filing Date
- 2022-09-22
- Publication Date
- 2026-06-26
AI Technical Summary
Existing deep learning-based entity extraction methods require large amounts of labeled data and are expensive. Data privacy leads to data silos, and non-independent, identically distributed client data causes a decrease in model accuracy.
By constructing an approximate IID entity annotation dataset and utilizing the server to collect shareable data information from the client, an approximately independent and identically distributed dataset is built for model training, reducing data communication volume and improving model accuracy.
While protecting data privacy, the model accuracy is close to that of centralized training, reducing the risk of data leakage, improving the model's accuracy and convergence speed, and reducing training communication costs.
Smart Images

Figure CN115392492B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of natural language processing technology, and in particular relates to a federated learning optimization method, system, device and terminal for entity extraction. Background Technology
[0002] Currently, the main function of entity extraction is to identify words with specific meanings from text data. It involves two steps: first, identifying the start and end positions of entities in a text sequence; and second, classifying the extracted entities. Entity extraction has significant applications across various fields. For example, in the medical field, entity extraction identifies disease-type and drug-type entities in electronic medical records, playing a crucial role in building pharmacovigilance systems, clinical decision support systems, and scientific research and teaching. Furthermore, most natural language processing technologies rely on entity extraction tasks. For instance, in question-answering systems, the accuracy of the system's semantic understanding and analysis of user questions depends on the entity information in the question, and the inferred answer is typically composed of entities from a knowledge base.
[0003] In recent years, deep learning-based entity extraction methods have demonstrated promising results. Deep learning can uncover finer-grained semantic features from text. Furthermore, its powerful ability to automatically extract features eliminates the need for extensive feature engineering, reducing the reliance on domain-specific knowledge. Examples include CNN-CRF, BiLSTM-CRF, CNN-BiLSTM-CRF, BERT-CRF, and ALBERT-BiLSTM-CRF. However, these methods typically require large amounts of labeled data, while the available labeled data for training entity extraction tasks on various platforms is limited. Manually labeling data for entity extraction is also extremely expensive and time-consuming, requiring significant domain expertise. Moreover, textual data in some domains is highly privacy-sensitive, such as patient symptoms, gene sequences, and pathology reports in the medical field. Furthermore, relevant laws and regulations explicitly require companies to protect user data privacy and strictly limit the scope of data transactions. This prevents platforms from sharing entity labeling data, leading to the "data silo" problem, which poses a significant challenge to traditional entity extraction methods.
[0004] Federated learning effectively addresses the aforementioned issues, enabling platforms to perform machine learning while protecting data privacy and complying with laws and regulations, thus solving the problems of data scarcity and data silos. Federated learning aims to coordinate multiple clients to collaboratively build machine learning models in a distributed environment. Furthermore, each client's training dataset does not need to be exposed to other clients; only information related to model training needs to be exchanged. The performance of the resulting federated learning model approaches that of a centralized model (a machine learning model trained by pooling the training datasets of all clients). In recent years, federated learning has been widely applied in fields such as medical image processing, natural language processing, and recommender systems, helping to solve the problem of insufficient data for model training due to data privacy protection, promoting the development of artificial intelligence in the enterprise world, and bringing significant commercial application value.
[0005] Currently, only Ge S et al. have applied federated learning frameworks to entity extraction tasks. This study proposes a medical entity extraction model for medical texts in English corpora and builds a personalized federated learning framework based on the FedAVG algorithm. The scheme achieves F1 scores of 65.16, 82.57, and 32.69 on the CADEC, ADE Corpus, and SMM4H datasets, respectively. However, this study suffers from the following problem: when the client data is non-IID, the client model deviates significantly from the global model, leading to a decrease in model accuracy.
[0006] In summary, entity extraction is a key technology in natural language processing. Existing research largely employs deep learning models for entity extraction tasks, requiring ample entity-annotated data for training. However, the entity-annotated data for a single client is typically limited, and data between clients often requires privacy protection and cannot be directly shared. Current research proposes using federated learning to address the privacy and security issues of shared entity-annotated data. However, the non-independent and identically distributed nature of data across clients leads to significant differences in model parameter spaces between clients, resulting in a decrease in the accuracy of the global model.
[0007] Based on the above analysis, the problems and shortcomings of the existing technology are as follows:
[0008] (1) Existing deep learning-based entity extraction methods require a large amount of labeled data, while manually labeling data for entity extraction tasks is expensive and time-consuming, and requires a lot of domain expertise.
[0009] (2) Text data in some fields is highly private, which makes it impossible for different platforms to share entity annotation data, resulting in the problem of "data silos". This poses a huge challenge to traditional entity extraction methods.
[0010] (3) In existing federated learning-based entity extraction methods, the model parameter space between clients differs greatly due to the non-independent and identically distributed nature of the data from each client, which reduces the accuracy of the global model. Summary of the Invention
[0011] To address the problems existing in the prior art, this invention provides a federated learning optimization method, system, device, and terminal for entity extraction.
[0012] This invention is implemented as follows: a federated learning optimization method oriented towards entity extraction, wherein the federated learning optimization method oriented towards entity extraction includes:
[0013] During the shared data metadata transmission phase, the server sends shared data requests to each selected client participating in federated learning. Each client calculates the shared dataset metadata and uploads it to the server. In the phase of constructing approximate IID entity annotation data, the server constructs an approximate IID entity annotation data index set based on the metadata uploaded by the clients and requests the corresponding data from the respective clients. During the model training phase, the server is treated as a client participating in federated learning training, using the approximate IID entity annotation dataset for model training, resulting in an approximate centralized model W. IID .
[0014] Furthermore, the metadata includes the sample's index, number of entities, and category.
[0015] Furthermore, the federated learning optimization method for entity extraction includes the following steps:
[0016] Step 1, for C t Each client server to Send a data sharing request to obtain Local shared data metadata is saved to set M;
[0017] Step two: The server constructs an approximate IID entity annotation data index set U based on M; the main reason for obtaining the index set instead of the dataset is to reduce data communication volume.
[0018] Step 3, the server sends to each Request the sample data corresponding to U to obtain the approximate IID entity annotation dataset D. IID The constructed IID entity annotation dataset can be used to train centralized models. A method for constructing an IID dataset suitable for federated learning oriented towards entity extraction is proposed.
[0019] Step 4, server-side based Use D IID Model training is performed to obtain an approximate centralized model W.IID .
[0020] Furthermore, M is a set of entity annotation dataset metadata, where each element is the metadata of an entity annotation data sample; U is a set of data sample indices, where each element is the index of a data sample; D k (i) represents D k The i-th entity annotation sample in the dataset; L represents the set of entity categories.
[0021] Furthermore, the construction of entity annotation dataset metadata in step one includes:
[0022] Entity annotation data sample D k (i) The metadata m consists of two parts: index u and entity category vector V, where V = [v1, v2, ..., v |L| ], v j D represents k (i) belongs to l j The number of times an entity appears; encode u and V into binary numbers respectively, and then concatenate them; use two binary numbers to assign each component v in vector V to... j Encoding results in a total of (2 × |L|) bits of binary data; when |D k | Range is 0 to 2 32 Within the entity annotation dataset D, the index is encoded using 32-bit binary numbers. k The maximum amount of metadata generated is 1 byte.
[0023] Furthermore, the construction of the entity annotation data index set in step three includes:
[0024] The entity category vectors V of all samples in M are summed to obtain the global vector. calculate variance The evaluation metric for entity annotation datasets is called entity dispersion. For a single piece of metadata m in M, let u(m) and V(m) represent the index and entity category vector corresponding to m, respectively.
[0025] The formula for calculating the entity discreteness is as follows:
[0026]
[0027] The method for constructing the approximate IID entity annotation data index set includes:
[0028] (1) The server decodes the binary entity annotation data metadata sent by the client;
[0029] (2) Initialize the approximate IID entity annotation data index set U and the global entity category vector.
[0030] (3) Traverse M and select one original message m such that The value reaches its minimum;
[0031] (4) If If the value is less than the threshold ∈ , then add u(m) to U and remove m from M, and update . for Repeat steps (3) and (4) until... If the threshold is greater than ∈ or M is empty.
[0032] Another object of the present invention is to provide an entity extraction-oriented federated learning optimization system that applies the aforementioned entity extraction-oriented federated learning optimization method, the entity extraction-oriented federated learning optimization system comprising:
[0033] The metadata transmission module is used by the server to send shared data requests to each selected client participating in federated learning. Each client calculates the metadata of the shared dataset and uploads the metadata to the server.
[0034] The approximate IID entity annotation data construction module is used by the server to construct an index set of approximate IID entity annotation data based on the metadata uploaded by the client, and to request the corresponding data of the index set from the corresponding client.
[0035] The model training module treats the server as a client participating in federated learning training, using an approximate IID entity annotation dataset for model training, resulting in an approximate centralized model W. IID .
[0036] Another object of the present invention is to provide a computer device including a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of the federated learning optimization method for entity extraction.
[0037] Another object of the present invention is to provide a computer-readable storage medium storing a computer program that, when executed by a processor, causes the processor to perform the steps of the federated learning optimization method for entity extraction.
[0038] Another objective of this invention is to provide an information data processing terminal for implementing the aforementioned federated learning optimization system for entity extraction.
[0039] Based on the above technical solutions and the technical problems solved, please analyze the advantages and positive effects of the technical solution to be protected by this invention from the following aspects:
[0040] First, addressing the technical problems existing in the prior art and the difficulty of solving them, this paper closely analyzes, in conjunction with the technical solution to be protected by this invention and the results and data obtained during the research and development process, how the technical solution of this invention solves the technical problems, and the inventive technical effects brought about by solving these problems. The specific description is as follows:
[0041] This invention proposes a federated optimization algorithm, FedSD, based on shared data. This algorithm reduces the variability in the model parameter spaces among clients, improves the accuracy and convergence speed of the global model, and reduces training communication costs. Experimental results show that the federated learning optimization method for entity extraction proposed in this invention can effectively overcome the problem of decreased global model accuracy under non-independent and identically distributed data from various clients. While reducing the risk of data privacy leakage, the accuracy of the model provided by this invention is close to that of the centralized training model, with an F1 score only 1.91% lower. Compared with existing entity extraction methods based on the federated learning framework FedAVG, the proposed method significantly improves the F1 score on the Boson dataset and the Weibo dataset, and significantly reduces the total data transmission volume per client. In particular, the performance improvement is more significant with the increase in the proportion of participating clients and the number of local training iterations.
[0042] This invention proposes a federated learning-based entity extraction method, FedSD. This method can coordinate the training of entity extraction models without exchanging data among clients, achieving model accuracy close to that of centrally trained models while reducing the risk of data privacy leakage. The proposed federated learning framework consists of four steps: client selection, model training, model optimization, and model aggregation. The key focus is on model optimization based on the FedSD shared data-based federated optimization algorithm. First, it constructs an approximately independent and identically distributed entity annotation dataset based on the shared dataset of clients. Then, it uses this dataset to train a model as an additional client model, participating in the subsequent model aggregation module. This addresses the problem of decreased model accuracy caused by non-independent and identically distributed client data. Simulation experiments on multiple datasets demonstrate that the proposed federated learning optimization method for entity extraction effectively overcomes the problem of decreased model accuracy under non-independent and identically distributed data and reduces the communication cost of model training.
[0043] Second, considering the technical solution as a whole or from a product perspective, the technical effects and advantages of the technical solution to be protected by this invention are specifically described as follows:
[0044] This invention proposes a federated learning optimization method, FedSD, for entity extraction. It constructs approximate IID entity annotation data based on a shareable dataset from all clients and uses this data to optimize the global model, thereby reducing the differences in parameter spaces among various clients and improving the accuracy of the global model.
[0045] Among the FedAVG and FedSD algorithms, the FedSD algorithm of this invention outperforms the baseline algorithm FedAVG, indicating that the FedSD algorithm can improve the F1 score of the model under Non-IID data, increase the convergence speed of the model, reduce the number of federated communication rounds required for model training, and reduce the communication cost of training.
[0046] Third, as supplementary evidence of the inventive step of the claims of this invention, it is also reflected in the following important aspects:
[0047] The technical solution of this invention fills a technological gap in the industry both domestically and internationally: existing federated learning methods do not employ a technique to construct an IID dataset by collecting publicly available datasets from various clients for training a global model. This invention is the first to propose constructing an IID dataset based on publicly available datasets from various clients for federated learning, alleviating the problem of slow and difficult convergence in federated learning caused by significant differences in datasets from various clients. Attached Figure Description
[0048] To more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the embodiments of the present invention will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0049] Figure 1 This is a flowchart of the federated learning optimization method for entity extraction provided in an embodiment of the present invention;
[0050] Figure 2 This is a schematic diagram of the federated optimization algorithm process based on shared data provided in an embodiment of the present invention;
[0051] Figure 3 This is a schematic diagram illustrating the entity annotation dataset metadata construction process provided in this embodiment of the invention;
[0052] Figure 4 This is a schematic diagram of the IID entity annotation data construction process provided in an embodiment of the present invention;
[0053] Figure 5 These are F1 score curves for the two algorithms provided in this embodiment of the invention.
[0054] Figure 6This is a schematic diagram showing the number of communication attempts during model training for the two algorithms provided in this embodiment of the invention. Detailed Implementation
[0055] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the invention.
[0056] To address the problems existing in the prior art, this invention provides a federated learning optimization method, system, device, and terminal for entity extraction. The invention will be described in detail below with reference to the accompanying drawings.
[0057] I. Explanation and Description of Embodiments. To enable those skilled in the art to fully understand how the present invention is specifically implemented, this section provides an explanation and description of the embodiments that expand upon the technical solutions of the claims.
[0058] like Figure 1 As shown, the federated learning optimization method for entity extraction provided in this embodiment of the invention includes the following steps:
[0059] S101, Shared data element information transmission;
[0060] S102, Construct an approximate IID entity annotation dataset;
[0061] S103 uploads the data to the server for model training.
[0062] As a preferred embodiment, the federated learning optimization method for entity extraction provided in this invention specifically includes the following steps:
[0063] 1. Basic Idea
[0064] The principle of federated learning model optimization is as follows: the client uploads local model parameters along with a portion of the local dataset that can be shared. Then, the server optimizes the model by training a new model based on the shared dataset uploaded by the client, which is then used to participate in model aggregation.
[0065] Optimization of federated learning models for entity extraction mainly involves three steps: sharing metadata transmission, constructing an approximate IID entity annotation dataset, and model training, such as... Figure 2 As shown.
[0066] During the shared data metadata transmission phase, the server sends shared data requests to each selected client participating in federated learning. Each client calculates the shared dataset metadata, including the sample index, entity count, and category, and uploads the metadata to the server. In the phase of constructing approximate IID entity annotation data, the server constructs an approximate IID entity annotation data index set based on the metadata uploaded by the clients, requests the corresponding data from the respective clients, and finally uploads this data to the server for model training. Besides uploading the approximate IID entity annotation dataset, the additional communication overhead comes from the shared dataset metadata sent by the clients to the server initially. The metadata construction algorithm is described in detail in Algorithm 2 of Section 2. The approximate IID entity annotation data construction algorithm is described in detail in Algorithm 3 of Section 3.
[0067] During the model training phase, the server is treated as a client participating in federated learning training. The model is trained using an approximate IID entity annotation dataset, resulting in an approximate centralized model W. IID .
[0068] The FedSD algorithm is described in detail below, as shown in Algorithm 1. For ease of description, the following definitions are used: M is the set of metadata for the entity annotation dataset, where each element represents the metadata of a single entity annotation data sample. U is the set of data sample indices, where each element represents the index of a single data sample. D k (i) represents D k The i-th entity annotation sample in the dataset. L represents the set of entity categories.
[0069] In the t-th round of Federated Learning training, FedSD consists of the following four steps: Step 1, for C t Each client server to Send a data sharing request to obtain The first step is to save the locally shared data metadata to set M; the second step is for the server to construct an approximate IID entity annotation data index set U based on M; the third step is for the server to send data to each entity. Request the sample data corresponding to U to obtain the approximate IID entity annotation dataset D. IID Fourth step, the server is based on Use D IID Model training is performed to obtain an approximate centralized model W. IID .
[0070] Table 1 FedSD Algorithm
[0071]
[0072]
[0073] 2. Constructing Meta-information
[0074] This invention describes a method for constructing metadata for entity annotation datasets. The process of constructing metadata for entity annotation datasets is as follows: Figure 3 As shown. Entity annotation data sample D k (i) The metadata m consists of two parts: index u and entity category vector V, where V = [v1, v2, ..., v |L| ], v j D represents k (i) belongs to l j The number of times an entity appears. To reduce communication costs associated with data transmission, this invention encodes u and V as binary numbers respectively, and then concatenates them. Typically, D... k (i) The number of times a certain type of entity appears does not exceed 4. Therefore, this invention uses two binary numbers to represent each component v in vector V. j Encoding results in a (2×|L|) bit binary number. Based on the above algorithm, assume |D k | Range is 0 to 2 32 Within the entity annotation dataset D, the index is encoded using 32-bit binary numbers. k The maximum amount of metadata generated is Each byte. The algorithm for constructing shared data metadata is shown in Algorithm 2.
[0075] Table 2 Algorithm for Constructing Shared Data Meta-Information
[0076]
[0077]
[0078] 3. Construct an entity annotation data index set
[0079] This invention introduces a method for constructing an entity annotation data index set. First, it describes the IID evaluation method for the entity annotation dataset. The entity category vectors V of all samples in M are summed to obtain the global vector. Then, calculate variance This is used as an evaluation metric for entity annotation datasets, called entity dispersion, and the calculation formula is shown in (1). For a piece of meta-information m in M, let u(m) and V(m) represent the index and entity category vector corresponding to m, respectively.
[0080]
[0081] The following describes a method for constructing an approximate IID entity annotation data index set. The construction process is as follows: Figure 4As shown. The first step is that the server decodes the binary entity annotation data metadata sent by the client. The second step is to initialize the approximate IID entity annotation data index set U and the global entity category vector. The third step is to iterate through M and select one original message m such that... The value reaches its minimum. Fourth step, if... If the value is less than the threshold ∈ , then add u(m) to U and remove m from M, and update . for Repeat steps three and four until... If the value is greater than the threshold, then M is empty. See Algorithm 3 for details.
[0082] Table 3. Constructing an index set of approximate IID entity annotation data.
[0083]
[0084]
[0085] The federated learning optimization system for entity extraction provided in this embodiment of the invention includes:
[0086] The metadata transmission module is used by the server to send shared data requests to each selected client participating in federated learning. Each client calculates the metadata of the shared dataset and uploads the metadata to the server.
[0087] The approximate IID entity annotation data construction module is used by the server to construct an index set of approximate IID entity annotation data based on the metadata uploaded by the client, and to request the corresponding data of the index set from the corresponding client.
[0088] The model training module treats the server as a client participating in federated learning training, using an approximate IID entity annotation dataset for model training, resulting in an approximate centralized model W. IID .
[0089] II. Application Examples. To demonstrate the inventiveness and technical value of the technical solution of this invention, this section provides application examples of the technical solution of the claims on specific products or related technologies.
[0090] Implementation process as follows Figure 1As shown, by collecting publicly available datasets from various clients, constructing IID data, and training a global model, the slow and difficult convergence of federated learning due to dataset differences is alleviated. After adopting this invention, on the 1998 People's Daily Chinese entity annotation dataset, the Boson entity annotation dataset, and the Weibo entity annotation dataset, the proposed algorithm FedSD outperformed FedAVG in all metrics by approximately 1%. Therefore, for various federated learning tasks, the performance of the final model can be improved by collecting publicly available datasets from various clients, constructing IID data, and training a global model.
[0091] III. Evidence of the Relevant Effects of the Embodiments. The embodiments of the present invention have achieved some positive effects during research and development or use, and indeed possess significant advantages compared to existing technologies. The following description, in conjunction with data, charts, and other materials from the experimental process, illustrates these advantages.
[0092] 1. Experimental Data and Evaluation Indicators
[0093] This invention uses the 1998 People's Daily Chinese entity annotation dataset, the Boson entity annotation dataset, and the Weibo entity annotation dataset.
[0094] First, we introduce the dataset preprocessing method. To prevent the Bi-LSTM model from experiencing gradient vanishing or descent due to excessively long input sequences, this invention limits the length of all samples in the dataset to a maximum of 256 characters. Samples exceeding 256 characters are truncated and split into multiple samples. Samples shorter than 256 characters are padded with the label "O". The dataset is divided into training and test sets in a 4:1 ratio, with the training set containing 20,864 samples and the test set containing 4,636 samples.
[0095] Next, we introduce the simulation method for the federated learning environment. First, the training set is classified. Specifically, the number of entities in each entity category for each sentence sample is counted, and the sentence is classified into the category with the highest number of occurrences of a certain entity. If the sentence does not contain any entity, it is denoted as category "O". Next, the training set is sorted according to the classification results, and then the training set excluding category "O" is equally divided into n subsets as n different clients. Then, to simulate inconsistent client data sizes and low-quality datasets, the training set for category "O" is equally divided into n subsets, and each client selects one subset with a probability of 0.3 until the training set for category "O" is empty. Finally, each client trains its model using its local subset and then participates in the federated aggregation.
[0096] This invention was developed using Python 3.6 on an Ubuntu 16.04 system. Hardware resources included four 2.2GHz 10-core CPUs, 12GB of RAM, and a GeForce RTX 2080 graphics card.
[0097] This invention uses precision, recall, and F1 score as evaluation metrics for entity extraction models. TP is defined as the number of entities correctly identified by the model, FP as the number of entities incorrectly identified by the model at entity boundaries or in the wrong entity category, and FN as the number of true entities not identified by the model. The formulas for the three evaluation metrics are as follows:
[0098]
[0099]
[0100]
[0101] 2. Experimental Setup
[0102] This section explains the advantages of the FedSD algorithm, with the selected baseline algorithm being FedAVG.
[0103] The federated learning parameters are set as follows: the number of clients is 10; the proportion of clients selected in each round is 0.2; and the number of federated learning training rounds is 20. The model training parameters are set as follows: the ALBERT-BiLSTM-CRF model is used, with 2 epochs, a batch size of 32, and a learning rate of 10 for the CRF layers. -3 The learning rate of the neural network module is 10. -5 Assume that in each round of federated communication in the FedSD algorithm, the client provides 10% of the shared dataset from its local training set.
[0104] 3. Ablation test
[0105] The F1 score curves for model training of the two algorithms are as follows: Figure 5 As shown in Table 4, the FedSD algorithm converges faster than the FedAVG algorithm. The F1 score reaches an inflection point in the second round of federated communication, and from the tenth round onwards, the F1 score approaches convergence. The FedAVG algorithm converges more slowly, only approaching convergence in the fourteenth round. The final experimental results are shown in Table 4. After 20 rounds of federated communication, the FedSD algorithm achieves an F1 score of 90.51%, approximately 0.81% higher than the FedAVG algorithm. The reasons for these experimental results are analyzed below.
[0106] In the FedAVG algorithm, with Non-IID data, each client models using local data, and these local datasets are typically small, making the model prone to overfitting. This leads to inconsistent update trends among client models, resulting in a less stable global model update trend after aggregation compared to centralized model training, thus reducing model accuracy. The FedSD algorithm, on the other hand, constructs approximate IID entity annotation data through a shared dataset among clients and uses this dataset for model training to obtain an approximate centralized model. By adding the centralized model as an additional client in model aggregation, it corrects the inconsistent update trends of client model parameters caused by Non-IID data, increases the global model's ability to model the overall data distribution, and thus improves model accuracy.
[0107] Table 4. Model training results of the two algorithms
[0108]
[0109] Next, the communication cost of model training for the two algorithms is analyzed. The number of communications required for the model to achieve an F1 score of 88% is set, and the experimental results are as follows. Figure 6 As shown in the figure. Experimental results show that the FedSD algorithm model achieved an F1 score of 88% earliest, requiring only 11 rounds of federated communication, while FedAVG required 19 rounds of federated communication. This indicates that to achieve the same F1 score, the FedSD algorithm requires fewer federated communication rounds compared to the baseline algorithm, resulting in higher model training efficiency and reduced training communication costs.
[0110] Further analysis of the total data transmission volume of a single client was conducted. In this experiment, clients were randomly selected to participate in federated learning training with a probability of 0.2 in each round of federated communication. Therefore, the total data transmission volume of different clients may vary. Thus, this invention calculates the average total data transmission volume of all clients as the total data transmission volume of a single client. The experimental results are shown in Table 5. The total data transmission volume of the FedAVG algorithm was 147.7 MB, higher than that of the FedSD algorithm. This is because the FedAVG algorithm has low model training efficiency and requires more federated communication, resulting in a larger total data transmission volume uploaded by the client. The total data transmission volume of the FedSD algorithm was 92.28 MB. Compared to FedAVG, in each round of federated communication, the FedSD algorithm client needs to transmit an additional shared dataset, which is 445 KB in size and contains 1259 data samples, including 803 "person name" entities, 1525 "geographic location" entities, and 917 "organization" entities. The FedAVG algorithm only needs to upload model parameters, totaling 41.5 MB. Although the FedSD algorithm transmits an additional shared dataset in each round of federated communication compared to the FedAVG algorithm, the FedSD algorithm improves the convergence speed of the model and achieves a higher model F1 score with fewer federated communication rounds, thus resulting in a lower total amount of data transmitted by the client.
[0111] Table 5 Data transfer volume of the two algorithms
[0112]
[0113] In summary, between the FedAVG and FedSD algorithms, the FedSD algorithm outperforms the baseline algorithm FedAVG. This indicates that the FedSD algorithm can improve the F1 score of the model on Non-IID data, increase the model's convergence speed, reduce the number of federated communication rounds required for model training, and lower the communication cost of training.
[0114] It should be noted that embodiments of the present invention can be implemented in hardware, software, or a combination of both. The hardware portion can be implemented using dedicated logic; the software portion can be stored in memory and executed by a suitable instruction execution system, such as a microprocessor or dedicated-design hardware. Those skilled in the art will understand that the above-described devices and methods can be implemented using computer-executable instructions and / or included in processor control code, for example, such code provided on a carrier medium such as a disk, CD, or DVD-ROM, a programmable memory such as read-only memory (firmware), or a data carrier such as an optical or electronic signal carrier. The devices and modules of the present invention can be implemented by hardware circuitry such as very large-scale integrated circuits or gate arrays, semiconductors such as logic chips, transistors, or programmable hardware devices such as field-programmable gate arrays, programmable logic devices, etc., or by software executed by various types of processors, or by a combination of the above-described hardware circuitry and software, such as firmware.
[0115] The above description is merely a specific embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any modifications, equivalent substitutions, and improvements made by those skilled in the art within the scope of the technology disclosed in the present invention, and within the spirit and principles of the present invention, should be covered within the scope of protection of the present invention.
Claims
1. A federated learning optimization method for entity extraction, characterized in that, The federated learning optimization method for entity extraction includes: During the shared data metadata transmission phase, the server sends shared data requests to each selected client participating in federated learning. Each client calculates the shared dataset metadata and uploads it to the server. In the phase of constructing approximate IID entity annotation data, the server constructs an approximate IID entity annotation data index set based on the metadata uploaded by the clients and requests the corresponding data from the respective clients. During the model training phase, the server is treated as a client participating in federated learning training, using the approximate IID entity annotation dataset for model training, resulting in an approximate centralized model. ; The federated learning optimization method for entity extraction includes the following steps: Step 1, for Each client The server sends Send a data sharing request to obtain Local shared data metadata is saved to set M; Step 2: The server constructs an approximate IID entity annotation data index set U based on M; Step 3, the server sends to each Request the sample data corresponding to U to obtain an approximate IID entity annotation dataset. ; Step 4, server-side based ,use Model training is performed to obtain an approximate centralized model. ; This is a collection of metadata for an entity-annotated dataset, where each element is the metadata of an entity-annotated data sample. This is a set of data sample indexes, where each element is the index of a data sample. The construction of entity annotation dataset metadata includes: Entity annotation data sample Meta information Include index and entity category vector Two parts, of which , express China belongs to The number of times an entity appears; and Encode each part into binary numbers, then concatenate them; use two binary digits to represent the vector. Each component Encode, and together constitute A binary number; when The range is 0~2 32 Internally, the index is encoded using 32-bit binary numbers, thus creating the entity annotation dataset. The maximum amount of metadata generated is 1 byte; (i) indicates The i-th entity annotation sample; Represents a set of entity categories; The construction of the approximate IID entity annotation data index set includes: The entity category vector of all samples in M Accumulate to obtain the global vector. ;calculate variance As an evaluation metric for entity annotation datasets, it is called entity dispersion; for a piece of meta-information m in M, let u(m) and V(m) represent the index and entity category vector corresponding to m, respectively; The formula for calculating the entity discreteness is as follows: ; The method for constructing the approximate IID entity annotation data index set includes: (1) The server decodes the binary entity annotation data metadata sent by the client; (2) Initialize the approximate IID entity annotation data index set and global entity category vector ; (3) Traverse M and select one metadata. Make The value reaches its minimum; (4) If Less than the threshold Then add u(m) And remove m from M, update for Repeat steps (3) and (4) until... Greater than the threshold Alternatively, M can be empty; the entity annotation dataset used is the 1998 People's Daily Chinese entity annotation dataset, the Boson entity annotation dataset, and the Weibo entity annotation dataset.
2. The federated learning optimization method for entity extraction as described in claim 1, characterized in that, The metadata includes the sample's index, number of entities, and category.
3. A federated learning optimization system for entity extraction, applying the federated learning optimization method for entity extraction as described in any one of claims 1 to 2, characterized in that, The federated learning optimization system for entity extraction includes: The metadata transmission module is used by the server to send shared data requests to each selected client participating in federated learning. Each client calculates the metadata of the shared dataset and uploads the metadata to the server. The approximate IID entity annotation data construction module is used by the server to construct an index set of approximate IID entity annotation data based on the metadata uploaded by the client, and to request the corresponding data of the index set from the corresponding client. The model training module treats the server as a client participating in federated learning training, using an approximate IID entity annotation dataset for model training to obtain an approximate centralized model. .
4. A computer device, characterized in that, The computer device includes a memory and a processor. The memory stores a computer program that, when executed by the processor, causes the processor to perform the steps of the federated learning optimization method for entity extraction as described in any one of claims 1 to 2.
5. A computer-readable storage medium storing a computer program, which, when executed by a processor, causes the processor to perform the steps of the federated learning optimization method for entity extraction as described in any one of claims 1 to 2.
6. An information data processing terminal, characterized in that, The information data processing terminal is used to implement the federated learning optimization system for entity extraction as described in claim 3.