Model training method and text processing method
By combining full and compressed data processing networks and dynamically selecting suitable network paths for training, the problem of balancing resource consumption and speed performance in large-scale model training is solved, achieving efficient and accurate model training and inference.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- CLOUD INTELLIGENCE ASSETS HOLDING (SINGAPORE) PTE LTD
- Filing Date
- 2025-11-07
- Publication Date
- 2026-06-18
AI Technical Summary
Large-scale model training leads to increased resource consumption and electricity costs, while slowing down response times when dealing with simple problems, making it impossible to balance speed and performance.
A combination of full data processing network and compressed data processing network is adopted. The target data processing network is constructed by acquiring multiple reference models, and a data processing model is generated. At different task stages, the appropriate network path is selected for training according to the requirements.
The model's parameters and computational efficiency were optimized, improving the utilization of computing resources, ensuring high-precision task processing capabilities, and reducing resource consumption during training and inference.
Smart Images

Figure CN2025133298_18062026_PF_FP_ABST
Abstract
Description
Model training methods and text processing methods
[0001] This disclosure claims priority to Chinese Patent Application No. 202411804544.3, filed with the China Patent Office on December 9, 2024, entitled “Model Training Method and Text Processing Method”, the entire contents of which are incorporated herein by reference. Technical Field
[0002] This disclosure relates to the field of computer technology, and in particular to a model training method. Background Technology
[0003] With the continuous development of large-scale model technology, the requirements for model performance and scale are increasing, leading to a significant increase in training costs. In order to ensure smooth convergence of the training process and avoid catastrophic forgetting of other data, the training set must be continuously expanded, resulting in increased resource consumption and electricity costs.
[0004] Currently, gating mechanisms reduce the cost of model training and address the resource consumption issues associated with large-scale model training. However, this technique suffers from slower response times when handling simple problems due to the use of larger models, failing to balance speed and performance. Therefore, a new model training method is needed to overcome these drawbacks. Summary of the Invention
[0005] In view of the above, embodiments of this disclosure provide a model training method, a text processing method, and a text processing method applied to a cloud device. One or more embodiments of this disclosure also relate to a training platform, a model training apparatus, a computing device, a computer-readable storage medium, and a computer program product, to address the technical deficiencies existing in the prior art.
[0006] According to a first aspect of the present disclosure, a model training method is provided, comprising:
[0007] Obtain at least two reference models, and based on each reference model, obtain at least two target data processing networks, wherein the target data processing networks include at least one full data processing network and at least one compressed data processing network;
[0008] A data processing model is generated based on each target data processing network, wherein the data processing model includes at least one data encoding layer, and the data encoding layer includes at least one full data processing network and at least one compressed data processing network;
[0009] Obtain target sample data and train the data processing model based on the target sample data until the training stop condition of the data processing model is met, wherein the target sample data is the data used to train the data processing model.
[0010] According to a second aspect of the present disclosure, a text processing method is provided, comprising:
[0011] Get the text to be processed;
[0012] The text to be processed is input into the data processing model, and the text processing result output by the data processing model is obtained, wherein the data processing model is trained by the above-described model training method.
[0013] According to a third aspect of the present disclosure, a text processing method is provided, applied to a cloud device, comprising:
[0014] A text processing request sent by a receiving end-side device, wherein the text processing request includes text to be processed;
[0015] The text to be processed is input into the data processing model, and the text processing result output by the data processing model is obtained, wherein the data processing model is trained by the above-described model training method;
[0016] The text processing result is sent to the terminal device.
[0017] According to a fourth aspect of the present disclosure, a training platform is provided, including a request interface and a response unit;
[0018] The request interface is used to receive task generation requests sent by the terminal device;
[0019] The response unit, based on the task generation request, obtains a data processing model, wherein the data processing model is trained by the aforementioned model training method; and generates task information based on the data processing model, wherein the task information is used by the terminal device to perform a text processing task.
[0020] According to a fifth aspect of the present disclosure, a computing device is provided, comprising:
[0021] Memory and processor;
[0022] The memory is used to store computer-executable instructions, and the processor is used to execute the computer-executable instructions. When the computer-executable instructions are executed by the processor, they implement the steps of the above-mentioned model training method, text processing method, and text processing method applied to cloud devices.
[0023] According to a sixth aspect of the present disclosure, a computer-readable storage medium is provided that stores computer-executable instructions, which, when executed by a processor, implement the steps of the above-described model training method, text processing method, and text processing method applied to a cloud device.
[0024] According to a seventh aspect of the present disclosure, a computer program product is provided, including a computer program / instructions that, when executed by a processor, implement the steps of the above-described model training method, text processing method, and text processing method applied to a cloud device.
[0025] By applying the scheme of this disclosure, and by acquiring multiple reference models and constructing a target data processing network based on them, the model structure can be flexibly adjusted to meet the needs of different tasks and application scenarios. The combination of a full data processing network and a compressed data processing network not only optimizes the number of model parameters and computational efficiency, but also enables the model to automatically select the most suitable network path according to specific needs at different training stages. This method improves the utilization of computing resources while ensuring the processing capability of high-precision tasks, effectively reducing resource consumption during training and inference. Attached Figure Description
[0026] Figure 1 is a flowchart of a model training method provided in an embodiment of this disclosure;
[0027] Figure 2 is a schematic diagram of the structure of a data processing model provided in an embodiment of this disclosure;
[0028] Figure 3 is a flowchart of a text processing method provided in an embodiment of this disclosure;
[0029] Figure 4 is a flowchart of a text processing method applied to a cloud device according to an embodiment of the present disclosure;
[0030] Figure 5 is an architecture diagram of a text processing system provided in an embodiment of this disclosure;
[0031] Figure 6 is a flowchart of the processing procedure of an automatic question answering model training method provided in an embodiment of this disclosure;
[0032] Figure 7 is a schematic diagram of a model training device provided in an embodiment of this disclosure;
[0033] Figure 8 is a schematic diagram of the structure of a cloud training platform provided in an embodiment of this disclosure;
[0034] Figure 9 is a structural block diagram of a computing device provided in an embodiment of this disclosure. Detailed Implementation
[0035] Numerous specific details are set forth in the following description to provide a full understanding of this disclosure. However, this disclosure can be implemented in many other ways than those described herein, and those skilled in the art can make similar extensions without departing from the spirit of this disclosure. Therefore, this disclosure is not limited to the specific implementations disclosed below.
[0036] The terminology used in one or more embodiments of this disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of this disclosure. The singular forms “a,” “the,” and “the” as used in one or more embodiments of this disclosure and the appended claims are also intended to include the plural forms unless the context clearly indicates otherwise. It should also be understood that the term “and / or” as used in one or more embodiments of this disclosure refers to and includes any or all possible combinations of one or more associated listed items.
[0037] It should be understood that although the terms first, second, etc., may be used to describe various information in one or more embodiments of this disclosure, such information should not be limited to these terms. These terms are only used to distinguish information of the same type from one another. For example, first may also be referred to as second without departing from the scope of one or more embodiments of this disclosure, and similarly, second may also be referred to as first. Depending on the context, the word “if” as used herein may be interpreted as “when”, “in response to a determination”, or “when…”.
[0038] Furthermore, it should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, stored data, displayed data, etc.) involved in one or more embodiments of this disclosure are all information and data authorized by the user or fully authorized by all parties. Moreover, the collection, use and processing of related data must comply with the relevant laws, regulations and standards of the relevant countries and regions, and corresponding operation entry points are provided for users to choose to authorize or refuse.
[0039] In one or more embodiments of this disclosure, a large model refers to a deep learning model with a large number of model parameters, typically containing hundreds of millions, tens of billions, hundreds of billions, trillions, or even tens of trillions of model parameters. A large model can also be called a foundation model. It is pre-trained using large-scale unlabeled corpora to produce a pre-trained model with hundreds of millions of parameters. Such models can adapt to a wide range of downstream tasks and have good generalization ability. Examples include Large Language Models (LLMs) and multimodal pre-training models.
[0040] In practical applications, large models only require a small number of samples to fine-tune the pre-trained model before they can be applied to different tasks. Large models can be widely used in fields such as Natural Language Processing (NLP) and Computer Vision. Specifically, they can be applied to computer vision tasks such as Visual Question Answering (VQA), Image Captioning (IC), and Image Generation, as well as NLP tasks such as text-based sentiment classification, text summarization, and machine translation. The main application scenarios for large models include digital assistants, intelligent robots, search, online education, office software, e-commerce, and intelligent design.
[0041] First, the terms and concepts involved in one or more embodiments of this disclosure will be explained.
[0042] Mixture of Experts (MoE) is a machine learning approach that improves task performance by combining multiple sub-models (expert models). In this model, different expert models process different input features, and the level of participation of each expert model is controlled by a gating mechanism. By dynamically selecting appropriate expert models for computation, MoE can more efficiently utilize the strengths of different experts. Its flexible structure allows it to automatically adjust the participating experts according to different task requirements, thereby improving the model's expressive power and performance.
[0043] This disclosure provides a model training method, a text processing method, and a text processing method applied to cloud devices. One or more embodiments of this disclosure also relate to a training platform, a model training apparatus, a computing device, a computer-readable storage medium, and a computer program product, which will be described in detail in the following embodiments.
[0044] MoE (Mixture of Experts) is a common cost control solution after large-scale model scaling and is a very important part of the long-term strategy of large-scale models. This technology determines whether we can control the cost of electricity, energy and other resources after providing services at a large scale and on a regular basis in the future.
[0045] According to a recent report, a certain AI model responds to approximately 200 million requests daily, consuming over 500,000 kilowatt-hours of electricity—equivalent to the daily electricity consumption of 17,000 households. This massive power consumption would significantly increase the required electricity without the MoE framework. The MoE framework not only significantly reduces costs but also ensures high-performance requirements are met. In practical applications, small models can quickly provide accurate answers to 80% of simple problems. However, unified large models can solve more complex problems, although this may lead to slower response times.
[0046] Referring to Figure 1, Figure 1 shows a flowchart of a model training method provided according to an embodiment of the present disclosure, which specifically includes the following steps.
[0047] Step 102: Obtain at least two reference models, and based on each reference model, obtain at least two target data processing networks, wherein the target data processing networks include at least one full data processing network and at least one compressed data processing network.
[0048] In practical applications, the reference model is the initial model used for model building and training, the target data processing network is a structural network built based on the reference model and used for data processing tasks; the full data processing network is a structure that utilizes the entire network resources when processing data, while the compressed data processing network is a network that reduces computation and model parameters by optimizing the network structure in specific scenarios.
[0049] For example, a reference model can be understood as the foundational model provided for each expert module during the construction of a data processing model. Each expert module has its unique functionality and processing method, and multiple expert models work together within the framework of the overall data processing model to accomplish tasks. The reference model here refers to the basic structure used to initialize the expert modules; these may have been trained during the pre-training phase and used as the basis for building complex data processing model architectures. For instance, one expert module might be based on a simpler language model, while another might rely on a more complex deep neural network model. Different reference models can provide customized solutions for different input data.
[0050] A data processing model can be understood as a model built using different data processing networks for a specific task; it can be understood as a deep learning model built on the MoE framework. The data processing model selects appropriate experts for task processing through different expert modules and routing mechanisms, thereby improving the model's adaptability and efficiency across various tasks. Essentially, the data processing model is a specific manifestation of the MoE model. Through carefully designed expert modules and selection mechanisms, it can effectively and dynamically select the most suitable processing path for different data tasks. The design of the data processing model is crucial to the accuracy and computational efficiency of the processing task. For example, in speech recognition tasks, the data processing model can select different experts based on the task complexity; simple tasks are handled by smaller experts, while complex tasks are handled by larger experts, avoiding redundant computation and improving the overall efficiency of task processing.
[0051] A target data processing network can be understood as an expert module generated based on a reference model to process real-world data. In a data processing model, a target data processing network typically consists of multiple expert modules, each responding differently based on the characteristics of the input data, thereby effectively improving the overall system's processing capacity. Depending on the task requirements, these target data processing networks may be full-data processing networks or compressed data processing networks, determined by the complexity of the task and the computational resource requirements.
[0052] A full data processing network can be understood as a network that processes data using its complete network structure. It does not perform any compression or simplification, fully utilizing all network parameters and computational resources to pursue high-precision results. In tasks requiring the processing of high-precision, large-scale data, using a full data processing network can ensure the quality of the output. For example, in complex natural language understanding, a full data processing network might be used to ensure the model's performance.
[0053] Compressed data processing networks can be understood as networks that improve efficiency by reducing computational complexity and the number of network parameters when processing data. Compression techniques can be implemented through methods such as pruning, quantization, or low-rank decomposition. For resource-constrained devices or scenarios, using compressed data processing networks can significantly reduce computational and storage overhead while maintaining good performance. For example, mobile devices often employ compressed data processing networks to adapt to limitations in computing power and storage capacity, while finding a balance between processing speed and accuracy.
[0054] By combining full data processing networks and compressed data processing networks, the efficiency of computing resource utilization can be optimized without compromising task accuracy. This combination allows the system to select the most suitable network path based on requirements at different task and data processing stages, thereby improving efficiency while reducing resource consumption.
[0055] Furthermore, at least two target data processing networks are obtained based on each reference model, including:
[0056] A target reference model is determined, wherein the target reference model is any one of the reference models;
[0057] A reference model data processing layer is determined in the target reference model, and a target data processing network is obtained based on the reference model data processing layer.
[0058] In practical applications, the reference model data processing layer is the neural network layer that significantly impacts the model's output. It typically contains the main computational units of the neural network, capable of in-depth analysis and transformation of the input data's features, generating feature representations suitable for downstream tasks. The role of the model data processing layer is to provide fundamental data transformation and feature extraction support for the entire model, directly affecting the accuracy and expressive power of the model's output. For example, in image recognition models, the model data processing layer might consist of a set of convolutional layers. These layers extract features from the image's pixel information, transforming the image into a more structured feature map for further analysis by subsequent layers. In natural language processing tasks, the model data processing layer might contain multiple layers of self-attention mechanisms to extract contextual relationships and semantic information from sentences, thereby improving the accuracy of language understanding.
[0059] It should be noted that determining the data processing layer in the reference model can be understood as identifying the network layers in the model that are crucial for data feature extraction and processing. Specifically, this can be achieved by analyzing the structural characteristics of the reference model to identify network layers with outstanding feature extraction performance; for example, identifying deeper convolutional layers in a convolutional neural network as data processing layers. Alternatively, layers with higher weights can be selected by calculating the influence weights of each layer on the model output. Furthermore, layers with larger gradients can be selected by observing the gradient changes of each layer during model training, thus ensuring the effective transmission of feature information. This disclosure does not impose any limitations in this regard.
[0060] By determining the reference model data processing layer in the target reference model and then determining the target data processing network based on this reference model data processing layer, representative layers in the reference model can be effectively extracted. Therefore, compared with directly using the reference model to obtain the target data processing network, computational and storage overhead can be significantly reduced while ensuring good performance.
[0061] Furthermore, obtaining the target data processing network based on the reference model data processing layer includes:
[0062] Obtain the initial parameter information corresponding to the data processing layer of the reference model;
[0063] Based on the initial parameter information, obtain the parameter feature information corresponding to the reference model data processing layer;
[0064] A compressed data processing network is generated based on the parameter feature information.
[0065] In practical applications, the initial parameter information refers to the original parameters used to describe the model weights in the data processing layer of the reference model; the parameter feature information is the key statistical or structured information extracted based on the initial parameter information, reflecting the characteristics and importance of the parameters.
[0066] Initial parameter information can be understood as the original weight configuration or parameter settings of the data processing layer in the reference model. These parameters are obtained during the training phase of the reference model and are used to reflect the model's basic capabilities in the data processing process. Parameter feature information can be understood as the feature information extracted from the initial parameter information, reflecting the importance, distribution characteristics, or numerical range of each parameter. By extracting features from the initial parameter information, parameter information or combinations of specific parameter information that contribute significantly to model performance can be identified.
[0067] It should be noted that obtaining the parameter feature information corresponding to the data processing layer of the reference model based on the initial parameter information can be understood as extracting and analyzing key feature data from the initial parameters. Specifically, this can be achieved by calculating the mean and variance of the parameters to analyze their distribution characteristics and determine the contribution of different parameters; alternatively, the initial parameters can be calculated specifically for scaling and output feature information to identify important features and optimize the model structure; or gradient analysis can be used to calculate the gradient changes of each parameter to identify parameters that have a significant impact on the output results and focus on them. This disclosure does not impose any limitations in this regard.
[0068] By generating a compressed data processing network based on parameter feature information derived from initial parameter information, computational and storage requirements can be significantly reduced while ensuring that key model features are not lost, thereby improving model efficiency. Parameter feature information helps identify redundant parts that contribute little to model performance, making the compression process more accurate and avoiding unnecessary performance loss. The generated compressed data processing network can achieve lower resource consumption and increased processing speed while meeting task requirements, providing a more advantageous performance for the model in resource-constrained environments.
[0069] Furthermore, based on the initial parameter information, the parameter feature information corresponding to the reference model data processing layer is obtained, including:
[0070] Based on the initial parameter information, parameter scaling feature information and parameter output feature information are obtained;
[0071] Parameter feature information is obtained based on the parameter scaling feature information and the parameter output feature information.
[0072] In practical applications, parameter scaling features are information extracted from the initial parameters to describe the proportional changes of the parameters across different dimensions; parameter output features are features representing the degree of influence of each parameter in the model's output layer on the output result. Parameter scaling features can be understood as feature data that proportionally adjusts each parameter in the model across different dimensions. Parameter output features can be understood as a measure of the influence of each parameter in the model on the final output result.
[0073] In one embodiment provided in this disclosure, the method for obtaining parameter scaling feature information and parameter output feature information is as shown in Formula 1: A=U∑V T ...Formula 1
[0074] Where A represents the initial parameter information, U represents the parameter output feature information, Σ represents the scaling feature information, and V represents the input feature information, defining the transformation direction of the input data.
[0075] It's important to note that obtaining parameter feature information based on parameter scaling and output features aims to comprehensively understand and optimize model parameters, thereby maintaining performance during model compression and optimization. Parameter scaling and output features reflect the model's data processing capabilities across different dimensions and the impact of each parameter on the output, respectively. Combining this information provides deeper parameter analysis, helping to identify which parameters are crucial for maintaining model performance and which may be redundant.
[0076] By integrating these features, core parameters can be effectively selected during model compression, reducing unnecessary computational and storage burdens. For example, high parameter scaling and output features often indicate the importance of parameters in a specific task. Preserving or optimizing these parameters ensures that the compressed model maintains accuracy and computational efficiency. Therefore, by acquiring these features, the model can effectively improve computational efficiency in resource-constrained environments while ensuring that performance is not significantly affected. This has significant value for deployment and maintenance in practical applications.
[0077] Furthermore, obtaining the target data processing network based on the reference model data processing layer includes:
[0078] Obtain the initial parameter information corresponding to the data processing layer of the reference model;
[0079] The initial parameter information is determined to be from a full data processing network.
[0080] In practical applications, the target data processing network is obtained based on the reference model's data processing layer. This acquisition of initial parameter information corresponding to the reference model's data processing layer ensures that the generated target data processing network retains the fundamental performance of the reference model. Since the initial parameter information typically includes weights, biases, and other information obtained during model training, which determine the model's performance on different tasks, obtaining this initial parameter information ensures that the target data processing network inherits the characteristics of the reference model when processing specific data, laying the foundation for further optimization.
[0081] For example, directly determining the initial parameter information as the full data processing network preserves the data processing capabilities of the reference model, ensuring the processing power and accuracy of the subsequently generated data processing model in complex tasks. In this way, the target data processing network in the data processing model can better handle complex data features, thereby improving the model's adaptability and robustness. This process not only helps ensure the efficiency of the data processing model but also provides more flexible applications in different environments, optimizing processing speed and performance, especially in large-scale data processing and high-precision tasks.
[0082] Step 104: Generate a data processing model based on each target data processing network, wherein the data processing model includes at least one data encoding layer, and the data encoding layer includes at least one full data processing network and at least one compressed data processing network.
[0083] In practical applications, the data encoding layer is the network layer that actually performs data processing, primarily used for in-depth processing of the input data. The data encoding layer contains at least one full data processing network and one compressed data processing network. The full data processing network is responsible for processing the raw data, preserving rich information details; the compressed data processing network reduces redundancy and optimizes the use of computing resources by appropriately processing the data.
[0084] The data encoding layer can be understood as a network layer specifically designed for deep data processing, aiming to improve model performance through different processing methods at different levels. Full data processing networks are suitable for tasks that require preserving all details, ensuring information integrity; while compressed data processing networks reduce computational resource consumption and accelerate the data processing process by optimizing data processing methods. In practical applications, such as natural language processing tasks, full data processing networks retain all semantic information, helping the model better understand text, while compressed data processing networks effectively compress data, improving processing efficiency while ensuring accuracy.
[0085] Furthermore, a data processing model is generated based on each target data processing network, including:
[0086] In each target data processing network, a target full data processing network and at least one target compressed data processing network are determined, wherein the target full data processing network is any one of the full data processing networks, and the target compressed data processing network is any one of the compressed data processing networks;
[0087] The target data encoding layer is obtained based on the target full data processing network and each target compressed data processing network, wherein the target data encoding layer is any one of the data encoding layers;
[0088] A data processing model is generated based on the encoding layers of each target data.
[0089] In practical applications, obtaining the target data encoding layer based on the target full data processing network and various target compressed data processing networks can be understood as generating the data encoding layer of the data processing model based on one full data processing network and one or more compressed data processing networks. The core purpose of this step is to ensure that the selected target full data processing network can cover a wide range of data processing capabilities and is suitable for various complex task scenarios. Simultaneously, selecting a target compressed data processing network from multiple compressed data processing networks aims to optimize the storage efficiency and utilization of computational resources in data processing. The selection of these two networks helps achieve a balance between model efficiency and performance, ensuring that the model maintains both high performance and good computational efficiency when processing high-dimensional data.
[0090] By constructing data encoding layers, input data can be effectively processed from multiple dimensions and perspectives, helping the model better understand and extract important features from the data. The construction of data encoding layers ensures that the model has greater flexibility and adaptability, especially when facing complex and dynamically changing input data. The steps of generating a data processing model based on each data encoding layer mark the establishment process of a complete data processing model. This process not only helps improve the model's performance in different tasks but also maintains the model's accuracy while optimizing computational efficiency. By flexibly adjusting the parameters and configurations of each layer, the model can adapt to more diverse application scenarios and needs, especially in large-scale datasets and real-time processing requirements, ensuring efficient and stable performance.
[0091] Furthermore, obtaining the target data encoding layer based on the target full data processing network and each target compressed data processing network includes:
[0092] Obtain a path selection network, wherein the path selection network is used to select at least one data processing network for data processing in this round from the target full data processing network and each target compressed data processing network;
[0093] The target data encoding layer is obtained based on the path selection network, the target full data processing network, and each target compressed data processing network.
[0094] In practical applications, a path selection network is a neural network that selects the appropriate data processing network for a given data encoding layer in each round. A path selection network can be understood as an intelligent selection mechanism that decides which data processing network to use based on specific input conditions or task requirements. For example, when processing image data, a path selection network might choose several suitable compression networks based on the image's complexity or other characteristics to improve processing accuracy. The role of a path selection network is to enhance system flexibility, allowing the entire data processing process to dynamically adjust according to different data or task requirements, thereby optimizing resource allocation and improving processing efficiency.
[0095] It should be noted that using a path selection network to determine the appropriate data processing network among multiple compressed data processing networks can be understood as the path selection network choosing one or more suitable networks for task processing based on the compatibility between the current data and each compressed data processing network. The specific methods for path selection can include calculating the similarity or matching degree between the current data and each network, for example, using specific similarity metrics to select the best network; or dynamically adjusting the selection strategy based on data characteristics, i.e., identifying key attributes in the data (such as dimensionality, distribution, noise, etc.) and comparing the matching degree between the processing capabilities of different networks (such as network depth, convolutional kernel size, activation function, etc.) and these features, prioritizing networks with higher matching degrees to the data features when processing large-scale or complex data; or continuously optimizing the selection criteria of the path selection network through a real-time feedback mechanism to more accurately select suitable networks for different data, etc. This disclosure does not impose any limitations in these aspects.
[0096] Referring to Figure 2, which is a schematic diagram of the structure of a data processing model provided in an embodiment of this disclosure, the left side shows the overall structure of the data processing model. The data processing model has a data embedding layer, at least one data encoding layer, and a data decoding layer. The data embedding layer is connected to the first data encoding layer, and the various encoding layers are connected sequentially. The last data encoding layer is connected to the data decoding layer. The right side shows a further structure of the data encoding layer in the data processing model. The data encoding layer has a self-attention network, a path selection network, a full data processing network, one or more compressed data processing networks, a data fusion network, a normalization network, and an activation network. The input data to the data encoding layer is... First, the data is input into a self-attention network. Then, the self-attention network inputs the processed data into a full data processing network and a path selection network, respectively. The path processing network determines which compressed data processing network to use for further processing based on the input data. The data generated by the full data processing network and the data generated by the compressed data processing network selected by the path selection network are input into a data fusion network to obtain fused feature data. Then, the fused feature data is input into a normalization network to make the fused features more continuous. Finally, the normalized data is input into an activation network so that the data can be input into subsequent data encoding or data decoding layers for further processing.
[0097] In practical applications, the data embedding layer is a network layer that converts the original input data into an embedded representation that the model can process; the data decoding layer is a network layer that generates the final output from the processed representation; the term "self-attention network" refers to a network layer used to focus on the internal correlations of the input data; the data fusion network is a network layer that integrates feature information from different data processing paths; the normalization network is a network layer that standardizes the data to improve the stability of the model; and the activation network is a network layer that introduces non-linear features into the processed data.
[0098] The data embedding layer can be understood as a network layer responsible for converting raw input data (such as text or images) into embedding vectors needed by the model. This step enables the model to efficiently process input information by compressing high-dimensional data into low-dimensional vectors. For example, in natural language processing tasks, the data embedding layer transforms each word into a word vector, allowing the semantic information of the words to be passed to subsequent network layers. The data decoding layer can be understood as a network layer used to transform the embedded representations processed by the model into interpretable output. This layer is the final output layer of the data processing model, responsible for generating results that meet the task requirements. For example, in machine translation tasks, the data decoding layer transforms the encoded language features into text in the target language, thereby completing the translation task.
[0099] Self-attention networks can be understood as a network layer used to compute relationships between different parts of data, capturing dependencies within the input sequence. By focusing on different parts of the input data, this type of network extracts more useful feature information. For example, in text processing tasks, self-attention networks can identify relationships between words in a sentence, helping the model understand the sentence's contextual structure. Data fusion networks can be understood as a network layer that integrates feature information from different sources or paths, merging feature data generated by full data processing networks and compressed data processing networks into a comprehensive representation. Data fusion networks can help models efficiently combine information from different feature sources in multi-task learning or multi-channel data processing, improving the comprehensiveness of data processing.
[0100] A normalization network can be understood as a network layer that standardizes input data, adjusting the data distribution to make it more suitable for model processing. This layer is typically used after feature fusion. Through normalization, the impact of numerical fluctuations can be reduced, helping the model remain stable during training and inference. For example, batch normalization is one of the common normalization methods. An activation network can be understood as a network layer that introduces non-linear features into the data through non-linear activation functions. This layer is usually applied after normalization, aiming to increase the model's expressive power through activation functions, enabling it to learn complex patterns and features, thereby improving the model's performance when processing complex data.
[0101] By constructing the data encoding layer using a path selection network, a suitable compressed data processing network can be dynamically selected based on the characteristics of the current data, achieving optimized resource utilization. This approach not only improves processing efficiency but also reduces unnecessary computational overhead, making the data encoding layer more flexible under different task requirements. Furthermore, the path selection network can adaptively select the best processing path based on the characteristics of the input data, ensuring the model maintains a high response speed while guaranteeing performance. This structural design helps balance processing quality and computational cost in complex data processing scenarios, providing more representative feature data for subsequent network layers.
[0102] Step 106: Obtain target sample data and train the data processing model based on the target sample data until the data processing model reaches the model training stopping condition, wherein the target sample data is the data used to train the data processing model.
[0103] In practical applications, target sample data refers to the specific dataset used to train the model, and model training stopping condition refers to the criteria or conditions used to determine when to stop training during the training process.
[0104] For example, target sample data can be understood as the input dataset used during model training. This dataset can be a portion of any publicly available dataset or a dataset specific to the target project. Model training stopping conditions can be understood as a standard for determining when to end the training process. Typically, training stopping conditions include several aspects, such as whether the model's performance on the validation set corresponding to the target sample data reaches the predetermined target, or whether the model's error converges to a fixed low level. They may also include limitations on the number of training iterations, training time, or stopping when the validation loss no longer shows significant improvement. For example, when the model's validation error no longer decreases significantly after several training iterations, the model is usually considered sufficiently trained, and training can be stopped to avoid overfitting caused by overtraining. In practical applications, setting appropriate training stopping conditions helps improve model training efficiency and prevents unnecessary waste of computational resources.
[0105] Furthermore, the data processing model is trained based on the target sample data until the data processing model reaches the model training stopping condition, including:
[0106] Based on the target sample data, adjust the target parameters corresponding to the data processing model and / or the path selection parameters corresponding to the path selection network in each data encoding layer until the data processing model reaches the model training stopping condition.
[0107] In practical applications, the target parameters are the parameters of each neural network layer in the data processing model, used to optimize model performance, while the path selection parameters are the parameters in the path selection network, used to control the selection mechanism of different data processing networks.
[0108] For example, target parameters can be understood as key parameters that directly affect model performance and output during model training. These parameters are typically weights and biases adjusted by optimization algorithms during training to enable the model to achieve the desired results when processing data. For instance, in neural networks, target parameters include the weight matrix and biases of each layer; adjusting these parameters directly determines the model's accuracy and generalization ability. By optimizing the target parameters, the model can exhibit better performance on a given task and adapt to different data characteristics.
[0109] Path selection parameters can be understood as control parameters in a path selection network that determine the direction of data flow. These parameters are used to dynamically select the data processing network to perform data processing based on the characteristics of the input data, ensuring that the data is routed to a more suitable compressed data processing network or a full data processing network. Optimizing path selection parameters can effectively improve the processing efficiency of the model, enabling it to automatically select the best processing path under complex tasks or different input conditions. For example, in a multi-task learning environment, path selection parameters can help the model decide which data processing network to call based on task requirements, thereby achieving a balance between computational efficiency and processing accuracy.
[0110] It should be noted that when training the data processing model, one can choose to adjust only the target parameter or the path selection parameter, or adjust both the target parameter and the path selection parameter; this disclosure does not impose any restrictions in this regard. By choosing to adjust only the target parameter or the path selection parameter, or by adjusting both the target parameter and the path selection parameter, different aspects of the model can be flexibly optimized, enabling the model to achieve optimal performance under specific task requirements. Adjusting only the target parameter can focus on improving the model's core accuracy and predictive ability, while adjusting the path selection parameter alone helps optimize the efficiency of data flow and improve the model's switching performance between different processing networks. When both are adjusted simultaneously, the model can achieve a balance between accuracy and efficiency, thereby meeting diverse task requirements. This flexible adjustment method not only improves the model's adaptability in various scenarios but also effectively reduces resource consumption and improves the overall efficiency and stability of the model.
[0111] Furthermore, based on the target sample data, the target parameters corresponding to the data processing model and the path selection parameters corresponding to the path selection network in each data encoding layer are adjusted until the model training stopping condition is met, including:
[0112] The target sample data is processed based on the data processing model to obtain a first model loss value, and the target parameters corresponding to the data processing model are adjusted according to the first model loss value until the first model training stopping condition is reached to obtain a reference data processing model. The first model loss value is a model loss value used to improve the accuracy of the model output results.
[0113] Obtain project sample data for the target project;
[0114] The project sample data is processed based on the data processing model to obtain a second model loss value, and the path selection parameters corresponding to each reference data encoding layer are adjusted according to the second model loss value until the second model training stopping condition is reached. The second model loss value is a model loss value used to reduce the difference in selection probability between target data processing networks with similar functions in the model.
[0115] In practical applications, the first model loss value is the loss value used to measure the accuracy of the model's output during training, and the first model training stopping condition is the standard for judging whether training is complete; the reference data processing model is the model adjusted based on the target sample data and the first model loss value; the project sample data is sample data collected for a specific task or project; the second model loss value is the loss value used to measure the model's resource allocation effect when processing project sample data; and the second model training stopping condition is the standard for judging whether the resource allocation effect has reached stability during training.
[0116] For example, the first model loss value can be understood as the error calculated by the model when processing the target sample data, used to evaluate the accuracy of the output. During training, by continuously adjusting the target parameters to reduce this loss value, the model gradually improves its performance on the dataset. For example, in a text generation task, the cross-entropy loss function can be used to calculate the difference between the model output and the target text, helping to improve the quality of text generation. Correspondingly, the first model training stopping condition can be understood as the standard used to determine whether to stop further adjustments at a specific training stage, usually related to the trend of the loss value. For example, when the first model loss value no longer decreases significantly after multiple rounds of training, the model reaches the first model training stopping condition, indicating that it tends to converge on the current dataset. A reasonable stopping condition helps prevent overfitting and saves computational resources.
[0117] The reference data processing model can be understood as the version of the model obtained after reaching the training stopping condition. Having undergone initial adjustments and optimizations, it possesses the ability to process the target data. This model version not only provides high accuracy on the target data but also lays the foundation for subsequent parameter tuning and network selection processes, ensuring the data processing system has good performance. Project sample data can be understood as a specific dataset used for further evaluation and fine-tuning of the reference model, typically containing samples reflecting project requirements. This data simulates the model's input in real-world scenarios, enabling it to better adapt to specific needs. For example, in a speech recognition task, project sample data might include speech data with a specific accent to test the model's recognition performance under that accent.
[0118] The second model loss value can be understood as the error value used to measure the effectiveness of resource allocation when processing project sample data. Unlike the first model loss value, this loss value focuses on the selection accuracy of the path selection network and expert modules, reducing selection differences between functionally similar networks in the model, thereby improving resource utilization and making multi-task processing more efficient. Correspondingly, the second model training stopping condition can be understood as the standard for judging whether the model has reached a convergent state during the resource allocation parameter adjustment process. Usually, when the loss value no longer decreases significantly, the model meets the second model training stopping condition, thus ensuring a better balance between task allocation and resource utilization, improving processing efficiency and stability.
[0119] It should be noted that obtaining the second model loss value based on the data processing model for processing project sample data can be understood as calculating the difference in selection probabilities between target data processing networks with similar functions in the model based on the model input data. Specifically, this can be achieved by calculating the difference between the selected path and other potential paths. For example, in the path selection network, the selection results of different data processing paths can be compared to calculate the loss value. Alternatively, in the expert module, several suitable expert modules can be selected based on the input data, and the weighted loss value between the output generated by these experts and the actual sample data can be calculated to adjust the path selection parameters. Furthermore, a target output can be set based on task requirements, and the actual output of the model can be compared with the target output, with the difference measured as the loss value, thereby optimizing the accuracy and efficiency of resource allocation, etc. This disclosure does not impose any limitations in this regard.
[0120] By sequentially adjusting the model's objective and path selection parameters using the first and second model loss values, the model's performance across different tasks can be optimized, achieving a balance between accuracy and efficiency. Adjusting the first model loss value focuses on improving accuracy, making predictions more precise on specific datasets; while adjusting the second model loss value focuses on the rationality of resource allocation. By optimizing the path selection parameters, the model can effectively allocate computing resources in a multi-task environment, avoiding overuse or waste of computing power. This dual optimization strategy not only ensures the model's accuracy in task processing but also improves response speed and computational efficiency, thus providing better support for the model's practical applications.
[0121] Furthermore, processing the project sample data based on the data processing model to obtain the second model loss value includes:
[0122] A first reference data encoding layer is determined, wherein the first reference data encoding layer is any one of the reference data encoding layers, and the first reference data encoding layer includes a path selection network and at least two target data processing networks;
[0123] The project sample data is input into the path selection network to obtain the path selection information generated by the path selection network.
[0124] Based on the path selection information, at least one first target data processing network and at least one second target data processing network are determined in each target data processing network. The first target data processing network is a data processing network determined by the path selection network to be used to process the project sample data. The second target data processing network is a target data processing network other than the first target data processing network in each target data processing network.
[0125] The project sample data is processed by each first target data processing network to obtain first data feature information, and the project sample data is processed by each second target data processing network to obtain second data feature information.
[0126] The second model loss value is obtained based on the first data feature information and the second data feature information.
[0127] In practical applications, path selection information refers to the processing path information determined by the path selection network based on the input data. The first target data processing network is the main data processing network used to process the project sample data, and the first data feature information is the feature information generated by the first target data processing network. The second target data processing network is a backup or auxiliary data processing network, and the second data feature information is the feature information generated by the second target data processing network.
[0128] Path selection information can be understood as the optimal processing path calculated by the path selection network after receiving input data. This information guides the data flow to the appropriate target data processing network to optimize processing efficiency. For example, in a multi-expert model, path selection information might determine the data flow to the k expert modules ahead for more efficient processing of specific input data, ensuring optimal task execution.
[0129] The first target data processing network can be understood as the data processing network determined by the path selection network based on path selection information. It is the main execution network of the model when processing the current input data, and is usually selected as the network that best matches the current task requirements. For example, in natural language processing tasks, the first target data processing network is the most suitable expert module for processing the given context. The first data feature information can be understood as the core feature information generated by the first target data processing network, which is the feature information obtained after fusing the data feature information output by each first target data network. This feature information reflects the key information extracted from the input data by the selected network, and is used for further data analysis or decision-making.
[0130] The second target data processing network can be understood as an alternative processing network determined by the path selection network after selecting the first target data processing network. The second data feature information can be understood as the feature information resulting from the fusion of auxiliary feature information generated by each of the second target data processing networks. This information supplements the model's main decisions, helping to improve the comprehensiveness of the processing. For example, in image recognition tasks, the second data feature information may contain features of secondary objects, used to supplement the recognition effect of the main object, thereby improving the overall performance of the model.
[0131] It should be noted that obtaining the second model loss value based on the first and second data feature information can be understood as evaluating the model's resource allocation effect through a difference or weighted analysis of the first feature information generated by the selected network and the second feature information generated by the unselected network. Specifically, this can be achieved by calculating the difference between the outputs generated by the first and second data feature information, for example, by comparing the similarity of the output features generated by the two, and using the difference value as the loss value; alternatively, a weighted method can be used, where the first and second feature information are weighted and superimposed according to task priority, and the error between the weighted result and the target output is calculated as the loss value; or the first and second feature information can be used to generate different prediction results, and these results can be independently compared with the true values to calculate their weighted average error as a reference for resource allocation optimization, etc. This disclosure does not impose any limitations in this regard.
[0132] By calculating the second model loss value using the first feature information generated by the selected network and the second feature information generated by the unselected network, load balancing of the model can be effectively achieved, thereby optimizing the efficiency of computing resource utilization. The primary feature information focuses on core processing, ensuring accurate task execution, while the secondary feature information shares some of the workload with the primary network when the load is high, resulting in more rational resource allocation. By calculating and adjusting the second model loss value, the model can intelligently switch processing loads between different tasks, avoiding performance bottlenecks caused by concentrating resources on a few networks. This achieves efficient collaboration and load balancing among multiple networks, ensuring the model's processing stability and response speed in a multi-tasking environment.
[0133] The scheme applied in this disclosure introduces multiple reference models and constructs diverse target data processing networks based on these models, including full data processing networks and compressed data processing networks, enabling flexible structural configuration during model training. This multi-layered data processing network design allows the system to intelligently select the most suitable data processing path according to the specific task and training stage requirements, thereby optimizing the efficiency of computing resource utilization. In particular, the combination of full data processing networks and compressed data processing networks not only improves processing accuracy but also effectively reduces model parameters and computational load, allowing the model to achieve a better balance between high accuracy and resource consumption.
[0134] Furthermore, by generating a data processing model based on the target data processing network and employing a path selection network mechanism, the training path can be dynamically adjusted according to the characteristics of different data processing networks. This dynamic path selection not only enhances the model's adaptability but also automatically selects the optimal processing method based on the characteristics of different tasks and training stages, avoiding excessive computation and redundant resource consumption, thereby effectively improving the speed of training and inference. Moreover, through a multi-layered network structure and flexible path selection, the model ensures that it can optimize the use of computing resources under different data samples and task requirements, providing efficient and high-precision solutions, significantly improving the utilization of computing resources and reducing unnecessary training and inference overhead.
[0135] Corresponding to the above method embodiments, this disclosure also provides a text processing method embodiment. Referring to FIG3, FIG3 shows a flowchart of a text processing method provided according to an embodiment of this disclosure, which specifically includes the following steps.
[0136] Step 302: Obtain the text to be processed.
[0137] Step 304: Input the text to be processed into the data processing model and obtain the text processing result output by the data processing model, wherein the data processing model is trained by the above-described model training method.
[0138] In practical applications, the text to be processed is raw text data that needs to be processed or analyzed in some way, and the text processing result is the output result obtained after the text to be processed is processed by the data processing model.
[0139] For example, the text to be processed can be understood as the input text that needs to be analyzed, transformed, or optimized during text processing. This text can be content from different sources and of different types, such as articles, comments, reports, social media posts, or any textual information that requires further understanding or manipulation. The text to be processed typically contains raw information, which may include noise, spelling errors, or grammatical problems, and needs to be adapted to further analysis or tasks through specific processing procedures. For example, in natural language processing tasks, the text to be processed could be a user-input question, a paragraph from an article, or a document; processing this text helps the model understand its meaning and respond.
[0140] Text processing results can be understood as the output data or information generated after a data processing model processes the text to be processed. This result typically includes key information extracted by the model, classification labels, summaries, translated text, or corrections to the original text. For example, in response to a user's query, the model can convert it into a machine-understandable format and provide an appropriate answer, or extract key information from the text for subsequent analysis. In practical applications, the quality and accuracy of text processing results directly affect the effectiveness of subsequent tasks. The results processed by the model can be used for various tasks such as information retrieval, sentiment analysis, and text classification.
[0141] The above is an illustrative scheme of a text processing method according to this embodiment. It should be noted that the technical solution of this text processing method and the technical solution of the model training method described above belong to the same concept. For details not described in detail in the technical solution of the text processing method, please refer to the description of the technical solution of the model training method described above.
[0142] By applying the scheme of this disclosure embodiment, and by acquiring the text to be processed and inputting it into a carefully trained data processing model, this method can combine the task of preprocessing text with the designed target data processing network to achieve efficient and accurate text processing results. Because the data processing model is optimized during training through the full data processing network and the compressed data processing network, it can intelligently select the appropriate data processing path. Therefore, this method exhibits higher flexibility and efficiency when processing different types of text.
[0143] Corresponding to the above method embodiments, this disclosure also provides an embodiment of a text processing method applied to a cloud device. Referring to FIG4, FIG4 shows a flowchart of a text processing method applied to a cloud device according to an embodiment of this disclosure, which specifically includes the following steps.
[0144] Step 402: Receive a text processing request sent by the receiving end device, wherein the text processing request includes the text to be processed.
[0145] Step 404: Input the text to be processed into the data processing model and obtain the text processing result output by the data processing model, wherein the data processing model is trained by the above-described model training method.
[0146] Step 406: Send the text processing result to the terminal device.
[0147] In practical applications, cloud devices are remote server devices with strong processing capabilities and abundant resources, while edge devices refer to devices running locally on the user's device, which typically have lower processing capabilities. Text processing requests are processing requests initiated by edge devices to cloud devices, and usually contain text content that needs to be analyzed or transformed.
[0148] Cloud devices can be understood as high-performance server clusters or computing resource pools deployed on cloud computing platforms. These devices typically possess powerful processing capabilities, storage capacity, and network bandwidth, making them suitable for performing complex data processing tasks. For example, cloud devices can be used to run large models, perform data analysis, or provide cloud storage services. In text processing scenarios, cloud devices often undertake the main computing tasks, receiving requests from edge devices, processing large amounts of data, generating processing results, and returning the results to the edge devices. The advantages of cloud devices lie in their scalability and efficient processing capabilities, making them suitable for handling complex text analysis and reasoning tasks.
[0149] Edge devices can be understood as devices used locally by the user, typically with limited computing and storage resources, such as smartphones, laptops, and IoT devices. Edge devices are generally used to receive and send data requests and perform relatively simple local processing tasks. By collaborating with cloud devices, edge devices delegate more complex computing tasks to the cloud. For example, a user inputs text on an edge device, sends a request to the cloud device to request text analysis or translation, and then the cloud returns the analysis results for the user to view. In text processing scenarios, edge devices play the role of data acquisition and request sending, reducing the local computing burden.
[0150] A text processing request can be understood as a request initiated by a device on the device side to a device on the cloud. It typically includes the text data to be processed and corresponding processing instructions. These requests may be for tasks such as text analysis, natural language understanding, sentiment analysis, and translation. Text processing requests are usually sent in a standardized data format, carrying the text to be processed and the task requirements. The sending of requests is usually triggered automatically by the user or device. After receiving the request, the cloud device executes the corresponding processing logic according to the requirements and returns the processing result. For example, when a user inputs a piece of voice text on a smartphone for translation, the phone sends a text processing request to the cloud, and the cloud translation service processes the request and returns the translated text.
[0151] The above is an illustrative scheme of a text processing method applied to a cloud device according to this embodiment. It should be noted that the technical solution of the text processing method applied to the cloud device belongs to the same concept as the technical solution of the text processing method described above. For details not described in detail in the technical solution of the text processing method applied to the cloud device, please refer to the description of the technical solution of the text processing method described above.
[0152] By applying the scheme of this disclosure, which receives text processing requests from edge devices and inputs the text to be processed into an optimized and trained data processing model, this method can provide an efficient and accurate cloud-based text processing solution. The data processing model is designed based on diverse target data processing networks, including full data processing networks and compressed data processing networks. This design enables cloud devices to intelligently select the most suitable data processing path according to different task requirements and actual processing scenarios, thereby effectively reducing the consumption of computing resources while ensuring text processing accuracy.
[0153] Referring to Figure 5, Figure 5 shows an architecture diagram of a text processing system provided in an embodiment of the present disclosure. The text processing system may include a client 100 and a server 200.
[0154] Client 100 is used to send a text processing request, including the text to be processed, to server 200;
[0155] Server 200 is used to input the text to be processed into the data processing model, obtain the text processing result output by the data processing model, wherein the data processing model is trained by the above-described model training method; and send the text processing result to client 100.
[0156] Client 100 is also used to receive text processing results sent by server 200.
[0157] By applying the solutions of this disclosure, the text processing system provides an efficient and distributed text processing solution through a client-server architecture. The client sends the text to be processed to the server, which processes the text based on a carefully trained data processing model, generates the processing result, and returns it to the client. This architecture not only improves the system's scalability but also optimizes resource utilization, allowing text processing tasks to be distributed across different computing nodes, avoiding bottlenecks caused by a single computing resource.
[0158] A text processing system may include multiple clients 100 and a server 200. Clients 100 can be referred to as edge devices, and server 200 can be referred to as cloud devices. Multiple clients 100 can establish communication connections through server 200. In a text processing scenario, server 200 is used to provide text processing services between multiple clients 100. Each client 100 can act as a sender or receiver, communicating through server 200.
[0159] Users can interact with server 200 through client 100 to receive data sent by other clients 100, or send data to other clients 100, etc. In a text processing scenario, users can publish data streams to server 200 through client 100, server 200 can generate text processing results based on the data stream, and push the text processing results to other clients that have established communication.
[0160] In this system, client 100 and server 200 establish a connection via a network. The network provides the medium for communication between client 100 and server 200. The network can include various connection types, such as wired or wireless communication links or fiber optic cables. Data transmitted by client 100 may need to undergo encoding, transcoding, compression, or other processing before being published to server 200.
[0161] Client 100 can be a browser, an app (application), a web application such as an H5 (HyperText Markup Language 5) application, a lightweight application (also known as a mini-program), or a cloud application. Client 100 can be developed based on the software development kit (SDK) of the corresponding service provided by server 200, such as a real-time communication (RTC) SDK. Client 100 can be deployed on electronic devices and depends on the device or certain apps on the device to run. Electronic devices may have displays and support information browsing, such as personal mobile terminals like mobile phones, tablets, and personal computers. Various other types of applications can also be configured on electronic devices, such as human-computer interaction applications, model training applications, text processing applications, web browser applications, shopping applications, search applications, instant messaging tools, email clients, and social media platform software.
[0162] Server 200 may include servers providing various services, such as servers providing communication services to multiple clients, servers supporting backend training of models used on clients, and servers processing data sent by clients. It should be noted that server 200 can be implemented as a distributed server cluster composed of multiple servers, or as a single server. The server can also be a server in a distributed system, or a server integrated with blockchain. The server can also be a cloud server providing basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (CDNs), and big data and artificial intelligence platforms, or an intelligent cloud computing server or intelligent cloud host with artificial intelligence technology.
[0163] It is worth noting that the text processing method provided in the embodiments of this disclosure is generally executed by the server. However, in other embodiments of this disclosure, the client may also have similar functions to the server, thereby executing the text processing method provided in the embodiments of this disclosure. In other embodiments, the text processing method provided in the embodiments of this disclosure may also be executed jointly by the client and the server.
[0164] The following description, in conjunction with Figure 6, uses the application of the model training method provided in this disclosure in automatic question-answering model training as an example to further illustrate the model training method. Figure 6 shows a flowchart of the processing procedure of an automatic question-answering model training method according to an embodiment of this disclosure, specifically including the following steps.
[0165] Step 602: Obtain at least two reference models, and based on each reference model, obtain at least two target data processing networks, including a full data processing network and a compressed data processing network.
[0166] In practical applications, by acquiring the above-mentioned multiple reference models as expert models, we can lay the foundation for building subsequent automatic question answering models.
[0167] Step 604: Determine the reference model data processing layer in the reference model, and obtain the target data processing network based on the reference model data processing layer.
[0168] Specifically, by obtaining the structure of the data processing network in the reference model, this process ensures that the obtained network is a relatively important network in the expert model (i.e., the reference model), thus avoiding excessive structural redundancy.
[0169] Step 606: Obtain the initial parameter information corresponding to each model data processing layer, and selectively generate the full data processing network and the compressed data processing network based on this initial parameter information.
[0170] Step 608: Based on the pre-configured settings, select a full data processing network and at least two compressed data processing networks to construct the various data encoding layers of the automatic question answering model.
[0171] Step 610: Obtain target sample data and adjust the parameters of each module in the automatic question answering model based on the target sample data.
[0172] Step 612: Obtain sample data related to the target project, and based on the target sample data, obtain the model load balancing loss value corresponding to each data encoding layer of the trained automatic question answering model.
[0173] In practical applications, the model load balancing loss value is used to reduce the difference in selection probability between target data processing networks with similar functions in the model.
[0174] Step 614: Train the parameters of the path selection network in each data encoding layer of the automatic question answering model using the above load balancing loss value until the model training stops, so that the trained automatic question answering model can improve the accuracy of the generated answers to questions related to the target project.
[0175] By applying the scheme of this disclosure, and by introducing multiple reference models and constructing diverse target data processing networks based on these models, this method provides a flexible and efficient network architecture for training automatic question-answering models. In this process, the combination of the full data processing network and the compressed data processing network not only enhances the model's processing accuracy but also effectively reduces computational resource consumption. In particular, extracting important network structures from the reference models and combining them with the initial parameter information of the data processing layers avoids unnecessary redundant structures, improving the model's structural simplicity and computational efficiency.
[0176] Furthermore, the intelligent selection and generation of the target data processing network allows the automatic question-answering model to flexibly adjust its network structure based on different data samples and task requirements. Through the path selection network mechanism, the model can dynamically adjust the data processing path, optimizing the use of computational resources during training. This flexible path selection not only avoids over-computation and resource waste but also significantly improves the speed of training and inference, greatly reducing redundant computation while ensuring high accuracy.
[0177] By employing this multi-layered, dynamically adjustable data processing network structure, this method effectively improves the adaptability of the automatic question-answering model. Particularly during training for the target project, the model load balancing loss value is used to optimize network selection, ensuring the model can answer specific questions more accurately. Furthermore, it optimizes the use of computational resources, avoiding performance bottlenecks and overcomputation, thus achieving an efficient and highly accurate automatic question-answering solution.
[0178] Corresponding to the above method embodiments, this disclosure also provides a model training device embodiment. Figure 7 shows a schematic diagram of the structure of a model training device provided in one embodiment of this disclosure. As shown in Figure 7, the device includes:
[0179] The acquisition module 702 is configured to acquire at least two reference models and acquire at least two target data processing networks based on each reference model, wherein the target data processing networks include at least one full data processing network and at least one compressed data processing network.
[0180] The generation module 704 is configured to generate a data processing model based on each target data processing network, wherein the data processing model includes at least one data encoding layer, and the data encoding layer includes at least one full data processing network and at least one compressed data processing network.
[0181] Training module 706 is configured to acquire target sample data and train the data processing model based on the target sample data until the data processing model reaches the model training stopping condition, wherein the target sample data is the data used to train the data processing model.
[0182] Optionally, the acquisition module 702 is further configured to:
[0183] A target reference model is determined, wherein the target reference model is any one of the reference models;
[0184] A reference model data processing layer is determined in the target reference model, and a target data processing network is obtained based on the reference model data processing layer.
[0185] Optionally, the acquisition module 702 is further configured to:
[0186] Obtain the initial parameter information corresponding to the data processing layer of the reference model;
[0187] Based on the initial parameter information, obtain the parameter feature information corresponding to the reference model data processing layer;
[0188] A compressed data processing network is generated based on the parameter feature information.
[0189] Optionally, the acquisition module 702 is further configured to:
[0190] Based on the initial parameter information, parameter scaling feature information and parameter output feature information are obtained;
[0191] Parameter feature information is obtained based on the parameter scaling feature information and the parameter output feature information.
[0192] Optionally, the acquisition module 702 is further configured to:
[0193] Obtain the initial parameter information corresponding to the data processing layer of the reference model;
[0194] The initial parameter information is determined to be from a full data processing network.
[0195] Optionally, the generation module 704 is further configured to:
[0196] In each target data processing network, a target full data processing network and at least one target compressed data processing network are determined, wherein the target full data processing network is any one of the full data processing networks, and the target compressed data processing network is any one of the compressed data processing networks;
[0197] The target data encoding layer is obtained based on the target full data processing network and each target compressed data processing network, wherein the target data encoding layer is any one of the data encoding layers;
[0198] A data processing model is generated based on the encoding layers of each target data.
[0199] Optionally, the generation module 704 is further configured to:
[0200] Obtain a path selection network, wherein the path selection network is used to select at least one data processing network for data processing in this round from the target full data processing network and each target compressed data processing network;
[0201] The target data encoding layer is obtained based on the path selection network, the target full data processing network, and each target compressed data processing network.
[0202] Optionally, the training module 706 is further configured to:
[0203] Based on the target sample data, adjust the target parameters corresponding to the data processing model and / or the path selection parameters corresponding to the path selection network in each data encoding layer until the data processing model reaches the model training stopping condition.
[0204] Optionally, the training module 706 is further configured to:
[0205] The target sample data is processed based on the data processing model to obtain a first model loss value, and the target parameters corresponding to the data processing model are adjusted according to the first model loss value until the first model training stopping condition is reached to obtain a reference data processing model. The first model loss value is a model loss value used to improve the accuracy of the model output results.
[0206] Obtain project sample data for the target project;
[0207] The project sample data is processed based on the data processing model to obtain a second model loss value, and the path selection parameters corresponding to each reference data encoding layer are adjusted according to the second model loss value until the second model training stopping condition is reached. The second model loss value is a model loss value used to reduce the difference in selection probability between target data processing networks with similar functions in the model.
[0208] Optionally, the training module 706 is further configured to:
[0209] A first reference data encoding layer is determined, wherein the first reference data encoding layer is any one of the reference data encoding layers, and the first reference data encoding layer includes a path selection network and at least two target data processing networks;
[0210] The project sample data is input into the path selection network to obtain the path selection information generated by the path selection network.
[0211] Based on the path selection information, at least one first target data processing network and at least one second target data processing network are determined in each target data processing network. The first target data processing network is the data processing network determined by the path selection network to be used to process the project sample data. The second target data processing network is the target data processing network other than the first target data processing network.
[0212] The project sample data is processed by each first target data processing network to obtain first data feature information, and the project sample data is processed by each second target data processing network to obtain second data feature information.
[0213] The second model loss value is obtained based on the first data feature information and the second data feature information.
[0214] By applying the scheme of this disclosure, and by introducing multiple reference models and generating diverse target data processing networks (including full data processing networks and compressed data processing networks) based on these reference models, this method achieves the function of flexibly configuring the data processing structure during model training. This multi-layered structural design enables the system to dynamically select the most suitable processing path according to the needs of different tasks and training stages, thereby optimizing the use of computing resources and improving training efficiency. In particular, the combination of the full data processing network and the compressed data processing network not only improves processing accuracy but also significantly reduces the number of model parameters and computational complexity, allowing the model to maintain high accuracy while minimizing resource consumption.
[0215] The above is an illustrative scheme of a model training device according to this embodiment. It should be noted that the technical solution of this model training device and the technical solution of the model training method described above belong to the same concept. For details not described in detail in the technical solution of the model training device, please refer to the description of the technical solution of the model training method described above.
[0216] Corresponding to the above method embodiments, this disclosure also provides a cloud training platform embodiment. Figure 8 shows a schematic diagram of the structure of a cloud training platform provided in one embodiment of this disclosure. As shown in Figure 8, the cloud training platform includes a request interface and a response unit;
[0217] The request interface 802 is used to receive a task generation request sent by the terminal device.
[0218] The response unit 804, based on the task generation request, obtains a data processing model, wherein the data processing model is trained by the aforementioned model training method; and generates task information based on the data processing model, wherein the task information is used by the terminal device to perform a text processing task.
[0219] Optionally, the response unit 804 is further configured to:
[0220] The task generation request is parsed to obtain request information, and a data processing model is obtained based on the request information.
[0221] By introducing flexible request interfaces and response units into the cloud training platform, the solution of this disclosure can quickly respond to and provide the data processing model required for the text processing task after receiving a task generation request from the terminal device. This design not only improves the response speed of task processing but also intelligently selects and optimizes the data processing model when different requests are received. Specifically, by combining diverse target data processing networks, the platform can select a more suitable processing path based on the characteristics of the request information to ensure the efficiency and accuracy of text processing.
[0222] The above is an illustrative scheme of a cloud training platform according to this embodiment. It should be noted that the technical solution of this cloud training platform belongs to the same concept as the technical solutions of the above-described model training method and antibody generation method. For details not described in detail in the technical solution of the cloud training platform, please refer to the description of the technical solution of the above-described model training method.
[0223] Figure 9 shows a structural block diagram of a computing device 900 according to an embodiment of the present disclosure. The components of the computing device 900 include, but are not limited to, a memory 910 and a processor 920. The processor 920 is connected to the memory 910 via a bus 930, and a database 950 is used to store data.
[0224] The computing device 900 also includes an access device 940, which enables the computing device 900 to communicate via one or more networks 960. Examples of these networks include Public Switched Telephone Network (PSTN), Local Area Network (LAN), Wide Area Network (WAN), Personal Area Network (PAN), or combinations of communication networks such as the Internet. The access device 940 may include one or more of any type of wired or wireless network interface (e.g., a network interface controller (NIC)), such as an IEEE 802.11 Wireless Local Area Network (WLAN) wireless interface, a Wi-MAX (Worldwide Interoperability for Microwave Access) interface, an Ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a Bluetooth interface, or a Near Field Communication (NFC) interface.
[0225] In one embodiment of this disclosure, the aforementioned components of the computing device 900, as well as other components not shown in FIG. 9, may also be connected to each other, for example, via a bus. It should be understood that the computing device structural block diagram shown in FIG. 9 is merely for illustrative purposes and is not intended to limit the scope of this disclosure. Those skilled in the art can add or replace other components as needed.
[0226] The computing device 900 can be any type of stationary or mobile computing device, including mobile computers or mobile computing devices (e.g., tablet computers, personal digital assistants, laptop computers, notebook computers, netbooks, etc.), mobile phones (e.g., smartphones), wearable computing devices (e.g., smartwatches, smart glasses, etc.) or other types of mobile devices, or stationary computing devices such as desktop computers or personal computers (PCs). The computing device 900 can also be a mobile or stationary server.
[0227] The processor 920 is used to execute the following computer-executable instructions, which, when executed by the processor, implement the steps of the above-mentioned model training method, text processing method, and text processing method applied to cloud devices.
[0228] The above is an illustrative scheme of a computing device according to this embodiment. It should be noted that the technical solution of this computing device belongs to the same concept as the above-mentioned model training method, text processing method, and text processing method applied to cloud devices. For details not described in detail in the technical solution of the computing device, please refer to the descriptions of the above-mentioned model training method, text processing method, and text processing method applied to cloud devices.
[0229] An embodiment of this disclosure also provides a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, implement the steps of the above-described model training method, text processing method, and text processing method applied to a cloud device.
[0230] The above is an illustrative scheme of a computer-readable storage medium according to this embodiment. It should be noted that the technical solution of this storage medium belongs to the same concept as the technical solutions of the above-described model training method, text processing method, and text processing method applied to cloud devices. For details not described in detail in the technical solution of the storage medium, please refer to the descriptions of the technical solutions of the above-described model training method, text processing method, and text processing method applied to cloud devices.
[0231] An embodiment of this disclosure also provides a computer program product, including a computer program / instructions that, when executed by a processor, implement the steps of the above-described model training method, text processing method, and text processing method applied to a cloud device.
[0232] The above is an illustrative scheme of a computer program according to this embodiment. It should be noted that the technical solution of this computer program belongs to the same concept as the technical solutions of the above-mentioned model training method, text processing method, and text processing method applied to cloud devices. For details not described in detail in the technical solution of the computer program, please refer to the descriptions of the technical solutions of the above-mentioned model training method, text processing method, and text processing method applied to cloud devices.
[0233] The foregoing has described specific embodiments of this disclosure. Other embodiments are within the scope of the appended claims. In some cases, the actions or steps recited in the claims may be performed in a different order than that shown in the embodiments and may still achieve the desired results. Furthermore, the processes depicted in the drawings do not necessarily require the specific or sequential order shown to achieve the desired results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
[0234] The computer instructions include computer program code, which may be in the form of source code, object code, executable file, or certain intermediate forms. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording media, USB flash drive, portable hard drive, magnetic disk, optical disk, computer memory, read-only memory (ROM), random access memory (RAM), electrical carrier signals, telecommunication signals, and software distribution media, etc. It should be noted that the content included in the computer-readable medium may be appropriately added or removed according to the requirements of patent practice. For example, in some regions, according to patent practice, computer-readable media may not include electrical carrier signals and telecommunication signals.
[0235] It should be noted that, for the sake of simplicity, the foregoing method embodiments are all described as a series of actions. However, those skilled in the art should understand that the embodiments of this disclosure are not limited to the described order of actions, because according to the embodiments of this disclosure, some steps can be performed in other orders or simultaneously. Furthermore, those skilled in the art should also understand that the embodiments described in the specification are all preferred embodiments, and the actions and modules involved are not necessarily essential to the embodiments of this disclosure.
[0236] In the above embodiments, the descriptions of each embodiment have different focuses. For parts not described in detail in a certain embodiment, please refer to the relevant descriptions of other embodiments.
[0237] The preferred embodiments disclosed above are merely illustrative of this disclosure. The optional embodiments do not exhaustively describe all details, nor do they limit the invention to the specific implementations described. Clearly, many modifications and variations can be made based on the embodiments of this disclosure. These embodiments are selected and specifically described in this disclosure to better explain the principles and practical applications of the embodiments of this disclosure, thereby enabling those skilled in the art to better understand and utilize this disclosure. This disclosure is limited only by the claims and their full scope and equivalents.
Claims
1. A model training method, comprising: Obtain at least two reference models, and based on each reference model, obtain at least two target data processing networks, wherein the target data processing networks include at least one full data processing network and at least one compressed data processing network; A data processing model is generated based on each target data processing network, wherein the data processing model includes at least one data encoding layer, and the data encoding layer includes at least one full data processing network and at least one compressed data processing network; Obtain target sample data and train the data processing model based on the target sample data until the data processing model reaches the model training stopping condition, wherein the target sample data is the data used to train the data processing model.
2. The method as described in claim 1, wherein at least two target data processing networks are obtained based on each reference model, comprising: A target reference model is determined, wherein the target reference model is any one of the reference models; A reference model data processing layer is determined in the target reference model, and a target data processing network is obtained based on the reference model data processing layer.
3. The method of claim 2, wherein determining the reference model data processing layer comprises: The feature importance of each reference model is evaluated, wherein the feature importance evaluation includes determining the gradient contribution of each neural network layer in each reference model; Based on the feature importance assessment results, the neural network layers whose gradient contribution is higher than a preset threshold are identified as the reference model data processing layers.
4. The method as described in claim 2 or 3, wherein obtaining the target data processing network based on the reference model data processing layer includes: Obtain the initial parameter information corresponding to the data processing layer of the reference model; Based on the initial parameter information, obtain the parameter feature information corresponding to the reference model data processing layer; A compressed data processing network is generated based on the parameter feature information.
5. The method as described in claim 4, wherein obtaining parameter feature information corresponding to the reference model data processing layer based on the initial parameter information includes: Based on the initial parameter information, parameter scaling feature information and parameter output feature information are obtained; Parameter feature information is obtained based on the parameter scaling feature information and the parameter output feature information.
6. The method as described in claim 4, further comprising obtaining parameter feature information corresponding to the reference model data processing layer based on the initial parameter information, and further comprising: The initial parameter information is normalized to generate a normalized parameter matrix; The parameter feature information is obtained by extracting parameter features from the normalized parameter matrix.
7. The method according to any one of claims 2-6, wherein obtaining the target data processing network based on the reference model data processing layer comprises: Obtain the initial parameter information corresponding to the data processing layer of the reference model; The initial parameter information is determined to be from a full data processing network.
8. The method according to any one of claims 1-7, generating a data processing model based on each target data processing network, comprising: In each target data processing network, a target full data processing network and at least one target compressed data processing network are determined, wherein the target full data processing network is any one of the full data processing networks, and the target compressed data processing network is any one of the compressed data processing networks; The target data encoding layer is obtained based on the target full data processing network and each target compressed data processing network, wherein the target data encoding layer is any one of the data encoding layers; A data processing model is generated based on the encoding layers of each target data.
9. The method of claim 8, wherein obtaining the target data encoding layer based on the target full data processing network and each target compressed data processing network comprises: Obtain a path selection network, wherein the path selection network is used to select at least one data processing network for data processing in this round from the target full data processing network and each target compressed data processing network; The target data encoding layer is obtained based on the path selection network, the target full data processing network, and each target compressed data processing network.
10. The method according to any one of claims 1-9, training the data processing model based on the target sample data until the data processing model reaches the model training stopping condition, comprising: Based on the target sample data, adjust the target parameters corresponding to the data processing model and / or the path selection parameters corresponding to the path selection network in each data encoding layer until the data processing model reaches the model training stopping condition.
11. The method of claim 10, wherein the target parameters corresponding to the data processing model and the path selection parameters corresponding to the path selection networks in each data encoding layer are adjusted based on the target sample data until the data processing model reaches the model training stopping condition, comprising: The target sample data is processed based on the data processing model to obtain a first model loss value, and the target parameters corresponding to the data processing model are adjusted according to the first model loss value until the first model training stopping condition is reached to obtain a reference data processing model. The first model loss value is a model loss value used to improve the accuracy of the model output results. Obtain project sample data for the target project; The project sample data is processed based on the data processing model to obtain a second model loss value, and the path selection parameters corresponding to each reference data encoding layer are adjusted according to the second model loss value until the second model training stopping condition is reached. The second model loss value is a model loss value used to reduce the difference in selection probability between target data processing networks with similar functions in the model.
12. The method of claim 11, wherein the second model loss value is obtained by processing the project sample data based on the data processing model, comprising: A first reference data encoding layer is determined, wherein the first reference data encoding layer is any one of the reference data encoding layers, and the first reference data encoding layer includes a path selection network and at least two target data processing networks; The project sample data is input into the path selection network to obtain the path selection information generated by the path selection network. Based on the path selection information, at least one first target data processing network and at least one second target data processing network are determined in each target data processing network. The first target data processing network is the data processing network determined by the path selection network to be used to process the project sample data. The second target data processing network is the target data processing network other than the first target data processing network. The project sample data is processed by each first target data processing network to obtain first data feature information, and the project sample data is processed by each second target data processing network to obtain second data feature information. The second model loss value is obtained based on the first data feature information and the second data feature information.
13. The method of claim 12, further comprising obtaining a second model loss value based on the first data feature information and the second data feature information, wherein the method further comprises: Determine the cosine similarity between the first data feature information and the second data feature information; A similarity loss value is generated based on the cosine similarity, and the similarity loss value is then incorporated into the second model loss value.
14. The method of any one of claims 1-13, further comprising: The target sample data is subjected to data augmentation processing, wherein the data augmentation processing includes random cropping, rotation, or adding noise; The data processing model is trained based on the enhanced target sample data.
15. A text processing method, comprising: Get the text to be processed; The text to be processed is input into the data processing model, and the text processing result output by the data processing model is obtained, wherein the data processing model is trained by the method described in any one of claims 1-14.
16. A text processing method applied to a cloud device, comprising: A text processing request sent by a receiving end-side device, wherein the text processing request includes text to be processed; The text to be processed is input into the data processing model, and the text processing result output by the data processing model is obtained, wherein the data processing model is trained by the method described in any one of claims 1-14; The text processing result is sent to the terminal device.
17. A training platform, comprising a request interface and a response unit; The request interface is used to receive task generation requests sent by the terminal device; The response unit, based on the task generation request, obtains the data processing model, wherein... The data processing model is trained by the method described in any one of claims 1-14; based on the data processing model, task information is generated, wherein the task information is used by the terminal device to perform text processing tasks.
18. A computing device, comprising: Memory and processor; The memory is used to store computer programs / instructions, and the processor is used to execute the computer programs / instructions, which, when executed by the processor, implement the steps of the method according to any one of claims 1-16.
19. A computer-readable storage medium storing a computer program / instructions that, when executed by a processor, implement the steps of the method according to any one of claims 1-16.
20. A computer program product comprising a computer program / instructions that, when executed by a processor, implement the steps of the method according to any one of claims 1-16.