A hybrid expert network optimization method based on partial quantization
By performing data stream sampling and iterative quantization on the hybrid expert network and optimizing the high-frequency subnet, the problems of high computational pressure and low throughput of the hybrid expert network are solved, and the computational efficiency and performance are improved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SHANGHAI FUDIAN INTELLIGENT TECH CO LTD
- Filing Date
- 2022-12-30
- Publication Date
- 2026-06-26
Smart Images

Figure CN115730646B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of information technology, and specifically to a hybrid expert network optimization method based on partial quantization. Background Technology
[0002] Mixture-of-Expert (MoE) networks are a technique for organizing neural networks in a sparse manner. This technique can integrate more network parameters while maintaining a limited increase in computational power requirements. A MoE network can be viewed as a large number of relatively small neural network systems (such as fully connected networks or transformers) sparsely connected together (through expert selection switches). MoE networks can provide effective support for tasks such as complex object discrimination, and therefore can serve as a basic model service for city-level artificial intelligence applications. Continuous optimization of MoE networks can contribute to high-performance intelligent applications in terms of computational efficiency and throughput.
[0003] Quantization is a method of network model compression. It approximates the network weights or activation values represented by high bit width (e.g., 32-bit floating-point numbers) with lower bit width (16-bit floating-point numbers, 8-bit integers, or even 2 bits). Numerically, this means discretizing continuous values.
[0004] In existing technologies, modern smart city systems increasingly rely on complex artificial intelligence models to discriminate and analyze spatial objects. The computational pressure brought by large neural networks has become a technical bottleneck for improving intelligent applications. Existing hybrid expert network optimization methods usually quantify the entire network. After the deployment of large neural networks, the computational pressure is large, the computing power consumed in the optimization process is large, and there are problems such as large amount of computation, high cost, unbalanced load, low network throughput, and insufficient performance support. Summary of the Invention
[0005] To overcome the technical problems of high computational load, high cost, and low network throughput in existing technologies, this invention provides a hybrid expert network optimization method based on partial quantization.
[0006] To achieve the above objectives, the present invention is implemented through the following technical solution:
[0007] A hybrid expert network optimization method based on partial quantization includes the following steps:
[0008] S1. Select a data sample set and perform hybrid expert network sampling;
[0009] S2. Establish the correspondence between subnets and datasets, and select high-frequency subnets and their corresponding datasets;
[0010] S3. Perform iterative quantization on the selected high-frequency subnet using the corresponding dataset.
[0011] Preferably, in step S1, information is sampled from the control gateway of each layer in the hybrid expert network to obtain the execution path for reasoning on the data sample.
[0012] Preferably, in step S1, the hybrid expert network is sampled to obtain information on frequently used subnets and execution paths, as well as the correspondence between them and the corresponding data sample set information.
[0013] Preferably, step S1 includes the following steps:
[0014] S11. For a given hybrid expert network N, the data sample set D = {d0, d1…dN};
[0015] S12, Implant sampling code for N;
[0016] S13. For each sample di in D, repeat the following steps:
[0017] S131. Write the ID number i of di to the log file;
[0018] S132, Call N to perform inference calculation on di, and write the accessed subnet set EN to the log file through sampling code.
[0019] Preferably, in step S2, in the sampled log file, the data expression bit sample ID is associated with the execution path EN. EN is composed of subnet IDs, so data pairs can be decomposed, and the data pairs can be summarized to obtain the correspondence between the subnet ID and the sample subset Dk.
[0020] Preferably, step S2 includes the following steps:
[0021] S21. For a given hybrid expert network N, the sample data set D is used to obtain the sampled data PD;
[0022] S22. Inductively summarize PN to obtain n pairs of relationships between ENk and Dk;
[0023] S23. Select r subnets {EN0, EN1, ..., ENr} from n relation pairs where Dk is greater than the threshold t as candidate subnets for quantization processing.
[0024] Preferably, in step S3, multiple subnets that do not have contextual dependencies are optimized simultaneously in parallel.
[0025] As a preferred approach, context-related processing further divides the original correspondence between subnets and sample sets by using the execution path of data sample inference to define the correspondence between subnet IDs, execution paths EN and sample sets. Here, execution path EN serves as context information. Based on this, the subnets are iteratively quantized using the sample set to achieve the effect of context-related processing.
[0026] Preferably, in step S3, the optimal quantization bit width configuration is found for the high-frequency expert subnet in the hybrid expert network through iterative quantization.
[0027] Preferably, step S3 includes the following steps:
[0028] S31. Initialization: Set the data sample set Dr, select the high-frequency subnet ENr, quantization threshold qt, set the optimized network ENr'=ENr, and set the current quantization configuration qc to QC1;
[0029] S32. Determine if qc is not equal to empty. If yes, return the current optimized network ENr'; otherwise, proceed to the next step.
[0030] S33. Apply Dr to quantize ENr with qc bit width configuration to obtain ENr1;
[0031] S34. Apply the quantization threshold qt to evaluate whether the quality of ENr1 is compliant. If yes, proceed to the next step; otherwise, return to the current optimized network ENr'.
[0032] S35. Set ENr'=ENr1, reduce the quantization bit width, select a lower bit width quantization configuration and set qc, then return to S32.
[0033] Compared with the prior art, the advantages of the present invention are:
[0034] This invention samples the data stream of the inference process of a hybrid expert network to obtain the correspondence between different datasets and different subnetworks in the hybrid expert network. Then, it performs quantization optimization on the corresponding datasets for each subnetwork, thereby reducing the computational burden required for overall optimization of the hybrid expert network. The optimized hybrid expert network can be processed in high-frequency user scenarios using computational optimization, thereby improving the overall network throughput.
[0035] This invention utilizes the correspondence between the sampled subnet and the data, and uses the dataset corresponding to the frequently used subnet to quantize the subnet, thus avoiding quantization of the entire network and avoiding the high computational overhead caused by overall quantization. It is simple, efficient, and improves the overall network performance.
[0036] This invention uses network sampling to locate the relationship between subnets and data samples, preparing data for subsequent quantization processing. For subnet quantization, frequently used subnets are optimized. Quantization can be performed in context-independent, correlated, or partially correlated ways, and multiple subnets without context-dependent correlation can be quantized in parallel. By integrating these methods, quantization optimization of a given hybrid expert network in an application environment is achieved. The advantages of quantization include reduced model size (weight values are represented by low-bit-width data, reducing storage space), reduced computational pressure (high-bit-width floating-point calculations are reduced to low-bit-width floating-point or even integer calculations, greatly reducing computational overhead), and reduced power consumption and increased throughput. Attached Figure Description
[0037] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on the structures shown in these drawings without creative effort.
[0038] Figure 1 This is a schematic diagram of a standard hybrid expert network architecture according to an embodiment of the present invention;
[0039] Figure 2 This is a schematic diagram of the data flow path for hybrid expert network inference according to an embodiment of the present invention;
[0040] Figure 3 This is a schematic diagram of context-independent quantization according to an embodiment of the present invention;
[0041] Figure 4 This is a schematic diagram of context-dependent quantization according to an embodiment of the present invention;
[0042] Figure 5 This is a flowchart illustrating the iterative quantization process according to an embodiment of the present invention. Detailed Implementation
[0043] Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings. The embodiments described below with reference to the accompanying drawings are exemplary and intended to explain the present invention, and should not be construed as limiting the present invention.
[0044] In the description of this invention, it should be understood that the terms "center," "longitudinal," "lateral," "length," "width," "thickness," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," and "outer," etc., indicate the orientation or positional relationship based on the orientation or positional relationship shown in the accompanying drawings. They are only for the convenience of describing this invention and simplifying the description, and do not indicate or imply that the device or element referred to must have a specific orientation, or be constructed and operated in a specific orientation. Therefore, they should not be construed as limitations on this invention.
[0045] Furthermore, the terms "first" and "second" are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one of that feature. In the description of this invention, "a plurality of" means at least two, such as two, three, etc., unless otherwise explicitly specified.
[0046] In this invention, unless otherwise explicitly specified and limited, the terms "installation," "connection," "linking," and "fixing," etc., should be interpreted broadly. For example, they can refer to a fixed connection, a detachable connection, or an integral part; they can refer to a mechanical connection, an electrical connection, or a connection that allows communication between them; they can refer to a direct connection or an indirect connection through an intermediate medium; they can refer to the internal communication of two components or the interaction between two components, unless otherwise explicitly limited. Those skilled in the art can understand the specific meaning of the above terms in this invention according to the specific circumstances.
[0047] See Figure 1-5 This is an embodiment of a hybrid expert network optimization method based on partial quantization according to the present invention. In this embodiment, the following steps are included:
[0048] S1. Select a data sample set and perform hybrid expert network sampling;
[0049] S2. Establish the correspondence between subnets and datasets, and select high-frequency subnets and their corresponding datasets;
[0050] S3. Perform iterative quantization on the selected high-frequency subnet using the corresponding dataset.
[0051] In this embodiment, in step S1, information is sampled from the control gateway of each layer in the hybrid expert network to obtain the execution path for reasoning on the data sample.
[0052] In this embodiment, in step S1, the hybrid expert network is sampled to obtain information on frequently used subnets and execution paths, as well as the correspondence between them and the corresponding data sample set information.
[0053] In this embodiment, step S1 includes the following steps:
[0054] S11. For a given hybrid expert network N, the data sample set D = {d0, d1…dN};
[0055] S12, Implant sampling code for N;
[0056] S13. For each sample di in D, repeat the following steps:
[0057] S131. Write the ID number i of di to the log file;
[0058] S132, Call N to perform inference calculation on di, and write the accessed subnet set EN to the log file through sampling code.
[0059] In this embodiment, in step S2, the data expression in the sampled log file shows the correspondence between the sample ID and the execution path EN. EN is composed of subnet IDs, so data pairs can be decomposed, and the correspondence between the subnet ID and the sample subset Dk can be obtained by summarizing the data pairs.
[0060] In this embodiment, step S2 includes the following steps:
[0061] S21. For a given hybrid expert network N, the sample data set D is used to obtain the sampled data PD;
[0062] S22. Inductively summarize PN to obtain n pairs of relationships between ENk and Dk;
[0063] S23. Select r subnets {EN0, EN1, ..., ENr} from n relation pairs where Dk is greater than the threshold t as candidate subnets for quantization processing.
[0064] In this embodiment, in step S3, multiple subnets that do not have contextual dependencies are optimized simultaneously in parallel.
[0065] In this embodiment, context-related processing further divides the original correspondence between subnets and sample sets by using the execution path of data sample inference to divide the correspondence between subnet IDs, execution paths EN and sample sets. Here, execution path EN serves as context information. Based on this, the subnets are iteratively quantized using the sample set to obtain the context-related effect.
[0066] In this embodiment, in step S3, the optimal quantization bit width configuration is found for the high-frequency expert subnet in the hybrid expert network through iterative quantization.
[0067] In this embodiment, step S3 includes the following steps:
[0068] S31. Initialization: Set the data sample set Dr, select the high-frequency subnet ENr, quantization threshold qt, set the optimized network ENr'=ENr, and set the current quantization configuration qc to QC1;
[0069] S32. Determine if qc is not equal to empty. If yes, return the current optimized network ENr'; otherwise, proceed to the next step.
[0070] S33. Apply Dr to quantize ENr with qc bit width configuration to obtain ENr1;
[0071] S34. Apply the quantization threshold qt to evaluate whether the quality of ENr1 is compliant. If yes, proceed to the next step; otherwise, return to the current optimized network ENr'.
[0072] S35. Set ENr'=ENr1, reduce the quantization bit width, select a lower bit width quantization configuration and set qc, then return to S32.
[0073] In this embodiment, as Figure 1 As shown, the network can be divided into L layers, with N expert networks (i.e., sub-networks) in each layer. These N expert networks are scheduled by a gateway, which controls the data flow to one or more of the expert networks. During inference computation, only a subset of the networks participate in the computation; therefore, this can be seen as expanding the network's learning capabilities while maintaining a relatively unchanged computational power requirement. Figure 2 This demonstrates a typical hybrid expert network inference execution path.
[0074] Hybrid expert network sampling is used to locate the execution path of a given data sample within a hybrid expert network, that is, the set of subnetworks invoked by the hybrid expert network during the inference process for that data sample. For example... Figure 2 As shown, a hybrid expert network consists of an L-layer set of expert networks, with one expert subnet participating in inference at each layer. The execution path of the given data sample inference in the figure includes the subnets EN={EL1E2, EL2E6, ... ELL-2E5, ELL-1E3, ELE1} (where ELi represents the i-th layer and Ej represents the j-th expert subnet of that layer).
[0075] Sampling of the hybrid expert network is obtained by sampling the output of each data sample as it flows through the control gateway (gate) of each layer. As can be seen from the structural characteristics of the hybrid expert network, the expert subnet through which each sample flows is determined by the gate. Therefore, it is unnecessary to sample the expert subnet itself; instead, the gate's decision needs to be recorded. Sampling is achieved by injecting profiler code into the operator code of the gate unit. The sampling code writes the subnet ID selected by the gate to a log file. Its pseudocode representation is as follows:
[0076] moe_gate_i(data){ / / Gateway of the i-th layer hybrid expert network, data is the input data
[0077] / / Calculate the required expert subnet
[0078] j = select_expert(data);
[0079] write_log(j) / / This is the sampling code, which writes the expert network's ID to the log file.
[0080] / / Call the j-th expert subnet
[0081] data'=net_inference(experts[j],data);
[0082] }
[0083] The process of hybrid expert network sampling includes the following steps:
[0084] S11. For a given hybrid expert network N, the data sample set D = {d0, d1…dN};
[0085] S12, Implant sampling code for N;
[0086] S13. For each sample di in D, repeat the following steps:
[0087] S131. Write the ID number i of di to the log file;
[0088] S132, Call N to perform inference calculation on di, and write the accessed subnet set EN to the log file through sampling code.
[0089] Select high-frequency subnets and their corresponding datasets. In the sampled log files, the data representation is that the sample ID corresponds to the execution path EN, and EN is composed of subnet IDs. Therefore, the correspondence between sample IDs and subnet IDs (data pairs) can be decomposed. Summarizing these data pairs, we can derive the sample subset Dk corresponding to the subnet ID, where Dk = {dk, dk+1, ..., dm}. Subnets corresponding to high-capacity sample subsets are clearly high-frequency subnets and can be selected for quantization. The steps for selecting high-frequency subnets for quantization are as follows:
[0090] S21. For a given hybrid expert network N, the sample data set D is used to obtain the sampled data PD;
[0091] S22. Inductively summarize PN to obtain n pairs of relationships between ENk and Dk;
[0092] S23. Select r subnets {EN0, EN1, ..., ENr} from n relation pairs where Dk is greater than the threshold t as candidate subnets for quantization processing.
[0093] Iterative quantization processing involves iteratively optimizing the quantization process after selecting the subnet for quantization. This invention employs the following typical precision bit widths for deep neural networks: 32-bit floating-point (FP32) as the starting bit width, followed by radius (BF16), 8-bit integer (INT8), 4-bit integer (INT4), 3-value representation, and 2-value representation, for a total of five quantization options from high to low (here labeled as quantization configurations: QC1, QC2, QC3, QC4, QC5). The quality of quantization is typically evaluated by setting a quantization threshold qt, ensuring that the precision of the quantized network is not lower than the quantization threshold qt.
[0094] Given a hybrid expert network N and a subnet ENr, and a corresponding dataset Dr, and a given quantization threshold qt, quantization is performed sequentially using quantization configurations from high to low, and the validity is verified by the quantization threshold. Finally, the quantized subnet optimized by the lowest quantization configuration (bit width configuration) not lower than the quantization threshold is taken as the final optimization result. The workflow steps of the iterative quantization process are as follows: S31, Initialization: Set the data sample set Dr, select the high-frequency subnet ENr, quantization threshold qt, set the optimized network ENr'=ENr, and set the current quantization configuration qc to QC1;
[0095] S32. Determine if qc is not equal to empty. If yes, return the current optimized network ENr'; otherwise, proceed to the next step.
[0096] S33. Apply Dr to quantize ENr with qc bit width configuration to obtain ENr1;
[0097] S34. Apply the quantization threshold qt to evaluate whether the quality of ENr1 is compliant. If yes, proceed to the next step; otherwise, return to the current optimized network ENr'.
[0098] S35. Set ENr'=ENr1, reduce the quantization bit width, select a lower bit width quantization configuration and set qc, then return to S32.
[0099] Context-dependent handling in hybrid expert networks, where high-frequency subnets appear simultaneously in multiple execution paths, for example... Figure 3As shown, subnet ELL-1E3 is shared by two execution paths. Therefore, quantization affects the inference performance of both paths simultaneously, meaning it is influenced by the data samples corresponding to both paths. Thus, quantization optimization results in a trade-off between the data samples of the two paths. A context-dependent sampling method is used to quantize and optimize subnet ELL-1E3 separately using samples from the two execution paths, generating different optimized subnets ELL-1E31 and ELL-1E2. This method avoids the trade-offs caused by multiple execution path data samples and searches for a more optimized quantization scheme for a specific path.
[0100] Context-dependent processing can be achieved by further dividing the original subnet into subnet IDs based on the execution path of data sample inference, where the execution path EN corresponds to the sample set. Here, the execution path EN serves as context information (EN is information from the sampling log, so no modification to the sampling method is needed). Based on this, iterative quantization of the subnet using the sample set yields the desired context-dependent effect.
[0101] In the multi-subnet parallel quantization of hybrid expert networks, the context-dependent property described in the previous section can be used to identify two sets of unrelated data sample sets, each corresponding to a subnet for optimization (due to context-dependent property, the same subnet can be optimized in different contexts). Therefore, in this invention, multiple subnets that are not correlated by context can be optimized simultaneously in parallel, further improving optimization efficiency.
[0102] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.
[0103] The embodiments described above are merely illustrative of several implementation methods of this application, and while the descriptions are relatively specific and detailed, they should not be construed as limiting the scope of the invention patent. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, and these all fall within the protection scope of this application. Therefore, the protection scope of this patent application should be determined by the appended claims.
Claims
1. A hybrid expert network optimization method based on partial quantization, characterized in that, Includes the following steps: S1. Select a data sample set, perform hybrid expert network sampling, and write the subnet set EN visited after sampling to the log file; S2. Establish the correspondence between subnets and datasets, and select high-frequency subnets and their corresponding datasets; S3. Perform iterative quantization processing on the selected high-frequency subnet using the corresponding dataset; S2 includes the following steps: S21. For a given hybrid expert network N, the sample data set D is used to obtain the sampled data PD; S22. Summarize the sampled structural data PN to obtain n pairs of relationships between subnets ENk and sample subsets Dk; S23. Select r subnets {EN0, EN1, ..., ENr} from n relation pairs where Dk is greater than the threshold t as candidate subnets for quantization. S3 includes the following steps: S31. Initialization: Set the data sample set Dr, select the high-frequency subnet ENr, quantization threshold qt, set the optimized network ENr'=ENr, and set the current quantization configuration qc to QC1; S32. Determine if qc is not equal to empty. If yes, return the current optimized network ENr'; otherwise, proceed to the next step. S33. Apply Dr to quantize ENr with qc bit width configuration to obtain ENr1; S34. Apply the quantization threshold qt to evaluate whether the quality of ENr1 is compliant. If yes, proceed to the next step; otherwise, return to the current optimized network ENr'. S35. Set ENr'=ENr1, reduce the quantization bit width, select a lower bit width quantization configuration and set qc, then return to S32.
2. The hybrid expert network optimization method based on partial quantization according to claim 1, characterized in that, In step S1, information is sampled from the control gateway of each layer in the hybrid expert network to obtain the execution path for reasoning on the data sample.
3. The hybrid expert network optimization method based on partial quantization according to claim 2, characterized in that, In step S1, the hybrid expert network is sampled to obtain information on frequently used subnets and execution paths, as well as the correspondence between them and the corresponding data sample set information.
4. The hybrid expert network optimization method based on partial quantization according to claim 3, characterized in that, S1 includes the following steps: S11. For a given hybrid expert network N, the data sample set D = {d0, d1…dN}; S12, Implant sampling code for N; S13. For each sample di in D, repeat the following steps: S131. Write the ID number i of di to the log file; S132, Call N to perform inference calculation on di, and write the accessed subnet set EN to the log file through sampling code.
5. The hybrid expert network optimization method based on partial quantization according to claim 4, characterized in that, In S2, the data in the sampled log file is expressed as the correspondence between sample ID and execution path EN, and the execution path EN is composed of a set of subnet IDs. Therefore, each record can be decomposed into a data pair (sample ID, subnet ID). By summarizing the data pairs, the correspondence between subnet ID and sample subset Dk can be obtained.
6. The hybrid expert network optimization method based on partial quantization according to claim 5, characterized in that, In S3, multiple subnets that do not have contextual dependencies are optimized simultaneously in parallel.
7. The hybrid expert network optimization method based on partial quantization according to claim 6, characterized in that, Context-dependent processing further divides the original correspondence between subnets and sample sets by using the execution path of data sample inference to establish the correspondence between subnet IDs, execution paths EN and sample sets. Here, execution path EN serves as context information. Based on this, the subnets are iteratively quantized using the sample set to achieve the effect of context-dependent processing.
8. The hybrid expert network optimization method based on partial quantization according to claim 7, characterized in that, In S3, the optimal quantization bit width configuration is found for the high-frequency expert subnet in the hybrid expert network through iterative quantization.