Large model dataset processing method and apparatus, device, and storage medium
By labeling and correcting the relationships between target factors, the large model dataset processing method solves the problem of inaccurate human judgment, achieves more accurate data screening and model correction, and improves the efficiency of large model training and the accuracy of generated content.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- KUO CHIA-LUN
- Filing Date
- 2025-12-03
- Publication Date
- 2026-06-25
AI Technical Summary
During the training and iteration process, existing large-scale language models suffer from inaccurate human judgment and simulated data feeding, resulting in large differences in answers and generated content that does not match user input or known world knowledge, thus affecting the application performance of downstream tasks.
By acquiring a pre-built large model algorithm and a learning dataset of target factor relationships, factor relationships are labeled and corrected, and processing strategies are determined, including adding simulation algorithms, updating the learning dataset, and filtering convolutional layers, thus achieving automated data filtering and model correction.
It improves the accuracy and efficiency of large model training, reduces iterative learning time and manpower costs, ensures the accuracy and consistency of generated content, and adapts to complex downstream application scenarios.
Smart Images

Figure CN2025139809_25062026_PF_FP_ABST
Abstract
Description
A method, apparatus, device, and storage medium for processing large model datasets.
[0001] This application claims priority to Chinese Patent Application No. CN202411864254.8, filed on December 17, 2024, entitled “A method, apparatus, device and storage medium for processing large model datasets”, the disclosure of which is incorporated herein by reference. Technical Field
[0002] This invention relates to the field of artificial intelligence technology, and more specifically, to a method, apparatus, device, and storage medium for processing large model datasets. Background Technology
[0003] In recent years, with the popularization and rapid development of Artificial Intelligence (AI), various applications combining AI have also developed rapidly. Generally speaking, the traditional way of using AI requires users to first train a model, and then put the trained model into an inference system. The inference system then deploys the trained model to realize specific AI applications. Existing large-scale AI models have various problems. Taking large-scale language models (LLMs) as an example, they often require manual judgment, self-written programs, or mechanical external design, or feeding existing large-scale models with artificially simulated data. However, the human part may be inaccurate or different judgment standards may exist for each person, resulting in large differences in answers. For example, the generated content may not match the user input, contradict the previously generated content, or be inconsistent with known world knowledge. Even large-scale models that were previously highly rated may have their application performance in downstream tasks directly affected after iteration or drift. Therefore, a scheme to process large-scale model datasets is needed to improve model training performance, thereby meeting the needs of practical applications. Summary of the Invention
[0004] This invention provides at least one method, apparatus, device, and storage medium for processing large model datasets. By correcting learning data determined through target factor relationships, it achieves more accurate large data filtering results and further enhances the large model correction anchoring effect, reducing training iteration time. In a first aspect, this invention provides a method for processing large model datasets, comprising: acquiring a pre-constructed large model algorithm and a first learning dataset required for large model training, and a pre-simulated second learning dataset with target factor relationships and a simulation algorithm required for simulating the second learning dataset; adding the simulation algorithm to the large model algorithm to obtain an updated large model algorithm, and adding the second learning dataset to the first learning dataset to obtain an updated first learning dataset; labeling each learning data in the updated first learning dataset with factor relationships based on the updated large model algorithm, and determining the labeling results; and, in response to the labeling results indicating whether the target factor relationship exists, determining a processing strategy for the updated first learning dataset, and / or a modification strategy for the updated large model algorithm. Optionally, the processing strategy for the updated first learning dataset is determined according to one of the following methods: If the annotation results indicate the presence of the target factor relationship, the updated first learning dataset is used as input data for large model training; if the annotation results indicate the absence of the target factor relationship, but a compatible factor relationship associated with the target factor relationship exists, a new second learning dataset is added to the updated first learning dataset until the annotation results show the target factor relationship. Optionally, the method further includes: obtaining a third learning dataset to be analyzed; inputting the third learning dataset into the final large model algorithm to determine whether the target factor relationship exists; the final large model algorithm is obtained by iteratively training a large model using at least one of the processed learning dataset obtained through the processing strategy of the updated first learning dataset and the modified large model algorithm obtained through the modification strategy of the updated large model algorithm; and filtering the third learning dataset based on the determination result. Optionally, the target factor relationship includes one or more of the following relationships: a functional relationship between the independent variable and the dependent variable; a clustering factor relationship with at least one clustering factor; a time series relationship or frequency relationship between different learning materials; or a time series relationship or frequency relationship between clustering factors within the learning materials.Optionally, when the target factor relationship is the clustering factor relationship, the method further includes: performing a clustering test based on a statistical clustering test method to determine whether the target factor relationship conforms to a specified clustering state; if the target factor relationship conforms to the clustering state based on the statistical clustering test method, confirming the clustering distribution map or geometric distribution shape of the target factor relationship with the second learning data set, and then filtering again; or, through analysis and searching of the second learning data set and the first learning data set, determining that the distribution of all data conforms to the clustering distribution map or geometric distribution shape of the target factor relationship. Optionally, when the target factor relationship is the time series relationship, the method further includes: adding random factor relationships outside the Fourier variation interval that conform to the target factor relationship, and confirming the target factor time series relationship; clustering the second learning data set based on the target time series relationship using the target factor or factor relationship; re-clustering the clustered factors based on the statistical clustering test method to confirm the new target factor time series relationship; and or, summing or mixing the results of each cluster based on the target factor to determine a new clustering factor relationship. Optionally, after obtaining the second learning data set, the method further includes: randomizing all or part of the learning data in the second learning data set to obtain a processed second learning data set; updating the first learning data set based on the processed second learning data set. Optionally, when there are multiple target factor relationships, the method further includes: determining common factor relationships among the multiple target factor relationships; filtering each convolutional layer included in the large model algorithm based on the common factor relationships, and training the large model based on the filtered convolutional layers. Optionally, filtering each convolutional layer included in the large model algorithm based on the common factor relationships includes: for each convolutional layer included in the large model algorithm, determining extreme values or obtaining point values through Fourier transformation based on statistical tests of the relationship between the same type of convolutional layer and the target factor; filtering based on the determined extreme values or obtained point values; and or, for external learning data to be corrected, performing convolutional layer filtering or data transformation based on statistical tests of the relationship between the same type of convolutional layer and the target factor. Optionally, the method further includes: obtaining a calibration index for data set calibration; and adding the calibration index to the second learning data set to calibrate the calibration index during the training of a large model.Secondly, the present invention also provides a large model dataset processing apparatus, comprising: an acquisition module, configured to acquire a pre-constructed large model algorithm and a first learning dataset required for training the large model, and a second learning dataset pre-simulated with target factor relationships and a simulation algorithm required for simulating the second learning dataset; an addition module, configured to add the simulation algorithm to the large model algorithm to obtain an updated large model algorithm, and add the second learning dataset to the first learning dataset to obtain an updated first learning dataset; an annotation module, configured to annotate the factors relationships of each learning dataset in the updated first learning dataset based on the updated large model algorithm, and determine the annotation results; and a processing module, configured to determine a processing strategy for the updated first learning dataset and / or a modification strategy for the updated large model algorithm in response to whether the target factor relationship appears as indicated by the annotation results. Thirdly, the present invention also provides an electronic device, comprising: a processor, a memory, and a bus, wherein the memory stores machine-readable instructions executable by the processor, and when the electronic device is running, the processor communicates with the memory via the bus, and when the machine-readable instructions are executed by the processor, the large model dataset processing method as described in any one of the first aspects and various embodiments thereof is executed. Fourthly, the present invention also provides a computer-readable storage medium storing a computer program that, when executed by a processor, performs the large model dataset processing method as described in any one of the first aspects and its various embodiments. Using the above-described large model dataset processing method, apparatus, device, and storage medium, given a pre-constructed large model algorithm and a first learning dataset required for large model training, as well as a second learning dataset with pre-simulated target factor relationships and a simulation algorithm required for simulating the second learning dataset, the large model algorithm and learning dataset can be updated. Then, based on the updated large model algorithm, factor relationships are labeled on each learning data in the updated first learning dataset to determine a processing strategy for the updated first learning dataset and / or a modification strategy for the updated large model algorithm based on the labeling results. The first learning dataset obtained through the processing strategy is a learning dataset corrected for target factor relationships, achieving a more accurate large data filtering effect. Simultaneously, better large model correction and anchoring effects can be achieved through modifications to the large model algorithm, significantly shortening the subsequent large model training iteration time and computing power, and reducing the human resource costs required for data filtering. Other advantages of the present invention will be explained in more detail below with reference to the accompanying drawings. It should be understood that the above description is merely an overview of the technical solution of the present invention, so as to provide a general understanding of the technical means of the present invention and to facilitate its implementation in accordance with the contents of the specification. To make the above and other objects, features, and advantages of the present invention more apparent and understandable, specific embodiments of the present invention are illustrated below. Attached Figure Description
[0005] To more clearly illustrate the technical solutions of the embodiments of the present invention, the accompanying drawings used in the embodiments will be briefly described below. The accompanying drawings are incorporated in and constitute a part of this specification. These drawings illustrate embodiments consistent with the present invention and, together with the specification, serve to illustrate the technical solutions of the present invention. It should be understood that the drawings only illustrate certain embodiments of the present invention and should not be considered as a limitation on the scope of protection. For those skilled in the art, other related drawings can be obtained based on these drawings without creative effort. Furthermore, the same reference numerals denote the same components throughout the drawings. In the drawings: Figure 1 shows a flowchart of a large model dataset processing method provided by an embodiment of the present invention; Figure 2 shows a schematic diagram of a large model dataset processing apparatus provided by an embodiment of the present invention; Figure 3 shows a schematic diagram of an electronic device provided by an embodiment of the present invention. Detailed Implementation
[0006] Exemplary embodiments of the present invention will now be described in more detail with reference to the accompanying drawings. While exemplary embodiments of the present invention are shown in the drawings, it should be understood that the invention can be implemented in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided to enable a more thorough understanding of the invention and to fully convey the scope of the invention to those skilled in the art. In the description of embodiments of the present invention, it should be understood that terms such as “comprising” or “having” are intended to indicate the presence of the disclosed features, numbers, steps, actions, components, portions, or combinations thereof, and do not exclude the possibility of the presence of one or more other features, numbers, steps, actions, components, portions, or combinations thereof. Unless otherwise stated, “ / ” means “or,” for example, A / B can mean A or B; “and / or” in this document is merely a description of the relationship between related objects, indicating that three relationships may exist, for example, A and / or B can mean: A alone, A and B simultaneously, and B alone. The terms “first,” “second,” etc., are used only for ease of description to distinguish the same or similar technical features and should not be construed as indicating or implying the relative importance or number of these technical features. Therefore, features defined by terms such as "first" and "second" can explicitly or implicitly include one or more of these features. In the description of embodiments of this invention, unless otherwise stated, the term "multiple" means two or more. Research has found that the development of existing large-scale artificial intelligence models has deviated from the scaling rule. Taking large-scale language models (LLMs) as an example, some bottlenecks stem from the fact that data often requires manual judgment or the use of artificially simulated data to feed existing large-scale models. However, the human element may be inaccurate, or each person's judgment criteria may differ. Artificial data simulation may introduce artificial simulation factors, such as causing significant differences in answers, generating content that does not match user input, contradicts previously generated content, or is inconsistent with known world knowledge. Even large-scale models that were previously highly rated may have their application performance in downstream tasks directly affected after iteration or drift. On the other hand, in the past, AI could identify or manipulate specified factor relationships by writing its own programs or mechanical external designs, which improved the identification of theoretical data and avoided drift, reducing the need for manual screening and errors. Compared with the underlying large model based on traditional statistical methods, it can provide more accurate cluster screening and time series relationship identification, but it cannot be integrated into the learning and iteration effects of existing large models.To at least partially address one or more of the aforementioned problems and other potential issues, this invention provides at least one large model dataset processing scheme. This scheme uses target factor relationships to simulate data for large model learning, confirms the large model's learning using these target factor relationships, and further filters content that cannot be confirmed solely by statistical relationships using self-written programs or mechanical external design. It rapidly establishes a large model agent from large datasets different from specific simulations, which can be used to add to and correct large model agents. It can also be used for timely judgment or anomaly warning of input and output data for agents with specific factor relationships, especially when current agent models have insufficient data or too few factors. By specifying target factor relationships, it can quickly filter out cases exceeding the limit, which helps in the application of agent models at the terminal and reduces illusory outputs. Alternatively, it can be applied to situations where existing data is insufficient and relevant specified factor large model data filtering is required. Furthermore, the processing mode provided by this invention does not conflict with other large model data filtering methods, such as existing competitive learning methods that manually confirm positive and negative cases, or reinforcement learning methods that use large model-generated data; these can be used in conjunction with and complement each other. Furthermore, the embodiments of this invention achieve automated screening, further improving the theoretical accuracy of data identification and enhancing the learning effect of large models on the screened and identified datasets. This provides a more precise measurement, effectively promoting the progress of human culture and technological development. The confirmation of relevant target factor relationships and the relearning of screened data are expected to further improve the learning outcomes of large models. Moreover, the embodiments of this invention also provide supporting methods for accelerating convolution and correction, accelerating large model iteration and saving computational and labor costs. To facilitate understanding of this embodiment, a detailed description of the large model dataset processing method disclosed in this invention is provided first. The execution entity of the large model correction method provided in this invention is generally an electronic device with certain computing capabilities, such as a server or other processing device. In some possible implementations, this large model dataset processing method can be implemented by a processor calling computer-readable instructions stored in memory.Referring to Figure 1, which shows a flowchart of the large model dataset processing method provided in an embodiment of the present invention, the method includes the following steps S101 to S104: S101: Obtain a pre-constructed large model algorithm and a first learning dataset required for training the large model, as well as a second learning dataset with pre-simulated target factor relationships and a simulation algorithm required for simulating the second learning dataset; S102: Add the simulation algorithm to the large model algorithm to obtain an updated large model algorithm, and add the second learning dataset to the first learning dataset to obtain an updated first learning dataset; S103: Based on the updated large model algorithm, perform factor relationship annotation on each learning dataset in the updated first learning dataset, and determine the annotation results; S104: In response to whether the target factor relationship appears as indicated by the annotation results, determine a processing strategy for the updated first learning dataset, and / or a modification strategy for the updated large model algorithm. To facilitate understanding of the large model dataset processing method provided in this embodiment of the present invention, the application scenarios of this method will be briefly introduced below. The large model dataset processing method provided in this invention can be mainly used in the field of artificial intelligence technology, such as image recognition, face detection, and other sub-fields of artificial intelligence, without specific limitations. Especially in the large model training stage, the processed learning dataset or modified large model algorithm has better model training performance, thus adapting to more complex and varied downstream application scenarios. For different application scenarios, the actual simulation algorithms added here are also different. For example, a specific factor relationship fitting algorithm can be added, detailed to specific parameters such as gravity 1 / 2*9.8*t^0.5 or battery wave equation, or parameters in the current large model algorithm can be specified. In practical applications, there can be two or more target factor relationships: the target factor relationships can share or be related to one or more target factors or target factor relationships (or correlations). After confirming individual target factor relationships, the shared target factors or target factor relationships are confirmed. When there is a shared target factor, it is confirmed whether the target factors are identical, linear (linear relationship, curvilinear relationship), or correlated. For discontinuous or sequential target factors, a clustering method can also be used for confirmation. If the algorithm does not meet the requirements, the following steps are taken: confirm the compatibility of the algorithm, modify and add algorithm modules, and add simulation data for learning. Assuming the common target factor is Y or the target factor relationship is Y = cy, first confirm other target factor relationships, then confirm the common target factor or target factor relationship (or correlation). For example, y = Z^2 or y' = 2lnZ are possible compatible factor relationships, but they may appear in the parameter part of the non-specialized algorithm, possibly with similar values. In this case, after confirming that Y and y have a linear relationship or that Y and 2^y' have a linear relationship, the confirmation and correction are completed, and the parameters are changed to the parameters that should be present in the compatible target factor relationship.For complex target factor relationships and compatible factor relationships, all factor relationships can be written into a single algorithm, or they can be split and combined. For example, some target factor relationships can be confirmed separately, and then common target factors or target factor relationships (or correlations) can be confirmed. It can be confirmed whether there are identical, linear (linear relationship, curvilinear relationship) or correlated relationships between target factors. For discontinuous or sequential target factors, the clustering method can also be used for confirmation. If they do not meet the requirements, the compatibility algorithm is confirmed, the algorithm module is modified and added, and simulation data for learning is added. The confirmation of the compatibility algorithm is to check whether the parameters of the algorithm after various splitting and combination are the same or similar to the parameters of the built-in compatibility algorithm, and then replace the parameters. The following is an example of a splitting and combination implementation. The entire equation set can be used, or the equation set can be split and combined into a co-structured equation set. For example, there are many different compatibility algorithms for electromagnetic wave functions. However, after splitting and combining, there may be situations where different data formats, data sources, or convolutional layers appear. The characteristics of the factor relationship after splitting can be used to find the compatible factor relationships. In practical applications, when a holistic target factor relationship exists, compatible factor relationships are more likely to emerge. However, after decomposition, factor units, convolutions, or clustering methods are more likely to cause changes in factor values. This approach, however, can correct the target factor relationship algorithm as much as possible. If, during further iterations, the specialization algorithm becomes insignificant (i.e., excluded from the large model factor value calculation), it indicates drift. Drift may also occur when using compatible algorithms, requiring recalibration of the large model. With a second learning dataset containing the simulation algorithm, the target factor relationship can be checked using the large model algorithm label. Here, the second learning dataset that conforms to the target factor relationship is added to the first learning dataset required for training the large model. After iteration, the target factor relationship is checked. If no target factor relationship appears, the learning dataset label is used to confirm whether a compatible factor relationship exists, or whether the algorithm module needs modification, or the learning dataset is increased and iterated again until the target factor relationship appears. If it is ultimately impossible to confirm all target factor relationships, it is necessary to check whether there are contradictions or errors in the composition of the target factor relationships. However, partial correction of the large model after iteration can still be achieved. In subsequent iterations, if the previous correction was a compatible algorithm, or if the target factor relationship specialization algorithm is insignificant (i.e., excluded from the calculation of factor values in the large model), correction needs to be performed again. In this embodiment of the invention, the processing strategy for the updated first learning dataset can be as follows: if the annotation results indicate the presence of a target factor relationship, the updated first learning dataset can be used as input data for training the large model; alternatively, if the annotation results indicate the absence of a target factor relationship, but a compatible factor relationship associated with the target factor relationship exists, a new second learning dataset can be added to the updated first learning dataset until the annotation results show a target factor relationship.Here, if the final large model algorithm is obtained through iterative training of at least one of the processed learning data obtained by the updated first learning data set processing strategy and the modified large model algorithm obtained by the updated large model algorithm modification strategy, the large model can be validated using the third learning data set to be analyzed. Here, the large model after algorithm correction is labeled on the large data set to be analyzed or identified (i.e., the third learning data set) to confirm whether it exhibits the same target factor relationship as the simulated data set. The target factor relationship includes, but is not limited to, functional relationships between independent and dependent variables, cluster factor relationships with at least one cluster factor, time series relationships or frequency relationships between different learning data sets, and time series relationships or frequency relationships between cluster factors within the learning data set. In practical applications, if there is no built-in large model fitting program, appropriate checks can be performed to further confirm whether the target factor relationship exists or within the correlation interval of the target factor. This allows for further analysis such as data classification, removal of non-compliant data, or issuance of warnings. In practical applications of this invention to clustering factor relationships, data updates can be performed according to the following steps: Step 1: Perform a clustering test based on a statistical clustering test method to determine whether the target factor relationship conforms to the specified clustering state; Step 2: If the target factor relationship conforms to the clustering state as determined by the statistical clustering test method, confirm the clustering distribution map or geometric distribution pattern of the target factor relationship with the second learning dataset, and then filter again; or, through analysis and searching of the second and first learning datasets, determine that the distribution of all data conforms to the clustering distribution map or geometric distribution pattern of the target factor relationship. When there is a clustering relationship in the target factor relationship, there may be more than one clustering factor. The clustering relationship can correspond to the statistical test of the distribution map of the clustering factor, or a further specialized and precise data format, such as a clustering table, multidimensional matrix, image, mapping, Cayley diagram, mapping relationship, multiplication table, etc. These clustering factors may be shared or non-shared factors or relationships between other target factors, but they will exhibit correlations. Therefore, clustering factors can be used as independent learning data items (e.g., adding corresponding temperature and pressure items to a three-phase diagram simulation learning dataset). When there are shared factors with other target factor relationships, after confirming whether each cluster conforms to the corresponding target factor relationship, the clustering factor can be identified (as it is a shared factor with other target factor relationships, it can share labels). Then, various statistical clustering tests can be used to perform clustering tests to confirm whether it conforms to the specified clustering state. If there are no shared factors with other target factor relationships, the clustering factor is a non-shared factor. In this case, the clustering factor can be added to the learning dataset, or a target factor relationship simulation dataset with this clustering factor can be used, or after statistical clustering tests, the clustering method can be used to analyze and find the clustering of the simulation dataset and the original large dataset to determine the clustering factor labels.In the process of determining cluster factor labels using the clustering method, many different clustering methods and statistical tests can be used. Here, numbers represent the clustering condition factor values or ranges. Through clustering, the fitted factor relationships are similar or dissimilar, and confirmation is gradually eliminated until one or a group of "factors" is obtained. For example, the corresponding learning dataset or a large dataset is added first, and then irrelevant factor relationships are eliminated. The cluster containing (1, 2, 3, 4, 5) is compared with the cluster containing (6, 7, 8, 9) to find the factor relationships that have changed (such as T-test, breakpoint test). Then, the cluster containing (2, 4, 6, 8) is compared with the cluster containing (1, 3, 5, 7, 9), and even the set containing 1 factor and the set containing 2 factors are statistically tested. Statistical tests (such as factor clustering tests, breakpoint tests) are used to find the changing factors. In a learning dataset using the three-phase relationship of water as an example, the relationships are grouped according to temperature and pressure. If the simulation data already includes density, thermal conductivity, and refractive index, after fitting the aforementioned target factor relationships, the temperature label for the grouping factor can be found. At this point, a pressure label can be manually added to the learning dataset, or to the pressure-related (simulation data) target factor relationships that can be detected and observed, for simulation. After confirming the grouping, further verification can be performed to confirm whether the grouping relationships match the three-phase diagram. In the case of further confirmation and correction of the "target factor relationship distribution map" of the cluster, the "cluster target factor relationship map" can use images, three-dimensional Bayesian tests or various relationship maps and mapping maps in group theory. Although it is called a map here, it can be one-dimensional or multi-dimensional. After considering the relationship between compatible factors that are compatible with the target factor relationship (such as 2.1.2Y=cy), Gaussian transformation, group theory, topology and other methods (distortion, flipping, combination, rotation, exchange, permutation, subgroup, product and quotient, regularization, normalization, folding, adjacency, conjugate, flipping, rotation, homomorphism, modular operation, direct product, coprime) are used alone or in combination to transform the "cluster target factor relationship map", or after various preset transformations, it is fitted, or after Fourier transformation (inverse transformation) and then statistical test fitting is performed to complete the confirmation of the target factor relationship. If there are data in the simulation dataset or large model dataset that do not match, it is necessary to confirm whether there are contradictions, errors, unsolvable problems, or missing information in the target factor relationships (e.g., the three-phase water was later found to have more than 10 states), or whether to remove the simulation dataset or large model dataset. Here, the similarity or identicalness can be directly confirmed by testing the fit, or correlation can be calculated, such as Bayesian sampling tests, which use random sampling points to confirm correlation, or Bayesian area tests, which move the changed mapping graph and observe the change in the calculated overlapping area. The latter also applies to the "factor relationship distribution map" (which may be a cluster distribution map or a frequency distribution map) after Fourier transform or inverse Fourier transform. Different clusters correspond to a set of target factor relationships (density function, thermal conductivity function, etc.). After fitting the target factor relationships separately, the labels of different groups of factor relationships are statistically tested to confirm whether they conform to the cluster state.At this point, Bayesian sampling verification is performed. The sampled simulation data and large model data contain the aforementioned target factor relationship data. Whether this data matches the "target factor relationship diagram" is further confirmed through clustering. In practical applications of time series relationships, this invention can perform clustering verification according to the following steps: Step 1: Add random factor relationships outside the Fourier variation interval that conform to the target factor relationship to confirm the target factor time series relationship; Step 2: Based on the target time series relationship, cluster the second learning dataset using the target factor or factor relationship; Step 3: Based on statistical clustering verification methods, re-cluster the clustered factors to confirm the new target factor time series relationship; and / or, based on the target factor, sum or mix the results of each cluster to determine the new clustering factor relationship. A time series relationship is a set of more than one time series data, in which there is a clustering target factor relationship. The target factor relationships can be between different time periods, geographical locations, subjects (people, events, time, places, and objects), or other grouping relationships. Alternatively, different groups can be further mixed and segmented to form new target factor relationships for confirmation and analysis (e.g., summing time-series data into 5-minute, hourly, daily, or monthly units, or summing geographical data by district, city, or province; the new target factor relationships for each group are derived from the original group target factor relationships). In practical applications, the grouping relationships here mainly refer to external input data or the correlation of target factor relationships. Changes in target factors or target factor relationships in the time-series data constitute the target factor relationship. First, the target factors or target factor relationships are confirmed, then image comparison or Fourier transform image comparison is performed. (Example 1) The Fourier transform relationship can be a transformation of time-series data, images, or multidimensional data. The parameter search for factor features and topological transformation of the target factor relationship result in the Fourier transform relationship. The Fourier transform or inverse transform relationship can be a transformation of time-series data, sound, images, or multidimensional data. The data can be divided into groups and then subjected to Fourier transforms. After the Fourier transform or inverse transform, direct fitting or correlation testing can be performed. The large model dataset processing method provided in this embodiment of the invention, after obtaining the second learning dataset, randomizes all or part of the learning data in the second learning dataset to obtain a processed second learning dataset; then, the first learning dataset is updated based on the processed second learning dataset. For example, for the specific application of car recognition, car colors can be randomized, or randomized according to the proportion of license plate issuance data in that area. Another example is traffic flow; after Fourier transform, it is compared with the Fourier transform data of the on-site data. Traffic flow outside the simulation interval can be added as randomness, changing the simulation data and improving the recognizability of the simulation interval.In the actual large-scale model training process, this embodiment of the invention can also combine common factor relationships to screen convolutional layers. Specifically, this can be determined through the following steps: Step 1: Determine the common factor relationships among multiple target factor relationships; Step 2: Screen each convolutional layer included in the large-scale model algorithm based on the common factor relationships, and train the large-scale model based on the screened convolutional layers. Specifically, for each convolutional layer included in the large-scale model algorithm, based on the statistical test of the relationship between the same type of convolutional layer and the target factor, determine the extreme values or obtain point values through Fourier transformation; then, screen based on the determined extreme values or obtained point values; it is also possible to screen convolutional layers or transform data based on the statistical test of the relationship between the same type of convolutional layer and the target factor for external learning data to be corrected. Here, the statistical test of the relationship between the same type of convolutional layer and the target factor can calculate extreme values, or prioritize the convolutional layer that appears after Fourier transformation; or further, use the common factors of several target factor relationships to confirm the suitable convolutional layers for each target factor relationship. In a specific embodiment, the convolution of linear data can be divided into [1][2][3][4][5][6][7][8][9]
[0010]
[0011]
[0012] ... A simple Fourier transform can be performed on [1][2][3][4][5][6][7][8][9]
[0010]
[0011]
[0012] .... Prioritize the high proportions after the transform (such as frequencies 3 and 10), and perform [1]+[2]+[3] convolution and [1]+[2]+[3]+....+
[0010] convolution to see if there is a neural fit that is more suitable for the convolution. If there are two sets of target factor relationships with shared factors (i.e., they have common factor relationships) and they may be different convolutional layers, the convolutional layer that can fit the common factor relationship can be selected first. In practical applications, the large model dataset processing method provided in this embodiment of the invention can also be corrected through the following steps: Step 1: Obtain correction indices for dataset correction; Step 2: Add the correction indices to the second learning dataset to achieve correction of the correction indices during the large model training process. Here, correction indices (such as local landmarks and focal length movement image data) can be added to the simulated data, allowing the large model to learn and detect the color temperature, focus correction, or convolution of the data. To facilitate understanding of the large model dataset processing method provided in this embodiment of the invention, the following detailed explanation is provided in conjunction with some specific embodiments. Embodiment 1: The target factor relationship is the simulated frequency function of vehicles passing through a certain intersection. The target factor relationship is the vehicle frequency map (matrix) after Fourier transform. The learning data is a simulated intersection time series image considering traffic lights, weather, time, and date. The learning data may include third-party data, and the image may contain specific landmarks at that location. After collecting local data and performing Fourier transform, other frequency functions outside the simulated frequency function (interval) can be included to correct the simulated data.The colors of passing vehicles are randomly distributed, essentially randomizing unrelated factors. The target factor function is the simulated car frequency function, and the target factor relationship function is a vehicle frequency map approximated by a Fourier transform and fitted with Bayesian statistics (matrix), followed by accelerated convolution using Fourier transform and other methods. The target factor relationship could be, for example, vehicle speed and its upper limit (pointing to a functional relationship), vehicle passing frequency (pointing to a clustering factor relationship), traffic light duration (pointing to a time series relationship), vehicle type, license plate recognition factor relationships, etc., used to create the simulation data. During the creation process, random factor frequencies outside the target factor relationship range can be added using vehicle speed or frequency (since the two are related, adding one is sufficient). Vehicle types, colors, etc., which do not affect safety, are randomly selected from the overall data or conform to local sampling. Data from local cameras and landmarks are also added to the training data. Here, the algorithm for the target factor relationship is written, and the clustering method is confirmed, etc., and added to the vehicle number time series data after simulation as needed. This can be achieved through landmark-based accelerated convolution or correction, or by using accelerated convolution selection. After the large model is calibrated, the intersection camera data can be imported. The target factor relationships fitted by the large model and the Fourier transform data of the time series after clustering are then subjected to Bayesian testing for real-time fitting to check for anomalies. Example 2: Robot Walking Learning Data Screening. Walking simulation data (simulated data of joint sensor positions), images and joint position scans, data from limb position sensors, tension sensors, etc. (large model dataset). Calibration is performed based on the target factor relationships of the simulation data. For example, this simulation data inputs the swing frequencies of individual feet and hands, and the computer simulates walking under the influence of gravity on different inclined planes. The presence of gravity relationships is confirmed; Fourier transforms are performed on the joints relative to the center of gravity to determine if target factor relationships exist, and data screening is performed. In practical applications, the joint control factor relationships of the simulation data are used, for example, with the center of gravity as the origin, and the translation lines relative to the orthocenter, shoulder line, and line of sight perpendicular to the three axes of a three-dimensional coordinate system; the distribution maps of other limbs and joints are drawn separately based on the four quadrants or positions of the feet off the ground, and data screening is performed again using these maps. At this point, convolutional layer selection can be performed on the position sensor and tension sensor data of the filtered data. This can be used for daily pre-use calibration, or if a specific sensor or a similar sensor is replaced, iterative learning can be performed on the filtered data to determine whether target factor relationships appear in the unaffected parts of other data, and then convolutional layer selection can be performed on the data affected by the sensor replacement.It is known that the large model dataset processing method provided by the embodiments of the present invention mainly has the following technical features: (1) By simulating the learning dataset with target factor relationships, the algorithm module written with target factor relationships has high specificity and exclusivity, and can confirm whether target factor relationships, "target factor relationship graph" relationships, or compatibility relationships with the aforementioned relationships exist, thus completing the large model correction (automatic or semi-automatic addition of algorithms or programs); it can also use this large model to label big data, and provide automated big data or large model dataset screening with the same target factor relationship confirmation method. Among them, the correction is performed with target factor relationships and factor values, model anchoring and large model dataset screening are also performed. In addition, three types of target factor relationship correction methods are provided, which are used as basic units. Target factor relationships can be decomposed, combined, and transformed, providing operable methods for multiple, multi-layered, and multi-type target factor relationships. (2) The embodiments of the present invention also include conversion methods for various compatible factor relationships of target factor relationships, algorithm module confirmation methods, and operable conversion confirmation methods for factors or factor relationships that share factors. This method allows for external correction or embedding of large models, and can incorporate large models after further iterations to leverage the advantages of large models in screening big data (e.g., simulated datasets are text-based, but can also screen graphic data). (3) It provides a systematic simulation data optimization scheme (how and when to add random factors), and provides convolution correction and accelerated fitting methods for hardware and software, and target factor relationships. It also provides a matching accelerated convolution layer selection method. Therefore, the training results based on simulated data can achieve model correction and better learning effects. When using it for big data screening, it can incorporate non-target factor factors and exclude hardware, data format, and other issues, thus achieving better big data data analysis results. Compared with related technologies, this invention provides a dataset processing scheme for screening and confirming big data data using large model algorithms. In practical applications, semi-automatic or fully automatic methods can be used to add data verification and transformation methods to the large model (or Agent model) to iteratively fit the simulated data. After that, the target factor relationship is used to confirm whether the large model correction has been completed, and this method can be used to further screen big data data or specified datasets. For example, it can be used to find data that matches or does not match specific target factors; it can also be used for real-time data monitoring or to confirm whether the data exceeds the scope of the Agent's existing data.This invention uses target factor relationships, factor values, and corresponding simulated data for correction. Compared to methods that rely solely on factor values, data filtering, and relearning, this approach offers better anchoring effects for large-scale model correction, reduces iterative learning time, and achieves better and more accurate large-scale data filtering. Furthermore, it provides a systematic simulated data optimization scheme, as well as convolutional correction methods for both hardware and software, and similar convolutional layer selection methods for target factor relationship correction, accelerating the fitting of target factor relationships. Using target factor relationships for large-scale data filtering allows for more precise filtering, such as using scientific functions and corresponding simulated data for correction, followed by further deduction through the large model to automatically search for and confirm the large-scale data to be analyzed. Finally, it can also be applied to other large-scale model needs. For example, it can be used for filtering data that is (re)-input into large-scale models; when the large-scale model agent at the terminal lacks sufficient hardware, software, and data, using this method to filter user questions that exceed the built-in data range can prevent illusions or erroneous answers, thus making it more applicable. In the description of this specification, references to terms such as "some possible implementations," "some implementations," "example," "specific example," or "some examples" indicate that a specific feature, structure, material, or characteristic described in connection with that implementation or example is included in at least one implementation or example of the invention, and the aforementioned terms do not necessarily refer to the same implementation or example. Furthermore, the described specific features, structures, materials, or characteristics can be combined in any suitable manner in one or more implementations or examples. Moreover, those skilled in the art can combine and integrate the different implementations or examples described in this specification and the features of different implementations or examples without contradiction. Regarding the method flowcharts of embodiments of the present invention, certain operations are described as different steps performed in a certain order. Such flowcharts are illustrative and not restrictive. Certain steps described herein may be grouped together and performed in a single operation, or certain steps may be divided into multiple sub-steps, and certain steps may be performed in an order different from that shown herein. The various steps shown in the flowcharts can be implemented in any way by any circuit structure and / or tangible mechanism (e.g., software running on a computer device, hardware (e.g., logic functions implemented by a processor or chip), and / or any combination thereof). Those skilled in the art will understand that the order in which the steps are written in the methods described in the above specific embodiments does not imply a strict execution order. The specific execution order of each step should be determined by its function and possible internal logic. Based on the same inventive concept, the embodiments of the present invention also provide a large model dataset processing device corresponding to the large model dataset processing method. Since the principle of the device in the embodiments of the present invention for solving the problem is similar to the large model dataset processing method described above in the embodiments of the present invention, the implementation of the device can refer to the implementation of the method, and repeated details will not be described again.Referring to Figure 2, a schematic diagram of a large model dataset processing device provided in an embodiment of the present invention is shown. The device includes: an acquisition module 201, an adding module 202, an annotation module 203, and a processing module 204. The acquisition module 201 is used to acquire a pre-constructed large model algorithm and a first learning dataset required for training the large model, as well as a second learning dataset with pre-simulated target factor relationships and a simulation algorithm required for simulating the second learning dataset. The adding module 202 is used to add the simulation algorithm to the large model algorithm to obtain an updated large model algorithm, and to add the second learning dataset to the first learning dataset to obtain an updated first learning dataset. The annotation module 203 is used to annotate the factor relationships of each learning dataset in the updated first learning dataset based on the updated large model algorithm, and to determine the annotation results. The processing module 204 is used to determine a processing strategy for the updated first learning dataset and / or a modification strategy for the updated large model algorithm in response to whether a target factor relationship exists, based on the annotation results indicating whether such a relationship exists. Using the aforementioned large model dataset processing device, given a pre-constructed large model algorithm, a first learning dataset required for large model training, a second learning dataset with pre-simulated target factor relationships, and a simulation algorithm required for simulating the second learning dataset, the large model algorithm and learning dataset can be updated. Then, based on the updated large model algorithm, factor relationships are labeled for each learning data point in the updated first learning dataset. The labeling results determine the processing strategy for the updated first learning dataset and / or the modification strategy for the updated large model algorithm. The first learning dataset obtained through this processing strategy is a learning dataset corrected for target factor relationships, achieving a more accurate large data filtering effect. Simultaneously, modifications to the large model algorithm can achieve better large model correction and anchoring effects, significantly shortening the subsequent large model training iteration time and computational power, and reducing the human resource costs required for data filtering. Optionally, the processing module 204 is specifically configured to determine a processing strategy for the updated first learning dataset in one of the following ways: if the annotation results indicate the presence of a target factor relationship, the updated first learning dataset is used as input data for training a large model; if the annotation results indicate the absence of a target factor relationship, but a compatible factor relationship associated with the target factor relationship exists, a new second learning dataset is added to the updated first learning dataset until the annotation results show a target factor relationship.Optionally, the processing module 204 is further configured to: acquire the third learning data set to be analyzed; input the third learning data set into the final large model algorithm to determine whether a target factor relationship exists; the final large model algorithm is obtained by iteratively training a large model using at least one of the processed learning data obtained by the processing strategy of the updated first learning data set and the modified large model algorithm obtained by the modification strategy of the updated large model algorithm; and filter the third learning data set based on the judgment result. Optionally, the target factor relationship includes one or more of the following relationships: a functional relationship between the independent variable and the dependent variable; a clustering factor relationship with at least one clustering factor; a time series relationship or frequency relationship between different learning data sets; and a time series relationship or frequency relationship between clustering factors within the learning data set. Optionally, when the target factor relationship is a clustering factor relationship, module 202 is added, which is also used to: perform a clustering test based on a statistical clustering test method to determine whether the target factor relationship conforms to the specified clustering state; if the clustering state is confirmed based on the statistical clustering test method, the clustering distribution map or geometric distribution shape of the target factor relationship is compared with the second learning data set, and then filtered again; or, through analysis and search of the second learning data set and the first learning data set, it is determined that the distribution of all data conforms to the clustering distribution map or geometric distribution shape of the target factor relationship. Optionally, when the target factor relationship is a time series relationship, processing module 204 is also used to: add random factor relationships outside the Fourier variation interval that conform to the target factor relationship, and confirm the target factor time series relationship; cluster the second learning data set based on the target time series relationship using the target factor or factor relationship; re-cluster the clustered factors based on the statistical clustering test method to confirm the new target factor time series relationship; and or, based on the target factor, sum or mix the results of each cluster to determine the new clustering factor relationship. Optionally, after obtaining the second learning data set, the processing module 204 is further configured to: randomize all or part of the learning data in the second learning data set to obtain a processed second learning data set; and update the first learning data set based on the processed second learning data set. Optionally, when there are multiple target factor relationships, the processing module 204 is further configured to: determine the common factor relationships among the multiple target factor relationships; filter the convolutional layers included in the large model algorithm based on the common factor relationships, and train the large model based on the selected convolutional layers.Optionally, the processing module 204 is used to filter the convolutional layers included in the large model algorithm based on the common factor relationship according to the following steps: For each convolutional layer included in the large model algorithm, determine the extreme value or obtain the point value after Fourier transformation based on the statistical test of the relationship between the same type of convolutional layer and the target factor; filter by the determined extreme value or the obtained point value; and or, for external learning data to be corrected, perform convolutional layer screening or data transformation based on the statistical test of the relationship between the same type of convolutional layer and the target factor. Optionally, the processing module 204 is also used to: obtain the correction index for data set correction; add the correction index to the second learning data set to realize the correction of the correction index during the large model training process. It should be noted that the device in the embodiment of the present invention can realize the various processes of the aforementioned method implementation and achieve the same effect and function, which will not be repeated here. The embodiment of the present invention also provides an electronic device, as shown in FIG3, which is a schematic diagram of the structure of the electronic device provided in the embodiment of the present invention, including: processor 301, memory 302, and bus 303. The memory 302 stores machine-readable instructions executable by the processor 301 (e.g., execution instructions corresponding to the acquisition module 201, addition module 202, annotation module 203, and processing module 204 in the device of Figure 2). When the electronic device is running, the processor 301 communicates with the memory 302 via the bus 303. When the machine-readable instructions are executed by the processor 301, the following processes are performed: Acquire a pre-built large model algorithm and a first learning data set required for training the large model, as well as a second learning data set pre-simulated with target factor relationships and a simulation algorithm required for simulating the second learning data set; Add the simulation algorithm to the large model algorithm to obtain an updated large model algorithm, and add the second learning data set to the first learning data set to obtain an updated first learning data set; Based on the updated large model algorithm, perform factor relationship annotation on each learning data in the updated first learning data set to determine the annotation results; In response to the annotation results indicating whether a target factor relationship exists, determine a processing strategy for the updated first learning data set and / or a modification strategy for the updated large model algorithm. This invention also provides a computer-readable storage medium storing a computer program. When executed by a processor, the computer program performs the steps of the large model dataset processing method described in the above-described method embodiments. The storage medium can be a volatile or non-volatile computer-readable storage medium. This invention also provides a computer program product carrying program code. The instructions included in the program code can be used to execute the steps of the large model dataset processing method described in the above-described method embodiments, as detailed in the above-described method embodiments, which will not be repeated here. The above-described computer program product can be implemented using hardware, software, or a combination thereof.In one optional embodiment, the computer program product is specifically embodied in a computer storage medium; in another optional embodiment, the computer program product is specifically embodied in a software product, such as a software development kit (SDK), etc. The various embodiments in this invention are described in a progressive manner, and the same or similar parts between the various embodiments can be referred to each other. Each embodiment focuses on describing the differences from other embodiments. In particular, for the embodiments of apparatus, devices, and computer-readable storage media, since they are basically similar to the method embodiments, their descriptions have been simplified, and relevant parts can be referred to the descriptions of the method embodiments. The apparatus, devices, and computer-readable storage media provided by the embodiments of this invention correspond one-to-one with the method; therefore, the apparatus, devices, and computer-readable storage media also have similar beneficial technical effects to their corresponding methods. Since the beneficial technical effects of the method have been described in detail above, the beneficial technical effects of the apparatus, devices, and computer-readable storage media will not be repeated here. Those skilled in the art should understand that the embodiments of this invention can be implemented as methods and apparatus (devices or systems), or computer-readable storage media. Therefore, this invention can be implemented in a completely hardware manner, a completely software manner, or a combination of software and hardware. Furthermore, the present invention can take the form of a computer-readable storage medium implemented on one or more computer-readable storage media (including, but not limited to, disk storage, read-only optical disc storage (CD-ROM), optical storage, etc.) containing computer-readable program code. The present invention is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (devices or systems), and computer-readable storage media according to embodiments of the invention. It should be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in one or more blocks of the flowchart illustrations and / or block diagrams. These computer program instructions can also be stored in a computer-readable storage medium capable of directing a computer or other programmable data processing apparatus to operate in a particular manner, such that the instructions stored in the computer-readable storage medium produce a product including instruction means, wherein the instruction means implement the functions specified in one or more blocks of the flowchart illustrations and / or block diagrams.These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer-implemented process, thereby providing steps for implementing one or more processes in a flowchart and / or one or more blocks in a block diagram to specify the functions. In a typical configuration, a computing device includes one or more processors (CPUs), input / output interfaces, network interfaces, and memory. Memory may include non-persistent storage in computer-readable media, such as random access memory (RAM) and / or non-volatile memory, such as read-only memory (ROM) or flash RAM. Memory is an example of computer-readable media. Computer-readable media include both permanent and non-persistent, removable and non-removable media, which can be used to store information by any method or technology. Information may be computer-readable instructions, data structures, modules of a program, or other data. Examples of computer-readable storage media include, but are not limited to, phase-change memory (PRAM), static random-access memory (SRAM), dynamic random-access memory (DRAM), other types of random-access memory, read-only memory, electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technologies, CD-ROM, digital versatile optical disc (DVD) or other optical storage, magnetic tape, magnetic magnetic disk storage or other magnetic storage devices, or any other non-transfer medium that can be used to store information accessible by a computing device. Furthermore, although the operations of the method of the invention are described in a specific order in the accompanying drawings, this does not require or imply that these operations must be performed in that specific order, or that all of the operations shown must be performed to achieve the desired result. Additionally, certain steps may be omitted, multiple steps may be combined into one step, and / or a step may be broken down into multiple sub-steps. While the spirit and principles of the invention have been described above with reference to several specific embodiments, it should be understood that the invention is not limited to the specific embodiments disclosed, and the division of aspects does not imply that features in these aspects cannot be combined. The invention is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.
Claims
1. A method for processing large model datasets, comprising: Obtain a pre-built large model algorithm and the first learning data set required for training the large model, as well as a second learning data set pre-simulated with target factor relationships and the simulation algorithm required to simulate the second learning data set; The simulation algorithm is added to the large model algorithm to obtain the updated large model algorithm, and the second learning data set is added to the first learning data set to obtain the updated first learning data set. Based on the updated large model algorithm, factor relationships are labeled for each learning data in the updated first learning data set, and the labeling results are determined. In response to whether the target factor relationship appears as indicated by the annotation results, a processing strategy for the updated first learning dataset and / or a modification strategy for the updated large model algorithm are determined.
2. The method according to claim 1, wherein, The processing strategy for the updated first learning dataset shall be determined according to one of the following methods: If the annotation results indicate the occurrence of the target factor relationship, the updated first learning dataset is used as the input data for training the large model. If the annotation results indicate that the target factor relationship does not appear, but a compatible factor relationship associated with the target factor relationship appears, a new second learning dataset is added to the updated first learning dataset until the annotation results show the target factor relationship.
3. The method according to claim 1 or 2, wherein, The method further includes: Obtain the third set of learning materials to be analyzed; The third learning dataset is input into the final large model algorithm to determine whether the target factor relationship exists; the final large model algorithm is obtained by iteratively training a large model using at least one of the processed learning dataset obtained by the processing strategy of the updated first learning dataset and the modified large model algorithm obtained by the modification strategy of the updated large model algorithm. The third set of learning materials is then filtered based on the judgment results.
4. The method according to any one of claims 1 to 3, wherein, The target factor relationship includes one or more of the following relationships: The functional relationship between the independent and dependent variables; Clustering factor relations with at least one clustering factor; The time series or frequency relationships between different learning materials; The time series or frequency relationships between clustering factors within the learning materials.
5. The method according to claim 4, wherein, When the target factor relationship is the grouping factor relationship, the method further includes: The clustering test is performed based on the statistical clustering test method to determine whether the target factor relationship conforms to the specified clustering state; If the target factor relationship is determined to meet the specified clustering criteria based on statistical clustering tests, the clustering distribution map or geometric distribution shape of the target factor relationship is compared with the second learning dataset, and then further screening is performed; or... By analyzing and searching the second and first learning data sets, we can determine the cluster distribution or geometric distribution pattern of all data that conforms to the target factor relationship.
6. The method according to claim 4 or 5, wherein, When the target factor relationship is the time series relationship, the method further includes: Add random factor relationships outside the Fourier variation range that conform to the target factor relationship to confirm the time series relationship of the target factor; The second learning data set is grouped according to the target time series relationship based on the target factor or factor relationship; Based on the statistical clustering test method, the factors after clustering are re-clustered to confirm the new target factor time series relationship; and or, based on the target factor, the results of each clustering are summed or mixed to determine the new clustering factor relationship.
7. The method according to any one of claims 1 to 6, wherein, After obtaining the second learning data set, the method further includes: Randomize all or part of the learning materials in the second learning materials set to obtain the processed second learning materials set. The first learning data set is updated based on the processed second learning data set.
8. The method according to any one of claims 1 to 6, wherein, When there are multiple target factor relationships, the method further includes: Determine the common factor relationships among the multiple target factor relationships; The convolutional layers included in the large model algorithm are selected based on the shared factor relationship, and the large model is trained based on the selected convolutional layers.
9. The method according to claim 8, wherein, The filtering of each convolutional layer in the large model algorithm based on the shared factor relationship includes: For each convolutional layer included in the large model algorithm, based on the statistical test of the relationship between the same type of convolutional layer and the target factor, extreme values are determined or point values are obtained through Fourier transformation; the determined extreme values or the obtained point values are then used for filtering. Alternatively, for external learning materials to be corrected, convolutional layer screening or data transformation can be performed based on statistical tests of the relationship between the same type of convolutional layers and the target factor.
10. The method according to any one of claims 1 to 6, wherein, The method further includes: Obtain the calibration metrics used for dataset calibration; The calibration index is added to the second learning dataset to calibrate the calibration index during the training of the large model.
11. A large model dataset processing device, comprising: The acquisition module is used to acquire a pre-built large model algorithm and a first learning data set required for training the large model, as well as a second learning data set pre-simulated with target factor relationships and a simulation algorithm required to simulate the second learning data set. An addition module is used to add the simulation algorithm to the large model algorithm to obtain an updated large model algorithm, and to add the second learning data set to the first learning data set to obtain an updated first learning data set. The annotation module is used to perform factor relationship annotation on each learning data in the updated first learning data set based on the updated large model algorithm, and determine the annotation results; The processing module is configured to, in response to the annotation result indicating whether the target factor relationship exists, determine a processing strategy for the updated first learning dataset, and / or a modification strategy for the updated large model algorithm.
12. An electronic device, comprising: The device includes a processor, a memory, and a bus. The memory stores machine-readable instructions executable by the processor. When the electronic device is running, the processor communicates with the memory via the bus. When the machine-readable instructions are executed by the processor, the large model dataset processing method as described in any one of claims 1 to 10 is performed.
13. A computer-readable storage medium storing a computer program that, when executed by a processor, performs the large model dataset processing method as described in any one of claims 1 to 10.