A software testing data generation system based on big data analysis
Through big data analysis and the improved TabDDPM coverage-guided diffusion model, the problems of efficiency and insufficient coverage in existing software test data generation have been solved. This has enabled a test data generation system that is efficient in generating and continuously optimizing test data, and can effectively cover low-coverage paths and high-risk defect scenarios.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- BEIJING TIANYI HONGTU TECHNOLOGY CO LTD
- Filing Date
- 2026-05-14
- Publication Date
- 2026-06-19
AI Technical Summary
Existing software test data generation methods are inefficient and lack sufficient coverage when faced with a large number of interface parameter combinations, complex field dependencies, boundary input combinations, and abnormal path coverage requirements. They also lack a continuous feedback and update mechanism for changes in code coverage, abnormal triggering results, and defect triggering results.
A software test data generation system based on big data analytics is adopted. The test big data analysis module cleans, unifies, and aligns interface call data, test case data, code coverage data, and defect record data to build a hybrid test feature table. The improved TabDDPM coverage guided diffusion model is used to generate candidate test data. Combined with test rule verification and feedback updates, a closed-loop optimization process is formed.
It improves the efficiency and coverage of test data generation, effectively constructs abnormal scenarios, and continuously adjusts the test data generation model according to software version changes and defect triggering, thereby enhancing the coverage of critical code paths and abnormal scenarios.
Smart Images

Figure CN122240508A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of software testing and test data generation technology, and specifically to a software test data generation system based on big data analysis. Background Technology
[0002] As software systems continue to expand in scale, microservice architecture, distributed interfaces, complex business processes, and multiple version iterations are widely used in software development. This places higher demands on the quantity, coverage, and ability to construct abnormal scenarios for software testing. Existing software test data typically relies on manual writing, rule template generation, or simple variations based on historical test cases. While this can meet some routine interface testing and regression testing needs, it remains insufficient in terms of test data generation efficiency and coverage sufficiency when faced with a large number of interface parameter combinations, complex field dependencies, boundary input combinations, and abnormal path coverage requirements.
[0003] Existing software testing methods based on big data analytics can typically perform statistical analysis on historical test cases, API call logs, code coverage data, and defect records. However, most methods only use the analysis results as a basis for test planning or test case selection, lacking a mechanism to directly incorporate coverage gaps, execution paths, and defect triggering conditions into the test data generation process. While some automated test data generation methods can generate large amounts of input data based on rules or random strategies, the generated results are prone to problems such as parameter type mismatches, incorrect field dependencies, invalid business state transitions, and incomplete API call sequences. This results in the generated data being unable to be effectively executed, or, although it can be executed, it is difficult to cover low-coverage paths and high-risk defect scenarios.
[0004] Furthermore, existing generative test data methods typically focus more on fitting data distributions and lack a continuous feedback and update mechanism for changes in code coverage, anomaly triggering results, and defect triggering results. The coverage improvement information and defect triggering information generated after test execution are not promptly fed back into the generative model, making it difficult for the model to dynamically adjust according to changes in the software version under test and historical defect distributions.
[0005] Therefore, how to provide a software test data generation system based on big data analysis is a problem that urgently needs to be solved by those skilled in the art. Summary of the Invention
[0006] One objective of this invention is to propose a software test data generation system based on big data analysis. This invention fully utilizes big data analysis of historical software tests, hybrid test feature construction, and an improved TabDDPM coverage-guided diffusion model. It details the process of extracting test features from interface call data, test case data, code coverage data, defect record data, and regression test data, and combining them with coverage gaps, execution paths, and defect triggering conditions to generate candidate software test data. Then, through test rule verification and execution feedback updates, executable test data and the generation model are formed through a closed-loop optimization process. This system has the advantages of high test data generation efficiency, strong low-coverage path coverage capability, sufficient abnormal scenario construction, and timely test feedback updates.
[0007] A software test data generation system based on big data analysis according to an embodiment of the present invention includes:
[0008] The test big data analysis module collects interface call data, test case data, code coverage data, defect record data, and regression test data of the software under test. It then performs cleaning, unification, and alignment processing on the collected data to generate standardized test data.
[0009] The test feature construction module extracts interface parameters, boundary conditions, abnormal inputs, execution paths, coverage gaps, defect triggers, and test result features based on standardized test data, and generates a hybrid test feature table;
[0010] The diffusion condition construction module generates coverage gap conditions, path constraint conditions, and defect triggering conditions based on the coverage gap, execution path, and defect triggering feature execution condition encoding in the hybrid test feature table.
[0011] The candidate data generation module inputs the hybrid test feature table, coverage gap conditions, path constraint conditions, and defect triggering conditions into the improved TabDDPM coverage-guided diffusion model. After noise mapping, condition fusion, and reverse denoising generation, candidate software test data is obtained.
[0012] The test rule verification module verifies the parameter types, value ranges, field dependencies, call order, and business status of candidate software test data, and generates executable software test data.
[0013] The test execution module inputs executable software test data into the software under test to perform tests, collects test execution results, coverage change results, anomaly trigger results, and defect trigger results, and generates test execution feedback data.
[0014] The feedback update module updates the hybrid test feature table based on test execution feedback data and updates the parameters of the improved TabDDPM coverage guided diffusion model to generate the updated software test data generation model.
[0015] Optionally, the test big data analysis module specifically includes:
[0016] Collect interface call data, test case data, code coverage data, defect record data, and regression test data generated during the historical testing process of the software under test to form a historical test data set;
[0017] Perform field cleaning on the historical test data set, remove or correct records with missing metadata, duplicate records and records with inconsistent field types, and generate field-cleaned test data.
[0018] After cleaning the fields, the test data is processed to unify the format, converting the interface name, parameter name, return status, path number, defect number, and test result fields into a unified field format to generate test data with a unified format.
[0019] Time alignment processing is performed on the uniformly formatted test data to map the interface call time, test case execution time, coverage collection time, defect submission time and regression test time to a unified test batch, generating time-aligned test data.
[0020] Duplicate records are removed from the time-aligned test data to generate standardized test data.
[0021] Optionally, the test feature construction module specifically includes:
[0022] Acquire standardized test data and build a test sample index according to interface identifier, test case identifier, version identifier and test batch identifier, and generate a test sample index set;
[0023] Based on the test sample index set, extract interface parameter information, boundary value information, abnormal input information, execution path information, coverage gap information, defect triggering information and test result information, and generate interface parameter features, boundary condition features, abnormal input features, execution path features, coverage gap features, defect triggering features and test result features respectively;
[0024] The interface parameter features, boundary condition features, abnormal input features, execution path features, coverage gap features, defect triggering features, and test result features are processed according to the test sample index set to perform row and column organization and field alignment to generate a hybrid test feature table.
[0025] Optionally, the diffusion condition construction module specifically includes:
[0026] Obtain the hybrid test feature table and read the coverage gap feature, execution path feature, and defect trigger feature from the hybrid test feature table;
[0027] Based on the coverage gap feature, information on uncovered code units, low-coverage paths, and coverage differences are extracted to generate coverage gap condition data;
[0028] Based on the execution path feature extraction, the interface call order, function call chain, branch jump relationship and exception return path are extracted to generate path constraint condition data;
[0029] Based on defect triggering features, defect type, triggering parameter combination, failure return code and repair status are extracted to generate defect triggering condition data;
[0030] Numerical normalization, category encoding, and vector concatenation are performed on the coverage gap condition data, path constraint condition data, and defect triggering condition data, respectively, to generate coverage gap condition vector, path constraint condition vector, and defect triggering condition vector.
[0031] Optionally, the candidate data generation module specifically includes:
[0032] Obtain the hybrid test feature table, coverage gap condition vector, path constraint condition vector, and defect triggering condition vector, and identify the continuous test field and discrete test field in the hybrid test feature table;
[0033] Perform numerical normalization on continuous test fields to generate continuous field encoded data;
[0034] Perform category mapping processing on discrete test fields to generate discrete field encoded data;
[0035] Continuous field encoded data and discrete field encoded data are input into the improved TabDDPM coverage guided diffusion model, and noise mapping processing is performed according to a preset noise scheduling sequence to generate test feature noise variables.
[0036] The conditional noise variables are generated by performing conditional fusion processing on the coverage gap condition vector, path constraint condition vector, and defect triggering condition vector with the test feature noise variables.
[0037] Perform inverse denoising and generation processing on the conditional noise variables to generate candidate test feature data;
[0038] Perform field restoration processing on the candidate test feature data to obtain candidate software test data.
[0039] Optionally, the improved TabDDPM coverage-guided diffusion model specifically includes:
[0040] The field encoding unit receives continuous field encoded data and discrete field encoded data, performs numerical embedding processing on the continuous field encoded data, and performs category embedding processing on the discrete field encoded data to generate test field embedded data.
[0041] The noise scheduling unit receives test field embedded data and a preset noise scheduling sequence, performs step-by-step noise mapping processing on the test field embedded data, and generates test feature noise variables corresponding to different noise steps.
[0042] The condition fusion unit receives the coverage gap condition vector, the path constraint condition vector, and the defect triggering condition vector, and performs weight mapping, conflict filtering, and splicing fusion processing on the coverage gap condition vector, the path constraint condition vector, and the defect triggering condition vector to generate coverage guidance condition embedding data.
[0043] The denoising network unit receives test feature noise variables, noise step encoded data, and coverage guidance condition embedding data, performs stepwise reverse denoising on the test feature noise variables, and generates candidate test feature data.
[0044] The field restoration unit performs inverse normalization on continuous fields in the candidate test feature data and inverse category mapping on discrete fields in the candidate test feature data to generate candidate software test data.
[0045] Optionally, the denoising network unit specifically includes:
[0046] Obtain the test feature noise variables, noise step encoding data, and coverage guidance condition embedding data corresponding to the current noise step;
[0047] The test feature noise variables and noise step encoded data are concatenated and mapped to generate noise state embedding data.
[0048] Inject the overlay guidance condition embedding data into the noise state embedding data to generate overlay guidance noise state data;
[0049] Based on the coverage-guided noise state data, the noise residuals corresponding to the continuous test fields are predicted to obtain the continuous field residual prediction data.
[0050] Based on the coverage guided noise state data, predict the category distribution corresponding to the discrete test field to obtain the discrete field category prediction data;
[0051] Based on the continuous field residual prediction data and the discrete field category prediction data, the test feature noise variable is updated by reverse denoising in the current noise step to generate the test feature variable corresponding to the next noise step.
[0052] When the reverse denoising update reaches the termination noise step, candidate test feature data is output.
[0053] Optionally, the test rule verification module specifically includes:
[0054] Acquire candidate software test data and read interface parameter type rules, parameter value range rules, field dependency rules, interface call order rules, and business status rules;
[0055] Based on interface parameter type rules, parameter value range rules, field dependency rules, interface call order rules, and business status rules, parameter type verification, parameter range verification, field dependency verification, interface order verification, and business status verification are performed on candidate software test data to generate comprehensive verification results.
[0056] Based on the comprehensive verification results, the candidate software test data is processed by deletion, replacement, field correction, or regeneration to generate executable software test data.
[0057] Optionally, the test execution module specifically includes:
[0058] Obtain executable software test data and generate a set of test execution tasks according to interface identifier, test case identifier, and execution batch identifier;
[0059] The test execution task set is input into the test execution environment of the software under test, and the executable software test data is processed by interface calls, business process execution and exception input handling to obtain test execution process data;
[0060] Based on the return codes, response times, exception logs, assertion results, and test pass status of the test execution process data acquisition interface, test execution results are generated.
[0061] Based on the data collection of function coverage, branch coverage, path coverage, and newly added coverage nodes during the test execution process, the coverage change results are generated.
[0062] Based on the data collection of exception type, exception location, exception stack, and exception input combination during the test execution process, an exception trigger result is generated.
[0063] Based on the test execution process data and defect record data, a matching process is performed to generate defect triggering results;
[0064] The test execution results, coverage change results, anomaly trigger results, and defect trigger results are summarized to generate test execution feedback data.
[0065] Optionally, the feedback update module specifically includes:
[0066] Acquire test execution feedback data, and extract test pass status, coverage change results, anomaly trigger results, and defect trigger results from the test execution feedback data;
[0067] Based on the test pass status, coverage change results, anomaly trigger results, and defect trigger results, execute result tags, coverage contribution tags, anomaly trigger tags, and defect trigger tags are added to the executable software test data to generate feedback labeled test sample data.
[0068] Write the feedback-annotated test sample data into the hybrid test feature table to generate an updated hybrid test feature table;
[0069] Reconstruct the coverage gap conditions, path constraints, and defect triggering conditions based on the updated hybrid test feature table;
[0070] Based on the updated hybrid test feature table, coverage gap conditions, path constraints, and defect triggering conditions, the parameters of the improved TabDDPM coverage-guided diffusion model are updated to generate the updated software test data generation model.
[0071] The beneficial effects of this invention are as follows: This invention uses a test big data analysis module to uniformly collect, clean, standardize the format, and align the time of interface call data, test case data, code coverage data, defect record data, and regression test data, generating standardized test data. This enables data scattered across test platforms, coverage collection tools, defect management systems, and regression test environments to enter a unified analysis process, solving the problems of scattered data sources, inconsistent field expressions, and insufficient utilization of historical test information in the traditional test data generation process.
[0072] This invention constructs a hybrid test feature table based on standardized test data, incorporating interface parameters, boundary conditions, abnormal inputs, execution paths, coverage gaps, defect triggers, and test result features into a unified feature structure. This allows the test data generation process to no longer rely solely on manual rules or random combinations, but to comprehensively reflect the interface constraints, historical execution paths, low-coverage areas, and high-defect scenarios of the software under test. Through a diffusion condition construction module, coverage gaps, path constraints, and defect trigger information are converted into generation conditions. The improved TabDDPM coverage-guided diffusion model can generate candidate software test data around low-coverage paths, high-risk input combinations, and historical defect trigger conditions during noise mapping, condition fusion, and reverse denoising generation, thereby improving the test data's coverage of critical code paths and abnormal scenarios.
[0073] This invention also utilizes a test rule verification module to validate candidate software test data for parameter types, value ranges, field dependencies, interface call order, and business status. This reduces the generation of invalid, unexecutable, and business logic conflicting data, improving the executability of generated test data in a real test environment. The test execution module collects test execution results, coverage change results, anomaly trigger results, and defect trigger results. The feedback update module writes the execution feedback data into a hybrid test feature table and updates the parameters of the improved TabDDPM coverage-guided diffusion model. This allows the test data generation process to continuously adjust based on software version changes, coverage changes, and defect triggering, forming a closed-loop optimization mechanism for test data generation, rule verification, execution verification, and model updates. Attached Figure Description
[0074] The accompanying drawings are provided to further illustrate the invention and form part of the specification. They are used in conjunction with embodiments of the invention to explain the invention and do not constitute a limitation thereof. In the drawings:
[0075] Figure 1 This is a schematic diagram of the structure of a software test data generation system based on big data analysis proposed in this invention;
[0076] Figure 2 This is a schematic diagram illustrating the process of testing big data analysis, constructing a hybrid test feature table, and constructing diffusion conditions in this invention.
[0077] Figure 3 This diagram illustrates the process of generating candidate software test data, verifying test rules, and updating feedback using the improved TabDDPM coverage-guided diffusion model in this invention. Detailed Implementation
[0078] The present invention will now be described in further detail with reference to the accompanying drawings. These drawings are simplified schematic diagrams, illustrating only the basic structure of the invention, and therefore only show the components relevant to the invention.
[0079] refer to Figures 1-3 A software test data generation system based on big data analysis includes the following steps:
[0080] The test big data analysis module collects interface call data, test case data, code coverage data, defect record data, and regression test data of the software under test. It then performs cleaning, unification, and alignment processing on the collected data to generate standardized test data.
[0081] The test feature construction module extracts interface parameters, boundary conditions, abnormal inputs, execution paths, coverage gaps, defect triggers, and test result features based on standardized test data, and generates a hybrid test feature table;
[0082] The diffusion condition construction module generates coverage gap conditions, path constraint conditions, and defect triggering conditions based on the coverage gap, execution path, and defect triggering feature execution condition encoding in the hybrid test feature table.
[0083] The candidate data generation module inputs the hybrid test feature table, coverage gap conditions, path constraint conditions, and defect triggering conditions into the improved TabDDPM coverage-guided diffusion model. After noise mapping, condition fusion, and reverse denoising generation, candidate software test data is obtained.
[0084] The test rule verification module verifies the parameter types, value ranges, field dependencies, call order, and business status of candidate software test data, and generates executable software test data.
[0085] The test execution module inputs executable software test data into the software under test to perform tests, collects test execution results, coverage change results, anomaly trigger results, and defect trigger results, and generates test execution feedback data.
[0086] The feedback update module updates the hybrid test feature table based on test execution feedback data and updates the parameters of the improved TabDDPM coverage guided diffusion model to generate the updated software test data generation model.
[0087] In this embodiment, the test big data analysis module specifically includes: collecting interface call data, test case data, code coverage data, defect record data and regression test data generated by the software under test during historical testing, and uniformly collecting the software test data from the above sources to form a historical test data set.
[0088] In this embodiment, the test big data analysis module performs field cleaning processing on the historical test data set. Field cleaning processing includes processing missing metadata records, duplicate records, records with inconsistent field types, records with abnormal formats, records with incorrect codes, and records with invalid times.
[0089] For records with missing interface names, test case numbers, defect numbers, software version numbers, and test batch numbers, the test big data analysis module completes the missing information based on the interface call context, test case execution logs, service gateway logs, and defect association records within the same execution batch. If the missing information cannot be completed, the corresponding record is marked as a low-confidence record, and the sampling weight of the corresponding record is reduced during subsequent feature construction.
[0090] For duplicate API call records, the test big data analysis module identifies duplicates based on API name, request method, parameter combination, return status code, execution path, and execution time interval. Completely duplicate records are deleted, while records with high coverage variation, inconsistent exception triggering results, or high degree of defect correlation are retained.
[0091] For records with inconsistent field types, the test big data analysis module reads the interface parameter definition information and converts numeric, string, boolean, enumeration, date, and array fields into a unified type format respectively;
[0092] For records with abnormal formats, the big data analysis module was tested and corrected according to interface protocol rules, field length rules, and business coding rules.
[0093] For error coding records, the test big data analysis module performs coding mapping according to the interface dictionary table, status code dictionary table, defect type dictionary table, and exception type dictionary table;
[0094] For invalid time records, the test big data analysis module filters or relocates them based on the test batch time range and pipeline execution time, generating cleaned test data.
[0095] In this embodiment, the test big data analysis module performs format unification processing on the test data after field cleaning, converting the field names, field types, field units, and field codes from different test platforms, interface automation test systems, service gateway logs, defect management systems, regression test platforms, and code coverage collection tools into a unified field format.
[0096] Because software test data from different sources have different field names, different status value representations, different exception log structures, and different coverage path numbers, the test big data analysis module pre-establishes field mapping tables, status mapping tables, exception type mapping tables, and coverage path mapping tables.
[0097] Convert the path numbers in different coverage acquisition tools into a unified path number structure;
[0098] Convert defect levels from different defect management platforms into a unified defect level field;
[0099] Write the mapping between the interface return status code, business error code, and exception type into a unified exception field set.
[0100] After the formatting is standardized, standardized test data is generated.
[0101] In this implementation, the test big data analysis module performs time alignment processing on the test data with a unified format, mapping the interface call time, test case execution time, coverage collection time, defect submission time, defect repair time, and regression test time to a unified test batch.
[0102] In this embodiment, the test big data analysis module performs duplicate record removal and credibility labeling on the time-aligned test data.
[0103] The duplicate record removal process involves merging test samples that have the same interface parameter values, execution paths, return status, exception logs, and coverage changes, while retaining the first trigger record, the record with the highest coverage contribution, and the defect association record.
[0104] The credibility labeling process includes generating credibility identifiers for test data based on field completeness, time alignment accuracy, defect correlation, and coverage collection completeness.
[0105] The test big data analysis module outputs the data after duplicate record removal and credibility labeling as standardized test data to the test feature construction module.
[0106] Standardized test data retains not only regular pass samples, but also abnormal failure samples, boundary input samples, historical defect trigger samples, and low-coverage path samples, thus providing a complete data foundation for the subsequent construction of hybrid test feature tables.
[0107] In this embodiment, reference is made to Figure 2 The test feature construction module specifically includes: receiving standardized test data output by the test big data analysis module, and building a test sample index according to interface identifier, test case identifier, software version identifier and test batch identifier, and generating a test sample index set.
[0108] In this embodiment, the test feature construction module extracts interface parameter features based on standardized test data. The interface parameter features include interface category, request method, parameter name, parameter type, parameter length, parameter value range, parameter default value, whether the parameter is required, parameter combination relationship, and parameter source location.
[0109] The sources of parameters include path parameters, query parameters, request body parameters, request header parameters, and session parameters.
[0110] The test feature construction module determines the parameter combination relationship based on the interface definition document and historical call records. The interface parameter features are stored in the hybrid test feature table in a mixed manner of continuous and discrete fields. The parameter length, parameter value, response time and historical call count are used as continuous fields, while the interface category, request method, parameter type, parameter source location and parameter requirement are used as discrete fields.
[0111] In this embodiment, the test feature construction module extracts boundary condition features and abnormal input features based on interface parameter features and historical test case data.
[0112] Boundary condition features include minimum, maximum, critical, zero, negative, out-of-range, and precision boundary values for numeric fields; empty strings, excessively long strings, special character strings, encoded abnormal strings, and strings with injection risks for string fields; valid, invalid, and missing enumeration values for enumeration fields; and expiration time, future time, incorrect format time, and time zone abnormal time for time fields.
[0113] Abnormal input characteristics include illegal permission input, duplicate submission input, business status conflict input, missing request body input, incorrect parameter type input, abnormal parameter order input, missing interface dependency input, and high frequency failure input.
[0114] The test feature construction module writes boundary condition features and abnormal input features into the boundary field group and abnormal field group of the hybrid test feature table, respectively. The boundary field group records whether the input value is in the boundary area, and the abnormal field group records the deviation type between the input value and the interface rule or business rule.
[0115] In this embodiment, the test feature construction module extracts execution path features and coverage gap features based on code coverage data.
[0116] Execution path characteristics include function call chain, interface call order, branch jump relationship, exception return path, business process node order, and cross-interface status transmission relationship.
[0117] Coverage gap characteristics include uncovered functions, low-coverage functions, uncovered branches, low-coverage branches, uncovered paths, low-coverage paths, coverage differences, and newly added covered nodes.
[0118] The test feature construction module associates specific test samples with specific code paths based on the interface call records and coverage collection records within the same test batch.
[0119] In this embodiment, the test feature construction module extracts defect trigger features and test result features based on defect record data and regression test data.
[0120] Defect triggering characteristics include defect level, defect type, defect triggering input, failure return code, exception stack summary, key fields of exception log, repair status, retest status, and the module to which the defect belongs.
[0121] Test result characteristics include assertion pass status, test execution status, interface return status, exception trigger status, coverage contribution status, and defect trigger status.
[0122] The test feature construction module associates defect trigger inputs with corresponding interface parameter combinations, associates the module to which the defect belongs with coverage gap features, and associates defect repair status with regression test results. This allows the hybrid test feature table to record not only the input data of the test sample itself, but also the defect risk, coverage contribution, and regression verification results corresponding to the test sample.
[0123] In this embodiment, the test feature construction module performs row and column organization and field alignment processing on interface parameter features, boundary condition features, abnormal input features, execution path features, coverage gap features, defect triggering features, and test result features according to the test sample index set to generate a hybrid test feature table.
[0124] In the hybrid test feature table, each row corresponds to a test sample, and each column corresponds to a test feature field. Continuous fields in the hybrid test feature table include amount, quantity, response time, parameter length, coverage difference, API call time, historical failure count, and historical success count.
[0125] Discrete fields include interface type, request method, permission level, order status, payment status, refund status, exception type, defect level, defect repair status, and test execution status.
[0126] The hybrid test feature table is organized by row according to the test sample index and by column according to the field type;
[0127] Continuous fields are mapped to a uniform numerical range through numerical normalization, while discrete fields are converted into discrete codes through category mapping.
[0128] The test feature construction module also adds data credibility identifiers, coverage contribution identifiers, and defect association identifiers to each row of test samples, which can be called by the diffusion condition construction module and the candidate data generation module.
[0129] In this embodiment, the test feature construction module further performs feature statistics and risk stratification processing on the hybrid test feature table. Feature statistics include interface call frequency statistics, parameter value distribution statistics, anomaly trigger frequency statistics, path access frequency statistics, and defect trigger frequency statistics.
[0130] Risk stratification involves dividing test samples into regular test samples, boundary test samples, abnormal input test samples, low coverage path test samples, and historical defect regression test samples based on defect level, defect triggering frequency, number of low coverage paths, and interface call failure rate.
[0131] The risk stratification results are written into the hybrid test feature table, serving as a crucial basis for the diffusion condition construction module to determine coverage gap conditions, path constraints, and defect triggering conditions. Through the above processing, the test feature construction module can transform historical test data from raw logs, use case records, and defect records into hybrid structured data that the model can process.
[0132] In this embodiment, reference is made to Figure 2 The diffusion condition construction module specifically includes: receiving a hybrid test feature table and reading coverage gap features, execution path features, and defect triggering features from the hybrid test feature table. Based on the coverage gap features, the diffusion condition construction module extracts information on uncovered code units, low-coverage paths, coverage differences, and newly added coverage nodes to generate coverage gap condition data.
[0133] In this embodiment, the diffusion condition construction module generates path constraint data based on the execution path feature extraction interface call order, function call chain, branch jump relationship, exception return path and business process node relationship.
[0134] The API call sequence records the sequential relationship between different APIs in the business process; the function call chain records the code call path after the API request enters the business logic.
[0135] The branch jump relationship record records the correspondence between parameter values and code branches;
[0136] The exception return path record records the return path caused by input exceptions, permission exceptions, status exceptions, and system exceptions;
[0137] The business process node relationship record records the dependencies between user login, permission verification, order creation, inventory deduction, payment confirmation, refund processing, and message notification.
[0138] After being categorized and serialized, the path constraint data forms a path constraint vector. This vector then restricts the interface call order and business state flow relationship of the candidate test data during the subsequent diffusion generation process.
[0139] In this embodiment, the diffusion condition construction module extracts defect type, defect level, trigger parameter combination, failure return code, exception stack summary, repair status and retest results based on defect trigger features to generate defect trigger condition data.
[0140] The defect triggering condition data records the relationship between historical defects and input fields.
[0141] After being categorized, weighted, and concatenated, the defect triggering condition data forms a defect triggering condition vector. This vector increases the probability of generating high-risk input combinations and similar input combinations to historical defects during the subsequent diffusion generation process.
[0142] In this embodiment, the diffusion condition construction module performs weight initialization processing on the coverage gap condition vector, path constraint condition vector, and defect triggering condition vector.
[0143] The weight initialization process determines the initial weights of different condition vectors based on the coverage gap size, path importance, defect level, and historical trigger frequency.
[0144] Paths with lower coverage correspond to higher coverage gap weights, historical defects with higher defect levels correspond to higher defect trigger weights, and API call chains with stronger dependencies in the business process correspond to higher path constraint weights.
[0145] The diffusion condition construction module performs vector concatenation and field alignment on the three types of condition vectors to form a set of covering guiding conditions, and outputs the set of covering guiding conditions to the candidate data generation module.
[0146] This set of coverage guidance conditions enables the subsequent generation process to not only conform to the overall distribution of historical test data, but also to generate data based on low-coverage paths, high-risk interfaces, and defect-triggered input combinations.
[0147] In this embodiment, reference is made to Figure 1 and Figure 3 The candidate data generation module specifically includes: receiving a hybrid test feature table, a coverage gap condition vector, a path constraint condition vector, and a defect triggering condition vector, and identifying continuous test fields and discrete test fields in the hybrid test feature table.
[0148] The candidate data generation module performs numerical normalization on continuous test fields to generate coded data for continuous fields.
[0149] Perform category mapping processing on discrete test fields to generate discrete field encoded data.
[0150] Continuous field encoded data and discrete field encoded data are input into the improved TabDDPM coverage guided diffusion model in the order of test sample index. The improved TabDDPM coverage guided diffusion model performs noise mapping processing according to a preset noise scheduling sequence to generate test feature noise variables.
[0151] The test feature noise variables are fused with the coverage gap condition vector, path constraint condition vector and defect triggering condition vector in the condition fusion unit to generate condition noise variables;
[0152] Conditional noise variables are processed by inverse denoising to generate candidate test feature data; candidate test feature data are processed by field restoration to generate candidate software test data.
[0153] In this embodiment, reference is made to Figure 3 The improved TabDDPM coverage-guided diffusion model specifically includes a field encoding unit, a noise scheduling unit, a conditional fusion unit, a denoising network unit, and a field restoration unit.
[0154] The field encoding unit receives continuous field encoded data and discrete field encoded data, performs numerical embedding processing on the continuous field encoded data, performs category embedding processing on the discrete field encoded data, and generates test field embedded data.
[0155] The test field embedding data is organized with test samples as rows and test fields as columns. Each test field corresponds to an embedding representation. The embedding results of continuous fields retain the numerical size relationship, while the embedding results of discrete fields retain the category difference relationship.
[0156] The output of the field encoding unit is connected to the noise scheduling unit. The noise scheduling unit performs step-by-step noise mapping processing on the embedded data of the test field according to the preset noise scheduling sequence, and generates test feature noise variables corresponding to different noise steps.
[0157] In this embodiment, the noise scheduling unit records the current noise step state through noise step encoding data and outputs the noise step encoding data and test feature noise variables synchronously to the denoising network unit.
[0158] The noise scheduling sequence includes multiple consecutive noise steps, with different noise steps corresponding to different noise intensities. During the forward noise mapping process, random noise is gradually added to the embedded data of the test field, forming test feature noise variables that approximate the noise distribution.
[0159] In the reverse denoising generation process, the denoising network unit combines the data embedded by the coverage guidance condition to gradually remove noise, so that the random noise variable is gradually restored to the candidate test feature data.
[0160] The noise scheduling unit and the condition fusion unit are connected through noise step encoded data and test feature noise variables to ensure that condition information can participate in the generation process in each inverse denoising stage.
[0161] In this embodiment, the condition fusion unit receives the coverage gap condition vector, the path constraint condition vector, and the defect triggering condition vector, and performs weight mapping, conflict filtering, and splicing fusion processing on the coverage gap condition vector, the path constraint condition vector, and the defect triggering condition vector to generate coverage guidance condition embedding data.
[0162] The weight mapping process adjusts the condition vector weights based on coverage difference, importance of low-coverage paths, defect level, and historical trigger frequency.
[0163] The conflict filtering process identifies and handles logical conflicts between conditions, and the splicing and fusion process maps the condition vectors after weight mapping and conflict filtering to a unified condition space and outputs covered guided condition embedding data.
[0164] The overlay-guided embedded data is connected to the denoising network unit and fused with the noise state embedded data in each inverse denoising stage.
[0165] In this embodiment, the denoising network unit includes a noise state mapping layer, a conditional fusion layer, a residual prediction layer, and a category prediction layer.
[0166] The noise state mapping layer receives test feature noise variables and noise step encoded data, performs concatenation mapping on the test feature noise variables and noise step encoded data, and generates noise state embedded data.
[0167] The conditional fusion layer receives noise state embedding data and overlay guidance conditional embedding data, injects the overlay guidance conditional embedding data into the noise state embedding data, and generates overlay guidance noise state data.
[0168] The residual prediction layer predicts the noise residuals corresponding to continuous test fields based on the coverage-guided noise state data, while the category prediction layer predicts the category distributions corresponding to discrete test fields based on the coverage-guided noise state data.
[0169] The denoising network unit performs reverse denoising update on the test feature noise variable corresponding to the current noise step based on the noise residual of the continuous field and the category distribution of the discrete field, generates the test feature variable corresponding to the next noise step, and outputs candidate test feature data when the termination noise step is reached.
[0170] The above structure enables the model to process both continuous and discrete test fields simultaneously, and to continuously integrate gap, path constraint, and defect triggering information during the denoising process.
[0171] In this embodiment, the field restoration unit receives candidate test feature data, performs inverse normalization processing on continuous fields in the candidate test feature data, and performs category inverse mapping processing on discrete fields in the candidate test feature data to generate candidate software test data.
[0172] The denormalization process recovers the difference in amount, quantity, response time, parameter length, and coverage based on the original numerical range recorded in the mixed-type test feature table;
[0173] The category demapping process restores the interface category, permission level, business status, exception type, and defect level based on the category dictionary;
[0174] The field restoration unit also restores the parameter name, parameter position, and parameter format based on the interface parameter definition information, so that the candidate software test data can enter the test rule verification module for further processing.
[0175] The candidate software test data output by the field restoration unit includes interface identifier, test case identifier, input parameter set, business status set, interface call order, and expected test objective.
[0176] In this embodiment, the improved TabDDPM coverage-guided diffusion model uses historical test samples as training data during the training process. The training data sources include interface automated testing platforms, regression testing platforms, defect verification platforms, and coverage acquisition platforms.
[0177] Training data annotation methods include adding execution result labels based on test execution results, adding coverage contribution labels based on coverage change results, adding exception trigger labels based on exception logs, and adding defect trigger labels based on defect records.
[0178] The execution result labels include Pass, Fail, and Incomplete;
[0179] Coverage contribution labels include new coverage, high coverage contribution, low coverage contribution, and no coverage contribution;
[0180] The exception trigger tags include interface exception, permission exception, status exception, data format exception, and system exception;
[0181] Defect trigger tags include historical defect triggers, new defect triggers, and untriggered defects.
[0182] Training data is written into the training matrix according to the test sample index. The training matrix is organized with test samples as rows and test feature fields as columns, and simultaneously records continuous fields, discrete fields, conditional fields and feedback label fields.
[0183] In this embodiment, the training parameters of the improved TabDDPM coverage-guided diffusion model include training batch size, learning rate, number of noise steps, conditional weight parameters, and convergence conditions.
[0184] The training batch size is set to 64, the initial learning rate is set to 0.0001, the number of noise steps is set to 1000, the initial weight of the coverage gap condition is set according to the number of low coverage paths, the initial weight of the path constraint condition is set according to the complexity of the business process, and the initial weight of the defect trigger condition is set according to the historical defect level.
[0185] Model training losses include continuous field noise residual loss, discrete field category distribution loss, coverage condition consistency loss, and defect condition consistency loss;
[0186] Continuous field noise residual loss constraint for numerical recovery of continuous test fields; discrete field category distribution loss constraint for category generation of discrete test fields; coverage condition consistency loss constraint for consistency between generated samples and coverage gap conditions; defect condition consistency loss constraint for consistency between generated samples and defect triggering conditions.
[0187] The training convergence condition is set to a loss decrease of less than a preset threshold for 10 consecutive training rounds, or a cessation of increases in the proportion of executable samples and coverage contribution samples on the validation set. Once the convergence condition is met, the initial software test data generation model is generated.
[0188] In this embodiment, compared to the original TabDDPM model, the improved TabDDPM coverage-guided diffusion model includes at least three improvements:
[0189] First, the coverage gap condition, path constraint condition and defect triggering condition are introduced into the reverse denoising generation process, so that the model generation result no longer depends solely on the distribution of historical test data, but simultaneously considers low-coverage code areas, business execution paths and high-risk defect triggering inputs.
[0190] Second, weight mapping and conflict filtering are added to the condition fusion unit so that coverage gaps, path constraints and defect triggering conditions can be dynamically adjusted according to the test objectives, and the number of unexecutable samples caused by logical conflict conditions is reduced.
[0191] Third, the test execution feedback data is written back to the hybrid test feature table, and the generated model parameters are updated based on the feedback data, so that the model can continuously adjust the generation direction according to changes in software version, defect distribution, and coverage.
[0192] The above improvements make the model more suitable for software test data generation scenarios, and enable the generated data to take into account historical distribution similarity, execution feasibility, coverage targeting, and defect triggering relevance.
[0193] In this embodiment, reference is made to Figure 1 and Figure 3 The test rule verification module specifically includes: receiving candidate software test data and reading interface parameter type rules, parameter value range rules, field dependency rules, interface call order rules, and business status rules.
[0194] The rules for interface parameter types come from the interface definition document and interface protocol configuration; the rules for parameter value ranges come from business field constraints and historical test sample distribution; the rules for field dependencies come from the business data model and database field constraints; the rules for interface call order come from the business process definition and interface call chain analysis results; and the rules for business status come from the state machine configuration and historical business process execution records.
[0195] The test rule verification module performs parameter type verification, parameter range verification, field dependency verification, interface order verification, and business status verification on the candidate software test data according to the above rules, and generates a comprehensive verification result.
[0196] In this embodiment, parameter type verification includes performing type consistency checks on string, integer, floating-point, boolean, date, enumeration, and array parameters;
[0197] Parameter range validation includes performing value consistency checks on the range of amount, quantity, string length, time, and enumeration range;
[0198] Field dependency validation includes determining the consistency of associations between user identifier and permission level, order identifier and order status, payment serial number and payment amount, and inventory identifier and inventory quantity.
[0199] Interface sequence verification includes checking the consistency of the call order among login, permission verification, order creation, inventory deduction, payment confirmation, refund processing, and order closure.
[0200] Business status verification includes judging the consistency of the flow between order status, payment status, refund status, inventory status, and user status.
[0201] For candidate software test data that passes the verification, the test rule verification module writes the data into the executable test data set;
[0202] For data where some fields do not meet the rules but can be corrected, the test rule verification module performs field replacement or field correction.
[0203] For data that cannot be corrected, the test rule verification module performs deletion or regeneration. After processing, executable software test data is generated.
[0204] In this embodiment, the test execution module specifically includes: receiving executable software test data, and generating a set of test execution tasks according to the interface identifier, test case identifier, and execution batch identifier.
[0205] In this embodiment, the test execution module generates test execution results, coverage change results, anomaly trigger results, and defect trigger results based on the test execution process data.
[0206] Test execution results include interface return codes, response times, assertion results, and test pass status;
[0207] The coverage change results include new coverage functions, new coverage branches, new coverage paths, coverage changes of low-coverage paths, and overall coverage changes;
[0208] The results of an exception trigger include the exception type, exception location, exception stack, exception input combination, and exception reproduction status.
[0209] Defect triggering results include historical defect triggers, new defect triggers, defect similarity matching results, and defect retesting results.
[0210] The test execution module summarizes the above results according to the test sample index, generates test execution feedback data, and outputs the test execution feedback data to the feedback update module.
[0211] In this embodiment, the feedback update module specifically includes: receiving test execution feedback data, and extracting test pass status, coverage change results, anomaly trigger results, and defect trigger results from the test execution feedback data.
[0212] In this embodiment, the feedback update module writes the feedback labeled test sample data into the hybrid test feature table and generates an updated hybrid test feature table.
[0213] Reconstruct the coverage gap conditions, path constraints, and defect triggering conditions based on the updated hybrid test feature table;
[0214] Based on the updated hybrid test feature table, coverage gap conditions, path constraints, and defect triggering conditions, the parameters of the improved TabDDPM coverage-guided diffusion model are updated to generate the updated software test data generation model.
[0215] During the parameter update process, the feedback update module uses the coverage contribution label, anomaly trigger label, and defect trigger label as feedback signals to adjust the condition weight parameters in the condition fusion unit, so that the subsequently generated data is more inclined to the combination of inputs related to low coverage paths, high-risk interfaces, and historical defects.
[0216] When the software version changes, the feedback update module updates the interface, path, and defect fields in the hybrid test feature table according to the version change description and regression test results, so that the model can adapt to the new version of the software under test.
[0217] In this embodiment, the test big data analysis module, test feature construction module, diffusion condition construction module, candidate data generation module, test rule verification module, test execution module, and feedback update module exchange data through a unified test sample index.
[0218] The unified test sample index includes interface identifier, test case identifier, software version identifier, and test batch identifier, ensuring that each test sample maintains a consistent data relationship from historical data collection, feature construction, condition generation, candidate generation, rule verification, test execution to feedback updates.
[0219] The standardized test data output by the big data analysis module provides the basic input for the test feature construction module; the hybrid test feature table output by the test feature construction module provides the basis for the diffusion condition construction module and the candidate data generation module.
[0220] The three types of conditional data output by the diffusion condition construction module provide generation guidance for the candidate data generation module;
[0221] The executable software test data output by the test rule verification module provides execution input for the test execution module;
[0222] The test execution feedback data output by the test execution module provides model update input for the feedback update module.
[0223] A unified test sample index runs through the data transfer process between modules, ensuring a clear connection between the generation, execution, and feedback processes.
[0224] In this embodiment, the system can be deployed on a software testing platform server, a continuous integration server, or a cloud testing environment.
[0225] The test big data analysis module can establish data interfaces with interface automation testing platforms, defect management systems, code coverage collection tools, and regression testing platforms;
[0226] The candidate data generation module can be deployed on computing nodes with graphics processors to improve the efficiency of diffusion model training and generation.
[0227] The test rule verification module can connect to the interface definition repository, business rule repository, and state machine configuration repository; the test execution module can connect to the test environment of the software under test, simulation service, database test environment, and log collection system.
[0228] The feedback update module can trigger model updates periodically or after each round of testing.
[0229] Each module can be deployed on the same testing platform, or it can be distributed across multiple server nodes according to the functions of data collection, model generation, test execution, and feedback updates.
[0230] Modules exchange data through standardized test data interfaces, hybrid test feature table interfaces, conditional data interfaces, candidate data interfaces, and feedback data interfaces, enabling the system to adapt to software testing environments of different scales.
[0231] In this embodiment, for software systems with a large number of interfaces and complex business processes, the test big data analysis module can manage historical test data by partitioning according to business domain, interface set, and software version.
[0232] When constructing a hybrid test feature table, the test feature construction module can generate multiple sub-feature tables according to the business domain, and input the improved TabDDPM coverage guided diffusion model into the candidate data generation module to generate candidate data.
[0233] For business domains with high defect incidence, the diffusion condition construction module increases the condition weight of defect triggering conditions; for business domains with consistently low coverage, the diffusion condition construction module increases the condition weight of coverage gap conditions.
[0234] For business domains with strict API call order, the diffusion condition building module increases the condition weight of path constraint conditions.
[0235] By implementing partition management and adjusting conditional weights, the system can develop differentiated test data generation strategies across different business domains.
[0236] In this implementation, when the interface fields, business status, and code path change due to the iteration of the software under test, the feedback update module identifies the newly added interface, deleted interface, changed parameters, and changed path based on the version change record, and writes the version change information into the hybrid test feature table.
[0237] For new interfaces, the test feature building module constructs initial interface parameter features based on the interface definition document and similar historical interfaces;
[0238] For changes in parameters, the test rule verification module updates the corresponding parameter type rules and parameter value range rules; for new code paths, the diffusion condition construction module writes the new code paths into the coverage gap condition data.
[0239] For historical defects that may recur even after being repaired, the feedback update module writes the combination of trigger parameters corresponding to the historical defects into the defect trigger condition data.
[0240] Through the above version adaptation process, the system can continue to generate software test data that matches the new version after a software version change.
[0241] In this embodiment, when the proportion of unexecutable generated data is too high, the feedback update module identifies the cause of the unexecutability based on the comprehensive verification result output by the test rule verification module.
[0242] Reasons for non-execution include incorrect parameter type, parameter value out of bounds, field dependency conflict, incorrect API call order, and inconsistent business state.
[0243] The feedback update module writes the reason for non-executability into the rule validation sample set and reduces the generation weight of similar field combinations in the next round of model update.
[0244] At the same time, the diffusion condition construction module updates the conflict screening rules in the path constraints and defect triggering conditions according to the rule verification sample set, so that the candidate data generation module reduces the number of candidate software test data that violate business rules in the subsequent generation process.
[0245] By combining rule feedback with model updates, the system can gradually improve the executable ratio of candidate software test data.
[0246] In this embodiment, for cases where low-coverage paths cannot be triggered for a long time, the feedback update module identifies the still uncovered code paths based on the coverage change results output by the test execution module, and rewrites the still uncovered code paths into the coverage gap condition data.
[0247] The diffusion condition construction module regenerates the coverage gap condition vector and path constraint condition vector based on the interface call order, input field combination, and historical similar paths corresponding to the still uncovered paths;
[0248] The candidate data generation module generates new candidate software test data based on the updated condition vector.
[0249] For candidate samples that are repeatedly generated but still fail to trigger the target path, the system writes the corresponding failure reason into the path constraint feedback field, and adjusts the parameter combination and business state combination in subsequent generation, thereby gradually approaching the input conditions and execution conditions required for the uncovered path.
[0250] In this embodiment, to address the issue of insufficient defect triggering capability, the feedback update module identifies high-risk parameter combinations, abnormal return codes, and key fields of the abnormal log based on the abnormal triggering results and defect triggering results.
[0251] The diffusion condition construction module writes the above information into the defect trigger condition data and updates the defect trigger condition weights based on the defect level, trigger stability, and number of recurrences.
[0252] The candidate data generation module incorporates the updated defect triggering condition vector during the reverse denoising generation process, making it easier for the generated data to cover similar input combinations of historical defects and potentially high-risk input combinations.
[0253] The test rule verification module performs rule verification on high-risk input combinations, retains executable abnormal input data, and deletes invalid abnormal data that has no execution meaning.
[0254] By triggering feedback updates based on defects, the system can improve its ability to construct abnormal scenarios and discover defects.
[0255] Through data collaboration, condition guidance, rule verification, and feedback optimization among the above modules, this invention can transform historical test data, code coverage data, defect record data, and regression test data into a unified hybrid test feature table, and generate candidate software test data through the improved TabDDPM coverage-guided diffusion model.
[0256] Meanwhile, the test rule verification module improves the executability of the generated data, the test execution module verifies the coverage contribution and anomaly triggering capability of the generated data in the real test environment, and the feedback update module continuously updates the hybrid test feature table and generated model parameters.
[0257] Compared with test data generation methods that rely on manual writing or template replacement, this invention can increase the generation ratio of low-coverage path test data, boundary condition test data, and abnormal input test data, reduce the amount of non-executable test data, and improve the efficiency of software test data generation, abnormal scenario coverage, and software test coverage.
[0258] Example 1: To verify the feasibility of this invention in practice, it was applied to the automated testing environment of an enterprise business management software. This software includes modules such as user login, permission verification, order creation, inventory deduction, payment confirmation, refund processing, and report query, involving a total of 86 interfaces, 312 interface parameters, and 128 code branches. Existing testing methods mainly rely on manually writing test cases and generating data through template replacement. For scenarios such as permission anomalies, inventory boundaries, payment amount exceeding limits, order status conflicts, and abnormal interface call order, the efficiency of test data construction is low, and some low-coverage paths are difficult to trigger over a long period.
[0259] In this embodiment, the test big data analysis module collects historical test data from the test management platform, interface automated testing platform, code coverage collection tool, defect management system, and regression test records. This includes 118,000 interface call records, 26,500 historical test cases, 312 sets of code coverage statistics records, 1,840 defect records, and 7,200 regression test records. The system performs field cleaning, format standardization, time alignment, and duplicate record removal on the above data. It unifies the mapping of interface number, test case number, defect number, version number, and execution batch number, corrects approximately 2,100 records with inconsistent field types, deletes approximately 9,300 duplicate call records, and forms approximately 132,000 standardized test data records.
[0260] The test feature construction module establishes a test sample index based on standardized test data and extracts features such as interface parameters, boundary conditions, abnormal inputs, execution paths, coverage gaps, defect triggers, and test results to form a hybrid test feature table. Continuous fields include amount, quantity, response time, coverage difference, and parameter length, while discrete fields include interface category, user permissions, order status, payment status, exception type, and defect level. The diffusion condition construction module identifies 17 low-coverage functions, 36 insufficiently covered branches, and 24 low-coverage paths from the hybrid test feature table. For example, the refund status rollback path has a coverage rate of 42.6%, the inventory concurrent deduction exception path has a coverage rate of 38.4%, and the payment amount boundary path has a coverage rate of 45.1%. Simultaneously, it extracts defect triggering conditions such as duplicate order submissions when inventory is 0, payment amounts exceeding order amounts, ordinary users calling the administrator interface, and inconsistencies between refund and payment statuses, generating coverage gap conditions, path constraint conditions, and defect triggering conditions.
[0261] The candidate data generation module inputs a hybrid test feature table, along with coverage gap conditions, path constraints, and defect triggering conditions, into the improved TabDDPM coverage-guided diffusion model. The model performs numerical normalization on continuous test fields, class mapping on discrete test fields, and integrates coverage gaps, path constraints, and defect triggering conditions during the reverse denoising generation process, prioritizing generated data that is closer to low-coverage paths and high-risk parameter combinations. After generation and processing, the system obtains 12,000 candidate software test data entries.
[0262] The test rule verification module validates the candidate software test data for parameter types, value ranges, field dependencies, interface call order, and business status. Verification revealed approximately 860 data entries with inconsistent parameter types, approximately 730 data entries with conflicting payment and order statuses, and approximately 420 data entries with incomplete interface call order. The system then performs deletion, replacement, field correction, or regeneration on the aforementioned data, ultimately yielding 10,350 executable software test data entries, including 3,600 entries for regular business process testing, 2,400 entries for boundary condition testing, 2,100 entries for abnormal input testing, 1,450 entries for low-coverage path testing, and 800 entries for historical defect regression testing.
[0263] The test execution module inputs executable software test data into the test execution environment of the software under test, collecting interface return codes, response times, exception logs, assertion results, coverage changes, and defect triggering results. After execution, 11 new coverage functions and 28 new coverage branches were added. The average coverage of low-coverage paths increased from 41.8% to 68.5%, and the overall branch coverage increased from 72.4% to 83.7%. Regarding exception triggering, the system found that 37 sets of exception input combinations could reliably trigger exception logs, of which 5 sets of exception input combinations formed new defect records, involving payment amount accuracy processing exceptions, inconsistent inventory rollback status, duplicate submission exceptions of refund interfaces, and permission verification exceptions.
[0264] The feedback update module adds execution result tags, coverage contribution tags, anomaly trigger tags, and defect trigger tags to the executable software test data based on the test execution results, and writes the feedback-annotated test sample data into the hybrid test feature table. For samples with significant coverage improvements, the system increases the generation weight of the corresponding coverage gap conditions; for samples that trigger anomalies or defects, the system writes the trigger parameter combination and anomaly path into the defect trigger conditions, and updates the parameters of the improved TabDDPM coverage-guided diffusion model. This embodiment demonstrates that the present invention can generate software test data with high executability and strong ability to cover low-coverage paths, and continuously optimizes the subsequent generation process through execution feedback.
[0265] Table 1 Comparison of Software Test Data Generation Effects Based on the Invention
[0266] Comparison Projects Traditional manual / template generation method This invention system Data Results Total amount of test data generated 4200 12,000 Increased generation scale Number of executable test data 3180 items 10,350 items The amount of executable data has increased significantly. Test data executability 75.7% 86.3% An increase of 10.6 percentage points Low coverage path test data quantity 360 items 1450 items Increase the number of low-coverage path samples Number of boundary condition test data 820 items 2400 Enhanced coverage of boundary scenes Number of abnormal input test data 690 items 2100 Abnormal input construction enhancement Number of historical defect regression test data 410 items 800 Increased defect regression coverage Overall branch coverage 72.4% 83.7% An increase of 11.3 percentage points Average coverage of low-coverage paths 41.8% 68.5% An increase of 26.7 percentage points Number of newly covered functions 3 11 More new overriding functions Increase the number of covered branches 9 28 Add more covered branches Number of stable triggering abnormal combinations 14 groups 37 groups Enhanced ability to trigger exceptions Number of new defects triggered 1 5 Improved defect detection capabilities Number of parameter type errors 620 items 860 candidate data entries were verified, corrected, or deleted. Candidate data is processed through rule validation. Number of conflicting field status data 510 items 730 candidate data entries were verified, corrected, or deleted. Field dependency issues were identified. Number of incomplete data calls 390 items 420 candidate data entries were verified, corrected, or deleted. The interface order issue was identified. Number of valid samples to be stored after generation 3180 items 10,350 items Increase in effective test samples Does the system support feedback updates? Not supported support Forming a closed-loop optimization
[0267] As shown in Table 1, compared with traditional manual or template-based generation methods, the system of this invention significantly improves the scale of test data generation, executability, coverage, and defect triggering capability. The traditional method generates a total of 4200 test data entries, with 3180 executable, resulting in an execution rate of 75.7%. The system of this invention generates 12000 candidate test data entries, and after rule validation, 10350 executable test data entries are obtained, achieving an execution rate of 86.3%. This result demonstrates that through big data analysis, construction of a hybrid test feature table, and validation of parameter types, field dependencies, call order, and business status, this invention can expand the scale of test data while reducing the proportion of invalid samples, making the generated data more suitable for execution in a real testing environment.
[0268] Regarding coverage improvement, the traditional method achieves an overall branch coverage rate of 72.4% and an average coverage rate of 41.8% for low-coverage paths. The system of this invention improves these rates to 83.7% and 68.5% respectively, with the number of newly covered functions increasing from 3 to 11 and the number of newly covered branches increasing from 9 to 28. These changes are related to the introduction of coverage gap conditions and path constraints in this invention. The improved TabDDPM coverage-guided diffusion model no longer generates data solely based on historical sample distribution during the reverse denoising process. Instead, it combines uncovered code units, low-coverage paths, and execution path relationships to generate more targeted test samples, thus effectively improving the reach of low-coverage areas.
[0269] Regarding abnormal inputs and defect triggers, the system of this invention generates 2400 boundary condition test data, 2100 abnormal input test data, and 800 historical defect regression test data, respectively, all significantly higher than traditional methods. The number of stable triggering abnormal combinations increased from 14 to 37, and the number of new defect triggers increased from 1 to 5, indicating that the introduction of defect triggering conditions during the generation process can increase the occurrence rate of high-risk parameter combinations. Data with incorrect parameter types, conflicting field states, and incomplete call order were identified in the candidate stage and corrected or deleted through rule verification, reflecting that this invention does not simply pursue the quantity of generated data, but rather improves the executability and business consistency of test data through rule constraints after generation.
[0270] Furthermore, traditional methods do not support feedback updates, while the system of this invention can write coverage changes, anomaly triggers, and defect triggers into a hybrid test feature table and update the improved TabDDPM coverage-guided diffusion model. This closed-loop mechanism allows subsequent test data generation to continuously adjust its direction based on actual execution results, which is beneficial for increasing the generation ratio of low-coverage path samples and defect-triggered samples. Based on the data in the table, this invention demonstrates significant improvements in test data scale, number of effective samples, coverage enhancement, and defect detection capabilities, validating the application value of a software test data generation system based on big data analysis and a coverage-guided diffusion generation mechanism.
[0271] The above are merely preferred embodiments of the present invention, but the scope of protection of the present invention is not limited thereto. Any equivalent substitutions or modifications made by those skilled in the art within the scope of the technology disclosed in the present invention, based on the technical solution and inventive concept of the present invention, should be covered within the scope of protection of the present invention.
Claims
1. A software test data generation system based on big data analysis, characterized in that, Includes the following steps: The test big data analysis module collects interface call data, test case data, code coverage data, defect record data, and regression test data of the software under test. It then performs cleaning, unification, and alignment processing on the collected data to generate standardized test data. The test feature construction module extracts interface parameters, boundary conditions, abnormal inputs, execution paths, coverage gaps, defect triggers, and test result features based on standardized test data, and generates a hybrid test feature table; The diffusion condition construction module generates coverage gap conditions, path constraint conditions, and defect triggering conditions based on the coverage gap, execution path, and defect triggering feature execution condition encoding in the hybrid test feature table. The candidate data generation module inputs the hybrid test feature table, coverage gap conditions, path constraint conditions, and defect triggering conditions into the improved TabDDPM coverage-guided diffusion model. After noise mapping, condition fusion, and reverse denoising generation, candidate software test data is obtained. The test rule verification module verifies the parameter types, value ranges, field dependencies, call order, and business status of candidate software test data, and generates executable software test data. The test execution module inputs executable software test data into the software under test to perform tests, collects test execution results, coverage change results, anomaly trigger results, and defect trigger results, and generates test execution feedback data. The feedback update module updates the hybrid test feature table based on test execution feedback data and updates the parameters of the improved TabDDPM coverage guided diffusion model to generate the updated software test data generation model.
2. The software test data generation system based on big data analysis according to claim 1, characterized in that, The test big data analysis module specifically includes: Collect interface call data, test case data, code coverage data, defect record data, and regression test data generated during the historical testing process of the software under test to form a historical test data set; Perform field cleaning on the historical test data set, remove or correct records with missing metadata, duplicate records and records with inconsistent field types, and generate field-cleaned test data. After cleaning the fields, the test data is processed to unify the format, converting the interface name, parameter name, return status, path number, defect number, and test result fields into a unified field format to generate test data with a unified format. Time alignment processing is performed on the uniformly formatted test data to map the interface call time, test case execution time, coverage collection time, defect submission time and regression test time to a unified test batch, generating time-aligned test data. Duplicate records are removed from the time-aligned test data to generate standardized test data.
3. The software test data generation system based on big data analysis according to claim 1, characterized in that, The test feature construction module specifically includes: Acquire standardized test data and build a test sample index according to interface identifier, test case identifier, version identifier and test batch identifier, and generate a test sample index set; Based on the test sample index set, extract interface parameter information, boundary value information, abnormal input information, execution path information, coverage gap information, defect triggering information and test result information, and generate interface parameter features, boundary condition features, abnormal input features, execution path features, coverage gap features, defect triggering features and test result features respectively; The interface parameter features, boundary condition features, abnormal input features, execution path features, coverage gap features, defect triggering features, and test result features are processed according to the test sample index set to perform row and column organization and field alignment to generate a hybrid test feature table.
4. The software test data generation system based on big data analysis according to claim 1, characterized in that, The diffusion condition construction module specifically includes: Obtain the hybrid test feature table and read the coverage gap feature, execution path feature, and defect trigger feature from the hybrid test feature table; Based on the coverage gap feature, information on uncovered code units, low-coverage paths, and coverage differences are extracted to generate coverage gap condition data; Based on the execution path feature extraction, the interface call order, function call chain, branch jump relationship and exception return path are extracted to generate path constraint condition data; Based on defect triggering features, defect type, triggering parameter combination, failure return code and repair status are extracted to generate defect triggering condition data; Numerical normalization, category encoding, and vector concatenation are performed on the coverage gap condition data, path constraint condition data, and defect triggering condition data, respectively, to generate coverage gap condition vector, path constraint condition vector, and defect triggering condition vector.
5. The software test data generation system based on big data analysis according to claim 1, characterized in that, The candidate data generation module specifically includes: Obtain the hybrid test feature table, coverage gap condition vector, path constraint condition vector, and defect triggering condition vector, and identify the continuous test field and discrete test field in the hybrid test feature table; Perform numerical normalization on continuous test fields to generate continuous field encoded data; Perform category mapping processing on discrete test fields to generate discrete field encoded data; Continuous field encoded data and discrete field encoded data are input into the improved TabDDPM coverage guided diffusion model, and noise mapping processing is performed according to a preset noise scheduling sequence to generate test feature noise variables. The conditional noise variables are generated by performing conditional fusion processing on the coverage gap condition vector, path constraint condition vector, and defect triggering condition vector with the test feature noise variables. Perform inverse denoising and generation processing on the conditional noise variables to generate candidate test feature data; Perform field restoration processing on the candidate test feature data to obtain candidate software test data.
6. The software test data generation system based on big data analysis according to claim 5, characterized in that, The improved TabDDPM coverage-guided diffusion model specifically includes: The field encoding unit receives continuous field encoded data and discrete field encoded data, performs numerical embedding processing on the continuous field encoded data, and performs category embedding processing on the discrete field encoded data to generate test field embedded data. The noise scheduling unit receives test field embedded data and a preset noise scheduling sequence, performs step-by-step noise mapping processing on the test field embedded data, and generates test feature noise variables corresponding to different noise steps. The condition fusion unit receives the coverage gap condition vector, the path constraint condition vector, and the defect triggering condition vector, and performs weight mapping, conflict filtering, and splicing fusion processing on the coverage gap condition vector, the path constraint condition vector, and the defect triggering condition vector to generate coverage guidance condition embedding data. The denoising network unit receives test feature noise variables, noise step encoded data, and coverage guidance condition embedding data, performs stepwise reverse denoising on the test feature noise variables, and generates candidate test feature data. The field restoration unit performs inverse normalization on continuous fields in the candidate test feature data and inverse category mapping on discrete fields in the candidate test feature data to generate candidate software test data.
7. A software test data generation system based on big data analysis according to claim 6, characterized in that, The denoising network unit specifically includes: Obtain the test feature noise variables, noise step encoding data, and coverage guidance condition embedding data corresponding to the current noise step; The test feature noise variables and noise step encoded data are concatenated and mapped to generate noise state embedding data. Inject the overlay guidance condition embedding data into the noise state embedding data to generate overlay guidance noise state data; Based on the coverage-guided noise state data, the noise residuals corresponding to the continuous test fields are predicted to obtain the continuous field residual prediction data. Based on the coverage guided noise state data, predict the category distribution corresponding to the discrete test field to obtain the discrete field category prediction data; Based on the continuous field residual prediction data and the discrete field category prediction data, the test feature noise variable is updated by reverse denoising in the current noise step to generate the test feature variable corresponding to the next noise step. When the reverse denoising update reaches the termination noise step, candidate test feature data is output.
8. The software test data generation system based on big data analysis according to claim 1, characterized in that, The test rule verification module specifically includes: Acquire candidate software test data and read interface parameter type rules, parameter value range rules, field dependency rules, interface call order rules, and business status rules; Based on interface parameter type rules, parameter value range rules, field dependency rules, interface call order rules, and business status rules, parameter type verification, parameter range verification, field dependency verification, interface order verification, and business status verification are performed on candidate software test data to generate comprehensive verification results. Based on the comprehensive verification results, the candidate software test data is processed by deletion, replacement, field correction, or regeneration to generate executable software test data.
9. A software test data generation system based on big data analysis according to claim 1, characterized in that, The test execution module specifically includes: Obtain executable software test data and generate a set of test execution tasks according to interface identifier, test case identifier, and execution batch identifier; The test execution task set is input into the test execution environment of the software under test, and the executable software test data is processed by interface calls, business process execution and exception input handling to obtain test execution process data; Based on the return codes, response times, exception logs, assertion results, and test pass status of the test execution process data acquisition interface, test execution results are generated. Based on the data collection of function coverage, branch coverage, path coverage, and newly added coverage nodes during the test execution process, the coverage change results are generated. Based on the data collection of exception type, exception location, exception stack, and exception input combination during the test execution process, an exception trigger result is generated. Based on the test execution process data and defect record data, a matching process is performed to generate defect triggering results; The test execution results, coverage change results, anomaly trigger results, and defect trigger results are summarized to generate test execution feedback data.
10. A software test data generation system based on big data analysis according to claim 1, characterized in that, The feedback update module specifically includes: Acquire test execution feedback data, and extract test pass status, coverage change results, anomaly trigger results, and defect trigger results from the test execution feedback data; Based on the test pass status, coverage change results, anomaly trigger results, and defect trigger results, execute result tags, coverage contribution tags, anomaly trigger tags, and defect trigger tags are added to the executable software test data to generate feedback labeled test sample data. Write the feedback-annotated test sample data into the hybrid test feature table to generate an updated hybrid test feature table; Reconstruct the coverage gap conditions, path constraints, and defect triggering conditions based on the updated hybrid test feature table; Based on the updated hybrid test feature table, coverage gap conditions, path constraints, and defect triggering conditions, the parameters of the improved TabDDPM coverage-guided diffusion model are updated to generate the updated software test data generation model.