Information system cost evaluation method based on large language model and deep learning

By combining large language models with deep learning, an information system cost prediction model is constructed, which solves the problems of nonlinear coupling and omission of implicit cost factors in the existing technology of information system cost assessment, and realizes high-precision intelligent assessment and decision guidance.

CN122243593APending Publication Date: 2026-06-19GUANGDONG TOBACCO ZHAOQING

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
GUANGDONG TOBACCO ZHAOQING
Filing Date
2026-03-19
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing information system cost assessment methods struggle to capture nonlinear coupling relationships when dealing with complex projects, lack the ability to analyze deep semantic information in requirements documents, resulting in inaccurate assessment results and the omission of implicit cost factors, and an inability to effectively utilize historical data for accurate assessment.

Method used

This paper adopts a method based on large language model and deep learning. By constructing a historical project cost database, it extracts multi-dimensional cost influencing factors and combines them with deep semantic parsing of requirements documents to build a cost prediction model, generate comprehensive feature representation, perform nonlinear mapping, and output the cost prediction value and confidence interval.

Benefits of technology

It achieves high-precision, interpretable, and intelligent assessment of information system costs, can identify key driving factors, and provides a comprehensive cost assessment report, including cost breakdown, analysis of major influencing factors, and risk warnings.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122243593A_ABST
    Figure CN122243593A_ABST
Patent Text Reader

Abstract

This invention belongs to the field of intelligent cost assessment technology and discloses a method for cost assessment of information systems based on a large language model and deep learning. The method includes: collecting cost data from various historical projects, establishing a historical project cost database, and extracting a multi-dimensional cost influencing factor feature set from it; inputting the requirement document of the project to be assessed into a pre-constructed large language model, extracting the requirement semantic feature vector, and aligning it with the multi-dimensional cost influencing factor feature set to construct a comprehensive feature representation; constructing a cost prediction model based on the multi-dimensional cost influencing factor feature set and the actual cost results of various historical projects; inputting the comprehensive feature representation of the project to be assessed into the cost prediction model to obtain the predicted cost value and confidence interval, and combining it with the requirement semantic feature vector to generate a cost assessment report. This invention can provide a high-precision, interpretable, and decision-guiding intelligent assessment scheme for the cost assessment of information system construction projects.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of intelligent cost assessment technology, and more specifically, to a method for cost assessment of information systems based on large language models and deep learning. Background Technology

[0002] Information system cost assessment is a crucial step in the initiation, bidding, and budgeting of IT projects. Accurate cost assessment directly impacts the project's feasibility analysis and resource allocation efficiency. Currently, information system cost assessment mainly relies on expert judgment and analogical estimation methods. This involves experienced assessors subjectively predicting the cost of the project under assessment by referring to historical cost data of similar projects and combining their own experience. Meanwhile, some assessment practices have introduced parametric estimation models, using linear regression calculations based on basic parameters such as the number of function points and development cycle, which has improved the standardization of assessments to some extent. However, these methods have significant limitations when dealing with complex information system projects. Information system costs are influenced by multiple factors, including demand complexity, technical architecture selection, development cycle, and human resource costs. Traditional linear parametric models struggle to capture the non-linear coupling relationships between these factors, leading to significant discrepancies between the assessment results and the actual cost.

[0003] Furthermore, requirement documents for information system projects often contain a large number of implicit technical requirements, quality constraints, and potential risk factors. This information is scattered throughout the document in natural language form, making it difficult to fully extract and quantify through manual reading or simple keyword searches. Existing parameter estimation models mainly calculate costs based on explicit functional module quantity and scale indicators, lacking the ability to analyze the deep semantic information of requirement documents. This leads to the omission of implicit cost factors and insufficient completeness and accuracy of the evaluation results. At the same time, although various industries have accumulated a large amount of historical project cost data, existing evaluation methods lack the ability to deeply mine and learn features from historical data, failing to effectively utilize the cost patterns contained in historical projects to provide data-driven reference for the evaluation of new projects. Therefore, there is an urgent need for an intelligent cost evaluation method that can integrate the semantic understanding capabilities of large language models with the predictive capabilities of deep learning. Through deep semantic analysis of requirement documents and feature learning from historical cost data, it can achieve accurate evaluation and intelligent analysis of information system costs.

[0004] In view of this, the present invention proposes an information system cost evaluation method based on large language models and deep learning to solve the above problems. Summary of the Invention

[0005] To overcome the aforementioned shortcomings of existing technologies and achieve the above objectives, this invention provides the following technical solution: a method for evaluating the cost of information systems based on large language models and deep learning, comprising: Step S1: Collect cost data for each historical project, clean, label, and structure the cost data to establish a historical project cost database; Step S2: Analyze the cost influencing factors based on the historical project cost database, and extract feature vectors for functional complexity, technical difficulty, labor cost, and time cost to form a multi-dimensional feature set of cost influencing factors. Step S3: Input the requirements document of the project to be evaluated into the pre-built large language model, extract the explicit functional requirements, implicit technical requirements, quality constraints and risk factors in the requirements document, generate the requirements semantic feature vector, and align the requirements semantic feature vector with the feature set of multi-dimensional cost influencing factors to construct a comprehensive feature representation of the project to be evaluated. Step S4: Based on the feature vectors in the feature set of multidimensional cost influencing factors and the actual cost results of each historical project in the historical project cost database, a cost prediction model is constructed by learning the nonlinear mapping relationship between the feature vector combination and the actual cost results. Step S5: Input the comprehensive feature representation of the project to be evaluated into the cost prediction model to obtain the predicted cost value and confidence interval. Combined with the demand semantic feature vector, generate a cost assessment report that includes the cost prediction results, cost composition decomposition, analysis of major influencing factors, risk warnings and optimization suggestions.

[0006] Furthermore, methods for establishing a historical project cost database include: Cost data for each historical project is collected. The cost data includes basic project information, requirements documents, functional module lists, technical architecture solutions, human resource allocation data, development cycle records, and actual cost results. The actual cost results include the actual settlement amount and cost composition details. The cost composition details include the proportion of labor costs, hardware procurement costs, software licensing costs, implementation and deployment costs, and project management costs. Each cost data point is cleaned to form a cleaned cost dataset; each cost data point in the cleaned cost dataset is labeled to form a labeled dataset; each cost data point in the labeled dataset is structured to form structured indicators for functional scale, technical architecture, human resources, and time dimension, and these are integrated with the actual cost results of the corresponding cost data to form a structured cost record for each historical project; the structured cost records of all historical projects are summarized to establish a historical project cost database; Methods for forming a feature set of multidimensional cost influencing factors include: The functional complexity feature vector, technical difficulty feature vector, labor cost feature vector, and time cost feature vector of each historical project are correlated and integrated to form a multi-dimensional cost influencing factor feature record for each historical project; the multi-dimensional cost influencing factor feature records of all historical projects are summarized to form a multi-dimensional cost influencing factor feature set.

[0007] Furthermore, methods for generating demand semantic feature vectors include: Obtain the requirements document for the project to be evaluated and mark it as the evaluation document; input the evaluation document into the large language model, and extract explicit functional requirements, implicit technical requirements, quality constraints, and risk factors in sequence; obtain quantitative indicators for explicit functional requirements, implicit technical requirements, quality constraints, and risk factors; arrange the quantitative indicators for explicit functional requirements, implicit technical requirements, quality constraints, and risk factors in order to form the semantic feature vector of the project requirements; perform cost sensitivity calibration on the semantic feature vector of the requirements to obtain the calibrated semantic feature vector of the requirements. The method for calibrating the cost sensitivity of the demand semantic feature vector is as follows: mark the demand documents of historical projects as historical documents, input all historical documents into the large language model to obtain the demand semantic feature vector of each historical project, and mark it as a historical demand feature vector; for each dimension in the historical demand feature vector, calculate the cost correlation degree of each dimension based on the values ​​of all historical projects in the corresponding dimension and the corresponding actual settlement amount, and calculate the cost sensitivity coefficient of each dimension based on the cost correlation degree; calculate the product of the value of each dimension in the demand semantic feature vector and the corresponding cost sensitivity coefficient to obtain the calibrated demand semantic feature vector.

[0008] Furthermore, methods for constructing a comprehensive feature representation of the project to be evaluated include: The calibrated demand semantic feature vectors are labeled as demand calibration feature vectors. The cosine similarity between the demand calibration feature vector of the project to be evaluated and the historical demand feature vector of each historical project is calculated to obtain the semantic similarity for each historical project. All historical projects are sorted from high to low semantic similarity, and the top-ranked projects are selected. Historical projects in a given location serve as anchor projects, forming a set of anchor projects; among them, The preset number of anchor points; The multidimensional cost influencing factor feature records corresponding to each anchor project in the anchor project set are obtained from the multidimensional cost influencing factor feature set. Based on the semantic similarity of each anchor project in the anchor project set, the similarity weight of each anchor project is calculated. Based on the similarity weight of each anchor project, the elements in each feature vector of each anchor project's multidimensional cost influencing factor feature record are weighted and summed to obtain the estimated functional complexity feature vector, estimated technical difficulty feature vector, estimated labor cost feature vector, and estimated time cost feature vector. These are then concatenated and integrated with the demand calibration feature vector to form a comprehensive feature representation of the project to be evaluated.

[0009] Furthermore, methods for constructing cost prediction models include: Obtain the historical demand feature vector and cost sensitivity coefficients for each dimension of each historical project, and calculate the demand calibration feature vector for each historical project based on the cost sensitivity coefficients for each dimension. Concatenate the multi-dimensional cost influencing factor feature records of each historical project with the demand calibration feature vector to form the training input feature vector for each historical project. For each historical project, calculate the corresponding cost benchmark anchor value and cost residual ratio, and use the cost residual ratio as the cost prediction label. Obtain the cost composition details from the actual cost results of each historical project to form the cost composition label vector for each historical project. Integrate the training input feature vector, cost prediction label, and cost composition label vector for each historical project to form the training sample for each historical project. Summarize the training samples of all historical projects to form the model training sample set. Construct a cost prediction network, perform joint training on the cost prediction network based on the model training sample set to obtain the batch joint training loss; iteratively update the model parameters in the cost prediction network based on the batch joint training loss until the batch joint training loss converges to the preset loss convergence threshold or the number of training iterations reaches the preset maximum number of iterations, and use the trained cost prediction network as the cost prediction model.

[0010] Furthermore, methods for constructing cost prediction models also include: The cost prediction network comprises a dimensional gated routing layer, a shared feature extraction layer, a cost residual prediction branch, and a cost composition prediction branch. The dimensional gated routing layer, consisting of fully connected layers and normalized activation functions, calculates five-dimensional gate weights and gated routing feature vectors based on the training input feature vectors. The shared feature extraction layer, connected to the dimensional gated routing layer, contains multiple fully connected layers and nonlinear activation functions, performing deep nonlinear feature transformations on the gated routing feature vectors to output a shared deep feature representation. The cost residual prediction branch, connected to the shared feature extraction layer, includes a residual value prediction head and a residual variance prediction head. The residual value prediction head outputs the predicted residual values, and the residual variance prediction head outputs the predicted residual variance values. The cost composition prediction branch, connected to the shared feature extraction layer, outputs five cost percentage prediction values: labor cost percentage, hardware procurement cost percentage, software licensing cost percentage, implementation and deployment cost percentage, and project management cost percentage.

[0011] Furthermore, methods for obtaining cost forecasts and confidence intervals include: The comprehensive characteristic representation of the project to be evaluated is input into the cost prediction model. The cost prediction model outputs the residual prediction value, residual variance prediction value, five cost percentage prediction values, and five dimension gate weight values. Based on the anchor point project set, the cost benchmark anchor value of the project to be evaluated is calculated. The product of the residual prediction value and the cost benchmark anchor value is calculated to obtain the residual adjustment amount. The sum of the cost benchmark anchor value and the residual adjustment amount is calculated to obtain the cost prediction value. Based on the residual variance prediction value and the cost benchmark anchor value, the confidence interval of the project to be evaluated is calculated, and the anchor point consistency reverse verification correction is performed on the confidence interval. The method for performing anchor point consistency backtesting correction on the confidence interval is as follows: For each anchor point item in the anchor point item set, the training input feature vector of the corresponding anchor point item is input into the cost prediction model to obtain the corresponding residual prediction value, which is then marked as the anchor point residual prediction value; based on the anchor point residual prediction value and the corresponding cost prediction label of each anchor point item, the anchor point backtesting deviation is calculated; the mean of the anchor point backtesting deviations for all anchor point items is calculated to obtain the average anchor point backtesting deviation; a backtesting deviation amplification factor is preset, and the average anchor point backtesting deviation, the backtesting deviation amplification factor, and the cost benchmark anchoring value are multiplied sequentially to obtain the empirical correction amount; based on the empirical correction amount, the lower and upper limits of the cost confidence in the confidence interval are updated sequentially.

[0012] Furthermore, methods for generating cost assessment reports include: Based on the five cost percentage predictions and the project cost forecast, a cost composition decomposition result is generated. Based on the five-dimensional gating weight values ​​and the anchor point project set, a layer-by-layer source analysis of cost drivers is performed, generating the analysis result of major influencing factors. Based on the demand calibration feature vector, residual variance predictions, and the cost composition decomposition result, multi-source risk cross-validation and risk transmission link deduction are performed, generating risk warning information. Based on the risk warning information, cost composition decomposition result, and major influencing factor analysis result, a cost optimization path deduction and coupled linkage optimization suggestion generation are performed, forming optimization suggestions. The project cost forecast, confidence interval, cost composition decomposition result, major influencing factor analysis result, risk warning information, and optimization suggestions are integrated to generate a cost assessment report. The method for generating the analysis results of the main influencing factors is as follows: Based on the gating weight values ​​of the five dimensions, the primary and secondary cost-driving dimensions are determined; for the primary and secondary cost-driving dimensions, fine-grained attribution calculations are performed within each dimension to obtain the key driving elements of the primary and secondary dimensions; the cost-driving difference index between the project to be evaluated and the anchor project set is calculated; for the primary and secondary cost-driving dimensions, cost-driving coupling analysis between dimensions is performed to calculate the coupling coefficient between dimensions; the primary cost-driving dimension, secondary cost-driving dimension, key driving elements of the primary dimension, key driving elements of the secondary dimension, cost-driving difference index, and coupling coefficient between dimensions are integrated to form the analysis results of the main influencing factors.

[0013] Furthermore, methods for generating risk warning information include: The system extracts quantitative indicators of risk factors from the demand calibration feature vector, and obtains demand clarity scores, technology maturity scores, and schedule constraint scores from these indicators. Based on the residual variance prediction values, it calculates the model prediction uncertainty risk level. It performs cross-validation of semantic risk and prediction uncertainty to generate risk warnings. These risk warnings can be composite high-risk warnings, model confidence warnings, or single-source risk warnings. It then performs a risk-to-cost transmission chain deduction to form a risk transmission chain description. Finally, it summarizes the risk warnings and risk transmission chain descriptions to form risk warning information. The method for forming a risk transmission chain description is as follows: Based on the comparison results of demand clarity scores, technology maturity scores, and schedule constraint scores with corresponding preset trigger thresholds, each risk source is determined; a preset risk transmission mapping set is established, containing the transmission mapping relationships between different risk sources and their corresponding affected cost categories; for each risk source, the affected cost category is determined based on the transmission mapping relationship between the risk source and the cost category; for each affected cost category, the corresponding estimated cost amount is obtained from the cost composition decomposition results; a preset risk transmission amplification coefficient is established for each risk source, and the estimated risk transmission increment is calculated based on the estimated cost amount and the risk transmission amplification coefficient; each risk source, affected cost category, and estimated risk transmission increment are integrated to form a risk transmission chain description.

[0014] Furthermore, the methods for generating optimization recommendations include: For each cost category, the cost percentage corresponding to all anchor point projects is obtained, and a weighted average is calculated based on similarity weights to obtain the weighted average cost percentage of anchor points for each cost category. For each cost category, the structural deviation is calculated based on the predicted cost percentage and the weighted average cost percentage of anchor points, and the maximum structural deviation and the largest deviation category are determined. If the maximum structural deviation is greater than the preset cost excess threshold, cost reduction suggestions are generated. The cost categories are, in order, human resource costs, hardware procurement costs, software licensing costs, implementation and deployment costs, and project management costs. Based on the analysis of key influencing factors and risk warnings, targeted cost optimization strategies are generated. These strategies include functional optimization, technical solution optimization, requirement clarification, and schedule optimization. The coupling coefficient between dimensions is used to determine whether the primary and secondary cost-driving dimensions are marked as coupled-driving dimension pairs. If such pairs exist, the coupling direction is determined based on the coupling coefficient. Coupled-linkage optimization suggestions are generated based on the coupling coefficient and coupling direction, including collaborative cost reduction and balanced trade-off suggestions. All cost reduction suggestions, cost optimization strategies, and coupled-linkage optimization suggestions are integrated to form a final optimization recommendation.

[0015] The technical effects and advantages of this invention's information system cost evaluation method based on large language models and deep learning are as follows: By constructing a historical project cost database and extracting multi-dimensional cost influencing factors from four dimensions—functional complexity, technical difficulty, labor cost, and time cost—and combining it with a pre-built large language model, explicit functional requirements, implicit technical requirements, quality constraints, and risk factors are deeply extracted from the requirements documents of the projects to be evaluated, generating semantic feature vectors of requirements. The cost sensitivity coefficients of each semantic dimension are calculated and calibrated using actual historical project cost data, thus compensating for the inability of the large language model itself to perceive cost correlations. Through an anchor project matching mechanism, the semantic features of requirements are aligned with historical structured features, effectively solving the cold start problem of the lack of structured cost data for the projects to be evaluated. The cost benchmark anchor value and cost residual ratio prediction mechanism transforms the prediction target from absolute cost value to relative deviation ratio, eliminating scale differences between projects of different scales. This effectively reduces the model learning difficulty and improves the balance and stability of prediction accuracy across projects of different scales. It overcomes the shortcomings of traditional deep learning cost prediction models that directly predict absolute cost value, leading to a large number of projects dominating the training process while the accuracy of small-scale projects drops significantly. The dimension-level gating routing mechanism designed in the cost prediction network can adaptively strengthen the driving dimensions that significantly contribute to the cost and suppress secondary dimensions based on input features. This allows the model to automatically identify key cost driving factors when predicting different types of projects. Moreover, the gating weights themselves have clear meanings for cost factor analysis, unlike the traditional feature concatenation method that treats all dimensions equally and cannot distinguish the differences in cost contribution of each dimension. Through joint training of the cost residual prediction branch and the cost composition prediction branch, the feature representation learned by the shared feature extraction layer can not only fit the cost value but also understand the internal composition law of the cost, giving the model a structured cognitive ability of cost formation mechanism. In the cost assessment output stage, the confidence interval is empirically broadened and corrected by the anchor point consistency backtesting correction mechanism using the actual backtesting performance of the model on the anchor point projects. This compensates for the deficiency that the confidence interval may be too narrow when relying solely on the uncertainty of the model's residual variance estimation. Through layer-by-layer source tracing analysis of cost driving factors, fine-grained attribution from the dimension level to the feature element level and quantitative analysis of the coupling relationship between dimensions are achieved. Through multi-source risk cross-validation and risk transmission link deduction, qualitative risk warning is upgraded to structured risk analysis with quantifiable cost impact. Through coupled linkage optimization suggestions, the optimization strategy is upgraded from single-dimensional independent optimization to multi-dimensional collaborative optimization. Finally, a comprehensive cost assessment report is generated, which includes cost prediction results, cost composition decomposition, analysis of major influencing factors, risk warnings and optimization suggestions. This provides a high-precision, interpretable and decision-guiding intelligent assessment solution for the cost assessment of information system construction projects. Attached Figure Description

[0016] Figure 1This is a flowchart of the information system cost evaluation method based on large language model and deep learning according to Embodiment 1 of the present invention. Detailed Implementation

[0017] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention. Example 1:

[0018] Please see Figure 1 As shown in this embodiment, the information system cost assessment method based on large language models and deep learning includes: Step S1: Collect cost data for each historical project, clean, label, and structure the cost data to establish a historical project cost database.

[0019] Methods for collecting cost data for historical projects include: The project management system (a project management platform used to manage and record information throughout the entire process of information system construction projects) is used to obtain basic project information and requirements documents for each historical project. Historical projects refer to information system construction projects that have been completed, delivered, and accepted. Basic project information includes a unique project code, project name, project type, project scale level, and project acceptance time. The unique project code is a predefined unique identifier for each historical project. Project types include, but are not limited to, management information systems, e-commerce systems, data analysis platforms, and IoT application systems. The project scale level reflects the overall size of the project, including small, medium, large, and extra-large. Requirements documents are text documents recording project construction requirements, including functional requirement descriptions, non-functional requirement descriptions, system boundary descriptions, and acceptance criteria. The functional module list and technical architecture scheme of each historical project are obtained from the development management system (i.e., the development process management platform used to record and manage technical solutions and functional module information during the development of information systems). The functional module list records all functional module information included in the historical project, specifically multiple functional module records. Each functional module record includes a module code, module name, module function description, module level, and number of module interfaces. The module code is a unique identifier for the functional module. The module level identifies the functional module's position in the system's functional hierarchy, including first-level, second-level, and third-level modules. The number of module interfaces is the total number of interfaces between the corresponding functional module and other modules or external systems. The technical architecture scheme records the technical architecture information used in the historical project, specifically including the technical architecture type, development language, database type, number of middleware components, and number of third-party components. The technical architecture type includes, but is not limited to, monolithic architecture, microservice architecture, distributed architecture, and hybrid architecture. The system retrieves human resource allocation data for each historical project from the human resource management system (a human resource management platform used to record and manage project human resource input and allocation information). This data records human resource input information during the development process of historical projects and includes multiple allocation records. Each allocation record includes job roles, number of personnel, working hours, and unit price per person. Job roles include, but are not limited to, project managers, system architects, development engineers, test engineers, UI designers, and operations engineers. Working hours refer to the cumulative working hours of the corresponding job role in the project. Unit price per person is the standard cost per unit of working hours for the corresponding job role. Retrieve development cycle records for each historical project from the project schedule management system (i.e., a schedule management platform used to record and track the time progress of each stage of project development). These development cycle records document the time information for each stage of a historical project from project initiation to acceptance, specifically including multiple stage records. Each stage record includes the stage name, stage start time, stage end time, and stage delay days. Stage names include requirements analysis stage, system design stage, coding development stage, system testing stage, and deployment and launch stage. The stage delay days are the number of days the actual completion time of the corresponding stage exceeds the planned completion time. Obtain the actual cost results of each historical project from the financial settlement system (i.e., the financial management platform used to record and manage project contract amounts and actual settlement information); the actual cost results are used to record the cost information of the final settlement of historical projects, specifically including the actual settlement amount and cost composition details; the cost composition details include the proportion of labor costs, hardware procurement costs, software licensing costs, implementation and deployment costs, and project management costs. The project basic information, requirements documents, functional module list, technical architecture plan, human resource allocation data, development cycle records and actual cost results of each historical project are linked and integrated to form the cost data of each historical project.

[0020] Methods for cleaning cost data include: Perform data integrity checks on all cost data. Specifically, check each cost data item for missing information. If the project's unique code, project type, or project scale level in the project's basic information is missing, mark the corresponding cost data as a critical missing record and remove it. If any data item in the functional module list, technical architecture plan, human resource allocation data, development cycle record, or actual cost result is missing in its entirety, mark the corresponding cost data as a critical missing record and remove it. If only a few non-critical fields are missing, fill them with the average value of the corresponding non-critical fields under the same project type. Non-critical fields refer to fields such as module function descriptions and stage delay days that do not have a decisive impact on subsequent feature extraction. Numerical anomaly detection is performed on the cost data that has passed integrity verification. Specifically, all cost data that has passed integrity verification are grouped according to project type and project scale level to obtain multiple similar and similar project groups. For each similar and similar project group, the mean and standard deviation of the actual settlement amount in all cost data are calculated to obtain the mean and standard deviation of the settlement amount within the group. The absolute value of the difference between the actual settlement amount of each cost data and the mean settlement amount within the corresponding group is calculated and then divided by the corresponding standard deviation of the settlement amount within the group to obtain the settlement amount deviation. A cost anomaly threshold is preset, which is preset by those skilled in the art based on the statistical characteristics of historical cost data. If the settlement amount deviation is greater than the cost anomaly threshold, the corresponding cost data is marked as an anomaly record and removed. The method for anomaly detection of other numerical fields such as working hours and unit price of personnel is the same as the method for anomaly detection of actual settlement amount. For cost data that passes the anomaly detection, a duplicate record check is performed. Specifically, the unique project code is used as the basis for uniqueness determination to check whether there are multiple cost data with the same unique project code. If there are multiple cost data with the same unique project code, the cost data with the most recent project acceptance time is retained, and the remaining cost data is removed. The cost data detected through duplicate recording are integrated to form a cost cleaning dataset.

[0021] Methods for annotating cost data include: A pre-defined project type labeling system is established, which includes multiple standard project types. Each standard project type is pre-set by a person skilled in the art based on the information system industry classification standards. Each piece of cost data in the cost cleaning dataset is matched with the standard project type in the project type labeling system according to its corresponding project type, and the successfully matched standard project type is used as the standardized project type label for the corresponding cost data. A pre-defined cost range labeling system is established, which includes multiple cost levels and their corresponding cost amount ranges. Each cost level and cost amount range is pre-set by a person skilled in the art based on the industry statistical distribution of information system costs. Each cost data in the cost cleaning dataset is matched with each cost amount range in the cost range labeling system according to the corresponding actual settlement amount. The cost level corresponding to the cost amount range into which the actual settlement amount falls is used as the cost level label for the corresponding cost data. For each cost data entry in the cost cleaning dataset, the cost composition details are compared, including the proportions of labor costs, hardware procurement costs, software licensing costs, implementation and deployment costs, and project management costs. The dominant cost type is determined based on the cost category corresponding to the largest cost proportion. The cost categories are, in order: labor costs, hardware procurement costs, software licensing costs, implementation and deployment costs, and project management costs. Specifically, if the labor cost proportion is the largest, the dominant cost type for the corresponding cost data is labeled as "labor-dominated"; if the hardware procurement cost proportion is the largest, it is labeled as "hardware-dominated"; if the software licensing cost proportion is the largest, it is labeled as "software-dominated"; and if the implementation and deployment cost proportion or the project management cost proportion is the largest, it is labeled as "service-dominated." This dominant cost type label identifies the most significant cost factor category affecting the project cost. Standardized project type labels, cost grade labels, and dominant cost type labels are added to each cost data entry to form a labeled dataset.

[0022] Methods for structuring cost data include: For each cost data point in the labeled dataset, the total number of records for each functional module is counted to obtain the total number of functional modules. The number of first-level, second-level, and third-level modules is counted separately to obtain the number of modules at each level. The number of module interfaces recorded for all functional modules is summed to obtain the total number of interfaces. The total number of functional modules, the number of modules at each level, and the total number of interfaces for the same cost data are integrated to form a structured indicator of the functional scale of each cost data point. For the technical architecture scheme of each cost data in the labeled dataset, the technical architecture type, development language and database type are encoded and converted to obtain the corresponding numerical code; the numerical code of the same cost data, the number of middleware and the number of third-party components are integrated to form the technical architecture structure index of each cost data; the specific implementation method of encoding conversion is a conventional technical means in this field, and will not be described in detail here. For each cost data entry in the labeled dataset, the human resource allocation data is calculated by multiplying the working hours and unit price of each personnel in each allocation record to obtain the personnel cost for each job role. The personnel costs for all job roles are summed to obtain the total human resource cost. The total number of personnel corresponding to all job roles in the allocation records is calculated to obtain the total number of team members. The ratio of the total human resource cost to the total number of team members is calculated to obtain the average human resource cost per person. The total human resource cost, total number of team members, and average human resource cost per person for the same cost data are integrated to form a structured human resource indicator for each cost data entry. For each cost data development cycle record in the labeled dataset, the difference between the stage end time and the stage start time is calculated to obtain the actual stage duration for each stage record. The actual stage durations of all stage records are summed to obtain the total project duration. The stage delay days of all stage records are summed to obtain the cumulative delay days. The ratio of the cumulative delay days to the total project duration is calculated to obtain the delay rate. The delay rate reflects the degree to which the project schedule deviates from the plan during the project development process. The total project duration, cumulative delay days, and delay rate of the same cost data are integrated to form a time-dimensional structured indicator for each cost data record.

[0023] Methods for establishing a historical project cost database include: The project's basic information, standardized project type labeling, cost level labeling, dominant cost type labeling, requirements document, functional scale structured indicators, technical architecture structured indicators, human resource structured indicators, and time dimension structured indicators for each cost data point are integrated with the actual cost results to form a structured cost record for each historical project. All structured cost records for all historical projects are then compiled to establish a historical project cost database.

[0024] Step S2: Based on the historical project cost database, analyze the cost influencing factors, and extract feature vectors for functional complexity, technical difficulty, labor cost, and time cost to form a multi-dimensional feature set of cost influencing factors.

[0025] Methods for extracting feature vectors of functional complexity include: From the historical project cost database, obtain the structured cost records of each historical project; extract the corresponding functional scale structured indicators from each cost structured record, and calculate the functional complexity derived indicators for each historical project based on the functional scale structured indicators; among them, the functional complexity derived indicators include interface density, functional depth, breadth-depth balance index, and weighted interface complexity; integrate the functional scale structured indicators and functional complexity derived indicators of the same historical project to form the functional complexity feature vector of each historical project; Specifically, the interface density is calculated by dividing the total number of interfaces by the total number of functional modules. Interface density reflects the average complexity of interactions between functional modules; a higher interface density indicates denser coupling between modules. The number of lower-level modules is calculated by summing the number of second-level and third-level modules. The functional depth is calculated by dividing the number of lower-level modules by the total number of functional modules. Functional depth reflects the downward extension of the information system's functional hierarchy; a higher functional depth indicates a higher degree of functional subdivision. The functional breadth ratio is calculated by dividing the number of first-level modules by the total number of functional modules. The breadth-depth balance index is calculated by dividing the functional depth by the functional breadth ratio. The breadth-depth balance index reflects the horizontal balance of information system functions. The degree of balance between coverage and vertical segmentation depth; a larger breadth-depth balance index indicates that the information system is biased towards a deep and complex structure, while a smaller index indicates that the information system is biased towards a broad coverage structure; a preset set of module size tiers is used, which includes multiple module size tiers and their corresponding complexity weighting coefficients; each module size tier is divided according to the numerical range of the total number of functional modules, and each complexity weighting coefficient is preset by those skilled in the art based on the increasing law of functional complexity of information systems of different sizes; the corresponding complexity weighting coefficient is obtained according to the module size tier into which the total number of functional modules falls; the product of interface density and complexity weighting coefficient is calculated to obtain the weighted interface complexity; the weighted interface complexity is used to reflect the interface coupling complexity after module size correction.

[0026] Methods for extracting technical difficulty feature vectors include: The corresponding technical architecture structure indicators are obtained from each cost structure record, and the technical difficulty derived indicators for each historical project are calculated based on the technical architecture structure indicators. Among them, the technical difficulty derived indicators include the architecture complexity benchmark score, language difficulty coefficient, total number of external dependencies, integration burden rate, and comprehensive technical difficulty value. The number of middleware, the number of third-party components, and the technical difficulty derived indicators of the same historical project are integrated to form the technical difficulty feature vector of each historical project. Specifically, a pre-defined architecture complexity mapping table is used, containing baseline scores for architecture complexity corresponding to each technical architecture type. Each baseline score is pre-set by those skilled in the art based on the technical implementation difficulty of different architecture types. The corresponding baseline scores are obtained from the architecture complexity mapping table based on the numerical codes corresponding to the technical architecture types in the structured technical architecture indicators. The total number of external dependencies is calculated by summing the number of middleware components and the number of third-party components; this total number reflects the degree of dependence of the information system on external technical components. The total number of functional modules is obtained from the structured functional scale indicators, and the ratio of the total number of external dependencies to the total number of functional modules is calculated to obtain the integration burden rate; this integration burden rate reflects the degree of dependence of the information system on external technical components. The average workload of integrating external technical components undertaken by each functional module; a pre-set development language difficulty mapping table, which includes the language difficulty coefficients corresponding to each development language; each language difficulty coefficient is pre-set by those skilled in the art based on the learning curves and development efficiency of different development languages; the corresponding language difficulty coefficients are obtained from the development language difficulty mapping table based on the numerical coding of the development languages ​​in the structured technical architecture indicators; the product of the architecture complexity benchmark score and the language difficulty coefficient is calculated to obtain the basic technical difficulty value; the sum of the basic technical difficulty value and the integration burden rate is calculated to obtain the comprehensive technical difficulty value; the comprehensive technical difficulty value is used to reflect the overall technical implementation difficulty of the project in terms of both technical architecture selection and technical component integration.

[0027] Methods for extracting feature vectors of labor costs include: Based on human resource allocation data, calculate the labor cost derivative indicators for each historical project. Among them, the labor cost derivative indicators include the number of role types, role diversity index, cost concentration and labor cost intensity ratio. Obtain the corresponding human resource structured indicators from each cost structured record, and integrate the human resource structured indicators and labor cost derivative indicators of the same historical project to form the labor cost feature vector of each historical project. Specifically, the process involves: 1) counting the number of job roles in the human resource allocation data to obtain the number of role types; 2) calculating the ratio of the number of role types to the total number of team members to obtain the role diversity index; 3) reflecting the dispersion of job roles within the project team, with a higher index indicating more detailed division of labor; 4) obtaining the job labor cost for each role in the human resource allocation data, marking the highest cost among all job labor costs as the highest job cost; 5) calculating the ratio of the highest job cost to the total labor cost to obtain the cost concentration; 6) reflecting the degree of concentration of labor costs across different job roles, with a higher concentration indicating that a few job roles account for the majority of labor costs; 7) obtaining the actual cost results from the structured cost records, and then obtaining the actual settlement amount from the actual cost results; 8) calculating the ratio of the total labor cost to the actual settlement amount to obtain the labor cost intensity ratio; 9) reflecting the proportion of labor input in the total project cost.

[0028] Methods for extracting time cost feature vectors include: Obtain the corresponding time-dimensional structured indicators from each cost structured record. Based on the time-dimensional structured indicators and development cycle records, calculate the time cost derivative indicators for each historical project. Among them, the time cost derivative indicators include the core development ratio, total man-day input, unit construction period cost, and delay concentration. Integrate the time-dimensional structured indicators and time cost derivative indicators of the same historical project to form the time cost feature vector of each historical project. Specifically, the actual duration of the coding development and system testing phases is obtained from the development cycle records; the sum of the actual durations of the coding development and system testing phases is calculated to obtain the core development duration; the ratio of the core development duration to the total project duration is calculated to obtain the core development percentage; the core development percentage reflects the time proportion of the two core phases (coding development and system testing) in the total project duration; the total number of team members is obtained from the structured human resource indicators, and the product of the total number of team members and the total project duration is calculated to obtain the total man-days; the total man-days reflects the comprehensive resource consumption of the project in terms of time and manpower; and the actual cost results are used to... Obtain the actual settlement amount and calculate the ratio of the actual settlement amount to the total project duration to obtain the unit cost per duration. The unit cost per duration reflects the average cost level within each project duration unit. Obtain the stage delay days recorded in the development cycle record for each stage, and take the maximum delay days among all stages as the maximum stage delay days. Calculate the ratio of the maximum stage delay days to the cumulative delay days to obtain the delay concentration. The delay concentration reflects the degree of distribution concentration of project delay risk among different stages. The closer the delay concentration is to one, the more concentrated the delay is in a single stage. The closer it is to zero, the more evenly the delay is distributed across stages. If the cumulative delay days are equal to zero, the delay concentration is set to zero.

[0029] Methods for forming a feature set of multidimensional cost influencing factors include: The functional complexity feature vector, technical difficulty feature vector, labor cost feature vector, and time cost feature vector of each historical project are correlated and integrated to form a multi-dimensional cost influencing factor feature record for each historical project; the multi-dimensional cost influencing factor feature records of all historical projects are summarized to form a multi-dimensional cost influencing factor feature set.

[0030] Step S3: Input the requirements document of the project to be evaluated into the pre-built large language model, extract the explicit functional requirements, implicit technical requirements, quality constraints and risk factors in the requirements document, generate the requirements semantic feature vector, and align the requirements semantic feature vector with the feature set of multi-dimensional cost influencing factors to construct a comprehensive feature representation of the project to be evaluated.

[0031] Methods for generating semantic feature vectors of demand include: Obtain the requirements document for the project to be evaluated and mark it as such; the project to be evaluated refers to an information system construction project that has not yet undergone cost evaluation; pre-build a large language model; the large language model is a domain-adaptive model obtained by fine-tuning and training a large-scale pre-trained language model with a professional corpus in the information system field, and has the ability to perform deep semantic understanding and structured information extraction of professional terms, technical descriptions and business logic in the information system requirements document; large-scale pre-trained language models, such as GPT series models, LLaMA series models, ChatGLM series models, Tongyi Qianwen model, DeepSeek series models, etc., are pre-trained language models with general semantic understanding capabilities; the fine-tuning training method of the large language model is a conventional technique in this field and will not be elaborated on here; The document to be evaluated is input into a large language model to extract explicit functional requirements. Specifically, an explicit functional requirement extraction instruction is preset, which is a structured prompt text guiding the large language model to identify and output all explicitly described functional requirements from the document to be evaluated. The document to be evaluated and the explicit functional requirement extraction instruction are input into the large language model together, and the large language model outputs the functional requirement parsing results. The functional requirement parsing results include a list of functional points and a complexity rating for each functional point. Each functional point in the functional point list includes a functional point name and a functional point description. A functional point refers to a specific business function or operation unit that the information system needs to implement, that is, a function described in the document to be evaluated that can be directly used by users or external systems. The complexity rating includes high complexity, medium complexity, and low complexity. The total number of functional points in the functional point list is counted to obtain the estimated total number of functional points. The high complexity, medium complexity, and low complexity ratings are then calculated separately. The number of function points at low complexity is used to obtain the number of function points at each complexity level. A corresponding level weight is assigned to each complexity level, with each weight pre-set by those skilled in the art based on the cost contribution differences of function points at different complexities. Specifically, the level weight corresponding to high complexity is greater than that corresponding to medium complexity, and the level weight corresponding to medium complexity is greater than that corresponding to low complexity. Based on each level weight, the number of function points at each complexity level is weighted and summed to obtain the function point scale. The function point scale reflects the overall scale of functional requirements after complexity correction. The ratio of the function point scale to the estimated total number of function points is calculated to obtain the average functional complexity. The average functional complexity reflects the average complexity of each function point. The estimated total number of function points, the number of function points at each complexity level, the function point scale, and the average functional complexity are integrated to form a quantitative indicator of explicit functional requirements. The document to be evaluated is input into a large language model to extract implicit technical requirements. Specifically, implicit technical requirement extraction instructions are preset, which are structured prompts guiding the large language model to identify and output the implicit technical implementation requirements from the document to be evaluated. The document to be evaluated and the implicit technical requirement extraction instructions are input into the large language model together, and the large language model outputs the technical requirement parsing results. The technical requirement parsing results include multiple technical requirement levels, specifically performance requirement level, security requirement level, and integration complexity level. A technical level scoring mapping table is preset, which contains quantitative scores corresponding to different technical requirement levels. The technical requirements are pre-set by those skilled in the art based on industry experience regarding the technical implementation difficulty of information systems; based on multiple technical requirement levels in the technical requirement analysis results, corresponding quantitative scores are obtained from the technical level scoring mapping table to obtain performance requirement scores, security requirement scores, and integration requirement scores; the average of the performance requirement scores, security requirement scores, and integration requirement scores is calculated to obtain the comprehensive technical requirement score; among them, the comprehensive technical requirement score is used to reflect the overall technical implementation difficulty implicit in the document to be evaluated; the performance requirement score, security requirement score, integration requirement score, and comprehensive technical requirement score are integrated to form implicit technical requirement quantitative indicators; The document to be evaluated is input into a large language model to extract quality constraints. Specifically, a pre-set quality constraint extraction instruction guides the large language model to identify and output structured prompts for quality assurance requirements from the document. The document to be evaluated and the quality constraint extraction instruction are input into the large language model, which outputs quality constraint parsing results. These results include multiple quality requirement levels, specifically reliability, availability, and maintainability. A pre-set quality level scoring mapping table contains quantitative scores corresponding to different quality requirement levels, pre-set by those skilled in the art based on industry experience regarding quality assurance costs. Based on each quality requirement level in the quality constraint parsing results, the corresponding quantitative scores are obtained from the quality level scoring mapping table to obtain reliability, availability, and maintainability scores. The average of the reliability, availability, and maintainability scores is calculated to obtain a comprehensive quality constraint score, which reflects the overall stringency of the quality constraints in the document. The reliability, availability, and maintainability scores are integrated with the comprehensive quality constraint score to form a quantitative quality constraint indicator. The document to be evaluated is input into a large language model to extract risk factors. Specifically, a risk factor extraction instruction is preset, which is a structured prompt text guiding the large language model to identify and evaluate potential risk factors affecting the cost from the document to be evaluated. The document to be evaluated and the risk factor extraction instruction are input into the large language model together, and the large language model outputs risk factor analysis results. The risk factor analysis results include multiple risk assessment results, specifically demand clarity assessment, technology maturity assessment, and schedule constraint assessment. Preset risk scoring rules include quantitative scores corresponding to different assessment results, which are pre-set by those skilled in the art based on their experience in information system project risk assessment. Risk factor analyses are obtained according to the risk scoring rules. The results yield quantitative scores for each risk assessment outcome, resulting in a demand clarity score, a technology maturity score, and a schedule constraint score. A lower demand clarity score indicates a more vague demand description and higher cost risk; a lower technology maturity score indicates a more novel technology used and higher implementation risk; and a lower schedule constraint score indicates a tighter project schedule, stronger time constraints, and higher project implementation difficulty and cost risk. The average of these scores is calculated to obtain a comprehensive risk score, reflecting the overall cost risk level faced by the project under assessment. Finally, the demand clarity score, technology maturity score, schedule constraint score, and comprehensive risk score are integrated to form a quantitative indicator of risk factors. The quantitative indicators of explicit functional requirements, implicit technical requirements, quality constraints, and risk factors are arranged in order to form the semantic feature vector of the requirements of the project to be evaluated.

[0032] Cost sensitivity calibration is performed on the demand semantic feature vector. Specifically, demand documents for all historical projects are retrieved from the historical project cost database and marked as historical documents. All historical documents are input into a large language model, and the same extraction and quantification methods as those used for the project to be evaluated are employed to obtain the demand semantic feature vector for each historical project, which is then marked as a historical demand feature vector. The corresponding actual settlement amount is obtained from the structured cost records of each historical project. For each dimension in the historical demand feature vector, the absolute value of the Pearson correlation coefficient between the value of each historical project in the corresponding dimension and the corresponding actual settlement amount is calculated to obtain the cost correlation degree for each dimension. The cost correlation degree reflects the statistical correlation strength between each dimension in the demand semantic feature vector and the actual project cost; a higher cost correlation degree indicates a more significant impact of the corresponding dimension on the cost. The number of all dimensions in the historical demand feature vector is counted to obtain the total number of dimensions. The calculation of all dimensions... The total cost correlation is obtained by summing the cost correlation of each dimension. The ratio of the cost correlation of each dimension to the total correlation is calculated and then multiplied by the total number of dimensions to obtain the cost sensitivity coefficient of each dimension. The cost sensitivity coefficient is used to convert the cost correlation of each dimension into a relative importance weight. A cost sensitivity coefficient greater than one indicates that the corresponding dimension has a higher-than-average impact on cost, while a coefficient less than one indicates that the corresponding dimension has a lower-than-average impact on cost. The product of the value of each dimension in the demand semantic feature vector and the corresponding cost sensitivity coefficient is calculated to obtain the calibrated demand semantic feature vector. The cost sensitivity calibration uses the actual cost results of historical projects to reverse identify the true impact of each semantic dimension on cost, and recalibrates the importance of cost-oriented demand semantic information extracted by the large language model. This allows semantic dimensions with a greater impact on cost to obtain higher feature weights in subsequent predictions, compensating for the lack of cost correlation that the large language model itself cannot perceive.

[0033] Methods for constructing a comprehensive feature representation of a project to be evaluated include: The calibrated demand semantic feature vectors are labeled as demand calibration feature vectors. The cosine similarity between the demand calibration feature vector of the project to be evaluated and the historical demand feature vectors of each historical project is calculated to obtain the semantic similarity for each historical project. The semantic similarity reflects the degree of closeness between the project to be evaluated and each historical project at the demand semantic level. All historical projects are sorted from high to low semantic similarity, and the top-ranked projects are selected. Historical projects in a given location serve as anchor projects, forming a set of anchor projects; among them, The number of anchor points is preset and is set by those skilled in the art based on the data size of the historical project cost database; the set of anchor point projects is used as a reference benchmark for the project to be evaluated in the structured feature space. The multidimensional cost influencing factor feature records corresponding to each anchor project in the anchor project set are obtained from the multidimensional cost influencing factor feature set. Then, the feature vectors of functional complexity, technical difficulty, labor cost, and time cost are obtained from each multidimensional cost influencing factor feature record. The sum of the semantic similarities corresponding to all anchor projects in the anchor project set is calculated to obtain the anchor similarity sum. The ratio of the semantic similarity corresponding to each anchor project to the anchor similarity sum is calculated to obtain the similarity weight of each anchor project. Based on the similarity weights of each anchor project, the elements of the corresponding functional complexity feature vector, technical difficulty feature vector, labor cost feature vector, and time cost feature vector for all anchor projects are weighted and summed to obtain the estimated functional complexity feature vector, estimated technical difficulty feature vector, estimated labor cost feature vector, and estimated time cost feature vector. Among them, the estimated functional complexity feature vector, estimated technical difficulty feature vector, estimated labor cost feature vector, and estimated time cost feature vector are used to characterize the estimated structured features of the project to be evaluated in the four dimensions of functional complexity, technical difficulty, labor cost, and time cost, so that the features of the project to be evaluated are in the same feature space as the structured features of historical projects. The estimated functional complexity feature vector, estimated technical difficulty feature vector, estimated labor cost feature vector, estimated time cost feature vector, and demand calibration feature vector are concatenated and integrated to form a comprehensive feature representation of the project to be evaluated. The comprehensive feature representation is used to uniformly express the demand semantic information of the project to be evaluated and the structured cost influencing factor information after feature alignment, so as to provide input basis for subsequent cost prediction models.

[0034] Step S4: Based on the feature vectors in the multidimensional cost influencing factors feature set and the actual cost results of each historical project in the historical project cost database, a cost prediction model is constructed by learning the nonlinear mapping relationship between the feature vector combination and the actual cost results.

[0035] Methods for constructing cost prediction models include: The system extracts multidimensional cost influencing factor feature records for each historical project from a multidimensional cost influencing factor feature set. From each record, it extracts feature vectors for functional complexity, technical difficulty, labor cost, and time cost. It also obtains the historical demand feature vector and cost sensitivity coefficients for each dimension of each historical project. The system multiplies the values ​​of each dimension in the historical demand feature vector of each historical project by the corresponding cost sensitivity coefficient to obtain the demand calibration feature vector for each historical project. Finally, it concatenates the functional complexity feature vector, technical difficulty feature vector, labor cost feature vector, and time cost feature vector with the demand calibration feature vector to form the training input feature vector for each historical project. For each historical project, the corresponding cost benchmark anchor value and cost residual ratio are calculated. Specifically, the actual settlement amount is obtained from the structured cost record of the current project (the historical project for which the cost benchmark anchor value and cost residual ratio are currently calculated), and the cosine similarity between the demand calibration feature vector of the current project and the demand calibration feature vectors of all other projects (historical projects other than the current project) is calculated to obtain the inter-project semantic similarity for each of the remaining projects. All remaining projects are then sorted from high to low according to their inter-project semantic similarity, and the projects with the highest inter-project semantic similarity are selected. The historical projects of each training anchor point are used as training anchor point projects. The sum of semantic similarities between corresponding projects of all training anchor point projects is calculated to obtain the total similarity of training anchor points. The ratio of the semantic similarity between corresponding projects of each training anchor point project to the total similarity of training anchor points is calculated to obtain the training anchor point weight of each training anchor point project. Based on the weight of each training anchor point, the actual settlement amount of each training anchor point project is weighted and summed to obtain the cost benchmark anchor value of the current project. The cost benchmark anchor value is used to represent the benchmark cost level estimated by the similar project group of the current project. The difference between the actual settlement amount of the current project and the cost benchmark anchor value is calculated and then divided by the cost benchmark anchor value to obtain the cost residual ratio, which is used as the cost prediction label of the current project. The cost residual ratio is used to characterize the degree of deviation of the actual cost of the current project from the benchmark cost level. It should be noted that the cost benchmark anchor value and cost residual ratio prediction mechanism transforms the prediction target from absolute construction value to the deviation ratio relative to the benchmark anchor value, eliminating the scale difference between projects of different scales. This allows the model to learn only the adjustment amount of the current project's deviation from the cost benchmark of similar project groups, effectively reducing the learning difficulty and improving the balance and stability of prediction accuracy across projects of different scales. This is different from the shortcomings of traditional deep learning-based cost prediction models that directly predict absolute construction value, which leads to a large number of large-scale projects dominating the training process while the accuracy of small-scale projects drops significantly. The cost composition details are obtained from the actual cost results of each historical project. The proportions of labor cost, hardware procurement cost, software licensing cost, implementation and deployment cost, and project management cost in the cost composition details are arranged in order to form a cost composition label vector for each historical project. The training input feature vector, cost prediction label, and cost composition label vector of each historical project are integrated to form a training sample for each historical project. The training samples of all historical projects are summarized to form a model training sample set.

[0036] A cost prediction network is constructed, which includes a dimension-level gated routing layer, a shared feature extraction layer, a cost residual prediction branch, and a cost composition prediction branch. The dimensional gating routing layer adaptively calculates the significance gating weights for the project cost under evaluation across five dimensions: functional complexity, technical difficulty, labor cost, time cost, and requirement semantics, based on input features. Specifically, the training input feature vector is input into the gating computation network, which consists of fully connected layers and normalized activation functions. The network outputs five dimension gating weights: functional complexity gating weight, technical difficulty gating weight, labor cost gating weight, time cost gating weight, and requirement semantics gating weight. The sum of these five dimension gating weights is one. Each element in the functional complexity feature vector, technical difficulty feature vector, labor cost feature vector, time cost feature vector, and requirement calibration feature vector is multiplied by its corresponding gating weight to obtain the gating weight. The system comprises a functional complexity feature vector, a gating technology difficulty feature vector, a gating labor cost feature vector, a gating time cost feature vector, and a gating requirement semantic feature vector. These five gating feature vectors are concatenated to form a gating routing feature vector. The dimension-level gating routing mechanism adaptively strengthens the driving dimensions that significantly contribute to the cost of the project to be evaluated and suppresses secondary dimensions based on the input features. This enables the model to automatically identify key cost drivers when predicting different types of projects. Furthermore, the granularity of the gating weights is at the dimension level of cost influencing factors, allowing the gating weights themselves to directly reflect the degree of contribution of each dimension to the cost of the project to be evaluated. This has a clear meaning in cost factor analysis, unlike the traditional feature concatenation method which treats all dimensions equally and cannot distinguish the differences in cost contribution between dimensions. The shared feature extraction layer, connected to the dimensional gated routing layer, contains multiple fully connected layers and nonlinear activation functions. It performs deep nonlinear feature transformations on the gated routing feature vectors, outputting a shared deep feature representation. The cost residual prediction branch, also connected to the shared feature extraction layer, includes a residual value prediction head and a residual variance prediction head. The residual value prediction head outputs the residual prediction value, and the residual variance prediction head outputs the residual variance prediction value. The residual variance prediction value characterizes the model's uncertainty estimate of the residual prediction value; a larger residual variance prediction value indicates lower certainty in the model's current prediction. The cost composition prediction branch, connected to the shared feature extraction layer, outputs five cost percentage prediction values: labor cost percentage, hardware procurement cost percentage, software licensing cost percentage, implementation and deployment cost percentage, and project management cost percentage.

[0037] Based on the model training sample set, joint training is performed on the cost prediction network. Specifically, the training input feature vector of each training sample is input into the cost prediction network to obtain the residual prediction value, residual variance prediction value, five cost percentage prediction values, and five-dimensional gating weight values ​​corresponding to each training sample. Based on the residual prediction value, residual variance prediction value, and corresponding cost prediction label, the cost prediction loss for each training sample is obtained using the Gaussian negative log-likelihood loss method. The Gaussian negative log-likelihood loss simultaneously constrains the accuracy of residual prediction and the accuracy of uncertainty estimation, enabling the cost prediction network to not only learn the accuracy of residual prediction values ​​but also learn the ability to assess its own prediction reliability. The calculation method is a well-known technique in the field and will not be elaborated further here. The mean square error between the five predicted cost percentages and each element in the corresponding cost composition label vector is calculated to obtain the cost composition loss for each training sample. Preset cost composition loss weights, which are pre-set by those skilled in the art based on the balance between cost prediction accuracy and cost composition analysis capability. The product of the cost composition loss weights and the cost composition loss is calculated to obtain the weighted cost composition error for each training sample. The sum of the cost prediction loss and the weighted cost composition error is calculated to obtain the joint training loss for each training sample. The average of the joint training losses for all training samples is calculated to obtain the batch joint training loss. It should be noted that by jointly training the cost composition prediction branch and the cost residual prediction branch, the feature representation learned by the shared feature extraction layer can not only fit the cost value, but also understand the internal composition law of the cost, so that the model has a structured cognitive ability to understand the cost formation mechanism. The backpropagation algorithm and gradient descent optimization method are used to iteratively update the model parameters in the cost prediction network based on the batch joint training loss until the batch joint training loss converges to a preset loss convergence threshold or the number of training iterations reaches a preset maximum number of iterations. The loss convergence threshold and the maximum number of iterations are both preset by those skilled in the art based on the model training stability requirements. The backpropagation algorithm and gradient descent optimization method are well-known technologies in the field and will not be elaborated upon here. The trained cost prediction network is used as the cost prediction model. This model receives input features with the same structure as the training input feature vector and outputs residual prediction values, residual variance prediction values, five cost percentage prediction values, and five-dimensional gating weight values, providing a predictive basis for subsequent cost assessment.

[0038] Step S5: Input the comprehensive feature representation of the project to be evaluated into the cost prediction model to obtain the predicted cost value and confidence interval. Combined with the demand semantic feature vector, generate a cost assessment report that includes the cost prediction results, cost composition decomposition, analysis of major influencing factors, risk warnings and optimization suggestions.

[0039] Methods for obtaining cost forecasts and confidence intervals include: The comprehensive characteristics of the project to be evaluated are input into the cost prediction model, and the cost prediction model outputs the residual prediction value, the residual variance prediction value, the five cost percentage prediction values ​​and the five-dimensional gating weight values. Based on the set of anchor projects, the cost benchmark anchor value of the project to be evaluated is calculated. Specifically, the actual settlement amount and similarity weight of each anchor project are obtained from the set of anchor projects. Based on the similarity weight of each anchor project, the actual settlement amount of each anchor project is weighted and summed to obtain the cost benchmark anchor value of the project to be evaluated. The product of the residual prediction value and the cost benchmark anchor value is calculated to obtain the residual adjustment amount. The sum of the cost benchmark anchor value and the residual adjustment amount is calculated to obtain the cost prediction value. The cost prediction value is used to represent the estimated total cost of the project to be evaluated. Based on the predicted residual variance and the cost benchmark anchor value, the confidence interval of the project to be evaluated is calculated. Specifically, the square root of the predicted residual variance is calculated to obtain the predicted residual standard deviation. A preset standard normal quantile is established, which is pre-set by a person skilled in the art according to the confidence requirements of the cost evaluation. The product of the predicted residual standard deviation and the standard normal quantile is calculated to obtain the residual offset. The product of the difference between the predicted residual value and the residual offset and the cost benchmark anchor value is calculated, and then added to the cost benchmark anchor value to obtain the lower confidence limit of the cost. The product of the sum of the predicted residual value and the residual offset and the cost benchmark anchor value is calculated, and then added to the cost benchmark anchor value to obtain the upper confidence limit of the cost. The lower confidence limit and the upper confidence limit of the cost constitute the confidence interval. Anchor point consistency backtesting correction is performed on the confidence interval. Specifically, for each anchor point item in the anchor point item set, the training input feature vector of the corresponding anchor point item is input into the cost prediction model to obtain the corresponding residual prediction value, which is then marked as the anchor point residual prediction value. The absolute value of the difference between the anchor point residual prediction value and the corresponding cost prediction label for each anchor point item is calculated to obtain the anchor point backtesting bias. The mean of the anchor point backtesting biases for all anchor points is calculated to obtain the average anchor point backtesting bias. The average anchor point backtesting bias is used to reflect the actual prediction accuracy of the cost prediction model on the group of historical projects most similar to the project to be evaluated. A backtesting bias amplification factor is preset, and the backtesting bias... The difference amplification factor is preset by those skilled in the art according to the robustness requirements of cost assessment; the average backtesting deviation of the anchor point, the backtesting deviation amplification factor, and the cost benchmark anchoring value are multiplied sequentially to obtain the empirical correction amount; the difference between the lower confidence limit of the cost and the empirical correction amount is calculated, and the lower confidence limit of the cost is updated based on the result of the difference calculation; the sum of the upper confidence limit of the cost and the empirical correction amount is calculated, and the upper confidence limit of the cost is updated based on the result of the sum calculation; among them, the anchor point consistency reverse verification uses the actual backtesting performance of the cost prediction model on the anchor point project to empirically broaden and correct the confidence interval, making up for the deficiency that the confidence interval may be too narrow when relying solely on the uncertainty of the model's own residual variance estimation.

[0040] Methods for generating cost assessment reports include: Based on the five cost percentage predictions and the cost forecast, a cost composition decomposition result is generated. Specifically, the cost forecast and the product of the five cost percentage predictions are calculated to obtain the estimated cost amount for each cost category. The estimated cost amount for each cost category is then integrated with the cost percentage prediction to form the cost composition decomposition result. The cost composition decomposition result is used to display the detailed allocation of the cost forecast in each cost category.

[0041] Based on the five-dimensional gating weight values ​​and the set of anchor point projects, a layer-by-layer source analysis of cost drivers is performed to generate the analysis results of the main influencing factors. Specifically, the five dimension gating weight values ​​are sorted from largest to smallest to obtain a dimension importance ranking sequence; the dimension corresponding to the first dimension gating weight value in the dimension importance ranking sequence is marked as the primary cost driving dimension, and the dimension corresponding to the second dimension gating weight value is marked as the secondary cost driving dimension. For both primary and secondary cost drivers, fine-grained attribution calculations are performed within each dimension to identify the most significant feature elements contributing to cost within each dimension, thus obtaining the key driving elements for the primary and secondary dimensions. Specifically, the feature vectors and actual settlement amounts corresponding to each anchor item under the primary cost driver dimension are obtained from the anchor item set, and the obtained feature vectors are marked as primary driver vectors. If the primary cost driver dimension is functional complexity, the functional complexity feature vector of the anchor item is obtained; if the primary cost driver dimension is technical difficulty, the technical difficulty feature vector of the anchor item is obtained, and so on for other dimensions. The values ​​at each element position in the primary driver vector for all anchor items are obtained, forming the values ​​for each element. The numerical sequence of positions; based on the actual settlement amount of all anchor items, an amount sequence is constructed according to the same anchor item order as the numerical sequence; the absolute value of the Pearson correlation coefficient between each numerical sequence and the amount sequence is calculated to obtain the cost contribution degree within the dimension of each element position corresponding to the numerical sequence; the cost contribution degree within the dimension is used to reflect the statistical correlation strength between each specific feature element and the cost within the primary cost driving dimension; all element positions are sorted from largest to smallest according to the cost contribution degree within the dimension, and the feature names corresponding to the top three element positions are selected as the key driving elements of the primary dimension; the feature name is the original meaning name of the corresponding element position in the feature vector definition; the same fine-grained attribution calculation within the dimension is performed on the secondary cost driving dimensions to obtain the key driving elements of the secondary dimensions; The cost-driven difference index between the project to be evaluated and the set of anchor projects is calculated. The cost-driven difference index includes primary deviation and secondary deviation. Specifically, the feature vector corresponding to the project to be evaluated under the primary cost-driven dimension is obtained and marked as the primary evaluation vector. The element-wise difference between the primary evaluation vector and the primary driving vector corresponding to each anchor project is calculated and weighted based on similarity weights to obtain the primary deviation vector. The mean of the absolute values ​​of each element in the primary deviation vector is calculated to obtain the primary deviation. The primary deviation reflects the degree to which the project to be evaluated deviates from the average of the similar historical project group in the primary cost-driven dimension. The same calculation process is performed on the secondary cost-driven dimension to obtain the secondary deviation. For the primary and secondary cost drivers, a cost driver coupling analysis is performed to calculate the coupling coefficients between dimensions. Specifically, for each anchor item, the mean of all elements in the corresponding feature vector under the primary and secondary cost drivers is calculated to obtain the primary feature mean and the secondary feature mean. Based on the primary and secondary feature means of all anchor items, a primary feature mean sequence and a secondary feature mean sequence are constructed, respectively. The order of anchor items in the primary feature mean sequence and the secondary feature mean sequence is consistent. The Pearson correlation coefficient between the primary feature mean sequence and the secondary feature mean sequence is calculated. The inter-dimensional coupling coefficient is obtained; the inter-dimensional coupling coefficient is used to reflect the strength of the covariation relationship between two main cost-driving dimensions in historical projects; a coupling significance threshold is preset, which is pre-set by those skilled in the art according to statistical significance standards; if the absolute value of the inter-dimensional coupling coefficient is greater than the coupling significance threshold, the primary cost-driving dimension and the secondary cost-driving dimension are marked as a coupled driving dimension pair, and the coupling direction is recorded; the coupling direction is determined according to the sign of the inter-dimensional coupling coefficient. If the inter-dimensional coupling coefficient is positive, the coupling direction is positive coupling; if the inter-dimensional coupling coefficient is negative, the coupling direction is negative coupling. The analysis results of the main influencing factors are formed by integrating the primary cost driving dimension, secondary cost driving dimension, key driving elements of the primary dimension, key driving elements of the secondary dimension, primary deviation degree, secondary deviation degree, coupling coefficient and coupling direction between dimensions. Among them, the analysis results of the main influencing factors are used to reveal the core driving dimension affecting the cost of the project to be evaluated, the key characteristic elements within the dimension, the synergistic relationship between dimensions, and the degree of difference between the project to be evaluated and a group of similar historical projects.

[0042] Based on the required calibration feature vector, residual variance prediction value and cost composition decomposition results, perform multi-source risk cross-validation and risk transmission link deduction to generate risk warning information; Specifically, risk factor quantitative indicators are obtained from the demand calibration feature vector, and demand clarity score, technology maturity score and schedule constraint score are obtained from the risk factor quantitative indicators. Based on the predicted residual variance, the model predicts the uncertainty risk level. Specifically, the standard deviation of the cost residual ratio of each anchor project in the anchor project set is calculated to obtain the anchor residual standard deviation. The ratio of the predicted residual standard deviation to the anchor residual standard deviation is calculated to obtain the uncertainty amplification factor. The uncertainty amplification factor reflects the degree to which the cost prediction model amplifies the uncertainty of the project under evaluation relative to the cost volatility of similar historical project groups. A pre-defined uncertainty level set is provided, which contains multiple uncertainty intervals and their corresponding predicted uncertainty risk levels. Each uncertainty interval and predicted uncertainty risk level is pre-set by those skilled in the art according to the accuracy requirements of cost evaluation. The corresponding predicted uncertainty risk level is determined based on the uncertainty interval into which the uncertainty amplification factor falls. Cross-validation of semantic risk and prediction uncertainty is performed to generate risk alerts. These alerts can be composite high-risk alerts, model confidence warnings, or single-source risk alerts. Specifically, a set of preset semantic risk trigger thresholds is used, including thresholds for demand clarity, technology maturity, and schedule constraints. Each threshold is pre-set by those skilled in the art based on their experience in information system cost risk control. If the demand clarity score is lower than the demand clarity trigger threshold, a demand ambiguity risk alert is generated, and the corresponding risk source is marked as a demand dimension. If the technology maturity score is lower than the technology maturity trigger threshold, a technology risk alert is generated, and the corresponding risk source is marked as a technology dimension. If the schedule constraint score is lower than the schedule constraint trigger threshold, a schedule tightness alert is generated. The system generates risk warnings and marks the corresponding risk sources as the progress dimension. It then cross-compares each risk source with the predicted uncertainty risk level. If the predicted uncertainty risk level is high and there is at least one risk source, it is classified as a composite high risk, generating a composite high risk warning that includes risk warnings for all risk sources and the predicted uncertainty risk level. If the predicted uncertainty risk level is high but there are no risk sources, it is classified as insufficient model confidence risk, generating a model confidence warning that includes the predicted uncertainty risk level. If the predicted uncertainty risk level is not high but there is at least one risk source, it is classified as a single-source risk, generating a single-source risk warning that includes risk warnings for all risk sources. The risk-to-cost transmission chain is deduced to form a risk transmission chain description. Specifically, a risk transmission mapping set is preset, which contains the transmission mapping relationship between different risk sources and corresponding affected cost categories. The transmission mapping relationship is pre-set by those skilled in the art based on the risk transmission rules of information system projects. For example, if the risk source is a demand dimension, the affected cost categories are human resource costs and project management costs; if the risk source is a technology dimension, the affected cost categories are human resource costs and software licensing costs; if the risk source is a schedule dimension, the affected cost categories are human resource costs and implementation and deployment costs. For each risk source, based on the transmission mapping relationship between the risk source and cost categories, the affected cost category most likely to lead to an increase in cost is determined. For each affected cost category… The system identifies cost categories and extracts corresponding estimated cost amounts from cost breakdown results. It pre-sets risk transmission amplification coefficients for each risk source, with each coefficient pre-defined by those skilled in the art based on historical project cost increases following risk occurrences. The estimated cost amount is multiplied by the corresponding risk transmission amplification coefficient to obtain the estimated risk transmission increment. This increment quantifies the potential cost increase for the corresponding cost category after a risk source occurs. Each risk source, affected cost category, and estimated risk transmission increment are integrated to form a risk transmission chain description. The risk transmission chain deduction establishes a causal transmission path from the risk source to the affected cost category and quantifies the transmission consequences, elevating risk warnings from qualitative alerts to structured risk analysis with quantifiable cost impact. The risk warnings and risk transmission links are summarized to form risk warning information. The risk warning information is used to reveal to the assessors the multi-source risk factors that affect the reliability of cost prediction, the cross-superposition effect of risks, and the transmission path and quantitative impact of risks to costs.

[0043] Based on the risk warning information, cost composition decomposition results and main influencing factor analysis results, the cost optimization path deduction and coupled linkage optimization suggestions are generated to form optimization suggestions; Specifically, the cost composition details and similarity weights for each anchor project are obtained from the anchor project set. For each cost category, the cost proportion of all anchor projects in the corresponding cost category is obtained, and a weighted average is calculated based on the similarity weights to obtain the weighted average cost proportion of anchors for each cost category. For each cost category, the difference between the predicted cost proportion of the project to be evaluated and the weighted average cost proportion of the corresponding anchors is calculated to obtain the structural deviation of each cost category. The structural deviation with the largest value is marked as the maximum structural deviation, and the cost category corresponding to the maximum structural deviation is marked as the maximum deviation category. If the maximum structural deviation is greater than a preset cost excess threshold, it is determined that the maximum deviation category has room for optimization, and cost reduction suggestions are generated. The cost excess threshold is preset by those skilled in the art based on industry cost composition standards. The cost reduction suggestions include the maximum deviation category and the maximum structural deviation. Based on the analysis of key influencing factors and risk warning information, targeted cost optimization strategy recommendations are generated. These recommendations include functional optimization, technical solution optimization, requirement clarification, and schedule optimization. Specifically, if the primary cost driver is functional complexity and the primary deviation exceeds a preset deviation warning threshold, a functional optimization recommendation is generated, suggesting simplification or phased implementation of the functional scope. If the primary cost driver is technical difficulty and the risk warning information includes technical risk warnings, a technical solution optimization recommendation is generated. If the risk warning information includes requirements ambiguity risk warnings, a requirement clarification recommendation is generated, suggesting that ambiguous requirements be refined and clarified before project initiation to reduce the risk of additional costs due to requirement changes. If the risk warning information includes schedule urgency risk warnings, a schedule optimization recommendation is generated, suggesting appropriately extending the project duration or increasing resource input to balance the impact of schedule constraints on cost. The deviation warning threshold is preset by those skilled in the art based on the statistical distribution of historical project cost deviations. Based on the coupling coefficient and coupling direction between dimensions, coupling and linkage optimization suggestions are generated. These suggestions include synergistic cost reduction suggestions and balancing trade-off suggestions. Specifically, if there is a pair of coupled driving dimensions with positive coupling, a synergistic cost reduction suggestion is generated, recommending joint optimization of the key driving elements of both dimensions to achieve a cumulative cost reduction effect. If there is a pair of coupled driving dimensions with negative coupling, a balancing trade-off suggestion is generated, recommending attention to the potential negative impact on secondary cost driving dimensions when optimizing the primary cost driving dimension to avoid cost transfer. The coupling and linkage optimization suggestions utilize historical synergistic relationships between dimensions to reveal the linkage effect between multi-dimensional cost driving factors for assessors, thus elevating the optimization strategy from single-dimensional independent optimization to multi-dimensional synergistic optimization. All cost reduction suggestions, cost optimization strategy suggestions, and coupled optimization suggestions are integrated to form optimization suggestions.

[0044] The cost forecast, confidence interval, cost breakdown results, analysis results of major influencing factors, risk warning information and optimization suggestions are integrated to generate a cost assessment report.

[0045] This embodiment constructs a historical project cost database and extracts multi-dimensional cost influencing factors from four dimensions: functional complexity, technical difficulty, labor cost, and time cost. Combined with a pre-built large language model, it deeply extracts explicit functional requirements, implicit technical requirements, quality constraints, and risk factors from the project's requirement documents, generating a requirement semantic feature vector. It then uses actual historical project cost data to calculate the cost sensitivity coefficients for each semantic dimension for calibration, compensating for the large language model's inability to perceive cost correlations. Furthermore, an anchor project matching mechanism aligns the requirement semantic features with historical structured features, effectively addressing the cold-start problem of lacking structured cost data for the project under evaluation. The cost benchmark anchor value and cost residual ratio prediction mechanism transforms the prediction target from absolute cost value to relative deviation ratio, eliminating scale differences between projects of different scales. This effectively reduces the model learning difficulty and improves the balance and stability of prediction accuracy across projects of different scales. It overcomes the shortcomings of traditional deep learning cost prediction models that directly predict absolute cost value, leading to a large number of projects dominating the training process while the accuracy of small-scale projects drops significantly. The dimension-level gating routing mechanism designed in the cost prediction network can adaptively strengthen the driving dimensions that significantly contribute to the cost and suppress secondary dimensions based on input features. This allows the model to automatically identify key cost driving factors when predicting different types of projects. Moreover, the gating weights themselves have clear meanings for cost factor analysis, unlike the traditional feature concatenation method that treats all dimensions equally and cannot distinguish the differences in cost contribution of each dimension. Through joint training of the cost residual prediction branch and the cost composition prediction branch, the feature representation learned by the shared feature extraction layer can not only fit the cost value but also understand the internal composition law of the cost, giving the model a structured cognitive ability of cost formation mechanism. In the cost assessment output stage, the confidence interval is empirically broadened and corrected by the anchor point consistency backtesting correction mechanism using the actual backtesting performance of the model on the anchor point projects. This compensates for the deficiency that the confidence interval may be too narrow when relying solely on the uncertainty of the model's residual variance estimation. Through layer-by-layer source tracing analysis of cost driving factors, fine-grained attribution from the dimension level to the feature element level and quantitative analysis of the coupling relationship between dimensions are achieved. Through multi-source risk cross-validation and risk transmission link deduction, qualitative risk warning is upgraded to structured risk analysis with quantifiable cost impact. Through coupled linkage optimization suggestions, the optimization strategy is upgraded from single-dimensional independent optimization to multi-dimensional collaborative optimization. Finally, a comprehensive cost assessment report is generated, which includes cost prediction results, cost composition decomposition, analysis of major influencing factors, risk warnings and optimization suggestions. This provides a high-precision, interpretable and decision-guiding intelligent assessment solution for the cost assessment of information system construction projects. Example 2:

[0046] This application also provides an electronic device. The electronic device may include one or more processors and one or more memories. The memories store computer-readable code, which, when executed by the one or more processors, can perform the information system cost assessment method based on large language models and deep learning as described above.

[0047] The method according to the embodiments of this application can also be implemented using the architecture of the electronic device shown in this application. The electronic device may include a bus, one or more CPUs, ROM, RAM, a communication port connected to a network, input / output, a hard disk, etc. The storage device in the electronic device, such as ROM or hard disk, may store the information system cost evaluation method based on large language models and deep learning provided in this application. Furthermore, the electronic device may also include a user interface. Of course, the architecture shown in this application is merely exemplary; when implementing different devices, one or more components in the electronic device shown in this application may be omitted according to actual needs.

[0048] Example 3

[0049] Please refer to the accompanying drawings. One embodiment of this application discloses a computer-readable storage medium. The computer-readable storage medium stores computer-readable instructions. When the computer-readable instructions are executed by a processor, the information system cost assessment method based on large language models and deep learning according to an embodiment of this application, as described above, can be performed. The storage medium includes, but is not limited to, volatile memory and / or non-volatile memory. Volatile memory may include, for example, random access memory (RAM) and cache memory. Non-volatile memory may include, for example, read-only memory (ROM), hard disk, flash memory, etc.

[0050] Furthermore, according to embodiments of this application, the processes described in the above-referenced flowcharts can be implemented as computer software programs. For example, this application provides a non-transitory machine-readable storage medium storing machine-readable instructions that can be executed by a processor to perform instructions corresponding to the method steps provided in this application, such as an information system cost evaluation method based on large language models and deep learning. When this computer program is executed by a central processing unit (CPU), it performs the functions defined in the method of this application.

[0051] The above description is merely a preferred embodiment of the present invention and is not intended to limit the present invention. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art can still modify the technical solutions described in the foregoing embodiments or make equivalent substitutions for some of the technical features. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the protection scope of the present invention.

[0052] All formulas in this manual are dimensionless and calculated numerically. The formulas are derived from software simulations based on a large amount of collected data to obtain the most recent real-world results. The preset parameters and thresholds in the formulas are set by those skilled in the art according to the actual situation.

[0053] Although embodiments of the invention have been shown and described, those skilled in the art will understand that various changes, modifications, substitutions and alterations can be made to these embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.

Claims

1. A method for evaluating the cost of information systems based on large language models and deep learning, characterized in that: include: Step S1: Collect cost data for each historical project, clean, label, and structure the cost data to establish a historical project cost database; Step S2: Analyze the cost influencing factors based on the historical project cost database, and extract feature vectors for functional complexity, technical difficulty, labor cost, and time cost to form a multi-dimensional feature set of cost influencing factors. Step S3: Input the requirements document of the project to be evaluated into the pre-built large language model, extract the explicit functional requirements, implicit technical requirements, quality constraints and risk factors in the requirements document, generate the requirements semantic feature vector, and align the requirements semantic feature vector with the feature set of multi-dimensional cost influencing factors to construct a comprehensive feature representation of the project to be evaluated. Step S4: Based on the feature vectors in the feature set of multidimensional cost influencing factors and the actual cost results of each historical project in the historical project cost database, a cost prediction model is constructed by learning the nonlinear mapping relationship between the feature vector combination and the actual cost results. Step S5: Input the comprehensive feature representation of the project to be evaluated into the cost prediction model to obtain the predicted cost value and confidence interval. Combined with the demand semantic feature vector, generate a cost assessment report that includes the cost prediction results, cost composition decomposition, analysis of major influencing factors, risk warnings and optimization suggestions.

2. The information system cost assessment method based on large language models and deep learning according to claim 1, characterized in that, Methods for establishing a historical project cost database include: Cost data for each historical project is collected. The cost data includes basic project information, requirements documents, functional module lists, technical architecture solutions, human resource allocation data, development cycle records, and actual cost results. The actual cost results include the actual settlement amount and cost composition details. The cost composition details include the proportion of labor costs, hardware procurement costs, software licensing costs, implementation and deployment costs, and project management costs. Each cost data point is cleaned to form a cleaned cost dataset; each cost data point in the cleaned cost dataset is labeled to form a labeled dataset; each cost data point in the labeled dataset is structured to form structured indicators for functional scale, technical architecture, human resources, and time dimension, and these are integrated with the actual cost results of the corresponding cost data to form a structured cost record for each historical project; the structured cost records of all historical projects are summarized to establish a historical project cost database; Methods for forming a feature set of multidimensional cost influencing factors include: The functional complexity feature vector, technical difficulty feature vector, labor cost feature vector, and time cost feature vector of each historical project are correlated and integrated to form a multi-dimensional cost influencing factor feature record for each historical project; the multi-dimensional cost influencing factor feature records of all historical projects are summarized to form a multi-dimensional cost influencing factor feature set.

3. The information system cost assessment method based on large language models and deep learning according to claim 2, characterized in that, Methods for generating semantic feature vectors of demand include: Obtain the requirements document for the project to be evaluated and mark it as the evaluation document; input the evaluation document into the large language model, and extract explicit functional requirements, implicit technical requirements, quality constraints, and risk factors in sequence; obtain quantitative indicators for explicit functional requirements, implicit technical requirements, quality constraints, and risk factors; arrange the quantitative indicators for explicit functional requirements, implicit technical requirements, quality constraints, and risk factors in order to form the semantic feature vector of the project requirements; perform cost sensitivity calibration on the semantic feature vector of the requirements to obtain the calibrated semantic feature vector of the requirements. The method for calibrating the cost sensitivity of the demand semantic feature vector is as follows: mark the demand documents of historical projects as historical documents, input all historical documents into the large language model to obtain the demand semantic feature vector of each historical project, and mark it as a historical demand feature vector; for each dimension in the historical demand feature vector, calculate the cost correlation degree of each dimension based on the values ​​of all historical projects in the corresponding dimension and the corresponding actual settlement amount, and calculate the cost sensitivity coefficient of each dimension based on the cost correlation degree; calculate the product of the value of each dimension in the demand semantic feature vector and the corresponding cost sensitivity coefficient to obtain the calibrated demand semantic feature vector.

4. The information system cost assessment method based on large language models and deep learning according to claim 3, characterized in that, Methods for constructing a comprehensive feature representation of a project to be evaluated include: The calibrated demand semantic feature vectors are labeled as demand calibration feature vectors. The cosine similarity between the demand calibration feature vector of the project to be evaluated and the historical demand feature vector of each historical project is calculated to obtain the semantic similarity for each historical project. All historical projects are sorted from high to low semantic similarity, and the top-ranked projects are selected. Historical projects in a given location serve as anchor projects, forming a set of anchor projects; among them, The preset number of anchor points; The multidimensional cost influencing factor feature records corresponding to each anchor project in the anchor project set are obtained from the multidimensional cost influencing factor feature set. Based on the semantic similarity of each anchor project in the anchor project set, the similarity weight of each anchor project is calculated. Based on the similarity weight of each anchor project, the elements in each feature vector of each anchor project's multidimensional cost influencing factor feature record are weighted and summed to obtain the estimated functional complexity feature vector, estimated technical difficulty feature vector, estimated labor cost feature vector, and estimated time cost feature vector. These are then concatenated and integrated with the demand calibration feature vector to form a comprehensive feature representation of the project to be evaluated.

5. The information system cost assessment method based on large language models and deep learning according to claim 4, characterized in that, Methods for constructing cost prediction models include: Obtain the historical demand feature vector and cost sensitivity coefficients for each dimension of each historical project, and calculate the demand calibration feature vector for each historical project based on the cost sensitivity coefficients for each dimension. Concatenate the multi-dimensional cost influencing factor feature records of each historical project with the demand calibration feature vector to form the training input feature vector for each historical project. For each historical project, calculate the corresponding cost benchmark anchor value and cost residual ratio, and use the cost residual ratio as the cost prediction label. Obtain the cost composition details from the actual cost results of each historical project to form the cost composition label vector for each historical project. Integrate the training input feature vector, cost prediction label, and cost composition label vector for each historical project to form the training sample for each historical project. Summarize the training samples of all historical projects to form the model training sample set. Construct a cost prediction network, perform joint training on the cost prediction network based on the model training sample set to obtain the batch joint training loss; iteratively update the model parameters in the cost prediction network based on the batch joint training loss until the batch joint training loss converges to the preset loss convergence threshold or the number of training iterations reaches the preset maximum number of iterations, and use the trained cost prediction network as the cost prediction model.

6. The information system cost assessment method based on large language models and deep learning according to claim 5, characterized in that, Methods for constructing cost prediction models also include: The cost prediction network comprises a dimensional gated routing layer, a shared feature extraction layer, a cost residual prediction branch, and a cost composition prediction branch. The dimensional gated routing layer, consisting of fully connected layers and normalized activation functions, calculates five-dimensional gate weights and gated routing feature vectors based on the training input feature vectors. The shared feature extraction layer, connected to the dimensional gated routing layer, contains multiple fully connected layers and nonlinear activation functions, performing deep nonlinear feature transformations on the gated routing feature vectors to output a shared deep feature representation. The cost residual prediction branch, connected to the shared feature extraction layer, includes a residual value prediction head and a residual variance prediction head. The residual value prediction head outputs the predicted residual values, and the residual variance prediction head outputs the predicted residual variance values. The cost composition prediction branch, connected to the shared feature extraction layer, outputs five cost percentage prediction values: labor cost percentage, hardware procurement cost percentage, software licensing cost percentage, implementation and deployment cost percentage, and project management cost percentage.

7. The information system cost assessment method based on large language models and deep learning according to claim 6, characterized in that, Methods for obtaining cost forecasts and confidence intervals include: The comprehensive characteristic representation of the project to be evaluated is input into the cost prediction model. The cost prediction model outputs the residual prediction value, residual variance prediction value, five cost percentage prediction values, and five dimension gate weight values. Based on the anchor point project set, the cost benchmark anchor value of the project to be evaluated is calculated. The product of the residual prediction value and the cost benchmark anchor value is calculated to obtain the residual adjustment amount. The sum of the cost benchmark anchor value and the residual adjustment amount is calculated to obtain the cost prediction value. Based on the residual variance prediction value and the cost benchmark anchor value, the confidence interval of the project to be evaluated is calculated, and the anchor point consistency reverse verification correction is performed on the confidence interval. The method for performing anchor point consistency backtesting correction on the confidence interval is as follows: For each anchor point item in the anchor point item set, the training input feature vector of the corresponding anchor point item is input into the cost prediction model to obtain the corresponding residual prediction value, which is then marked as the anchor point residual prediction value; based on the anchor point residual prediction value and the corresponding cost prediction label of each anchor point item, the anchor point backtesting deviation is calculated; the mean of the anchor point backtesting deviations for all anchor point items is calculated to obtain the average anchor point backtesting deviation; a backtesting deviation amplification factor is preset, and the average anchor point backtesting deviation, the backtesting deviation amplification factor, and the cost benchmark anchoring value are multiplied sequentially to obtain the empirical correction amount; based on the empirical correction amount, the lower and upper limits of the cost confidence in the confidence interval are updated sequentially.

8. The method for cost assessment of information systems based on large language models and deep learning according to claim 7, characterized in that, Methods for generating cost assessment reports include: Based on the five cost percentage predictions and the project cost forecast, a cost composition decomposition result is generated. Based on the five-dimensional gating weight values ​​and the anchor point project set, a layer-by-layer source analysis of cost drivers is performed, generating the analysis result of major influencing factors. Based on the demand calibration feature vector, residual variance predictions, and the cost composition decomposition result, multi-source risk cross-validation and risk transmission link deduction are performed, generating risk warning information. Based on the risk warning information, cost composition decomposition result, and major influencing factor analysis result, a cost optimization path deduction and coupled linkage optimization suggestion generation are performed, forming optimization suggestions. The project cost forecast, confidence interval, cost composition decomposition result, major influencing factor analysis result, risk warning information, and optimization suggestions are integrated to generate a cost assessment report. The method for generating the analysis results of the main influencing factors is as follows: Based on the gating weight values ​​of the five dimensions, the primary and secondary cost-driving dimensions are determined; for the primary and secondary cost-driving dimensions, fine-grained attribution calculations are performed within each dimension to obtain the key driving elements of the primary and secondary dimensions; the cost-driving difference index between the project to be evaluated and the anchor project set is calculated; for the primary and secondary cost-driving dimensions, cost-driving coupling analysis between dimensions is performed to calculate the coupling coefficient between dimensions; the primary cost-driving dimension, secondary cost-driving dimension, key driving elements of the primary dimension, key driving elements of the secondary dimension, cost-driving difference index, and coupling coefficient between dimensions are integrated to form the analysis results of the main influencing factors.

9. The method for cost assessment of information systems based on large language models and deep learning according to claim 8, characterized in that, Methods for generating risk warning information include: The system extracts quantitative indicators of risk factors from the demand calibration feature vector, and obtains demand clarity scores, technology maturity scores, and schedule constraint scores from these indicators. Based on the residual variance prediction values, it calculates the model prediction uncertainty risk level. It performs cross-validation of semantic risk and prediction uncertainty to generate risk warnings. These risk warnings can be composite high-risk warnings, model confidence warnings, or single-source risk warnings. It then performs a risk-to-cost transmission chain deduction to form a risk transmission chain description. Finally, it summarizes the risk warnings and risk transmission chain descriptions to form risk warning information. The method for forming a risk transmission chain description is as follows: Based on the comparison results of demand clarity scores, technology maturity scores, and schedule constraint scores with corresponding preset trigger thresholds, each risk source is determined; a preset risk transmission mapping set is established, containing the transmission mapping relationships between different risk sources and their corresponding affected cost categories; for each risk source, the affected cost category is determined based on the transmission mapping relationship between the risk source and the cost category; for each affected cost category, the corresponding estimated cost amount is obtained from the cost composition decomposition results; a preset risk transmission amplification coefficient is established for each risk source, and the estimated risk transmission increment is calculated based on the estimated cost amount and the risk transmission amplification coefficient; each risk source, affected cost category, and estimated risk transmission increment are integrated to form a risk transmission chain description.

10. The method for cost assessment of information systems based on large language models and deep learning according to claim 9, characterized in that, Methods for generating optimization suggestions include: For each cost category, the cost percentage corresponding to all anchor point projects is obtained, and a weighted average is calculated based on similarity weights to obtain the weighted average cost percentage of anchor points for each cost category. For each cost category, the structural deviation is calculated based on the predicted cost percentage and the weighted average cost percentage of anchor points, and the maximum structural deviation and the largest deviation category are determined. If the maximum structural deviation is greater than the preset cost excess threshold, cost reduction suggestions are generated. The cost categories are, in order, human resource costs, hardware procurement costs, software licensing costs, implementation and deployment costs, and project management costs. Based on the analysis of key influencing factors and risk warnings, targeted cost optimization strategies are generated. These strategies include functional optimization, technical solution optimization, requirement clarification, and schedule optimization. The coupling coefficient between dimensions is used to determine whether the primary and secondary cost-driving dimensions are marked as coupled-driving dimension pairs. If such pairs exist, the coupling direction is determined based on the coupling coefficient. Coupled-linkage optimization suggestions are generated based on the coupling coefficient and coupling direction, including collaborative cost reduction and balanced trade-off suggestions. All cost reduction suggestions, cost optimization strategies, and coupled-linkage optimization suggestions are integrated to form a final optimization recommendation.