Question answering quality automatic evaluation method for large models in soil and groundwater field
By using a discriminator based on a large language model and an attention pooling layer for automated evaluation, the professionalism and efficiency issues of large model question-answering quality assessment in the soil and groundwater field are solved. This achieves efficient and low-cost automated evaluation, improving the accuracy and efficiency of the assessment.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- RES INST FOR ENVIRONMENTAL INNOVATION SUZHOU TSINGHUA
- Filing Date
- 2026-03-18
- Publication Date
- 2026-06-19
AI Technical Summary
Existing methods for assessing the quality of large-scale model questions in the soil and groundwater fields cannot simultaneously achieve professionalism, efficiency, and low cost, resulting in inaccurate assessment results and high costs, making it difficult to support the application and development of large-scale models in this field.
The dataset is converted into high-dimensional feature vectors by a large language model, and weighted by a discriminator using an attention pooling layer to improve semantic understanding. Combined with a multilayer perceptron for automated evaluation, it replaces traditional manual review and achieves fully automated assessment.
It improves the accuracy and efficiency of assessment, reduces manpower and time costs, enables rapid verification of the question-and-answer quality of large models, and supports the iteration and application of large models in the soil and groundwater fields.
Smart Images

Figure CN122240437A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of artificial intelligence natural language processing technology, and in particular to an automatic evaluation method for question-answering quality for large models in the soil and groundwater domain. Background Technology
[0002] With the rapid development of large language model technology, its application in specialized fields has become an important research and engineering direction. In the environmental science subfield of soil and groundwater, constructing large domain models capable of understanding and answering specialized questions in this field is of significant value in improving research efficiency and supporting engineering decision-making. However, how to accurately and efficiently evaluate the performance of large models on knowledge-answering tasks in the soil and groundwater field has become a key bottleneck restricting its application in this area.
[0003] Currently, common evaluation methods include automated evaluation based on general natural language processing (NLP) metrics and expert review. The former, relying solely on NLP metrics to assess the question-answering quality of large-scale models in a specialized domain, is problematic because these metrics are typically calculated based on lexical overlap or shallow semantic similarity. However, the knowledge system in the soil and groundwater domain is built upon a large number of highly precise and irreplaceable technical terms and complex, multi-step physicochemical-biological coupling mechanisms. The quality of the answers provided by large models is extremely sensitive to the accuracy of the use of these core terms and the completeness of the logical chain. Therefore, automated evaluation based solely on NLP metrics cannot accurately reflect the true capabilities of large models in semantic understanding, mechanism deduction, and contextual judgment in the soil and groundwater domain. Expert review, on the other hand, is costly in terms of manpower and time, and inefficient, making it difficult to support the performance iteration and rapid validation needs of large models in soil and groundwater knowledge-answering tasks.
[0004] Therefore, existing methods for assessing the quality of knowledge-based question answering for large models cannot simultaneously achieve professionalism, efficiency, and low cost, thus hindering the application and development of large models in the soil and groundwater fields. Summary of the Invention
[0005] In view of this, this application provides an automatic evaluation method for question-answering quality of large models in the soil and groundwater field. This method can solve the problem that the existing evaluation methods for question-answering quality of large models in the soil and groundwater field cannot simultaneously achieve professionalism, efficiency and low cost, which restricts the application and development of large models in the soil and groundwater field.
[0006] Firstly, this application provides an automatic question-answering quality evaluation method for large-scale models in the soil and groundwater field, the method comprising:
[0007] Obtain the first dataset, which includes the first question, the first answer to be evaluated, the first standard answer, and the first evaluation instruction. The first evaluation instruction is a text description used to guide the evaluation process.
[0008] The first dataset is input into the large language model to obtain the first high-dimensional feature vector output by the large language model. The first high-dimensional feature vector integrates the overall semantic information in the first dataset.
[0009] The first high-dimensional feature vector is input into the discriminator, which includes an attention pooling layer. Based on the attention pooling layer, the first high-dimensional feature vector is subjected to attention weighting to obtain the corresponding global feature vector. The global feature vector is used to amplify the weights of information in the first high-dimensional feature vector that is highly correlated with the soil and groundwater domain, and / or to suppress the weights of information that is not highly correlated.
[0010] Based on the global feature vector, and further processed by a discriminator, an automatic scoring result is obtained. The automatic scoring result is used to represent the degree of similarity between the first answer to be evaluated and the first standard answer.
[0011] Through this solution, the above-mentioned technical solution of this application has at least one of the following beneficial effects:
[0012] (1) The first dataset containing key content such as the first answer to be evaluated and the first standard answer is converted into the first high-dimensional feature vector by the large language model, so as to realize the rapid and full extraction of deep semantic information in the first dataset. In conjunction with the discriminator containing the attention pooling layer, attention weighting is performed on the first high-dimensional feature vector. This can amplify the weight of information with high relevance in the soil and groundwater field and suppress the weight of information with low relevance. This effectively improves the discriminator's semantic understanding ability of content related to the soil and groundwater field. As a result, the discriminator's ability to evaluate the question-answering quality of the large model in the soil and groundwater field is close to that of the domain expert, overcoming the limitations of traditional evaluation based on general natural language processing indicators.
[0013] (2) By constructing an automated evaluation process based on a large language model and a discriminator, the fully automated processing from input of large model question-and-answer data in the soil and groundwater domain to output of scores is realized. It can replace the traditional evaluation method that relies on experts to manually review each item, improve evaluation efficiency, significantly reduce labor and time costs, and can be used to support the performance iteration and rapid verification needs of large models in knowledge question-and-answer tasks in the soil and groundwater domain.
[0014] In one possible implementation of the first aspect above, based on an attention pooling layer, the first high-dimensional feature vector is subjected to attention weighting to obtain the corresponding global feature vector, including:
[0015] The attention pooling layer includes multiple first sub-attention heads, which process the first high-dimensional feature vector simultaneously from different first semantic dimensions to obtain multiple first focused feature vectors. The first semantic dimension includes the accuracy, semantic consistency and readability of professional terms that are highly relevant to the field of soil and groundwater.
[0016] Each first-focused feature vector is multiplied by a trainable enhancement scaling parameter, which is used to increase the scalar value of the first-focused feature vector to obtain the corresponding enhancement feature component.
[0017] Based on multiple enhanced feature components, the first fused feature vector is obtained by summing them.
[0018] Based on the first fused feature vector and the first high-dimensional feature vector, a multiplication operation is performed to obtain the global feature vector corresponding to the first high-dimensional feature vector.
[0019] This scheme sets up a first sub-attention head specifically for the first semantic dimension, such as the accuracy of professional terminology, semantic consistency, and readability. This allows the first semantic dimension, which is highly relevant to the soil and groundwater field, to be given priority, thereby improving the professionalism of the automatic evaluation. By configuring the reinforcement ratio for the reinforcement sub-attention head, the contribution weight of the first semantic dimension is significantly amplified, making the evaluation results more in line with the judgment criteria of domain experts. This results in the final automatic scoring results having high accuracy and reliability.
[0020] In one possible implementation of the first aspect above, based on an attention pooling layer, attention-weighted processing is performed on the first high-dimensional feature vector to obtain the corresponding global feature vector, and the implementation further includes:
[0021] Based on the multiple second sub-attention heads included in the attention pooling layer, attention is paid to the first high-dimensional feature vector from different second semantic dimensions to obtain multiple second focused feature vectors. The second semantic dimensions include emotional richness and diversity, which are of low relevance to the soil and groundwater domain.
[0022] Each second focused feature vector is multiplied by a trainable weakening scaling parameter, which is used to reduce the scalar value of the first focused feature vector to obtain the corresponding weakened feature component.
[0023] Based on multiple enhanced feature components and multiple weakened feature components, the second fused feature vector is obtained by summing them.
[0024] Based on the second fused feature vector and the first high-dimensional feature vector, a multiplication operation is performed to obtain the global feature vector corresponding to the first high-dimensional feature vector.
[0025] This scheme sets up a second sub-attention head specifically for the second semantic dimension, such as emotional richness and diversity, and configures a weakening ratio parameter for the second sub-attention head. This reduces the attention given to the second semantic dimension, which has low relevance to the soil and groundwater domain, effectively suppressing the interference of non-soil and groundwater domain professional factors on the automatic scoring results. This makes the evaluation process more focused on professional quality indicators. At the same time, it considers strengthening and weakening feature components, realizing bidirectional feature regulation, generating a more accurate and reliable global feature vector, and further improving the accuracy and reliability of the automatic scoring results output by the discriminator.
[0026] In one possible implementation of the first aspect above, the training process of the discriminator includes:
[0027] Obtain a training sample set, which includes the second dataset and the corresponding human rating results. The second dataset includes the second question, the second answer to be rated, the second standard answer, and the second evaluation instruction.
[0028] The second dataset is input into the large language model to obtain the second high-dimensional feature vector;
[0029] The second high-dimensional feature vector is input into the discriminator to obtain the predicted score result;
[0030] The gradient signal is obtained based on the human rating results, the predicted rating results, and the loss calculation unit of the optimization module, wherein the human rating results serve as the supervision signal of the loss calculation unit.
[0031] Based on gradient signals and the optimizer included in the optimization module, the discriminator is optimized to make the predicted scoring results close to the human scoring results, while keeping the structure and parameters of the large language model unchanged.
[0032] This approach utilizes an architecture combining existing large language models with a trainable discriminator to build an evaluation model. This avoids training a dedicated evaluation model from scratch, fully leveraging the powerful semantic understanding capabilities of existing large language models. Optimization of the evaluation model can be achieved simply by training the discriminator, improving the accuracy of prediction scores. This solution addresses the high costs associated with full-parameter training, which requires large amounts of labeled data and expert involvement, and avoids the dependence on massive computing power and high-performance hardware resources. It enables efficient training of the evaluation model under conditions of limited data and computing resources. Furthermore, during the training of the discriminator, the parameters of the large language model are kept frozen to avoid damaging the pre-trained semantics of the large language model. This ensures that the understanding ability of the large language model in the soil and groundwater domain and the general domain remains stable, thereby ensuring that the ability of the large language model to extract high-dimensional features from the first or second dataset remains stable. By iteratively optimizing the discriminator parameters, the predicted scores continuously approach the human results, enhancing the discrimination ability and convergence efficiency of the evaluation model. This makes the predicted scores output by the discriminator closer to the human scores, thereby improving the professionalism and credibility of automatically scoring the question-answering quality of the large model in the soil and groundwater domain using the discriminator.
[0033] In one possible implementation of the first aspect above, the optimizer and the optimization discriminator, based on the gradient signal and the optimization module, include:
[0034] Based on the gradient signal and the optimizer, the pooling layer in the discriminator is determined to be an attention pooling layer.
[0035] Based on gradient signals and optimizers, it is determined that the attention pooling layer includes multiple first sub-attention heads and multiple second sub-attention heads;
[0036] Based on gradient signals and optimizers, the enhancement ratio parameter corresponding to each first sub-attention head is determined, and the weakening ratio parameter corresponding to each second sub-attention head is determined, so that the predicted scoring results are close to the human scoring results based on the attention pooling layer.
[0037] This scheme determines that the pooling layer of the discriminator ultimately adopts an attention pooling layer structure, enabling the discriminator to autonomously learn the optimal feature attention method. By automatically determining the first sub-attention head and its corresponding enhancement ratio parameter, and the second sub-attention head and its corresponding weakening ratio parameter, it achieves precise control over the importance of each semantic dimension of the second high-dimensional feature vector. This ensures that the predicted score output by the discriminator is highly consistent with the human score, thereby ensuring that the automatic score output by the discriminator has high reliability after it is put into use.
[0038] In one possible implementation of the first aspect described above, based on the gradient signal and the optimizer and optimization discriminator included in the optimization module, it further includes:
[0039] Based on gradient signals and optimizers, the type, location, and / or parameters of the normalization layer in the multilayer perceptron included in the discriminator are optimized to improve the generalization ability and robustness of the discriminator.
[0040] And / or, optimize the activation function used in the multilayer perceptron to improve the discriminator's ability to nonlinearly map complex semantics and subtle logical differences in specialized texts in the soil and groundwater domain, and enhance its discriminative sensitivity.
[0041] This scheme optimizes the type, location, and / or parameters of the normalization layer, thereby improving the stability of the activation value distribution of each layer of the multilayer perceptron during the processing of global feature vectors. Ultimately, it enhances the discriminator's generalization ability and robustness to reasonable variants of the first or second high-dimensional feature vectors corresponding to the soil and groundwater domains. Optimizing the selection of activation functions strengthens the discriminator's ability to nonlinearly map complex semantics and subtle logical differences contained in the global feature vectors, thereby improving the accuracy and sensitivity of automatic evaluation.
[0042] In one possible implementation of the first aspect described above, based on the gradient signal and the optimizer and optimization discriminator included in the optimization module, it further includes:
[0043] An external module is set on the discriminator and optimized. The external module includes a residual structure and / or a compression-excitation module. The residual structure is used to enhance the stability of the first high-dimensional feature vector or the second high-dimensional feature vector in the discriminator. The compression-excitation module is used to perform adaptive recalibration on the first high-dimensional feature vector or the second high-dimensional feature vector.
[0044] This scheme introduces a residual structure, ensuring the stable transmission of the first or second high-dimensional feature vector in the deep network and avoiding the problem of gradient signal attenuation during transmission, thereby improving training efficiency and model performance. By employing a compression-activation module, adaptive recalibration of feature channels can be achieved, dynamically enhancing important features and suppressing redundant information, thus improving the discriminative power and accuracy of feature representation.
[0045] In one possible implementation of the first aspect above, based on the global feature vector, and further processed by a discriminator, an automatic scoring result is obtained, including:
[0046] The global feature vector is input into the multilayer perceptron included in the discriminator;
[0047] Based on a multilayer perceptron, the global feature vector is transformed and mapped layer by layer to obtain sub-scores for multiple quality dimensions;
[0048] Automatic scoring results are obtained based on the sub-scoring and discriminator output layer of multiple quality dimensions.
[0049] This scheme utilizes a multilayer perceptron to perform layer-by-layer transformation and mapping of global feature vectors, enabling in-depth processing and refined analysis of these vectors and enhancing the discriminator's ability to understand complex semantic relationships. Furthermore, by employing sub-scoring across multiple quality dimensions, it achieves a multi-faceted evaluation of question-and-answer quality, covering key dimensions including accuracy and readability. This results in an automatic scoring system with good comprehensiveness, consistency, and reliability.
[0050] Secondly, this application provides an electronic device including a processor and a memory, wherein the memory stores at least one instruction or at least one program, and the at least one instruction or at least one program is loaded and executed by the processor to implement the automatic question-answering quality evaluation method for large models in the soil and groundwater field disclosed in the first aspect and any possible implementation thereof.
[0051] Thirdly, this application provides a computer-readable storage medium storing at least one instruction or at least one program, wherein the at least one instruction or at least one program is loaded and executed by a processor to implement the automatic question-answering quality evaluation method for large models in the soil and groundwater domain disclosed in the first aspect and any possible implementation thereof.
[0052] Fourthly, this application provides a computer program product comprising: computer instructions that, when executed on an electronic device, cause the electronic device to execute the automatic question-answering quality evaluation method for large models in the soil and groundwater domain disclosed in the first aspect and any possible implementation thereof.
[0053] It should be understood that the beneficial effects of the second to fourth aspects mentioned above can be found in the first aspect and the beneficial effects of any possible implementation of the first aspect, and will not be repeated here. Attached Figure Description
[0054] Figure 1 This is a schematic diagram illustrating the application scenario of an automatic evaluation device used to perform an automatic evaluation method for question-and-answer quality in large models of soil and groundwater in this application embodiment;
[0055] Figure 2 This is a flowchart of an automatic question-answering quality evaluation method for a large model in the soil and groundwater field, as described in this application.
[0056] Figure 3 This is a schematic diagram of the structure of the large language model and discriminator in the embodiments of this application;
[0057] Figure 4This is a flowchart of step S400 in an embodiment of this application;
[0058] Figure 5 This is a flowchart of a method for performing attention weighting processing on the first high-dimensional feature vector to obtain the corresponding global feature vector in step S300 of this application embodiment;
[0059] Figure 6 This is a flowchart of another method for performing attention weighting processing on the first high-dimensional feature vector to obtain the corresponding global feature vector in step S300 of this application embodiment;
[0060] Figure 7 This is one structure of the attention pooling layer in the embodiments of this application;
[0061] Figure 8 This is a flowchart illustrating the training process of the discriminator in the embodiments of this application;
[0062] Figure 9 This is a schematic diagram of the structure of the optimization module in the embodiments of this application;
[0063] Figure 10 This is a block diagram of the electronic device in the embodiments of this application;
[0064] Figure 11 This is a block diagram of a system-on-chip (SoC) in the embodiments of this application. Detailed Implementation
[0065] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.
[0066] The technical problems to be solved by the embodiments of this application will be described below.
[0067] As described above, in order to accelerate the question-answering application of large models in the soil and groundwater domain, it is necessary to evaluate the question-answering quality of large models in the soil and groundwater domain during the training process, and optimize large models based on the question-answering quality evaluation results.
[0068] Currently, common evaluation methods include automated evaluation based on general natural language processing (NLP) metrics and manual review by experts. The former, however, assesses the question-answering quality of large-scale models in a specialized domain solely based on general NLP metrics, which are typically calculated based on lexical overlap or shallow semantic similarity. Therefore, the evaluation results are difficult to accurately reflect the true capabilities of large-scale models in semantic understanding, mechanism deduction, and contextual judgment in the soil and groundwater domain. The latter method suffers from high labor and time costs and low efficiency, making it difficult to support the performance iteration and rapid validation needs of large-scale models in specialized domain knowledge question-answering tasks.
[0069] Furthermore, while theoretically feasible, training a completely new, dedicated assessment model from scratch for the soil and groundwater field presents significant challenges. Soil and groundwater, being a subfield of environmental science, have limited available question-and-answer data for training. Moreover, the training process requires large-scale, refined evaluation of responses by domain experts to provide reference data for the model's training, resulting in extremely high labor costs. Additionally, full-parameter training typically demands substantial computing resources and high-performance hardware, further increasing training costs. Therefore, considering factors such as data scale, expert involvement costs, and computing resource investment, full-parameter training is not suitable for constructing assessment models for the soil and groundwater field.
[0070] It is understandable that existing methods for assessing the quality of knowledge-based question answering for large models cannot simultaneously achieve professionalism, efficiency, and low cost, thus hindering the application and development of large models in the soil and groundwater fields.
[0071] Therefore, to address the aforementioned issues, this application provides an automatic question-answering quality evaluation method for large-scale models in the soil and groundwater domain. On one hand, by using a large language model, the first dataset containing key content such as the first answer to be evaluated and the first standard answer is converted into a first high-dimensional feature vector, enabling rapid and comprehensive extraction of deep semantic information from the first dataset. Furthermore, in conjunction with a discriminator containing an attention pooling layer, attention weighting is applied to the first high-dimensional feature vector. This amplifies the weights of information highly relevant to the soil and groundwater domain and suppresses the weights of information with low relevance, effectively improving the discriminator's semantic understanding of content related to the soil and groundwater domain. Consequently, the discriminator's question-answering quality evaluation capability for large-scale models in the soil and groundwater domain approaches that of domain experts, overcoming the limitations of traditional evaluation methods based on general natural language processing metrics. On the other hand, by constructing an automated evaluation process based on a large language model and discriminator, the fully automated processing from input of large model question-and-answer data in the soil and groundwater domain to score output is realized. This can replace the traditional evaluation method that relies on experts to manually review each item, improve evaluation efficiency, and significantly reduce labor and time costs. It can be used to support the performance iteration and rapid verification needs of large models in knowledge question-and-answer tasks in the soil and groundwater domain.
[0072] To better understand the automatic question-answering quality evaluation method for large-scale models in the soil and groundwater domain in the embodiments of this application, the following will be combined with Figure 1 First, the application scenarios of this automatic evaluation method will be explained. Figure 1 The application scenario of an automatic evaluation device for performing an automatic evaluation method is shown in the embodiments of this application.
[0073] like Figure 1 As shown, during the training of the large-scale model for soil and groundwater, the automatic evaluation device compares the model's output answers with expert-provided standard answers, outputting automatic evaluation results. These results indicate the quality of the model's question-answering capabilities in the soil and groundwater domain. Based on these results, the optimization direction for the model's question-answering ability can be identified, providing an optimization reference for the model.
[0074] The following is for reference. Figure 2 and Figure 3 This paper provides a detailed description of the automatic question-answering quality evaluation method for large-scale models in the soil and groundwater domain, as described in the embodiments of this application. Figure 2 The specific flow of the automatic evaluation method in the embodiments of this application is shown. Figure 3 The structure of the large language model and discriminator in the embodiments of this application is shown.
[0075] like Figure 2 and Figure 3 As shown, the automatic question-answering quality evaluation method for large models in the soil and groundwater domain in this application embodiment may include the following steps S100-S400:
[0076] Step S100: Obtain the first dataset, which includes the first question, the first answer to be evaluated, the first standard answer, and the first evaluation instruction.
[0077] Specifically, the first dataset is the input set used to evaluate the quality of question-and-answer sessions in this embodiment of the application. The first question can be a specialized query in the field of soil and groundwater, such as a question about the migration mechanism of pollutants; the first answer to be evaluated is the response generated by a large model in the soil and groundwater field for the first question; the first standard answer is an authoritative answer provided by a domain expert; and the first evaluation instruction is a textual description used to guide the evaluation process, such as "Please evaluate the closeness between the first answer to be evaluated and the first standard answer in terms of the accuracy of technical terminology and logical consistency."
[0078] The construction of the first dataset can be based on real-world application scenarios, such as collecting data from professional databases in the field of soil and groundwater or from expert review records, to ensure the representativeness and diversity of the data.
[0079] As an example, the specific content of the set of data included in the first dataset can be:
[0080] The first question: "What is the role of the established random inspection mechanism for soil pollution status investigation and assessment reports?"
[0081] The first answer to be evaluated: "The role of the random inspection mechanism for soil pollution status investigation and assessment reports mainly includes the following aspects:"
[0082] Ensuring data accuracy: Random checks can verify whether the data and conclusions of soil pollution investigation and assessment reports are true and reliable, preventing information distortion caused by human error or deliberate tampering.
[0083] Improving work quality: The spot check mechanism can urge relevant institutions and personnel to carry out their work in strict accordance with standards and norms, thereby improving the overall quality and professional level of soil pollution investigation and assessment.
[0084] Ensuring public safety: Soil pollution is directly related to the ecological environment and public health. Through random inspections, potential pollution problems can be identified and addressed, preventing the spread of pollution and protecting the lives and property of the people.
[0085] Promoting the enforcement of laws and regulations: The random inspection mechanism helps ensure the effective implementation of relevant laws and regulations, deters violations, and safeguards the authority and seriousness of the law.
[0086] Optimize resource allocation: By analyzing the results of spot checks, high-risk areas or projects can be identified, thereby allocating governance resources more effectively and improving the efficiency and effectiveness of pollution control.
[0087] Enhancing social trust: A transparent and open inspection process and results can increase public trust and support for soil pollution control efforts, thereby promoting social harmony and stability.
[0088] In conclusion, the random inspection mechanism for soil pollution status investigation and assessment reports is an important means to ensure soil environmental quality and public health and safety, and is of great significance for promoting ecological civilization and sustainable development.
[0089] The first standard answer states: "The main function of the established random inspection mechanism for soil pollution status investigation and assessment reports is to strengthen the quality management and supervision of soil pollution status investigations. Through random inspections, the objectivity, accuracy, and standardization of investigation and assessment reports are ensured, preventing data falsification or inaccurate information. This mechanism helps improve the overall level of soil pollution status investigations, providing a reliable basis for subsequent construction land access management, risk control, and remediation decisions, thereby effectively preventing soil pollution risks and ensuring the safety and compliance of land development and utilization."
[0090] The first evaluation instruction is: "Please assess how closely the first answer to be evaluated is similar to the first standard answer, and score the first answer to be evaluated based on the degree of similarity. The closer the first answer to the first standard answer is, the higher the score will be."
[0091] In step S100, by constructing the first dataset, it is possible to ensure that the evaluation process has a clear input structure and objective, providing a data foundation for subsequent semantic extraction and automatic scoring. Structured data input avoids evaluation result errors caused by inconsistent input formats, allowing the text structure of the first evaluation instruction to be determined based on structured data. This enhances the professional guidance of the first evaluation instruction and improves the accuracy and reliability of subsequent automatic scoring.
[0092] Step S200: Input the first dataset into the large language model to obtain the first high-dimensional feature vector output by the large language model. The first high-dimensional feature vector integrates the overall semantic information in the first dataset.
[0093] The large language model in this application embodiment is different from the aforementioned large model in the soil and groundwater domain. Instead, it refers to an independent deep learning model with billions or even hundreds of billions of parameters, pre-trained on large-scale text data, which is capable of understanding, generating, and reasoning about natural language.
[0094] The specific type of large language model used in this application embodiment is not limited; it can be any general or special large language model that can achieve the relevant functions.
[0095] For example, when using Qwen3-8B as a large language model, its deep transformer architecture can encode the first input dataset, extract lexical, syntactic and semantic features, and output the first high-dimensional feature vector.
[0096] Specifically, the large language model treats the text content of the four parts included in the first dataset—the first question, the first answer to be evaluated, the first standard answer, and the first evaluation instruction—as a whole. The first large language model can understand the semantic relationships between them when encoding. For example, the first answer to be evaluated is a response to the first question and needs to be compared with the first standard answer to obtain a quality score for the first answer to be evaluated, rather than encoding the text content of the four parts in isolation.
[0097] Therefore, the first high-dimensional feature vector output by the large language model captures the complex relationships between the first question, the first answer to be evaluated, the first standard answer, and the first evaluation instruction, such as the accuracy and logical coherence of professional terms in the field of soil and groundwater, thus avoiding the limitations of traditional shallow semantic evaluation based on lexical overlap.
[0098] It can be understood that the first high-dimensional feature vector is a vector composed of hundreds or thousands of numbers, which contains deep information such as semantics, logic, and style. Since the first high-dimensional feature vector can be directly used by the computer for mathematical operations and contains rich semantic information, it can provide rich feature input for the discriminator.
[0099] By leveraging the powerful semantic understanding capabilities of large language models, rapid and comprehensive feature extraction from texts in the soil and groundwater domain is achieved, avoiding the limitations of traditional methods based on shallow semantic similarity calculations. This provides high-quality feature input for subsequent discriminators, improving the efficiency and accuracy of the entire evaluation process.
[0100] Step S300: Input the first high-dimensional feature vector into the discriminator. The discriminator includes an attention pooling layer. Based on the attention pooling layer, the first high-dimensional feature vector is subjected to attention weighting to obtain the corresponding global feature vector. The global feature vector is used to amplify the weights of information in the first high-dimensional feature vector that is highly correlated with the soil and groundwater domain, and / or to suppress the weights of information with low correlation.
[0101] Specifically, the attention pooling layer can optimize the first high-dimensional feature vector by increasing the weight of information highly correlated with the soil and groundwater domain in the first high-dimensional feature vector, or by decreasing the weight of information uncorrelated with the soil and groundwater domain. This highlights the information highly correlated with the soil and groundwater domain in the first high-dimensional feature vector, enabling the discriminator to focus on key factors in the professional field, thereby improving the professionalism and discrimination sensitivity of the evaluation, and making the automatic scoring results closer to the human scoring results.
[0102] Step S400: Based on the global feature vector, and continuing through the discriminator, obtain the automatic scoring result, which is used to represent the degree of similarity between the first answer to be evaluated and the first standard answer.
[0103] The discriminator quantifies the question-answering quality of the large soil and groundwater model by outputting automatic scoring results, thereby achieving the evaluation of the question-answering performance of the large soil and groundwater model and providing optimization reference for large models in the field of soil and groundwater.
[0104] Based on the above steps S100-S400, this embodiment of the application combines the feature extraction capability of the large language model and the attention pooling layer of the discriminator to realize the automated and professional evaluation of the question-answering quality in the soil and groundwater domain. It overcomes the shortcomings of traditional methods that rely on manual evaluation or automatic evaluation based on general indicators in terms of efficiency, cost and professional semantic understanding, and realizes the automated and efficient evaluation of the question-answering quality of the large model in the soil and groundwater domain.
[0105] Therefore, compared with the prior art, the automatic question-answering quality evaluation method for large models in the soil and groundwater field provided in this application has at least the following beneficial effects:
[0106] (1) The first dataset containing key content such as the first answer to be evaluated and the first standard answer is converted into the first high-dimensional feature vector by the large language model, so as to realize the rapid and full extraction of deep semantic information in the first dataset. In conjunction with the discriminator containing the attention pooling layer, attention weighting is performed on the first high-dimensional feature vector. This can amplify the weight of information with high relevance in the soil and groundwater field and suppress the weight of information with low relevance. This effectively improves the discriminator's semantic understanding ability of content related to the soil and groundwater field. As a result, the discriminator's ability to evaluate the question-answering quality of the large model in the soil and groundwater field is close to that of the domain expert, overcoming the limitations of traditional evaluation based on general natural language processing indicators.
[0107] (2) By constructing an automated evaluation process based on a large language model and a discriminator, the fully automated processing from input of large model question-and-answer data in the soil and groundwater domain to output of scores is realized. It can replace the traditional evaluation method that relies on experts to manually review each item, improve evaluation efficiency, significantly reduce labor and time costs, and can be used to support the performance iteration and rapid verification needs of large models in knowledge question-and-answer tasks in the soil and groundwater domain.
[0108] refer to Figure 4 , Figure 4 The specific process of obtaining an automatic scoring result based on a global feature vector and further processed by a discriminator in step S400 of this embodiment is shown.
[0109] like Figure 3 and Figure 4 As shown in the embodiment of this application, in step S400, based on the global feature vector, and further processed by the discriminator, an automatic scoring result is obtained, which may specifically include the following steps S410 to S430:
[0110] Step S410: Input the global feature vector into the multilayer perceptron included in the discriminator.
[0111] Specifically, the multilayer perceptron (MLP) is the latter part of the discriminator, consisting of multiple fully connected layers.
[0112] Step S420: Based on the multilayer perceptron, the global feature vector is transformed and mapped layer by layer to obtain sub-scores for multiple quality dimensions.
[0113] Specifically, the number of layers in a multilayer perceptron can be determined experimentally, and the dimension of the output layer of the multilayer perceptron is matched with the quality dimension.
[0114] Each layer of a multilayer perceptron performs a linear transformation and mapping on the input global feature vector, progressively extracting higher-level feature representations. Ultimately, the output layer of the multilayer perceptron generates sub-scores for multiple quality dimensions, each sub-score quantifying the performance of the first answer in a specific quality dimension. These quality dimensions include at least two: accuracy and readability.
[0115] Step S430: Based on the sub-scoring and discriminator output layer of multiple quality dimensions, obtain the automatic scoring result.
[0116] Specifically, the discriminator's output layer receives all sub-ratings and generates the final automatic rating result through weighted summation or non-linear fusion. In the process of calculating the automatic rating result based on multiple sub-ratings, if weighted operations are involved, the weights used in the weighting operation can be learned through training or set according to domain priors; for example, the accuracy dimension may be assigned a higher weight, while the readability dimension may be assigned a lower weight.
[0117] In some embodiments of this application, the output layer of the discriminator uses a sigmoid or linear activation function to map the scores to an expression in a specified format, thereby obtaining an automatic scoring result in a specified format.
[0118] This application does not specifically limit the format of the automatic scoring results output by the discriminator. For example, the automatic scoring result can be any number from 0 to 1, in which case the discriminator adopts a normalized scoring system; the automatic scoring result can also be any number from 0 to 100, in which case the discriminator adopts a percentage system; the automatic scoring result can also be any letter from A, B, C, D, in which case the discriminator adopts a graded system.
[0119] It is understood that in this embodiment, the global feature vector is transformed and mapped layer by layer based on a multilayer perceptron, achieving in-depth processing and refined analysis of the global feature vector, thus enhancing the discriminator's ability to understand complex semantic relationships. Furthermore, through sub-scoring of multiple quality dimensions, a multi-faceted evaluation of question-and-answer quality is achieved, covering at least two key quality dimensions, including accuracy and readability. This results in an automatic scoring result obtained based on multiple sub-scorings that exhibits good comprehensiveness, consistency, and reliability.
[0120] refer to Figure 5 , Figure 5 This document illustrates a specific process of a method in which attention-weighted processing is performed on the first high-dimensional feature vector in step S300 of this embodiment to obtain the corresponding global feature vector.
[0121] like Figure 5 As shown, in some embodiments of this application, step S300 involves attention-weighted processing of the first high-dimensional feature vector to obtain the corresponding global feature vector, which may specifically include the following steps S310 to S340:
[0122] Step S310: The first high-dimensional feature vector is processed synchronously from different first semantic dimensions through multiple first sub-attention heads included in the attention pooling layer to obtain multiple first focused feature vectors. The first semantic dimension includes the accuracy, semantic consistency and readability of professional terms that are highly relevant to the soil and groundwater field.
[0123] Specifically, the first sub-attention head is a parallel processing unit specially configured in the attention pooling layer. Each first sub-attention head is configured with an independent query, key, and value matrix to focus on the first high-dimensional feature vector from different first semantic dimensions.
[0124] The terminology accuracy dimension focuses on the correct use of terminology in the soil and groundwater field, such as whether "hydraulic conductivity coefficient" is used accurately rather than a vague expression. The semantic consistency dimension assesses the consistency of the first candidate answer and the first standard answer in terms of conceptual logic, such as whether the mechanism description is complete. The readability dimension measures the clarity of language expression and the rationality of the organizational structure.
[0125] Each first sub-attention point projects the input first high-dimensional feature vector through an independent query matrix to generate the corresponding first focused feature vector.
[0126] The existence of multiple first sub-attention heads enables the discriminator to extract information from the first high-dimensional feature vector from multiple first semantic dimensions simultaneously, avoiding the limitations of a single perspective.
[0127] In some other embodiments of this application, the first semantic dimension may also include logical rigor.
[0128] Step S320: Multiply each first focus feature vector by a trainable enhancement scaling parameter, which is used to increase the scalar value of the first focus feature vector to obtain the corresponding enhancement feature component.
[0129] Specifically, the reinforcement ratio parameter is a scalar value that can be adjusted through training. It is multiplied with the first focused feature vector, which can significantly increase the magnitude of the corresponding feature value, thereby increasing the contribution weight of the corresponding feature.
[0130] The initial value of the reinforcement ratio parameter is usually set to a value greater than 1, such as 1.2 or 1.5. During training, the reinforcement ratio parameter can be dynamically adjusted.
[0131] During the optimization of the enhancement ratio parameter, the discriminator learns to assign larger enhancement ratio parameter values to important primary semantic dimensions, thereby achieving adaptive emphasis on key primary semantic dimensions. This allows the model to flexibly adjust the relative importance of different primary semantic dimensions in the final score. For example, the terminology accuracy dimension may receive a higher enhancement ratio parameter value than the readability dimension.
[0132] Step S330: Based on multiple enhanced feature components, the first fused feature vector is obtained by summing them.
[0133] Specifically, the first fused feature vector is obtained by summing all enhanced feature components, which integrates feature information obtained after enhancing multiple first semantic dimensions. Through the fusion mechanism, the balance and representativeness of the first fused feature vector as a feature representation are ensured, avoiding the excessive dominance of a single first semantic dimension.
[0134] Step S340: Based on the first fused feature vector and the first high-dimensional feature vector, perform a multiplication operation to obtain the global feature vector corresponding to the first high-dimensional feature vector.
[0135] Specifically, the first fused feature vector is multiplied by the original first high-dimensional feature vector to achieve interactive integration of the enhanced features and the original features. The multiplication operation acts as a gating mechanism, allowing the enhanced features to selectively cover corresponding parts of the original features. The resulting global feature vector retains the original semantic information while highlighting the salience of features highly correlated in the soil and groundwater domains, providing high-quality input for subsequent scoring.
[0136] It is understood that in this embodiment of the application, a first sub-attention head is set up specifically for the first semantic dimension, such as the accuracy of professional terminology, semantic consistency, and readability. This allows features in the first high-dimensional feature vector that are highly relevant to the soil and groundwater domain to be given priority attention, thereby improving the professionalism of the automatic evaluation. By configuring a strengthening ratio parameter for the first sub-attention head, the contribution weight of the first semantic dimension is significantly amplified, making the evaluation results more in line with the judgment criteria of domain experts, and resulting in the final automatic scoring results having high accuracy and reliability.
[0137] refer to Figure 6 and Figure 7 ,in, Figure 6 This illustrates the specific process of another method in step S300 for performing attention-weighted processing on the first high-dimensional feature vector to obtain the corresponding global feature vector. Figure 7 This paper illustrates a structure of an attention pooling layer in an embodiment of this application.
[0138] like Figure 6 and Figure 7 As shown, in some other embodiments of this application, step S300 involves attention-weighting the first high-dimensional feature vector based on an attention pooling layer to obtain the corresponding global feature vector. Specifically, this may include the following steps S350-S380:
[0139] Step S350: Based on the multiple second sub-attention heads included in the attention pooling layer, focus on the first high-dimensional feature vector from different second semantic dimensions to obtain multiple second focused feature vectors. The second semantic dimensions include emotional richness and diversity, which are of low relevance to the soil and groundwater domain.
[0140] Specifically, the second sub-attention head is a parallel unit in the attention pooling layer specifically designed to process non-critical features. Its structure is similar to that of the first sub-attention head, but its functions are complementary.
[0141] The emotional richness dimension focuses on the emotional tone and intensity of subjective attitudes expressed in the text, such as whether the answer contains emphatic words like "significant" or "extremely." The diversity dimension assesses the variability and creativity of language expression, including lexical diversity and the degree of sentence variation.
[0142] It is understandable that the second semantic dimension is a secondary factor in professional Q&A in the field of soil and groundwater, and excessive focus on it may lead to significant evaluation bias. Each second sub-attention head extracts the feature information corresponding to the second semantic dimension from the first high-dimensional feature vector and outputs a second focused feature vector. This step, through specialized non-critical feature extraction, achieves the clear identification and isolation of features irrelevant to the soil and groundwater field.
[0143] Step S360: Multiply each second focused feature vector by a trainable weakening scaling parameter, which is used to reduce the scalar value of the first focused feature vector to obtain the corresponding weakened feature component.
[0144] Specifically, the weakening scaling parameter is a learnable scalar parameter corresponding to each second sub-attention point. Its initial value is usually set to a value less than 1, such as 0.7 or 0.8, and it can be dynamically adjusted through training. By multiplying the weakening scaling parameter with the second attention feature vector, the contribution weight of the features corresponding to the second semantic dimension can be reduced in subsequent feature fusion.
[0145] During parameter optimization, the discriminator learns to assign smaller weakening ratio parameter values to less important second semantic dimensions. This adaptively suppresses second semantic dimensions with low relevance to the soil and groundwater domains, preventing them from excessively influencing the final scoring results. For example, the emotional richness dimension may receive a lower weakening ratio parameter value than the diversity dimension.
[0146] Step S370: Based on multiple enhanced feature components and multiple weakened feature components, the second fused feature vector is obtained by summing them.
[0147] The enhanced feature components can be obtained based on the aforementioned steps S310-S320.
[0148] Specifically, the second fused feature vector is obtained by summing all enhanced and weakened feature components, which simultaneously includes enhanced features for the first semantic dimension and weakened features for the second semantic dimension. This summation operation adjusts the proportion of features of different importance, ensuring that key features in the soil and groundwater domain dominate while non-key features are appropriately suppressed, so that the final second fused feature vector can accurately reflect the relative importance of each first and second semantic dimension.
[0149] Step S380: Based on the second fused feature vector and the first high-dimensional feature vector, perform a multiplication operation to obtain the global feature vector corresponding to the first high-dimensional feature vector.
[0150] Specifically, the second fused feature vector is multiplied by the original first high-dimensional feature vector to achieve the final integration of the adjusted features and the original features. The multiplication operation acts as a gating mechanism, allowing the enhanced and weakened features to selectively cover corresponding parts of the original features. The resulting global feature vector maintains the integrity of the original semantics while optimizing the relative salience of the features corresponding to the first and second semantic dimensions. This step completes the final attention-weighted feature transformation, and the resulting global feature vector provides highly adapted optimized feature inputs for subsequent scoring in the soil and groundwater domains.
[0151] For example, when evaluating answers regarding soil sampling methods, the second sub-attention head identifies the use of emotionally evocative words and expressions of diversity, such as "very precise" and "diverse." These features are weakened by a scaling factor of 0.6 to reduce their impact. Meanwhile, the first sub-attention head focuses on technical terms like "undisturbed sampling" and "undisturbed soil sample." Ultimately, this results in the fused global feature vector where features highly relevant to the soil and groundwater domain account for over 80% of the weights, while features related to emotional richness and diversity account for less than 20%. This ensures that the scoring is primarily based on professional accuracy rather than stylistic expression, thus enhancing the professionalism of the scoring results.
[0152] It is understood that in this embodiment, a second sub-attention head is set up specifically for the second semantic dimension, such as emotional richness and diversity, and a weakening ratio parameter is configured for the second sub-attention head. This reduces the attention given to the second semantic dimension, which has low relevance to the soil and groundwater domain, effectively suppressing the interference of non-soil and groundwater domain professional factors on the automatic scoring results, and making the evaluation process more focused on professional indicators in the soil and groundwater domain. At the same time, the strengthening and weakening of feature components are considered, realizing bidirectional feature regulation, generating a more accurate and reliable global feature vector, and further improving the accuracy and reliability of the automatic scoring results finally output by the discriminator.
[0153] In this embodiment of the application, the discriminator used in the process of automatically evaluating the question-answering quality of a large model in the field of soil and groundwater is a trainable discriminator.
[0154] refer to Figure 8 and Figure 9 ,in, Figure 8 The training process of the discriminator in the embodiments of this application is shown. Figure 9 The structure of the optimization module in an embodiment of this application is shown.
[0155] like Figure 8 and Figure 9 As shown in the embodiments of this application, the training process of the discriminator may include the following steps S500-S900:
[0156] Step S500: Obtain the training sample set, which includes the second dataset and the corresponding human rating results. The second dataset includes the second question, the second answer to be rated, the second standard answer, and the second evaluation instruction.
[0157] Specifically, the training sample set is the dataset used to train the discriminator. Human ratings serve as training supervision signals. The second dataset has a similar structure to the first dataset but is specifically designed for training purposes. The second set of questions covers various specialized question types within the soil and groundwater field. The second set of standard answers is provided by experts in the soil and groundwater field to ensure its authority and reliability. The second set of evaluation instructions is consistent with the instructions used in the final application.
[0158] The human scoring results can be independently evaluated by multiple experts in the field of soil and groundwater according to predefined scoring criteria, and the final average score or consensus score is taken as the supervision signal. This step, by constructing training samples and training sample sets with clear input structure and supervision signals, enables the discriminator to learn and optimize stably, providing reliable input information for the discriminator's learning and training.
[0159] Since high-quality training samples are relatively scarce in the field of soil and groundwater, this application embodiment expands the number of training samples through diversified generation strategies.
[0160] Specifically, in this embodiment, for the same second question, multiple different question-answering models, including a large model in the soil and groundwater domain, are invoked to generate answer samples with cross-model diversity. Furthermore, within the same question-answering model, multiple independent answers are generated for the same second question, utilizing the randomness of the generation process to obtain multiple versions of answers under the same model, enriching the diversity at the micro-level of expression. Additionally, by adjusting key parameters in the question-answering model decoding process, including the temperature coefficient, Top-k sampling threshold, and Top-p sampling threshold, the determinism and exploratory nature of the generated text are artificially controlled, thereby obtaining answer samples with a quality gradient distribution. Specifically, low-temperature or small-range sampling strategies tend to generate conservative and high-quality answers, while high-temperature or large-range sampling strategies may introduce semantic bias or logical flaws, resulting in medium- or low-quality answers.
[0161] Through the above-mentioned diversified generation strategies, multiple second answers to be evaluated, including high-quality, medium-quality and low-quality, are obtained for each second question, thereby forming multiple sets of training samples, which fully cover the possible output space in terms of semantic expression and quality dimension, and provide rich comparative learning materials for discriminator training.
[0162] Step S600: Input the second dataset into the large language model to obtain the second high-dimensional feature vector.
[0163] Specifically, the large language model used in the training phase is the same as the one used in the application phase, and its parameters are kept frozen and not updated. After the second dataset is input into the large language model, a second high-dimensional feature vector is calculated through forward propagation. The second high-dimensional feature vector has the same dimensionality and semantic properties as the first high-dimensional feature vector, but the second high-dimensional feature vector comes from the training samples. This step ensures the consistency of feature extraction in the training and inference phases and avoids training-test bias.
[0164] Step S700: Input the second high-dimensional feature vector into the discriminator to obtain the prediction score result.
[0165] In the initial stages of training, there may be a significant discrepancy between the predicted scores and the human scores. This discrepancy is gradually reduced through iterative optimization. This step completes the forward propagation of the training data, providing input for subsequent loss calculations.
[0166] Step S800: Based on the manual rating results, the predicted rating results, and the loss calculation unit of the optimization module, the gradient signal is obtained, wherein the manual rating results serve as the supervision signal of the loss calculation unit.
[0167] Specifically, the loss calculation unit uses a regression loss function, such as mean squared error (MSE), to calculate the difference between the predicted rating and the human rating. Based on the loss value, a gradient signal is calculated, indicating the direction and magnitude of parameter updates in the discriminator. This step provides a clear optimization objective for the discriminator by quantifying the difference between the predicted and human ratings.
[0168] Step S900: Based on the gradient signal and the optimizer included in the optimization module, optimize the discriminator to make the predicted scoring result close to the human scoring result, while keeping the structure and parameters of the large language model unchanged.
[0169] Specifically, the large language model was explicitly excluded from the optimization process and kept completely frozen.
[0170] In this embodiment of the application, when optimizing the optimizer, a fixed total number of training rounds is preset as an upper limit. However, if the loss value stops decreasing or even increases for a certain number of rounds during the training process, it means that the discriminator has basically converged. After convergence, continuing training will not improve the performance. In order to save training costs, the training will be terminated in advance.
[0171] In other words, there are two conditions for ending discriminator training: one is reaching the preset maximum number of training rounds; the other is satisfying the early stopping criterion, that is, if the loss does not decrease in several consecutive iterations, the training is terminated early to preserve the best parameters and ensure that the discriminator has a stable and reliable automatic evaluation capability.
[0172] It is understood that in this embodiment, the evaluation model is built using an architecture combining an existing large language model with a trainable discriminator. This avoids training a dedicated evaluation model from scratch, fully utilizing the powerful semantic understanding capabilities of existing large language models. Only the discriminator needs to be trained to optimize the evaluation model, improving the accuracy of the predicted scores. This solves the high cost problem caused by the need for large amounts of labeled data and expert participation in full-parameter training, and also avoids the dependence of full-parameter training on large-scale computing power and high-performance hardware resources, achieving efficient training of the evaluation model under limited data and computing resources. Furthermore, during the training of the discriminator, the parameters of the large language model are kept frozen to avoid damaging the pre-trained semantics of the large language model. This ensures that the understanding ability of the large language model in the soil and groundwater domain and the general domain remains stable, thereby ensuring that the ability of the large language model to extract high-dimensional features from the first or second dataset remains stable. By iteratively optimizing the discriminator parameters, the predicted scoring results are made to continuously approach the human results, which enhances the discrimination ability and convergence efficiency of the evaluation model. This makes the predicted scoring results output by the discriminator closer to the human scoring results, thereby improving the professionalism and credibility of using the discriminator to automatically score the question-answering quality of the large model in the soil and groundwater domain.
[0173] In this embodiment of the application, step S900, based on the gradient signal and the optimizer included in the optimization module, optimizes the discriminator, which may specifically include the following steps S910 to S930:
[0174] Step S910: Based on the gradient signal and the optimizer, determine that the pooling layer in the discriminator is an attention pooling layer.
[0175] Specifically, during the discriminator structure optimization process, by analyzing the response of gradient signals to different pooling operations, it was determined that the attention pooling layer, compared to traditional pooling layers such as average pooling or max pooling, can produce a more significant loss reduction. Gradient signals showed that the attention mechanism can learn feature weighting patterns more consistent with the domain evaluation objective; therefore, attention pooling was ultimately chosen as the standard configuration. This step, through gradient signal-guided structure selection, ensures that the discriminator adopts the most efficient feature processing architecture.
[0176] Referring to Table 1, Table 1 shows the experimental results obtained by verifying the scoring ability of discriminators with different structures and parameters in the embodiments of this application.
[0177] Among the evaluation metrics, the mean squared error (MSE) is the average of the squares of the differences between the predicted score and the human score. Because of the squaring operation, large deviations will result in quadratic losses, so it is sensitive to large errors. MSE can amplify the gap between the predicted score output by the discriminator and the human score.
[0178] The root mean squared error (RMSE) is the arithmetic square root of the MSE. It restores the unit of the error to the same unit as the predicted score and can directly indicate how large the error is between the predicted score and the human score.
[0179] Mean absolute error (MAE) is the average of the absolute values of the difference between the predicted score and the human score. It is used to measure the actual magnitude of the error without squaring the error and to reflect the average performance of the discriminator under normal conditions.
[0180] Based on the error results obtained from the experiment under different evaluation indicators, especially the error results corresponding to Experiment 3, it was determined that when the attention pooling layer is used, the discriminator has a better ability to automatically evaluate the question-answering quality of large models in the soil and groundwater domain.
[0181] Table 1. Experimental results for the discriminator with different structures and parameters.
[0182]
[0183] Step S920: Based on the gradient signal and the optimizer, determine that the attention pooling layer includes multiple first sub-attention heads and multiple second sub-attention heads.
[0184] Specifically, by observing the impact of different sub-attention head configurations on the loss function, the gradient signal indicates that the hybrid structure, which includes both the first and second sub-attention heads, achieves the best performance.
[0185] In this application, there is no limitation on the number of first and second sub-attention heads. The number of first sub-attention heads is typically set to 3-5, covering the main professional dimensions; the number of second sub-attention heads is set to 2-3, handling secondary non-professional dimensions. The specific number of first and second sub-attention heads needs to be determined through experimental verification.
[0186] In this embodiment of the application, the attention head configuration scheme is determined to be three first sub-attention heads and two second sub-attention heads through this step. Among them, the three first sub-attention heads focus on the accuracy of technical terms, semantic consistency, and readability, respectively, while the two second sub-attention heads focus on emotional richness and diversity, respectively.
[0187] Step S930: Based on the gradient signal and the optimizer, determine the enhancement ratio parameter corresponding to each first sub-attention head and the weakening ratio parameter corresponding to each second sub-attention head, so as to make the predicted scoring result close to the human scoring result based on the attention pooling layer.
[0188] Specifically, after setting the initial values for the enhancement and de-enhancing scaling parameters, they are optimized using the gradient descent algorithm. The gradient signal indicates the sensitivity of each enhancement and de-enhancing scaling parameter to the loss function, thereby guiding the adjustment direction of the enhancement and de-enhancing scaling parameters.
[0189] After training, the enhancement ratio parameter of the first sub-attention head typically converges to the range of 1.5-1.8, and the weakening ratio parameter of the second sub-attention head converges to the range of 0.2-0.5. This step achieves fine-tuning of feature weighting through parameter optimization.
[0190] It is understood that in this embodiment of the application, the pooling layer of the discriminator is ultimately determined to adopt an attention pooling layer structure, which enables the discriminator to autonomously learn the optimal feature attention method. By automatically determining the first sub-attention head and its corresponding enhancement ratio parameter, the second sub-attention head and its corresponding weakening ratio parameter, the discriminator achieves precise control over the importance of each semantic dimension of the second high-dimensional feature vector, ensuring that the predicted scoring result output by the discriminator is highly consistent with the human scoring result. This ensures that the automatic scoring result output by the discriminator has high reliability after the discriminator is put into use.
[0191] In this embodiment of the application, step S900, based on the gradient signal and the optimizer and discriminator included in the optimization module, may further include at least one of the following steps S940 to S960:
[0192] Step S940: Based on the gradient signal and the optimizer, optimize the type, location and / or parameters of the normalization layer in the multilayer perceptron included in the discriminator to improve the generalization ability and robustness of the discriminator.
[0193] Specifically, the normalization layer is an important component in a multilayer perceptron, used to stabilize the training process and improve generalization performance. By analyzing gradient signals, the optimal normalization strategy is determined, including: choosing the type of normalization, such as LayerNorm or BatchNorm, deciding whether to place the normalization layer before or after the activation function, and tuning the normalization parameters.
[0194] In this step, by optimizing the type, location and / or parameters of the normalization layer, the stability of the activation value distribution of each layer of the multilayer perceptron in the process of processing global feature vectors is optimized, thereby improving the discriminator's generalization ability and robustness to reasonable variants of the first high-dimensional feature vector or the second high-dimensional feature vector corresponding to the soil and groundwater domains.
[0195] Step S950: Optimize the activation function used in the multilayer perceptron to improve the discriminator's nonlinear mapping ability and discrimination sensitivity to complex semantics and subtle logical differences in specialized texts in the soil and groundwater domain.
[0196] Specifically, the choice of activation function directly affects the discriminator's expressive power and training stability. This can be achieved by comparing the gradient signals and validation performance of different activation functions, such as GeLU, GeGLU, and SwiGLU.
[0197] Referring to the experimental results of Experiments 4 and 5 in Table 1, GeLU was ultimately selected as the main activation function, which showed the best performance in processing complex semantic patterns of professional texts in the field of soil and groundwater.
[0198] This step, by optimizing the selection of activation function, can enhance the discriminator's ability to nonlinearly map complex semantics and subtle logical differences contained in the global feature vector, thereby improving the accuracy and sensitivity of automatic evaluation.
[0199] Step S960: Set an external module on the discriminator and optimize the external module. The external module may include a residual structure and / or a compression-excitation module. The residual structure is used to enhance the stability of the first high-dimensional feature vector or the second high-dimensional feature vector in the discriminator. The compression-excitation module is used to perform adaptive recalibration on the first high-dimensional feature vector or the second high-dimensional feature vector.
[0200] Specifically, external modules are supplementary structures added to the main discriminator to enhance specific capabilities.
[0201] Since the residual structure directly passes the input to subsequent layers through skip connections, it can alleviate the gradient vanishing problem in deep networks and ensure that feature information is stably passed in the discriminator. Therefore, in some embodiments of this application, the use of the residual structure can optimize the automatic scoring results output by the discriminator.
[0202] Referring to the experimental results of Experiments 6 and 5 shown in Table 1, in some other embodiments of this application, when the number of training samples is limited, increasing the depth may lead to overfitting, causing the discriminator to memorize the noise in the training sample set and the generalization ability to decrease. In this case, it is better not to use the residual structure.
[0203] Referring to the experimental results of Experiments 7 and 5 shown in Table 1, the compression-excitation module, as a feature recalibration mechanism based on channel attention, can adaptively learn the nonlinear dependencies between channels and generate a normalized weight vector. Finally, by performing channel-by-channel multiplication of this weight vector with the original features, it achieves dynamic enhancement of important feature channels and effective suppression of redundant information, thereby significantly improving the discriminative power and domain adaptability of the global feature vector. The parameters of the above external modules are optimized based on gradient signals and integrated into the discriminator to improve overall performance.
[0204] It is understandable that this step introduces a residual structure, which can ensure the stable transmission of the first or second high-dimensional feature vector in the deep network, avoid the problem of gradient signal decaying step by step during transmission, and improve training efficiency and model performance. By adopting a compression-excitation module, adaptive recalibration of feature channels can be achieved, dynamically enhancing important features and suppressing redundant information, thereby improving the discriminative power and accuracy of feature representation.
[0205] In summary, the automatic question-answering quality evaluation method for large-scale models in the soil and groundwater domain provided in this application, on the one hand, transforms the first dataset containing key content such as the first question answer and the first standard answer into a first high-dimensional feature vector through a large language model, enabling rapid and sufficient extraction of deep semantic information from the first dataset. Furthermore, in conjunction with a discriminator containing an attention pooling layer, attention-weighted processing is applied to the first high-dimensional feature vector, amplifying the weights of information highly relevant to the soil and groundwater domain and suppressing the weights of information with low relevance, effectively improving the discriminator's semantic understanding ability of content related to the soil and groundwater domain. This enables the discriminator to evaluate the question-answering quality of large-scale models in the soil and groundwater domain to approach the evaluation capabilities of domain experts, overcoming the limitations of traditional evaluations based on general natural language processing indicators. On the other hand, by constructing an automated evaluation process based on large language models and discriminators, fully automated processing from input of question-answering data in large-scale models in the soil and groundwater domain to output of scores is achieved. This can replace the traditional evaluation method that relies on experts to manually review each item, improving evaluation efficiency and significantly reducing labor and time costs. It can be used to support the performance iteration and rapid verification needs of large-scale models in knowledge question-answering tasks in the soil and groundwater domain.
[0206] Furthermore, by adopting an architecture design that combines existing large language models with a trainable discriminator, we avoid training a dedicated evaluation model from scratch. This fully leverages the powerful semantic understanding capabilities of existing large language models, requiring only the training of the discriminator. This solves the high-cost problem of requiring a large amount of labeled data and expert participation for full-parameter training, and achieves efficient model training under limited data conditions.
[0207] This application provides an electronic device, which includes a processor and a memory. The memory stores at least one instruction or at least one program. When the processor loads and executes the instruction or program, the electronic device performs the automatic question-answering quality evaluation method for large-scale models in the soil and groundwater domain described in the above embodiments. Its specific functions and corresponding technical effects can be found in the above embodiments. Figures 1-9 The automatic question-answering quality assessment method for large-scale models in the soil and groundwater domain, as explained above, will not be elaborated upon here. The following section will combine... Figure 10 The electronic devices described in the embodiments of this application will be described in detail.
[0208] refer to Figure 10 The diagram shows a block diagram of an electronic device 1200 according to one embodiment of this application. The electronic device 1200 may include one or more processors 1201 coupled to a controller hub 1203. In at least one embodiment, the controller hub 1203 communicates with the processor 1201 via a multi-branch bus such as a front side bus (FSB) 1210, a point-to-point interface such as a quick path interconnect (QPI), or a similar connection. The processor 1201 executes instructions controlling general types of data processing operations. In one embodiment, the controller hub 1203 includes, but is not limited to, a graphics memory controller hub (GMCH) (not shown) and an input / output hub (IOH) (which may be on a separate chip) (not shown), wherein the GMCH includes memory and a graphics controller and is coupled to the IOH.
[0209] Electronic device 1200 may also include a coprocessor 1202 and a memory 1204 coupled to a controller hub 1203. Alternatively, one or both of the memory and the GMCH may be integrated within the processor (as described in this application), with memory 1204 and coprocessor 1202 directly coupled to processor 1201 and controller hub 1203, which resides on a single chip with the IOH. Memory 1204 may be, for example, dynamic random access memory (DRAM), phase change memory (PCM), or a combination of both. In one embodiment, coprocessor 1202 is a dedicated processor, such as, for example, a high-throughput MIC (many integerized core) processor, a network or communication processor, a compression engine, a graphics processor, a general-purpose computing on GPU (GPGPU), or an embedded processor, etc. Optional properties of coprocessor 1202 are indicated by dashed lines. Figure 10 middle.
[0210] As a computer-readable storage medium, memory 1204 may include one or more tangible, non-transitory computer-readable media for storing data and / or instructions. For example, memory 1204 may include any suitable non-volatile memory such as flash memory and / or any suitable non-volatile storage device such as one or more hard-disk drives (HDDs), one or more compact disc (CD) drives, and / or one or more digital versatile disc (DVD) drives.
[0211] In one embodiment, electronic device 1200 may further include a network interface controller (NIC) 1206. Network interface 1206 may include a transceiver for providing a radio interface for electronic device 1200 to communicate with any other suitable device, such as a front-end module, antenna, etc. In various embodiments, network interface 1206 may be integrated with other components of electronic device 1200. Network interface 1206 can implement the functions of the communication unit in the above embodiments.
[0212] Electronic device 1200 may further include input / output (I / O) device 1205. I / O device 1205 may include: a user interface designed to enable a user to interact with electronic device 1200; a peripheral component interface designed to enable peripheral components to also interact with electronic device 1200; and / or sensors designed to determine environmental conditions and / or location information related to electronic device 1200.
[0213] It is worth noting that, Figure 10 This is merely an example. That is, although... Figure 10 The electronic device 1200 shown includes multiple devices such as a processor 1201, a coprocessor 1202, a controller hub 1203, and a memory 1204. However, in practical applications, devices using the methods of this application may include only a portion of the devices in the electronic device 1200. For example, it may include only the processor 1201 and the network interface 1206. Figure 10 The properties of the optional devices are shown in dashed lines. According to some embodiments of this application, the memory 1204, which is a computer-readable storage medium, stores instructions or programs that, when executed on a computer, perform the xxx method described in the above embodiments. Specific details can be found in the methods of the above embodiments, and will not be repeated here.
[0214] Now for reference Figure 11 The diagram shown is a block diagram of a system-on-chip (SoC) 1300 according to an embodiment of this application. Figure 11 In the diagram, similar components share the same reference numerals. Additionally, dashed boxes are an optional feature for more advanced SoCs. Figure 11 In this SoC 1300, the following are included: an interconnect unit 1350 coupled to an application processor 1310; a system proxy unit 1380; a bus controller unit 1390; an integrated memory controller unit 1340; a group or one or more coprocessors 1320, which may include integrated graphics logic, an image processor, an audio processor, and a video processor; a static random access memory (SRAM) unit 1330; and a direct memory access (DMA) unit 1360. In one embodiment, the coprocessor 1320 includes a dedicated processor, such as, for example, a network or communication processor, a compression engine, a GPGPU, a high-throughput MIC processor, or an embedded processor.
[0215] The static random access memory (SRAM) unit 1330 may include one or more computer-readable media for storing data and / or instructions. The computer-readable storage medium may store instructions, specifically, temporary and permanent copies of those instructions. These instructions may include, when executed by at least one unit in the processor, causing the SoC 1300 to perform an automatic question-answering quality assessment method for a large model in the soil and groundwater domain according to the above embodiments, as detailed in the methods described in the above embodiments, which will not be repeated here.
[0216] This application provides a computer-readable storage medium storing at least one instruction or at least one program. The instruction or program is loaded and executed by a processor to implement the automatic question-answering quality evaluation method for large-scale models in the soil and groundwater domain described in the above embodiments. Its specific functions and corresponding technical effects can be found in the above embodiments. Figures 1-9 The automatic evaluation method for question-answering quality in large-scale models for soil and groundwater, which has been explained, will not be elaborated upon here.
[0217] This application provides a computer program product, including computer instructions. When the computer instructions are executed on an electronic device, the electronic device implements the automatic question-answering quality evaluation method for large-scale models in the soil and groundwater field described in the above embodiments. Its specific functions and corresponding technical effects can be found in the above embodiments. Figures 1-9 The automatic evaluation method for question-answering quality in large-scale models for soil and groundwater, which has been explained, will not be elaborated upon here.
[0218] Various embodiments of the mechanisms disclosed in this application can be implemented in hardware, software, firmware, or combinations of these implementation methods. Embodiments of this application can be implemented as computer programs or program code executable on a programmable system, the programmable system including at least one processor, a storage system (including volatile and non-volatile memory and / or storage elements), at least one input device, and at least one output device.
[0219] Program code can be applied to input instructions to execute the functions described in this application and generate output information. The output information can be applied to one or more output devices in a known manner. For the purposes of this application, the processing system includes any system having a processor such as, for example, a digital signal processor (DSP), a microcontroller, an application-specific integrated circuit (ASIC), or a microprocessor.
[0220] The program code can be implemented using a high-level procedural language or an object-oriented programming language to communicate with the processing system. Assembly language or machine language can also be used when needed. In fact, the mechanisms described in this application are not limited to any particular programming language. In either case, the language can be a compiled language or an interpreted language.
[0221] In some cases, the disclosed embodiments may be implemented in hardware, firmware, software, or any combination thereof. The disclosed embodiments may also be implemented as instructions carried or stored thereon on one or more temporary or non-temporary machine-readable (e.g., computer-readable) storage media, which may be read and executed by one or more processors. For example, the instructions may be distributed via a network or through other computer-readable media. Therefore, machine-readable media may include any mechanism for storing or transmitting information in a machine-readable (e.g., computer-readable) form, including but not limited to floppy disks, optical disks, CD-ROMs, compact disc read-only memory (CD-ROMs), magneto-optical disks, read-only memory (ROM), random access memory (RAM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), magnetic cards or optical cards, flash memory, or tangible machine-readable storage for transmitting information (e.g., carrier waves, infrared signals, digital signals, etc.) using the Internet in the form of electrical, optical, acoustic, or other forms of propagated signals. Therefore, machine-readable media include any type of machine-readable medium suitable for storing or transmitting electronic instructions or information in a machine-readable (e.g., computer-readable) form.
[0222] In the accompanying drawings, some structural or methodological features may be shown in a specific arrangement and / or order. However, it should be understood that such a specific arrangement and / or order may not be necessary. Rather, in some embodiments, these features may be arranged in a manner and / or order different from that shown in the accompanying drawings. Furthermore, including structural or methodological features in a particular figure does not imply that such features are required in all embodiments, and in some embodiments, these features may be omitted or may be combined with other features.
[0223] It should be noted that the order of the embodiments described above is merely for descriptive purposes and does not represent the superiority or inferiority of the embodiments. Furthermore, specific embodiments have been described above. Other embodiments are within the scope of the appended claims. In some cases, the actions or steps described in the claims can be performed in a different order than that shown in the embodiments and still achieve the desired result. Additionally, the processes depicted in the drawings do not necessarily require a specific or sequential order to achieve the desired result. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
[0224] It should be noted that all units / modules mentioned in the device embodiments of this application are logical units / modules. Physically, a logical unit / module can be a physical unit / module, a part of a physical unit / module, or a combination of multiple physical units / modules. The physical implementation of these logical units / modules themselves is not the most important factor; the combination of functions implemented by these logical units / modules is the key to solving the technical problems proposed in this application. Furthermore, to highlight the innovative aspects of this application, the above-described device embodiments of this application have not introduced units / modules that are not closely related to solving the technical problems proposed in this application. This does not mean that the above-described device embodiments do not contain other units / modules.
[0225] It should be noted that in the examples and description of this application, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one" does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.
[0226] Although this application has been illustrated and described with reference to certain preferred embodiments thereof, those skilled in the art should understand that various changes in form and detail may be made thereto without departing from the spirit and scope of this application.
Claims
1. A method for automatically evaluating the quality of answers to questions in the field of large models of soil and groundwater, characterized by, include: Obtain a first dataset, which includes a first question, a first answer to be evaluated, a first standard answer, and a first evaluation instruction, wherein the first evaluation instruction is a text description used to guide the evaluation process; The first dataset is input into a large language model to obtain a first high-dimensional feature vector output by the large language model. The first high-dimensional feature vector integrates the overall semantic information in the first dataset. The first high-dimensional feature vector is input to the discriminator, which includes an attention pooling layer. Based on the attention pooling layer, the first high-dimensional feature vector is subjected to attention weighting to obtain a corresponding global feature vector. The global feature vector is used to amplify the weight of information in the first high-dimensional feature vector that is highly relevant to the soil and groundwater domain, and / or to suppress the weight of information that is not highly relevant. Based on the global feature vector, and further processed by the discriminator, an automatic scoring result is obtained, which is used to represent the degree of similarity between the first answer to be evaluated and the first standard answer.
2. The method of claim 1, wherein, The attention-weighted processing of the first high-dimensional feature vector based on the attention pooling layer to obtain the corresponding global feature vector includes: The attention pooling layer includes multiple first sub-attention heads, which simultaneously process the first high-dimensional feature vector from different first semantic dimensions to obtain multiple first focused feature vectors. The first semantic dimension includes the accuracy, semantic consistency, and readability of professional terms that are highly relevant to the field of soil and groundwater. Each of the first focused feature vectors is multiplied by a trainable enhancement scaling parameter, which is used to increase the scalar value of the first focused feature vector to obtain the corresponding enhanced feature component. Based on the multiple enhanced feature components, the first fused feature vector is obtained by summing them. Based on the first fused feature vector and the first high-dimensional feature vector, a multiplication operation is performed to obtain the global feature vector corresponding to the first high-dimensional feature vector.
3. The method of claim 2, wherein, The attention-weighted processing of the first high-dimensional feature vector based on the attention pooling layer to obtain the corresponding global feature vector includes: Based on the multiple second sub-attention heads included in the attention pooling layer, attention is paid to the first high-dimensional feature vector from different second semantic dimensions to obtain multiple second focused feature vectors. The second semantic dimension includes emotional richness and diversity, which are of low relevance to the soil and groundwater domain. Each second focused feature vector is multiplied by a trainable weakening scaling parameter, which is used to reduce the scalar value of the first focused feature vector to obtain the corresponding weakened feature component. Based on the multiple enhanced feature components and the multiple weakened feature components, a second fused feature vector is obtained by summing them. Based on the second fused feature vector and the first high-dimensional feature vector, a multiplication operation is performed to obtain the global feature vector corresponding to the first high-dimensional feature vector.
4. The method of claim 3, wherein, The training process of the discriminator includes: Obtain a training sample set, which includes a second dataset and corresponding human scoring results. The second dataset includes a second question, a second answer to be evaluated, a second standard answer, and a second evaluation instruction. The second dataset is input into the large language model to obtain the second high-dimensional feature vector; The second high-dimensional feature vector is input into the discriminator to obtain the predicted score result; Based on the manual scoring results, the predicted scoring results, and the loss calculation unit of the optimization module, a gradient signal is obtained, wherein the manual scoring results serve as the supervision signal of the loss calculation unit. Based on the gradient signal and the optimizer included in the optimization module, the discriminator is optimized to make the predicted scoring result close to the human scoring result, while keeping the structure and parameters of the large language model unchanged.
5. The method of claim 4, wherein, The optimization of the discriminator based on the gradient signal and the optimizer included in the optimization module includes: Based on the gradient signal and the optimizer, the pooling layer in the discriminator is determined to be an attention pooling layer; Based on the gradient signal and the optimizer, it is determined that the attention pooling layer includes a plurality of first sub-attention heads and a plurality of second sub-attention heads; Based on the gradient signal and the optimizer, the enhancement ratio parameter corresponding to each first sub-attention head is determined, and the weakening ratio parameter corresponding to each second sub-attention head is determined, so that the predicted scoring result is close to the human scoring result based on the attention pooling layer.
6. The method of claim 5, wherein, The optimization of the discriminator based on the gradient signal and the optimizer included in the optimization module further includes: Based on the gradient signal and the optimizer, the type, position and / or parameters of the normalization layer in the multilayer perceptron included in the discriminator are optimized to improve the generalization ability and robustness of the discriminator. And / or, optimize the activation function used in the multilayer perceptron to improve the discriminator's nonlinear mapping ability and discrimination sensitivity to complex semantics and subtle logical differences in specialized texts in the soil and groundwater fields.
7. The method of claim 6, wherein, The optimization of the discriminator based on the gradient signal and the optimizer included in the optimization module further includes: An external module is set on the discriminator and optimized. The external module includes a residual structure and / or a compression-excitation module. The residual structure is used to enhance the stability of the first high-dimensional feature vector or the second high-dimensional feature vector in the discriminator. The compression-excitation module is used to perform adaptive recalibration on the first high-dimensional feature vector or the second high-dimensional feature vector.
8. The method according to any one of claims 1 to 7, characterized in that, The automatic scoring result obtained based on the global feature vector and further processed by the discriminator includes: The global feature vector is input into the multilayer perceptron included in the discriminator; Based on the multilayer perceptron, the global feature vector is transformed and mapped layer by layer to obtain sub-scores for multiple quality dimensions; The automatic scoring result is obtained based on the sub-scores of the multiple quality dimensions and the output layer of the discriminator.
9. An electronic device, comprising: It includes one or more processors and one or more memories, the one or more memories storing one or more computer programs, the one or more computer programs including instructions that, when executed by the one or more processors, cause the method as described in any one of claims 1 to 8 to be performed.
10. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores computer instructions that, when executed on a computer, cause the method as described in any one of claims 1 to 8 to be performed.