A text data quality evaluation method and system
By employing multi-scale feature extraction and dynamic weight calibration mechanisms, the problems of multi-dimensional decoupling and scoring instability in text data quality assessment are solved, enabling efficient and stable assessment of text data.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- CHINA SOFTWARE TESTING CENT
- Filing Date
- 2026-05-13
- Publication Date
- 2026-06-12
AI Technical Summary
Existing text data quality assessment methods struggle to simultaneously characterize both text structural and semantic anomalies. They suffer from limited assessment dimensions, feature coupling, poor adaptability of fixed weights, and inconsistent scoring scales, leading to unstable assessment results and impacting model training stability.
By employing multi-scale feature extraction (character-level, sentence-level, and document-level) combined with multi-dimensional independent prediction units, dynamic fusion weights are generated. Furthermore, by constructing a reference quality space for calibration, multi-dimensional decoupling and cross-data source consistency in text quality assessment are achieved.
It significantly improves the ability to identify low-quality text, reduces score fluctuations, achieves personalized evaluation and score stability, and is suitable for large-scale text data evaluation by third-party institutions.
Smart Images

Figure CN122197856A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the technical field of data quality assessment, specifically to a method and system for assessing the quality of text data. Specifically, it relates to a method and system for assessing the quality of text data based on multi-scale feature modeling, quality dimension decoupling prediction, dynamic weight fusion, and a reference quality space calibration mechanism. This method and system are suitable for third-party organizations to conduct objective and automated quality assessments of text datasets. Background Technology
[0002] With the development of artificial intelligence technology, large-scale text data is widely used for model training, knowledge construction, and intelligent applications. The quality of text data directly affects the performance of downstream models, inference stability, and system reliability.
[0003] Existing text data quality assessment methods typically score based on a single statistical indicator or a fixed weighting combination, which has the following technical drawbacks:
[0004] (1) It is difficult to simultaneously characterize text structural anomalies and semantic anomalies, and the evaluation dimension is too singular;
[0005] (2) There is feature coupling between different quality dimensions, which leads to score fluctuations and unstable evaluation results;
[0006] (3) Fixed weights are difficult to adapt to different text types and lack adaptability;
[0007] (4) The scoring scales are inconsistent among texts from different sources, making horizontal comparisons difficult;
[0008] (5) Abnormal score fluctuations are likely to occur during batch data processing, affecting the stability of model training.
[0009] Therefore, there is an urgent need for a text quality assessment method that can simultaneously take into account multi-scale features, multi-dimensional decoupled prediction, dynamic adaptive weights, and cross-data source consistency calibration. Summary of the Invention
[0010] In view of the above-mentioned technical problems in related technologies, the present invention provides a text data quality assessment method and system that can solve the above problems.
[0011] To achieve the above-mentioned technical objectives, the technical solution of the present invention is implemented as follows:
[0012] A method for assessing the quality of text data includes the following steps:
[0013] Obtain the text data to be tested;
[0014] Multi-scale feature extraction is performed on the text data to be tested to generate character-level feature representations, sentence-level feature representations, and document-level feature representations, respectively.
[0015] The character-level feature representation, sentence-level feature representation, and document-level feature representation are fused to obtain a unified text quality feature representation;
[0016] Based on the unified text quality feature representation, multiple quality dimension scores are output by multiple independently set quality prediction units. The parameters of each quality prediction unit are set independently to reduce feature coupling between different quality dimensions.
[0017] Dynamic fusion weights are generated for each quality dimension based on text type features and text statistical features.
[0018] Based on the aforementioned dynamic fusion weights, the scores of each quality dimension are weighted and fused to obtain a comprehensive quality score.
[0019] A reference quality space is constructed based on a high-quality reference text set. The reference quality space includes at least a reference center vector calculated from the quality feature representation of the reference text. The deviation between the quality feature representation of the text to be tested and the reference center vector is calculated, and the comprehensive quality score is calibrated based on the deviation.
[0020] Output the calibrated text quality score;
[0021] The calibration is used to unify the scoring scale of texts from different sources and improve the scoring stability during batch text data processing.
[0022] Furthermore, the character-level feature representation is obtained by embedding and encoding the word sequence and modeling it using a multi-layer self-attention network; the sentence-level feature representation is obtained by segmenting the text into sentences to generate sentence vectors and modeling them using an inter-sentence self-attention structure; and the document-level feature representation is obtained by performing global semantic modeling based on the sentence-level feature representation.
[0023] Furthermore, the quality dimensions include at least one of confusion level, information density, repetition level, completeness level, missing level, purity level, and coherence level.
[0024] Furthermore, the method also includes: during the training phase, introducing a decoupling regularization loss to the intermediate representation of each quality prediction unit to reduce the representation correlation between different quality dimensions.
[0025] Furthermore, the dynamic fusion weights are generated by a neural network that includes text type embedding vectors, text length normalized values, and text statistical features, and are obtained through a normalization function.
[0026] Furthermore, the reference center vector is obtained by averaging the quality feature representation of the reference text; the deviation is the Euclidean distance between the quality feature representation of the text to be tested and the reference center vector; the calibration is calculated as follows: post-calibration score = pre-calibration comprehensive quality score - calibration coefficient × deviation.
[0027] Furthermore, the fusion of character-level feature representation, sentence-level feature representation, and document-level feature representation to obtain a unified text quality feature representation specifically includes: concatenating the feature vectors output from the three branches, reducing their dimensionality through a fully connected layer, and then processing them through a nonlinear activation function.
[0028] A text data quality assessment system, comprising:
[0029] The data acquisition module is used to acquire the text data to be tested;
[0030] The multi-scale feature extraction module is used to perform multi-scale feature extraction on the text data to be tested, and generate character-level feature representation, sentence-level feature representation and document-level feature representation respectively;
[0031] The feature fusion module is used to fuse the character-level feature representation, sentence-level feature representation, and document-level feature representation to obtain a unified text quality feature representation;
[0032] The multi-dimensional independent quality prediction module contains multiple independently configured quality prediction units, each with independently set parameters. It is used to output multiple quality dimension scores based on the unified text quality feature representation, thereby reducing feature coupling between different quality dimensions.
[0033] The dynamic weight generation module is used to generate dynamic fusion weights for each quality dimension based on text type features and text statistical features.
[0034] The weighted fusion module is used to perform weighted fusion of the scores of each quality dimension based on the dynamic fusion weights to obtain a comprehensive quality score.
[0035] The reference quality space calibration module is used to construct a reference quality space based on a set of high-quality reference texts. The reference quality space includes at least a reference center vector calculated from the quality feature representation of the reference texts. The module calculates the deviation between the quality feature representation of the text to be tested and the reference center vector, and calibrates the comprehensive quality score based on the deviation.
[0036] The output module is used to output the calibrated text quality score.
[0037] The beneficial effects of this invention are:
[0038] (1) Through multi-scale feature extraction (character level, sentence level, document level), it can simultaneously capture structural and semantic anomalies in text, significantly improving the ability to identify low-quality text;
[0039] (2) By using a multi-dimensional independent prediction mechanism and decoupling regularization constraints, the mutual interference between different quality dimensions is reduced, and the independence of each dimension score and the stability of the overall score are improved.
[0040] (3) Through a dynamic weight fusion mechanism, weights for each dimension are adaptively generated based on text type features and statistical features to achieve personalized quality assessment for different text types;
[0041] (4) By constructing a reference quality space and introducing a deviation calibration mechanism, the scoring scale of texts from different sources is unified, which effectively suppresses abnormal scoring fluctuations in the batch processing process and improves cross-data source consistency.
[0042] (5) The method of the present invention is applicable to third-party evaluation agencies to conduct objective, automated and batch evaluation of large-scale text data, and has good versatility and scalability. Attached Figure Description
[0043] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0044] Figure 1 This is a flowchart of a text data quality assessment method according to an embodiment of the present invention;
[0045] Figure 2 This is a schematic diagram of multi-scale feature modeling as described in an embodiment of the present invention;
[0046] Figure 3 This is a schematic diagram of the dynamic fusion weight generation described in an embodiment of the present invention. Detailed Implementation
[0047] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention are within the scope of protection of the present invention.
[0048] like Figure 1-3 As shown, the present invention discloses a method for assessing the quality of text data, comprising the following steps:
[0049] Obtain the text data to be tested;
[0050] Multi-scale feature extraction is performed on the text data to be tested to generate character-level feature representations, sentence-level feature representations, and document-level feature representations, respectively.
[0051] The character-level feature representation, sentence-level feature representation, and document-level feature representation are fused to obtain a unified text quality feature representation;
[0052] Based on the unified text quality feature representation, multiple quality dimension scores are output by multiple independently set quality prediction units. The parameters of each quality prediction unit are set independently to reduce feature coupling between different quality dimensions.
[0053] Dynamic fusion weights are generated for each quality dimension based on text type features and text statistical features.
[0054] Based on the aforementioned dynamic fusion weights, the scores of each quality dimension are weighted and fused to obtain a comprehensive quality score.
[0055] A reference quality space is constructed based on a high-quality reference text set. The reference quality space includes at least a reference center vector calculated from the quality feature representation of the reference text. The deviation between the quality feature representation of the text to be tested and the reference center vector is calculated, and the comprehensive quality score is calibrated based on the deviation.
[0056] Output the calibrated text quality score;
[0057] The calibration is used to unify the scoring scale of texts from different sources and improve the scoring stability during batch text data processing.
[0058] In a specific embodiment of this invention, the acquired text data to be tested can originate from: local data files; database systems; data interfaces; or data stream acquisition systems. In this embodiment, the input text is assumed to be segmented into a sequence:
[0059]
[0060] in, Indicates the first token The length of the sub-word sequence.
[0061] The maximum length of each text is set to 512 sub-tokens to control computational complexity and preserve sequence structure information.
[0062] In a specific embodiment of the present invention, multi-scale feature extraction is implemented using a parallel neural network structure, including: 1) a character-level encoding branch; 2) a sentence-level encoding branch; and 3) a document-level encoding branch.
[0063] (a) Character-level encoding branch
[0064] 1. Input Processing
[0065] Perform word segmentation on the text.
[0066] 2. Embedding layer
[0067]
[0068] in, Sub-word embedding matrix; For the position encoding matrix, .
[0069] 3. Encoding Structure
[0070] After passing through a 4-layer self-attention network, the output is a character-level representation matrix:
[0071]
[0072] in, It is a character-level self-attention network; .
[0073] The character-level global feature vector is obtained by average pooling:
[0074]
[0075] in, .
[0076] Character-level branches are used to capture: abnormal character distribution; consecutive repeating characters; punctuation anomalies; and noisy text features.
[0077] (ii) Sentence-level coding branches
[0078] 1. Clause segmentation
[0079] The text is segmented into sentences to obtain a set of sentences.
[0080]
[0081] in, For the first One sentence. .
[0082] 2. Sentence Vector Generation
[0083] Each sentence generates a sentence vector through a shared encoder:
[0084]
[0085] in, It is a sentence vector encoder; .
[0086] 3. Inter-sentence modeling
[0087] Using the Transformer architecture, the sentence-level representation matrix is output:
[0088]
[0089] in, It is a set of sentence vectors; It is an inter-sentence self-attention network; .
[0090] Sentence-level global features are obtained through attention aggregation:
[0091]
[0092]
[0093] in, Let be the attention weight for the i-th sentence; These are trainable weight vectors; For summation index; It is an exponential function. .
[0094] This branch is used to model: sentence coherence; missing information; repetitive sentence structure; semantic consistency.
[0095] (iii) Document-level coding branch
[0096] Using sentence-level representations as input, document-level features are constructed.
[0097] Using the Transformer architecture, the document-level representation matrix is output:
[0098]
[0099]
[0100] in, Modeling networks at the document level; .
[0101] This branch is used to capture: global topic focus; paragraph-level semantic drift; and overall structural integrity.
[0102] In a specific embodiment of this invention, during the multi-scale feature fusion process, the output vectors of the three branches are concatenated:
[0103]
[0104] in, This represents the fused text quality feature vector. Character-level feature vectors; Sentence-level feature vectors; Document-level feature vectors; This indicates a vector concatenation operation.
[0105] Dimensions are:
[0106]
[0107] Dimensionality reduction via fully connected layers: 1) Input dimension: 1280; 2) Output dimension: 768; 3) Activation function: GELU.
[0108] We obtain a unified text quality representation vector:
[0109]
[0110] in, For the fusion mapping weight matrix; It is the bias vector; It is a non-linear activation function; .
[0111] In a specific embodiment of the present invention, seven independent quality prediction units are set up, and the structure of each prediction unit is as follows: fully connected layer: 768 → 256; ReLU; Dropout: 0.1; fully connected layer: 256 → 1.
[0112] Output the following scores respectively: 1) Confusion level score; 2) Information density score; 3) Repetition level score; 4) Completeness level score; 5) Missing level score; 6) Purity level score; 7) Coherence level score.
[0113] For each quality dimension i:
[0114]
[0115]
[0116] in, , and These are the first-layer weights and second-layer weights of the i-th prediction unit, respectively; and These are the first layer bias and the second layer bias, respectively; Score the i-th quality dimension. .
[0117] During the training phase, a decoupling regularization term is introduced to reduce the cosine similarity between the intermediate representations of each prediction unit. The decoupling regularization loss is as follows:
[0118]
[0119] Where "·" represents the vector dot product, Let i be the Euclidean norm of the vector; i ≠ j are indices of different quality dimensions.
[0120] In a specific embodiment of this invention, the input features include: 1) a text type embedding vector (dimension 128); 2) a normalized text length value; and 3) basic statistical features, constructing a fused input vector.
[0121]
[0122] in, Embed vectors for text types; To normalize the text length; This is a statistical feature vector.
[0123] Network structure: 1) Input dimension: 138; 2) Fully connected layer: 138 → 64; 3) ReLU; 4) Fully connected layer: 64 → 7; 5) Softmax.
[0124] Output seven fusion weights:
[0125]
[0126] Overall quality score calculation formula:
[0127]
[0128] in, Overall quality score of the text; Quality dimension index; : No. The fusion weights of each quality dimension; :No. Predicted scores for each quality dimension.
[0129] In one specific implementation of the present invention, a high-quality reference corpus (e.g., 100,000 manually selected high-quality texts) is constructed.
[0130] Encoding the reference text yields a set of latent space representations.
[0131] Calculate the reference center vector:
[0132] ;
[0133] Where N is the number of reference corpora; For the first The quality representation vector of each reference text.
[0134] The deviation between the test text and the reference quality space, calculated as follows:
[0135]
[0136] in, The quality feature representation vector; The reference center vector; Let be the Euclidean distance norm.
[0137] Calibration score:
[0138]
[0139] in, The overall quality score after calibration; A comprehensive quality score for the text before calibration; The calibration coefficient is selected between 0.1 and 0.5. This represents the deviation between the text under test and the reference quality space.
[0140] In a specific embodiment of this invention, for the text quality assessment method described in this application, it is necessary to pre-train a multi-scale feature extraction network, a multi-dimensional independent quality prediction unit, a dynamic weight generation network, and a reference quality space encoding network. During the training phase, a loss function incorporating multiple objectives is constructed, enabling the output of accurate, stable, decoupled, and highly discriminative quality scores during inference.
[0141] This embodiment uses a labeled, high-quality training dataset. Each training sample includes the text to be tested, human ratings for each quality dimension, quality level labels, and pairwise comparison ranking information. The total loss function during training is composed of the following five weighted components:
[0142] Quality regression loss ( The mean squared error (MSE) is used to monitor the consistency between the dimensional scores output by each quality prediction unit and the human scores. This loss directly ensures the numerical accuracy of the quality dimensional scores.
[0143] Classification loss ( Cross-entropy loss is employed to supervise the classification prediction of text quality levels (e.g., high / medium / low). This loss helps improve the semantic discriminative power of the unified text quality feature representation H, indirectly improving the quality of feature fusion.
[0144] Decoupling regularization loss ( The loss function calculates the sum of cosine similarities between the intermediate representations of each quality prediction unit, penalizing the correlation between different dimensions. This loss forces each quality prediction unit to learn an independent feature subspace, reducing feature coupling between different quality dimensions, thereby improving the independence and stability of the scores.
[0145] Consistency loss ( The loss function takes the same text and performs minor data augmentation (such as random removal of stop words or synonym replacement) and performs two forward propagation tests, constraining their overall quality score S to be as consistent as possible. This loss function suppresses prediction jitter in the weighted fusion score.
[0146] Ranking loss ( The method employs a triplet loss to maintain the correct relative order of the overall quality scores for text pairs of different qualities (high quality, medium quality, low quality). This loss enhances the discriminative power of the scores after calibration in the reference quality space, ensuring the reliability of the ranking during batch evaluation.
[0147] The total loss function is in the form of:
[0148]
[0149] in, This is the total loss function; For quality regression loss; For classification loss; To decouple the regularization loss; This results in a loss of consistency. The ranking loss; Example parameters for the loss weight function: ; ; ; By minimizing the total loss L, the parameters of all the above modules are updated end-to-end using the backpropagation algorithm.
[0150] The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the protection scope of the present invention.
Claims
1. A method for assessing the quality of text data, characterized in that, Includes the following steps: Obtain the text data to be tested; Multi-scale feature extraction is performed on the text data to be tested to generate character-level feature representations, sentence-level feature representations, and document-level feature representations, respectively. The character-level feature representation, sentence-level feature representation, and document-level feature representation are fused to obtain a unified text quality feature representation; Based on the unified text quality feature representation, multiple quality dimension scores are output by multiple independently set quality prediction units. The parameters of each quality prediction unit are set independently to reduce feature coupling between different quality dimensions. Dynamic fusion weights are generated for each quality dimension based on text type features and text statistical features. Based on the aforementioned dynamic fusion weights, the scores of each quality dimension are weighted and fused to obtain a comprehensive quality score. A reference quality space is constructed based on a high-quality reference text set. The reference quality space includes at least a reference center vector calculated from the quality feature representation of the reference text. The deviation between the quality feature representation of the text to be tested and the reference center vector is calculated, and the comprehensive quality score is calibrated based on the deviation. Output the calibrated text quality score; The calibration is used to unify the scoring scale of texts from different sources and improve the scoring stability during batch text data processing.
2. The text data quality assessment method according to claim 1, characterized in that, The character-level feature representation is obtained by embedding and encoding the word sequence and modeling it using a multi-layer self-attention network; the sentence-level feature representation is obtained by segmenting the text into sentences to generate sentence vectors and modeling them using an inter-sentence self-attention structure; the document-level feature representation is obtained by performing global semantic modeling based on the sentence-level feature representation.
3. The text data quality assessment method according to claim 1, characterized in that, The quality dimensions include at least one of the following: confusion level, information density, repetition level, completeness level, missing level, purity level, and coherence level.
4. The text data quality assessment method according to claim 1, characterized in that, A decoupling regularization loss is introduced into the intermediate representation of each quality prediction unit to reduce the representation correlation between different quality dimensions.
5. The text data quality assessment method according to claim 1, characterized in that, The dynamic fusion weights are generated by a neural network that includes text type embedding vectors, text length normalized values, and text statistical features, and are obtained through a normalization function.
6. The text data quality assessment method according to claim 1, characterized in that, The reference center vector is obtained by averaging the quality feature representation of the reference text; the deviation is the Euclidean distance between the quality feature representation of the text to be tested and the reference center vector; the calibration is calculated as follows: post-calibration score = pre-calibration comprehensive quality score - calibration coefficient × deviation.
7. The text data quality assessment method according to claim 1, characterized in that, The character-level feature representation, sentence-level feature representation, and document-level feature representation are fused to obtain a unified text quality feature representation. Specifically, this includes concatenating the feature vectors output from the three branches, reducing their dimensionality through a fully connected layer, and then processing them through a nonlinear activation function.
8. A text data quality assessment system, used to implement the text data quality assessment method as described in any one of claims 1 to 7, characterized in that, include: The data acquisition module is used to acquire the text data to be tested; The multi-scale feature extraction module is used to perform multi-scale feature extraction on the text data to be tested, and generate character-level feature representation, sentence-level feature representation and document-level feature representation respectively; The feature fusion module is used to fuse the character-level feature representation, sentence-level feature representation, and document-level feature representation to obtain a unified text quality feature representation; The multi-dimensional independent quality prediction module contains multiple independently configured quality prediction units, each with independently set parameters. It is used to output multiple quality dimension scores based on the unified text quality feature representation, thereby reducing feature coupling between different quality dimensions. The dynamic weight generation module is used to generate dynamic fusion weights for each quality dimension based on text type features and text statistical features. The weighted fusion module is used to perform weighted fusion of the scores of each quality dimension based on the dynamic fusion weights to obtain a comprehensive quality score. The reference quality space calibration module is used to construct a reference quality space based on a set of high-quality reference texts. The reference quality space includes at least a reference center vector calculated from the quality feature representation of the reference texts. The module calculates the deviation between the quality feature representation of the text to be tested and the reference center vector, and calibrates the comprehensive quality score based on the deviation. The output module is used to output the calibrated text quality score.