A low-quality data filtering system and method for multi-source questionnaire data

By analyzing the regularity of the answer sequence and the duration of responses in the questionnaire data, and combining the difficulty of the questions with the user's ability, the quality of the questionnaire data is dynamically evaluated. This overcomes the limitations of judging based on a single duration threshold and achieves efficient and accurate screening of low-quality data.

CN122309939APending Publication Date: 2026-06-30HENAN HENGDAI INFORMATION TECHNOLOGY CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
HENAN HENGDAI INFORMATION TECHNOLOGY CO LTD
Filing Date
2026-04-03
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

In existing technologies, the method of using a single response time threshold to assess data quality in questionnaire data analysis cannot accurately identify low-quality data and fails to effectively distinguish response time anomalies caused by differences in ability and level of diligence.

Method used

By analyzing the regularity of answer sequences, the length of question text, and the duration of response, combined with the user's regularity index and the degree of thought required, the seriousness of the user's response is dynamically assessed, and abnormal data is filtered out.

Benefits of technology

It significantly improves the accuracy of identifying low-quality data, reduces misjudgments and biases, lays a reliable foundation for subsequent data analysis, and improves data cleaning efficiency.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122309939A_ABST
    Figure CN122309939A_ABST
Patent Text Reader

Abstract

This invention relates to the field of electronic digital data processing technology, specifically to a low-quality data filtering system and method for multi-source questionnaire data. The method includes: acquiring raw questionnaire data from target users, including questions and answers; analyzing the regularity of responses using answer sequences to obtain a regularity performance index for the target users; determining the necessity of consideration for target questions using the text length parameter of the target questions and the response time data for each user; determining the degree of carelessness in the target users' responses by combining the regularity performance index, the necessity of consideration for target questions, and the response time data for each question; determining the differences in the target users' responses to similar questions; and identifying and filtering out abnormal response data based on the degree of carelessness and the differences in responses to obtain filtered, valid questionnaire data. The technical solution of this invention significantly improves the accuracy and reliability of data quality identification.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of electronic digital data processing technology, specifically to a low-quality data filtering system and method for multi-source questionnaire data. Background Technology

[0002] For questionnaire survey data, the corresponding survey results can be intuitively output through automated data analysis. To improve the quality of data analysis and the reliability of the results, data cleaning is necessary to distinguish between valid data and invalid data (low-quality data).

[0003] Traditional methods judge the seriousness and quality of a test taker's response solely based on a threshold of response time. They simply assume that shorter response times indicate lower seriousness and quality, while longer response times indicate higher seriousness and quality. This approach fails to consider the dynamic impact of different question difficulties and varying test taker abilities on response time. This static judgment standard cannot distinguish between reasonable variations in response time due to ability differences and abnormal time consumption caused by varying levels of seriousness, and is not accurate enough in judging low-quality cases in questionnaire data. Summary of the Invention

[0004] To address the technical problem of low accuracy in identifying low-quality data when using a single response time threshold to assess data quality during questionnaire data analysis, this invention aims to provide a low-quality data filtering system and method for multi-source questionnaire data. The specific technical solution adopted is as follows: This invention provides a method for filtering low-quality data from multi-source questionnaire data, the method comprising: Obtain raw questionnaire data from target users, including questions and answers, and use answer sequence analysis to obtain the target users' regularity performance index; The necessity of considering the target question is determined by using the text length parameter of the target question and the response time data for each user; By combining the target user's pattern performance index, the degree of necessity for thinking about the target questions, and the target user's answering time for each question, the degree of carelessness of the target user in answering the questions can be determined. Determine the differences in the target users' answers to similar questions, and combine the degree of carelessness with the differences in answers to identify abnormal data and filter it out to obtain the filtered valid questionnaire data.

[0005] Furthermore, the method of using answer sequence analysis to obtain the target user's pattern performance index includes: Arrange the answer options in the original questionnaire data according to the question order to obtain the answer sequence, and identify adjacent questions in the answer sequence that have the same target option code; Determine the number of interval questions between adjacent questions, and analyze the answering patterns based on the number of interval questions between adjacent questions in each group to obtain the target user's pattern performance index.

[0006] Furthermore, based on the number of questions between adjacent questions in each group, the regularity of answering patterns is analyzed to obtain the target user's regular performance index, including: Determine the average number of interval questions between adjacent questions in all groups, and use the difference between the number of interval questions between adjacent questions in each group and the average number of interval questions to determine the degree of difference in interval questions for the target option code; The number of options with a difference level less than a preset difference threshold is counted, and the regularity index of the target user is obtained by combining the number of options and the difference level.

[0007] Furthermore, determining the necessity of considering the target question using the text length parameter of the target question and the response time data for each user includes: Determine the target word count for the target question and the average response time for the target question across all users; By using the target word count and the average response time, the necessity of the target question for any user to think about it is determined.

[0008] Furthermore, using the target word count and the average response time, the necessity of the target question for any user's consideration is determined, including: Determine the percentage of the target word count relative to the maximum word count of all questions, and determine the percentage of the average answer time relative to the average maximum answer time of all questions; By combining the word count percentage and the time duration percentage, the necessity of the target question for any user's thinking can be calculated.

[0009] Furthermore, the determination of the target user's lack of seriousness in answering questions, based on the target user's pattern performance index, the necessity of thinking about the target questions, and the target user's answering time for each question, includes: Determine the average response time for all questions by the target user and the target response time for the target question, and determine the difference in response time between the target response time and the average response time; By combining the target user's pattern performance index, the degree of thought required for the target questions, and the differences in response time, the degree of carelessness of the target user in answering the questions can be determined.

[0010] Furthermore, the process of determining the degree of carelessness in the target user's responses by combining the target user's pattern performance index, the necessity of thinking about the target questions, and the target user's response time for each question, further includes: All original questionnaire data corresponding to users whose level of inattentiveness exceeds the preset inattentiveness threshold will be screened out as abnormal data.

[0011] Furthermore, determining the differences in target users' answers to similar questions includes: Determine the hierarchical numbering or textual semantics corresponding to each answer option in the same type of question, and determine the differences in numbering or semantics between answer options; The differences in the numbering or the differences in the semantics are used as the differences in how target users answer similar questions.

[0012] Furthermore, by combining the degree of carelessness and the differences in responses, abnormal response data is identified and filtered out to obtain filtered valid questionnaire data, including: Determine the first answer option that has the largest percentage of each answer option among all answer options in the same type of question, and determine the target percentage of the target answer option among all answer options; Determine the first difference in responses between the target answer option and the first answer option, and determine the maximum difference in responses between any two answer options; Determine the second answer difference between the first answer difference and the maximum answer difference, and combine the degree of carelessness, the target quantity ratio, and the second answer difference to obtain the degree of necessity for filtering out the target data corresponding to the target answer option; If the necessity for screening exceeds a preset screening threshold, the target data is treated as abnormal response data and screened out to obtain filtered valid questionnaire data.

[0013] The present invention also provides a low-quality data filtering system for multi-source questionnaire data, the system being used to implement the low-quality data filtering method for multi-source questionnaire data as described in any of the preceding claims; the system includes: The response analysis module is used to acquire raw questionnaire data from target users, including questions and answers. It analyzes the patterns in response sequences to obtain the target users' regularity performance index. It uses the text length parameter of the target questions and the response time data for each user to determine the degree of consideration required for the target questions. Combining the target users' regularity performance index, the degree of consideration required for the target questions, and the response time data for each question, it determines the degree of carelessness in the target users' responses. The data filtering module is used to determine the differences in the target users' answers to similar questions, and to identify and filter out abnormal data based on the degree of carelessness and the differences in answers to obtain the filtered valid questionnaire data.

[0014] The present invention has the following beneficial effects: This invention targets user (tester) questionnaire data and analyzes the patterns in questionnaire answer selection to determine perfunctory responses. It assesses the necessity of thinking about a question based on the time different users spend answering a single question and the question's length. It also judges the seriousness of test-takers' responses based on the time spent answering questions of varying difficulty. Furthermore, it filters the questionnaire data by considering differences in test-takers' answers to semantically similar (same category) questions to identify valid and invalid (low-quality) data.

[0015] This invention identifies user response behavior characteristics through multi-dimensional analysis of questionnaire data, dynamically adapting to varying question difficulty and test-taker abilities, significantly improving the accuracy of identifying low-quality data. This method effectively overcomes the limitations of relying solely on a single time threshold, reducing misjudgments and biases, and laying a reliable foundation for subsequent high-quality data analysis. This invention is based on recognizing regularities in answer sequences; quantifying individual question engagement using a time-to-text ratio model; constructing a regression model of difficulty coefficient and time consumption to assess overall attentiveness; and analyzing the distance between answers of similar questions to detect logical contradictions in responses. Ultimately, it supports the output of multi-dimensional quality scores, providing quantitative evidence for data cleaning, assisting analysts in quickly screening high-reliability data, significantly improving the accuracy and reliability of data quality identification, while automated data screening greatly improves questionnaire processing efficiency. Attached Figure Description

[0016] To more clearly illustrate the technical solutions and advantages in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0017] Figure 1 A flowchart illustrating the steps of a method for filtering low-quality multi-source questionnaire data provided in one embodiment of the present invention; Figure 2 This is a detailed flowchart of step S1 in a low-quality data filtering method for multi-source questionnaire data provided in an embodiment of the present invention. Figure 3 This is a detailed flowchart of step S2 in a low-quality data filtering method for multi-source questionnaire data provided in an embodiment of the present invention. Figure 4 This is a detailed flowchart of step S3 in a low-quality data filtering method for multi-source questionnaire data provided in an embodiment of the present invention. Figure 5This is a detailed flowchart of step S4 in a low-quality data filtering method for multi-source questionnaire data provided in an embodiment of the present invention. Figure 6 This is a schematic diagram of the hardware operating environment of the low-quality data filtering device for multi-source questionnaire data involved in the embodiments of the present invention; Figure 7 This is a schematic diagram of the framework structure of a low-quality data filtering system for multi-source questionnaire data involved in an embodiment of the present invention. Detailed Implementation

[0018] To further illustrate the technical means and effects adopted by the present invention to achieve its intended purpose, the following, in conjunction with the accompanying drawings and preferred embodiments, details the specific implementation, structure, features, and effects of a low-quality data filtering method for multi-source questionnaire data proposed according to the present invention. In the following description, different "one embodiment" or "another embodiment" do not necessarily refer to the same embodiment. Furthermore, specific features, structures, or characteristics in one or more embodiments can be combined in any suitable form.

[0019] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention pertains.

[0020] The following description, in conjunction with the accompanying drawings, details a specific scheme for a low-quality data filtering method for multi-source questionnaire data provided by the present invention.

[0021] Example 1: For a method for filtering low-quality multi-source questionnaire data provided by this invention, please refer to [link / reference]. Figure 1 The diagram illustrates a flowchart of a method for filtering low-quality multi-source questionnaire data provided in an embodiment of the present invention.

[0022] The method for filtering low-quality data from multi-source questionnaire data includes: Step S1: Obtain the original questionnaire data of the target users, including questions and answers, and use the answer sequence analysis to obtain the regularity index of the target users' responses. In this embodiment, the questionnaire design process clearly defines the research objectives and operationalizes the core constructs to ensure that each question effectively corresponds to the dimensions to be measured. The design incorporates multiple quality control points, including attention checks, logic checks, and open-ended questions, to pre-set mechanisms for identifying invalid responses. Subsequently, appropriate question types and scales are selected based on the characteristics of the target population, and a professional platform is used to capture metadata such as jump logic and response time recording. A key step is to conduct a small-scale pre-test to collect feedback to optimize wording ambiguity, adjust option completeness and flow fluency, thereby maximizing the validity and reliability of the questionnaire before data collection.

[0023] To acquire low-quality data, the answers from test takers ("users") should be obtained after they have completed the questionnaires. First, target samples are recruited through multiple channels (such as panel libraries and social media) based on the sampling frame. During the initial deployment phase, the response time and the pass rate of trick questions are monitored to promptly identify anomalies. Once the target sample size is reached, a complete raw dataset containing all answers, timestamps, IP addresses, and other metadata is exported from the platform. This dataset is then correlated with the questions in the questionnaire to obtain the raw questionnaire data, which can be immediately and securely backed up. Subsequently, the data is standardized and imported into the analysis software, with unified variable naming and value labels. A data cleaning log is also established to record all subsequent operations. The core principle at this stage is "raw data preservation," ensuring that all subsequent screening and analysis are based on traceable, tamper-proof original records.

[0024] When answering a questionnaire, target users (referring to any user) may answer arbitrarily due to reasons such as inappropriate attitude, for example, selecting all options or selecting in a certain order. Therefore, in order to judge the quality of the data related to each answer, we can prioritize analyzing the regularity of user responses reflected in the answer options.

[0025] Specifically, please refer to Figure 2 Step S1, which involves analyzing the answer sequence to obtain the target user's pattern performance index, includes: Step S11: Arrange the answer options in the original questionnaire data according to the question order to obtain the answer sequence, and determine the adjacent questions in the answer sequence that have the same target option code; Step S12: Determine the number of interval questions between adjacent questions, and analyze the answering patterns based on the number of interval questions between adjacent questions in each group to obtain the target user's pattern performance index.

[0026] More specifically, step S12, which analyzes the pattern of answering based on the number of questions between adjacent questions in each group to obtain the target user's pattern performance index, includes: Determine the average number of interval questions between adjacent questions in all groups, and use the difference between the number of interval questions between adjacent questions in each group and the average number of interval questions to determine the degree of difference in interval questions for the target option code; The number of options with a difference level less than a preset difference threshold is counted, and the regularity index of the target user is obtained by combining the number of options and the difference level.

[0027] In this embodiment, based on the original questionnaire data, the answer data of a single tester 'a' (as the target user) can be plotted on the horizontal axis with the question order as the horizontal axis (from smallest to largest, e.g., questions 1-20) and the vertical axis with the corresponding answer options (option codes) as the vertical axis (e.g., from A to D, with the vertical axis being the vertical axis), thus establishing a Cartesian coordinate system. The answer options for each question are then placed into this coordinate system. Similarly, the answer options (codes) in the original questionnaire data can be arranged according to the question order to obtain an answer sequence.

[0028] For a single option (code) p (as the target option code, such as A, B, C, or D), calculate the number of interval questions between each pair (group) of adjacent questions where all elements in the Cartesian coordinate system or answer sequence are option p. For example, the options for questions 1 to 4 are (A, D, C, A). A pair of adjacent questions are questions 1 and 4, and the number of questions between them is 2. The options corresponding to the questions between them are D and C.

[0029] For each question adjacent to option p, calculate the number of questions between multiple sets of adjacent questions. mean This allows us to calculate the degree of difference in the interval between adjacent questions when option p is selected. ;in Let p be the number of options p in the question; k represents the index of the k-th adjacent question, and the total number of options is 1. Group; Represent the quantity difference (absolute value); use the minimum-maximum normalization method to... Normalization was performed to obtain the degree of difference. Its range is [0,1].

[0030] Statistically analyze the degree of difference among multiple option codes (generally including four types: A, B, C, and D; more options can be added depending on the actual situation). The number of options (types) with preset difference thresholds, which can be adjusted according to actual conditions. For example, if user A has two options, A and B, in this questionnaire where the degree of difference between the options meets the above conditions, then the number of options is... The answer is 2. It should be noted that, given this is a questionnaire scenario, the number of times an option code p appears in the answer sequence is... If the difference is less than 2, this option will not be included in the summation; its degree of difference will be set directly. A value of 0 (representing an extreme case).

[0031] when Number of options The more [number] options there are, and the greater the difference in the intervals between adjacent questions corresponding to all option types. cumulative sum (N is the total number of option types, such as A, B, C, and D) The smaller the number, the more regular the test taker a's questionnaire answers are, and the less serious they may be when answering, that is, the more perfunctory their answering behavior is.

[0032] This allows us to obtain the regularity index of tester a's questionnaire answers. The 0.01 is used to avoid the numerator or denominator being 0, which would render the calculation meaningless.

[0033] By implementing the above process, we can determine the regularity of different test takers' answer choices and use this as one of the references for screening low-quality data in the questionnaire data, thus making the screening more accurate.

[0034] Step S2: Determine the degree of necessity for thinking about the target question by using the text length parameter of the target question and the answering time data of each user; Specifically, please refer to Figure 3 Step S2 includes: Step S21: Determine the target word count for the target question and the average response time for all users for the target question; Step S22: Using the target word count and the average response time, determine the degree of necessity of the target question for any user's consideration.

[0035] More specifically, step S22 includes: Determine the percentage of the target word count relative to the maximum word count of all questions, and determine the percentage of the average answer time relative to the average maximum answer time of all questions; By combining the word count percentage and the time duration percentage, the necessity of the target question for any user's thinking can be calculated.

[0036] In this embodiment, different questions on the answer sheet require different amounts of mental effort to answer carefully due to their varying word counts. Consequently, the time taken to answer different questions should also differ for each test taker. To accurately assess a test taker's attentiveness when answering different questions, the necessary level of thought for each question should first be determined by examining the time taken by different test takers to answer a single question and the length of the question itself.

[0037] For question b in the questionnaire (as the target question, referring to any question), calculate the average time taken by all test takers to answer question b. This refers to the average time taken to answer questions.

[0038] The minimum time taken by all test takers to answer question b is obtained by comparison. This refers to the minimum response time.

[0039] Count the number of words in question b That is, the target word count.

[0040] The maximum number of words for each question in the questionnaire was obtained by comparison. .

[0041] The average time taken by all test takers to answer question b The larger the value, the minimum time all test takers spend answering question b. The larger the number of words in question b, the better. Maximum word count for each topic ratio The larger the value, the longer it takes to answer question b, and the greater the thought effort required.

[0042] Therefore, we can determine the degree of necessity for consideration in question b of the questionnaire. ;in This represents the maximum average response time for all questions in the test, i.e., the average maximum response time. This process is used to assess the necessary consideration for different questions, serving as one of the reference conditions for judging the seriousness of the test takers' responses and avoiding the influence of question differences on the judgment of low-quality data. It should be noted that, under normal circumstances... and All values ​​are non-zero. If any one of the parameters is zero, it indicates a systemic error, the calculation stops, and a data anomaly is indicated.

[0043] Step S3: Combine the target user's pattern performance index, the necessity of thinking about the target questions, and the target user's answering time data for each question to determine the degree of carelessness in the target user's answers. Specifically, please refer to Figure 4 Step S3 includes: Step S31: Determine the average answering time of the target user for all questions and the target answering time for the target question, and determine the answering time difference between the target answering time and the average answering time. Step S32: Combine the target user's pattern performance index, the degree of necessity for thinking about the target question, and the difference in answering time to determine the degree of carelessness of the target user in answering the question.

[0044] In this embodiment, when test takers answer questions attentively, the time spent answering questions of different difficulty levels should correspond to different answer durations. Therefore, the seriousness of the test takers' answers is judged by the correct correspondence between the degree of thought required for each question and the answer duration, as well as the regularity of the test takers' answers.

[0045] Calculate the average time taken by test taker a to answer all questions in the questionnaire. Calculate the cumulative difference in response time for test taker a across different questions. ; The time taken for test taker a to answer question b (target answering time). This refers to the number of questions in the questionnaire. It should be noted that... Greater than 0.

[0046] When test taker A is considering the degree of necessity The longer the answer time for the larger the question. The shorter the answer time, the greater the cumulative difference. The smaller the value, the more likely it is that the test taker a has taken the average time to answer all questions. The smaller the value, the more regular the test subject's questionnaire answers appear. The larger the value, the shorter the time test taker a spends on more difficult questions, with no significant difference from other easier questions, and the more regular the answers appear, the greater the likelihood that test taker a did not answer carefully.

[0047] This indicates the degree of carelessness on the part of test taker A in answering the questions. The purpose of 0.01 is to avoid a denominator of 0. The max-min normalization method is used to... Normalization is performed to obtain the degree of carelessness. Its range is [0,1].

[0048] In one embodiment, after step S3, the method further includes: All original questionnaire data corresponding to users whose level of inattentiveness exceeds a preset inattentiveness threshold will be screened out as abnormal data. (When the threshold for inattentiveness is preset and can be adjusted,) if tester A is too inattentive in answering the questionnaire, all of A's questionnaire data will be evaluated as low quality. That is, all of A's questionnaire data will be treated as abnormal data and filtered out, and will not be used as a reference for subsequent analysis.

[0049] The above process is used to assess the overall level of carelessness among different test takers when answering questions, serving as a reference for judging the quality of individual responses in the test takers' questionnaires.

[0050] Step S4: Determine the differences in the target users' answers to similar questions, and combine the degree of carelessness with the differences in answers to identify abnormal data and filter it to obtain filtered valid questionnaire data.

[0051] Specifically, step S4, determining the differences in target users' answers to similar questions, includes: Determine the hierarchical numbering or textual semantics corresponding to each answer option in the same type of question, and determine the differences in numbering or semantics between answer options; The differences in the numbering or the differences in the semantics are used as the differences in how target users answer similar questions.

[0052] In this embodiment, although users may be carefully considering their answers, they may still make incorrect selections for some questions, which do not necessarily reflect their true thoughts. To filter out such cases, the differences in test takers' answers to similar questions can be used to identify incorrect selections, making the screening of low-quality data more detailed and comprehensive.

[0053] Multiple questions with different wording but similar meanings are pre-set in the questionnaire. These questions are considered as the same type of question. A single type of question can include multiple questions in the questionnaire. For any group of question type v, the test taker a selects the answers to different questions in that group.

[0054] Calculate the differences in answer choices across different question types within the question type v group. For answer options of different categories (degree, size / frequency, text), (referring to the substantive content of the answer, not the option code), calculate the differences in different answer options using different methods. For degree categories (e.g., severe, moderate, not severe) and size / frequency categories (numerical or range-based), assign different graded numbers to the answer options, using the difference in numbers as the magnitude of the difference between the answer options. For text-based categories (core being textual description), use the degree of difference in textual semantics (using natural language processing techniques such as calculating cosine similarity and Jaccard similarity coefficient) as the magnitude of the difference between the answer options.

[0055] It should be noted that the answer options in this embodiment refer to the substantive content of the answer. For answer options that can be hierarchically numbered, any two answer options with a number difference of 0 are considered to be the same answer option. For answer options that can be compared in terms of textual semantics, any two answer options with a semantic similarity greater than 0.8 (which can be adjusted) are considered to be the same answer option.

[0056] Specifically, please refer to Figure 5 Step S4, which combines the degree of carelessness and the differences in responses to determine abnormal response data and filter them to obtain filtered valid questionnaire data, includes: Step S41: Determine the first answer option corresponding to the largest proportion of each answer option among all answer options in the same type of question, and determine the target proportion of the target answer option among all answer options; Step S42: Determine the first answer difference between the target answer option and the first answer option, and determine the maximum answer difference between any two answer options; Step S43: Determine the second answer difference between the first answer difference and the maximum answer difference; combine the degree of carelessness, the target quantity ratio and the second answer difference to obtain the degree of necessity for filtering the target data corresponding to the target answer option. Step S44: If the necessity of screening exceeds the preset screening threshold, the target data is treated as abnormal response data and screened out to obtain filtered valid questionnaire data.

[0057] In this embodiment, the maximum percentage of each answer option (in this embodiment, it still refers to the substantive content of the answer) in all answer options of question type v is obtained by comparison. The corresponding answer option is denoted as the first answer option.

[0058] For question type v, the percentage of the target number of answer option g (as the target answer option, referring to any answer option) among all answer options for question type v. The smaller the value, the greater the proportion. The difference between the first answer option and the answer option g When the first answer differs The maximum value of the difference between any two answer options The difference (maximum response difference) The smaller the difference, the better. The smaller the value, the less serious the test taker (a) is in answering the questions. The larger the value, the greater the difference between the answer option g and the test taker's actual choice, the more likely it is that the test taker will make a mistake, and the lower the quality of the corresponding answer.

[0059] This allows us to determine the necessity (low quality) of filtering out the data corresponding to answer option g (referring to the target data including both the corresponding question and the answer). Using the maximum-minimum normalization method to Normalization is performed to obtain the necessary screening degree. Its range is [0,1]. When (When the preset screening threshold is used, it can be adjusted.) The question data corresponding to question type v and answer option g in the questionnaire answered by test taker a is considered abnormal data and filtered out to ensure more complete filtering of low-quality data and obtain the filtered valid questionnaire data. exp represents an exponential function with the natural constant as its base, used to implement... The negative correlation mapping process makes it consistent with the actual analysis logic.

[0060] The above process is used to screen low-quality data from the responses of different test takers.

[0061] After matching the screening results from different test takers with the corresponding complete questionnaire data, the data is compressed and stored in the database. The database data is then accessed using a bus, allowing test taker information and low-quality data to be visualized on the computer screen in tabular form, as shown in Table 1 below.

[0062] This invention identifies user response behavior characteristics through multi-dimensional analysis of questionnaire data, dynamically adapting to varying question difficulty and test-taker abilities, significantly improving the accuracy of identifying low-quality data. This method effectively overcomes the limitations of relying solely on a single time threshold, reducing misjudgments and biases, and laying a reliable foundation for subsequent high-quality data analysis. This invention is based on recognizing regularities in answer sequences; quantifying individual question engagement using a time-to-text ratio model; constructing a regression model of difficulty coefficient and time consumption to assess overall attentiveness; and analyzing the distance between answers of similar questions to detect logical contradictions in responses. Ultimately, it supports the output of multi-dimensional quality scores, providing quantitative evidence for data cleaning, assisting analysts in quickly screening high-reliability data, significantly improving the accuracy and reliability of data quality identification, while automated data screening greatly improves questionnaire processing efficiency.

[0063] Example 2: This invention also proposes a low-quality data filtering device for multi-source questionnaire data. The device can be a computer, server, or other data analysis and computing equipment, or a combination of multiple devices.

[0064] like Figure 6 As shown, Figure 6 This is a schematic diagram of the hardware operating environment of the low-quality data filtering device for multi-source questionnaire data involved in the embodiments of the present invention.

[0065] like Figure 6As shown, the low-quality data filtering device for multi-source questionnaire data may include: a processor 1001, such as a CPU; a network interface 1004; a user interface 1003; a memory 1005; and a communication bus 1002. The communication bus 1002 is used to establish communication between these components. The user interface 1003 may include a display or an input unit such as a control panel; the user interface 1003 may also include a standard wired interface or a wireless interface. The network interface 1004 may optionally include a standard wired interface or a wireless interface (such as a Wi-Fi interface). The memory 1005 may be high-speed RAM or stable non-volatile memory, such as a disk drive. The memory 1005 may also optionally be a storage system independent of the aforementioned processor 1001. The memory 1005, as a computer storage medium, may include a low-quality data filtering program for multi-source questionnaire data (hereinafter referred to as the "low-quality data filtering program").

[0066] Those skilled in the art will understand that Figure 6 The hardware structure shown does not constitute a limitation on the device and may include more or fewer components than shown, or combine certain components, or have different component arrangements.

[0067] Continue to refer to Figure 6 , Figure 6 The memory 1005, which is a computer-readable storage medium, may include an operating system, a user interface module, a network communication module, and a low-quality data filtering program for multi-source questionnaire data.

[0068] exist Figure 6 In this embodiment, the network communication module is mainly used to connect to the server and can communicate with the server for data; while the processor 1001 can call the low-quality data filtering program of the multi-source questionnaire data stored in the memory 1005 and execute the steps in the above embodiments.

[0069] The hardware structure of the low-quality data filtering device based on the above-mentioned multi-source questionnaire data is used to implement various embodiments of the low-quality data filtering method for multi-source questionnaire data of the present invention.

[0070] In addition, this invention also provides a low-quality data filtering system for multi-source questionnaire data (hereinafter referred to as the "low-quality data filtering system"), please refer to... Figure 7 The low-quality data filtering system for the multi-source questionnaire data includes: The response analysis module A10 is used to acquire the original questionnaire data of the target users, including questions and answers. It uses the answer sequence analysis to obtain the regularity index of the target users' responses. It uses the text length parameter of the target questions and the response time data of each user to determine the degree of consideration of the target questions. Combining the regularity index of the target users, the degree of consideration of the target questions, and the response time data of the target users for each question, it determines the degree of carelessness of the target users in answering the questions. The data filtering module A20 is used to determine the differences in the target users' answers to similar questions, and to identify and filter out abnormal data based on the degree of carelessness and the differences in answers to obtain the filtered valid questionnaire data.

[0071] Furthermore, the response analysis module A10 is also used for: Arrange the answer options in the original questionnaire data according to the question order to obtain the answer sequence, and identify adjacent questions in the answer sequence that have the same target option code; Determine the number of interval questions between adjacent questions, and analyze the answering patterns based on the number of interval questions between adjacent questions in each group to obtain the target user's pattern performance index.

[0072] Furthermore, the response analysis module A10 is also used for: Determine the average number of interval questions between adjacent questions in all groups, and use the difference between the number of interval questions between adjacent questions in each group and the average number of interval questions to determine the degree of difference in interval questions for the target option code; The number of options with a difference level less than a preset difference threshold is counted, and the regularity index of the target user is obtained by combining the number of options and the difference level.

[0073] Furthermore, the response analysis module A10 is also used for: Determine the target word count for the target question and the average response time for the target question across all users; By using the target word count and the average response time, the necessity of the target question for any user to think about it is determined.

[0074] Furthermore, the response analysis module A10 is also used for: Determine the percentage of the target word count relative to the maximum word count of all questions, and determine the percentage of the average answer time relative to the average maximum answer time of all questions; By combining the word count percentage and the time duration percentage, the necessity of the target question for any user's thinking can be calculated.

[0075] Furthermore, the response analysis module A10 is also used for: Determine the average response time for all questions by the target user and the target response time for the target question, and determine the difference in response time between the target response time and the average response time; By combining the target user's pattern performance index, the degree of thought required for the target questions, and the differences in response time, the degree of carelessness of the target user in answering the questions can be determined.

[0076] Furthermore, the response analysis module A10 is also used for: All original questionnaire data corresponding to users whose level of inattentiveness exceeds the preset inattentiveness threshold will be screened out as abnormal data.

[0077] Furthermore, the data filtering module A20 is also used for: Determine the hierarchical numbering or textual semantics corresponding to each answer option in the same type of question, and determine the differences in numbering or semantics between answer options; The differences in the numbering or the differences in the semantics are used as the differences in how target users answer similar questions.

[0078] Furthermore, the data filtering module A20 is also used for: Determine the first answer option that has the largest percentage of each answer option among all answer options in the same type of question, and determine the target percentage of the target answer option among all answer options; Determine the first difference in responses between the target answer option and the first answer option, and determine the maximum difference in responses between any two answer options; Determine the second answer difference between the first answer difference and the maximum answer difference, and combine the degree of carelessness, the target quantity ratio, and the second answer difference to obtain the degree of necessity for filtering out the target data corresponding to the target answer option; If the necessity for screening exceeds a preset screening threshold, the target data is treated as abnormal response data and screened out to obtain filtered valid questionnaire data.

[0079] Furthermore, the present invention also provides a computer-readable storage medium. The computer-readable storage medium stores a low-quality data filtering program for multi-source questionnaire data, wherein when executed by a processor, the low-quality data filtering program for multi-source questionnaire data implements the steps of the low-quality data filtering method for multi-source questionnaire data as described above.

[0080] The method implemented when the low-quality data filtering procedure for multi-source questionnaire data is executed can be referred to in various embodiments of the low-quality data filtering method for multi-source questionnaire data of the present invention, and will not be repeated here.

[0081] It should be noted that the order of the above embodiments of the present invention is merely for descriptive purposes and does not represent the superiority or inferiority of the embodiments. The processes depicted in the accompanying drawings do not necessarily require a specific or sequential order to achieve the desired result. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

[0082] The various embodiments in this specification are described in a progressive manner. The same or similar parts between the various embodiments can be referred to each other. Each embodiment focuses on describing the differences from other embodiments.

[0083] Those skilled in the art will understand that embodiments of the present invention can be provided as methods, systems, or computer program products. Therefore, the present invention can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention can take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.

[0084] The above description is only a preferred embodiment of the present invention and does not limit the scope of protection of the present invention. All equivalent structural / method transformations made under the inventive concept of the present invention using the contents of the present invention specification and drawings, or direct / indirect applications in other related technical fields, are included within the scope of protection of the present invention.

Claims

1. A method for filtering low-quality data from multi-source questionnaire data, characterized in that, The method includes the following steps: Obtain raw questionnaire data from target users, including questions and answers, and use answer sequence analysis to obtain the target users' regularity performance index; The necessity of considering the target question is determined by using the text length parameter of the target question and the response time data for each user; By combining the target user's pattern performance index, the degree of necessity for thinking about the target questions, and the target user's answering time for each question, the degree of carelessness of the target user in answering the questions can be determined. Determine the differences in the target users' answers to similar questions, and combine the degree of carelessness with the differences in answers to identify abnormal data and filter it out to obtain the filtered valid questionnaire data.

2. The method for filtering low-quality data from multi-source questionnaire data according to claim 1, characterized in that, The method of analyzing answer sequences to obtain the target user's pattern performance index includes: Arrange the answer options in the original questionnaire data according to the question order to obtain the answer sequence, and identify adjacent questions in the answer sequence that have the same target option code; Determine the number of interval questions between adjacent questions, and analyze the answering patterns based on the number of interval questions between adjacent questions in each group to obtain the target user's pattern performance index.

3. The method for filtering low-quality data from multi-source questionnaire data according to claim 2, characterized in that, Based on the analysis of the number of questions between adjacent questions in each group, the regularity of answering patterns is obtained to obtain the target user's regular performance index, including: Determine the average number of interval questions between adjacent questions in all groups, and use the difference between the number of interval questions between adjacent questions in each group and the average number of interval questions to determine the degree of difference in interval questions for the target option code; The number of options with a difference level less than a preset difference threshold is counted, and the regularity index of the target user is obtained by combining the number of options and the difference level.

4. The method for filtering low-quality data from multi-source questionnaire data according to claim 1, characterized in that, The process of determining the necessity of considering the target question using the text length parameter of the target question and the response time data for each user includes: Determine the target word count for the target question and the average response time for the target question across all users; By using the target word count and the average response time, the necessity of the target question for any user to think about it is determined.

5. The method for filtering low-quality data from multi-source questionnaire data according to claim 4, characterized in that, Using the target word count and the average response time, determine the necessity of the target question for any user's consideration, including: Determine the percentage of the target word count relative to the maximum word count of all questions, and determine the percentage of the average answer time relative to the average maximum answer time of all questions; By combining the word count percentage and the time duration percentage, the necessity of the target question for any user's thinking can be calculated.

6. The method for filtering low-quality data from multi-source questionnaire data according to claim 1, characterized in that, The determination of the target user's level of carelessness in answering questions, based on the target user's pattern performance index, the degree of thought required for the target questions, and the target user's time spent answering each question, includes: Determine the average response time for all questions by the target user and the target response time for the target question, and determine the difference in response time between the target response time and the average response time; By combining the target user's pattern performance index, the degree of thought required for the target questions, and the differences in response time, the degree of carelessness of the target user in answering the questions can be determined.

7. The method for filtering low-quality data from multi-source questionnaire data according to claim 1, characterized in that, The method combines the target user's pattern performance index, the degree of thought required for the target questions, and the target user's response time for each question to determine the degree of carelessness in the target user's responses. This is followed by: All original questionnaire data corresponding to users whose level of inattentiveness exceeds the preset inattentiveness threshold will be screened out as abnormal data.

8. The method for filtering low-quality data from multi-source questionnaire data according to claim 1, characterized in that, The determination of differences in target users' answers to similar questions includes: Determine the hierarchical numbering or textual semantics corresponding to each answer option in the same type of question, and determine the differences in numbering or semantics between answer options; The differences in the numbering or the differences in the semantics are used as the differences in how target users answer similar questions.

9. The method for filtering low-quality data from multi-source questionnaire data according to claim 1, characterized in that, By combining the degree of carelessness and the differences in responses, abnormal response data is identified and filtered out to obtain the filtered valid questionnaire data, including: Determine the first answer option that has the largest percentage of each answer option among all answer options in the same type of question, and determine the target percentage of the target answer option among all answer options; Determine the first difference in responses between the target answer option and the first answer option, and determine the maximum difference in responses between any two answer options; Determine the second answer difference between the first answer difference and the maximum answer difference, and combine the degree of carelessness, the target quantity ratio, and the second answer difference to obtain the degree of necessity for filtering out the target data corresponding to the target answer option; If the necessity for screening exceeds a preset screening threshold, the target data is treated as abnormal response data and screened out to obtain filtered valid questionnaire data.

10. A low-quality data filtering system for multi-source questionnaire data, characterized in that, The system is used to implement the low-quality data filtering method for multi-source questionnaire data as described in any one of claims 1 to 9; the system includes: The response analysis module is used to acquire raw questionnaire data from target users, including questions and answers. It analyzes the patterns in response sequences to obtain the target users' regularity performance index. It uses the text length parameter of the target questions and the response time data for each user to determine the degree of consideration required for the target questions. Combining the target users' regularity performance index, the degree of consideration required for the target questions, and the response time data for each question, it determines the degree of carelessness in the target users' responses. The data filtering module is used to determine the differences in the target users' answers to similar questions, and to identify and filter out abnormal data based on the degree of carelessness and the differences in answers to obtain the filtered valid questionnaire data.