A method and system for constructing a question and answer knowledge base and a storage medium

By calculating the semantic repetition rate and constructing a representative question set for the question-and-answer knowledge base, and combining the weighted matching rate and business intent ratio, the extraction threshold and job base are dynamically adjusted. This solves the problems of semantic repetition and insufficient adaptability in existing question-and-answer databases, and achieves adaptive updating and accuracy of the knowledge base.

CN122309685APending Publication Date: 2026-06-30国投人力资源服务有限公司

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
国投人力资源服务有限公司
Filing Date
2026-05-28
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing question-answering knowledge bases lack representative extraction of question texts with different expressions but the same intent for the same job position, and cannot evaluate the matching degree between newly extracted question-answer pairs and existing question-answer pairs in the knowledge base. This results in a large number of semantically repetitive question-answer pairs being stored in the knowledge base, and there is a lack of adaptive adjustment capability for knowledge base construction parameters.

Method used

By classifying the current batch of question-answer pairs, calculating the semantic repetition rate of any two question texts, grouping the two question texts with the highest semantic repetition rate into the same group, constructing a representative question set, and calculating the weighted matching rate based on the storage duration of in-place question-answer pairs, the extraction threshold and job base are dynamically adjusted to adapt to changes in business intent.

Benefits of technology

It enables the representative extraction of multiple question texts with different expressions but the same intent for the same position, avoiding the duplication of similar questions in the knowledge base, and dynamically adjusting the screening criteria for question-answer pairs to ensure the adaptability and accuracy of the knowledge base.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122309685A_ABST
    Figure CN122309685A_ABST
Patent Text Reader

Abstract

This invention relates to the field of knowledge base construction technology, specifically to a question-and-answer knowledge base construction method, system, and storage medium. The invention categorizes question-and-answer pairs in the current batch to obtain question texts, their job tags, and business intents. It calculates the semantic repetition rate of any two question texts and groups the two question texts with the highest semantic repetition rate together. Through semantic repetition rate calculation and grouping, it achieves representative extraction of multiple question texts with different expressions but the same intent under the same job position, avoiding the duplicate storage of similar questions in the knowledge base. The invention constructs candidate question-and-answer pairs based on a representative question set, calculates a weighted matching rate using the storage duration of in-situ question-and-answer pairs in the knowledge base as a weight, and evaluates the degree of matching between candidate question-and-answer pairs and in-situ knowledge through the weighted matching rate.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of knowledge base construction technology, specifically to a question-and-answer knowledge base construction method, system, and storage medium. Background Technology

[0002] In Q&A knowledge base applications related to human resources and organizational management, common information such as reimbursement processes, job responsibilities, and salary calculation standards is typically retrieved from the knowledge base. While existing Q&A knowledge bases can store and retrieve common information, the question-and-answer pairs in the knowledge base become outdated over time. Different batches of user question-and-answer data may have different business intentions, resulting in the knowledge base retaining a large amount of outdated information that could be misleading, or missing key new questions and answers.

[0003] In the prior art, for example, Chinese invention patent with publication number CN119089905A discloses a method and apparatus for automatically extracting domain-specific question-answer pairs. It utilizes pre-constructed domain question extraction models and domain answer extraction models to process target text, outputting domain questions and domain answers to form domain question-answer pairs.

[0004] However, the existing technologies mentioned above have the following shortcomings: 1. They lack representative extraction of multiple question texts with different expressions but the same intent for the same position, resulting in a large number of semantically repetitive question-answer pairs stored in the knowledge base. 2. They cannot evaluate the matching degree between newly extracted question-answer pairs and existing question-answer pairs in the knowledge base, and lack the ability to adaptively adjust the parameters for building the knowledge base. When business intent changes, they cannot dynamically adjust the selection criteria for question-answer pairs. Summary of the Invention

[0005] The purpose of this invention is to overcome the shortcomings of the prior art and provide a method, system and storage medium for constructing a question-and-answer knowledge base.

[0006] The technical solution adopted by the present invention to solve its technical problem is: a question-and-answer knowledge base construction method, including the following steps: classifying the current batch of question-and-answer pairs to obtain the question text and its job tags and business intent.

[0007] Calculate the semantic repetition rate of any two question texts, group the two question texts with the highest semantic repetition rate into the same group, and form a representative question set based on the most frequent question in each group, along with its job title and business intent.

[0008] Candidate question-answer pairs are constructed based on a representative question set. The weighted matching rate between candidate question-answer pairs and in-situ question-answer pairs is calculated using the storage duration of in-situ question-answer pairs in the knowledge base as the weight.

[0009] The proportion of intents is obtained by dividing the number of business intents that did not match the candidate question-answer pairs by the total number of business intents representing the question set.

[0010] Based on a representative set of questions, the average length of the question text is used as the extraction threshold, and the ratio of the total number of questions to the number of job tag types is used as the base number of jobs.

[0011] Based on the comparison results of the sliding window differential change rate of the weighted matching rate and the proportion of intentions in adjacent periods, the adjustment direction and adjustment amount of the extraction threshold and the base number of positions are determined.

[0012] Based on the adjusted extraction threshold and job quota, new question-answer pairs are constructed and stored in the knowledge base according to the next batch of question-answer pairs.

[0013] Compared with the prior art, the present invention has the following beneficial effects: 1. The present invention obtains question texts and their job tags and business intentions by classifying the current batch of question-answer pairs, calculates the semantic repetition rate of any two question texts, and groups the two question texts with the highest semantic repetition rate into the same group; through semantic repetition rate calculation and grouping and filtering, it realizes the representative extraction of multiple question texts with different expressions but the same intention under the same job position, avoiding the repeated storage of similar questions in the knowledge base.

[0014] 2. This invention constructs candidate question-answer pairs based on a representative question set, calculates a weighted matching rate using the storage duration of in-situ question-answer pairs in the knowledge base as the weight, and evaluates the degree of matching between candidate question-answer pairs and in-situ knowledge through the weighted matching rate.

[0015] 3. This invention obtains the intent ratio based on the number of business intent types of unmatched candidate question-answer pairs and the total number of business intent types in the representative question set. Based on the comparison results of the sliding window difference change rate of the weighted matching rate and the intent ratio of adjacent periods, the extraction threshold and job base are dynamically adjusted. According to the adjusted extraction threshold and job base, new question-answer pairs are constructed based on the next batch of question-answer pairs and stored in the knowledge base, so as to realize the adaptive adaptation of the knowledge base construction parameters to changes in business intent. Attached Figure Description

[0016] To more clearly illustrate the technical solutions of the embodiments of the present invention, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0017] Figure 1 This is a schematic diagram of the construction method of the present invention;

[0018] Figure 2 This is a schematic diagram of the system module connections of the present invention;

[0019] Figure 3 This is a schematic diagram of the sliding window difference change rate for calculating the weighted matching rate according to the present invention.

[0020] Figure 4 This is a flowchart illustrating the comparison results of the intended proportions of adjacent cycles obtained by the present invention.

[0021] Figure 5 This is a flowchart illustrating the process of determining the adjustment direction and amount of the extraction threshold and the base number of positions in this invention. Detailed Implementation

[0022] Various exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It should be noted that, unless otherwise specifically stated, the relative arrangement, numerical expressions, and values ​​of the components and steps set forth in these embodiments do not limit the scope of the invention. Furthermore, it should be understood that, for ease of description, the dimensions of the various parts shown in the drawings are not drawn to actual scale.

[0023] The following description of at least one exemplary embodiment is merely illustrative and is in no way intended to limit the invention or its application or use. Techniques, methods, and apparatus known to those skilled in the art may not be discussed in detail, but where appropriate, such techniques, methods, and apparatus should be considered part of the specification.

[0024] In all examples shown and discussed herein, any specific values ​​should be interpreted as merely exemplary and not as limitations. Therefore, other examples of exemplary embodiments may have different values.

[0025] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention pertains.

[0026] The following description, in conjunction with the accompanying drawings, details a specific scheme for constructing a question-and-answer knowledge base provided by this invention.

[0027] Please see Figure 1 The diagram shows a flowchart of a question-and-answer knowledge base construction method provided by the present invention, which specifically includes the following steps: Step S1, classify the current batch of question-and-answer pairs to obtain the question text and its job tags and business intent.

[0028] In practice, a batch of unprocessed question-and-answer pairs that arrived in chronological order are retrieved from data sources such as user inquiry logs and FAQ documents, and these are designated as the current batch of question-and-answer pairs. For the current batch of question-and-answer pairs, statements that meet any of the following conditions are identified as question text: the statement begins with an interrogative word; the statement contains an interrogative sentence structure; or the statement ends with a question mark.

[0029] Interrogative words are specific words used to form interrogative sentences, such as how, how to, why, which, etc. If a sentence begins with an interrogative word, such as "how to process reimbursement," it is identified as question text.

[0030] Interrogative sentence structures refer to sentences containing grammatical structures used to express questions or inquiries, such as "Is there...?", "Is there...?", "Is...?", etc. If a sentence contains an interrogative sentence structure, such as "Does anyone know the reimbursement process?", it is identified as question text.

[0031] After identifying the problematic text, each character is scanned sequentially from its starting position. When punctuation marks or spaces are encountered, the current continuous character sequence is treated as a word segmentation result. If the continuous character sequence contains multiple Chinese characters, it is further segmented according to common combinations of Chinese characters in the general vocabulary. The general vocabulary includes two-character words, three-character words, and common phrases, such as query, modify, print, export, process reimbursement, change password, and payslip.

[0032] The splitting rules are as follows: starting from the beginning of the continuous character sequence, take the two longest consecutive Chinese characters as a candidate word each time; if the candidate word has independent meaning in the general vocabulary, then take it as a word segmentation result and move the scan starting position two characters to the right; if the candidate word does not have independent meaning in the general vocabulary, then take a single Chinese character as a word segmentation result and move the scan starting position one character to the right; repeat the above process until the end of the problem text is reached.

[0033] Next, all word segmentation results are arranged in the order of their appearance in the question text to obtain a word sequence. From the word sequence, words that meet one of the following types are selected: verbs indicating the action target and their direct objects, nouns indicating the object of the operation, and noun phrases indicating service requirements.

[0034] Specifically, the action goal refers to the action the questioner wishes to perform, such as modifying, querying, or deleting. The operation object refers to the entity or content affected by the action, such as passwords, salaries, or contracts. The service requirement refers to the type of service the questioner wishes to obtain, such as printing, exporting, or sending.

[0035] Based on this, words in the extracted question text that represent action goals, operation objects, or service requirements are taken as business intent.

[0036] Simultaneously, the metadata fields of the current batch of question-and-answer pairs are parsed to read the questioner's user ID, logged-in role, and department. These metadata fields are written by the upstream system when the question-and-answer pairs are generated. Based on the user ID, the user's job title, role name, or job description is retrieved from the user information database. If no result is found based on the user ID, the logged-in role is used as the job title tag. If the logged-in role is empty, the department is used as the job title tag.

[0037] If the department is also empty, then scan every word in the question text. If the word ends with a job title such as manager, specialist, supervisor, director, etc., then the word is used as a job title tag; otherwise, further determine whether the word is a general corporate job title such as finance, human resources, administration, technology, sales, etc. If so, then the word is used as a job title tag; otherwise, the word is not used as a job title tag.

[0038] Specifically, job title refers to a user's job title, such as: Finance Manager, Human Resources Specialist, etc. Role title refers to the user's assigned permissions and roles within a specific system, such as: Approver, Administrator, etc. Scope of function refers to the user's department or area of ​​responsibility, such as: Finance Department, Technology Department, etc.

[0039] Based on this, words describing the questioner's job title, role name, or scope of function in the extracted question-and-answer pairs will be used as job tags.

[0040] Subsequently, using the unique identifier in the metadata field of the question-and-answer pair as an index, the job tags and business intents extracted from the same question-and-answer pair are stored as the job tags and business intents corresponding to the question text in this question-and-answer pair.

[0041] If a question-and-answer pair lacks a job title tag or business intent, then starting from the question-and-answer pair that lacks the job title tag or business intent, each question-and-answer pair is checked backwards in the order they arrived, until a question-and-answer pair containing both a job title tag and business intent is found, which is then designated as an adjacent question-and-answer pair. The job title tag or business intent from the adjacent question-and-answer pair is copied to the question-and-answer pair that lacks the job title tag or business intent.

[0042] If no adjacent question-and-answer pair is found during the forward search, each question-and-answer pair is checked sequentially until an adjacent pair is found. If no adjacent question-and-answer pair is found during both the forward and backward searches, the question-and-answer pair that lacks a job title tag or business intent is marked as invalid and will not participate in subsequent steps.

[0043] Furthermore, multiple question texts under the same job title and corresponding to the same business intent are grouped into the same set. For example, all question texts with the job title "Finance" and the business intent "Expense Reimbursement" are grouped into one set; all question texts with the job title "Human Resources" and the business intent "Salary Inquiry" are grouped into another set.

[0044] For each set, a length filtering operation is performed. Specifically, the text content of each question text in the set is read, and the number of characters in the text content is counted as the character count; all question texts in the set are sorted in ascending order of the number of characters; after sorting, the question texts with the most and fewest characters are deleted, and the remaining question texts are kept, thereby eliminating the impact of extremely long question texts on subsequent steps.

[0045] If all sets of questions contain empty text, skip the current batch, do not execute subsequent steps, and wait directly for the next batch of question-and-answer pairs.

[0046] Step S2: Calculate the semantic repetition rate of any two question texts, and group the two question texts with the highest semantic repetition rate into the same group. Use the most frequent question in each group, along with its job title and business intent, to form a representative question set.

[0047] First, based on all the question texts retained in step S1, if any two question texts have different job tags and business intentions, then they are semantically unrelated, and the semantic repetition rate is zero. If any two question texts have the same job tags and business intentions, then they are semantically equivalent, and the semantic repetition rate is one.

[0048] If any two question texts have the same job title tags but different business intentions, or different job title tags but the same business intentions, then it is necessary to further determine the semantic repetition rate based on the number of content words.

[0049] The further determination process is as follows: each question text is segmented and tagged with parts of speech, filtering out function words and retaining words belonging to nouns, verbs, adjectives, numerals, and quantifiers to form a set of content words; the intersection of the content word sets of the two question texts is taken, and the number of content words in the intersection is counted as the content word count; the union of the content word sets of the two question texts is taken, and the number of content words in the union is counted as the total number of unique content words; the ratio of the content word count to the total number of unique content words is calculated, and the ratio is used as the semantic repetition rate.

[0050] For example, if the set of content words in question text A is (process, reimbursement, procedure), and the set of content words in question text B is (reimbursement, procedure, need, materials), then the intersection of the two is (reimbursement, procedure), and the union is (process, reimbursement, procedure, need, materials). Therefore, the number of content words is 2, and the total number of unique content words is 5.

[0051] After obtaining the semantic repetition rate, if a unique pair of texts with the largest semantic repetition rate that is greater than zero is found, then this pair of texts is merged into one group. If multiple pairs of texts with the same largest semantic repetition rate that is greater than zero are found, then these multiple pairs of texts are merged into one group.

[0052] If the semantic repetition rate of all two question texts is zero, then the two question texts that appear first in the current batch will be merged into the same group according to the time order in which they appear.

[0053] After merging into groups, continue merging the two question texts with the highest semantic repetition rate from the remaining question texts until all question texts are assigned to a group; if only one question text remains after merging, then that question text is grouped separately.

[0054] Then, the frequency of each question text in each group is counted, and the question text with the highest frequency is taken as the representative question text. If there are multiple question texts with the highest frequency in each group, the arithmetic mean of the number of characters in all question texts in the group is calculated first, and then the difference between the number of characters in each question text with the highest frequency and the arithmetic mean is calculated. The question text corresponding to the smallest absolute value of the difference is taken as the representative question text.

[0055] When multiple question texts have the same smallest absolute difference, select the question text with the smallest number of characters as the representative question text. Finally, combine the representative question texts of each group with their job titles and business intentions to form the representative question set for the current batch.

[0056] Step S3: Construct candidate question-answer pairs based on the representative question set, and calculate the weighted matching rate between the candidate question-answer pairs and the in-situ question-answer pairs using the storage duration of the in-situ question-answer pairs in the knowledge base as the weight.

[0057] In practice, the text content of each representative question text in the current batch of representative questions is used as an index to search for the corresponding answer statement in the question-answer pairs of the current batch that have not been processed in step S1. For example, if the question text is "How to change my personal password?", the corresponding answer text is "Please log in to the system and click the password change option in your personal settings".

[0058] Next, the representative question text and answer text are combined into candidate question-answer pairs. In-situ question-answer pairs with the same job tag or business intent as each candidate question-answer pair are selected from the knowledge base; if no such pairs are found, the subsequent weighted matching rate calculation is skipped, the weighted matching rate of each candidate question-answer pair is set to zero, and step S4 is executed directly.

[0059] Calculate the elapsed time from the moment each selected in-place question-answer pair was stored in the knowledge base to the current moment, and use this as the storage duration for each in-place question-answer pair. Use the storage duration of each selected in-place question-answer pair as a weight.

[0060] Following the semantic repetition rate calculation process in step S2, the semantic repetition rate between the candidate question-answer pair and the question text of each corresponding in-situ question-answer pair is calculated. The semantic repetition rate is then multiplied by the corresponding weight, summed, and divided by the number of all in-situ question-answer pairs corresponding to each candidate question-answer pair to obtain the weighted matching rate. The larger the weighted matching rate, the higher the overlap between the knowledge in the current batch and the existing knowledge in the knowledge base.

[0061] Step S4: Divide the number of business intent types that do not match candidate question-answer pairs by the total number of business intent types representing the question set to obtain the intent ratio.

[0062] Specifically, candidate question-answer pairs with a weighted matching rate of zero are considered unmatched candidate question-answer pairs. The business intents of all unmatched candidate question-answer pairs are collected, duplicate business intents are removed, and the number of remaining business intents is counted as the number of business intent types for each unmatched candidate question-answer pair.

[0063] At the same time, collect the business intents of all representative question texts in the current batch of representative question sets, remove duplicate business intents, and count the number of remaining business intents as the total number of business intents in the representative question set.

[0064] Next, the number of business intent types that did not match candidate question-answer pairs is divided by the total number of business intent types representing the question set to obtain the intent ratio. The higher the intent ratio, the more new business intents appear in the current batch, and the greater the degree to which the knowledge base needs to be updated.

[0065] Step S5: Based on the representative question set, the average length of the question text is used as the extraction threshold, and the ratio of the total number of questions to the number of job tag types is used as the job base.

[0066] Iterate through each representative question text in the current batch's representative question set, obtain the character count of each representative question text, sum the character counts of all representative question texts, and then divide by the total number of representative question texts in the representative question set to obtain the average question text length. Use the average question text length as the extraction threshold.

[0067] Simultaneously, iterate through the representative question text of each question in the current batch's representative question set, extract its job tags, remove duplicate job tags, and count the remaining job tags as the number of job tag types. Divide the total number of questions by the number of job tag types to obtain the job base.

[0068] Step S6: Based on the comparison results of the sliding window differential change rate of the weighted matching rate and the proportion of intentions in adjacent periods, determine the adjustment direction and amount of the extraction threshold and the base number of positions.

[0069] Please see Figure 3Step S60: Calculate the sliding window difference rate of change of the weighted matching rate. Specifically, the weighted matching rates of all candidate question-answer pairs in the current batch are arranged into a sequence according to the number of question text characters in the candidate question-answer pairs from shortest to longest.

[0070] Set the width of the sliding window to three consecutive weighted matching rates, place the sliding window at the beginning of the sequence, and subtract the previous weighted matching rate from the next weighted matching rate in the sliding window to obtain two differences; use the arithmetic mean of the absolute values ​​of the two differences as the rate of change of the difference.

[0071] Then, the sliding window is moved backward by one weighted matching rate to form a new window. For example, if the initial sliding window covers three weighted matching rates A1, A2, and A3, then the new window contains A2, A3, and A4. The differential change rate is calculated based on the new window in the above manner. The operation is repeated until the movement reaches the end of the sequence, at which point it is no longer possible to move backward to form a complete new window.

[0072] Subsequently, the arithmetic mean of all the differential change rates is used as the sliding window differential change rate of the weighted matching rate. The larger the value of the sliding window differential change rate of the weighted matching rate, the more unbalanced the knowledge base's coverage of the issues in the current batch is.

[0073] If the total number of weighted matching rates in the sequence is less than three, a complete sliding window cannot be formed, and the sliding window difference rate of the weighted matching rate is set to zero, then step S61 continues. If the total number of weighted matching rates in the sequence is equal to three, then there is only one sliding window, and the difference rate of change calculated for this window is used as the sliding window difference rate of the weighted matching rate.

[0074] Please see Figure 4 Step S61: Obtain the comparison results of the proportion of intentions in adjacent periods. Specifically, traverse each representative question text in the representative question set of the current batch, and group representative question texts with the same business intention into the same intention group; extract the job tags of all representative question texts in each intention group, remove duplicate job tags, and count the number of remaining job tags as the number of job tag types corresponding to each business intention.

[0075] Dividing the number of job tag types by the proportion of intents in the current batch yields the job dispersion for each business intent in the current batch. Job dispersion characterizes the distribution density of each business intent across different job positions. The higher the job dispersion value, the more job positions are involved in the corresponding business intent.

[0076] Furthermore, iterate through the business intent of each representative question text in the current batch's representative question set, remove duplicate business intents, and count the remaining business intents as the total number of business intents.

[0077] Simultaneously, based on the number of job tag types corresponding to each business intent in the representative problem set of the previous batch, and according to the intent ratio of the previous batch obtained in step S4, the job dispersion of each business intent in the previous batch is calculated similarly. If there is no previous batch, the comparison result of the intent ratio of adjacent periods is determined to be zero, and the following step S62 is continued.

[0078] Then, calculate the difference in job dispersion for the same business intent between the current batch and the previous batch, as the dispersion change; sum the absolute values ​​of all dispersion changes and divide by the total number of business intents, the result is used as the comparison result of the intent ratio between adjacent periods. The larger the value of the comparison result of the intent ratio between adjacent periods, the greater the difference in the distribution of business intents between the previous batch and the current batch.

[0079] Please see Figure 5 Step S62: Determine the adjustment direction and amount for the extraction threshold and the base number of positions. Specifically, the sliding window differential change rate of the weighted matching rate is used as reference value one, and the comparison result of the intention ratio of adjacent periods is used as reference value two.

[0080] The absolute value of the difference between reference value one and reference value two is used as the adjustment amount for the extraction threshold; the adjustment amount for the base number of positions is the same as the adjustment amount for the extraction threshold.

[0081] If reference value one is greater than reference value two, the adjustment direction of the extraction threshold is determined to be increasing; if reference value one is less than reference value two, the adjustment direction of the extraction threshold is determined to be decreasing; if reference value one is equal to reference value two, the adjustment direction of the extraction threshold is determined to be unchanged.

[0082] When the extraction threshold is increased, fewer questions will be selected. However, in order to ensure the richness of the knowledge base, the number of job positions needs to be reduced. Therefore, the adjustment direction of the number of job positions is determined to be opposite to the adjustment direction of the extraction threshold.

[0083] Step S7: Based on the adjusted extraction threshold and job base, construct new question-answer pairs and store them in the knowledge base.

[0084] After obtaining the adjustment direction and amount for the extraction threshold and the base number of positions, add or subtract the adjustment amount from the extraction threshold according to the adjustment direction to obtain the adjusted extraction threshold. Subtract or add the adjustment amount from the base number of positions according to the opposite adjustment direction to obtain the adjusted base number of positions.

[0085] In order to enable the knowledge base to be updated adaptively in response to business changes, new question-answer pairs are constructed based on the next batch of question-answer pairs and stored in the knowledge base according to the adjusted extraction threshold and job base.

[0086] The specific construction process is as follows: Perform the same classification process as step S1 on the next batch of question-answer pairs to obtain the question text and its job tag and business intent, and obtain the answer text corresponding to each question text according to step S3.

[0087] Select question texts from the obtained question texts whose character count is greater than or equal to the adjusted extraction threshold. If no question texts with a character count greater than or equal to the adjusted extraction threshold are selected, skip this batch of question-answer pairs, do not build new question-answer pairs, and wait for the next batch of question-answer pairs.

[0088] Considering the differences in business needs and questioning habits among different positions, the selected question texts were grouped by position tags, with question texts with the same position tag grouped together. Then, based on the adjusted base number of positions, the number of question texts to be retained in each group was determined, ensuring that the total number of questions in each group is less than or equal to the adjusted base number of positions, thus preventing the question texts for certain positions in the knowledge base from becoming excessively bloated.

[0089] The specific rules are as follows: If the total number of questions in a group is less than or equal to the adjusted base number of positions, all question texts in that group are retained. If the total number of questions in a group exceeds the adjusted base number of positions, the excess question texts are deleted from that group; the deleted question texts are those with the smallest character count. If there are multiple question texts with the smallest character count, some of them are deleted.

[0090] Finally, each group of retained question texts, their answer texts, job tags, and business intentions are combined into a new question-and-answer pair and stored in the knowledge base, with a timestamp recording the moment the new question-and-answer pair is stored.

[0091] Please see Figure 2 The present invention also provides a question-and-answer knowledge base construction system, including: a question-and-answer processing module, used to classify the current batch of question-and-answer pairs to obtain question texts and their job tags and business intentions; calculate the semantic repetition rate of any two question texts, combine the two question texts with the highest semantic repetition rate into the same group, and form a representative question set with the most frequent questions in each group and their job tags and business intentions.

[0092] The weighted matching module is used to construct candidate question-answer pairs based on a representative question set. It uses the storage duration of in-situ question-answer pairs in the knowledge base as the weight to calculate the weighted matching rate between candidate question-answer pairs and in-situ question-answer pairs.

[0093] The parameter setting module is used to divide the number of business intent types of unmatched candidate question-answer pairs by the total number of business intent types in the representative question set to obtain the intent ratio; based on the representative question set, the average question text length is used as the extraction threshold, and the ratio of the total number of questions to the number of job tag types is used as the job base.

[0094] The module is adjusted to determine the direction and amount of adjustment for the extraction threshold and job base based on the comparison results of the sliding window differential change rate based on the weighted matching rate and the proportion of intent in adjacent periods. Based on the adjusted extraction threshold and job base, new question-answer pairs are constructed and stored in the knowledge base based on the next batch of question-answer pairs.

[0095] In addition, the present invention also provides a question-and-answer knowledge base storage medium, wherein the storage medium stores a computer program, and when the computer program is read, the computer executes the above-described question-and-answer knowledge base construction method.

[0096] The above embodiments can be implemented, in whole or in part, by software, hardware, firmware, or any other combination thereof. When implemented in software, the above embodiments can be implemented, in whole or in part, as a computer program product.

[0097] Those skilled in the art will recognize that the modules and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementations should not be considered beyond the scope of this invention.

[0098] In addition, the functional modules in the various embodiments of the present invention can be integrated into one processing module, or each module can exist physically separately, or two or more modules can be integrated into one module.

[0099] The above description is merely a specific embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the technical scope disclosed in the present invention should be included within the scope of protection of the present invention. Therefore, the scope of protection of the present invention should be determined by the scope of the claims.

[0100] Finally, the above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the protection scope of the present invention.

Claims

1. A method for constructing a question-and-answer knowledge base, characterized in that, Includes the following steps: Classify the current batch of question-and-answer pairs to obtain the question text, its job tag, and business intent; Calculate the semantic repetition rate of any two question texts, group the two question texts with the highest semantic repetition rate into the same group, and form a representative question set based on the most frequent question in each group, its job title tag, and business intent. Candidate question-answer pairs are constructed based on a representative question set. The weighted matching rate between candidate question-answer pairs and in-situ question-answer pairs is calculated by using the storage duration of in-situ question-answer pairs in the knowledge base as the weight. The proportion of intents is obtained by dividing the number of business intents that did not match the candidate question-answer pairs by the total number of business intents representing the question set. Based on a representative set of questions, the average length of the question text is used as the extraction threshold, and the ratio of the total number of questions to the number of job tag types is used as the base number of jobs. Based on the comparison results of the sliding window differential change rate of the weighted matching rate and the proportion of intentions in adjacent periods, the adjustment direction and amount of the extraction threshold and job base are determined. Based on the adjusted extraction threshold and job quota, new question-answer pairs are constructed and stored in the knowledge base according to the next batch of question-answer pairs.

2. The question-and-answer knowledge base construction method according to claim 1, characterized in that, The process of obtaining the question text, its job title tags, and business intent is as follows: For the current batch of question-answer pairs, statements that meet any of the following conditions will be identified as question text: the statement begins with an interrogative word; The statement contains an interrogative sentence structure; the statement ends with a question mark; Extract words from the question text that indicate action goals, operation objects, or service needs as business intent; extract words from the question-and-answer pairs that describe the questioner's job title, role name, or scope of function as job tags; Record the job tags and business intentions extracted from the same question and answer pair as the job tags and business intentions corresponding to the question text in this question and answer pair; For question-and-answer pairs that lack job tags or business intent, use the job tags or business intents from adjacent question-and-answer pairs; Sort multiple question texts under the same job title and corresponding to the same business intent by the number of characters contained, from fewest to most. Delete the question texts with the most and fewest characters, and keep the remaining question texts.

3. The question-and-answer knowledge base construction method according to claim 2, characterized in that, The process of assembling the representative problem set is as follows: For any two question texts: if the job tags and business intentions of the two are different, the semantic repetition rate is zero; if the job tags and business intentions of the two are the same, the semantic repetition rate is one; if the job tags of the two are the same but the business intentions are different, or the job tags are different but the business intentions are the same, then extract the content words that appear in both, and use the ratio of the number of content words to the total number of non-repeating content words in both as the semantic repetition rate. If a unique pair of question texts has the largest semantic repetition rate that is greater than zero, then this pair of question texts will be merged into the same group. If multiple pairs of question texts have the same largest semantic repetition rate that is greater than zero, then these multiple pairs of question texts will be merged into the same group. If the semantic repetition rate of all two question texts is zero, then the two question texts that appear first in the current batch will be merged into the same group according to their chronological order of appearance. After merging them into groups, continue merging the two question texts with the highest semantic repetition rate from the remaining question texts until all question texts are assigned to a certain group; Count the frequency of each question text in each group, and take the question text with the highest frequency as the representative question text; if there are multiple question texts with the highest frequency in each group, take the question text with the smallest difference between the number of characters and the average number of characters of all question texts in the group as the representative question text. Each group's representative question text, along with its job title and business intent, is compiled into a representative question set.

4. The question-and-answer knowledge base construction method according to claim 1, characterized in that, The weighted matching rate is calculated as follows: Extract the answer text corresponding to each representative question text from the current batch of question-answer pairs, and combine the representative question text and the answer text into candidate question-answer pairs; Select in-place question-and-answer pairs from the knowledge base that have the same job tags or business intent as each candidate question-and-answer pair, and use the storage duration of each selected in-place question-and-answer pair as a weight. Calculate the semantic repetition rate between the candidate question-answer pair and the question text of each corresponding in-situ question-answer pair, multiply the semantic repetition rate by the corresponding weight, sum them up, and then divide by the number of in-situ question-answer pairs selected to obtain the weighted matching rate. If no in-place question-and-answer pairs with the same job tags or business intentions as the candidate question-and-answer pairs are found, the weighted matching rate is determined to be zero.

5. The question-and-answer knowledge base construction method according to claim 1, characterized in that, The calculation process for the sliding window difference rate of the weighted matching rate is as follows: The weighted matching rates of all candidate question-answer pairs in the current batch are arranged into a sequence according to the number of characters in the question text from shortest to longest. Set the width of the sliding window to three consecutive weighted matching rates, place the sliding window at the beginning of the sequence, calculate the difference between two adjacent weighted matching rates within the sliding window, and obtain two difference values; use the arithmetic mean of the absolute values ​​of the two differences as the difference rate of change. The sliding window is moved backward by one weighted matching rate to form a new window. The difference change rate is calculated based on the new window until the end of the sequence is reached. The mean of all differential rates of change is used as the sliding window differential rate of change for the weighted matching rate.

6. The question-and-answer knowledge base construction method according to claim 1, characterized in that, The specific results of comparing the proportions of intentions in adjacent cycles are as follows: Divide the number of job tag types corresponding to each business intent in the representative question set of the current batch by the intent ratio of the current batch to obtain the job dispersion of each business intent in the current batch. Based on the number of job tag types corresponding to each business intent in the representative problem set of the previous batch, the job dispersion of each business intent in the previous batch can be obtained similarly. Calculate the difference in job dispersion for the same business intent between the current batch and the previous batch, and use it as the change in dispersion. The ratio of the sum of the absolute values ​​of the changes in each dispersion degree to the total number of business intentions is used as the comparison result of the proportion of intentions in adjacent periods.

7. The question-and-answer knowledge base construction method according to claim 6, characterized in that, The process of determining the adjustment direction and adjustment amount is as follows: The sliding window difference change rate of the weighted matching rate is used as reference value one, and the comparison result of the intention ratio of adjacent periods is used as reference value two. The absolute value of the difference between reference value one and reference value two is used as the adjustment amount for the extraction threshold; the adjustment amount for the base number of positions is the same as the adjustment amount for the extraction threshold. Based on the comparison between reference value one and reference value two, the adjustment direction of the extraction threshold is determined; the adjustment direction includes increasing, decreasing or remaining unchanged; the adjustment direction of the job base is opposite to the adjustment direction of the extraction threshold.

8. The question-and-answer knowledge base construction method according to claim 1, characterized in that, The process of constructing a new question-answer pair is as follows: The next batch of question-and-answer pairs is categorized to obtain the question text, its job tag, and business intent, and the answer text corresponding to each question text is obtained. Select question texts from the obtained question texts whose number of characters is greater than or equal to the adjusted extraction threshold; group the selected question texts by job label, and determine the number of question texts to be retained in each group based on the adjusted job base number, so that the total number of questions in each group is less than or equal to the adjusted job base number. Combine the retained question text, answer text, job tag, and business intent of each group into a new question-and-answer pair.

9. A question-and-answer knowledge base construction system, characterized in that, include: The question-and-answer processing module is used to classify the question-and-answer pairs in the current batch and obtain the question text, its job tag, and business intent. Calculate the semantic repetition rate of any two question texts, group the two question texts with the highest semantic repetition rate into the same group, and form a representative question set based on the most frequent question in each group, its job title tag, and business intent. The weighted matching module is used to construct candidate question-answer pairs based on a representative question set. It uses the storage duration of in-situ question-answer pairs in the knowledge base as the weight to calculate the weighted matching rate between candidate question-answer pairs and in-situ question-answer pairs. The parameter setting module is used to divide the number of business intent types that do not match candidate question-answer pairs by the total number of business intent types representing the question set to obtain the intent ratio; Based on a representative set of questions, the average length of the question text is used as the extraction threshold, and the ratio of the total number of questions to the number of job tag types is used as the base number of jobs. The module is adjusted to determine the direction and amount of adjustment for the extraction threshold and job base based on the comparison results of the sliding window differential change rate based on the weighted matching rate and the proportion of intent in adjacent periods. Based on the adjusted extraction threshold and job base, new question-answer pairs are constructed and stored in the knowledge base based on the next batch of question-answer pairs.

10. A question-and-answer knowledge base storage medium, characterized in that, The storage medium stores a computer program, and when the computer program is read, the computer executes a question-and-answer knowledge base construction method as described in any one of claims 1 to 8.