A data processing method
By using cross-node comparison and semantic alignment techniques for feature words, the problem of pseudo-existing technologies caused by differences in AI-generated content during dissemination is solved. This enables a method to accurately restore the original content from multiple versions of content, ensuring the credibility and reliability of the restoration results.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- LIAONING ZHONGKE ZHICHAN HIGH TECH IND RES CO LTD
- Filing Date
- 2026-04-20
- Publication Date
- 2026-06-30
AI Technical Summary
AI-generated content is prone to discrepancies during dissemination, leading to pseudo-existing technologies and making it difficult for users to identify the authenticity and completeness of the original recorded content.
By acquiring multiple AI-generated content variants distributed across different propagation nodes, different expressions of technical feature words are identified. Candidate restoration values are generated using cross-node comparison and semantic alignment, and credibility is evaluated to output the restoration results.
It improves the accuracy and reliability of restoring the original content from multiple versions of AI-generated content, ensuring that the restoration results conform to statistical laws and do not violate common sense in the field, and provides a credible reference.
Smart Images

Figure CN122065987B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of data processing technology, and in particular to a data processing method. Background Technology
[0002] In the data processing process, for example, when dealing with massive amounts of multi-source text data on the Internet, the authenticity, integrity and traceability of the data become key issues. Generative artificial intelligence technology can automatically generate a large amount of AI content, such as patent abstracts, technical solutions, academic literature, etc., and spread them on the Internet. AI-generated content can bring convenience to information acquisition, but in the process of automatic generation, the original recorded content is easily tampered with, and even fabricated patent numbers and false citations may occur, making it impossible for users to understand the original recorded content.
[0003] To address the aforementioned issues, some solutions use semantic similarity comparison to determine the relevance between the content and the original recorded content, or utilize knowledge graphs to verify consistency. Semantic similarity comparison can establish a connection between the content and the original recorded content, helping users identify potential copying or rewriting relationships. Knowledge graph verification can perform structured checks, helping to determine whether there are logical contradictions or factual errors in the AI-generated content.
[0004] However, AI-generated content can be disseminated at different nodes, and during the dissemination process, multiple different derivative AI-generated content may appear. That is, the AI-generated content disseminated at different nodes is different, resulting in differences between the AI-generated content and the original recorded content, which can easily lead to pseudo-existing technology. Summary of the Invention
[0005] This application provides a data processing method to address the problem that AI-generated content differs from the original recorded content, which can easily lead to pseudo-priority technology.
[0006] This application provides a data processing method, including:
[0007] Multiple AI-generated content variants distributed across different propagation nodes are obtained. These multiple AI-generated content variants include the same technical feature words, which include one or more combinations of performance index words, physical quantity words, material composition words, process step words, and limiting condition words.
[0008] Cross-node comparison of multiple AI-generated content variants is performed to identify different expressions of the technical feature words in different AI-generated content variants. The different expressions include at least one of numerical differences, terminological differences, or differences in limiting conditions.
[0009] Based on the distribution characteristics among the different expressions, the tendency type is identified, which includes overall shift, polarization, or random fluctuation.
[0010] Based on the stated tendency type and knowledge constraints, at least one candidate restoration value for the technical feature word is generated;
[0011] The credibility of the candidate restored values is evaluated, and the restoration result is output. The restoration result includes the target restored value and the corresponding credibility evaluation information.
[0012] The data processing method obtains multiple AI-generated content variants distributed across different propagation nodes and identifies different expressions in order to restore the original content from the differences between multiple versions. Specifically, by screening candidate content containing the same technical feature words and performing correlation analysis, the comparability of multiple AI-generated content variants can be ensured. Semantic alignment and extraction of target expressions unify the expression of the same technical meaning in different AI-generated content variants. By classifying target expressions into numerical differences, terminological differences, or limiting condition differences, and by calculating statistics, semantic distance, or conditional change, the degree of difference in numerical values, terms, or limiting conditions is quantified respectively. The concentration of technical topics is judged by comprehensively considering numerical and terminological differences, and the trend of technical scope changes is judged by combining limiting condition differences, thereby improving the accuracy of tendency type identification. By extracting statistical features and comparing them with knowledge constraints to generate candidate restoration values, the restoration results are made to conform to statistical laws and not violate domain common sense. By extracting concentration features, cluster centers, or preset quantiles according to tendency types, the generation method of candidate restoration values is ensured to match the variation pattern. By comparing statistical features with preset ranges and adjusting values exceeding the range, the restoration results are prevented from violating domain common sense, thus improving the reliability of the restoration results. By comprehensively considering propagation node features and cross-difference consistency, credibility assessment is performed and the results are sorted and output, providing users with selectable restoration results and credibility references. Attached Figure Description
[0013] To more clearly illustrate the technical solution of this application, the drawings used in the embodiments will be briefly introduced below. Obviously, for those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0014] Figure 1 A flowchart illustrating a data processing method provided in an embodiment of this application;
[0015] Figure 2 A schematic diagram illustrating the acquisition of multiple AI-generated content variations provided in an embodiment of this application;
[0016] Figure 3 A schematic diagram illustrating the determination of tendency type provided in an embodiment of this application;
[0017] Figure 4 This is a schematic diagram of the process for determining candidate restoration values provided in an embodiment of this application. Detailed Implementation
[0018] The embodiments will now be described in detail, examples of which are illustrated in the accompanying drawings. When the following description relates to the drawings, unless otherwise indicated, the same numbers in different drawings represent the same or similar elements. The embodiments described in the following examples do not represent all embodiments consistent with this application.
[0019] During the dissemination process, AI-generated content is often reposted, rewritten, or summarized multiple times by different platforms or users. For example, after the technical solution in the original patent document is generated into a blog post by AI, the article may be copied or modified by other websites to increase its appeal; another platform may generate a new summary based on the same original document, resulting in multiple AI versions with different expressions derived from the same original content.
[0020] These derivative versions differ from the original document in several ways, such as numerical differences, terminology substitutions, or the loss of key limitations, and each spreads independently on the internet. When users search for prior art, they may only find one of these derivative versions and mistakenly take it as the original published document, thus creating pseudo-prior art—content that is not a true original document but masquerades as prior art. If such pseudo-prior art is cited in patent examination or academic evaluation, it will directly lead to misjudgment.
[0021] To address the aforementioned problems, embodiments of this application provide a data processing method, such as... Figure 1 As shown, it includes steps S100-S500.
[0022] S100: Obtain multiple AI-generated content variants distributed at different propagation nodes. The multiple AI-generated content variants include the same technical feature words, which include one or more combinations of performance index words, physical quantity words, material composition words, process step words, and limiting condition words.
[0023] AI-generated content variants can spread across different dissemination nodes, such as patent databases, academic literature platforms, technical blogs, news websites, and social media. During the dissemination process, the same original content may be rewritten, reproduced, or summarized in different ways by different dissemination nodes, resulting in multiple different AI-generated content versions, i.e., AI-generated content variants. These different versions may have numerical differences, such as modified technical parameters; terminology differences, such as replacing technical terms with synonyms or colloquial expressions; and differences in limiting conditions, such as the deletion or alteration of testing conditions or scope of application.
[0024] It should be noted that this application categorizes different expressions into three types: numerical differences, terminological differences, and differences in limiting conditions. This classification is based on the most common variations of technical feature words during the dissemination of AI-generated content. Technical feature words typically include performance index terms, physical quantity terms, material composition terms, process step terms, or limiting condition terms. These elements are most prone to three types of changes during reprinting, rewriting, or summarizing: first, specific numerical values are exaggerated or reduced, resulting in numerical differences; second, technical terms are replaced with synonyms, abbreviations, or colloquial expressions, resulting in terminological differences; and third, conditional adverbs, scope limitations, or environmental parameters are added, deleted, or modified, resulting in differences in limiting conditions.
[0025] These three types of differences can cover the vast majority of propagative tampering scenarios, and are respectively suitable for quantitative analysis using statistics, semantic distance, and conditional change, thus providing a clear analytical path for subsequent tendency type identification and candidate restoration value generation.
[0026] Compared to other possible types of differences (such as word order adjustment, punctuation changes, etc.), the three types selected in this application are directly related to the core elements of technical features, have substantial significance for restoring the original content, and are quantifiable and verifiable, meeting the requirements of practicality and feasibility of data processing methods.
[0027] By acquiring multiple AI-generated content variants, the original content can be reverse-engineered by leveraging the contradictions and consistency between multiple versions. This approach does not rely on matching a single piece of content but rather utilizes the statistical distribution patterns among multiple versions for reconstruction.
[0028] Multiple AI-generated content variants include the same technical feature words. Technical feature words are words that can characterize the essential features of a technical solution, including performance index words, such as efficiency and power; physical quantity words, such as temperature and pressure; material composition words, such as perovskite and silicon-based materials; process step words, such as annealing and deposition; and limiting condition words, such as lighting conditions and testing standards.
[0029] Specifically, in some embodiments, obtaining multiple AI-generated content variants distributed across different propagation nodes includes the following steps S110-S150.
[0030] S110: Obtain candidate AI-generated content from multiple propagation nodes.
[0031] Candidate AI-generated content refers to text content that may be generated by AI, including but not limited to patent abstracts, technical solution descriptions, and academic literature reviews. This content can be obtained from multiple dissemination nodes such as patent databases, academic literature platforms, technical blogs, news websites, and social media. Figure 2 In the process, candidate AI-generated content 1, 2, and 3 are obtained from propagation nodes A, B, and C, respectively.
[0032] Unlike AI-generated content variants, multiple AI-generated content pieces do not share the same technical signature words, meaning that multiple AI-generated content pieces may be completely unrelated.
[0033] S120: Perform natural language processing on candidate AI-generated content to identify technical feature words.
[0034] Natural Language Processing (NLP) is the use of computer technology to analyze and process natural language text. In this embodiment, NLP is mainly used to identify technical feature words from candidate AI-generated content.
[0035] Specifically, methods such as word segmentation, part-of-speech tagging, named entity recognition, and dependency parsing can be used to segment the text into word sequences and identify technical terms, performance index words, physical quantity words, material component words, process step words, and limiting condition words.
[0036] For example, named entity recognition using natural language processing (NLP) can identify material names such as "perovskite"; regular expressions using NLP can match performance metrics such as "efficiency 25%"; NLP can transform unstructured text into structured data, providing a foundation for subsequent semantic alignment and difference recognition. By identifying technical feature words, the technical fields and topics involved in candidate content can be preliminarily determined, providing a basis for subsequent screening.
[0037] It is understood that the recognition process can be implemented in various ways, and this application does not limit it. You can refer to any implementation method in the prior art that can realize the recognition of technical feature words.
[0038] S130: Select candidate AI-generated content that includes the same technical feature words to obtain the first variant set.
[0039] In other words, only when multiple candidate AI-generated content involves the same technical topic, i.e. includes the same technical feature words, are they considered to be likely to originate from the same original content, and thus further screening is performed.
[0040] S140: Extract features from multiple candidate AI-generated content in the first variant set to obtain target feature information, which includes at least one of content fingerprint, propagation chain information, or metadata.
[0041] Further feature extraction is performed on the candidate AI-generated content in the first variant set. During the extraction process, the content fingerprint may include semantic hash values or text vector representations to characterize the semantic features of the content; the propagation chain information may include citation relationships, reprint paths, or publication time series, etc.; metadata may include original document identifiers, author information, publishing platforms, etc.
[0042] S150: Based on the target feature information, perform correlation analysis on the candidate AI-generated content in the first variant set to construct a candidate variant set, which includes multiple AI-generated content variants.
[0043] Association analysis utilizes content fingerprints, propagation chain information, or metadata to determine whether content generated by different candidate AIs originates from the same original content. Content fingerprints determine whether the content generated by different candidate AIs is highly similar semantically, thus providing a preliminary assessment of their homology. Propagation chain information includes citation relationships, reprint paths, and publication time sequences. If multiple candidate AI-generated content pieces have direct or indirect citation relationships, or if their publication times are sequential and their themes are consistent, they are likely to originate from the same original content. Metadata includes original document identifiers, such as patent numbers, author information, and publishing platforms. If the metadata of multiple pieces of content contains the same original document identifier, their homology can be directly confirmed.
[0044] In association analysis, one or more of the above information can be used to construct a candidate variant set through weighted scoring or decision trees, thereby ensuring that the variants in the set do indeed originate from the same original content, excluding cases where they are only similar in technical theme but actually originate from different original content. The multiple AI-generated content variants contained in the resulting candidate variant set are the input data used for reverse reconstruction analysis.
[0045] After obtaining multiple AI-generated content variations, step S200 is executed.
[0046] S200: Perform cross-node comparison of multiple AI-generated content variants to identify different expressions of technical feature words in different AI-generated content variants. The different expressions include at least one of numerical differences, terminological differences, or differences in limiting conditions.
[0047] Because different dissemination nodes may use different expressions during the process of reprinting or rewriting, the specific expression of the same technical feature may differ in different AI-generated content variants.
[0048] For example, the same performance metric, "photoelectric conversion efficiency," may be expressed as "PCE," "efficiency," or "photoelectric conversion efficiency" in different AI-generated content variants, and the corresponding numerical values may also differ. Therefore, to improve the accuracy of content reconstruction, it is necessary to identify the different expressions of technical feature terms.
[0049] In some embodiments, cross-node comparison is performed on multiple variants to identify different expressions of technical feature words in different AI-generated content variants, including the following steps S210-S240.
[0050] S210: Perform natural language processing on AI-generated content variants to locate the expression form of technical feature words in AI-generated content variants.
[0051] In this step, natural language processing is used to locate the specific representation of technical feature words in variants. The localization process may include: performing word segmentation and part-of-speech tagging on each variant to identify all possible technical feature word candidates; then using a pre-built domain dictionary or pre-trained language model to filter out words that match the technical feature words identified in step S120; and finally, determining the position of these technical feature words in the sentence and their modification relationships through dependency parsing.
[0052] For example, in the sentence "The photoelectric conversion efficiency of this solar cell reaches 25.3%", dependency parsing can determine that "photoelectric conversion efficiency" is the subject and "25.3%" is its numerical object, thus locating the technical feature term "photoelectric conversion efficiency" and its corresponding value.
[0053] S220: Semantically align different expressions representing the same technical meaning in different AI-generated content variants to obtain aligned feature words.
[0054] Semantic alignment can be achieved through methods such as synonym mapping (e.g., based on a domain-specific thesaurus) or vector space models (e.g., semantic similarity calculation based on pre-trained models such as Word2Vec and BERT).
[0055] Semantic alignment can yield aligned feature words, which are words that express different meanings but have the same technical meaning in different AI-generated content variants and are grouped into the same group.
[0056] S230: Extract the target representation, which is the value or modification representation of the aligned feature word in the AI-generated content variant.
[0057] The value is the specific numerical parameter corresponding to the technical feature term, such as "25%", "100℃", "5MPa", etc., which is usually composed of numbers and units.
[0058] Modifying expressions are words that limit or describe technical features, including conditional adverbs (such as "under AM1.5G illumination"), range limitations (such as "at room temperature" or "pH value between 7 and 8"), and environmental parameters (such as "humidity 80%)".
[0059] For example, for the alignment feature term "photoelectric conversion efficiency", the corresponding value in variant A is "25.3%", the corresponding value in variant B is "28%", and the corresponding value in variant C is "22%". These values are the target description.
[0060] As another example, for the alignment feature word "temperature", the modifiers could be "reaction temperature" or "heated to", which are also target expressions.
[0061] S240: Based on the target description, determine the different expressions of technical feature words in different AI-generated content variants.
[0062] In this embodiment, different expressions are categorized into three types: numerical differences, terminological differences, and differences in limiting conditions. By classifying the target expressions, complex multi-dimensional difference problems can be decomposed into single-dimensional problems, making the reconstruction process more targeted and accurate. Simultaneously, this classification lays the foundation for subsequent judgments. For example, when multiple differences exist, the concentration of the technical topic is first determined through numerical and terminological differences, then the changing trend of the technical scope is determined through differences in limiting conditions, and finally, the tendency type is comprehensively determined.
[0063] In some embodiments, determining the different expressions of technical feature terms in different AI-generated content variants based on the target expression includes: when the target expression is a value, determining the difference in the corresponding value in different AI-generated content variants as a numerical difference; when the modified expression of the target expression is a different expression of a technical feature term with the same technical meaning, determining the difference in the corresponding expression in different AI-generated content variants as a terminological difference; when the modified expression of the target expression is a conditional adverbial, scope limitation, or environmental parameter attached to a technical feature term, determining the difference in the conditional adverbial, scope limitation, or environmental parameter in different AI-generated content variants as a limiting condition difference.
[0064] Among them, numerical differences refer to the different numerical parameters corresponding to the same technical feature words in different AI-generated content variants. For example, the efficiency recorded in variant A is 25%, while the efficiency recorded in variant B is 30%.
[0065] Terminology differences refer to the different expressions used for the same technical meaning in different AI-generated content variants. For example, "efficiency," "PCE," and "photoelectric conversion efficiency" are all different expressions of the same technical meaning.
[0066] Differences in limiting conditions refer to differences in the conditional adverbs, scope limitations, or environmental parameters attached to the technical feature terms. For example, variant A states "efficiency is 25% under AM1.5G standard illumination," while variant B only states "efficiency is 25%," lacking the limitation of illumination conditions.
[0067] After identifying the different expressions, step S300 is executed.
[0068] S300: Identify the trend type based on the distribution characteristics between different expressions. The trend type includes overall shift, polarization, or random fluctuation.
[0069] In the scenario of this application, AI-generated content variants may exhibit different change patterns. That is, the values of some AI-generated content variants have a consistent offset relative to the original values, the expressions of some AI-generated content variants are split into multiple obviously different clusters, and the changes of some AI-generated content variants are chaotic and without obvious rules.
[0070] In this embodiment, the overall offset represents the expression of each AI-generated content variant, which has a consistent offset direction relative to the original value; polarization represents the expression of each AI-generated content variant, which splits into two or more distinct clusters; and random fluctuation represents the expression of each AI-generated content variant, which is randomly scattered and has no obvious pattern.
[0071] By identifying the type of tendency, targeted restoration strategies can be selected. For example, in cases of overall shift, concentrated features can be extracted to offset the shift effect; in cases of polarization, multiple cluster centers can be retained for subsequent selection; and in cases of random fluctuation, ineffective restoration can be avoided. Therefore, by identifying the type of tendency, it can be ensured that the restoration process matches the data change pattern, thereby improving the accuracy and reliability of the restoration results.
[0072] To identify the type of tendency, in some embodiments, when different expressions include numerical differences, statistics are calculated based on the values of technical feature words in different AI-generated content variants to determine the numerical distribution pattern; the statistics include at least one of mean, median, variance, skewness, or kurtosis; when different expressions include terminological differences, the degree of differentiation of terminological expressions is determined based on the semantic distance of the expression forms, the semantic distance being calculated based on synonym mapping or vector space models; when different expressions include differences in limiting conditions, the change pattern of limiting conditions is determined based on the amount of change in the limiting conditions, the amount of change in conditions including the number of increases or decreases in limiting conditions or the expansion or contraction of the limiting range; the type of tendency is identified based on the numerical distribution pattern, degree of differentiation, or change pattern.
[0073] For numerical differences, the distribution pattern of the values can be determined by calculating statistics. For example, if the values are generally high and concentrated with a small variance, it indicates that the values of each AI-generated content variant have a consistent high offset relative to the original values, which can be identified as an overall offset; if the values are split into two obvious clusters with small absolute values of skewness but large kurtosis, it can be identified as polarization; if the values are randomly scattered with large variance and no obvious clustering characteristics, it can be identified as random fluctuation.
[0074] For example, such as Figure 3 As shown, cluster 1: 22%, 22.5%, 23%, cluster 2: 28%, 28.5%, 29%. The statistical values are mean 25.5%, median 25.5%, variance ≈ 10.5 (large), skewness ≈ 0 (symmetric), and kurtosis ≈ 1.5 (low). The large variance, skewness close to 0, and low kurtosis, combined with the data, clearly show two peaks. Therefore, the trend is polarization.
[0075] For terminological differences, the degree of differentiation in terminological expression can be determined by calculating semantic distance. Semantic distance can be calculated based on thesaurus mapping (e.g., calculating the semantic distance between terms based on a domain thesaurus) or vector space models (e.g., mapping terms to vectors based on pre-trained models such as BERT, and determining the semantic distance by calculating the cosine distance between vectors). If the semantic distance of all terms is small, pointing to the same semantic center, it can be identified as an overall shift; if terms cluster into multiple semantic centers, it can be identified as polarization; if there is no stable correspondence between terms and the semantic distances are chaotic, it can be identified as random fluctuations.
[0076] For differences in limiting conditions, the pattern of change in the limiting conditions can be determined by analyzing the amount of change in the conditions. The amount of change in the conditions includes the number of additions or subtractions of the limiting conditions, for example, variant A adds a limiting condition; or the expansion or contraction of the limiting range, for example, the temperature range changes from "room temperature" to "60-80℃".
[0077] If the limiting conditions are strengthened or weakened as a whole, it can be identified as an overall shift; if some conditions are strengthened and others are weakened, showing at least two different directions of change, it can be identified as polarization; if the change is irregular, it can be identified as random fluctuation.
[0078] When different expressions may include at least two of the following: numerical differences, terminological differences, or differences in limiting conditions, in some embodiments, the tendency type is identified based on the distribution characteristics among the different expressions, including: if numerical differences exist, determining the concentration of the technical topic based on the distribution characteristics of the numerical differences; and / or, if terminological differences exist, determining the concentration of the technical topic based on the distribution characteristics of the terminological differences; if differences in limiting conditions exist, determining the trend of change in the technical scope based on the distribution characteristics of the differences in limiting conditions; and determining the tendency type of the technical feature words based on the concentration of the technical topic and / or the trend of change in the technical scope.
[0079] Understandably, in practical applications, the same technical feature term may exhibit multiple types of differences across different AI-generated content variants. For example, a particular technical feature term may simultaneously exhibit numerical differences and differences in limiting conditions. In such cases, it is necessary to comprehensively consider the distribution characteristics of multiple types of differences to determine the tendency type.
[0080] Specifically, the concentration of technical topics is first determined based on the distribution characteristics of numerical and terminological differences. The concentration of technical topics is used to characterize whether each variant points to the same technical concept. If the numerical differences show an overall shift and the terminological differences also point to the same semantic center, the concentration of technical topics is high, indicating that each variant describes the same technical solution. If the numerical differences show polarization or the terminological differences have multiple semantic centers, the concentration of technical topics is low, indicating that each variant may describe different technical solutions.
[0081] Secondly, based on the distribution characteristics of the differences in the limiting conditions, the changing trend of the technical scope is determined. The changing trend of the technical scope is used to characterize the direction of change of the applicable scope of the technical solution in each variant. If the limiting conditions are strengthened as a whole, for example, more test conditions are added, then the technical scope is narrowed; if the limiting conditions are weakened as a whole, for example, some limiting conditions are deleted, then the technical scope is expanded; if the changes in the limiting conditions are inconsistent, then the changing trend of the technical scope is uncertain.
[0082] Finally, by considering the changing trends in the concentration of technical topics and the scope of technology, the tendency type of technical feature words is determined. This hierarchical comprehensive judgment mechanism can effectively handle complex variation patterns, such as situations where the technical topics are consistent but the limiting conditions are fragmented, thereby improving the accuracy of tendency type identification.
[0083] After identifying the tendency type, proceed to step S400.
[0084] S400: Generate at least one candidate restoration value for technical feature words based on the tendency type and knowledge constraints.
[0085] In the scenario described in this application, if the restored value is generated solely based on the tendency type, it may exceed the range of values within the domain. For example, the statistically calculated efficiency value may exceed the physical limit, or the material parameters may contradict industry consensus. In such cases, by introducing knowledge constraints for correction, comparing the statistical features with the preset range in the domain knowledge base, and adjusting or truncating values that exceed the range, the restoration result can be ensured to be technically reliable.
[0086] By combining tendency type and knowledge constraints, we can approximate the original value by utilizing the statistical regularity of the data itself, and eliminate abnormal interference by leveraging domain knowledge, so that the generated candidate restored value has both statistical rationality and domain reliability.
[0087] Among them, the candidate restored value is the most likely original value of the technical feature words inferred based on statistical analysis and knowledge constraints. By combining the tendency type and knowledge constraints, the restored result can conform to the statistical law of data distribution and not violate the common sense of the domain.
[0088] In some embodiments, generating at least one candidate restored value for a technical feature word based on a tendency type and knowledge constraints includes: extracting statistical features from different expressions based on the tendency type; comparing the statistical features with knowledge constraints, outputting the comparison result, and generating at least one candidate restored value based on the comparison result, wherein the knowledge constraints include a preset range in a domain knowledge base.
[0089] Understandably, different tendency types correspond to different statistical feature extraction strategies. If the tendency type is an overall shift, then the central tendency measures of different expressions are extracted as statistical features, including at least one of the mean, median, or mode. In this case, since the expressions of each variant have a consistent shift direction relative to the original value, the shift effect can be offset by extracting central tendency measures (such as the mean, median, or mode), resulting in statistical features that are close to the original value.
[0090] If the tendency type is polarization, multiple cluster centers of different expressions are extracted as statistical features. At this time, the expression of each variant is split into multiple clusters, and each cluster may correspond to different propagation paths or rewriting strategies. Therefore, it is necessary to extract multiple cluster centers as statistical features to retain the representative values of each cluster.
[0091] If the tendency type is random fluctuation, extract the preset quantiles of different expressions as statistical features, or terminate the generation of candidate restored values. In this case, the expressions of each variant are randomly scattered, making it difficult to restore the original value through central tendency. Therefore, the preset quantiles can be extracted as statistical features, or the generation of candidate restored values can be terminated directly to avoid outputting unreliable restoration results.
[0092] like Figure 4As shown, the statistical features are then compared with the knowledge constraints, the comparison results are output, and at least one candidate restoration value is generated based on the comparison results. In some embodiments, the method includes: if the statistical feature is within a preset range, it is used as a candidate restoration value; if the statistical feature is higher than the upper limit of the preset range, the lumped feature quantity is recalculated based on the values below the upper limit in different expressions, and used as a candidate restoration value; if the statistical feature is lower than the lower limit of the preset range, the lumped feature quantity is recalculated based on the values above the lower limit in different expressions, and used as a candidate restoration value; if the values in different expressions all exceed the preset range, a candidate restoration value is generated based on the boundary of the preset range.
[0093] Understandably, knowledge constraints are pre-defined reasonable value ranges within a domain knowledge base. For example, the domain knowledge base could pre-determine the photoelectric conversion efficiency of perovskite solar cells as 5%-30%. By comparing statistical characteristics with knowledge constraints, outliers caused by alterations or exaggerations during the dissemination process can be filtered out.
[0094] Specifically, if a statistical characteristic (such as the mean) is within a preset range, it means that the value is within a reasonable range in the domain and can be directly used as a candidate restoration value.
[0095] If the statistical characteristics are higher than the upper limit of the preset range, it indicates that there may be an exaggeration effect. In this case, outlier values that are higher than the upper limit can be excluded, and the lumped feature values can be recalculated based on the remaining values that are lower than the upper limit as candidate restoration values.
[0096] If the statistical characteristics are below the preset lower limit, it indicates that there may be an underestimation effect. In this case, outliers below the lower limit can be excluded, and the lumped feature quantity can be recalculated based on the remaining values above the lower limit as candidate restored values.
[0097] If all values exceed the preset range, it means that the current data quality is insufficient to support reliable restoration. In this case, candidate restoration values can be generated based on the boundary values of the preset range (such as the upper or lower limit) and marked as low confidence.
[0098] By comparing statistical features with knowledge constraints, we can avoid the restoration results from violating common sense in the domain, thereby improving the reliability of the restoration results.
[0099] After generating candidate restoration values, step S500 is executed.
[0100] S500: Evaluate the credibility of candidate restored values and output the restoration results, which include the target restored value and the corresponding credibility evaluation information.
[0101] Understandably, since the restoration results are based on statistical inference, their reliability may be affected by various factors such as data quality, the number of variants, and the credibility of propagation nodes. Therefore, evaluating the credibility of candidate restoration values can provide users with a reference for the reliability of the restoration results.
[0102] In some embodiments, the credibility of the candidate restored values is evaluated and the restoration result is output, including steps S510-S550.
[0103] S510: Filter target nodes, which are the propagation nodes where AI-generated content variants that participate in constituting candidate restoration values are located.
[0104] For example, if a candidate restoration value is calculated based on the values of variant A, variant B, and variant C, then the propagation nodes of these three AI-generated content variants are all target nodes.
[0105] For example, if a candidate restored value is obtained based on the numerical calculation of variant B, then the propagation node where variant B is located is the target node.
[0106] S520: Determine the credibility weight of AI-generated content variants based on at least one of the target node type, publication time, or propagation path.
[0107] Understandably, different types of dissemination nodes have different levels of credibility. For example, official databases (such as the patent office database) are more credible than personal blogs; content published earlier is more likely to be close to the original content and has higher credibility; content with a shorter dissemination path (i.e., passing through fewer intermediate links) is more likely to maintain the integrity of the original information and has higher credibility.
[0108] S530: Determine the consistency coefficient based on the degree of consistency between the candidate restored value and the tendency type determined based on at least one of numerical difference, terminology difference, or constraint difference.
[0109] The generation process of candidate restored values depends on the judgment of the tendency type. If the candidate restored values are highly consistent with the tendency type determined based on the difference type, it indicates that the logic of the restoration process is highly consistent and the restoration result is more reliable.
[0110] For example, if the tendency type is an overall offset and the candidate restored values are indeed located in the concentrated area of each variant value, then the consistency coefficient is high.
[0111] S540: Calculate the comprehensive score of the candidate restored value based on the credibility weight and consistency coefficient.
[0112] The overall score can be calculated using weighted summation, weighted average, or other comprehensive evaluation methods.
[0113] S550: Sort multiple candidate restoration values according to the comprehensive score and output the sorting results, which include one or two restoration results.
[0114] The sorting results include one or two restoration results for the user to select and confirm. By outputting the sorting results, the user can select the most suitable restoration value based on the credibility assessment information, thereby enhancing the practicality of the solution.
[0115] Through the above steps, the data processing method provided in this application embodiment can reversely restore the original true information from multiple variants distributed at different propagation nodes by means of statistical regularities and domain knowledge constraints.
[0116] Specifically, by screening candidate content containing the same technical feature words and performing correlation analysis, the comparability of multiple obtained variants is ensured; through semantic alignment and extraction of target expressions, the expression forms of the same technical meaning in different AI-generated content variants are unified; by classifying target expressions into numerical differences, terminological differences, or differences in limiting conditions, and performing quantitative analysis respectively, the restoration process is made more in line with the internal logic of technical documents; through a hierarchical comprehensive judgment mechanism, the concentration of technical topics is judged first, and then the changing trend of technical scope is judged, improving the accuracy of tendency type identification; by extracting statistical features and comparing them with knowledge constraints to generate candidate restoration values, the restoration results are made to conform to statistical laws and not violate common sense in the domain; by comprehensively considering the characteristics of propagation nodes and cross-difference consistency, credibility assessment is performed and the results are ranked and output, providing users with selectable restoration results and credibility references, enhancing the practicality of the solution; the entire restoration process is based on quantifiable statistical features and domain knowledge, and the output results are interpretable and can be used as a basis for subsequent decision-making.
[0117] Similar parts between the embodiments provided in this application can be referred to mutually. The specific implementation methods provided above are only a few examples under the overall concept of this application and do not constitute a limitation on the scope of protection of this application. For those skilled in the art, any other implementation methods extended from the solution of this application without creative effort shall fall within the scope of protection of this application.
Claims
1. A data processing method, characterized in that, include: Multiple AI-generated content variants distributed across different propagation nodes are obtained. These multiple AI-generated content variants include the same technical feature words, which include one or more combinations of performance index words, physical quantity words, material composition words, process step words, and limiting condition words. Cross-node comparison is performed on multiple AI-generated content variants to identify different expressions of the technical feature words in different AI-generated content variants. The different expressions include at least one of numerical differences, terminological differences, or differences in limiting conditions. Based on the distribution characteristics among the different expressions, the tendency type is identified, which includes overall shift, polarization, or random fluctuation. Based on the stated tendency type and knowledge constraints, at least one candidate restoration value for the technical feature word is generated; The credibility of the candidate restored values is evaluated, and the restoration result is output. The restoration result includes the target restored value and the corresponding credibility evaluation information. The process involves cross-node comparison of multiple AI-generated content variants to identify different expressions of the technical feature words in different AI-generated content variants. This includes: performing natural language processing on the AI-generated content variants to locate the expression forms of the technical feature words in the AI-generated content variants; semantically aligning different expression forms representing the same technical meaning in different AI-generated content variants to obtain aligned feature words; extracting target expressions, where the target expressions are the values or modified expressions corresponding to the aligned feature words in the AI-generated content variants; and determining the different expressions of the technical feature words in different AI-generated content variants based on the target expressions. Based on the target expression, the different expressions of the technical feature words in different AI-generated content variants are determined, including: when the target expression is a value, the difference in the corresponding value in different AI-generated content variants is determined as a numerical difference; when the modified expression of the target expression is a different expression of the technical feature word with the same technical meaning, the difference in the corresponding expression in different AI-generated content variants is determined as a terminology difference; when the modified expression of the target expression is a conditional adverbial, scope limitation, or environmental parameter attached to the technical feature word, the difference in the conditional adverbial, scope limitation, or environmental parameter in different AI-generated content variants is determined as a limiting condition difference. Based on the distribution characteristics among the different expressions, the tendency type is identified, including: when the different expressions include numerical differences, calculating statistics based on the values of the technical feature words in different AI-generated content variants to determine the numerical distribution pattern; the statistics include at least one of mean, median, variance, skewness, or kurtosis; when the different expressions include terminological differences, determining the degree of differentiation of terminological expressions based on the semantic distance of the expression forms, the semantic distance being calculated based on synonym mapping or vector space model; when the different expressions include differences in limiting conditions, determining the change pattern of limiting conditions based on the change in the limiting conditions, the change in the limiting conditions including the increase or decrease in the number of limiting conditions or the expansion or contraction of the limiting range; and identifying the tendency type based on the numerical distribution pattern, degree of differentiation, or change pattern.
2. The data processing method according to claim 1, characterized in that, Obtain multiple AI-generated content variants distributed across different propagation nodes, including: Acquire candidate AI-generated content from multiple propagation nodes; Natural language processing is performed on the candidate AI-generated content to identify technical feature words; Candidate AI-generated content containing the same technical feature words is selected to obtain the first variant set; Feature extraction is performed on multiple candidate AI-generated content in the first variant set to obtain target feature information, wherein the target feature information includes at least one of content fingerprint, propagation chain information or metadata; Based on the target feature information, correlation analysis is performed on the candidate AI-generated content in the first variant set to construct a candidate variant set, which includes multiple AI-generated content variants.
3. The data processing method according to claim 1, characterized in that, Based on the distribution characteristics among the different expressions, the tendency type is identified, including: If the numerical differences and / or terminology differences exist, the concentration of the technical topics is determined based on the distribution characteristics of the numerical differences and / or terminology differences. If the aforementioned limiting conditions differ, the trend of change in the technical scope is determined based on the distribution characteristics of these differences. Based on the concentration of the technical topics and / or the changing trends of the technical scope, determine the tendency type of the technical feature words.
4. The data processing method according to claim 1, characterized in that, Based on the stated tendency type and knowledge constraints, at least one candidate restoration value for the technical feature words is generated, including: Based on the stated tendency type, statistical features are extracted from the different statements; The statistical features are compared with the knowledge constraints, the comparison results are output, and at least one candidate restoration value is generated based on the comparison results. The knowledge constraints include a preset range in the domain knowledge base.
5. The data processing method according to claim 4, characterized in that, Based on the stated tendency type, statistical features are extracted from the different statements, including: If the tendency type is an overall shift, the central tendency features of the different expressions are extracted as statistical features, and the central tendency features include at least one of the mean, median or mode; If the tendency type is polarization, extract multiple cluster centers of the different expressions as statistical features; If the tendency type is random fluctuation, extract the preset quantiles of the different expressions as statistical features, or terminate the generation of candidate restored values.
6. The data processing method according to claim 5, characterized in that, The statistical features are compared with the knowledge constraints, the comparison results are output, and at least one candidate restoration value is generated based on the comparison results, including: If the statistical feature is within the preset range, the statistical feature is used as a candidate restoration value; If the statistical feature is higher than the upper limit of the preset range, the lumped feature quantity is recalculated based on the values lower than the upper limit in the different expressions, and used as candidate restoration values; If the statistical feature is lower than the lower limit of the preset range, the lumped feature quantity is recalculated based on the values of the different expressions that are higher than the lower limit, and used as candidate restoration values; If the values in the different expressions all exceed the preset range, then candidate restoration values are generated based on the boundaries of the preset range.
7. The data processing method according to claim 1, characterized in that, The reliability of the candidate restored values is evaluated, and the restoration result is output, including: Filter target nodes, where the target nodes are the propagation nodes of AI-generated content variants that participate in constituting candidate restoration values; The credibility weight of the AI-generated content variant is determined based on at least one of the target node's type, publication time, or propagation path. A consistency coefficient is determined based on the degree of consistency between the candidate restored value and the tendency type determined based on at least one of the numerical difference, the terminological difference, or the limiting condition difference; Calculate the comprehensive score of the candidate restored value based on the credibility weight and the consistency coefficient; The candidate restoration values are sorted according to the comprehensive score, and the sorting result is output, which includes one or two restoration results.