A Method and System for Automated Processing of R&D Data Based on Keyword Extraction

By using a keyword extraction method and leveraging text matching and historical data range to identify anomalous data, the problem of accuracy and efficiency in chemical drug R&D data processing was solved, achieving efficient data organization and anomaly identification.

CN121996739BActive Publication Date: 2026-06-30ZHEJIANG FINGARD TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
ZHEJIANG FINGARD TECH CO LTD
Filing Date
2026-04-09
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing technologies cannot balance accuracy and efficiency in chemical drug research and development data processing, and there are difficulties in identifying abnormal data.

Method used

By using a keyword extraction method, text extraction algorithms and chemical drug terminology are used for data matching. Combined with location coordinates, historical data ranges, and conflict relationships, abnormal data is automatically identified and marked, and a required results table is constructed.

Benefits of technology

It improves the efficiency and accuracy of data processing for chemical drug research and development, reduces the impact of abnormal data on subsequent analysis, and ensures the accuracy and consistency of data processing.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN121996739B_ABST
    Figure CN121996739B_ABST
Patent Text Reader

Abstract

This application discloses a method and system for automated processing of R&D data based on keyword extraction, belonging to the field of automated data processing technology. The method includes: initially extracting text information from chemical drug R&D data using a text extraction algorithm to obtain keywords; further extracting keywords using chemical drug terminology to obtain a target text set; responding to user processing needs, retrieving target text matching the user's processing needs and its location coordinates from the target text set; constructing a required result table based on the target text; retrieving data information corresponding to the target text from the chemical drug R&D data based on the location coordinates and adding it to the required result table; and identifying and marking abnormal results in the data information of the required result table based on the data range and data conflict relationships in historical chemical drug-related data. The beneficial effects of this application are: improving the processing efficiency and accuracy of chemical drug R&D data.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of automated data processing technology, and in particular to a method and system for automated processing of R&D data based on keyword extraction. Background Technology

[0002] In the field of chemical drug research and development experiments, research and development companies need to submit the drugs they regularly develop to the corresponding authoritative monitoring laboratories. The laboratories will then submit the experimental monitoring data to the product development companies in the form of reports. The research and development company's staff need to regularly organize the experimental data and format it into formatted data for daily operation and management.

[0003] However, chemical drug research and development data is characterized by numerous technical terms, complex data dimensions, inconsistent formats, and extremely high accuracy requirements. Related technologies rely on manual screening of key information from massive amounts of data, which is not only time-consuming and labor-intensive, but also prone to affecting subsequent research and development trials due to subjective human judgment or information omissions.

[0004] In some automated data extraction and data table integration technologies, only the data is extracted and organized. Whether the data is abnormal still needs to be judged manually. Furthermore, data anomalies are prone to occur due to retrieval errors, so the organized data table still requires a lot of time to filter out abnormal data.

[0005] The patent "Method, Device, Medium, and Program Product for Processing Tabular Data," publication number CN120407645A, published on August 1, 2025, specifically discloses a method including: inputting tabular data and associated processing requests from an interactive interface into a preset model; generating tool invocation requirements corresponding to the processing requests; sending the tool invocation requirements to a task processing engine; inputting tool information returned by the task processing engine into the preset model to obtain multiple target tools, including reading tools and processing tools; if it is determined that the task processing engine has successfully read the tabular data using the reading tools, inputting the reading results and functional information of the processing tools into the preset model to obtain a processing task corresponding to the processing requests; and sending the processing task to the task processing engine; and displaying the execution results on the interactive interface in response to receiving the execution results from the task processing engine. While this solution automates data processing, it cannot identify abnormal data and is unsuitable for scenarios requiring high data accuracy, such as data processing in chemical pharmaceutical research and development. Summary of the Invention

[0006] This application addresses the problem that existing data processing methods cannot simultaneously achieve both accuracy and efficiency when integrating and processing large amounts of chemical drug R&D data. It provides an automated R&D data processing method and system based on keyword extraction. By identifying and extracting keywords and matching location coordinates, the system achieves automatic processing and integration of chemical drug R&D data. Furthermore, by analyzing the data range and data conflict relationships in historical chemical drug-related data, the system identifies data anomalies and provides users with guidance on handling abnormal data, thereby improving the accuracy and efficiency of automated data processing.

[0007] To achieve the aforementioned technical objectives, this application provides a technical solution: an automated processing method for R&D data based on keyword extraction, comprising the following steps: initial extraction of text information from chemical drug R&D data using a text extraction algorithm to obtain keywords; secondary extraction of keywords using chemical drug proper nouns to obtain a target text set; wherein each target text in the target text set has corresponding location coordinates; responding to user processing needs, retrieving target texts matching the user's processing needs and their location coordinates from the target text set; constructing a demand result table based on the target texts; retrieving data information corresponding to the target texts from the chemical drug R&D data according to the location coordinates and adding it to the demand result table; and identifying anomalies in the data information in the demand result table based on the data range and data conflict relationships in historical chemical drug related data, and marking abnormal results.

[0008] Furthermore, the step of identifying anomalies in the data information of the required results table based on the data range and data conflict relationships in the historical chemical drug-related data, and marking abnormal results, includes: obtaining a first type of conflict relationship based on the proportion of various chemical drug proper noun types coexisting in the historical chemical drug-related data; dividing the historical chemical drug-related data according to the chemical drug proper noun types to obtain a first category set; obtaining the data range corresponding to the chemical drug proper noun type using the first category set; wherein the data range includes at least the text range and the numerical range; obtaining a second category set using historical chemical drug-related data with the same combination of chemical drug proper noun types; constructing a second combination conflict relationship based on the difference in the data ranges of the first category set and the second category set; constructing a data conflict relationship using the first type of conflict relationship and the second combination conflict relationship; and determining abnormal results based on the data range and the data conflict relationship.

[0009] Furthermore, determining abnormal results based on data range and data conflict relationships includes: if the data information conforms to the data range, then performing secondary anomaly identification based on the first type of conflict relationship and the second combined conflict relationship; if the data information does not conform to the data range, then marking it as an abnormal result; if the data information is identified as having a conflict relationship based on the first type of conflict relationship and the second combined conflict relationship, then marking it as an abnormal result.

[0010] Furthermore, the step of constructing a second combination conflict relationship based on the data range differences between the first and second classification sets includes: obtaining data range differences containing textual and numerical difference information based on the first and second classification sets; and constructing a second combination conflict relationship based on the changes in data range differences between different combinations of chemical drug proper noun types.

[0011] Furthermore, it also includes: using the Qwen3-Turbo model architecture to train based on historical chemical drug research and development tables and chemical drug terminology, learning chemical drug terminology and table construction methods, and constructing an experimental data processing model; wherein, the experimental data processing model responds to user processing needs.

[0012] Furthermore, it also includes: constructing terminology mapping relationships based on the synonyms and antonyms of chemical drug proper nouns; performing initial identification of chemical drug R&D data based on the antonyms of chemical drug proper nouns to obtain antonym-related data; performing unified terminology replacement on chemical drug R&D data based on the synonyms of chemical drug proper nouns, and replacing invalid antonym-related data to obtain preprocessed chemical drug R&D data.

[0013] Furthermore, it also includes: constructing a first correction relationship based on the physicochemical properties and uses of chemical drugs; traversing chemical drug R&D data to obtain data to be corrected based on the first correction relationship; obtaining correction information based on the source of the data to be corrected and the correction status of historical sources, and recording the correction information and the corresponding location coordinates; wherein, the correction information includes at least the correction ratio and correction type of the source of the data to be corrected; obtaining the identification data to be corrected based on the location coordinates of the abnormal results and the location coordinates corresponding to the correction information, performing anomaly identification on the identification data to be corrected based on the data range and data conflict relationship, and determining whether to display the identification data to be corrected.

[0014] Furthermore, the step of obtaining the identification data to be corrected based on the location coordinates of the abnormal result and the location coordinates corresponding to the correction information includes: if the location coordinates of the abnormal result are the same as the location coordinates of the correction information, then the correction information is retrieved, and the data information in the abnormal result is corrected according to the correction information to obtain the identification data to be corrected.

[0015] Furthermore, the step of performing anomaly identification on the data to be corrected based on the data range and data conflict relationship, and determining whether to display the data to be corrected includes: outputting the number of generated identification values ​​to be corrected according to the correction ratio; generating identification values ​​to be corrected corresponding to the number of generated values ​​based on the correction type and original data information; performing anomaly identification on the identification values ​​to be corrected based on the data range and data conflict relationship, filtering out the identification values ​​to be corrected that are identified as abnormal results, retaining the identification values ​​to be corrected that are identified as non-abnormal results, and displaying them.

[0016] Another technical solution provided in this application is an automated R&D data processing system based on keyword extraction, used to implement the method described above, including: a configuration database for storing chemical drug terminology; a target text extraction module for initially extracting keywords from text information in chemical drug R&D data using a text extraction algorithm, and then using the chemical drug terminology in the configuration database to perform secondary extraction on the keywords to obtain a target text set; a table construction module for retrieving target text matching the user's processing requirements and their location coordinates from the target text set, constructing a required result table based on the target text, and retrieving data information corresponding to the target text from the chemical drug R&D data based on the location coordinates to the required result table; and an anomaly identification module for identifying anomalies in the data information in the required result table based on the data range and data conflict relationships in historical chemical drug related data, and marking abnormal results.

[0017] The beneficial effects of this application are as follows: 1. By matching text in chemical drug R&D data with chemical drug proper nouns, the application obtains the chemical drug proper nouns existing in the chemical drug R&D data as target text and records their location coordinates. Then, based on the user's processing requirements, it matches the target text that needs to be processed, retrieves numerical data based on the location coordinates of the target text, and organizes it into a required result table containing numerical data and target text. Furthermore, based on the data range of historical chemical drug-related data and the data conflict relationships between drugs themselves and between drugs, it identifies and marks abnormal results in the required result table. While improving the efficiency of chemical drug R&D data processing through keyword matching and coordinate-based retrieval, it also allows users to intuitively identify potential anomalies in the current required result table, improving the accuracy of the required result table and preventing abnormal data from affecting the accuracy of subsequent user analyses based on chemical drug R&D data.

[0018] 2. Verify the filled data by considering the data range of a single chemical drug proper noun type, the coexistence conflicts between chemical drug proper noun types, and the data range conflicts between combinations of chemical drug proper noun types. Utilize the specific characteristics of chemical drug data to verify the correctness of the filled data, so as to avoid the filling of incorrect data interfering with the user's subsequent experiments.

[0019] 3. Use the number of data points within the data range as the benchmark for obtaining the number of identification values ​​to be corrected, ensuring that the number of values ​​generated is compatible with the scale of historical normal data, avoiding too many or too few values, and ensuring that the generated identification values ​​to be corrected are both representative and have a balanced generation efficiency. Attached Figure Description

[0020] Figure 1 This is a flowchart illustrating the automated processing method for R&D data based on keyword extraction proposed in this application.

[0021] Figure 2 This is a flowchart illustrating the process of outputting the identified data to be corrected in the keyword-based automated processing method for R&D data in this application. Detailed Implementation

[0022] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description of this application is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely one preferred embodiment of this application and are only used to explain this application. They do not limit the scope of protection of this application. All other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0023] like Figure 1 As shown, the automated processing method for R&D data based on keyword extraction includes the following steps:

[0024] Text extraction algorithms are used to initially extract text information from chemical drug research and development data to obtain keywords;

[0025] Keywords are extracted a second time using proper nouns for chemical drugs to obtain a target text set; each target text in the target text set has a corresponding location coordinate;

[0026] In response to the user's processing needs, retrieve the target text that matches the user's processing needs from the target text set, along with its location coordinates;

[0027] Build a required results table based on the target text;

[0028] Retrieve data information corresponding to the target text from chemical drug research and development data based on location coordinates and add it to the required results table;

[0029] Based on the data range and data conflict relationships in historical chemical drug-related data, anomalies are identified in the data information in the required results table, and abnormal results are marked.

[0030] In this embodiment, text in chemical drug R&D data is matched with proper nouns for chemical drugs to obtain the proper nouns in the data as target text, and their location coordinates are recorded. Then, the target text to be processed is matched according to the user's processing requirements. Numerical data is retrieved based on the location coordinates of the target text, and compiled into a required result table containing both numerical data and the target text. Anomalies in the required result table are identified and marked based on the data range of historical chemical drug-related data and data conflicts between drugs themselves and between drugs. Through keyword extraction and matching, and coordinate-based retrieval, large-scale data processing is automated, improving the efficiency of chemical drug R&D data processing. Simultaneously, it allows users to intuitively identify potential anomalies in the current required result table, improving the accuracy of the required result table and preventing abnormal data from affecting the accuracy of subsequent user analyses based on chemical drug R&D data.

[0031] The text extraction algorithm is the TextRank algorithm, which extracts keywords and summarizes documents from chemical drug research and development data. Keywords are extracted by utilizing the co-occurrence information (semantics) between words in the text information of chemical drug research and development data.

[0032] Specifically, the TextRank algorithm is as follows:

[0033] ;

[0034] in, Indicates the first Nodes Importance score Indicates the damping coefficient. Indicates the first Nodes Pointing to a node The predecessor node, Represents a node and The weight of the edges between them. Indicates the first Nodes For nodes The subsequent node it points to express and The weight of the edges between them. Indicates the first Nodes Importance score.

[0035] In this embodiment, the jieba library (Jieba Library) of Python (a computer programming language) is used for text extraction, and the extracted tags are set through the allowPOS (allow part-of-speech tag) parameter. For example, the allowPOS parameter is set to "def textrank(self,sentence,topK=20,withWeight=False,allowPOS=('nw','n','nt','v','t'),withFlag=False)", where def is a required starting keyword for defining functions in Python, textrank is the TextRank algorithm, self is a built-in Python keyword, sentence is the sentence, topK is the top K ranked terms, allowPOS=('nw','n','nt','v','t') allows part-of-speech tags including network terms (nw), common nouns (n), organization names (nt), common verbs (v), and time (t), withWeight is the weight, withFlag is the tag, withWeight=False disables weight, and withFlag=False disables tag. The part-of-speech tag library can be found in Table 1.

[0036] Table 1. Part-of-speech tag library:

[0037] .

[0038] A configuration database is pre-built based on knowledge in the field of chemical pharmaceuticals. This knowledge can be obtained through expert experience or common sense and stored in the configuration database. The database is then traversed to identify chemical pharmaceutical terms and their matching keywords, which are used as target text. The location coordinates of these target texts are then obtained, and a target text set is constructed. These location coordinates include at least row, column, and data block coordinates. The data block coordinates are used to locate the overall data unit within the chemical R&D data.

[0039] Based on the user's processing requirements, retrieve the corresponding target text and its location coordinates from the target text set. Then, construct the initial result table based on the target text. For example, if the target text is the drug name, experimental indicators, and test items, use the drug name, experimental indicators, and test items as the column headers to construct the result table.

[0040] Then, based on the location coordinates of the target text, the specific values ​​corresponding to the target text in the original R&D data are retrieved, and the numerical information in the requirement result table is filled in to form a requirement result table containing the target text and numerical data.

[0041] After the requirement result table is constructed, historical chemical drug-related data is retrieved to obtain the data range and data conflict relationships corresponding to the target text. The data range and data conflict relationships are used to determine whether there are any abnormalities in the data information. If so, the data information is marked as an abnormal result so that users can intuitively obtain data situations that may be abnormal, in order to meet the data accuracy requirements of the chemical field.

[0042] In some cases, the target text needs to include the specific drug name. For example, if the user's processing requirement is "to organize the purity experimental data of aspirin," the target text must at least include "aspirin," "drug name," "purity index," and "experimental purity." In this case, based on historical drug-related data, the purity index and experimental purity data range from historical aspirin experiments are obtained. Furthermore, the data conflict relationship is identified based on the difference between the purity index and the experimental purity. That is, the experimental purity needs to be matched with the purity index. If the experimental purity value is a gas chromatograph reading, while the purity index is a high-performance liquid chromatography (HPLC) purity value, then there is a conflict between the experimental purity and the purity index, and one of them must be abnormal.

[0043] In other cases, anomalies are identified in the results table based on the data range and data conflicts in historical chemical drug data. Anomalies are marked as follows:

[0044] Based on the proportion of various chemical drug terminology types coexisting in historical chemical drug data, the first type of conflict relationship is obtained;

[0045] Based on the types of proper nouns for chemical drugs, historical data related to chemical drugs are divided into the first category set;

[0046] Obtain the data range corresponding to the proper noun type of chemical drugs from the first category set;

[0047] A second category set is obtained from historical chemical drug-related data that have the same combination of chemical drug proper noun types;

[0048] A second combination conflict relationship is constructed based on the differences in the data ranges of the first and second classification sets.

[0049] Data conflict relationships are constructed using the first type of conflict relationship and the second combined conflict relationship;

[0050] Identify abnormal results based on the data range and data conflict relationships.

[0051] All historical chemical drug-related data were categorized by chemical drug terminology type, resulting in multiple primary category sets. Each primary category set contains only historical data for the corresponding chemical drug terminology type. Statistical algorithms were then used to perform statistical analysis on the data in each primary category set to determine the data range.

[0052] In this case, the data range includes both numerical and textual ranges. When historical chemical data is categorized based on the type of chemical terminology, if the historical chemical data is text, keywords are extracted from the text content, and these keywords are used as the text range. If the historical chemical data is numerical, the numerical range is used as the data range. For example, if "experimental purity" includes "high performance liquid chromatography" and "98%", then high performance liquid chromatography is included as a keyword in the text range, and 98% is used to calculate the numerical range of experimental purity.

[0053] Furthermore, based on the proportion of coexistence of various chemical drug proper noun types in historical chemical drug data, that is, the proportion of coexistence of various chemical drug proper noun types in the same data table, if two chemical drug proper noun types have never coexisted in historical chemical drug data, it is considered that these two chemical drug proper noun types have a type coexistence conflict. For example, a certain detection method will not be used on a certain drug. At this time, the chemical drug proper noun type corresponding to the detection method and the chemical drug proper noun type corresponding to the drug have a type coexistence conflict. In this case, the first type conflict relationship between them is a coexistence probability of 0%.

[0054] Simultaneously, statistical analysis was conducted on historical data related to chemical drugs with the same type of proper nouns. Specifically, within a second category set, different data tables exist, all containing the same type of proper nouns. This allowed for the determination of the data range corresponding to a combination of chemical drug proper noun types. By examining the data ranges corresponding to combinations of chemical drug proper noun types and the data ranges corresponding to individual chemical drug proper noun types, the impact of different type combinations on the data range was determined, thus constructing a second set of combination conflict relationships.

[0055] Therefore, the data filling is verified by considering the data range of a single chemical drug proper noun type, the coexistence conflicts between chemical drug proper noun types, and the data range conflicts between combinations of chemical drug proper noun types. The correctness of the filled data is verified by utilizing the data characteristics of chemical drugs, so as to avoid the filling of incorrect data from interfering with the user's subsequent experiments.

[0056] Specifically, identifying anomalous results based on data range and data conflict relationships includes:

[0057] If the data information conforms to the data range, then perform secondary anomaly identification based on the first type of conflict relationship and the second combination of conflict relationships;

[0058] If the data does not conform to the data range, it is marked as an abnormal result;

[0059] If conflicting relationships exist in the data information identified based on the first type of conflict relationship and the second combined conflict relationship, it is marked as an abnormal result.

[0060] In this embodiment, only abnormal results are marked so that users can intuitively obtain data that may be abnormal; the abnormal data is not replaced. First, the data information is identified based on its range to filter out data with obvious anomalies. Then, the remaining data information within the normal range is identified a second time to determine whether there are type conflicts or combination conflicts between the data information, further ensuring the accuracy of the required results table.

[0061] The second combination conflict relationship is constructed based on the differences in data range between the first and second classification sets, including:

[0062] Based on the first classification set and the second classification set, obtain the data range differences that include textual difference information and numerical difference information;

[0063] A second set of conflict relationships is constructed based on the differences in the data range of different combinations of chemical drug proper noun types.

[0064] By examining the variations in data range resulting from different combinations of chemical drug terminology types, we can identify the relationships between these types of terminology and further identify any anomalies in the data by utilizing the correlation between chemical drug terminology types within the same experiment. For example, in a water bath heating experiment, there should be a temperature and a corresponding change in the drug parameter. Therefore, we can further narrow down the data range of a combination that includes both temperature and drug parameter to avoid the data range of a single drug parameter ignoring the influence of temperature.

[0065] As a second embodiment of this application, the automated processing method for R&D data based on keyword extraction further includes:

[0066] The Qwen3-Turbo model architecture was used to train the model based on historical chemical drug research and development tables and chemical drug terminology. The model learned the chemical drug terminology and table construction methods, and then constructed an experimental data processing model.

[0067] At this point, the experimental data processing model interacts with the user to achieve the following steps:

[0068] In response to the user's processing needs, retrieve the target text that matches the user's processing needs from the target text set, along with its location coordinates;

[0069] Build a required results table based on the target text;

[0070] Retrieve data information corresponding to the target text from the chemical drug R&D data based on the location coordinates and add it to the required results table.

[0071] The Qwen3-Turbo model is trained on chemical pharmaceutical terminology, enabling the generalized model to automatically match and replace user descriptions with these terminologies. A configuration database of chemical pharmaceutical terminology can be pre-built, and the Qwen3-Turbo model can be trained using data from this database. These terminologies can be obtained from expert experience or domain knowledge and stored in the configuration database.

[0072] When chemical drug research and development data is obtained, keywords can first be extracted using the TextRank algorithm, and then further extracted using chemical drug terminology. The target text set is temporarily stored in a configuration database, ready to respond to user processing requests, thereby reducing the time users need to wait for initial processing. It is understood that the configuration database and other components in this embodiment can be set up on the user's intranet to avoid leakage of experimental data.

[0073] When a user inputs a data processing request, the Qwen3-Turbo model first identifies the target text based on the user's description using the target text set, determining the target text to be retrieved. Then, it constructs the corresponding result table based on the table construction methods found in historical chemical drug research and development tables. Specifically, this construction of the result table based on the table construction methods in historical chemical drug research and development tables can be trained by learning from the table formats corresponding to different types of chemical terms in those tables, ensuring that the final output table format meets the current requirements.

[0074] As a third embodiment of this application, the automated R&D data processing method based on keyword extraction further includes:

[0075] Construct terminology mapping relationships based on the synonyms and antonyms of chemical drug proper nouns;

[0076] Data preprocessing of chemical drug research and development data is performed using terminology mapping relationships.

[0077] Before initially extracting textual information from chemical drug research and development data, data preprocessing is performed on the chemical drug research and development data through terminology mapping relationships.

[0078] The data preprocessing of chemical drug R&D data based on terminology mapping relationships includes:

[0079] Initial identification of chemical drug R&D data is performed based on the antonymous correlation of chemical drug proper nouns to obtain antonymous correlation data;

[0080] Based on the synonyms of chemical drug technical terms, the chemical drug R&D data is uniformly replaced, and invalid antonymous related data are replaced to obtain the preprocessed chemical drug R&D data.

[0081] Based on the synonymous descriptions of chemical drug technical terms, non-standard synonymous terms in chemical drug R&D data are uniformly replaced with the same description. At the same time, based on the antonymous descriptions of chemical drug technical terms, logically conflicting statements in chemical drug R&D data are identified. For example, sub-zero experiments and 25°C. In this case, if there are antonymous related terms that match the terminology mapping relationship in the chemical drug R&D data, invalid synonyms are replaced to avoid the replaced descriptions affecting subsequent users' understanding of the data with antonymous related terms.

[0082] like Figure 2 As shown, the automated processing method for R&D data based on keyword extraction also includes:

[0083] The first correction relation is constructed based on the physicochemical properties and uses of chemical drugs;

[0084] The data to be corrected is obtained by traversing the chemical drug R&D data according to the first correction relation;

[0085] Obtain correction information based on the source of the data to be corrected and the correction status of historical sources, and record the correction information and the corresponding location coordinates;

[0086] The data to be corrected is obtained based on the location coordinates of the abnormal results and the location coordinates of the correction information. Anomalies are identified in the data to be corrected based on the data range and data conflict relationship, and it is determined whether to display the data to be corrected.

[0087] It should be noted that the chemical drug R&D data here refers to the pre-processed chemical drug R&D data. In this embodiment, except for the chemical drug R&D data during pre-processing, which is the initial chemical drug R&D data, all other chemical drug R&D data belong to the pre-processed chemical drug R&D data.

[0088] First, a first correction relationship is established based on the physicochemical properties and uses of chemical drugs. For example, aspirin's physicochemical property is that it is readily soluble in ethanol, and its use is for antipyresis and analgesia. In this case, the data to be corrected is obtained by traversing the chemical drug research and development data according to the first correction relationship. That is, if there is a corresponding description of aspirin as poorly soluble in ethanol, it is taken as the data to be corrected. Physicochemical properties and uses can be obtained from the pharmacopoeia.

[0089] In this embodiment, constructing the first correction relationship based on the physicochemical properties and uses of the chemical drugs further includes:

[0090] The first revised text relationship is constructed based on the text descriptions of the physicochemical properties and uses of chemical drugs;

[0091] The first corrected numerical relationship is constructed based on the numerical descriptions of the physicochemical properties and uses of chemical drugs;

[0092] Construct the first correction relationship by associating the first correction text relationship with the first correction numerical relationship.

[0093] The physicochemical properties and uses of chemical drugs are presented in both text and numerical form. It can be understood that the numerical description of uses mainly reflects the dosage when taken and the body values ​​of the target, which makes it easier to screen chemical drug research and development data with abnormal text descriptions or abnormal numerical descriptions.

[0094] In this embodiment, obtaining correction information based on the source of the data to be corrected and historical source correction information includes:

[0095] Obtain the sources of historical chemical drug data and the corresponding historical source corrections, calculate the correction ratio for each source, and obtain the correction type for each chemical drug proper noun.

[0096] The correction ratio and correction type of the data source to be corrected are retrieved based on the matching results between the data source to be corrected and the historical sources.

[0097] Based on the data sources for chemical reagent-related data, such as instrument testing records and manual records, obtain the correction ratio for each source, e.g., the ratio of modified data to all data for instrument A. Simultaneously, obtain the correction type corresponding to each chemical reagent terminology; for example, the correction type for experimental purity is positive or negative numerical deviation. Then, based on the current data source for chemical reagent-related data, obtain the corresponding correction ratio and correction type for that source. Considering the possibility of new sources for chemical reagent-related data, if no matching source exists, use the average correction ratio for the same source type as the correction ratio for the new source, and the highest frequency correction type for the same chemical reagent terminology type as the correction type for the new source.

[0098] Based on the location coordinates of the abnormal results and the location coordinates corresponding to the correction information, the identification data to be corrected is obtained, including:

[0099] If the location coordinates of the abnormal result are the same as the location coordinates of the correction information, then the correction information is retrieved, and the data information in the abnormal result is corrected according to the correction information to obtain the identification data to be corrected.

[0100] The data to be corrected includes at least the correction type, correction ratio, and original data information.

[0101] Based on the data range and data conflict relationships, anomaly identification is performed on the data to be corrected, determining whether to display the data to be corrected, including:

[0102] The number of generated identification values ​​to be corrected is output based on the correction ratio and the number of data in the data range.

[0103] Generate the number of identification values ​​to be corrected based on the correction type and the original data information;

[0104] Based on the data range and data conflict relationships, anomaly identification is performed on the identification values ​​to be corrected. Identification values ​​to be corrected that are identified as abnormal results are filtered out, while identification values ​​to be corrected that are identified as non-abnormal results are retained and displayed.

[0105] In this embodiment, the number of generated identification values ​​to be corrected is output based on the correction ratio and the amount of data in the data range. This ensures that sources with higher historical correction ratios generate more identification values ​​to be corrected. It should be noted that the number of generated identification values ​​to be corrected is at least 1. The amount of data in the data range is used as the benchmark for obtaining the number of generated identification values ​​to be corrected, ensuring that the number generated is compatible with the scale of historical normal data, avoiding too many or too few values. This ensures that the generated identification values ​​to be corrected are both representative and balance the generation efficiency of correction values. For example, in a data range with only three text descriptions, a large number of identification values ​​to be corrected is not required. The reliability of the source is reflected by the proportion of abnormal sources, thus generating more identification values ​​to be corrected for sources with lower reliability for user reference.

[0106] In other embodiments, the correction information includes the identification value to be corrected, which is the identification value whose location coordinates match the location coordinates of the abnormal result. Obtaining the correction information based on the source of the data to be corrected and historical correction data includes:

[0107] Sources of historical data on chemical drugs and corresponding corrections to those historical sources;

[0108] The influence relationship of the correction type corresponding to the source is obtained based on the correction values ​​corresponding to each correction type in the historical source correction situation and the differences in the initial chemical drug related data;

[0109] Based on the source of the data to be corrected, retrieve the corresponding correction type influence relationship, and obtain the identification value to be corrected using the correction type influence relationship and the data to be corrected.

[0110] In this embodiment, the current data to be corrected is directly corrected based on the correction status corresponding to the historical correction type. For example, if the correction value in the historical correction is 125% of the initial chemical drug-related data, then the data to be corrected is corrected based on 125%. At this time, there is only one identification value to be corrected.

[0111] As a fourth embodiment of this application, a research and development data automated processing system based on keyword extraction includes:

[0112] Configure a database to store proper nouns for chemical products;

[0113] The target text extraction module is used to initially extract keywords from the text information in the chemical drug research and development data using text extraction algorithms, and then call the chemical drug proper nouns in the configuration database to perform secondary extraction of the keywords to obtain the target text set.

[0114] The table building module is used to retrieve the target text that matches the user's processing needs from the target text set and its location coordinates, build the required result table based on the target text, and retrieve the data information corresponding to the target text from the chemical drug research and development data based on the location coordinates to the required result table.

[0115] The anomaly identification module is used to identify anomalies in the data information in the required results table based on the data range and data conflict relationships in historical chemical drug-related data, and to mark the abnormal results.

[0116] In this embodiment, the database is configured to connect to the target text extraction module, the target text extraction module is connected to the table construction module, and the table construction module is connected to the anomaly detection module.

[0117] Specifically, the table building module also executes:

[0118] The Qwen3-Turbo model architecture was used to train the model based on historical chemical drug research and development tables and chemical drug terminology. The model learned the chemical drug terminology and table construction methods, and then constructed an experimental data processing model.

[0119] In response to user processing needs, the table building module calls the experimental data processing model to directly output the required result table.

[0120] In another embodiment, a preprocessing module is also included, in which the following is performed:

[0121] Construct terminology mapping relationships based on the synonyms and antonyms of chemical drug proper nouns;

[0122] Initial identification of chemical drug R&D data is performed based on the antonymous correlation of chemical drug proper nouns to obtain antonymous correlation data;

[0123] Based on the synonyms of chemical drug technical terms, the chemical drug R&D data is uniformly replaced, and invalid antonymous related data are replaced to obtain the preprocessed chemical drug R&D data.

[0124] The preprocessing module is connected to the chemical drug research and development data storage unit. After preprocessing the chemical drug research and development data, it is available for retrieval by the target text extraction module and the table construction module.

[0125] The specific embodiments described above are preferred embodiments of the R&D data automation processing method and system based on keyword extraction in this application, and are not intended to limit the specific implementation scope of this application. The scope of this application includes but is not limited to the specific embodiments described above. All equivalent changes made in accordance with the shape and structure of this application are within the protection scope of this application.

Claims

1. A method for automatic processing of R&D data based on keyword extraction, characterized in that: Includes the following steps: Text extraction algorithms are used to initially extract text information from chemical drug research and development data to obtain keywords; Keywords are extracted a second time using proper nouns for chemical drugs to obtain a target text set; each target text in the target text set has a corresponding location coordinate; In response to the user's processing needs, retrieve the target text and its location coordinates from the target text set that match the user's processing needs; Build a required results table based on the target text; Retrieve data information corresponding to the target text from chemical drug research and development data based on location coordinates and add it to the required results table; Based on the data range and data conflict relationships in historical chemical drug-related data, anomalies are identified and marked in the required results table; The process involves identifying anomalies in the data information of the required results table based on the data range and data conflict relationships in historical chemical drug-related data, and marking abnormal results as follows: Based on the proportion of various types of proper nouns for chemical drugs coexisting in historical data on chemical drugs, the first type of conflict relationship is obtained; Based on the types of proper nouns for chemical drugs, historical data related to chemical drugs are divided into the first category set; Obtain the data range corresponding to the proper noun type of chemical drugs from the first category set; wherein, the data range includes at least the text range and the numerical range; A second category set is obtained from historical chemical drug-related data that have the same combination of chemical drug proper noun types; A second combination conflict relationship is constructed based on the differences in the data ranges of the first and second classification sets. Data conflict relationships are constructed using the first type of conflict relationship and the second combined conflict relationship; Identify abnormal results based on the data range and data conflict relationships.

2. The automated R&D data processing method based on keyword extraction as described in claim 1, characterized in that: The determination of abnormal results based on data range and data conflict relationships includes: If the data information conforms to the data range, then perform secondary anomaly identification based on the first type of conflict relationship and the second combined conflict relationship; If the data does not conform to the data range, it is marked as an abnormal result; If conflicting relationships exist in the data information identified based on the first type of conflict relationship and the second combination of conflict relationships, it is marked as an abnormal result.

3. The automated R&D data processing method based on keyword extraction as described in claim 1 or 2, characterized in that: The step of constructing the second combination conflict relationship based on the data range differences between the first classification set and the second classification set includes: Based on the first classification set and the second classification set, obtain the data range differences that include textual difference information and numerical difference information; A second set of conflict relationships is constructed based on the differences in the data range of different combinations of chemical drug proper noun types.

4. The keyword extraction based R&D data automation processing method of claim 1, wherein: Also includes: The Qwen3-Turbo model architecture is used to train the model based on historical chemical drug research and development tables and chemical drug terminology. The model learns the chemical drug terminology and table construction methods to build an experimental data processing model. The experimental data processing model responds to user processing needs.

5. The keyword extraction based R&D data automation processing method of claim 1, wherein: Also includes: A terminology mapping relationship is constructed based on the synonyms and antonyms of chemical drug proper nouns; Initial identification of chemical drug R&D data is performed based on the antonymous correlation of chemical drug proper nouns to obtain antonymous correlation data; Based on the synonyms of chemical drug technical terms, the chemical drug R&D data is uniformly replaced, and invalid antonymous related data are replaced to obtain the preprocessed chemical drug R&D data.

6. The keyword extraction based R&D data automation processing method of claim 1, wherein: Also includes: The first correction relation is constructed based on the physicochemical properties and uses of chemical drugs; The data to be corrected is obtained by traversing the chemical drug R&D data according to the first correction relation; Obtain correction information based on the source of the data to be corrected and the correction status of historical sources, and record the correction information and the corresponding location coordinates; wherein, the correction information includes at least the correction ratio and correction type of the source of the data to be corrected; The data to be corrected is obtained based on the location coordinates of the abnormal results and the location coordinates of the correction information. Anomalies are identified in the data to be corrected based on the data range and data conflict relationship, and it is determined whether to display the data to be corrected.

7. The automated R&D data processing method based on keyword extraction as described in claim 6, characterized in that: The step of obtaining the identification data to be corrected based on the location coordinates of the abnormal result and the location coordinates corresponding to the correction information includes: If the location coordinates of the abnormal result are the same as the location coordinates of the correction information, then the correction information is retrieved, and the data information in the abnormal result is corrected according to the correction information to obtain the identification data to be corrected.

8. The automated R&D data processing method based on keyword extraction as described in claim 7, characterized in that: The step of identifying anomalies in the data to be corrected based on data range and data conflict relationships, and determining whether to display the data to be corrected, includes: Output the number of generated identification values ​​to be corrected based on the correction ratio; Generate the number of identification values ​​to be corrected based on the correction type and the original data information; Based on the data range and data conflict relationships, anomaly identification is performed on the identification values ​​to be corrected. Identification values ​​to be corrected that are identified as abnormal results are filtered out, while identification values ​​to be corrected that are identified as non-abnormal results are retained and displayed.

9. A research and development data automated processing system based on keyword extraction, used to implement the method as described in any one of claims 1 to 8, characterized in that: include: Configure a database to store proper nouns for chemical products; The target text extraction module is used to initially extract keywords from the text information in the chemical drug research and development data using text extraction algorithms, and then call the chemical drug proper nouns in the configuration database to perform secondary extraction of the keywords to obtain the target text set. The table building module is used to retrieve the target text that matches the user's processing needs from the target text set and its location coordinates, build the required result table based on the target text, and retrieve the data information corresponding to the target text from the chemical drug research and development data based on the location coordinates to the required result table. The anomaly identification module is used to identify anomalies in the data information in the required results table based on the data range and data conflict relationships in historical chemical drug-related data, and to mark the abnormal results.