An academic literature metadata database updating method, device, equipment and storage medium

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By unifying the original metadata record format into a record to be normalized, and using the Dempster-Shafer evidence theory to determine the comprehensive trust level and conflict coefficient, the problem of low accuracy and reliability in the updating of academic literature metadata databases is solved, and an automated and precise updating process is achieved.

CN122240636APending Publication Date: 2026-06-19INST OF MEDICAL INFORMATION CHINESE ACAD OF MEDICAL SCI

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: INST OF MEDICAL INFORMATION CHINESE ACAD OF MEDICAL SCI
Filing Date: 2026-04-16
Publication Date: 2026-06-19

Application Information

Patent Timeline

16 Apr 2026

Application

19 Jun 2026

Publication

CN122240636A

IPC: G06F16/23; G06F16/242; G06F16/2455; G06F16/2458

AI Tagging

Application Domain

Database updating Special data processing applications

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

During the update process of existing academic literature metadata databases, the lack of uniformity in data standards from different data sources and interference from dirty data lead to a high rate of missed and incorrect data matching, which reduces the accuracy and reliability of the database.

Method used

The original metadata records to be added to the database are converted into records to be normalized in a unified format. The Dempster-Shafer evidence theory is used to determine the comprehensive trust level and conflict coefficient based on multiple evidence bodies. Through standardized preprocessing, accurate candidate recall, multi-dimensional evidence construction and intelligent adjudication, the document metadata database is updated automatically and accurately.

Benefits of technology

It significantly improves the accuracy and reliability of academic literature metadata databases, avoids omissions and errors in keyword matching, and achieves accurate identification and automated updates of literature entities.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122240636A_ABST

Patent Text Reader

Abstract

This application discloses a method, apparatus, device, and storage medium for updating an academic literature metadata database. The method involves converting the original metadata records to be imported into a standardized format; performing a recall process on the metadata database based on these standardized records to identify a candidate record set containing standardized candidate literature records matching the records to be imported; constructing evidence between each record to be imported and each candidate record in the candidate record set to identify multiple evidence bodies corresponding to each record; using Dempster-Shafer evidence theory based on these evidence bodies to determine the overall trust level and conflict coefficient between the record to be imported and each candidate record, thereby determining the ruling result corresponding to the record to be imported; and updating the metadata database based on the ruling result. This achieves the import and update of multi-source original metadata based on evidence theory, thus improving the accuracy of the academic literature metadata database.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of data processing technology, and in particular to a method, apparatus, device and storage medium for updating an academic literature metadata database. Background Technology

[0002] An academic literature metadata database is a structured database used to store and manage the core feature information of academic literature. It can provide unified and reliable data support for downstream applications such as literature retrieval, association analysis, and knowledge graph construction, and is the core infrastructure of the academic information resource system.

[0003] In existing technologies, the updating of academic literature metadata databases usually involves keyword matching of the original metadata records to be added to the database, followed by manual review, to update the literature metadata obtained from different data sources to the academic literature metadata database.

[0004] However, due to inconsistent data standards from different data sources and interference from dirty data, the rate of missed and incorrect data matching is high, which reduces the accuracy of academic literature metadata databases. Summary of the Invention

[0005] To address the aforementioned issues, this application provides a method, apparatus, device, and storage medium for updating an academic literature metadata database, with the aim of improving the accuracy of the academic literature metadata database.

[0006] The embodiments of this application disclose the following technical solutions: Firstly, this application provides a method for updating an academic literature metadata database, including: The original metadata records to be entered into the database are converted into records to be normalized; the records to be normalized are standardized records after format unification. Based on the record to be normalized, a recall process is performed on a preset document metadata database to determine a candidate record set; the candidate record set includes candidate document normalization records that match the record to be normalized. The record to be normalized is compared with each candidate record in the candidate record set to construct evidence, thereby determining multiple evidence bodies corresponding to the record to be normalized and each candidate record respectively; Based on the multiple evidence bodies, the Dempster-Shafer evidence theory is used to determine the comprehensive trust level and conflict coefficient corresponding to the record to be normalized and each candidate record, respectively. Based on the comprehensive trust level and the conflict coefficient, the adjudication result corresponding to the record to be normalized is determined, and the document meta-database is updated based on the adjudication result.

[0007] Optionally, in the method described above, the step of performing a recall process on a preset document metadata database based on the records to be normalized to determine a candidate record set includes: Based on the records to be normalized, hard identifier retrieval features, title semantic vector retrieval features, and physical topological fingerprint retrieval features are extracted to obtain the retrieval feature set; According to the retrieval dimensions corresponding to the feature set to be retrieved, a multi-channel parallel retrieval is performed on the preset document meta-database to obtain initial retrieval results; The initial search results of each channel are summarized and deduplicated to obtain a merged and deduplicated search result set; The merged and deduplicated search result set is sorted according to the preset recall channel priority to obtain the sorted search results; Based on a preset maximum candidate set threshold, a set of candidate records is determined from the sorted search results.

[0008] Optionally, in the method described above, the step of performing a multi-channel parallel search on a preset document metadata database according to the search dimensions corresponding to the feature set to be searched, to obtain initial search results, includes: Based on the hard identifier retrieval features, the document meta-database is subjected to precise hard identifier retrieval to obtain the first channel retrieval results; Based on the title semantic vector retrieval features, a broad title semantic retrieval is performed on the document meta-database to obtain the second channel retrieval results; Based on the physical topology fingerprint retrieval features, a physical topology fallback retrieval is performed on the document meta-database to obtain the third channel retrieval results; The search results from the first channel, the second channel, and the third channel are merged and deduplicated to obtain the initial search results.

[0009] Optionally, in the method described above, the record to be normalized includes a hard identifier, a title view set, a container identifier set, a bimodal field of publication spatiotemporal coordinates, and an author lexicon set; The hard identifier is a standardized identifier obtained by standardizing the digital object identifier in the original metadata record. The title view set is a title text set formed by removing rich text and noise from the main title, subtitle, and translation in the original metadata record; The container identifier set is a complete set of identifiers for locking the document carrier journal, including the International Standard Serial Number (ISSN), the electronic version of the ISSSN, the standard journal title, and the abbreviated journal title. The bimodal field of the publication spatiotemporal coordinates is a field that represents the publication year, volume, issue, and page location information of the document; the field includes two modalities: the extracted pure numerical form and the string form that retains the original format; the page location information includes virtual page numbers generated based on electronic locator mapping. The author lexical set is an unordered lexical set obtained by splitting the author's name in the original metadata record by spaces, removing punctuation, and converting it to lowercase.

[0010] Optionally, in the method described above, the step of constructing evidence by comparing the record to be normalized with each candidate record in the candidate record set, and determining multiple evidence bodies corresponding to the record to be normalized and each candidate record, includes: For each candidate record in the candidate record set, perform the following operations: The hard identifier in the record to be normalized is compared with the hard identifier corresponding to the candidate record to determine the hard identifier status evidence. The title view set in the record to be normalized is cross-compared with the title view set corresponding to the candidate record to construct an asymmetric similarity matrix, and the global maximum value of the asymmetric similarity matrix is taken as the title similarity to determine the title similarity evidence. Based on the bimodal field of the container identifier set and the publication spatiotemporal coordinates in the record to be normalized, a topological positioning comparison is performed on the container identifier set and the bimodal field of the publication spatiotemporal coordinates corresponding to the candidate record to determine the physical topological state evidence. The intersection of the author term set in the record to be normalized and the author term set corresponding to the candidate record is performed to determine the author risk control gate evidence; By integrating the hard identifier state evidence, the title similarity evidence, the physical topology state evidence, and the author risk control gating evidence, multiple evidence bodies corresponding to the record to be normalized and the candidate record are obtained.

[0011] Optionally, in the method described above, the step of determining the comprehensive trust level and conflict coefficient corresponding to the record to be normalized and each candidate record respectively based on the multiple evidence bodies using Dempster-Shafer evidence theory includes: The basic probability allocation mapping process is performed on the multiple evidence bodies to obtain the basic probability allocation results corresponding to each evidence body. Select any one of the multiple basic probability assignment results as the initial accumulated evidence, and initialize the global conflict coefficient to zero; The current accumulated evidence and the evidence to be accumulated are subjected to an unnormalized orthogonal synthesis operation to obtain the cross-conflict term and the consistency result. The global conflict coefficient is recursively updated based on the cross-conflict terms, and the consistency result is used as the current accumulated evidence; the evidence to be accumulated is any one of the remaining basic probability allocation results that have not participated in the synthesis operation; before the first orthogonal synthesis operation, the current accumulated evidence is the initial accumulated evidence, and the global conflict coefficient is zero; The synthesis and accumulation of all basic probability assignment results and the update of the global conflict coefficient are completed iteratively to obtain the conflict coefficient and the cumulative evidence of the target. Based on the accumulated evidence of the target and the conflict coefficient, a unified normalization process is performed to obtain the comprehensive trust level corresponding to the record to be normalized and each candidate record.

[0012] Optionally, in the method described above, determining the adjudication result corresponding to the record to be normalized based on the comprehensive trust level and the conflict coefficient, and updating the document metadata database based on the adjudication result, includes: The conflict coefficient is compared with a preset conflict circuit breaker threshold, and a circuit breaker determination result is obtained based on the author's risk control gate evidence; the circuit breaker determination result is used to determine whether the conflict circuit breaker mechanism is triggered. If the circuit breaker determination result is that a conflict circuit breaker mechanism has been triggered, a pending review result is generated and sent to the user; If the circuit breaker determination result is that the conflict circuit breaker mechanism is not triggered, then the target candidate record is determined from the candidate record set; The overall trust level corresponding to the target candidate record is compared with the preset automatic merging threshold and automatic addition threshold to obtain the adjudication result corresponding to the record to be normalized; the automatic merging threshold is greater than the automatic addition threshold. If the overall trust level corresponding to the target candidate record is greater than the automatic merging threshold, then the record to be normalized is merged into the target candidate record and updated in the document meta-database; or, If there are no candidate records in the candidate record set with a comprehensive trust level greater than the automatic addition threshold, then the record to be normalized is added to the document meta-database; or, If the overall trust level corresponding to the target candidate record is within the range of the automatic merging threshold and the automatic addition threshold, a pending review result is generated and sent to the user.

[0013] Secondly, this application provides an academic literature metadata database updating device, comprising: The standardization processing module is used to convert the original metadata records to be entered into the database into records to be normalized; the records to be normalized are standardized records after format unification. The recall processing module is used to perform recall processing on a preset document metadata database based on the record to be normalized, and determine a candidate record set; the candidate record set includes candidate document normalization records that match the record to be normalized. The evidence construction module is used to construct evidence by comparing the record to be normalized with each candidate record in the candidate record set, and to determine multiple evidence bodies corresponding to the record to be normalized and each candidate record respectively. The evidence processing module is used to determine the comprehensive trust level and conflict coefficient corresponding to the record to be normalized and each candidate record respectively based on the multiple evidence bodies and using the Dempster-Shafer evidence theory. The conflict resolution module is used to determine the resolution result corresponding to the record to be normalized based on the comprehensive trust level and the conflict coefficient, and to update the document metadata database based on the resolution result.

[0014] Thirdly, this application provides an electronic device, the device including: a processor, and a memory communicatively connected to the processor; The memory stores instructions that the computer executes; The processor executes computer execution instructions stored in memory to implement the academic literature metadata database update method described in any of the above embodiments.

[0015] Fourthly, this application provides a computer-readable storage medium storing computer-executable instructions, which, when executed by a processor, are used to implement the academic literature metadata database update method described in any of the above embodiments.

[0016] Compared with the prior art, this application has the following beneficial effects: The method of this application involves converting the original metadata records to be entered into the database into records to be normalized; the records to be normalized are standardized records after format unification; based on the records to be normalized, a pre-set document metadata database is recalled to determine a set of candidate records; the set of candidate records contains candidate standardized records of documents that match the records to be normalized; evidence is constructed by comparing the records to be normalized with each candidate record in the set of candidate records to determine multiple evidence bodies corresponding to the records to be normalized and each candidate record; based on the multiple evidence bodies, the Dempster-Shafer evidence theory is used to determine the comprehensive trust level and conflict coefficient corresponding to the records to be normalized and each candidate record; based on the comprehensive trust level and the conflict coefficient, the adjudication result corresponding to the records to be normalized is determined, and the document metadata database is updated based on the adjudication result.

[0017] Converting raw metadata records to be entered into the database into standardized records eliminates the heterogeneity of multi-source metadata formats, laying a standardized data foundation for subsequent accurate matching. Recalling a pre-defined literature metadata database based on these standardized records and determining a candidate record set quickly narrows the matching range, avoiding the inefficiency of full database traversal and improving the efficiency of the update process. Constructing evidence between each standardized record and each candidate record in the candidate record set and determining multiple evidence bodies allows for the mining of matching criteria from multiple dimensions, including hard identifiers, title similarity, physical topology, and author information, providing comprehensive and reliable evidence support for subsequent decisions. Based on these multiple evidence bodies, Dempster-Shafer is then used... Evidence theory determines the comprehensive trust level and conflict coefficient, enabling the quantitative fusion of multi-dimensional evidence. Simultaneously, the conflict coefficient identifies contradictions between pieces of evidence, enhancing the credibility of decision-making results. Based on the comprehensive trust level and conflict coefficient, the adjudication results are determined and the document meta-database is updated, enabling precise identification of document entities and effectively distinguishing between record types requiring merging, creation, and review. This application's solution, through standardized preprocessing, precise candidate recall, multi-dimensional evidence construction, quantitative evidence fusion, and intelligent adjudication—a complete technical chain—automates and simplifies the updating of the academic document meta-database, avoiding the omissions and errors in keyword matching found in existing technologies, thereby significantly improving the accuracy and reliability of the document meta-database. Attached Figure Description

[0018] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0019] Figure 1 A flowchart illustrating an academic literature metadata database update method provided in this application embodiment; Figure 2 A schematic diagram of the structure of an academic literature metadata database update device provided in this application embodiment; Figure 3 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. Detailed Implementation

[0020] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with specific embodiments and accompanying drawings. It should be particularly noted that the embodiments described in this application are only a part of the embodiments of this application, and not all of them. All other embodiments obtained by those skilled in the art based on the embodiments of this application without creative effort are within the scope of protection of this application.

[0021] It should be noted that, unless otherwise defined, the technical or scientific terms used in the embodiments of this application should have the ordinary meaning understood by one of ordinary skill in the art to which this application pertains. The terms "first," "second," and similar terms used in the embodiments of this application do not indicate any order, quantity, or importance, but are merely used to distinguish different components. Terms such as "comprising" or "including" mean that the element or object preceding the word encompasses the elements or objects listed after the word and their equivalents, without excluding other elements or objects. Terms such as "connected" or "linked" are not limited to physical or mechanical connections, but can include electrical connections, whether direct or indirect. Terms such as "upper," "lower," "left," and "right" are only used to indicate relative positional relationships; when the absolute position of the described object changes, the relative positional relationship may also change accordingly.

[0022] As described earlier, current methods for deduplicating and normalizing metadata from multiple academic literature sources for database storage typically employ rule matching, inverted string retrieval, vector semantic retrieval, linear weighted scoring, and strict serial verification. These methods first recall candidate records based on literal or semantic features, then determine whether they belong to the same document entity through fixed field alignment and weighted summation. However, these methods rely solely on surface textual features and simple linear calculations for matching, failing to effectively handle heterogeneous and dirty data scenarios such as title field drift, author name heterogeneity, mixed journal identifiers, and missing publication information. This leads to issues like missed or incorrect merging, resulting in low accuracy and poor reliability of the metadata database.

[0023] Through research, the inventors proposed a method, device, equipment, and storage medium for updating an academic literature metadata database. This method achieves automated deduplication, fusion, and database integration of massive amounts of academic literature while ensuring an extremely low false positive rate, thereby improving the accuracy and reliability of the metadata database.

[0024] To enable those skilled in the art to better understand the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present application, and not all embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative effort are within the scope of protection of the present application.

[0025] To ensure that the scope of protection of this invention covers various technical implementation paths, the key terms used herein are defined as follows: Evidence: refers to the unit of information extracted from the alignment results of metadata fields to support or refute the assumption that "two records are the same entity". Evidence can be discrete states (such as match / conflict / missing) or continuous similarity values.

[0026] Evidence fusion refers to the process of synthesizing multiple independent pieces of evidence according to certain rules to arrive at a final judgment. Fusion rules can be based on mathematical models of uncertainty reasoning (such as Dempster-Shafer evidence theory, Bayesian networks, fuzzy logic, etc.), or on heuristic rule sets based on expert knowledge or data-driven approaches.

[0027] Trust Measure: This refers to the overall support for the "same entity" hypothesis after fusion, and can be expressed as probability, confidence level, score, etc. In the preferred embodiment of this invention, Bel(T) is used.

[0028] Conflict Measure: This refers to the degree of contradiction or exclusivity between pieces of evidence. It can be quantified through methods such as the conflict coefficient K in evidence theory, the likelihood ratio difference in Bayesian models, or the number of mutually exclusive conditions triggered in a rule set.

[0029] Conflict Circuit-Breaking: This refers to a risk control mechanism whereby the system proactively terminates the automatic decision-making process and transfers the sample to a manual review queue when a conflict metric is detected to exceed a preset threshold or a preset exclusivity condition is met.

[0030] The aforementioned overarching concepts ensure that this invention is not limited to a specific algorithm implementation, but rather protects a class of technical solutions based on multi-evidence construction, fusion, and conflict detection.

[0031] See Figure 1 The figure is a flowchart illustrating an academic literature metadata database update method provided in an embodiment of this application. Figure 1 As shown, the method includes: S101: Convert the original metadata records to be imported into the database into records to be normalized.

[0032] Among them, the records to be normalized are the standardized records after the format has been unified.

[0033] In this embodiment, the original metadata records to be added to the database are sequentially cleaned and formatted to generate records to be normalized. The original metadata records R to be added to the database come from any data source, and this field may be missing or contain noise. For example, the original metadata records R to be added to the database are converted into standardized, computable records to be normalized R′ for subsequent candidate construction and evidence construction.

[0034] In one embodiment, the record to be normalized includes hard identifiers, a title view set, a container identifier set, a bimodal field of publication spatiotemporal coordinates, and an author lexical set. Specifically, the hard identifier is a standardized identifier obtained by standardizing the digital object identifiers in the original metadata record; the title view set is a set of title texts formed by removing rich text and noise from the main title, subtitle, and translation in the original metadata record; the container identifier set is a complete set of identifiers for the journal that holds the document, including the International Standard Serial Number (ISSN), the electronic version of the ISSSN, the standard journal title, and the abbreviated journal title; the bimodal field of publication spatiotemporal coordinates is a field representing the publication year, volume, issue, and page location information of the document; the field includes two modalities: the extracted pure numerical form and the string form that retains the original format; the page location information includes virtual page numbers generated based on electronic locator mapping; and the author lexical set is an unordered lexical set obtained by splitting the author's name in the original metadata record by spaces, removing punctuation, and converting it to lowercase.

[0035] In this embodiment, the Digital Object Identifier (DOI) in the original metadata record is decoded, deprecated, denoised, and lowercase-ized to generate a hard identifier DOI_clean; if cleaning fails, it is set to null and a reason code (such as DOI_MISSING) is recorded. Specifically, the decoding process restores the URL-encoded characters to their original form; the deprecation process removes prefixes such as protocol headers and domain names, retaining only the DOI body; and the denoising process removes leading and trailing whitespace and residual punctuation, and unifies it to lowercase.

[0036] The main title, subtitle, and translation in the original metadata record are processed to remove rich text and noise, generating a title view collection Title_views; and a corresponding semantic vector Title_vec is generated for the selected title for candidate recall and reordering. If the vector generation fails or the title is missing, it is set to empty and the corresponding reason is recorded.

[0037] The print ISSN (pISSN), electronic ISSN (eISSN), and ISSN-Linking (ISSN-L) numbers in the original metadata records are merged into a container identifier set ISSN_set. The journal name is normalized to the standard journal name Journal_name_norm. The publication year in pure numeric form Year_digit is extracted from the publication date, while the original string form of the publication year Year_raw is retained. Bimodal fields in pure numeric and original string forms are generated for the volume, issue, and page number fields, respectively, i.e., bimodal fields of publication spatiotemporal coordinates. If an electronic locator Elocation_id exists but the page number field is missing, the Elocation_id is mapped to a virtual page number and added to the page number field for subsequent comparison.

[0038] The author's name is segmented to obtain an unordered set of author tokens, Author_token_set. If the author information is missing, it is set to empty and the corresponding reason is recorded. After the above processing, the record to be normalized is obtained.

[0039] S102: Based on the records to be normalized, perform recall processing on the preset document metadata database to determine the candidate record set.

[0040] The candidate record set includes candidate document standardization records that match the records to be normalized.

[0041] In this embodiment, three types of retrieval features—hard identifiers, title semantic vectors, and physical topological fingerprints—are extracted from the records to be normalized to form a retrieval feature set. A three-channel parallel retrieval is then performed on a preset document metadata database. Specifically, a precise retrieval using hard identifiers yields the first channel results, a semantically broad retrieval using title semantic vectors yields the second channel results, and a fallback retrieval using physical topological fingerprints yields the third channel results. The results from the three channels are then aggregated and duplicate records are removed. The results are sorted according to a priority order: hard identifier channel first, semantic channel second, and physical topological channel last. Records at the top of the priority list are truncated based on a preset maximum candidate set threshold, resulting in a candidate record set containing standardized records of matching candidate documents.

[0042] The hard identifier retrieval feature is a standardized and cleaned digital object identifier (DOI_clean) used for precise key-value lookup; the title semantic vector retrieval feature is a title semantic vector (Title_vec) obtained by vectorizing standard titles selected from the title view set, used for approximate nearest neighbor semantic retrieval; the physical topology fingerprint retrieval feature is a structured retrieval key generated based on the International Standard Serial Number (ISSN_set), digital publication year (Year_digit), and digital page number (Fpage_digit), used for physical topology fallback retrieval in cases of missing DOIs or distorted titles. As one implementation, one or more physical topology retrieval keys can be generated based on ISSN, Year, and Fpage_Digit; when some fields are missing, candidate retrieval keys corresponding to a subset of fields can be generated according to preset degradation rules to improve candidate retrieval capability in missing scenarios. The digital page number (Fpage_digit) is the pure numerical form corresponding to the page number field when processing the bimodal field of publication spatiotemporal coordinates.

[0043] S103: Construct evidence by comparing the record to be normalized with each candidate record in the candidate record set, and determine multiple evidence bodies corresponding to the record to be normalized and each candidate record respectively.

[0044] In this embodiment, the record to be normalized and each candidate record in the candidate record set are sequentially subjected to field alignment and consistency checks to construct multiple independent evidence bodies. Each evidence body includes at least hard identifier state evidence, title similarity evidence, physical topology state evidence, and author risk control gating evidence. Hard identifier state evidence characterizes whether the hard identifiers of the record to be normalized and the candidate record are in a matching, conflicting, or missing state. Title similarity evidence is the title similarity value obtained through asymmetric multi-view alignment. Physical topology state evidence first determines whether the record to be normalized and the candidate record belong to the same journal container, then performs a numerical priority comparison of their publication year, volume, issue, and page numbers, and combines the container determination result with the spatiotemporal coordinate comparison result to obtain one of three states: matching, conflicting, or missing. Author risk control gating evidence is a pass, rejection, or missing signal determined based on the overlap relationship between author lexical sets. Finally, the above four types of evidence are integrated to obtain multiple evidence bodies of the record to be normalized and the corresponding candidate record, along with specific attribution labels for the conflict dimension, thus forming comprehensive judgment evidence.

[0045] S104: Based on multiple evidence bodies, the Dempster-Shafer evidence theory is used to determine the comprehensive trust level and conflict coefficient corresponding to the record to be normalized and each candidate record.

[0046] In this embodiment, the evidence states and similarity values of hard identifier state evidence, title similarity evidence, physical topology state evidence, and author risk control gating evidence are mapped to their respective basic probability allocation functions. Then, according to the Dempster-Shafer evidence theory, multiple basic probability allocations are subjected to unnormalized iterative orthogonal synthesis, and the global conflict coefficient is recursively updated during the iteration process. After all evidence is synthesized, a unified normalization process is performed based on the target accumulated evidence and the global conflict coefficient. Finally, a comprehensive trust level is obtained to characterize that the record to be normalized and the candidate record belong to the same entity, and a conflict coefficient is obtained to characterize the degree of contradiction and mutual exclusion between multiple sets of evidence.

[0047] Among them, the Dempster-Shafer evidence theory, also known as the Dempster-Shafer evidence theory, is an uncertainty reasoning method. It obtains a comprehensive trust level and conflict coefficient by assigning multiple independent pieces of evidence to basic probability and orthogonally synthesizing them. It can effectively quantify the degree of contradiction between evidence and is suitable for decision-making scenarios with conflicting and uncertain information.

[0048] S105: Based on the comprehensive trust level and conflict coefficient, determine the adjudication result corresponding to the record to be normalized, and update the document meta-database based on the adjudication result.

[0049] In this embodiment, after obtaining the comprehensive trust level and conflict coefficient for each candidate record in the candidate record set, the target candidate record is first determined in the candidate record set. Then, the comprehensive trust level and conflict coefficient corresponding to the target candidate record are used to make a judgment and obtain the corresponding judgment result. Finally, the record to be normalized is processed according to the determined judgment result to complete the update of the document meta-database.

[0050] In this embodiment, converting the original metadata records to be entered into the database into uniformly formatted records eliminates the heterogeneity of multi-source metadata formats, laying a standardized data foundation for subsequent accurate matching. Recalling the pre-defined document metadata database based on the records to be normalized and determining a candidate record set quickly narrows the matching range, avoiding the inefficiency of full database traversal and improving the processing efficiency of the update process. Constructing evidence between the records to be normalized and each candidate record in the candidate record set and determining multiple evidence bodies allows for the mining of matching criteria from multiple dimensions, such as hard identifiers, title similarity, physical topology, and author information, providing comprehensive and reliable evidence support for subsequent decision-making. Based on these multiple evidence bodies, the Dempster-Sh... Afer's evidence theory determines the comprehensive trust level and conflict coefficient, enabling the quantitative fusion of multi-dimensional evidence. Simultaneously, the conflict coefficient identifies contradictions between evidence, enhancing the credibility of decision-making results. Based on the comprehensive trust level and conflict coefficient, the adjudication results are determined and the document meta-database is updated, enabling precise identification of document entities and effectively distinguishing between record types requiring merging, creation, and review. This application's solution, through standardized preprocessing, precise candidate recall, multi-dimensional evidence construction, quantitative evidence fusion, and intelligent adjudication—a complete technical chain—automates and simplifies the updating of the academic document meta-database, avoiding the omissions and errors in keyword matching found in existing technologies, thereby significantly improving the accuracy and reliability of the document meta-database.

[0051] As an achievable approach, the specific implementation process of "recalling a pre-defined document metadata database based on the records to be normalized to determine a candidate record set" in S102 may include the following steps: S1021: Extract hard identifier retrieval features, title semantic vector retrieval features, and physical topological fingerprint retrieval features based on the records to be normalized to obtain the feature set to be retrieved.

[0052] In this embodiment, cleaned and standardized hard identifiers are extracted from the records to be normalized as hard identifier retrieval features, semantic vectors corresponding to the title view set are extracted as title semantic vector retrieval features, and physical fingerprints generated by combining the container identifier set with the bimodal field of publication spatiotemporal coordinates are extracted as physical topological fingerprint retrieval features. The above three types of features are combined to form the feature set to be retrieved.

[0053] S1022: Perform multi-channel parallel retrieval on the preset document metadata database according to the retrieval dimensions corresponding to the feature set to be retrieved, and obtain the initial retrieval results.

[0054] In this embodiment, a precise key-value search is performed on the document meta-database based on hard identifier retrieval features to obtain the first channel retrieval result; an approximate nearest neighbor search is performed on the document meta-database based on title semantic vector retrieval features to obtain the second channel retrieval result; and a structured fallback search is performed on the document meta-database based on physical topological fingerprint retrieval features to obtain the third channel retrieval result; each channel retrieval is performed independently, and a hit in any channel will yield the corresponding initial retrieval result.

[0055] S1023: Summarize and deduplicate the initial search results of each channel to obtain a merged and deduplicated search result set.

[0056] In this embodiment, the initial search results obtained from the hard identifier channel, semantic channel, and physical topology channel are merged, and duplicate records are eliminated by comparing unique identifiers to ensure that each candidate record in the merged search result set is unique, thus obtaining the merged and deduplicated search result set.

[0057] S1024: Sort the merged and deduplicated search result set according to the preset recall channel priority to obtain the sorted search results.

[0058] In this embodiment, for example, the preset recall channel priority order is sorted from high to low as hard identifier channel, semantic channel, and physical topology channel; based on this priority, the merged and deduplicated search result set is sorted hierarchically, with candidate records hit by hard identifier channel ranked first, followed by candidate records hit by semantic channel, and finally candidate records hit by physical topology channel; within the same priority, the results are sorted from high to low according to basic similarity score to obtain the sorted search results.

[0059] S1025: Based on the preset maximum candidate set threshold, determine the candidate record set from the sorted retrieval results.

[0060] In this embodiment, the sorted search results are truncated according to the preset maximum candidate set threshold, and the high-priority records with the highest ranking are retained. The retained high-priority candidate records are used as the candidate record set to match the records to be normalized.

[0061] In this embodiment, hard identifier retrieval features, title semantic vector retrieval features, and physical topological fingerprint retrieval features are extracted based on the records to be normalized to obtain a set of features to be retrieved. According to the retrieval dimensions corresponding to the set of features to be retrieved, a multi-channel parallel retrieval is performed on a preset document metadata database to obtain initial retrieval results. The initial retrieval results from each channel are summarized and deduplicated to obtain a merged and deduplicated retrieval result set. The merged and deduplicated retrieval result set is sorted according to a preset recall channel priority to obtain sorted retrieval results. Based on a preset maximum candidate set threshold, a set of candidate records is determined from the sorted retrieval results, reducing subsequent computational overhead, enhancing adaptability to heterogeneous, missing, and noisy data, and providing a stable and reliable candidate foundation for document normalization.

[0062] As an achievable approach, the specific implementation process of "performing multi-channel parallel retrieval of the preset document meta-database according to the retrieval dimensions corresponding to the feature set to be retrieved, and obtaining initial retrieval results" in S1022 may include the following steps: Based on hard identifier retrieval features, a precise hard identifier retrieval is performed on the document meta-database to obtain the first channel retrieval results; based on title semantic vector retrieval features, a broad title semantic retrieval is performed on the document meta-database to obtain the second channel retrieval results; based on physical topology fingerprint retrieval features, a physical topology fallback retrieval is performed on the document meta-database to obtain the third channel retrieval results; the first channel retrieval results, the second channel retrieval results, and the third channel retrieval results are merged and deduplicated to obtain the initial retrieval results.

[0063] In this embodiment, the precise retrieval based on hard identifier retrieval features specifically involves: determining whether the cleaned hard identifier in the record to be normalized is non-empty; if it is non-empty, then using the identifier as the retrieval key value, performing an equivalent matching query in the document meta-database, and using the successfully matched standardized document records as the first channel retrieval results; the semantically broad retrieval based on title semantic vector retrieval features specifically involves: determining whether the title semantic vector in the record to be normalized is non-empty; if it is non-empty, then calculating the cosine similarity between the vector and the title semantic vectors of all records in the document meta-database, and filtering out records with a similarity greater than a preset semantic threshold as... The second channel search results; the fallback search based on physical topological fingerprint search features specifically involves: determining whether the set of International Standard Serial Publication Numbers, digital publication year, and digital page number in the record to be normalized meet the preset search conditions. If they do, one or more physical topological search keys generated by combining the above information are used as the search basis. An equivalent matching query or a field subset matching query is performed in the document meta-database according to preset degradation rules. The successfully matched records are used as the third channel search results. Finally, the search results of the three channels are directly summarized, merged, and deduplicated to obtain the initial search results containing all the records matched by all channels.

[0064] In this embodiment, based on hard identifier retrieval features, a precise hard identifier retrieval is performed on the document meta-database to obtain the first channel retrieval results; based on title semantic vector retrieval features, a broad title semantic retrieval is performed on the document meta-database to obtain the second channel retrieval results; based on physical topology fingerprint retrieval features, a physical topology fallback retrieval is performed on the document meta-database to obtain the third channel retrieval results; the first channel retrieval results, the second channel retrieval results, and the third channel retrieval results are merged and deduplicated to obtain the initial retrieval results, which can comprehensively cover precision, semantics, and fallback matching, improving candidate recall and accuracy.

[0065] As an achievable method, the specific implementation process of "constructing evidence by comparing the record to be normalized with each candidate record in the candidate record set, and determining multiple evidence bodies corresponding to the record to be normalized and each candidate record" in S103 may include the following steps: For each candidate record in the candidate record set, perform the following operations: S1031: Compare the hard identifier field of the record to be normalized with the hard identifier of the candidate record to determine the hard identifier status evidence.

[0066] In this embodiment, the hard identifier in the record to be normalized is compared with the hard identifier corresponding to the candidate record. If either identifier is empty, the hard identifier status evidence is determined to be missing, and the reason code DOI_MISSING is marked. If both are not empty and their values are equal, the record is determined to be a match, and the reason code DOI_OK is marked. If both are not empty and their values are not equal, the record is determined to be a conflict, and the reason code DOI_CONFLICT is marked.

[0067] S1032: Cross-compare the title view set in the record to be normalized with the title view set corresponding to the candidate record to construct an asymmetric similarity matrix, and take the global maximum value of the asymmetric similarity matrix as the title similarity to determine the title similarity evidence.

[0068] In this embodiment, the Cartesian product of the title view set of the record to be normalized and the title view set of the candidate record are compared pairwise. The string similarity of each set of title views is calculated and an asymmetric similarity matrix is constructed. The global maximum value in the matrix is extracted as the title similarity value SIM_Title of the set of records, with a value range of 0 to 1. If the number of title views of any record is insufficient, the reason code TITLE_INSUFFICIENT is marked to generate title similarity evidence.

[0069] S1033: Based on the bimodal field of the container identifier set and the publication spatiotemporal coordinates in the record to be normalized, perform topological positioning comparison on the bimodal field of the container identifier set and the publication spatiotemporal coordinates corresponding to the candidate record to determine the physical topological state evidence.

[0070] In this embodiment, container consistency is first determined by comparing the container identifier sets of the record to be unified with those of the candidate record. If the sets intersect, the containers are considered consistent. If the container identifier sets are insufficient, the similarity of the standard journal names is compared. If the similarity exceeds a preset threshold, the containers are considered consistent. If both determination methods show inconsistency, i.e., the container identifier sets have no intersection and the similarity of the standard journal names does not reach the threshold, then a container conflict is determined. Next, the consistency of publication spatiotemporal coordinates is determined. The bimodal fields of publication spatiotemporal coordinates are compared sequentially using a strategy of prioritizing numerical form and downgrading the original string form. The page number field includes virtual page numbers mapped by electronic locators. If any one or more of the following four cases exist—unequal years, unequal volume numbers, unequal issue numbers, and unequal starting pages—it indicates a critical coordinate conflict. When a container conflict or critical coordinate conflict is detected, the physical topology status evidence is determined to be in a conflict state, and the corresponding reason code is marked (e.g., JOURNAL_MISMATCH for container conflict, ERR_YEAR for year conflict, ERR_VOL for volume conflict, ERR_ISS for period conflict, ERR_FPAGE for start page conflict). If there is no conflict and at least one valid match exists, it is determined to be in a matching state, and the reason code PHY_OK is marked; if there is no conflict but the valid information is insufficient, it is determined to be in a missing state, and the reason code PHY_INSUFFICIENT is marked.

[0071] S1034: Perform an intersection operation between the set of author terms in the record to be normalized and the set of author terms corresponding to the candidate record to determine the author risk control gating evidence.

[0072] In this embodiment, the intersection of the author lexicon set of the record to be normalized and the author lexicon set of the candidate record is calculated; if the intersection is not empty, the author risk control gate evidence is determined to be in the pass state, and the reason code AUTH_MATCH is marked; if both author lexicon sets are not empty and the intersection is empty, it is determined to be in the rejection state, and the reason code AUTH_MISMATCH is marked; if the author lexicon set of any record is empty, it is determined to be in the missing state, and the reason code AUTH_MISSING is marked.

[0073] S1035: Integrate hard identifier state evidence, title similarity evidence, physical topology state evidence, and author risk control gating evidence to obtain multiple evidence bodies corresponding to the records to be normalized and the candidate records.

[0074] In this embodiment, the hard identifier state evidence, title similarity evidence, physical topology state evidence, and author risk control gating evidence obtained separately are combined to form four independent evidence bodies that can be used for evidence fusion calculation.

[0075] In this embodiment, the hard identifiers in the record to be normalized are compared with the hard identifiers corresponding to the candidate records to determine the hard identifier status evidence; the title view set in the record to be normalized is cross-compared with the title view set corresponding to the candidate records to construct an asymmetric similarity matrix, and the global maximum value of the asymmetric similarity matrix is taken as the title similarity to determine the title similarity evidence; based on the bimodal field of the container identifier set and the publication spatiotemporal coordinates in the record to be normalized, the topological location comparison of the container identifier set and the publication spatiotemporal coordinates corresponding to the candidate records is performed to determine the physical topological status evidence. The method involves finding the intersection of the author lexical set in the record to be normalized with the author lexical set in the candidate record to determine the author risk control gating evidence. It integrates hard identifier state evidence, title similarity evidence, physical topology state evidence, and author risk control gating evidence to obtain multiple evidence bodies corresponding to the record to be normalized and the candidate record. Four independent types of evidence are constructed through multi-dimensional field comparison, comprehensively depicting the matching and conflict relationships between records. Asymmetric alignment and set intersection calculation are employed to improve adaptability to heterogeneous and noisy data. The evidence structure is clear and highly interpretable, providing a stable and reliable basis for subsequent evidence fusion and conflict resolution.

[0076] As an achievable approach, the specific implementation process of "determining the comprehensive trust level and conflict coefficient corresponding to the record to be normalized and each candidate record based on multiple evidence bodies and using Dempster-Shafer evidence theory" in S104 may include the following steps: S1041: Perform basic probability allocation mapping on multiple pieces of evidence to obtain the basic probability allocation results corresponding to each piece of evidence.

[0077] In this embodiment, for example, multiple evidence bodies are hard identifier state evidence (E1), title similarity evidence (E2), physical topology state evidence (E3), and author risk control gating evidence (E4). Basic probability assignment (BPA) mapping is performed on the above four types of evidence, as follows: Hard identifier state evidence (E1) serves as a source of strong support / strong opposition evidence. The mapping rules are as follows: If E1 = Match (cause code DOI_OK), then the basic probability allocation results are m1(T) = w1, m1(F) = 0, m1(Θ) = 1-w1; if E1 = Conflict (cause code DOI_CONFLICT), then m1(T) = 0, m1(F) = w2, m1(Θ) = 1-w2; if E1 = Null (cause code DOI_MISSING), then it is mapped in the form of uncertainty. This avoids misjudgments due to missing fields. w1 and w2 are preset weights.

[0078] Title similarity evidence (E2) provides continuous support, discounted by a confidence coefficient α, where, The mapping rule is as follows: based on the title similarity SIM_Title∈[0,1], calculate m2(T)=α×SIM_Title, m2(F)=0, m2(Θ)=1- m2(T), where α is used to discount the confidence of the evidence and reduce the impact of non-critical similarity.

[0079] Physical topological state evidence (E3) serves as a source of "strong support / strong opposition" evidence, and its mapping rules are consistent with E1. If E3 = Match (reason code PHY_OK), then m3(T) = w3, m3(F) = 0, and m3(Θ) = 1 - w3. If E3 = Conflict (including container conflicts and key coordinate conflicts, corresponding to various reason codes), then m3(T) = 0, m3(F) = w4, and m3(Θ) = 1 - w4. If E3 = Null (reason code PHY_INSUFFICIENT), then m3(Θ) = 1.0, participating in the fusion in the form of uncertainty. Among them, w3 and w4 are preset weights.

[0080] The author's risk control gate evidence (E4) can provide either supporting or opposing information, and can also serve as a basis for risk control. The mapping rules are as follows: if E4 = Pass (reason code AUTH_MATCH), then m4(T) = w5, m4(F) = 0, m4(Θ) = 1 - w5; if E4 = Fail (reason code AUTH_MISMATCH), then m4(T) = 0, m4(F) = w6, m4(Θ) = 1 - w6; if E4 = Null (reason code AUTH_MISSING), then m4(T) = 0, m4(F) = 0, m4(Θ) = 1. Here, w5 and w6 are preset weights.

[0081] After the above four types of evidence are mapped, the corresponding basic probability allocation results (m1, m2, m3, m4) are obtained. Each basic probability allocation result contains three values: m(T) (quality value supporting the hypothesis T of the same entity), m(F) (quality value supporting the hypothesis F of different entities), and m(Θ) (quality value of the uncertainty term Θ).

[0082] S1042: Select any one of the multiple basic probability assignment results as the initial cumulative evidence.

[0083] In this embodiment, multiple basic probability allocation results are m1 corresponding to E1, m2 corresponding to E2, m3 corresponding to E3, and m4 corresponding to E4. One of these is randomly selected as the initial accumulated evidence. For example, the basic probability allocation result m1 corresponding to the hard identifier state evidence is selected as the initial accumulated evidence, i.e., the initial accumulated evidence M. acc The initial value is: =m1(T), =m1(F), At the same time, the global conflict coefficient is initialized to K=0.

[0084] S1043: Perform an unnormalized orthogonal synthesis operation on the current accumulated evidence and the evidence to be accumulated to obtain the cross-conflict term and the consistency result.

[0085] In this embodiment, the evidence to be accumulated is any one of the remaining basic probability assignment results that did not participate in the synthesis operation. Let the current accumulated evidence be Macc, and the evidence to be accumulated be m. x (x represents the sequence number of 2, 3, and 4 that did not participate in the synthesis). First, calculate the cross-conflict term k, which characterizes the degree of mutual exclusion between the two sets of evidence. The calculation formula is:

[0086] Calculate M again acc With m x The unnormalized consistency results of the two sets of evidence, including the unnormalized cumulative quality supporting the hypothesis of the same entity. Supports unnormalized cumulative mass for different entities and the unnormalized cumulative mass of the uncertain terms The calculation formulas are as follows:

[0087]

[0088]

[0089] Among them, cross-conflict items Used for subsequent updates to the global conflict coefficient, and unnormalized consistency results. , as well as Used to update the current accumulated evidence.

[0090] Taking m1 as the initial accumulated evidence as an example, the current accumulated evidence M is... acc That is, the initial accumulated evidence m1, and if any piece of evidence to be accumulated is m2, then the cross-conflict terms between m1 and m2. It can be represented as:

[0091] The corresponding unnormalized consistency results are as follows:

[0092]

[0093]

[0094] S1044: Update the global conflict coefficient based on the cross-conflict terms and use the consistency result as the current cumulative evidence.

[0095] In this embodiment, before the first orthogonal composition operation, the current accumulated evidence is the initial accumulated evidence determined in step S1042, and the global conflict coefficient K is initialized to 0; after each orthogonal composition operation in step S1043, the global conflict coefficient is recursively updated based on the cross-conflict term k obtained in this calculation, and the update formula is: ;in, This is the global conflict coefficient from the previous orthogonal composition operation; This is the updated global conflict coefficient after the current orthogonal composition operation. Simultaneously, the consistency result obtained in step S1043 ( , as well as Assign the value to M acc .

[0096] The updated current accumulated evidence is used to participate in the next orthogonal synthesis operation, after which the accumulated evidence is switched to the remaining basic probability assignment results m3 and m4 that have not participated in the synthesis.

[0097] Taking the above embodiment as an example, the cross-conflict terms between m1 and m2 And if the initial global conflict coefficient K is 0, then the updated global conflict coefficient is k. 12 M acc In Depend on Updated to M acc In Depend on Updated to M acc In Depend on Updated to .

[0098] S1045: Iterate through the synthesis and accumulation of all basic probability assignment results and update the global conflict coefficient to obtain the conflict coefficient and the target accumulation evidence.

[0099] S1046: Based on the target accumulated evidence and the global conflict coefficient, perform unified normalization processing to obtain the comprehensive trust degree corresponding to the record to be normalized and each candidate record respectively.

[0100] In this embodiment, steps S1043 to S1044 are repeated, and the remaining basic probability allocation results (the parts of m2, m3, and m4 that were not involved in the synthesis) are used as evidence to be accumulated. Unnormalized orthogonal synthesis, cross-conflict term calculation, global conflict coefficient update, and accumulated evidence update are performed with the current accumulated evidence until the basic probability allocation results corresponding to the four types of evidence m1, m2, m3, and m4 are all synthesized and accumulated. After the iteration, the global conflict coefficient K obtained by the last update is the conflict coefficient reflecting the degree of mutual exclusion between the four types of evidence. The current accumulated evidence that is finally updated is the target accumulated evidence, which includes the unnormalized accumulated quality corresponding to the hypothesis of supporting the same entity, the hypothesis of supporting different entities, and the uncertainty term.

[0101] In this embodiment, the unnormalized cumulative quality of the accumulated evidence supporting the hypothesis T of the same entity is uniformly normalized by combining it with the global conflict coefficient K, resulting in the comprehensive confidence level that the record to be normalized and the corresponding candidate record belong to the hypothesis of the same entity. The calculation formula is as follows:

[0102] The overall trust level Bel(T) represents the final level of trust the system places on two records as belonging to the same entity after integrating all dimensions of information and eliminating conflicting parts. The global conflict coefficient K represents the current risk level. When the value of K is greater than the preset warning threshold, it means that there are significant contradictions among the evidence in many dimensions. The system will ignore the value of Bel(T) and directly trigger a forced circuit breaker to intercept and merge the records.

[0103] In this embodiment, basic probability allocation mapping is performed on multiple evidence bodies to obtain basic probability allocation results for each evidence body. Any one of the basic probability allocation results is selected as the initial accumulated evidence. The current accumulated evidence and the evidence to be accumulated are orthogonally synthesized to obtain cross-conflict terms and consistency results. The global conflict coefficient is updated, and the consistency result is used as the current accumulated evidence. The synthesis accumulation of all basic probability allocation results and the update of the global conflict coefficient are iteratively completed to obtain the conflict coefficient and the target accumulated evidence. Then, based on the target accumulated evidence and the conflict coefficient, a unified normalization is performed to obtain the comprehensive trust level. The method of this application adopts Dempster-Shafer evidence theory. By quantifying the four types of constructed evidence into basic probability allocations and then performing orthogonal synthesis iterative calculations, the support and conflict levels of multi-source evidence can be quantified. Missing information is incorporated into the fusion in the form of uncertainty to avoid misjudgment. Finally, a credible comprehensive trust level and global conflict coefficient are output, providing a stable, objective, and interpretable quantitative basis for subsequent adjudication.

[0104] As an achievable approach, the specific implementation process of "determining the adjudication result corresponding to the record to be normalized based on the comprehensive trust level and conflict coefficient, and updating the document metadata database based on the adjudication result" in S105 may include the following steps: S1051: Compare the conflict coefficient with the preset conflict circuit breaker threshold, and obtain the circuit breaker determination result based on the author's risk control gate evidence.

[0105] The circuit breaker determination result is used to determine whether the conflict circuit breaker mechanism has been triggered.

[0106] In this embodiment, a conflict circuit breaker threshold K is preset. limit First, the conflict coefficient K and the preset conflict circuit breaker threshold K are obtained. limit A comparison is performed; simultaneously, the author's risk control gated evidence E4 is used for judgment. If the conflict coefficient K exceeds the preset conflict circuit breaker threshold K... limit If the E4 indicates a rejection status with the reason code AUTH_MISMATCH, then the circuit breaker determination result is considered to have triggered the conflict circuit breaker mechanism, and a pending review result is generated and sent to the user; if the conflict coefficient K does not exceed the preset conflict circuit breaker threshold K... limit If the author's risk control gate evidence E4 is in a pass or null state, then the circuit breaker decision result is that the conflict circuit breaker mechanism is not triggered.

[0107] S1052: When the circuit breaker determination result is that the conflict circuit breaker mechanism is not triggered, the target candidate record is determined from the candidate record set.

[0108] In this embodiment, after obtaining the overall trust level and conflict coefficient for each candidate record in the candidate record set, the candidate record with the highest overall trust level is selected as the target candidate record from the candidate records that have not triggered the conflict circuit breaker mechanism; if there are two or more candidate records with the same overall trust level, the candidate record with the lower conflict coefficient is selected as the target candidate record; if both the overall trust level and the conflict coefficient are the same, the target candidate record is determined according to the preset recall channel priority or the candidate record sorting result.

[0109] The preset recall channel priority can be consistent with the sorting rules of the candidate recall stage. For example, candidate records recalled by the hard identifier channel are selected first, followed by candidate records recalled by the title semantic channel, and finally candidate records recalled by the physical topology channel.

[0110] By using the above method, when multiple candidate records simultaneously meet the subsequent adjudication conditions, the system can stably output a unique target candidate record, thereby avoiding adjudication ambiguity caused by the coexistence of multiple candidates.

[0111] S1053: Compare the comprehensive trust level, the preset automatic merging threshold, and the automatic addition threshold corresponding to the target candidate record to obtain the adjudication result corresponding to the record to be normalized.

[0112] Once the target candidate record is determined, the overall trust level corresponding to the target candidate record is compared with the preset automatic merging threshold and automatic addition threshold: If the overall trust level of the target candidate record is greater than the automatic merging threshold, the record to be normalized will be merged into the target candidate record and updated in the document meta-database; If there are no candidate records in the candidate record set with a comprehensive trust level greater than the automatic addition threshold, then the record to be normalized will be added to the document meta-database. If the overall trust level corresponding to the target candidate record is between the automatic merging threshold and the automatic addition threshold, a pending review result will be generated and sent to the user.

[0113] In this embodiment, two thresholds are preset, namely the automatic merging threshold. and automatically adding thresholds If the overall trust level of the target candidate record Bel(T) > If the record to be normalized and the corresponding candidate record are determined to be the same document entity, the metadata fields of the record to be normalized and the metadata fields of the candidate record are integrated and completed, duplicate field information is removed, and a merged standardized document record is generated. This record is then overwritten and updated to the position of the original candidate record in the document metadata database to ensure the uniqueness and integrity of the records in the database.

[0114] If the overall trust level Bel(T) of each target candidate record in the candidate record set is < If the record to be normalized is determined to be a new document entity that has not been entered into the database, the record to be normalized will be encapsulated in the metadata standardization format and added to the document metadata database as a new document entry. At the same time, a unique identifier will be generated for the record, and the structured attribution information of the new operation will be recorded to facilitate subsequent auditing and traceability.

[0115] If the overall trust level of the target candidate records ≤Bel(T) ≤ If the attribution status of the record to be normalized cannot be automatically confirmed, a pending review result is generated. The metadata, candidate record information, comprehensive trust value, conflict coefficient value, and various evidence statuses of the record to be normalized are packaged and pushed to the manual review port for manual review by the user. After the review is completed, the document metadata database is updated accordingly based on the review result, and the review opinions and results are recorded simultaneously for auditing and review.

[0116] In this embodiment, the conflict coefficient is compared with a preset conflict circuit breaker threshold, and a circuit breaker determination result is obtained based on author risk control gating evidence. When the circuit breaker determination result is that the conflict circuit breaker mechanism is not triggered, the comprehensive trust level, the preset automatic merging threshold, and the automatic addition threshold are compared to obtain the adjudication result for the record to be normalized. If the comprehensive trust level is greater than the automatic merging threshold, the record to be normalized is merged into the candidate record and updated to the literature metadata database; or, if the comprehensive trust level is less than the automatic addition threshold, the record to be normalized is added to the literature metadata database; or, if the comprehensive trust level is within the range of the automatic merging threshold and the automatic addition threshold, a pending review result is generated and sent to the user. The above method can effectively avoid erroneous merging caused by evidence conflict and improve the reliability of normalization through conflict circuit breaking and graded threshold adjudication. At the same time, merging, creation, or manual review is automatically performed according to the trust level, preferring to enter the pending state rather than allow erroneous merging, thus balancing automation efficiency and risk control security, achieving accurate deduplication and compliant updates, and ensuring the quality of the literature metadata database.

[0117] For example, this embodiment provides two typical adjudication scenarios. In the conflict circuit breaker scenario, the record to be normalized, A (DOI=10.1056 / NEJMc1714503; title: Global Burden of Rheumatic Heart Disease; ISSN 1533-4406; publication year: 2018; volume 378; issue 1; starting page e2; author set {Sohrabi, Bahram}) and candidate record B (DOI=10.1056 / NEJMc1714503; title: Global Burden of Rheumatic Heart Disease; ISSN 1533-4406; publication year: 2018; volume 378; issue 1; starting page e2; author set {Johannsen, Taking Ronald as an example, if the DOIs are completely identical, and the title, journal, year, volume, issue, and starting page all match, but the author sets have no overlap, the evidence construction result is E1=Match, E2=1.0, E3=Match, and E4=Fail. Although E1, E2, and E3 support merging, E4 directly triggers the "author uniqueness exclusivity" hard circuit breaker rule, resulting in a conflict circuit breaker and ultimately outputting "Pending" with structured attribution codes of DOI_OK, SIM_Title=1.0, PHY_OK, and AUTH_MISMATCH, thus avoiding erroneous merging.

[0118] In the automatic merging scenario, record A (DOI=empty; title: Gastrointestinal: pyogenic liver abscess associated with a penetrating fish bone; journal: Journal of gastroenterology and hepatology; ISSN: 0815-9319; publication year: 2010; volume: 25; issue: 12; starting page: 1900; author set: {Hsu, Heng, Yen}) and candidate record B (DOI=empty; title: Education and Imaging: Gastrointestinal: pyogenic liver abscess associated with apenetrating fish bone; journal: Journal of gastroenterology and hepatology; ISSN: 1440-1746; publication year: 2010; volume: 25; issue: 12; starting page: 1900; author set: {Yen, Taking HH} as an example, both are missing DOI, but the titles are highly similar in semantics, the journals match, and the year, volume, issue, and starting page are completely aligned. The author sets have a valid intersection. The evidence construction result is consistent in multiple dimensions. After evidence fusion, the overall trust level Bel (T) is high and the conflict coefficient K is low. Finally, the output is automatically merged and the records are merged into the database.

[0119] As an optional engineering implementation approach, this method also includes the following system interaction, storage index, user interface, parallel processing, and high-performance approximate implementation schemes: This invention can be implemented by a system platform and a normalization calculation tool working together: the system platform sends a normalization request containing the record to be normalized and the necessary context; the normalization calculation tool executes the above normalization process and returns the decision result, target entity ID, confidence score, and structured attribution information; the system platform writes the returned results into the platform database and provides a result display page and a manual processing page for reviewing the assigned review task and writing back the results; wherein, the core algorithm calculation is executed on the normalization calculation tool side, and the platform side is responsible for scheduling, data storage, and result display.

[0120] This invention can optionally configure a database for persistent storage of entity data, including a relational database or a non-relational database, for storing normalized entity records, field completion results, normalization decision logs, and pending tasks. The candidate construction and evidence construction stages can be implemented through a unified standardized record library, which stores standardized records of documents already included in the library and provides title vector retrieval, hard identifier retrieval, and physical topological feature retrieval capabilities. The normalization calculation tool can obtain standardized fields of candidate records from this record library to complete the construction and fusion adjudication of four types of evidence. In addition, this invention can also configure a pending queue or pending task table to receive pending review samples and form a closed loop of manual review.

[0121] When the normalization calculation tool outputs pending review results, the platform system can provide visual decision support information to human reviewers to improve review efficiency. Optional interface elements include: displaying key fields of source and candidate records in table or side-by-side card format, and visually distinguishing consistent and conflicting fields; displaying structured attribution labels, which can be converted into human-readable prompt text; displaying the overall trust level value in the form of progress bars, percentages, or color blocks, as well as warning indicators indicating whether the conflict coefficient exceeds the threshold; providing quick operation buttons such as confirm merging, refuse merging, and mark difficult cases, which, when clicked, write the review results back to the platform database. The above interface visualizes the intermediate calculation results of the algorithm for human reviewers, forming a collaborative working mode that combines algorithm suggestions with human decision-making, improving the interpretability and credibility of the system. The interface layout, color scheme, and interaction details can be freely implemented according to the front-end technology stack, all of which fall within the technical scope of this invention for transforming algorithm output into user-perceptible information.

[0122] The three-channel candidate construction can be performed in parallel or pipelined manner, and multiple input records can be retrieved in batches according to a preset batch size to improve processing throughput; deduplication can be performed within the batch during batch processing to avoid duplicate candidates and duplicate calculations in the same batch; to ensure processing consistency, idempotent keys can be set for the same input record to avoid duplicate processing, and all decision outputs can record version numbers and attribution information for easy backtracking and auditing.

[0123] In implementation scenarios with extremely high real-time requirements or limited computing resources, the fusion process based on Dempster-Shafer evidence theory can be simplified into a set of deterministic logical rules to reduce computational complexity and improve throughput. The simplification principle is as follows: by analyzing historical labeled data or expert knowledge offline, the evidence mapping and evidence synthesis operations are pre-compiled into a set of conditional judgment rules, and the decision results are directly output based on the four types of evidence states and the title similarity threshold range, skipping real-time floating-point operations.

[0124] See Figure 2The figure is a schematic diagram of the structure of an academic literature metadata database update device provided in an embodiment of this application. Figure 2 As shown, the device 20 includes a standardization processing module 21, a recall processing module 22, an evidence construction module 23, an evidence processing module 24, and a conflict resolution module 25.

[0125] The system comprises the following modules: Standardization module 21 converts raw metadata records to be entered into a database into records to be standardized; these records are standardized records after format unification. Recall module 22, based on the records to be standardized, performs recall processing on a pre-defined document metadata database to determine a set of candidate records; this set includes standardized candidate document records that match the records to be standardized. Evidence construction module 23 constructs evidence by comparing the records to be standardized with each candidate record in the candidate record set, determining multiple evidence bodies corresponding to each record. Evidence processing module 24, based on these evidence bodies, uses Dempster-Shafer evidence theory to determine the overall trust level and conflict coefficient corresponding to each record. Conflict adjudication module 25, based on the overall trust level and conflict coefficient, determines the adjudication result corresponding to the record to be standardized and updates the document metadata database based on the adjudication result.

[0126] The academic literature metadata database update device provided in this application embodiment can execute the technical solution shown in the above method embodiment. Its implementation principle and beneficial effects are similar, and will not be described again here.

[0127] Furthermore, based on the above embodiments, the recall processing module 22 is specifically used to extract hard identifier retrieval features, title semantic vector retrieval features, and physical topological fingerprint retrieval features based on the records to be normalized, to obtain a set of features to be retrieved; to perform multi-channel parallel retrieval on the preset document meta-database according to the retrieval dimensions corresponding to the set of features to be retrieved, to obtain initial retrieval results; to summarize and deduplicate the initial retrieval results of each channel, to obtain a merged and deduplicated retrieval result set; to sort the merged and deduplicated retrieval result set according to the preset recall channel priority, to obtain sorted retrieval results; and to determine a set of candidate records from the sorted retrieval results based on the preset maximum candidate set threshold.

[0128] The academic literature metadata database update device provided in this application embodiment can execute the technical solution shown in the above method embodiment. Its implementation principle and beneficial effects are similar, and will not be described again here.

[0129] Furthermore, based on the above embodiments, when performing multi-channel parallel retrieval on the preset document meta-database according to the retrieval dimensions corresponding to the feature set to be retrieved, and obtaining the initial retrieval results, the recall processing module 22 is specifically used to perform hard identifier precise retrieval on the document meta-database based on hard identifier retrieval features to obtain the first channel retrieval results; perform title semantic broad retrieval on the document meta-database based on title semantic vector retrieval features to obtain the second channel retrieval results; perform physical topology fallback retrieval on the document meta-database based on physical topology fingerprint retrieval features to obtain the third channel retrieval results; and merge and deduplicate the first channel retrieval results, the second channel retrieval results, and the third channel retrieval results to obtain the initial retrieval results.

[0130] The academic literature metadata database update device provided in this application embodiment can execute the technical solution shown in the above method embodiment. Its implementation principle and beneficial effects are similar, and will not be described again here.

[0131] Furthermore, based on the above embodiments, the record to be normalized obtained in the standardization processing module 21 includes hard identifiers, a title view set, a container identifier set, a bimodal field of publication spatiotemporal coordinates, and an author lexical set. The hard identifier is a standardized identifier obtained after standardizing the digital object identifiers in the original metadata record; the title view set is a set of title texts formed after removing rich text and noise from the main title, subtitle, and translation in the original metadata record; the container identifier set is a complete set of identifiers for the journal that holds the document, including the International Standard Serial Number (ISSN), the electronic version of the ISSSN, the standard journal title, and the abbreviated journal title; the bimodal field of publication spatiotemporal coordinates represents the document's publication year, volume, issue, and page location information; the field includes two modalities: the extracted pure numerical form and the string form retaining the original format; the page location information includes virtual page numbers generated based on electronic locator mapping; the author lexical set is an unordered lexical set obtained after splitting the author's name in the original metadata record by spaces, removing punctuation, and converting it to lowercase.

[0132] The academic literature metadata database update device provided in this application embodiment can execute the technical solution shown in the above method embodiment. Its implementation principle and beneficial effects are similar, and will not be described again here.

[0133] Furthermore, based on the above embodiments, the evidence construction module 23 is specifically used to perform the following operations for each candidate record in the candidate record set: compare the hard identifier in the record to be normalized with the hard identifier corresponding to the candidate record to determine the hard identifier status evidence; cross-compare the title view set in the record to be normalized with the title view set corresponding to the candidate record to construct an asymmetric similarity matrix, and take the global maximum value of the asymmetric similarity matrix as the title similarity to determine the title similarity evidence; based on the bimodal field of the container identifier set and the publication spatiotemporal coordinates in the record to be normalized, perform topological positioning comparison on the bimodal field of the container identifier set and the publication spatiotemporal coordinates corresponding to the candidate record to determine the physical topological status evidence; perform an intersection operation on the author lexical set in the record to be normalized and the author lexical set corresponding to the candidate record to determine the author risk control gating evidence; integrate the hard identifier status evidence, title similarity evidence, physical topological status evidence, and author risk control gating evidence to obtain multiple evidence bodies corresponding to the record to be normalized and the candidate record.

[0134] The academic literature metadata database update device provided in this application embodiment can execute the technical solution shown in the above method embodiment. Its implementation principle and beneficial effects are similar, and will not be described again here.

[0135] Further, based on the above embodiments, the evidence processing module 24 is specifically used to perform basic probability allocation mapping processing on multiple evidence bodies respectively to obtain the basic probability allocation results corresponding to each evidence body; select any one of the multiple basic probability allocation results as the initial accumulated evidence; perform unnormalized orthogonal synthesis operation on the current accumulated evidence and the evidence to be accumulated to obtain cross-conflict terms and consistency results; recursively update the global conflict coefficient according to the cross-conflict terms, and use the consistency result as the current accumulated evidence; the evidence to be accumulated is any one of the other basic probability allocation results that did not participate in the synthesis operation; before the first orthogonal synthesis operation, the current accumulated evidence is the initial accumulated evidence, and the global conflict coefficient is zero; iteratively complete the synthesis accumulation of all basic probability allocation results and the update of the global conflict coefficient to obtain the conflict coefficient and the target accumulated evidence; perform unified normalization processing based on the target accumulated evidence and the global conflict coefficient to obtain the comprehensive trust level.

[0136] The academic literature metadata database update device provided in this application embodiment can execute the technical solution shown in the above method embodiment. Its implementation principle and beneficial effects are similar, and will not be described again here.

[0137] Further, based on the above embodiments, the conflict adjudication module 25 is specifically used to compare the conflict coefficient with a preset conflict circuit breaker threshold, and obtain a circuit breaker judgment result based on the author risk control gating evidence; the circuit breaker judgment result is used to determine whether the conflict circuit breaker mechanism is triggered; if the circuit breaker judgment result is that the conflict circuit breaker mechanism is triggered, a pending review result is generated and sent to the user; if the circuit breaker judgment result is that the conflict circuit breaker mechanism is not triggered, a target candidate record is determined from the candidate record set; the comprehensive trust level corresponding to the target candidate record, the preset automatic merging threshold, and the automatic addition threshold are compared to obtain the adjudication result corresponding to the record to be normalized; the automatic merging threshold is greater than the automatic addition threshold; if the comprehensive trust level corresponding to the target candidate record is greater than the automatic merging threshold, the record to be normalized is merged into the target candidate record and updated to the document meta-database; or, if there is no candidate record in the candidate record set with a comprehensive trust level greater than the automatic addition threshold, the record to be normalized is added to the document meta-database; or, if the comprehensive trust level corresponding to the target candidate record is within the range of the automatic merging threshold and the automatic addition threshold, a pending review result is generated and sent to the user.

[0138] The academic literature metadata database update device provided in this application embodiment can execute the technical solution shown in the above method embodiment. Its implementation principle and beneficial effects are similar, and will not be described again here.

[0139] See Figure 3 The figure is a schematic diagram of the structure of an electronic device provided in an embodiment of this application, including: Memory 11 is used to store computer programs; The processor 12 is configured to implement the steps of the academic literature metadata database update method described in any of the above method embodiments when executing the computer program.

[0140] In this embodiment, the device can be an in-vehicle computer, a PC (Personal Computer), or a terminal device such as a smartphone, tablet computer, handheld computer, or portable computer.

[0141] The device may include a memory 11, a processor 12, and a bus 13.

[0142] The memory 11 includes at least one type of readable storage medium, such as flash memory, hard disk, multimedia card, card-type memory (e.g., SD or DX memory), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the memory 11 may be an internal storage unit of the device, such as the hard disk of the device. In other embodiments, the memory 11 may be an external storage device of the device, such as a plug-in hard disk, SmartMedia Card (SMC), Secure Digital (SD) card, Flash Card, etc. Furthermore, the memory 11 may include both internal and external storage units of the device. The memory 11 can be used not only to store application software and various types of data installed on the device, such as program code executing methods for updating academic literature metadata databases, but also to temporarily store data that has been output or will be output. In some embodiments, the processor 12 may be a central processing unit (CPU).

[0143] In some embodiments, processor 12 may be a central processing unit (CPU), controller, microcontroller, microprocessor or other data processing chip, used to run program code stored in memory 11 or process data, such as program code for executing an academic literature metadata database update method.

[0144] This bus 13 can be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus, etc. This bus can be divided into address bus, data bus, control bus, etc. For ease of representation, Figure 3 The bus is represented by a single thick line, but this does not mean that there is only one bus or one type of bus.

[0145] Furthermore, the device may also include a network interface 14, which may optionally include a wired interface and / or a wireless interface (such as a Wi-Fi interface, a Bluetooth interface, etc.), typically used to establish communication connections between the device and other electronic devices.

[0146] Optionally, the device may further include a user interface 15, which may include a display, an input unit such as a keyboard, and optionally, a standard wired interface or a wireless interface. Optionally, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, or an OLED (Organic Light-Emitting Diode) touchscreen, etc. The display may also be appropriately referred to as a screen or display unit, used to display information processed in the device and to display a visual user interface.

[0147] Figure 3 Only devices with components 11-15 are shown; those skilled in the art will understand that... Figure 3 The structure shown does not constitute a limitation on the device and may include fewer or more components than shown, or combine certain components, or have different component arrangements.

[0148] Based on the same inventive concept, corresponding to the methods of any of the above embodiments, this application also provides a computer-readable storage medium storing computer instructions for causing the computer to perform the methods described in any of the above embodiments.

[0149] The computer-readable media in this application embodiment includes permanent and non-permanent, removable and non-removable media, and information storage can be implemented by any method or technology. Information can be computer-readable instructions, data structures, program modules, or other data. Examples of computer storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technologies, CD-ROM, digital versatile optical disc (DVD) or other optical storage, magnetic tape, magnetic magnetic disk storage or other magnetic storage devices, or any other non-transfer medium that can be used to store information accessible by a computing device.

[0150] The computer instructions stored in the storage medium of the above embodiments are used to cause the computer to perform the methods described in any of the above embodiments, and have the beneficial effects of the corresponding method embodiments, which will not be repeated here.

[0151] It should be noted that the various embodiments in this specification are described in a progressive manner, and the same or similar parts between the various embodiments can be referred to mutually. Each embodiment focuses on describing the differences from other embodiments. In particular, for methods, apparatuses, electronic devices, and media, since they are basically similar to the method embodiments, the descriptions are relatively simple, and relevant parts can be referred to the descriptions of the method embodiments. The methods, apparatuses, electronic devices, and media described above are merely illustrative. The units described as separate components may or may not be physically separate, and the components indicated as units may or may not be physical units, that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of the solution in this embodiment according to actual needs. Those skilled in the art can understand and implement this without creative effort.

[0152] The above description is merely one specific embodiment of this application, but the scope of protection of this application is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the technical scope disclosed in this application should be included within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.

Claims

1. An academic literature metadata database updating method, characterized by, include: Convert the original metadata records to be imported into the database into records to be normalized; The records to be normalized are standardized records after format unification; Based on the record to be normalized, a recall process is performed on a preset document metadata database to determine a candidate record set; the candidate record set includes candidate document normalization records that match the record to be normalized. The record to be normalized is compared with each candidate record in the candidate record set to construct evidence, thereby determining multiple evidence bodies corresponding to the record to be normalized and each candidate record respectively; Based on the multiple evidence bodies, the Dempster-Shafer evidence theory is used to determine the comprehensive trust level and conflict coefficient corresponding to the record to be normalized and each candidate record, respectively. Based on the comprehensive trust level and the conflict coefficient, the adjudication result corresponding to the record to be normalized is determined, and the document meta-database is updated based on the adjudication result.

2. The method of claim 1, wherein, The step of recalling a preset document metadata database based on the records to be normalized to determine a candidate record set includes: Based on the records to be normalized, hard identifier retrieval features, title semantic vector retrieval features, and physical topological fingerprint retrieval features are extracted to obtain the retrieval feature set; According to the retrieval dimensions corresponding to the feature set to be retrieved, a multi-channel parallel retrieval is performed on the preset document meta-database to obtain initial retrieval results; The initial search results of each channel are summarized and deduplicated to obtain a merged and deduplicated search result set; The merged and deduplicated search result set is sorted according to the preset recall channel priority to obtain the sorted search results; Based on a preset maximum candidate set threshold, a set of candidate records is determined from the sorted search results.

3. The method of claim 2, wherein, The step of performing a multi-channel parallel search on a preset document metadata database according to the search dimensions corresponding to the feature set to be searched, and obtaining initial search results, includes: Based on the hard identifier retrieval features, the document meta-database is subjected to precise hard identifier retrieval to obtain the first channel retrieval results; Based on the title semantic vector retrieval features, a broad title semantic retrieval is performed on the document meta-database to obtain the second channel retrieval results; Based on the physical topology fingerprint retrieval features, a physical topology fallback retrieval is performed on the document meta-database to obtain the third channel retrieval results; The search results from the first channel, the second channel, and the third channel are merged and deduplicated to obtain the initial search results.

4. The method of claim 1, wherein, The records to be normalized include hard identifiers, title view sets, container identifier sets, bimodal fields of publication spatiotemporal coordinates, and author lexical sets; The hard identifier is a standardized identifier obtained by standardizing the digital object identifier in the original metadata record. The title view set is a title text set formed by removing rich text and noise from the main title, subtitle, and translation in the original metadata record; The container identifier set is a complete set of identifiers for locking the document carrier journal, including the International Standard Serial Number (ISSN), the electronic version of the ISSSN, the standard journal title, and the abbreviated journal title. The bimodal field of the publication spatiotemporal coordinates is a field that represents the publication year, volume, issue, and page location information of the document; the field includes two modalities: the extracted pure numerical form and the string form that retains the original format; the page location information includes virtual page numbers generated based on electronic locator mapping. The author lexical set is an unordered lexical set obtained by splitting the author's name in the original metadata record by spaces, removing punctuation, and converting it to lowercase.

5. The method of claim 4, wherein, The step of constructing evidence by comparing the record to be normalized with each candidate record in the candidate record set, and determining multiple evidence bodies corresponding to the record to be normalized and each candidate record, includes: For each candidate record in the candidate record set, perform the following operations: The hard identifier in the record to be normalized is compared with the hard identifier corresponding to the candidate record to determine the hard identifier status evidence. The title view set in the record to be normalized is cross-compared with the title view set corresponding to the candidate record to construct an asymmetric similarity matrix, and the global maximum value of the asymmetric similarity matrix is taken as the title similarity to determine the title similarity evidence. Based on the bimodal field of the container identifier set and the publication spatiotemporal coordinates in the record to be normalized, a topological positioning comparison is performed on the container identifier set and the bimodal field of the publication spatiotemporal coordinates corresponding to the candidate record to determine the physical topological state evidence. The intersection of the author term set in the record to be normalized and the author term set corresponding to the candidate record is performed to determine the author risk control gate evidence; By integrating the hard identifier state evidence, the title similarity evidence, the physical topology state evidence, and the author risk control gating evidence, multiple evidence bodies corresponding to the record to be normalized and the candidate record are obtained.

6. The method of claim 1, wherein, Based on the multiple evidence bodies, the comprehensive trust level and conflict coefficient corresponding to the record to be normalized and each candidate record are determined using Dempster-Shafer evidence theory, including: The basic probability allocation mapping process is performed on the multiple evidence bodies to obtain the basic probability allocation results corresponding to each evidence body. Select any one of the multiple basic probability assignment results as the initial accumulated evidence, and initialize the global conflict coefficient to zero; The current accumulated evidence and the evidence to be accumulated are subjected to an unnormalized orthogonal synthesis operation to obtain the cross-conflict term and the consistency result. The global conflict coefficient is recursively updated based on the cross-conflict terms, and the consistency result is used as the current accumulated evidence; the evidence to be accumulated is any one of the remaining basic probability allocation results that have not participated in the synthesis operation; before the first orthogonal synthesis operation, the current accumulated evidence is the initial accumulated evidence, and the global conflict coefficient is zero; The synthesis and accumulation of all basic probability assignment results and the update of the global conflict coefficient are completed iteratively to obtain the conflict coefficient and the cumulative evidence of the target. Based on the target accumulated evidence and the global conflict coefficient, a unified normalization process is performed to obtain the comprehensive trust level corresponding to the record to be normalized and each candidate record.

7. The method according to claim 1, characterized in that, The process of determining the adjudication result corresponding to the record to be normalized based on the comprehensive trust level and the conflict coefficient, and updating the document metadata database based on the adjudication result, includes: The conflict coefficient is compared with a preset conflict circuit breaker threshold, and a circuit breaker determination result is obtained based on the author's risk control gate evidence; the circuit breaker determination result is used to determine whether the conflict circuit breaker mechanism is triggered. If the circuit breaker determination result is that a conflict circuit breaker mechanism has been triggered, a pending review result is generated and sent to the user; If the circuit breaker determination result is that the conflict circuit breaker mechanism is not triggered, then the target candidate record is determined from the candidate record set; The overall trust level, preset automatic merging threshold, and automatic addition threshold corresponding to the target candidate record are compared to obtain the adjudication result corresponding to the record to be normalized; the automatic merging threshold is greater than the automatic addition threshold. If the overall trust level corresponding to the target candidate record is greater than the automatic merging threshold, then the record to be normalized is merged into the target candidate record and updated in the document meta-database; or, If there are no candidate records in the candidate record set with a comprehensive trust level greater than the automatic addition threshold, then the record to be normalized is added to the document meta-database; or, If the overall trust level corresponding to the target candidate record is within the range of the automatic merging threshold and the automatic addition threshold, a pending review result is generated and sent to the user.

8. An academic literature metadata database update device, characterized in that, include: The standardization processing module is used to convert the original metadata records to be entered into the database into records to be normalized; the records to be normalized are standardized records after format unification. The recall processing module is used to perform recall processing on a preset document metadata database based on the record to be normalized, and determine a candidate record set; the candidate record set includes candidate document normalization records that match the record to be normalized. The evidence construction module is used to construct evidence by comparing the record to be normalized with each candidate record in the candidate record set, and to determine multiple evidence bodies corresponding to the record to be normalized and each candidate record respectively. The evidence processing module is used to determine the comprehensive trust level and conflict coefficient corresponding to the record to be normalized and each candidate record respectively based on the multiple evidence bodies and using the Dempster-Shafer evidence theory. The conflict resolution module is used to determine the resolution result corresponding to the record to be normalized based on the comprehensive trust level and the conflict coefficient, and to update the document metadata database based on the resolution result.

9. An electronic device, characterized in that, The device includes: a processor, and a memory communicatively connected to the processor; The memory stores computer-executed instructions; The processor executes computer execution instructions stored in the memory to implement the method as described in any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores computer-executable instructions, which, when executed by a processor, are used to implement the method as described in any one of claims 1 to 7.