Cascade semantic mapping and name normalization method and system with environmental feature fusion

By fusing cascaded semantic mapping with citation environment features, a high-dimensional standard feature space is generated, reference records are parsed, semantic similarity and literal similarity are calculated, and combined with citation environment topological consistency verification, the problems of non-standard abbreviations and homonyms in journal name standardization are solved, achieving high-precision and high-robustness journal name standardization.

CN122197867APending Publication Date: 2026-06-12DOCUMENT & INFORMATION CENT OF CHINESE ACAD OF SCI

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
DOCUMENT & INFORMATION CENT OF CHINESE ACAD OF SCI
Filing Date
2026-03-10
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing methods for standardizing journal names are ineffective in handling non-standard abbreviations of journal names and cannot resolve the issue of disambiguation of journals with the same name but different meanings, thus failing to meet the requirements for high precision and high robustness.

Method used

A method combining cascaded semantic mapping and citation environment feature fusion is adopted. By generating a high-dimensional standard feature space, the reference records are parsed into context triples, semantic similarity and literal similarity are calculated, and combined with the topological consistency verification of the citation environment, an adaptive gating coefficient is configured for score fusion to generate the name standardization result.

Benefits of technology

It achieves accurate and automatic standardization of journal names, improving the robustness and accuracy in handling scenarios involving non-standard abbreviations and journals with the same name but different meanings.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122197867A_ABST
    Figure CN122197867A_ABST
Patent Text Reader

Abstract

The application discloses a name standardization method and system based on cascaded semantic mapping and environment feature fusion, and relates to the technical field of name standardization. The method comprises the following steps: generating a structured semantic description set for a standard journal library to form a high-dimensional standard feature space; obtaining a reference literature record to be standardized; generating a query vector; performing vector similarity retrieval to recall a preset number of candidate journals; calculating semantic similarity and literal similarity analysis; performing environment topology consistency verification on the candidate journals to establish a structural consistency score; performing score fusion to generate a name standardization result. The application solves the technical problems of the prior art, such as only focusing on text-level features, ignoring environment features, being difficult to effectively handle the problem of non-standard abbreviations and spelling of journal names, and being unable to solve the problem of homonymy journal disambiguation, and achieves the technical effects of realizing accurate automatic standardization of journal names and improving the robustness and accuracy in handling non-standard abbreviations and homonymy journal scenes.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of name standardization technology, specifically to a name standardization method and system that integrates cascaded semantic mapping and referencing environment features. Background Technology

[0002] In scientific literature data governance, citation index database construction, and scientific evaluation, journal name standardization is a core foundational step that directly impacts data quality and the accuracy of analysis results. Existing technologies for journal name standardization often have significant shortcomings. One type is traditional methods based on dictionaries, regular expressions, or string similarity, which struggle to cover various journal name variations arising from arbitrary author choices and inconsistent abbreviation rules, failing to effectively address issues of inconsistent abbreviations and spellings. Another type relies on sequence labeling models based on machine learning, which can only achieve structured segmentation of references but cannot achieve true journal name standardization. Furthermore, existing methods generally focus only on textual features, ignoring the implicit citation environment features within the citation network, making it difficult to effectively disambiguate journals with the same name and failing to meet the high precision and robustness requirements of large-scale citation data governance for journal name standardization.

[0003] Existing technologies focus only on textual features and ignore citation context features, making it difficult to effectively handle the problem of non-standard spelling of journal name abbreviations and unable to resolve the technical problem of disambiguating journals with the same name but different meanings. Summary of the Invention

[0004] This application provides a naming standardization method and system that integrates cascaded semantic mapping and citation context features. This method addresses the technical problems in existing technologies that focus only on textual features and ignore citation context features, making it difficult to effectively handle non-standard spelling of journal name abbreviations and unable to resolve disambiguation of journals with the same name but different meanings.

[0005] In view of the above problems, this application provides a naming standardization method and system that integrates cascaded semantic mapping and referencing environment features.

[0006] The first aspect of this application provides a naming standardization method that fuses cascaded semantic mapping with referencing context features, the method comprising: For each journal entity in the standard journal database, a structured semantic description set containing standard names and their subject classification information is generated. A unified training encoder is used to map this structured semantic description set into standard journal vectors embedded with subject attribute distributions, forming a high-dimensional standard feature space. Reference records to be standardized are obtained and parsed into context triples, which include a core entity, semantic context, and structural context. These context triples are concatenated and input into the unified training encoder to generate a query vector. The query vector is then input into the high-dimensional standard feature space to perform vector similarity retrieval, recalling a preset number of candidate journals. Semantic similarity and literal similarity analysis are calculated for each candidate journal to establish a semantic constraint score. A citation environment topology consistency verification is performed on the candidate journals to establish a structural consistency score. Adaptive gating coefficients are configured for the semantic constraint score and structural consistency score, and score fusion is performed. Based on the score fusion result, a name standardization result is generated.

[0007] A second aspect of this application provides a name standardization system that integrates cascaded semantic mapping with referencing environment features, the system comprising: A high-dimensional standard feature space generation module is used to generate a set of structured semantic descriptions containing standard names and subject classification information for each journal entity in the standard journal database, and to map the set of structured semantic descriptions into standard journal vectors embedded with subject attribute distributions using a unified training encoder, forming a high-dimensional standard feature space; a reference record parsing module is used to obtain the reference records to be standardized, and to parse the reference records into context triples, which include core entities, semantic context, and structural context; a query vector generation module is used to concatenate and concatenate the context triples and input them into the unified training encoder. The system comprises: an encoder to generate a query vector; a candidate journal recall module to input the query vector into the high-dimensional standard feature space to perform vector similarity retrieval and recall a preset number of candidate journals; a semantic constraint score establishment module to calculate semantic similarity and literal similarity analysis for each candidate journal and establish a semantic constraint score; a structural consistency score establishment module to perform citation environment topological consistency verification on the candidate journals and establish a structural consistency score; and a score fusion execution module to configure adaptive gating coefficients for the semantic constraint score and structural consistency score, perform score fusion, and generate a name standardization result based on the score fusion result.

[0008] One or more technical solutions provided in this application have at least the following technical effects or advantages: For each journal entity in the standard journal database, a structured semantic description set containing its standard name and subject classification information is generated. A unified training encoder is then used to map this structured semantic description set into standard journal vectors embedded with subject attribute distributions, forming a high-dimensional standard feature space. Reference records to be standardized are obtained and parsed into context triples. These are input into the unified training encoder to generate query vectors. Vector similarity retrieval is performed to recall a preset number of candidate journals. Semantic similarity and literal similarity analysis are calculated for each candidate journal to establish a semantic constraint score. Citation environment topological consistency verification is performed on the candidate journals to establish a structural consistency score. Adaptive gating coefficients are configured for the semantic constraint score and structural consistency score, and score fusion is performed. Based on the score fusion result, a name standardization result is generated. This achieves accurate and automatic standardization of journal names, improving the robustness and accuracy in handling scenarios involving non-standard abbreviations and journals with the same name but different meanings. Attached Figure Description

[0009] Figure 1 A schematic diagram of the naming standardization method for fusing cascaded semantic mapping and citation environment features provided in this application embodiment; Figure 2 A schematic diagram of the name specification system structure provided for the cascading semantic mapping and citation environment feature fusion in the embodiments of this application.

[0010] Figure labeling: 10 High-dimensional standard feature space generation module, 20 Reference record parsing module, 30 Query vector generation module, 40 Candidate journal recall module, 50 Semantic constraint score establishment module, 60 Structural consistency score establishment module, 70 Score fusion execution module. Detailed Implementation

[0011] This application provides a naming standardization method and system that integrates cascaded semantic mapping and citation context features. This method addresses the technical problems in existing technologies that focus only on textual features and ignore citation context features, making it difficult to effectively handle non-standard spelling of journal name abbreviations and unable to resolve disambiguation of journals with the same name but different meanings.

[0012] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of this application, and not all of them. All other embodiments obtained by those skilled in the art based on the embodiments of this application without creative effort are within the scope of protection of this application.

[0013] Example 1, as Figure 1 As shown, this application provides a name standardization method that fuses cascaded semantic mapping with referencing environment features. The method includes: Step S100: Generate a set of structured semantic descriptions containing standard names and their subject classification information for each journal entity in the standard journal database, and use a unified training encoder to map the set of structured semantic descriptions into standard journal vectors embedded with subject attribute distributions to form a high-dimensional standard feature space.

[0014] Specifically, in the offline phase, for each journal entity in the standard journal database, a structured semantic description set containing the journal's full standard name and subject classification label is constructed. The subject classification label is based on the subject to which the JCR journal belongs. If the journal does not have a JCR subject classification, the Dewey classification of Ulrich journals is used to supplement it. The structured semantic description set is standardized and constructed in the format of "Standard Name: Full standard name of journal; Category: Subject classification information". Then, a pre-trained unified encoder is used to perform vector mapping on this structured semantic description set to generate a high-dimensional standard journal vector with embedded journal subject attribute distribution features. The standard journal vectors corresponding to all journal entities are integrated to finally form a high-dimensional standard feature space containing subject metadata, providing a basic feature library for subsequent vector retrieval of journal names.

[0015] Step S200: Obtain the reference records to be standardized, and parse the reference records into context triples, wherein the context triples include core entities, semantic context and structural context.

[0016] Specifically, the online phase receives unstructured reference records to be standardized as input data, performs structured parsing on these records, extracts and constructs context triples containing core entities, semantic context, and structural context. The core entity is the original noisy text of the journal name to be standardized in the citing reference. The semantic context contains semantic information at two levels: the title of the cited paper and the title of the citing paper. If the reference only indicates the author's name and publication year of the cited paper but lacks the title of the cited paper, the title of the citing paper is directly used to replace the title of the cited paper in the semantic context. The structural context is the set of all sibling reference entries in the reference list of the citing paper, excluding the journal name entry to be standardized.

[0017] Step S300: Concatenate and concatenate the context triples, then input them into the unified training encoder to generate a query vector.

[0018] Specifically, for the parsed context triples, the core entity and semantic context are selected for concatenated splicing. According to feature priority, the core entity is placed at the beginning of the splicing sequence as a first-level feature, the title of cited paper is placed immediately after the core entity as a second-level semantic feature, and the title of cited paper is placed at the end of the splicing sequence as a third-level auxiliary semantic feature. At the same time, preset separators are inserted between the features at each level to maintain the independence of the semantic boundaries of each part. The completed concatenated text sequence is then input into a unified training encoder that is from the same source as step S100. Through feature extraction and vector mapping by the encoder, a high-dimensional query vector that integrates the journal name text to be standardized and semantic context information is generated.

[0019] Step S400: Input the query vector into the high-dimensional standard feature space to perform vector similarity retrieval and recall a preset number of candidate journals.

[0020] Specifically, the generated high-dimensional query vector, which integrates the journal name to be standardized and semantic context information, is input into a pre-constructed high-dimensional standard feature space containing subject metadata. By calculating the cosine similarity between the query vector and the standard journal vectors corresponding to all journal entities in the feature space, vector similarity retrieval is carried out. The journals are sorted according to the cosine similarity and the top K journal entities with the highest similarity are recalled from the high-dimensional standard feature space to form a set of candidate journals to be screened, providing a basis for subsequent semantic and structural dimension scoring and verification.

[0021] Step S500: Calculate semantic similarity and literal similarity analysis for each candidate journal and establish semantic constraint scores.

[0022] Specifically, for each candidate journal in the recalled candidate journal set, similarity is calculated from both semantic and literal dimensions, and a semantic constraint score is established by weighting the scores. First, the cosine similarity between the generated query vector and the standard journal vector corresponding to the candidate journal is calculated, and this is used as the semantic similarity. Then, the original text of the core entity, i.e., the name of the journal to be standardized, is aligned with the standard name of the candidate journal at the character level, and the normalized edit distance between the two is calculated as the literal similarity. Finally, the obtained semantic similarity and literal similarity are weighted and combined according to the preset hyperparameter weights w1 and w2, and the result is obtained through the formula... The semantic constraint score for each candidate journal is calculated to form a dual constraint of literal and semantic meaning, preventing semantic drift in generative models. represents the semantic constraint score of the i-th candidate journal, where w1 and w2 are preset hyperparameter weights, corresponding to the weight ratios of semantic similarity and literal similarity, respectively. This represents the cosine similarity between the query vector and the standard journal vector corresponding to the i-th candidate journal. The normalized edit distance literal similarity between the original text of the journal name to be standardized and the standard name of the i-th candidate journal after character-level alignment.

[0023] Step S600: Perform citation environment topology consistency verification on the candidate journals and establish structural consistency scores.

[0024] Specifically, based on the co-citation principle, a topological consistency verification of the citation environment is performed on each candidate journal in the candidate journal set, thereby establishing a structural consistency score for each candidate journal. First, the parsed structural context, i.e., the list of sibling documents, is traversed to identify journals that have undergone normalization. Based on the identification results, the subject distribution of these sibling journals is statistically analyzed, and an environmental subject vector V of the citation paper is generated. env Then, extract the inherent standard subject attribute vector V of each candidate journal from the standard journal database. classi The environmental science vector V is calculated using either the vector dot product or the inverse of the KL divergence. env With standard subject attribute vector V classi The matching degree, which is the structural consistency score S of the corresponding candidate journal. topo (c i This achieves implicit disambiguation of candidate journals, ensuring that the candidate journals are consistent with the subject attributes of the current citation context.

[0025] Step S700: Configure adaptive gating coefficients for the semantic constraint score and structural consistency score, perform score fusion, and generate name standardization results based on the score fusion results.

[0026] Specifically, two core features are first extracted from the data to be standardized in order to configure adaptive gating coefficients, namely, calculating the semantic information entropy of the core entity, the text of the journal name to be standardized. Characterizing textual uncertainty and calculating the document structure density of the sibling document list within the structural context. The topological completeness of the lectining environment is characterized by the formula. The adaptive gating coefficients were calculated. ,in This is the text uncertainty amplification factor. This is the topological completeness amplification factor. The function maps the calculation results to the (0, 1) interval, realizing dynamic weight adjustment of the semantic and structural channels; then, the adaptive gating coefficient is used to perform weighted fusion of the semantic constraint scores and structural consistency scores of each candidate journal, through the formula... The final fusion score for each candidate journal was calculated, where, Let be the final fusion score for the i-th candidate journal. Let be the semantic constraint score of the i-th candidate journal. For the structural consistency score of the i-th candidate journal, the candidate journals are sorted from high to low according to the final fusion score, and the candidate journal with the highest score is selected as the preliminary standardization result. Finally, the confidence of the preliminary standardization result is checked. If its final fusion score exceeds the preset confidence threshold, the journal name is directly output as the final name standardization result. If it is lower than the preset confidence threshold, the "Unknown" unknown label is output and the manual review process is triggered to complete the entire journal name standardization operation.

[0027] In one possible implementation, step S300 further includes: Step S310: Concatenate the core entity and semantic context within the context triple according to priority, wherein the core entity is placed at the beginning of the concatenation sequence as a first-level feature, the title of the cited paper is placed after the core entity as a second-level semantic feature, and the title of the citing paper is placed at the end of the concatenation sequence as a third-level auxiliary semantic feature.

[0028] Step S320: Insert preset delimiters between features at each level to maintain semantic boundary independence.

[0029] Step S330: When the title of the cited paper is missing in the semantic context, the title of the citing paper is used in place of the title of the cited paper in the splicing process.

[0030] Step S340: Input the concatenated sequence into the unified training encoder to generate a query vector.

[0031] Specifically, for the parsed context triples, the core entities and semantic contexts are selected for priority sorting during concatenation. The core entity of the original noisy text of the journal name to be standardized in the cited references is placed at the beginning of the concatenation sequence as a first-level feature. The title of the cited paper, which belongs to strong semantic features and can directly describe the content of the journal, is placed after the core entity as a second-level semantic feature. The title of the cited paper, which belongs to auxiliary semantic / background features and can describe the citation environment, is placed at the end of the concatenation sequence as a third-level auxiliary semantic feature. This completes the priority-order concatenation of the core entities and semantic context, laying the foundation for the subsequent generation of query vectors that integrate multi-dimensional semantics.

[0032] After prioritizing the core entities and semantic context, to avoid semantic information confusion and loss of feature independence among the features at different levels (first-level core entities, second-level cited paper titles, and third-level citing paper titles), a pre-defined separator is inserted between adjacent feature sequences at different levels. This separator clearly separates the semantic features at each level, effectively maintaining the independence of the semantic boundaries of each part. This ensures that the subsequent unified training encoder can accurately identify and distinguish semantic features of different priorities during feature extraction, thus guaranteeing the accuracy of cascaded semantic coding.

[0033] Before prioritizing the concatenation of core entities and semantic context, the completeness of the components of the semantic context is first checked. If it is found that the reference only indicates the author's name and publication year of the cited literature and does not include the title of the cited paper, according to the construction rules of semantic context, the title of the citing paper corresponding to the reference is directly used to replace the missing title of the cited paper. This title is then included in the concatenation sequence at the position of the original secondary semantic features to participate in the concatenation, ensuring the information completeness of the semantic context. This ensures that the query vector generated by subsequent encoding can integrate effective semantic information and avoid the problem of insufficient semantic features caused by the missing title of the cited paper.

[0034] The complete concatenated sequence, after priority concatenation, delimiter insertion, and missing information completion, is input into a unified training encoder that shares the same origin as the constructed high-dimensional standard feature space. This unified training encoder is built on a multi-layer bidirectional Transformer encoder structure, containing 12 Transformer encoding layers, each with 12 attention heads, and a hidden layer dimension of 768. It has been pre-trained using a structured semantic description dataset from a standard journal database. The pre-training process employs a dual-task joint training of Masked Language Modeling (MLM) and Sentence Sequence Prediction (NSP) to learn semantic representations of text such as journal names, subject classifications, and paper titles. Subject-related features: In use, the concatenated sequence is first processed by word segmentation, position encoding, and segment encoding, and then input into the encoder. The self-attention mechanism of each Transformer layer models the semantic associations of different levels of features in the sequence, namely, the first-level core entity, the second-level semantic features, and the third-level auxiliary semantic features. The contextual dependencies and subject-related features of each feature are extracted. Finally, the encoder outputs a high-dimensional vector with a dimension of 768 at the [CLS] position. This vector is the query vector that integrates the text of the journal name to be standardized and the full semantic context information, providing an accurate semantic representation basis for subsequent vector similarity retrieval in the high-dimensional standard feature space.

[0035] In one possible implementation, step S500 further includes: Step S510: Calculate the cosine similarity between the query vector and the standard journal vector corresponding to the candidate journal as the semantic similarity.

[0036] Step S520: Perform character-level alignment processing on the core entity and the standard name of the candidate journal, and calculate the normalized edit distance as the literal similarity.

[0037] Step S530: The semantic similarity and literal similarity are weighted and combined to generate a semantic constraint score.

[0038] Specifically, for each candidate journal recalled from the high-dimensional standard feature space, the corresponding standard journal vector with embedded subject attribute distribution is first extracted from the feature space. Then, the standard journal vector and the generated query vector, which integrates the journal name to be standardized and the full semantic context information, are used to calculate the cosine similarity. By quantifying the cosine value of the angle between the two high-dimensional semantic vectors, the degree of association and matching between the two at the semantic level is characterized. Finally, the calculated cosine similarity result is directly used as the semantic similarity of the candidate journal. The higher the value, the stronger the matching degree between the candidate journal and the journal name to be standardized at the semantic and subject attribute levels.

[0039] For the core entity to be standardized, namely the original noisy text of the journal name in the cited references, a character-level matching analysis is performed with the standard full name of the candidate journal. First, the two are aligned at the character level, and the character differences and positional relationships between the texts are compared character by character. Then, the minimum number of character insertions, deletions, and replacements required to convert the original text of the core entity into the standard name of the candidate journal is calculated. Subsequently, the minimum number of edits is normalized to eliminate the influence of the difference in text length between different journal names on the calculation results. The normalized edit distance is used as the literal similarity between the two, thereby achieving accurate matching of the literal form of the journal name, forming a literal constraint, and preventing the problem of excessive semantic association with comprehensive journals in the subsequent semantic matching process.

[0040] The semantic similarity and the literal similarity obtained in the previous step are used as dual features, and linearly weighted and combined according to the preset hyperparameter weights w1 and w2, and then calculated using the formula... Complete the calculation, where, The semantic constraint score represents the i-th candidate journal. , These are the preset hyperparameter weights corresponding to semantic similarity and literal similarity, respectively. This represents the semantic similarity calculated between the query vector and the standard journal vector corresponding to the i-th candidate journal. The literal similarity score is calculated by aligning the original text of the journal name to be standardized with the standard name of the i-th candidate journal at the character level. The hyperparameter weights can be optimized according to the actual journal standardization scenario to obtain the semantic constraint score for each candidate journal. The weighted combination method constructs a dual constraint mechanism of literal and semantic, which not only retains the semantic association matching ability brought by high-dimensional vector mapping, but also avoids the semantic drift problem of generative models through the constraint of literal similarity. At the same time, it can effectively suppress mismatches caused by excessive semantic association. For example, when the core entity to be standardized is "Nature", the extremely high literal similarity will constrain the model to avoid it from mismatching sub-journals with similar semantics but large literal differences, thus ensuring the accuracy and rationality of the semantic constraint score.

[0041] In one possible implementation, step S600 further includes: Step S610: Traverse the structural context, identify the normalized journals in the structural context, statistically analyze the subject distribution of sibling journals based on the traversal results, and generate the environmental subject vector of the citing paper.

[0042] Step S620: Extract the inherent standard subject attribute vector from the standard journal database for each candidate journal.

[0043] Step S630: Calculate the matching degree between the environmental subject vector and the standard subject attribute vector, and generate a structural consistency score.

[0044] Specifically, relying on the co-citation principle, a full traversal of the structural context in the context triple, i.e., the set of sibling document entries in the citing paper's reference list, is performed. Normalized journal entities in the list are identified one by one. Based on the identified normalized sibling journals, the number and proportion of journals corresponding to each subject category are statistically analyzed to form a quantitative subject distribution characteristic. This subject distribution characteristic is then transformed into a structured citing paper environment subject vector V. env This vector can accurately characterize the overall disciplinary attributes and distribution characteristics of the citation environment in which the journal name to be standardized is located.

[0045] For each candidate journal recalled in the high-dimensional standard feature space, the standard subject attribute vector inherent in the subject classification system of the journal is extracted from the pre-constructed standard journal database using the journal's standard name as an index. This vector has been standardized and encoded in the offline construction stage and directly embeds multi-dimensional subject tag features such as the first-level discipline, second-level discipline and subdivided research direction of the candidate journal. It can objectively and uniquely represent the subject attributes of the candidate journal itself, and provide a stable and unified standard reference for subsequent matching degree calculation with the subject vector of the citing paper environment.

[0046] The generated citation paper environment subject vector is matched with the extracted standard subject attribute vectors of each candidate journal at the vector level. The consistency of the subject distribution between the two is quantified by the inverse of KL divergence or cosine similarity. The calculated matching degree is directly used as the structural consistency score of the candidate journal. The higher the score, the more the inherent subject attributes of the candidate journal fit the overall subject topology of the citation environment. This achieves implicit disambiguation based on the citation environment, effectively distinguishes journals with the same name but different subjects, and improves the accuracy and reliability of candidate journal selection.

[0047] In one possible implementation, step S700 further includes: Step S710: Calculate the semantic information entropy of the text of the journal name to be standardized, which is used to characterize the text uncertainty.

[0048] Step S720: Calculate the document structure density of the structural context to characterize the topological completeness of the citation environment.

[0049] Step S730: Generate adaptive gating coefficients based on the text uncertainty and topological completeness.

[0050] Specifically, the original noisy text T of the core entity in the context triple, namely the journal name to be normalized in the citing references, is targeted at... target Calculate its semantic information entropy H(T) target This semantic information entropy is expressed by the classic formula for information entropy, H(T). target )=- The calculation shows that, among which This represents a character unit in the text of the journal title to be standardized. The semantic information entropy value, calculated by this formula, represents the probability of each character unit appearing in the text, where n is the total number of different character units in the text. This entropy value is specifically used to quantify the semantic ambiguity and text uncertainty of the text of the journal title to be standardized. The shorter the text of the journal title to be standardized, the higher the degree of abbreviation, and the more ambiguous the semantic direction, the higher its semantic information entropy value and the higher the corresponding text uncertainty. Conversely, the text is more certain. This entropy value provides the core quantitative feature basis for the subsequent calculation of the adaptive gating coefficient at the level of the text to be standardized.

[0051] For the structural context L in the context triple siblings This refers to the set of sibling references in the citing paper's bibliography, and its bibliographic structure density D(L) is calculated. siblingsThis density value is specifically used to quantify the topological completeness and structural richness of the citation environment in which the journal title to be standardized is located. The calculation is based on a set of sibling documents, and the ratio of the number of journals that have completed the standardization process to the total number of sibling documents is counted. Combined with the uniformity of the distribution of effective subject tags of the standardized journals, a comprehensive quantification is performed. The higher the document structure density value, the richer the available subject topological features in the citation environment and the stronger the topological completeness of the citation environment. Conversely, the lower the density value, the sparser the topological features of the citation environment and the weaker the completeness. This density value provides the core quantitative characteristic basis for the calculation of the subsequent adaptive gating coefficient at the citation environment level.

[0052] The semantic information entropy H(T) of the journal name text to be standardized, which represents the uncertainty of the text, is obtained. target The structural context document structure density D(L) obtained, which characterizes the topological completeness of the citing environment, is... siblings As the core input feature, it is substituted into the preset adaptive gating coefficient calculation formula. Generate adaptive gating coefficients Where k1 and k2 are preset hyperparameters, and sigmoid is... As a logical function, the calculation result can be mapped to the range of 0 to 1. This coefficient serves as the dynamic fusion weight of the semantic and structural channels. Its value can be adaptively adjusted according to changes in text uncertainty and topological completeness of the citation environment, thereby realizing intelligent weight allocation of the scores of the two channels and providing a core basis for the subsequent weighted fusion of semantic and structural scores.

[0053] In one possible implementation, step S730 further includes: The adaptive gating coefficient is calculated as follows: ; in, Characterizing the adaptive gating coefficient, This is a logical function used to map the result of a linear combination to the interval (0, 1). Entropy represents semantic information. Characterizing document structure density, This is the text uncertainty amplification factor. This is the topological completeness amplification factor.

[0054] Specifically, the calculation method for generating adaptive gating coefficients based on text uncertainty and topological completeness is given specific limitations. These adaptive gating coefficients are calculated using the formula... The calculation shows that, among which The final adaptive gating coefficients are used as dynamic fusion weights for semantic constraint scores and structural consistency scores. The function is a logistic activation function, which maps the result of the linear combination of text uncertainty and topological completeness to the (0, 1) interval, ensuring that the range of coefficient values ​​meets the requirements of weighted fusion. It is the semantic information entropy of the journal name text to be standardized, used to characterize the text uncertainty; It is the document structure density of the structural context, used to characterize the topological completeness of the citing environment; This is the text uncertainty amplification factor. The topological completeness amplification coefficient and the two are preset hyperparameters that can be optimized according to the actual journal name standardization scenario and data characteristics. This enables flexible control of the influence weights of text uncertainty and topological completeness, allowing the adaptive gating coefficient to better meet the actual score fusion requirements.

[0055] In one possible implementation, step S700 further includes: Step S740: Determine whether the score fusion result is lower than a preset confidence threshold.

[0056] Step S750: If the score fusion result is lower than the preset confidence threshold, an unknown identifier is output and a manual review process is triggered.

[0057] Specifically, after weighted fusion of semantic constraint scores and structural consistency scores using adaptive gating coefficients, for each candidate journal corresponding to the name of the journal to be standardized... The final score is then merged. Each entry is compared and judged against a confidence threshold pre-set by the system based on the standardization accuracy of journal names and actual data characteristics. represents the final score fusion result of the i-th candidate journal, and α represents the adaptive gating coefficient. The semantic constraint score represents the i-th candidate journal. The structural consistency score of the i-th candidate journal is used to verify the score fusion result of this candidate journal. Whether the value is below the preset threshold is a key quantitative criterion for determining whether the journal name is automatically standardized. This step provides the core quantitative basis for generating standardized results or triggering manual review processes, and is crucial for ensuring the effectiveness of the automatic standardization of journal names.

[0058] If the threshold is used for judgment, the final score fusion result S of the candidate journals corresponding to the journal names to be standardized is obtained. finalIf the result is below the system's preset threshold, it indicates that the machine-automated matching result of the journal name to be standardized has not met the criteria for effective standardization. In this case, the system will output an "Unknown" identifier for the original noisy text of the journal name to be standardized, and automatically trigger a manual review process. The reference record to be standardized and the matching information of related candidate journals will be synchronized to the manual review stage, where professionals will conduct manual verification and standardization judgment. This will compensate for the limitations of automatic machine processing in complex scenarios such as extreme abbreviations and cross-disciplinary homonyms, and ensure the overall accuracy and robustness of the automatic standardization method for journal names based on the fusion of cascaded semantic mapping and citation environment topological features.

[0059] In one possible implementation, step S700 further includes: The structural consistency score is also used to constrain and correct the semantic constraint score. When the structural consistency score of any candidate journal is lower than the preset score threshold, a decay constraint factor is applied to the semantic constraint score of the corresponding candidate journal to weaken the ranking weight of candidate journals that are inconsistent with the subject distribution of the citation environment.

[0060] Specifically, in the automatic normalization process of journal names based on the fusion of cascaded semantic mapping and citation environment topological features, the structural consistency score S topo (c i In addition to participating in the final weighted fusion as the core score of the structural channel, it is also used to evaluate the semantic constraint score S. sem (c i Pre-constraint corrections are performed; the system pre-sets a structural consistency score threshold for each recalled candidate journal c. i First, it is verified whether the structural consistency score is lower than the preset score threshold. If it is determined that the structural consistency score of a candidate journal does not reach the threshold, it means that the inherent subject attributes of the candidate journal do not match the subject topology distribution of the citation environment in which the journal name to be standardized is located. At this time, a decay constraint factor will be applied to the semantic constraint score corresponding to the candidate journal. The weight of the semantic constraint score of the candidate journal will be reduced through numerical decay processing. In this way, the competitiveness of such candidate journals that are inconsistent with the subject distribution of the citation environment will be weakened from the ranking level, so as to avoid them from obtaining improper ranking due to high semantic matching degree, further improving the accuracy of candidate journal selection, and making the final matching result more in line with the actual citation context of the journal name to be standardized.

[0061] In one possible implementation, step S700 further includes: The semantic constraint score and structural consistency score are weighted and fused using the adaptive gating coefficient, and the fused score result is output.

[0062] Specifically, in the automatic normalization method for journal names based on the fusion of cascaded semantic mapping and citation environment topological features, the structural consistency score is obtained after verification of citation environment topological consistency. The semantic constraint score after structural constraint correction After generating an adaptive gating coefficient α that adapts to the features of the text to be standardized and the citation environment, this adaptive gating coefficient α is used as the dynamic weight of the semantic constraint score, and (1 α) serves as the dynamic weight for the structural consistency score, according to... The formula calculates a weighted sum of the scores from the two channels, achieving a deep fusion of semantic features and citation environment topological features, and finally outputs the fusion score result for each candidate journal. This result serves as the core quantitative indicator of the matching degree of candidate journals, providing a crucial basis for subsequent confidence threshold judgment and the generation of journal name standardization results.

[0063] Example 2, based on the same inventive concept as the naming standardization method that fuses cascaded semantic mapping and referencing environment features in the aforementioned examples, such as... Figure 2 As shown, this application provides a name standardization system that integrates cascaded semantic mapping and referencing environment features. The system and method embodiments in this application are based on the same inventive concept. The system includes: The high-dimensional standard feature space generation module 10 is used to generate a set of structured semantic descriptions containing standard names and their subject classification information for each journal entity in the standard journal database, and to use a unified training encoder to map the set of structured semantic descriptions into standard journal vectors embedded with subject attribute distributions, thus forming a high-dimensional standard feature space.

[0064] The reference record parsing module 20 is used to obtain the reference record to be standardized and parse the reference record into a context triple, which includes a core entity, semantic context and structural context.

[0065] The query vector generation module 30 is used to concatenate and concatenate the context triples and then input them into the unified training encoder to generate a query vector.

[0066] The candidate journal recall module 40 is used to input the query vector into the high-dimensional standard feature space to perform vector similarity retrieval and recall a preset number of candidate journals.

[0067] The semantic constraint score building module 50 is used to calculate semantic similarity and literal similarity analysis for each candidate journal and build a semantic constraint score.

[0068] The structural consistency score establishment module 60 is used to perform citation environment topological consistency verification on the candidate journals and establish a structural consistency score.

[0069] The score fusion execution module 70 is used to configure adaptive gating coefficients for the semantic constraint score and structural consistency score, perform score fusion, and generate a name standardization result based on the score fusion result.

[0070] Furthermore, the system is also used for the following functions: The core entity within the context triple is concatenated with the semantic context according to priority. The core entity is placed at the beginning of the concatenation sequence as a first-level feature, the title of the cited paper is placed after the core entity as a second-level semantic feature, and the title of the citing paper is placed at the end of the concatenation sequence as a third-level auxiliary semantic feature. Preset separators are inserted between each level of features to maintain the independence of semantic boundaries. When the title of the cited paper is missing in the semantic context, the title of the citing paper is used in place of the title of the cited paper in the concatenation. The concatenated concatenation sequence is input into the unified training encoder to generate a query vector.

[0071] Furthermore, the system is also used for the following functions: The cosine similarity between the query vector and the standard journal vector corresponding to the candidate journal is calculated as the semantic similarity; the core entity and the standard name of the candidate journal are aligned at the character level, and the normalized edit distance is calculated as the literal similarity; the semantic similarity and literal similarity are weighted and combined to generate a semantic constraint score.

[0072] Furthermore, the system is also used for the following functions: Traverse the structural context, identify normalized journals within the structural context, statistically analyze the subject distribution among sibling journals based on the traversal results, and generate environmental subject vectors for citation papers; extract inherent standard subject attribute vectors from the standard journal database for each candidate journal; calculate the matching degree between the environmental subject vector and the standard subject attribute vector, and generate a structural consistency score.

[0073] Furthermore, the system is also used for the following functions: Calculate the semantic information entropy of the journal name text to be standardized, which is used to characterize the text uncertainty; calculate the document structure density of the structural context, which is used to characterize the topological completeness of the citation environment; and generate adaptive gating coefficients based on the text uncertainty and topological completeness.

[0074] Furthermore, the system is also used for the following functions: The adaptive gating coefficient is calculated as follows: ;in, Characterizing the adaptive gating coefficient, This is a logical function used to map the result of a linear combination to the interval (0, 1). Entropy represents semantic information. Characterizing document structure density, This is the text uncertainty amplification factor. This is the topological completeness amplification factor.

[0075] Furthermore, the system is also used for the following functions: Determine whether the score fusion result is lower than a preset confidence threshold; if the score fusion result is lower than the preset confidence threshold, output an unknown flag and trigger a manual review process.

[0076] Furthermore, the system is also used for the following functions: The structural consistency score is also used to constrain and correct the semantic constraint score. When the structural consistency score of any candidate journal is lower than the preset score threshold, a decay constraint factor is applied to the semantic constraint score of the corresponding candidate journal to weaken the ranking weight of candidate journals that are inconsistent with the subject distribution of the citation environment.

[0077] Furthermore, the system is also used for the following functions: The semantic constraint score and structural consistency score are weighted and fused using the adaptive gating coefficient, and the fused score result is output.

[0078] It should be noted that the order of the embodiments described above is merely for descriptive purposes and does not represent the superiority or inferiority of the embodiments. Furthermore, the above description focuses on specific embodiments of this specification. Additionally, the processes depicted in the accompanying drawings do not necessarily require a specific or sequential order to achieve the desired results. In some implementations, multitasking and parallel processing are possible or may be advantageous.

[0079] The above description is only a preferred embodiment of this application and is not intended to limit this application. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this application should be included within the protection scope of this application.

[0080] This specification and accompanying drawings are merely illustrative examples of this application and are intended to cover any and all modifications, variations, combinations, or equivalents within the scope of this application. Clearly, those skilled in the art can make various alterations and modifications to this application without departing from its scope. Therefore, if such modifications and variations fall within the scope of this application and its equivalents, this application intends to include such modifications and variations.

Claims

1. A naming standardization method that integrates cascaded semantic mapping and referencing environment features, characterized in that, The method includes: For each journal entity in the standard journal database, a structured semantic description set containing standard names and their subject classification information is generated, and the structured semantic description set is mapped to a standard journal vector with embedded subject attribute distribution using a unified training encoder to form a high-dimensional standard feature space. Obtain the reference records to be standardized, and parse the reference records into context triples, wherein the context triples include core entities, semantic context and structural context; The context triples are concatenated and concatenated, and then input into the unified training encoder to generate a query vector; The query vector is input into the high-dimensional standard feature space to perform vector similarity retrieval and recall a preset number of candidate journals. For each candidate journal, semantic similarity and literal similarity analysis are performed to establish a semantic constraint score; Perform citation environment topological consistency verification on the candidate journals and establish structural consistency scores; An adaptive gating coefficient is configured for the semantic constraint score and structural consistency score, score fusion is performed, and a name standardization result is generated based on the score fusion result.

2. The naming standardization method that fuses cascaded semantic mapping and referencing environment features as described in claim 1, characterized in that, The concatenated context triples are then input into the unified trained encoder to generate a query vector, including: The core entity and semantic context within the context triple are concatenated according to priority. The core entity is placed at the beginning of the concatenation sequence as a first-level feature, the title of the cited paper is placed after the core entity as a second-level semantic feature, and the title of the citing paper is placed at the end of the concatenation sequence as a third-level auxiliary semantic feature. Predefined separators are inserted between features at each level to maintain semantic boundary independence; When the title of the cited paper is missing in the semantic context, the title of the citing paper is used in place of the title of the cited paper in the concatenation process; The concatenated sequence is input into the unified training encoder to generate a query vector.

3. The naming standardization method that fuses cascaded semantic mapping and referencing environment features as described in claim 1, characterized in that, For each candidate journal, semantic similarity and literal similarity analysis are performed to establish a semantic constraint score, including: The cosine similarity between the query vector and the standard journal vector corresponding to the candidate journal is calculated as semantic similarity; The core entity and the standard name of the candidate journal are aligned at the character level, and the normalized edit distance is calculated as the literal similarity. The semantic similarity and literal similarity are weighted and combined to generate a semantic constraint score.

4. The naming standardization method that fuses cascaded semantic mapping and referencing environment features as described in claim 1, characterized in that, Perform citation environment topological consistency verification on the candidate journals and establish structural consistency scores, including: Traverse the structural context, identify the normalized journals in the structural context, statistically analyze the subject distribution of sibling journals based on the traversal results, and generate the environmental subject vector of the citing paper. For each candidate journal, extract the inherent standard subject attribute vector from the standard journal database; Calculate the matching degree between the environmental subject vector and the standard subject attribute vector, and generate a structural consistency score.

5. The naming standardization method that fuses cascaded semantic mapping and referencing environment features as described in claim 1, characterized in that, Configure adaptive gating coefficients for the semantic constraint score and structural consistency score, and perform score fusion, including: Calculate the semantic information entropy of the journal name text to be standardized, which is used to characterize the text uncertainty; The document structure density of the structural context is calculated to characterize the topological completeness of the citation environment; Adaptive gating coefficients are generated based on the text uncertainty and topological completeness.

6. The naming standardization method that fuses cascaded semantic mapping and referencing environment features as described in claim 5, characterized in that, Generate adaptive gating coefficients based on the text uncertainty and topological completeness, including: The adaptive gating coefficient is calculated as follows: ; in, Characterizing the adaptive gating coefficient, This is a logical function used to map the result of a linear combination to the interval (0,1). Entropy represents semantic information. Characterizing document structure density, This is the text uncertainty amplification factor. This is the topological completeness amplification factor.

7. The naming standardization method that fuses cascaded semantic mapping and referencing environment features as described in claim 1, characterized in that, The naming specification results generated based on the score fusion results also include: Determine whether the score fusion result is lower than a preset confidence threshold; If the score fusion result is lower than the preset confidence threshold, an unknown flag is output and a manual review process is triggered.

8. The naming standardization method that fuses cascaded semantic mapping and referencing environment features as described in claim 1, characterized in that, The structural consistency score is also used to constrain and correct the semantic constraint score. When the structural consistency score of any candidate journal is lower than the preset score threshold, a decay constraint factor is applied to the semantic constraint score of the corresponding candidate journal to weaken the ranking weight of candidate journals that are inconsistent with the subject distribution of the citation environment.

9. The naming standardization method that fuses cascaded semantic mapping and referencing environment features as described in claim 1, characterized in that, Configure adaptive gating coefficients for the semantic constraint score and structural consistency score, and perform score fusion, including: The semantic constraint score and structural consistency score are weighted and fused using the adaptive gating coefficient, and the fused score result is output.

10. A naming standardization system that integrates cascaded semantic mapping and referencing context features, characterized in that: The system is used to implement the naming standardization method of cascading semantic mapping and referencing environment feature fusion as described in any one of claims 1-9, the system comprising: The high-dimensional standard feature space generation module is used to generate a set of structured semantic descriptions containing standard names and their subject classification information for each journal entity in the standard journal database, and to use a unified training encoder to map the set of structured semantic descriptions into standard journal vectors embedded with subject attribute distributions, thus forming a high-dimensional standard feature space. The reference record parsing module is used to obtain the reference records to be standardized and parse the reference records into context triples, wherein the context triples include core entities, semantic context and structural context; The query vector generation module is used to concatenate and concatenate the context triples and input them into the unified training encoder to generate a query vector. The candidate journal recall module is used to input the query vector into the high-dimensional standard feature space to perform vector similarity retrieval and recall a preset number of candidate journals. The semantic constraint score building module is used to calculate semantic similarity and literal similarity analysis for each candidate journal and build a semantic constraint score. The structural consistency score establishment module is used to perform citation environment topological consistency verification on the candidate journals and establish a structural consistency score. The score fusion execution module is used to configure adaptive gating coefficients for the semantic constraint score and structural consistency score, perform score fusion, and generate a name standardization result based on the score fusion result.