A ship polishing field mixed named entity recognition method based on entity structure difference
By employing a hybrid named entity recognition method that combines a rule-based routing engine and a deep learning model, the recognition difficulties caused by differences in entity type structure in ship polishing process texts are resolved. This approach achieves high-accuracy and robust entity recognition while reducing annotation costs.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- JIANGSU UNIV OF SCI & TECH
- Filing Date
- 2026-04-01
- Publication Date
- 2026-06-30
AI Technical Summary
Existing technologies struggle to account for differences in entity type structure in ship polishing process texts, resulting in a trade-off between recognition stability and accuracy. Furthermore, the high annotation costs and uneven category distribution negatively impact model adaptability.
A hybrid named entity recognition method is adopted, which identifies structurally stable entities through a multi-dimensional rule routing engine and identifies semantically dependent entities by combining a deep learning model. The results are fused and the advantages of rule recognition and deep learning are utilized to achieve accurate recognition of different types of entities.
It improves the accuracy and robustness of entity recognition in ship polishing process texts, reduces the reliance on large-scale manually labeled data, and is applicable to diverse ship polishing process texts.
Smart Images

Figure CN122311202A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the interdisciplinary field of natural language processing and intelligent ship manufacturing, specifically to a hybrid named entity recognition method based on differences in entity structural characteristics for ship grinding processes. Background Technology
[0002] Named Entity Recognition (NER) is a crucial step in Natural Language Processing (NLP) technology. It aims to extract meaningful entities from unstructured text and determine their categories, serving as a core prerequisite for building knowledge graphs, enabling semantic search, and intelligent question-answering systems. In shipbuilding, the grinding and surface pretreatment processes before painting accumulate a vast amount of process specifications, work instructions, safety regulations, and technical documentation. These multi-source, heterogeneous texts contain key domain knowledge, including grinding equipment models, micron-level process parameters, surface defect morphologies, and stringent quality inspection standards (such as the PSPC standard). Accurate and automated extraction of these entities is fundamental to achieving intelligent management, quality traceability, and decision support in ship grinding processes.
[0003] However, the entity types in ship polishing process texts exhibit significant differences in their expressive structures: one type of entity has a fixed or semi-fixed format and minimal contextual dependence, such as standard numbers, quality grades, and parameter ranges; the other type of entity is flexible in expression, with boundaries changing with the context, such as polishing equipment, defects, work tasks, and work objects. If a uniform identification strategy is adopted without distinguishing these differences, it can easily lead to misjudgment or omission of standard parameter entities, and unstable boundaries for work description entities.
[0004] Currently, entity recognition technologies in this field mainly have the following limitations: 1. Uniform sequence labeling strategies struggle to accommodate entities with diverse structural characteristics. Existing technologies typically employ a single deep sequence labeling model (such as BiLSTM-CRF or BERT-CRF) to uniformly model all types of entities. This "one-size-fits-all" approach ignores the structural differences within entities themselves: for structurally stable entities, deep models often suffer from poor generalization in standard formats due to overfitting to specific contexts in the training data, and may even misinterpret fixed rules; while for semantically dependent entities, a single model struggles to capture their complex boundaries when sufficient contextual features are lacking.
[0005] 2. Vertical domain annotation is costly and suffers from uneven category distribution. Ship polishing is a vertically segmented industrial field lacking large-scale publicly available annotation corpora. Manual annotation requires specialized knowledge, is costly, and exhibits a long-tailed distribution of entity categories. Issues with sample size and distribution can lead to model overfitting and decreased performance in recognizing low-frequency entities, impacting engineering usability. Summary of the Invention
[0006] Purpose of the invention: To address the challenges of balancing stability and accuracy in heterogeneous entity recognition scenarios and the insufficient model adaptability caused by scarce labeled samples and class imbalance in existing technologies, this invention proposes a hybrid named entity recognition method based on entity structural differences in the field of ship polishing. This method employs a hybrid approach combining rule-based recognition, deep learning recognition, and fusion resolution. By performing divide-and-conquer modeling based on entity structural differences and fusing the results, it achieves deterministic recognition of structurally stable entities and semantically generalized recognition of semantically dependent entities.
[0007] Technical solution: A hybrid named entity recognition method based on differences in physical structure for ship polishing, comprising the following steps: A multi-dimensional rule routing engine is used to identify structurally stable entities in text, determine the category of the corresponding structurally stable entity and its location information in the text; the multi-dimensional rule routing engine includes: a dictionary matching module, a regular expression constraint module and a logic verification module; A deep learning-based named entity recognition model is used to identify semantically dependent entities in text; the deep learning-based named entity recognition model is trained on a structured text corpus for the field of ship polishing. The recognition results of the multi-dimensional rule routing engine and the recognition results of the deep learning-based named entity recognition model are fused to obtain the fused recognition result. The structurally stable entities and semantically dependent entities are classified based on the structural stability and semantic dependence of entities in textual expression, combining knowledge of the ship polishing field and the ship polishing process.
[0008] Furthermore, the multi-dimensional rule routing engine includes: a dictionary matching module, a regular expression constraint module, and a logic verification module; The dictionary matching module is used to store all pattern strings with fixed expressions from a pre-built standard database of ship polishing processes into a dictionary, and to construct a prefix tree using all pattern strings in the dictionary. The pattern strings in the dictionary are pre-stored entity term texts in a dictionary of ship polishing processes, and each pattern string corresponds to a specific expression of a structurally stable entity. The prefix tree structure is constructed using the pattern string set, and the fail pointers of each node in the prefix tree are calculated based on a breadth-first search algorithm to construct an Aho-Corasick automaton. During entity recognition, the text stream to be processed is input into the Aho-Corasick automaton, and the breadth-first search algorithm is used to... Complete the full search of all entities in the dictionary and output the first candidate entity set within the time complexity, where n is the length of the text to be processed; The regular expression constraint module is used to, after the dictionary matching module outputs the first candidate entity set, for structurally stable entities that are not fully covered in the dictionary but have significant format regularity, lock text fragments that conform to the industry standard format by defining character sets, quantifiers and boundary qualifiers, extract their corresponding text content and position information, and form a second candidate entity set. The logical verification module, by introducing a background verification mechanism, merges the multi-source candidate entity sets and then performs unified verification and adjudication under rule constraints to ensure the accuracy and consistency of the identification results for structurally stable entities. The verification mechanism mainly includes: (1) Legality verification Based on preset domain rules, each candidate entity in the initial candidate entity set is verified one by one. The verification includes entity type verification, numerical range verification, unit consistency verification, and context semantic verification. (2) Boundary Correction For candidate entities with truncated or redundant boundaries, their start and end positions are adjusted according to character distribution and rule constraints to obtain a complete entity representation. (3) Conflict resolution For candidate entities whose text positions overlap or contain each other, the selection is based on the principle of prioritizing dictionary matching modules and the rule of prioritizing the longest length, retaining candidate entities that meet the constraints; when the same text fragment is identified as different entity types, the entity type that meets the domain constraints is selected. The final output is the identification result of structurally stable entities.
[0009] The logic verification module is used to introduce a background logic verification mechanism to obtain the identification result for entities containing quantitative indicators after the initial boundary identification.
[0010] Furthermore, the deep learning-based named entity recognition model includes: a pre-trained language model BERT, a bidirectional long short-term neural network BiLSTM, and a label constraint model CRF; The pre-trained language model BERT performs contextual semantic encoding on the input text, transforming the text into dynamic word vectors containing contextual information. The pre-trained language model BERT introduces an effective feature alignment mechanism, which includes: generating a sequence of sub-words and an effective positional mapping from the original word to the first sub-word for the WordPiece segmentation mechanism of the pre-trained language model BERT, thereby avoiding label misalignment and boundary drift caused by word segmentation fragmentation.
[0011] The Bidirectional Long Short-Term Neural Network (BiLSTM) captures forward constraints and backward logic based on the features output by the pre-trained language model BERT to model the contextual dependencies of the text. The label constraint model CRF is used to learn the BIOES label transition matrix using a conditional random field, and the constraint recognition results conform to the labeling specifications.
[0012] Furthermore, the recognition results of the multi-dimensional rule routing engine and the recognition results of the deep learning-based named entity recognition model are fused to obtain the fused recognition results, including: When the recognition results of the multi-dimensional rule routing engine overlap or conflict with the recognition results of the deep learning-based named entity recognition model, selection and filtering are performed according to entity type priority, maximum coverage matching principle and model confidence threshold to obtain the fused recognition results.
[0013] Furthermore, the selection and filtering based on entity type priority, maximum coverage matching principle, and model confidence threshold to obtain the fused recognition result includes: The recognition results of the multi-dimensional rule routing engine have higher priority than the recognition results of the deep learning-based named entity recognition model; When the recognition results of the multi-dimensional rule routing engine overlap with the entity boundary parts of the recognition results of the deep learning-based named entity recognition model, the recognition result with the longest coverage is selected. If the confidence score of an entity output by a deep learning-based named entity recognition model is lower than a preset threshold, and an entity identified by a multi-dimensional rule routing engine exists within the text interval corresponding to that entity or within a range overlapping with that text interval, then the recognition result of the multi-dimensional rule routing engine replaces the model's recognition result. Furthermore, the structured text corpus for the ship polishing field is obtained as follows: From the annotated structured texts on ship polishing, we extract the instance sets of each entity category. While keeping the original semantic structure unchanged, we replace the entities in the text with other entity instances of the same category and inherit the original entity's annotation labels to generate new annotation pairs. This is how we construct a structured text corpus for ship polishing.
[0014] Furthermore, in the process of training the deep learning-based named entity recognition model using a structured text corpus for the field of ship polishing, a category weighting mechanism is adopted. This category weighting mechanism includes setting weighting factors based on the reciprocal of the frequency of each category of entity in the structured text corpus for the field of ship polishing.
[0015] Beneficial Effects: This invention collects heterogeneous texts from multiple sources, including standard specifications, work instructions, and process descriptions, within the ship polishing process scenario. These texts are then cleaned, segmented, and standardized in terms of units and format to construct a domain corpus. Based on the structural stability and semantic dependence of entity representations, entities to be identified are categorized into structurally stable entities and semantically dependent entities. For structurally stable entities, rule matching based on a domain dictionary and regular expressions, combined with logical verification, is used to determine entity boundaries and categories. For semantically dependent entities, a sequence labeling model is constructed, including a pre-trained encoding layer, a bidirectional sequence modeling layer, and a conditional random field labeling layer, for prediction. During the training phase, entity-level replacement data augmentation and category weighting mechanisms can be employed to alleviate sample scarcity and class imbalance. The rule recognition results and model recognition results are fused and conflict-resolved according to entity type priority, maximum coverage matching, and confidence threshold, outputting the entity type, text position, and confidence level. Through the above technical solutions, this invention has the following advantages compared to existing technologies: 1. Balancing recognition accuracy and robustness: This invention decouples entities by structural analysis, uses rule-based recognition for structurally stable entities, and makes full use of format rules and value constraints to reduce false positives and false negatives; for semantically dependent entities, it uses a deep sequence labeling model to improve the ability to recognize flexible expressions and complex boundaries. 2. Reduce labeling dependence and improve long-tail category performance: This invention achieves stable recognition performance even with limited labeled corpus and uneven category distribution by using rule-based result assistance, data augmentation, and category balancing strategies. 3. Strong industrial applicability: This invention effectively handles the overlap and conflict between rules and model outputs through a fusion resolution mechanism of priority, maximum matching and confidence filtering, and is suitable for ship polishing process text with high entity density and diverse expressions. 4. This invention improves the recognition accuracy and robustness of standard parameter entities and operation description entities in ship grinding process text by using a hybrid divide-and-conquer and fusion strategy, and reduces the dependence on large-scale manual annotation data. Attached Figure Description
[0016] Figure 1 This is a schematic diagram of the hybrid named entity recognition method in the ship polishing process scenario in an embodiment of the present invention; Figure 2 This is a schematic diagram of the logic for fusing and resolving conflicts in multi-source identification results. Detailed Implementation
[0017] The technical solution of this embodiment will now be further described in conjunction with the accompanying drawings and examples.
[0018] like Figure 1 As shown, this invention proposes a hybrid named entity recognition method for the ship polishing field based on differences in entity structure. Entities with different structural characteristics are processed using differentiated recognition strategies. Specifically, for entities with different structural characteristics, a combination of rule-based recognition and deep learning recognition is used for processing. The multi-source recognition results are then fused and output. The specific steps include: Step S1: Domain Text Collection and Multidimensional Preprocessing: Collect multi-source heterogeneous texts in the ship polishing process scenario, including standard specifications, work instructions and process description texts; perform noise cleaning, long sentence segmentation and standardization processing on the collected multi-source heterogeneous texts to build a structured text corpus for the ship polishing field.
[0019] As one implementation method, firstly, multi-source heterogeneous texts in the ship polishing scenario are collected through web crawling, manual input, and other methods. The text sources include, but are not limited to, international / industry standards such as "ISO 8501-1 Surface Treatment of Steel Before Painting" and "CB / T 3474 Ship Painting Specification", classification society polishing standards and specifications, polishing-related literature and journals, and shipyard's publicly available operation standards (section polishing operation standards, external cargo hold polishing operation standards, superstructure polishing operation standards, etc.) to form a domain corpus. The text in the corpus is preprocessed, including: removing HTML tags, garbled characters, and redundant metadata not related to production; correspondingly extracting entity paragraphs related to the polishing process and cleaning the paragraphs, specifically including deduplication, filtering completely duplicated or highly similar text paragraphs, removing meaningless punctuation, characters, and sequence markers from sentences to form a preliminary corpus database; truncating long sentences by periods, semicolons, or newlines, and setting a maximum sentence length (128 characters) to adapt to the input constraints of deep learning models; unifying the text encoding format, especially converting units of measurement to a unified format, to form an unlabeled, high-quality unlabeled text dataset.
[0020] Step S2: Entity classification feature mapping based on structural characteristics: Combining the entity system and text representation rules in the field of ship polishing, the entities to be identified are divided into two categories: structurally stable entities and semantically dependent entities. Among them, structurally stable entities refer to entities with fixed / semi-fixed expression forms, limited value ranges, and small context offsets; semantically dependent entities refer to entities with flexible and diverse expression forms and boundary determination that strongly depends on the semantic logic of the context.
[0021] As one implementation method, combining knowledge and processes in the field of ship polishing, and further based on the structural stability and semantic dependence of entities in textual representation, this embodiment performs binary decoupling classification on the entities to be identified: Structurally Stable Entities (SS-Entities): These entities typically have fixed or semi-fixed representations, limited value ranges, low dependence on contextual semantics, and strong pattern matching characteristics. Examples include grinding standards (such as "ISO8501-1", "PSPC standard"), quality specifications, and some parameter standards (such as "Sa 2.5 grade", "roughness"). (”).
[0022] Semantic-dependent entities (SD-Entities): These entities are flexible in their expression in text, with entity boundaries fluctuating with the context and strongly dependent on the semantics of the context for judgment. Examples include grinding equipment (such as "pneumatic angle grinder" and "vacuum sandblaster"), grinding defects (such as "oxide scale" and "rust"), work tasks (such as "roughening" and "rust removal"), and work objects (such as "ballast tank outer plating" and "section joining point").
[0023] The above-mentioned classification based on the structural characteristics of entities lays the foundation for adopting different identification strategies in the future.
[0024] Step S3: Rule-based and dictionary-based structurally stable entity recognition: For structurally stable entities, an entity recognition module based on rules and a vertical domain dictionary is constructed. The rules are determined by scanning the text stream through regular expressions, keyword precise matching logic, and numerical range constraint rules to determine the physical boundaries and logical types of entities, and the rule recognition results are output.
[0025] As one implementation method, this embodiment constructs a multi-dimensional rule routing engine for the aforementioned SS-Entities (structurally stable entities) to efficiently identify entities with stable structural expressions and clear format characteristics. The multi-dimensional rule routing engine specifically includes: a dictionary matching module, a regular expression constraint module, and a logical verification module.
[0026] Specifically, the dictionary matching module stores all pattern strings with fixed expressions from a pre-built database of ship polishing process standards into a dictionary. It then constructs a prefix tree using all pattern strings in the dictionary. The pattern strings in the dictionary are pre-stored entity term texts in the ship polishing process field dictionary, with each pattern string corresponding to a specific expression of a structurally stable entity. Entity term texts include, for example, specific grade specification names and standard number prefixes. A Trie tree (prefix tree) is constructed using all pattern strings in the dictionary, and the fail pointers of each node are calculated based on a breadth-first search algorithm to construct an Aho-Corasick automaton (AC automaton). During entity recognition, the text stream to be processed is input into the automaton's state machine, and the algorithm... Within a time complexity (where n is the length of the text to be tested), a full search of all entities in the dictionary is completed and the first candidate set of entities is output. The matching efficiency is basically unaffected by the growth of the dictionary size, so as to ensure the real-time performance and stability of large-scale standardized text processing.
[0027] After the dictionary matching module outputs the first candidate entity set, the regular expression constraint module, for structurally stable entities not fully covered in the dictionary but exhibiting significant format regularity, defines character sets, quantifiers, and boundary qualifiers to lock down text fragments conforming to industry standard formats, extracting their corresponding text content and positional information to form a second candidate entity set. For example, for non-fixed value expressions in structurally stable entities not fully covered in the dictionary but exhibiting significant format regularity (such as surface cleanliness levels, roughness ranges, etc.), character sets, quantifiers, and boundary qualifiers are defined to lock down text fragments conforming to industry standard formats; for example, for cleanliness levels, a pattern string is set to match strings starting with "Sa" or "St" followed by numbers and decimal places, achieving pattern-based recognition of non-fixed value entities. This second candidate entity set supplements the first candidate entity set obtained by the dictionary matching module, especially for identifying structurally stable entities with regular expression characteristics but variable values that are difficult to represent exhaustively through a dictionary, thereby improving the overall recall capability of structurally stable entities.
[0028] The logical verification module, through the introduction of a background verification mechanism, merges the first and second candidate entity sets and then performs unified verification and adjudication under rule constraints to ensure the accuracy and consistency of the recognition results for structurally stable entities and eliminate false alarms caused by contextual ambiguity. This verification mechanism mainly includes: (1) Legality verification Based on preset domain rules, each candidate entity in the first and second candidate entity sets is verified one by one, including entity type verification, numerical range verification, unit consistency verification, and contextual semantic verification.
[0029] (2) Boundary Correction For candidate entities with truncated or redundant boundaries, their start and end positions are adjusted according to character distribution and rule constraints to obtain a complete entity representation.
[0030] (3) Conflict resolution For candidate entities whose text positions overlap or contain each other, the selection is based on the principle of prioritizing dictionary matching modules and the rule of prioritizing the longest length, retaining candidate entities that meet the constraints; when the same text fragment is identified as different entity types, the entity type that meets the domain constraints is selected.
[0031] The final output is the identification result of structurally stable entities.
[0032] Through the collaborative processing of dictionary matching, regular expression constraints, and logical verification, fragments in the text that satisfy preset rules are directly identified, and their corresponding structurally stable entity categories and their position information in the original text are determined. The rule recognition module outputs preliminary recognition results for structurally stable entities, providing basic input for subsequent semantically dependent entity recognition and fusion decisions.
[0033] Step S4: Semantic Dependency Entity Recognition Based on Deep Sequence Labeling Model: For semantically dependent entities, a deep learning-based named entity recognition model is constructed, comprising a pre-trained language model BERT, a bidirectional sequence modeling layer, and a conditional random field (CRF) labeling layer. The pre-trained language model BERT is used to obtain deep semantic vectors, the bidirectional sequence modeling layer captures long-distance dependencies, and the CRF layer is used to learn label transition probabilities, thereby achieving predictive labeling of complex semantic entities.
[0034] As one implementation method, the pre-trained language model BERT performs contextual semantic encoding on the input text, transforming the text into dynamic character vectors containing rich contextual information. An effective feature alignment mechanism is introduced, including: a WordPiece segmentation mechanism for BERT that generates subword sequences and an effective positional mapping from the original word to the first subword, thereby avoiding label misalignment and boundary drift caused by fragmented segmentation. A first-subword alignment strategy is adopted to ensure that each original character corresponds to a vector output, eliminating offsets and reducing interference from redundant tokens on sequence labeling. The encoded features are input into a bidirectional sequence modeling network (BiLSTM) to capture the forward constraints and backward logic in the description of the polishing process, modeling the contextual dependencies of the text. A label constraint layer is introduced at the output layer, using a CRF (Conditional Random Field) to learn the BIOES label transition matrix, constraining the recognition results to conform to labeling specifications (e.g., B-Object is strictly prohibited before the I-Equipment label), avoiding the occurrence of illegal label sequences.
[0035] The above model is used to identify semantically dependent entities in text and output the model recognition results.
[0036] Step S5: Training Sample Augmentation and Class Balance: During the training phase, data augmentation methods such as entity-level replacement are used to expand the samples for low-frequency entity categories, and loss weights or sampling strategies are set according to the category frequency to alleviate the impact of class imbalance on model training.
[0037] As one implementation method, due to the scarcity of professionally labeled data in the field of ship polishing, this embodiment performs sample augmentation on the training data to improve the generalization ability of the semantically dependent entity recognition model in small-sample scenarios. Sample augmentation employs an entity-level replacement strategy, specifically: extracting instance sets of each entity category from the labeled text, replacing entities in the text with other entity instances of the same category while maintaining the original semantic structure, and simultaneously inheriting the original entity's label. For example, replacing "using a pneumatic angle grinder to process welds" with "using an electric polishing machine to process weld points" generates a new label pair. Simultaneously, a category weighting mechanism is introduced during model training, setting weight factors based on the reciprocal of the frequency of each entity category in the corpus, assigning higher weights to low-frequency entity categories to alleviate the problem of uneven entity category distribution.
[0038] Step S6: Multi-source result fusion and conflict resolution: Logically align and fuse the rule matching results and model prediction results. When overlap or conflict occurs, select and filter according to entity type priority, maximum coverage matching principle and model confidence threshold to generate consistent final annotation results.
[0039] As one implementation method, when the rule branch and the model branch produce conflicting results, the following logic is executed: First, determine whether there is any overlap or conflict between the rule-based entity set K and the model-based entity set H. If there is no conflict, directly merge the two entity sets to generate the final entity recognition sequence.
[0040] If overlaps or conflicts exist, it is further determined whether the conflicting entities involve structurally stable entities. If the conflicting entities contain structurally stable entities, the rule recognition result is retained first, that is, the entity output by the multi-dimensional rule routing engine is used as the final result to ensure the reliability of the recognition of structurally stable entities.
[0041] If the conflicting entities do not involve structurally stable entities, the system further determines whether there is overlap or inclusion relationship between the conflicting entities in the text range. If there is overlap or inclusion relationship, the maximum coverage matching principle is adopted, that is, the entity recognition results with larger text spans are retained to avoid entity boundaries being truncated.
[0042] After performing maximum coverage matching, the confidence level of the model recognition result is further assessed. If the confidence level of the model-recognized entity is higher than a preset threshold, the model recognition result is retained; otherwise, it is determined whether a rule-recognized entity exists within the text interval corresponding to that entity. If a rule-recognized entity exists, the process reverts to the rule recognition result; if no rule-recognized entity exists, the model recognition result is retained.
[0043] In short, it mainly includes the following three principles: 1. Category Priority Principle: If the rule is identified as a "parameter standard" and the model is identified as a "job object", since the parameter standard belongs to SS-Entities, the system will force the rule result to take precedence; 2. Maximum matching principle: When the boundary of the entities identified by the two parties partially overlaps (e.g., the rule identifies "Sa 2.5" and the model identifies "Sa 2.5 level"), the result with the longest coverage is selected; 3. Confidence filtering: If the confidence of a certain entity output by the deep learning-based named entity recognition model is lower than a preset threshold (e.g., 0.85), and there is an entity identified by the multi-dimensional rule routing engine within the text interval corresponding to the entity or within a range that overlaps with the text interval, then the recognition result of the multi-dimensional rule routing engine is used to replace the recognition result of the model.
[0044] Through the aforementioned conflict resolution mechanism, a unified set of entity recognition results is formed. The logic diagram is as follows: Figure 2 As shown.
[0045] Step S7: Entity Knowledge Output and Structured Display: Summarize the fused recognition results and output the entity's offset, length, type, and confidence score in the original text for subsequent knowledge extraction, process knowledge base construction, and business system integration.
[0046] As one implementation method, based on the fused entity recognition results, the system converts the recognized entities and their attributes into structured data, stored in the following format: This leads to the final named entity recognition results in the field of ship polishing.
Claims
1. A hybrid named entity recognition method for ship polishing based on differences in entity structure, characterized in that: Includes the following steps: A multi-dimensional rule-based routing engine is used to identify structurally stable entities in text, determine their corresponding categories and their location information in the text; The multi-dimensional rule routing engine includes: a dictionary matching module, a regular expression constraint module, and a logic verification module; A deep learning-based named entity recognition model is used to identify semantically dependent entities in text; the deep learning-based named entity recognition model is trained on a structured text corpus for the field of ship polishing. The recognition results of the multi-dimensional rule routing engine and the recognition results of the deep learning-based named entity recognition model are fused to obtain the fused recognition result. The structurally stable entities and semantically dependent entities are classified based on the structural stability and semantic dependence of entities in textual expression, combining knowledge of the ship polishing field and the ship polishing process.
2. The hybrid named entity recognition method for ship polishing based on differences in entity structure as described in claim 1, characterized in that: The multi-dimensional rule routing engine includes: a dictionary matching module, a regular expression constraint module, and a logic verification module; The dictionary matching module is used to store all pattern strings with fixed expressions into the dictionary, and construct a prefix tree using all pattern strings in the dictionary. The pattern strings in the dictionary are entity entry texts pre-stored in the dictionary of ship polishing technology. Each pattern string corresponds to a specific expression of a structurally stable entity. The prefix tree structure is constructed using the set of pattern strings, and the fail pointer of each node in the prefix tree is calculated based on the breadth-first search algorithm to construct an Aho-Corasick automaton. During entity recognition, the text stream to be processed is input into the Aho-Corasick automaton, and a full search is performed on all pattern strings in the dictionary based on the breadth-first search algorithm to output the first entity candidate set. The regular expression constraint module is used to define character sets, quantifiers and boundary qualifiers for structurally stable entities that are not fully covered in the dictionary but have regular formats, to lock text fragments that conform to industry standard formats, extract their corresponding text content and position information, and form a second candidate entity set. The logic verification module is used to perform unified verification and adjudication on candidate entities in the first candidate entity set and the second candidate entity set under rule constraints using a preset verification mechanism, so as to obtain the recognition result of the multi-dimensional rule routing engine.
3. The hybrid named entity recognition method for ship polishing based on differences in physical structure as described in claim 2, characterized in that: The verification mechanism includes: Based on preset domain rules, all candidate entities are verified one by one. The verification includes entity type verification, numerical range verification, unit consistency verification and context semantic verification. For candidate entities with truncated or redundant boundaries, their start and end positions are adjusted according to character distribution and rule constraints to obtain a complete entity representation. For candidate entities whose text positions overlap or contain each other, the selection is based on the principle of prioritizing dictionary matching modules and the rule of prioritizing the longest length, and candidate entities that meet the conditions are retained; when the same text fragment is identified as different entity types, the entity type that meets the domain constraints is selected.
4. The hybrid named entity recognition method for ship polishing based on differences in physical structure as described in claim 1, characterized in that: The deep learning-based named entity recognition model includes: the pre-trained language model BERT, the bidirectional long short-term neural network BiLSTM, and the label constraint model CRF; The pre-trained language model BERT performs contextual semantic encoding on the input text, transforming the text into dynamic word vectors containing contextual information. The pre-trained language model BERT introduces an effective feature alignment mechanism, which includes: the WordPiece segmentation mechanism for the pre-trained language model BERT, generating a sequence of sub-words and an effective positional mapping from the original word to the first sub-word. The Bidirectional Long Short-Term Neural Network (BiLSTM) captures forward constraints and backward logic based on the features output by the pre-trained language model BERT to model the contextual dependencies of the text. The label constraint model CRF is used to learn the BIOES label transition matrix using a conditional random field, and the constraint recognition results conform to the labeling specifications.
5. The hybrid named entity recognition method for ship polishing based on differences in entity structure as described in claim 1, characterized in that: The recognition results of the multi-dimensional rule routing engine and the recognition results of the deep learning-based named entity recognition model are fused to obtain the fused recognition results, including: When the recognition results of the multi-dimensional rule routing engine overlap or conflict with the recognition results of the deep learning-based named entity recognition model, selection and filtering are performed according to entity type priority, maximum coverage matching principle and model confidence threshold to obtain the fused recognition results.
6. The hybrid named entity recognition method for ship polishing based on differences in entity structure as described in claim 5, characterized in that: The selection and filtering based on entity type priority, maximum coverage matching principle, and model confidence threshold, to obtain the fused recognition result, includes: The recognition results of the multi-dimensional rule routing engine have higher priority than the recognition results of the deep learning-based named entity recognition model; When the recognition results of the multi-dimensional rule routing engine overlap with the entity boundary parts of the recognition results of the deep learning-based named entity recognition model, the recognition result with the longest coverage is selected. If the confidence level of a certain entity output by the deep learning-based named entity recognition model is lower than a preset threshold, and there is an entity identified by the multi-dimensional rule routing engine within the text interval corresponding to the entity or within a range that overlaps with the text interval, then the recognition result of the multi-dimensional rule routing engine is used to replace the recognition result of the deep learning-based named entity recognition model.
7. The hybrid named entity recognition method for ship polishing based on differences in entity structure as described in claim 1, characterized in that: The structured text corpus for the ship polishing field is obtained as follows: From the annotated structured texts on ship polishing, we extract the instance sets of each entity category. While keeping the original semantic structure unchanged, we replace the entities in the text with other entity instances of the same category and inherit the original entity's annotation labels to generate new annotation pairs. This is how we construct a structured text corpus for ship polishing.
8. The hybrid named entity recognition method for ship polishing based on differences in entity structure as described in claim 1, characterized in that: In the process of training a deep learning-based named entity recognition model using a structured text corpus for the field of ship polishing, a category weighting mechanism is adopted. This category weighting mechanism includes setting weight factors based on the reciprocal of the frequency of each category of entity in the structured text corpus for the field of ship polishing.