Method for analyzing causes of shipbuilding accidents
By constructing an intelligent recognition framework based on a large language model and an LDA topic model, and combining it with a Levenshtein distance and multi-level accident causative factor extraction framework, the problems of data imbalance and classification difficulties in the causal analysis of ship repair and construction accidents were solved. This enabled automated and standardized extraction and coding of causal factors, improving the depth and breadth of the analysis.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SHANGHAI MARITIME UNIVERSITY
- Filing Date
- 2026-05-09
- Publication Date
- 2026-06-19
AI Technical Summary
Existing technologies for analyzing the causes of ship repair and construction accidents suffer from problems such as data imbalance, difficulty in classification, low efficiency due to reliance on manual analysis, and strong subjectivity. They are difficult to conduct in-depth and broad analysis, and static analysis frameworks are unable to identify emerging causal factors.
A large language model-based intelligent recognition framework is constructed. By combining Levenshtein distance to calculate text similarity and LDA topic model, a multi-level accident causative factor extraction framework is built. A collaborative qualitative coding process is adopted for the systematic extraction and coding of causative factors.
It enables automated and standardized extraction of accident causal event chains from unstructured text, improving the objectivity and consistency of causal classification, dynamically identifying new causal factors, generating structured datasets, and supporting efficient statistical analysis.
Smart Images

Figure CN122243218A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of accident causation analysis technology, and in particular to a method for causation analysis of ship repair and construction accidents. Background Technology
[0002] The shipbuilding and repair industry is a high-risk field, and in-depth analysis of accident causes is crucial for accident prevention and improving safety management. Currently, accident cause analysis in this field mainly relies on the collation and summarization of textual materials such as accident investigation reports. However, in practice and research, text-based accident cause analysis methods face a series of inherently interconnected problems, severely restricting the depth, breadth, and reliability of the analysis. First, the accident description texts themselves suffer from significant data imbalance. This is reflected in the vastly different number of records for different accident types and the huge differences in the level of detail in the text descriptions. Many accident records only contain brief process descriptions, lacking complete records of underlying causes and causal chains, resulting in sparse and unevenly distributed effective information for in-depth analysis. Second, the difficulty in classifying accident causes is extremely prominent. Due to the lack of unified and detailed standards for standardizing and mapping the causal factors described in the texts, analysts heavily rely on personal experience for judgment and classification, leading to highly subjective and inconsistent analysis results. This directly leads to a large number of causes being lumped into the "other" category, and the same or similar causes being classified into different levels of classification codes in different cases, causing confusion in the classification system. This confusion makes it difficult for statistical analysis based on classification results to truly reflect the distribution patterns and intrinsic relationships of causes.
[0003] At the methodological level, existing research mostly employs static analysis frameworks based on predefined classification systems. While such frameworks are systematic, their inherent static nature makes it difficult to proactively identify and incorporate emerging causal factors that have never appeared in the predefined classifications, resulting in lagging model updates. Furthermore, when faced with the large-scale, unstructured historical accident text data accumulated by shipbuilding and repair enterprises, traditional analytical methods relying on manual reading, understanding, and encoding suffer from bottlenecks such as low processing efficiency, high labor costs, and poor scalability. The limitations of manual processing also make it difficult to avoid errors caused by the analyst's subjective perspective or fatigue. These errors are further amplified when processing massive amounts of data, affecting the accuracy of the overall conclusions. Summary of the Invention
[0004] Therefore, it is necessary to provide a method for analyzing the causes of ship repair and construction accidents that can systematically, automatically, accurately, and scalably identify, classify, and extract multi-level causal factors from large-scale, unstructured, and unevenly qualityed textual data of ship repair and construction accidents, and can dynamically update the causal system.
[0005] This invention provides a method for analyzing the causes of ship repair and construction accidents, the method comprising: Obtain accident description text data from the accident database of shipbuilding and repair enterprises, and perform desensitization and block processing on it; A smart recognition framework based on a large language model is constructed to identify the processed data blocks in order to extract the causal event chain of the accident and to label the accident according to the preset chain accident judgment rules. An accident cause classification query mechanism is constructed. The accident cause classification query mechanism calculates the similarity of text strings based on Levenshtein distance to realize the standardized mapping and classification query of accident causes. Using labeled accident text data, an LDA topic model is constructed. The optimal number of topics is determined by calculating the perplexity, and feature words under each topic are extracted. Based on a multi-level accident causation factor extraction framework, this study collaboratively performs text-based causation coding mapping, causation extraction based on existing qualitative reports, causation classification based on event chains of large language models, and new causation discovery based on LDA topic feature words. Based on the collaborative extraction results, an accident causation classification model is constructed that includes six levels: social factors, organizational influence, unsafe supervision, preconditions for unsafe behavior, unsafe human behavior, and emergency response. A collaborative qualitative coding process is adopted to systematically extract and encode the causal factors from the accident description text based on the accident causation classification model, generate an accident causation dataset, and perform statistical analysis.
[0006] In one embodiment, the construction of the intelligent recognition framework based on a large language model includes: An SRBE domain dictionary is constructed, which is based on existing dictionaries in the fields of shipbuilding engineering, mechanical engineering, safety engineering, transportation, electrical engineering, management science and building decoration, as well as a corpus constructed from detailed descriptions of SRBE accidents. By removing special symbols, unit identifiers and irrelevant information from the corpus, and filtering function words and stop words by combining the Modern Chinese Function Word List and the Harbin Institute of Technology Stop Word List; based on the comparison results of verb segmentation frequency of jieba, HanLP, THULAC, SnowNLP and LTP Chinese word segmentation tools, the tool with the best verb segmentation frequency is selected as the benchmark model. By combining the verbs describing the manner of injury in the accident classification standards, standard verbs are compiled and industry-specific verbs are expanded to construct an SRBE accident injury manner verb mapping table.
[0007] In one embodiment, the preset chain accident determination rule includes: the accident description contains at least two related events; and the event at the end of the causal chain causes personal injury or death; the accident text that meets the rule is marked as 1, otherwise it is marked as 0.
[0008] In one embodiment, the accident cause classification query mechanism calculates text string similarity using Levenshtein distance, the formula for which the Levenshtein distance is calculated is: In the formula, Indicates the minimum edit distance; and Representing strings respectively , The first in , One element; For the characteristic function, when The value is 1 for time and 0 for everything else; Based on the minimum edit distance, the formula for calculating string similarity is: In the formula, and They are strings respectively and The length.
[0009] In one embodiment, the accident cause classification query mechanism realizes rapid classification based on the risk factors in the accident description text, and standardized coding query based on existing accident causes; the rapid classification divides the direct causes into three levels: major category, minor category and sub-category, and performs standardized mapping of accident causes based on precise matching and fuzzy matching algorithms.
[0010] In one embodiment, the optimal number of topics is determined by calculating perplexity, wherein the formula for calculating perplexity is: In the formula, This represents the test set corpus. For document The probability of term generation, For document The total number of terms, This represents the total number of documents.
[0011] In one embodiment, the four methods for the collaborative execution of the multi-level accident causation factor extraction framework are: Grounded theory and HFACS model analysis based on detailed accident description texts; Extraction of existing causes based on the enterprise's accident characterization report; Event chain and cause classification and extraction based on large language model; New factor discovery based on LDA topic model and feature words.
[0012] In one embodiment, the hierarchical indicators of the accident cause classification model include: The social factors hierarchy includes government oversight and the social environment; Organizational influence levels include structure, policies, and culture under organizational atmosphere; human resources, information, funds, and equipment under resource management; incomplete systems, plans, and procedures under organizational processes, as well as loopholes in the implementation of systems, plans, and procedures; changes or imperfections in process flow, job changes or temporary adjustments, and overlapping or changes in construction scope under change management; and lack of leadership and effective communication and coordination under communication and coordination. Unsafe supervision levels include: insufficient supervision leading to a lack of effective supervision and training, a lack of effective on-site inspections and guidance; inappropriate operational plans leading to unreasonable labor organization and overcapacity production; failure to correct known problems leading to inadequate rectification of hidden dangers and risks, failure to identify risky employees; and violations of supervision leading to illegal command and allowing unqualified personnel to enter. The preconditions for unsafe acts include the working environment, natural environment, and contractor environment under environmental factors; poor psychological and physiological state of workers; unsafe attitudes; low level of skills and knowledge; lack or defects of protective safety signals and other devices under technical equipment; defects of equipment, facilities, tools and accessories; and lack or defects of PPE equipment. The hierarchy of unsafe human behaviors includes skill errors, decision-making and perceptual errors under errors, habitual violations under violations, and unstable emotional work and destructive behavior under destructive behavior. Emergency response levels include those that are not handled in a timely manner or are not handled properly.
[0013] In one embodiment, the collaborative qualitative coding process includes: Framework cognition and coding: Assign coding personnel to independently code randomly selected accident cases based on the accident causation classification model; Inter-coder reliability test: By comparing independent coding results, the percentage distribution of the average number of discrepancies per case out of the total number of coded items is statistically analyzed, and the mean of the average discrepancy ratio of the codes is calculated. Consensus mechanism construction: Organize workshops for coded items with disagreements and reach a consensus on coding rules through the Defield method; After revising the coding rules, all samples were independently coded: After reaching a consensus, multiple people independently coded all accident cases, and the consistency of the coding results at each level was checked. Standardized batch coding: After the consistency check passes, the causal factor coding work is performed independently on the remaining accident cases based on the consensus framework; if it fails, the above three steps are repeated.
[0014] In one embodiment, the encoding of the causative factors follows the principles below: Each hazard factor mentioned in the accident description is mapped to a unique HFACS hierarchical category; Each HFACS subcategory can only be counted once per incident.
[0015] The aforementioned method for analyzing the causes of ship repair and construction accidents effectively addresses the issues of uneven detail and sparse information in the original text data of these accidents by combining the constructed intelligent identification framework with the LDA topic model. It systematically extracts causal event chains and potential topic feature words from unstructured text. By constructing a classification query mechanism based on Levenshtein distance, it achieves automated and standardized mapping and querying of causal factors described in the text, reducing subjectivity and inconsistencies in classification caused by reliance on human experience, and improving the objectivity and standardization of causal classification. Furthermore, through a collaboratively operating "multi-level accident causal factor extraction framework," it integrates the outputs of four paths: rule encoding, historical reports, event chain analysis, and topic discovery. The constructed six-level accident causal classification model not only covers the pre-defined causal system but also dynamically identifies and incorporates new causal factors through LDA topic feature words, thus overcoming the lag of static analysis frameworks. Ultimately, the collaborative qualitative coding process based on this classification model enabled the efficient and systematic extraction and coding of causal factors from large-scale historical accident texts, generating a structured accident causation dataset. This provided a reliable foundation for subsequent statistical analysis and improved the overall efficiency, consistency, and scalability of accident causation analysis. Attached Figure Description
[0016] To more clearly illustrate the technical solutions in this invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this invention. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.
[0017] Figure 1 This is a flowchart of the ship repair and construction accident causation analysis method according to an embodiment of the present invention; Figure 2 This is a schematic diagram of the intelligent identification framework for SRBE chain accident injury data according to an embodiment of the present invention; Figure 3This is a schematic diagram of the SRBE chain accident injury causation factor extraction framework according to an embodiment of the present invention; Figure 4 This is a schematic diagram of an accident injury causation index system based on the LDA-HFACS-SRBE architecture according to an embodiment of the present invention; Figure 5 This is a schematic diagram of the accident causality confusion curves under different numbers of topics in an embodiment of the present invention. Figure 6 This is a flowchart of a method for analyzing the causes of ship repair and construction accidents according to another embodiment of the present invention; Figure 7 This is a schematic diagram of the hazard factor coding results in the accident description text according to an embodiment of the present invention; Figure 8 This is a schematic representation of the indicator markings corresponding to the accident causative factors in an embodiment of the present invention. Detailed Implementation
[0018] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0019] The following is combined Figures 1-8 The present invention describes a method, apparatus, and electronic equipment for analyzing the causes of ship repair and construction accidents.
[0020] like Figure 1 As shown, in one embodiment, a method for analyzing the causes of ship repair and construction accidents includes the following steps: Step S110: Obtain accident description text data from the accident database of ship repair and manufacturing enterprises, and perform desensitization and block processing on it.
[0021] This embodiment uses the Shipbuilding and Repair Enterprise Accident Database (SRBAD) as the data source, selecting 1364 records from January 1, 2013 to May 18, 2022 as the research object. Since the original data contains a large amount of sensitive information, direct processing poses a privacy risk; therefore, anonymization is necessary. Anonymization rules include: deleting company information, hospital information, dock information, wharf information, work group information, etc.; retaining only the surname of personnel and masking it with an asterisk; and obfuscating the ship number and customized configuration information of ships under construction or repair. After anonymization, the data needs to be segmented to adapt to the context window limitations of the large language model. In the initial testing of this embodiment, 100 cases were set per batch, but this triggered an error due to model input limitations. After debugging, it was finally determined that the data block size was reduced to 50 case texts per batch, thus ensuring the stability of subsequent processing.
[0022] Step S120: Construct an intelligent recognition framework based on a large language model to identify the processed data blocks in order to extract the causal event chain of the accident and label the accident according to the preset chain accident judgment rules.
[0023] Leveraging the semantic understanding capabilities of large language models, structured causal event chains are automatically extracted from unstructured text. (See reference...) Figure 2 The construction of the intelligent recognition framework involves multi-model comparison and prompt word optimization. This embodiment compares multiple models, including deepseek-r1-pro, gemini-2.0-pro-exp, grok-3, and chatgpt-o3-mini-high, and ultimately selects the model with the best performance in terms of event chain extraction completeness and annotation accuracy as the core engine. During the extraction process, the model performs binary labeling according to preset chain accident judgment rules: if the accident description contains at least two related events, and the event at the end of the causal chain causes personal injury or death, it is labeled as 1 (chain accident injury); otherwise, it is labeled as 0. Through this step, chain accident samples with analytical value can be quickly screened from massive amounts of data, significantly reducing the cost of manual screening.
[0024] Step S130: Construct an accident cause classification query mechanism. This mechanism calculates the similarity of text strings based on Levenshtein distance to achieve standardized mapping and classification query of accident causes.
[0025] To address the challenges of inconsistent and disorganized accident causation classification standards, this embodiment develops a classification query mechanism based on string similarity matching. This mechanism utilizes the Levenshtein distance algorithm to calculate the edit distance between the input text and the standard causation classification library, converting this distance into a similarity score to achieve rapid matching and standardized mapping of causative factors. This step not only resolves the issue of different classifications for the same causation but also provides a data cleaning tool for subsequently building a standardized causation classification model, ensuring the consistency of the analytical foundation.
[0026] Step S140: Using the labeled accident text data, construct an LDA topic model, determine the optimal number of topics by calculating the perplexity, and extract feature words under each topic.
[0027] After obtaining the labeled cascading accident dataset, this embodiment introduces the Latent Discourse Analysis (LDA) topic model for latent semantic mining. The LDA model can model documents as random combinations of latent topics, thereby discovering implicit causal patterns behind the text. Determining the optimal number of topics is crucial for LDA modeling. This embodiment calculates perplexity values under different numbers of topics, plots perplexity curves, and determines the optimal number of topics based on the inflection point or lowest point of the curve. For example, for six typical accident types, such as object strikes and mechanical injuries, optimal numbers of topics of 5, 5, 6, 3, 5, and 4 were determined, respectively. The topic feature words extracted in this step provide an important source of candidate words for the subsequent construction of causal classification model indicators, realizing the transformation from data to knowledge.
[0028] Step S150: Based on the multi-level accident causation factor extraction framework, collaboratively execute text-based causation encoding mapping, causation extraction based on existing qualitative reports, causation classification based on large language model event chains, and new causation discovery based on LDA topic feature words. Based on the collaborative extraction results, construct an accident causation classification model that includes six levels: social factors, organizational influence, unsafe supervision, preconditions for unsafe behavior, unsafe human behavior, and emergency response.
[0029] Reference Figure 3 and Figure 4This embodiment constructs a multi-level dynamic framework for extracting accident causal factors. This framework is not a simple application of a single method, but rather a synergy of four core methods: deep encoding of accident texts using grounded theory, extraction of deterministic causes from existing enterprise reports, classification of large-scale event chains using a large language model, and discovery of potential new factors using an LDA model. Through this collaborative processing of multi-source heterogeneous data, an LDA-HFACS-SRBE accident injury causal index system comprising six levels and 40 subcategories is constructed. Compared to traditional static frameworks, this model not only covers the core levels of the classic HFACS model but also adds social factors and emergency response levels tailored to the characteristics of the shipbuilding and repair industry, thus enabling a more comprehensive capture of the underlying causes and end-stage handling of accidents.
[0030] Step S160: Using a collaborative qualitative coding process, the causal factors of the accident description text are systematically extracted and coded based on the accident causal classification model, generating an accident causal dataset and performing statistical analysis.
[0031] This embodiment designs a five-step collaborative qualitative coding process. This process includes framework cognition and coding, inter-coder reliability testing, consensus mechanism construction, independent coding of all samples after revising coding rules, and standardized batch coding. Through multi-person independent coding and consistency testing (such as calculating Krippendorff's Alpha coefficient), the subjective blind spots and fatigue errors of single-person analysis are effectively avoided. The resulting SRBE cascade accident injury causation dataset records the distribution of each accident across various causative indicators in binary code form, providing a structured data foundation for subsequent quantitative analysis such as frequency statistics and association rule mining.
[0032] This embodiment constructs an intelligent recognition framework based on a large language model, which can automatically extract and label causal event chains from unstructured accident description text, solving the problems of low efficiency and strong subjectivity in traditional manual analysis and significantly improving the automation level of data processing. By constructing an accident cause classification query mechanism based on Levenshtein distance, it realizes standardized mapping and rapid query of accident causes, effectively solving the problems of inconsistent cause classification standards and chaotic classification, and improving the accuracy and consistency of analysis. By combining the LDA topic model and the multi-level accident cause factor extraction framework, and synergistically utilizing the advantages of grounded theory, HFACS model and large language model, it can not only systematically extract known causes, but also actively discover potential new cause factors, constructing a complete accident cause classification model with six levels, overcoming the shortcomings of static analysis frameworks in updating. In addition, the adoption of a collaborative qualitative coding process and closed-loop optimization mechanism ensures the reliability of cause factor extraction and the adaptive optimization capability of the model, providing scientific and accurate data support for accident prevention and safety management in shipbuilding and repair enterprises.
[0033] In one embodiment, the key to building an intelligent recognition framework based on a large language model lies in constructing an SRBE domain dictionary. This SRBE domain dictionary is built upon existing dictionaries in the fields of shipbuilding engineering, mechanical engineering, safety engineering, transportation, electrical engineering, management science, and architectural decoration, as well as a corpus constructed from detailed descriptions of SRBE accidents.
[0034] Reference Figure 2 The construction of the SRBE domain dictionary is a process of multi-source fusion and iterative optimization. The initial dictionary construction is primarily based on corpora and open-source resources, supplemented by expert knowledge and domain literature. For open-source resources, existing dictionaries from seven domains—naval architecture, mechanical engineering, safety engineering, transportation, electrical engineering, management science, and architectural decoration—were selected to ensure comprehensive terminology coverage. The corpus construction first underwent data preprocessing. Based on detailed descriptions of SRBE accidents, special symbols, unit identifiers, and irrelevant information were removed from the corpus. Then, function words and stop words were filtered using the Modern Chinese Function Word List and the Harbin Institute of Technology Stop Word List. Through multiple iterations and optimizations of the cleaning rules, the purity of the corpus was ensured.
[0035] Furthermore, in order to quantitatively evaluate the performance differences of various word segmentation tools in verb feature extraction, this embodiment selects the tool with the best verb segmentation frequency as the benchmark model based on the comparison results of verb segmentation frequency of Chinese word segmentation tools such as jieba, HanLP, THULAC, SnowNLP and LTP.
[0036] Specifically, 20 detailed accident descriptions were randomly selected as the experimental corpus to construct a test set. Using verb-related parts of speech as the observation indicator, five mainstream Chinese word segmentation tools—jieba, HanLP, THULAC, SnowNLP, and LTP—were compared. By comparing the performance of each tool in verb segmentation frequency, the tool with the best segmentation effect was selected as the benchmark model for subsequent text processing, thereby improving the accuracy of word segmentation. Finally, combining the verbs describing the manner of injury in the accident classification criteria, standard verbs were compiled and industry-specific verbs were expanded to construct the SRBE accident injury manner verb mapping table, forming a multi-level mapping relationship, providing standardized support for subsequent semantic parsing.
[0037] The advantage of this series of operations is that by integrating multi-domain dictionaries and performing refined cleaning, the problems of complex professional terminology in the shipbuilding and repair field and low accuracy of general word segmentation tools are solved, providing a high-quality semantic foundation for the subsequent extraction of event chains in large models and significantly improving recognition accuracy.
[0038] After the dictionary is built, the intelligent recognition framework needs to label the accidents according to the preset chain accident determination rules. The preset chain accident determination rules include: the accident description contains at least two related events; and the event at the end of the causal chain causes personal injury or death; accident texts that meet the rules are labeled as 1, otherwise they are labeled as 0.
[0039] Specifically, the determination of chain-related injuries follows strict standards: First, the accident description must contain at least two related events, such as a chain of A→B→C or A→B→C→D; second, the terminal event in the causal chain must result in personal injury or death. Based on this, a binary labeling strategy is adopted: accident texts meeting the above standards are labeled as 1 (chain-related injury), otherwise labeled as 0 (non-chain-related injury). The labeling results are stored in the labeling field, and the extracted event chain is stored in the event chain field. For example, for an arc burn accident, its event chain can be extracted as "electric pen touching copper busbar (live operation) → copper busbar phase-to-phase short circuit (electrical fault) → arc generation (energy release) → burns to the face and back of the hand (direct injury)". This case contains four related events, and the terminal event results in injury, so it is labeled as 1. This rule effectively eliminates interference from accidents without personal injury or single-event accidents, ensuring that subsequent causal analysis focuses on accident samples with typical chain characteristics, thus improving the relevance and effectiveness of the analysis results.
[0040] Furthermore, in the data preprocessing stage, this embodiment also specifies the detailed parameters for data anonymization and block processing. To prevent the leakage of confidential information, location information such as enterprise information, hospital information, docks, wharves, and work teams is deleted; only the surnames of personnel are retained and masked with asterisks; and the ship numbers and customized configuration information of ships under construction or under repair are obfuscated. Regarding data block processing, considering the limitations of the large model context window, the initial data block size was set at 100 accident cases per batch. However, in actual testing, when 100 case texts were passed in a continuous dialogue mode by calling the large model API through a Python script, the system experienced output interruption during the second prompt word optimization stage, and the third request triggered an error due to model input limitations. After this verification, the data block size was reduced to 50 case texts per batch, thereby ensuring the stability of subsequent processing. This adjustment of block parameters based on actual test feedback ensures the continuity and reliability of large-scale text data processing.
[0041] In one embodiment, the accident cause classification query mechanism calculates text string similarity using Levenshtein distance, the formula for which is: In the formula, Indicates the minimum edit distance; and Representing strings respectively , The first in , One element; For the characteristic function, when The value is 1 for time and 0 for everything else.
[0042] Specifically, Levenshtein distance (edit distance) refers to the minimum number of edits required to transform string A into target string B. Editing operations include insertion, deletion, and replacement. In accident causal analysis scenarios, accident description texts often contain non-standard expressions and misuse of synonyms, making direct keyword matching ineffective. This embodiment introduces an edit distance algorithm to quantify the degree of difference between two strings. For example, the accident cause description "missing protective equipment" and the standard term "damaged protective equipment," although not entirely identical in wording, can be identified as highly similar by calculating the edit distance. The recursive logic in the formula ensures that the algorithm can handle strings of arbitrary length, efficiently finding the optimal transformation path through dynamic programming, thus laying the foundation for subsequent similarity calculations.
[0043] Furthermore, based on the minimum edit distance, the formula for calculating string similarity is: In the formula, and They are strings respectively and The length.
[0044] Specifically, this formula normalizes the similarity calculation by dividing the edit distance by the maximum of the two string lengths, eliminating the influence of text length differences on the similarity calculation and stabilizing the result between 0 and 1. A higher similarity value indicates a higher similarity between the two strings. A similarity of 1 indicates that the two strings are completely identical; a similarity close to 0 indicates that they are significantly different. This quantitative indicator provides a mathematical basis for the automated classification of accident causes, enabling the system to automatically determine whether to map non-standardized descriptions to standard classification codes based on a similarity threshold, effectively solving the problems of strong subjectivity and inconsistent standards in manual classification.
[0045] The accident cause classification query mechanism enables rapid classification of hazardous factors based on accident description text, as well as standardized coding query based on existing accident causes. Rapid classification divides direct causes into three levels: major category, minor category, and sub-category, and performs standardized mapping of accident causes based on precise matching and fuzzy matching algorithms.
[0046] Specifically, this embodiment develops an accident cause classification query software based on national standards, aiming to provide efficient accident cause classification and query functions. In terms of technical implementation, the software relies on the Dash third-party library in Python to build its overall layout and utilizes the Bootstrap framework to optimize the webpage style design, thereby providing an intuitive and user-friendly interface. The software page adopts a left-right layout: the left side displays two sections, a drop-down menu for direct and indirect causes and a cause classification diagram, where the "cause classification diagram" lists the hierarchical structure of unsafe conditions and unsafe behaviors in tabular form; the right side contains the query input box and query results. Regarding accident cause classification, referring to the implementation regulations on accident cause classification codes in the accident classification standard, direct causes are divided into three levels: major category (e.g., 6.1), minor category (e.g., 6.01.1), and detailed category (e.g., 6.01.1.1). This hierarchical division method conforms to the classification logic in safety management, facilitating the step-by-step location of causal factors from macro to micro levels.
[0047] In terms of matching logic, the system combines precise matching and fuzzy matching algorithms. Precise matching allows users to input the full name or code number of the accident cause for a search. If the input is a subcategory or sub-category, the query output will show all the categories to which it belongs. For example, inputting "6.01.1" will automatically display the major category "6.01" and related descriptions. Fuzzy matching, based on the input, provides the top 5 matches according to the string similarity score, from highest to lowest. The output format fields include: similarity, indirect cause, major category, minor category, and sub-category. Cells without data are replaced with blank values. If the input is the full name or code number of the accident cause, and the fuzzy matching results in a string similarity of 100, the output will match the precise matching result. Through this dual matching mechanism, the system can not only handle standardized query requests but also effectively cope with non-standardized and fuzzy descriptive inputs, significantly improving the efficiency and accuracy of accident cause analysis and achieving automated mapping from unstructured text descriptions to structured classification codes.
[0048] Furthermore, based on the accident cause classification query mechanism, a mapping relationship between accident causes and the HFACS framework is established. Referring to Table 1, the mapping relationship between indirect and root causes of accidents and the original HFACS framework is shown; referring to Table 2, the mapping relationship between direct causes of accidents and the HFACS framework is shown. These mapping relationships provide a basis for the standardized classification of accident causal factors, realizing automatic conversion from accident cause coding to HFACS levels.
[0049] Table 1. Mapping relationship between indirect and root causes of accidents and the original HFACS framework. Table 2. Mapping relationship between direct causes of accidents and the HFACS framework In one embodiment, an LDA topic model is constructed using labeled accident text data, the optimal number of topics is determined by calculating perplexity, and feature words under each topic are extracted.
[0050] LDA (Latent Distributed Aspect-Oriented) topic modeling is a topic mining method based on probabilistic generative models, capable of efficiently uncovering latent topics in text corpora. When constructing the model, the choice of the number of topics directly determines the model's generalization ability and explanatory power. This embodiment introduces perplexity as an evaluation metric, reflecting the model's uncertainty regarding the distribution of document topics. The smaller the perplexity value, the stronger the model's predictive ability and the better its topic structure. The formula for calculating perplexity is: In the formula, This represents the test set corpus. For document The probability of term generation, For document The total number of terms, This represents the total number of documents.
[0051] Reference Figure 5 This embodiment calculates the perplexity value under different numbers of topics and plots the relationship curve between perplexity and the number of topics. Taking six typical accident types in the accident database of shipbuilding and repair enterprises as examples, by observing the inflection point or stable interval of the perplexity curve, the optimal number of topics for each accident type is finally determined as follows: 5 for object impact, 5 for mechanical injury, 6 for fall from height, 3 for other injuries, 5 for crane injury, and 4 for vehicle injury. This process achieves accurate capture of the latent semantic structure of the accident text, provides a quantitative basis for the subsequent extraction of causal factors, and avoids the subjectivity and arbitrariness of manually setting the number of topics.
[0052] Furthermore, the accident causation themes and their corresponding keywords were extracted, and repetitive or similar themes within the same accident type were merged. Table 3 shows the themes and feature words extracted by the LDA model. These theme feature words provide an important source of candidate words for the subsequent construction of indicators for the causation classification model, realizing the transformation from data to knowledge.
[0053] Table 3. Topics and keywords extracted by the LDA model Furthermore, based on a multi-level accident causation factor extraction framework, it collaboratively performs text-based causation encoding mapping, causation extraction based on existing qualitative reports, causation classification based on event chains of large language models, and new causation discovery based on LDA topic feature words.
[0054] Specifically, refer to Figure 3 This framework is not a simple aggregation of single methods, but rather a deep synergy and complementary strengths of four core approaches. First, based on grounded theory and the HFACS model analysis of detailed accident description texts, it extracts the specific manifestations of accident causes within the HFACS model and maps them to the classification hierarchy within the framework through systematic coding of accident texts and expert judgment. This method excels at uncovering deep, latent causes. Second, based on existing cause extraction from enterprise accident qualitative reports, it directly extracts identified accident causes from accident reports that have already undergone qualitative analysis. This method offers convenient data acquisition and can serve as a validation benchmark. Third, based on event chain and cause classification extraction using a large language model, it can efficiently process large-scale data. Combined with grounded theory, it achieves more accurate factor extraction and causal relationship analysis, while enhancing the efficiency and accuracy of topic feature word extraction. Finally, based on the LDA topic model and feature word-based new factor discovery, it utilizes the LDA topic model to mine topics from accident texts and combines feature word analysis to identify potential new causal factors, overcoming the limitation of pre-defined classification systems in absorbing emerging causes.
[0055] These four methods form a closed-loop synergy within the framework: new factors discovered by the LDA model supplement the knowledge base of the large model; the event chains extracted by the large model provide coding material for grounded theory; the coding results of grounded theory correct the classification biases of existing reports; and the authority of existing reports, in turn, verifies the completeness of the results of the first three stages. This synergistic mechanism effectively solves the problems of low efficiency, narrow coverage, or strong subjectivity that exist when a single method is used to process large-scale unstructured data, achieving a balance between the comprehensiveness and accuracy of causal factor extraction.
[0056] In one embodiment, the accident causation classification model includes six levels: social factors, organizational influence, unsafe supervision, preconditions for unsafe behavior, unsafe human behavior, and emergency response.
[0057] Reference Figure 4 The LDA-HFACS-SRBE accident injury causation model index system constructed in this embodiment breaks through the limitations of the traditional HFACS model, which only includes four levels: organizational influence, unsafe supervision, unsafe behavior preconditions, and unsafe human behavior. Considering the characteristics of the shipbuilding and repair industry, such as high risk, multi-job cross-operation, and great influence from the external environment, this embodiment extends the traditional model upward to the social factor level and downward to the emergency response level, thus forming a six-level analysis framework covering the entire accident chain. This hierarchical expansion is not a simple superposition, but is based on a deep understanding of the multi-level transmission mechanism of SRBE chain accident injuries, and can effectively capture key causes such as lack of external supervision, social environmental pressure, and failure of emergency response after the accident.
[0058] Furthermore, the social factors level includes government oversight and the social environment.
[0059] Specifically, the social factors level is located at the top of the model, representing the macro-level background influencing the occurrence of accidents. Referring to Table 4, the government regulatory indicator (A11) corresponds to external pressures in the causes of accidents, specifically manifested in management regulations formulated by safety regulatory departments, the intensity of safety inspections, and the effectiveness of inspections. For example, if regulatory departments are lax in their review of the special operation qualifications of shipbuilding and repair enterprises, or if the penalties for violations are insufficient, these are all causes at this level. The social environment indicator (A12) has multi-dimensional pervasive characteristics, specifically manifested in economic downturns, pandemics, corporate restructuring, resource integration and optimization, and unreasonable transportation planning. For example, under economic downturn pressure, enterprises may reduce safety investment, thereby indirectly leading to accidents. This level allows the analytical method to transcend the internal perspective of enterprises and examine the root causes of accidents from a more macro-level social dimension.
[0060] Table 4. Top-level and bottom-level indicators and causal manifestations Table 5. Organizational Influence Level Indicators and Causal Manifestations Furthermore, organizational influence levels include structure, policies, and culture under organizational atmosphere; human resources, information, funds, and equipment under resource management; incomplete systems, plans, and procedures under organizational processes, as well as loopholes in the implementation of systems, plans, and procedures; changes or imperfections in process flow, job changes or temporary adjustments, and overlapping or changes in construction scope under change management; and lack of leadership and effective communication and coordination under communication and coordination.
[0061] Specifically, the organizational influence level is the deep-seated organizational root cause of accidents. Referring to Table 5, this level is divided into five aspects and 11 indicators. Regarding organizational atmosphere, structure (B11) refers to an unreasonable organizational structure, such as a lack of independence in the safety management department; policy (B12) refers to lagging or impractical safety policies; and culture (B13) refers to a weak corporate safety culture, such as management prioritizing efficiency over safety. Regarding resource management, human resources (B21) refers to insufficient safety management personnel; information (B22) refers to poor or missing safety information transmission; funding (B23) refers to insufficient safety investment; and equipment (B24) refers to lagging updates to safety equipment and facilities. Regarding organizational processes, incomplete systems, plans, and procedures (B31) refer to a lack of necessary operating procedures; and loopholes in the implementation of systems, plans, and procedures (B32) refer to non-compliance with existing rules. Regarding change management, changes or imperfections in the process flow (B41), changes or temporary adjustments to job positions (B42), and overlapping or altered construction scopes (B43) all indicate a loss of control in a dynamic operational environment. In terms of communication and coordination, a lack of leadership (B51) indicates a lack of on-site command; a lack of effective communication and coordination (B52) indicates information barriers between departments. This detailed breakdown of indicators allows analysts to accurately pinpoint weaknesses in organizational management.
[0062] Furthermore, unsafe supervision levels include insufficient supervision (lack of effective supervision and training, lack of effective on-site inspection and guidance), inappropriate operation plans (unreasonable labor organization, production exceeding capacity), failure to correct known problems (inadequate rectification of hidden dangers and risks, failure to identify risky employees), and violations of supervision (illegal command, allowing unqualified personnel to enter).
[0063] Specifically, the unsafe supervision level is a key link connecting organizational management and front-line operations. Referring to Table 6, this level includes four aspects and eight indicators. Regarding insufficient supervision, the lack of effective supervision and training (C11) corresponds to insufficient or absent supervision, lack of knowledge, and lack of training in the cause category mapping; the lack of effective on-site inspection and guidance (C12) corresponds to insufficient inspection or incorrect guidance of on-site work in the cause category mapping. Regarding inappropriate operational planning, unreasonable labor organization (C21) manifests as improper personnel allocation and improper work arrangement; overcapacity production organization (C22) manifests as insufficient rest, overloaded operation, and working through the night in the form of overtime work. Regarding uncorrected known problems, inadequate rectification of hidden dangers and risks (C31) corresponds to the absence or lack of serious implementation of accident prevention measures and ineffective rectification of accident hazards in the cause category mapping; failure to identify risky employees (C32) specifically refers to the lack of risk identification for new employees, transferred workers, and employees with negative emotions. Regarding violations of supervision, "illegal command" (C41) refers to managers forcing employees to engage in hazardous operations; "allowing unqualified personnel to enter" (C42) refers to untrained personnel entering the work area or allowing temporary personnel to operate equipment. This detailed classification helps to clarify specific gaps in responsibility at the supervisory and management level.
[0064] Table 6 Indicators of Unsafe Supervision Levels and Their Causes Furthermore, the preconditions for unsafe acts include environmental factors such as the work environment, natural environment, and contractor environment; poor psychological and physiological state of workers; unsafe attitudes; low level of skills and knowledge; lack or defects in protective safety signals and other devices; defects in equipment, facilities, tools, and accessories; and lack or defects in PPE equipment.
[0065] Specifically, the preconditions hierarchy of unsafe acts represents the direct background leading to unsafe behaviors. Referring to Table 7, this hierarchy comprises three aspects and nine indicators. Regarding environmental factors, the work environment (D11) refers to insufficient lighting, poor ventilation, etc.; the natural environment (D12) refers to the impact of severe weather; and the contractor environment (D13) refers to interference from related parties' operations, a typical characteristic of the shipbuilding and repair industry where multiple trades operate simultaneously. Regarding worker condition, poor psychological and physiological condition (D21) refers to working while fatigued or ill; unsafe attitudes (D22) refer to complacency and negligence; and low skill and knowledge levels (D23) refer to lack of operational skills and insufficient safety knowledge. Regarding technical equipment, lack of or defective protective safety signals and other devices (D31) refers to malfunctioning safety protection facilities; defective equipment, facilities, tools, and accessories (D32) refer to equipment operating with defects; and lack of or defective PPE (D33) refers to missing or ineffective personal protective equipment. Identifying these preconditions provides a basis for developing targeted preventative measures.
[0066] Table 7. Prerequisite Level Indicators and Causal Manifestations of Unsafe Behaviors Furthermore, the hierarchy of unsafe human behavior includes skill errors, decision-making and perceptual errors under error, habitual violations under violation, and unstable emotional work and deliberate sabotage under destructive behavior.
[0067] Specifically, the hierarchy of unsafe human behaviors refers to the triggering actions that directly lead to accidents. Referring to Table 8, this hierarchy includes three aspects and five indicators. Regarding errors, skill errors (E11) correspond to multiple operational mistakes in the cause category mapping, such as misoperation and forgetting operational steps; decision-making and perceptual errors (E12) correspond to judgment errors and risk perception errors. Regarding violations, habitual violations (E21) correspond to multiple violations in the cause category mapping, such as not operating according to procedures and working without a license; this is the most common cause type in shipbuilding and repair sites. Regarding destructive behavior, unstable emotional work (E31) refers to working under negative emotions; intentional sabotage (E32) corresponds to incorrect motivation or lack of interest in the cause category mapping, such as deliberate concealment and intentional damage. This hierarchical classification helps distinguish between unintentional errors and intentional violations, thus allowing for different intervention strategies.
[0068] Table 8. Categories and Mapping Relationships of Unsafe Behavior Hierarchy Furthermore, the emergency response levels include untimely response and improper handling.
[0069] Specifically, the emergency response level is a newly added final level in this embodiment, tailored to the characteristics of ship repair and construction accidents, used to evaluate the effectiveness of the emergency response after an accident occurs. Referring to Table 4, untimely response (F11) corresponds to the time element in emergency preparedness and response, specifically manifested as rescue delays, failure to arrive at the scene in a timely manner, and failure to detect the accident promptly; improper handling (F12) corresponds to the operational element in emergency preparedness and response, specifically manifested as insufficient estimation of the accident situation and lack of professional knowledge among rescue personnel. The introduction of this level extends the timeline of accident analysis from "before the accident" to "after the accident," enabling causal tracing throughout the entire accident lifecycle. For example, if a fall from height accident results in aggravated injuries due to improper rescue, "improper handling (F12)" will be marked as a significant cause, thereby prompting companies to improve their emergency plans and rescue training. Through the detailed division of the above six levels, this embodiment constructs a complete classification system containing 40 subcategories, achieving comprehensive and thorough coverage of the causal factors of ship repair and construction accidents.
[0070] like Figure 6As shown, in one embodiment, a collaborative qualitative coding process is adopted to systematically extract and encode the causal factors from the accident description text based on the accident causation classification model.
[0071] Given the large volume of accident data, if a single researcher independently completes all data processing and causal analysis, the researcher's inherent knowledge background and analytical perspective can easily lead to fixed mindsets, resulting in subjective blind spots in the identification of certain potential causal factors and inefficiency when dealing with large-scale accident datasets. Therefore, this embodiment designs a five-step collaborative qualitative coding process, specifically including the following steps: Step S610, Framework Cognition and Coding: Arrange coding personnel to independently code randomly selected accident cases based on the accident causation classification model.
[0072] Specifically, taking 96 cases of crane injury data as an example, after a detailed introduction to the LDA-HFACS-SRBE architecture and accident cause extraction method, three accident cause researchers were assigned to independently code 20 randomly selected accident cases, identifying and coding the causal factors in the descriptive text. This step aims to enable coders to fully understand and internalize the hierarchical structure and indicator definitions of the classification model, establishing a cognitive benchmark for subsequent large-scale coding.
[0073] Step S620, inter-coder reliability test: By comparing the independent coding results, the percentage distribution of the average number of discrepancies in a single case to the total number of coded items is statistically analyzed, and the mean of the average discrepancy ratio of the codes is calculated.
[0074] Specifically, refer to Figure 7 By comparing the coding results from three parties, the percentage distribution of the average number of discrepancies per case out of the total number of coded items was statistically analyzed. Experimental data showed that the average discrepancy ratio was 8.83%. This relatively low value indicates a high degree of consistency in the understanding of the indicator system among different coders, but it also reveals misunderstandings of some items that require correction.
[0075] Step S630: Consensus mechanism construction. Organize workshops for the coding entries with disagreements and reach a consensus on coding rules through the Defield method.
[0076] Specifically, dedicated workshops were organized for coded entries with differing opinions, and a consensus on coding rules was reached through the Delphi method. This process not only resolved the current disagreements but, more importantly, resulted in a standardized coding interpretation manual that clarified the boundaries of ambiguous concepts and effectively eliminated errors caused by subjective personal judgments.
[0077] Step S640: After revising the coding rules, all samples are coded independently. After reaching a consensus, multiple people independently code all accident cases, and the consistency of the coding results at each level is checked.
[0078] Specifically, after reaching a consensus, all 96 accident cases were independently coded by three individuals. After coding was completed, the consistency of the coding results at each level was checked. This embodiment uses Krippendorff's Alpha coefficient as the consistency evaluation index. The test results show that the evaluation results at all levels meet the accepted credibility standard (α≥80%), indicating that the constructed coding index system has high reliability and repeatability.
[0079] Step S650: Standardize batch coding. After the consistency check passes, independently perform causal factor coding on the remaining accident cases based on the consensus framework; if it fails, repeat the above three steps.
[0080] Specifically, based on the consensus framework, after the consistency check passes, the remaining incident cases are assigned to individual coders to independently perform causal factor coding. If the check fails, steps S620 to S640 are executed repeatedly until the consistency requirement is met. This closed-loop verification mechanism ensures that the coding quality does not decrease with the increase in data volume, guaranteeing the accuracy of the final dataset.
[0081] Furthermore, the coding of the causative factors follows these principles: each hazard factor mentioned in the accident description is mapped to a unique HFACS hierarchical category; each HFACS subcategory can be counted at most once in each accident.
[0082] Specifically, refer to Figure 8 Through in-depth analysis of each crane-related accident case, the indicator system factors involved in the cases were systematically labeled: if an indicator appeared in the accident, it was labeled as 1; otherwise, it was labeled as 0. The "unique mapping" principle in principle one ensures the certainty of causative factor classification, avoiding duplicate calculations or ambiguous attributions of the same factor at different levels. The "single-time inclusion" principle in principle two avoids the distortion of statistical results caused by the excessive occurrence of a single high-frequency causative factor in any single accident, ensuring that the statistical analysis results can truly reflect the causative distribution pattern. Through the above process and principles, a structured SRBE chain accident injury causative dataset was finally generated. Table 9 shows the SRBE chain accident injury causative dataset. This dataset provides a high-quality data foundation for subsequent frequency statistics and association rule mining. Table 10 shows the frequency statistics results of SRBE chain accident injury causative indicators.
[0083] Table 9 SRBE Chain Accident Injury Cause Dataset Table 10. Frequency Statistics of Injury Causation Indicators in SRBE Chain Accidents In one embodiment, the collaborative execution of the multi-level dynamic accident causation factor extraction framework includes the following time-series data closure steps.
[0084] Reference Figure 4 This framework is not a simple parallel processing approach, but rather a four-stage progressive process that strictly follows temporal logic. The first stage extracts an initial set of potential causal topics and perplexity values based on the LDA topic model. This unsupervised learning process quickly locates potential points of interest from massive amounts of text. The second stage uses the initial set of potential causal topics as knowledge injection items for the large model's cue words, performing event chain and cause classification extraction. By injecting LDA-mined topics into the cue words, the generation space of the large model is effectively constrained, reducing illusion phenomena. The third stage uses the event chain and cause classification results as the initial encoding categories, performing grounded theory open coding. This stage utilizes the structured output of the large model to assist manual coding, significantly reducing the difficulty of starting the coding process. The fourth stage extracts existing causes based on the company's accident qualitative reports, verifies the completeness of the results from the first three stages, and uses the company's existing authoritative conclusions to perform reverse verification of the extracted results. This progressive design achieves a layer-by-layer refinement from unsupervised mining to supervised verification, ensuring the depth and breadth of cause extraction.
[0085] Furthermore, when different methods extract causal factors from the same accident text and there is a hierarchical conflict in their attribution, an arbitration algorithm based on confidence weights is triggered.
[0086] Specifically, due to the different underlying logic and data sources of different methods, the same causal factor may be classified as "organizational influence" by LDA but as "insecure surveillance" by grounded theory. In this case, the system introduces a confidence weight for arbitration. The confidence weight is dynamically determined by the confidence level of the data source corresponding to the method. The confidence level includes the output probability value of the large language model, the grounded theory coding consistency coefficient, and the LDA topic consistency index. Referring to Table 11, the consistency indices at each level in the LDA-HFACS-SRBE framework are shown. For example, if the grounded theory coding consistency coefficient of a factor is as high as 0.9, while the LDA topic consistency index is only 0.6, the system will prioritize the classification result of grounded theory. This arbitration mechanism effectively solves the conflict problem in the process of multi-source heterogeneous data fusion and ensures the logical consistency of the final dataset.
[0087] Table 11 Consistency Indices at Each Level in the LDA-HFACS-SRBE Framework Furthermore, the final causal dataset verified in the fourth stage is fed back to the first stage as supervised training data for the LDA topic model. The hyperparameters α and β of the topic-word distribution are adjusted until the topic consistency index improvement rate between two adjacent iterations is lower than a preset threshold, at which point the iteration is terminated.
[0088] Specifically, this embodiment constructs a closed-loop feedback mechanism. In traditional methods, the LDA model is fixed after training and struggles to adapt to new data. This embodiment injects subsequently validated, accurate data back into the LDA model. By adjusting the hyperparameters α (prior document-topic distribution) and β (prior topic-word distribution), the model can more accurately identify causal topics in the next iteration. When the improvement rate of the topic consistency index between two adjacent iterations falls below a preset threshold (e.g., 1%), it indicates that the model has converged, and the iteration is terminated. This mechanism endows the model with the ability to self-evolve, allowing it to continuously optimize as data accumulates.
[0089] In another specific implementation, a two-way quantization optimization mechanism is established between the second stage and the first stage. This two-way optimization mechanism aims to resolve the contradiction between the parameter sensitivity of the LDA model and the uncertainty of large model cue word engineering.
[0090] First, a dynamic optimization mechanism for large-scale model prompt words is constructed based on a perplexity threshold. In the first stage, when the perplexity value of a topic exceeds a preset threshold, it indicates that the topic is semantically ambiguous. At this point, a negative constraint correction is triggered on the large-scale model prompt word template, adding exclusionary instructions to the prompt words, such as "excluding descriptions unrelated to equipment malfunctions." When the perplexity value is below the preset threshold, it indicates that the topic is semantically clear. High-probability feature words for that topic are extracted and injected into the prior knowledge base of the prompt words to assist the large-scale model in accurate localization. This strategy achieves dynamic guidance of the large-scale model prompt words based on the quality of the LDA model.
[0091] Secondly, a topic number preset optimization based on the event chain complexity index is established. The average length L of the event chains extracted in the second stage and the causal correlation density D are statistically analyzed. When L×D exceeds a preset threshold, it indicates that the causal logic of the accident is complex, and the existing topic number may be insufficient to cover it. Therefore, the number of topics k is preset to be increased during the first stage of LDA modeling, and the perplexity curve is recalculated to determine the new optimal number of topics. This strategy enables reverse tuning of the LDA model parameters based on the large model extraction results.
[0092] Finally, feature word fusion and resolution are implemented. The Jaccard similarity coefficient between the first-stage LDA feature word set and the second-stage large model keyword set is calculated. When the similarity is below a preset threshold, the intersection is used as the core feature words, and the difference as the extended feature words, each assigned different weights to reconstruct the topic-word distribution. Through this bidirectional quantization mechanism, deep coupling and complementary advantages between the unsupervised topic model and the supervised large language model are achieved.
[0093] In another specific implementation, the third stage includes an adaptive calibration step. Specifically, the grounded theory coding process often faces problems of strong subjectivity and standard drift, which this embodiment addresses through an adaptive calibration step.
[0094] First, dynamic correction of the initial categories is performed. Using the core feature word set, related sentences in the accident description text are filtered through word vector similarity calculation. Only these related sentences are encoded using grounded theory open-ended coding, rather than the entire text, thus significantly improving the targeting of the coding. When the initial categories fail to map to the preset subclasses of HFACS, the unmapped categories are fed back to the first-stage LDA model as new topic seeds for local topic resampling, achieving closed-loop processing of coding failure cases.
[0095] Secondly, a process-based monitoring of inter-coder reliability is implemented. During the coding process, Krippendorff's Alpha coefficient is calculated in real time. When the Alpha coefficient of a predetermined number of consecutive incident cases (e.g., 5 consecutive cases) falls below a preset threshold (e.g., 0.8), a coding pause mechanism is automatically triggered. At this point, coders are organized to conduct a Delphi method discussion on the disputed coding entries. After supplementary coding rules are formed, the aforementioned cases are recoded retrospectively, and the grounded theory coding manual is updated. This mechanism transforms traditional post-event quality control into process monitoring, effectively preventing systematic biases in large-scale coding.
[0096] Finally, the confidence calculation and resolution of HFACS hierarchical classification are established. For fuzzy causal factors that meet the definitions of two or more HFACS hierarchies, the posterior probability of their respective classification at each level is calculated, and the level corresponding to the maximum value is selected. For example, "insufficient training" may belong to "resource management" under organizational influence, or "inadequate supervision" under unsafe supervision. The system determines its optimal classification by calculating its conditional probability in historical data. When the difference between the maximum and second-largest posterior probabilities is less than a preset threshold, it indicates that the factor has cross-level attributes. In this case, a single classification is not forced; instead, the factor is marked as a cross-level factor, and a mapping link from this factor to multiple levels is established in the accident causation classification model, assigning different weight coefficients in the final dataset. This approach retains the rigor of classification while taking into account the complexity of accident causes, avoiding information loss.
[0097] To verify the practical application effect of the ship repair and construction accident causation analysis method provided by this invention, this embodiment selects a typical accident of the vehicle injury type for detailed explanation. This case demonstrates the entire process from obtaining the accident description text to generating the final causation dataset, verifying the applicability and accuracy of the LDA-HFACS-SRBE model in handling complex cascading accidents.
[0098] Specifically, the original description of the accident is as follows: "On the morning of January 17, 2017, Wang, a truck driver in the transportation section of the starting workshop, was driving a Liaoning B-plated truck to the warehouse rented by the company to transport pipeline accessories according to the production plan. At around 1:05 p.m., Wang started the engine of the truck parked on the left side of the road and drove to the right, preparing to load goods into the warehouse. He knocked down Zhang, a batching worker from the engine workshop, who was squatting on the ground to the right of the cab filling out a document, and Zhang's foot was crushed. After hearing the victim's shouts, Wang immediately stopped driving and reversed the truck. The people at the scene immediately called the police. The victim was taken to the hospital by an ambulance. The doctor diagnosed him with: laceration of the skin of the right foot, fracture of the second metatarsal bone of the right foot, incomplete fracture of the right medial malleolus, and fracture of the proximal phalanx of the middle toe of the left foot." First, the accident description text was acquired and preprocessed. The system automatically identified and removed specific vehicle license plate numbers and workshop names such as "Liaoning B truck," "shipping workshop," and "engine workshop" from the original text, and masked the names "Wang" and "Zhang" to ensure data anonymization. Then, an intelligent recognition framework based on a large language model was used to identify the text. Based on the SRBE domain dictionary and chain accident determination rules, the large model successfully extracted the causal event chain: "Truck starts and moves to the right (vehicle movement) → knocks down the distribution worker (personnel collision) → foot is crushed (direct injury)." Because this chain contains more than two related events and the end results in personal injury, the system labeled the accident as 1 (chain accident injury) and classified it as a vehicle injury type.
[0099] Secondly, based on the LDA-HFACS-SRBE accident injury causation index system, the causative factors of this accident were deeply explored and mapped. The analysis process followed the principles of "unique mapping" and "single-time inclusion," tracing back layer by layer from the direct trigger point of the accident. At the level of unsafe human behavior, driver Wang failed to fully observe the right blind spot when starting the vehicle, which constitutes a decision-making and perceptual error (E12); the distribution worker Zhang remained in the danger zone after hearing the vehicle start sound, demonstrating an unsafe attitude (D22). At the level of preconditions for unsafe behavior, in terms of technical equipment, the warehouse loading and unloading area lacked clear signs separating people and vehicles and physical isolation measures, and corresponding protective safety signals and other devices were lacking or defective (D31); in terms of personnel status, the driver was not focused after transitioning from rest to work, and the distribution worker showed insufficient safety awareness, both of which were mapped to poor psychological and physiological state (D21); the contractor's environmental management in the work environment had loopholes, corresponding to the contractor's environment (D13).
[0100] Further tracing back to the unsafe supervision level, although the shipping workshop organized relevant training, there were no specific clauses addressing the accident, indicating inadequate training and a lack of enforcement supervision, corresponding to a lack of effective supervision and training (C11); on-site inspections and guidance were also lacking, corresponding to a lack of effective on-site inspections and guidance (C12). At the organizational impact level, the engine workshop auxiliary section failed to provide daily violation inspection records, corresponding to a failure to strictly implement information resource management requirements (B22); the lack of pre-shift safety briefings indicates inadequate communication and coordination among work teams, corresponding to a lack of effective communication and coordination (B52); the incomplete system is reflected in the omission of key clauses in operating standards, corresponding to an incomplete system of plans and procedures (B31); weak safety management enforcement is manifested in daily inspections being merely a formality, corresponding to loopholes in the implementation of system of plans and procedures (B32); the management team's tendency to prioritize efficiency over safety indicates a deficiency in safety culture, corresponding to a culture issue (B13). Since the accident investigation did not mention emergency-related content, there is no need to classify causal factors at the emergency response level.
[0101] Table 12 Results of Extraction of Causes of Injuries in Typical Vehicle Accidents Finally, based on the above analysis results, the system generates a causative factor coding record for the accident. Referring to Table 12 (Typical Vehicle Accident Injury Cause Extraction Results), the causative factors of this accident at each level of the LDA-HFACS-SRBE model are labeled as follows: organizational influence level is labeled as B13, B22, B31, B32, B52; unsafe supervision level is labeled as C11, C12; unsafe behavior prerequisite level is labeled as D13, D21, D22, D31; and unsafe human behavior level is labeled as E12. These labeled results are stored in the accident cause dataset in binary form, where "1" represents the presence of the causative factor and "0" represents its absence. This application example demonstrates that the method provided by this invention can systematically and accurately transform unstructured accident description text into structured cause data, not only identifying direct unsafe human behavior but also deeply exploring the underlying causes at the organizational management level, verifying the effectiveness and practicality of this method in the cause analysis of ship repair and construction accidents.
[0102] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.
[0103] The above-described embodiments are merely illustrative of several implementations of the present invention, and while the descriptions are specific and detailed, they should not be construed as limiting the scope of the invention. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of the present invention, and these modifications and improvements all fall within the scope of protection of the present invention. Therefore, the scope of protection of the present invention should be determined by the appended claims.
Claims
1. A method of shipbuilding incident causation analysis, characterized by, The method includes: Obtain accident description text data from the accident database of shipbuilding and repair enterprises, and perform desensitization and block processing on it; A smart recognition framework based on a large language model is constructed to identify the processed data blocks in order to extract the causal event chain of the accident and to label the accident according to the preset chain accident judgment rules. An accident cause classification query mechanism is constructed. The accident cause classification query mechanism calculates the similarity of text strings based on Levenshtein distance to realize the standardized mapping and classification query of accident causes. Using labeled accident text data, an LDA topic model is constructed. The optimal number of topics is determined by calculating the perplexity, and feature words under each topic are extracted. Based on a multi-level accident causation factor extraction framework, this study collaboratively performs text-based causation coding mapping, causation extraction based on existing qualitative reports, causation classification based on event chains of large language models, and new causation discovery based on LDA topic feature words. Based on the collaborative extraction results, an accident causation classification model is constructed that includes six levels: social factors, organizational influence, unsafe supervision, preconditions for unsafe behavior, unsafe human behavior, and emergency response. A collaborative qualitative coding process is adopted to systematically extract and encode the causal factors from the accident description text based on the accident causation classification model, generate an accident causation dataset, and perform statistical analysis.
2. The marine construction incident causation analysis method according to claim 1, wherein, The construction of the intelligent recognition framework based on a large language model includes: An SRBE domain dictionary is constructed, which is based on existing dictionaries in the fields of shipbuilding engineering, mechanical engineering, safety engineering, transportation, electrical engineering, management science and building decoration, as well as a corpus constructed from detailed descriptions of SRBE accidents. By removing special symbols, unit identifiers and irrelevant information from the corpus, and filtering function words and stop words by combining the Modern Chinese Function Word List and the Harbin Institute of Technology Stop Word List; based on the comparison results of verb segmentation frequency of jieba, HanLP, THULAC, SnowNLP and LTP Chinese word segmentation tools, the tool with the best verb segmentation frequency is selected as the benchmark model. By combining the verbs describing the manner of injury in the accident classification standards, standard verbs are compiled and industry-specific verbs are expanded to construct an SRBE accident injury manner verb mapping table.
3. The marine construction incident causation analysis method according to claim 1, characterized in that, The preset chain accident determination rules include: the accident description contains at least two related events; and the event at the end of the causal chain causes personal injury or death; the accident text that meets the rules is marked as 1, otherwise it is marked as 0.
4. The marine construction incident causation analysis method according to claim 1, characterized in that, The accident cause classification query mechanism calculates text string similarity using Levenshtein distance, and the formula for calculating Levenshtein distance is as follows: , wherein, denotes the minimum edit distance; and denote the , th element in the string , ; is an indicator function that is 1 when and 0 otherwise. Based on the minimum edit distance, the formula for calculating string similarity is: , In the formula, and They are strings respectively and The length.
5. The method for analyzing the causes of ship repair and construction accidents according to claim 1, characterized in that, The accident cause classification query mechanism enables rapid classification of risk factors based on accident description text, as well as standardized coding query based on existing accident causes. The rapid classification divides direct causes into three levels: major category, minor category, and sub-category, and performs standardized mapping of accident causes based on precise matching and fuzzy matching algorithms.
6. The method for analyzing the causes of ship repair and construction accidents according to claim 1, characterized in that, The optimal number of topics is determined by calculating perplexity, where the formula for calculating perplexity is: , In the formula, This represents the test set corpus. For document The probability of term generation, For document The total number of terms, This represents the total number of documents.
7. The method for analyzing the causes of ship repair and construction accidents according to claim 1, characterized in that, The four methods for collaborative execution of the multi-level accident causation factor extraction framework are as follows: Grounded theory and HFACS model analysis based on detailed accident description texts; Extraction of existing causes based on the enterprise's accident characterization report; Event chain and cause classification and extraction based on large language model; New factor discovery based on LDA topic model and feature words.
8. The method for analyzing the causes of ship repair and construction accidents according to claim 1, characterized in that, The indicators at each level of the accident causation classification model include: The social factors hierarchy includes government oversight and the social environment; Organizational influence levels include structure, policies, and culture under organizational atmosphere; human resources, information, funds, and equipment under resource management; incomplete systems, plans, and procedures under organizational processes, as well as loopholes in the implementation of systems, plans, and procedures; changes or imperfections in process flow, job changes or temporary adjustments, and overlapping or changes in construction scope under change management; and lack of leadership and effective communication and coordination under communication and coordination. Unsafe supervision levels include: insufficient supervision leading to a lack of effective supervision and training, a lack of effective on-site inspections and guidance; inappropriate operational plans leading to unreasonable labor organization and overcapacity production; failure to correct known problems leading to inadequate rectification of hidden dangers and risks, failure to identify risky employees; and violations of supervision leading to illegal command and allowing unqualified personnel to enter. The preconditions for unsafe acts include the working environment, natural environment, and contractor environment under environmental factors; poor psychological and physiological state of workers; unsafe attitudes; low level of skills and knowledge; lack or defects of protective safety signals and other devices under technical equipment; defects of equipment, facilities, tools and accessories; and lack or defects of PPE equipment. The hierarchy of unsafe human behaviors includes skill errors, decision-making and perceptual errors under error, habitual violations under violation, and unstable emotional work and deliberate sabotage under destructive behavior; Emergency response levels include those that are not handled in a timely manner or are not handled properly.
9. The method for analyzing the causes of ship repair and construction accidents according to claim 1, characterized in that, The collaborative qualitative coding process includes: Framework cognition and coding: Assign coding personnel to independently code randomly selected accident cases based on the accident causation classification model; Inter-coder reliability test: By comparing independent coding results, the percentage distribution of the average number of discrepancies per case out of the total number of coded items is statistically analyzed, and the mean of the average discrepancy ratio of the codes is calculated. Consensus mechanism construction: Organize workshops for coded items with disagreements and reach a consensus on coding rules through the Defield method; After revising the coding rules, all samples were independently coded: After reaching a consensus, multiple people independently coded all accident cases, and the consistency of the coding results at each level was checked. Standardized batch coding: After the consistency check passes, the causal factor coding work is performed independently on the remaining accident cases based on the consensus framework; if it fails, the above three steps are repeated.
10. The method for analyzing the causes of ship repair and construction accidents according to claim 9, characterized in that, The coding of the causative factors follows these principles: Each hazard factor mentioned in the accident description is mapped to a unique HFACS hierarchical category; Each HFACS subcategory can only be counted once per incident.