Elevator picture text generation method, system, device and medium based on semantic recognition and image segmentation large model
By using a large model based on semantic recognition and image segmentation, automated segmentation and text generation of elevator components were achieved, solving the problems of low detection efficiency and query difficulties caused by manual operation in elevator inspection, and generating standardized and traceable elevator status documents.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SICHUAN SPECIAL EQUIP INSPECTION & RES INST
- Filing Date
- 2025-07-31
- Publication Date
- 2026-06-26
AI Technical Summary
In current elevator inspection, traditional visual data post-processing relies on manual intervention, resulting in low inspection timeliness, visual semantic understanding bias, and low database utilization, making it impossible to accurately query the status of elevator components.
A method based on semantic recognition and image segmentation large model is adopted. The elevator component segmentation is performed by pre-trained U-Net model, and feature fusion and text generation are performed by combining ResNet-50 and Q-Former models. The text description is optimized by multi-source calibration mechanism to generate standardized elevator status documents.
It achieves automated conversion of elevator images into standardized documents, improving detection efficiency and document generation accuracy, solving the problems of subjective bias in manual interpretation and unstructured data retrieval, and supporting efficient querying of elevator component status.
Smart Images

Figure CN120894636B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of elevator image processing technology, specifically to elevator image-to-text methods, systems, devices, and media based on a large model of semantic recognition and image segmentation. Background Technology
[0002] As a crucial link in the special equipment safety supervision system, the real-time monitoring and full life-cycle management of elevators are directly related to public safety. To meet the regulatory requirements for traceability and objectivity in the testing process, the industry generally uses intelligent terminals or dedicated imaging equipment to visually collect data on key testing components. This provides an intuitive visual representation of elevator component status, testing operation trajectories, and other information, forming an important basis for equipment safety assessment.
[0003] In existing elevator inspection scenarios, the post-processing of traditional visual data still relies on manual intervention. The semantic gap between unstructured visual data and structured reports forces inspectors to extract component state features through frame-level parsing and manually transcribe them into standardized descriptions. This process, due to its redundancy, significantly reduces inspection timeliness. Secondly, cross-personal visual semantic understanding biases result in a lack of objective consistency in the description of key abnormal states such as excessive door gaps and localized guide rail deformation. Finally, the disconnect between low-level visual features and high-level semantics means that the database only supports basic searches based on timestamps and device IDs, failing to achieve precise queries based on core semantic information such as component type and fault characteristics, leading to low database utilization. Summary of the Invention
[0004] To achieve automatic semantic parsing and structured report generation of elevator images, this invention provides a method, system, device, and medium for elevator image-to-text generation based on a large-scale model of semantic recognition and image segmentation. The specific technical solution adopted is as follows:
[0005] The first aspect of the present invention provides an elevator image-to-text method based on a large model of semantic recognition and image segmentation, the method comprising:
[0006] Acquire the original image of the elevator inspection scene and perform preprocessing;
[0007] The pre-processed original image is processed based on a pre-trained semantic segmentation model to generate a segmentation mask image of the elevator components.
[0008] The original image and the segmentation mask image are respectively input into the image encoder for feature extraction, and the extracted features are fused to generate cross-modal fusion features;
[0009] Based on the cross-modal fusion features, an initial text description is generated by parsing the elevator component association information using a text generation model.
[0010] The initial text description is optimized using a multi-source calibration mechanism to generate a standardized elevator status document containing structured metadata.
[0011] Further, the pre-processed original image is processed based on a pre-trained semantic segmentation model to generate a segmentation mask map of the elevator components, including:
[0012] Construct a multi-category segmentation standard dataset for elevator components, and train a semantic segmentation model based on the standard dataset;
[0013] The encoder performs multi-level downsampling on the preprocessed original image to extract high-level semantic features;
[0014] The high-level semantic features are upsampled by the decoder and then concatenated with the features of the corresponding layer of the encoder.
[0015] Based on the category probability distribution of the output layer, a multi-class mask image is generated by threshold determination.
[0016] Further, the original image and the segmentation mask image are respectively input into an image encoder for feature extraction, and the extracted features are fused to generate cross-modal fusion features, including:
[0017] The first encoder is used to extract global environmental feature vectors from the original image;
[0018] The second encoder is used to extract the spatial positioning feature vectors of elevator components from the segmentation mask image;
[0019] The global environment feature vector is added to the spatial positioning feature vector of the elevator components to generate cross-modal fusion features.
[0020] Furthermore, based on the aforementioned cross-modal fusion features, an initial text description is generated by parsing the elevator component association information using a text generation model, including:
[0021] Based on cross-modal fusion features, spatial topological vectors are parsed using the self-attention mechanism of Transformer;
[0022] Based on spatial topology vectors and elevator component knowledge base, a text sequence containing component names, location descriptions, and anomaly markers is generated.
[0023] Furthermore, the initial text description is optimized using a multi-source calibration mechanism to generate a standardized elevator status document containing structured metadata, including:
[0024] Convert the segmentation mask image into a sequence of component location text;
[0025] Introduce predefined elevator component labels as calibration benchmarks;
[0026] Calculate the difference in edit distance between the component location text sequence and the predefined elevator component labels;
[0027] Calculate the semantic similarity between the component location text sequence and the initial text description;
[0028] The component location description is corrected based on the difference in edit distance and semantic similarity.
[0029] Furthermore, the expression for the loss function of the multi-source calibration mechanism is:
[0030]
[0031] In the formula, This indicates the closed-loop calibration loss; Text sequence indicating component location Compared with the initial text description Semantic alignment weight coefficients; Text sequence indicating component location Compared with the initial text description Semantic difference loss; Indicates predefined elevator component labels Compared with the initial text description Semantic difference loss; Text sequence indicating component location With predefined elevator component labels The structural alignment weight coefficient; Text sequence indicating component location With predefined elevator component labels Text structure difference loss.
[0032] Furthermore, the standardized elevator status document includes:
[0033] Component status descriptions by car system, door system, and traction system;
[0034] Metadata including elevator registration code, detection timestamp, and shaft location coordinates;
[0035] Store and create a component type index field in JSON format.
[0036] The second aspect of the present invention provides an elevator image-to-text system based on a large-scale model of semantic recognition and image segmentation, employing the elevator image-to-text method based on a large-scale model of semantic recognition and image segmentation described in the first aspect of the present invention. The system includes:
[0037] The image acquisition module is configured to acquire the original image of the elevator detection scene and perform preprocessing.
[0038] The segmentation module is configured to process the preprocessed original image based on a pre-trained semantic segmentation model to generate a segmentation mask image of the elevator components.
[0039] The feature fusion module is configured to input the original image and the segmentation mask image into the image encoder for feature extraction, and then fuse the extracted features to generate cross-modal fusion features.
[0040] The text generation module is configured to generate an initial text description by parsing the elevator component association information based on the cross-modal fusion features and using a text generation model.
[0041] The calibration output module is configured to optimize the initial text description using a multi-source calibration mechanism to generate a standardized elevator status document containing structured metadata.
[0042] The third aspect of the present invention provides an electronic device, the electronic device comprising: a processor and a memory communicatively connected to the processor; wherein the memory stores instructions executable by the processor, the instructions being executed by the processor to enable the processor to perform the steps of the elevator image-to-text method based on a large model of semantic recognition and image segmentation as described in the first aspect of the present invention.
[0043] The fourth aspect of the present invention provides a computer-readable storage medium storing a program for implementing an elevator image-to-text method based on a large model of semantic recognition and image segmentation. The program for implementing the elevator image-to-text method based on a large model of semantic recognition and image segmentation is executed by a processor to implement the steps of the elevator image-to-text method based on a large model of semantic recognition and image segmentation as described in the first aspect of the present invention.
[0044] The present invention has the following beneficial effects:
[0045] The elevator image-to-text method provided by this invention, based on a large model of semantic recognition and image segmentation, achieves automated conversion of elevator images into standardized documents in the scenario of automatic generation of elevator status documents. It achieves accurate localization of key elevator components based on a pre-trained semantic segmentation model, and generates standardized descriptions by parsing component topological relationships through cross-modal feature fusion combined with the Transformer architecture. Furthermore, it utilizes a multi-source calibration mechanism to optimize the segmentation mask, predefined labels, and generated text through triangular constraints, solving the subjective bias of human interpretation and generating machine-parsable structured documents with semantic indexing capabilities, thus avoiding the problem of unstructured visual data being unable to be deeply retrieved. Attached Figure Description
[0046] To more clearly illustrate the technical solutions and advantages in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0047] Figure 1 This is a flowchart illustrating an elevator image-to-text method, system, device, and medium based on a large model of semantic recognition and image segmentation, provided in an embodiment of the present invention.
[0048] Figure 2 This is a schematic diagram of the model architecture of the elevator image-to-text method based on a large model of semantic recognition and image segmentation provided in an embodiment of the present invention;
[0049] Figure 3 This is a schematic diagram of the elevator image-to-text method, system, device, and medium based on a large model of semantic recognition and image segmentation provided in an embodiment of the present invention. Detailed Implementation
[0050] To further illustrate the technical means and effects adopted by the present invention to achieve its intended purpose, the following, in conjunction with the accompanying drawings and preferred embodiments, details the specific implementation, structure, features, and effects of an elevator image-to-text method, system, device, and medium based on a large model of semantic recognition and image segmentation proposed according to the present invention. In the following description, different "one embodiment" or "another embodiment" do not necessarily refer to the same embodiment. Furthermore, specific features, structures, or characteristics in one or more embodiments can be combined in any suitable form.
[0051] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention pertains.
[0052] The following description, in conjunction with the accompanying drawings, details a specific scheme for an elevator image-to-text method, system, device, and medium based on a large model of semantic recognition and image segmentation provided by this invention.
[0053] Please see Figure 1 This document illustrates a flowchart of an elevator image-to-text method, system, device, and medium based on a large-scale model of semantic recognition and image segmentation, according to an embodiment of the present invention. The method includes:
[0054] Step S100: Acquire and preprocess the original images of the elevator inspection scenario. Specifically, in the elevator status inspection scenario, using image acquisition equipment, such as a fixed camera inside the elevator or a handheld shooting device, high-definition images of key areas such as the elevator car interior, shaft, and machine room are captured according to predetermined inspection time nodes. The acquired images are transmitted to the system server via wired or wireless network. After receiving the images, the server first performs a unified conversion of the image format, transforming various original image formats, such as JPEG and PNG, into a system-compatible standard format. The same preprocessing operations as the U-Net model training data are then performed, including: resizing: using bicubic interpolation, the image and corresponding mask are uniformly scaled to 512×512 pixels; for images whose aspect ratio is inconsistent with the target size, zeros are padded at the edges to ensure that the image does not deform during scaling while maintaining the integrity of the content; normalization: the image pixel values are normalized and mapped to the [0,1] interval to eliminate the differences in brightness and contrast between different images, enabling the model to learn image features more stably and accelerate the convergence speed during training and inference.
[0055] Step S200: Process the pre-processed original image based on the pre-trained semantic segmentation model to generate a segmentation mask map of the elevator components;
[0056] Step S200 specifically includes:
[0057] Step S210: Construct a multi-class segmentation standard dataset for elevator components, and train a semantic segmentation model based on the standard dataset; specifically, acquire images and corresponding masks of different elevator models, and label key components such as the car and guide rails; then divide the dataset into training, validation, and test sets in an 8:1:1 ratio. Preprocess the images, uniformly scaling them to 512×512 pixels, normalizing them to the [0,1] interval, and enhancing data diversity through rotation and flipping; please refer to [link to relevant documentation]. Figure 2 This embodiment uses the U-Net architecture as the segmentation model, which includes an encoder and a decoder. The encoder extracts semantic features through convolution and pooling; the decoder restores spatial resolution through deconvolution and connects it with the corresponding layer features of the encoder; the output layer uses 1×1 convolution to generate a part category probability map.
[0058] Step S220: The preprocessed original image is downsampled at multiple levels by the encoder to extract high-level semantic features. Specifically, the preprocessed 512×512×3 image is input into the encoder of the pre-trained U-Net, and feature extraction is completed sequentially through four downsampling modules: each module first extracts features from the image through two 3×3 convolutional layers. The feature map after the convolution operation is normalized by the BatchNorm layer to reduce internal covariate bias, and then nonlinearity is introduced through the ReLU activation function to enhance the expressive power of the model. Subsequently, a 2×2 max pooling layer with a stride of 2 halves the feature map size and doubles the number of channels, gradually compressing the spatial information of the image to extract high-level semantic features, including the global category and spatial association of elevator components.
[0059] Step S230: Upsample the high-level semantic features using the decoder and concatenate them with the features of the corresponding layer of the encoder; specifically, the 32×32×512 high-level features output by the encoder are fed into the decoder, and the spatial resolution is gradually restored through four upsampling modules:
[0060] The first upsampling module: The input features are upsampled through a 2×2 deconvolutional layer with 256 kernels and a stride of 2, outputting a 64×64×256 feature map; it is then concatenated with the 64×64×256 feature map output from the third downsampling module of the encoder via skip connections in the channel dimension, fusing high-level semantics and mid-level detailed features; the features are then refined through two 3×3 convolutional layers, outputting a 64×64×256 feature map;
[0061] Upsampling modules 2 through 4 have the same structure as module 1, but the number of deconvolution kernels is halved sequentially. After upsampling, their dimensions are 128×128×128, 256×256×64, and 512×512×32, respectively. These are concatenated and fused with features from layers 2, 1, and 0 of the encoder—that is, the features of the original image after initial convolution. Skip connections are used to pass low-level details such as edges and textures preserved by the encoder. Combined with semantic information from the decoder, this allows the model to accurately locate part contours while restoring spatial resolution.
[0062] Step S240: Based on the category probability distribution of the output layer, a multi-class mask image is generated by threshold determination. Specifically, the decoder finally outputs a 512×512×32 feature map, which is mapped to the number of categories by a 1×1 convolutional layer to obtain a 512×512×32 category score map. The Softmax activation function is applied to the score map to calculate the probability of each pixel belonging to each category, and a 512×512×32 probability distribution map is output. In this embodiment, a probability threshold of 0.5 is set. The category with the highest probability is selected for each pixel. If its probability is ≥0.5, it is determined to be the category; otherwise, it is determined to be the background. Finally, a 512×512 multi-class mask image is generated to mark the outline and position of each elevator component.
[0063] This embodiment achieves accurate segmentation of elevator components by constructing a specialized dataset and optimizing the U-Net model architecture. The output multi-class mask image not only preserves the complete spatial outline of the components but also accurately labels the category information, providing a reliable basis for component localization for subsequent cross-modal feature fusion. At the same time, the quality control mechanism ensures the stability of the segmentation results, effectively solving problems such as component recognition confusion and edge localization ambiguity in complex scenarios, and significantly improving the low-level feature parsing accuracy of the graph-to-text system.
[0064] Step S300: Input the original image and the segmentation mask image into the image encoder for feature extraction, and fuse the extracted features to generate cross-modal fusion features;
[0065] Step S300 specifically includes:
[0066] Step S310: Extract global environmental feature vectors from the original image using the first encoder; the encoding module uses a pre-trained ResNet-50 as the backbone network to extract features from the image: after the original image is input into ResNet-50, it passes through the convolutional layer, pooling layer and residual block in the network in sequence; through continuous convolution and pooling operations, the local features, global features and semantic information at different levels of the image are gradually extracted, and finally a high-dimensional feature vector is output. This vector contains the overall structure and semantic information of the original image, such as the interior layout of the car, the lighting conditions of the shaft, and the overall distribution of the equipment in the machine room.
[0067] Step S320: Extract spatial localization feature vectors of elevator components from the segmentation mask image using the second encoder; similarly, a pre-trained ResNet-50 is used as the second encoder. For the 512×512 segmentation mask image generated in step S200, which is single-channel and where pixel values represent component categories, feature extraction is performed:
[0068] The mask image is first converted to a 512×512×3 three-channel image through dimensionality expansion. Single-channel values are copied to all three channels to adapt to the ResNet-50 input format and input to the second encoder. It then undergoes the same network structure processing as in step S310: 7×7 convolution, 3×3 pooling, and four residual block groups are applied sequentially. Finally, global average pooling outputs a 1×1×2048 feature vector. This vector focuses on the spatial distribution features of elevator components, such as the positional relationships like "the car door is on the left side of the image" and "the traction machine is slightly above the center of the image," compensating for the lack of detailed component location information in the original image features. Since the segmentation mask image mainly reflects the position and contour information of the components, the encoded feature vector highlights the spatial distribution features of the components. The feature vectors obtained from the original image and the segmentation mask image are merged in subsequent steps to provide a more comprehensive image semantic representation for text generation.
[0069] Step S330: Add the global environment feature vector to the spatial localization feature vector of the elevator component to generate cross-modal fusion features; use Feature Addition: add the feature vector encoded by the mask image to the feature vector encoded by the original image according to the corresponding dimensions to achieve the fusion of the two features. This step integrates the fine local features from the target component in the mask image with the global environment features from the original image to generate a more comprehensive and representative fusion feature vector, providing rich information for subsequent Q-Former parsing of the relationships between components.
[0070] This embodiment constructs a cross-modal feature representation that considers both global and local aspects through a dual-path encoding and feature fusion strategy: the global environment features extracted by the first encoder from the original image provide scene context, while the spatial localization features extracted by the second encoder from the mask image enhance component details. The additive fusion of the two achieves complementary enhancement of semantic information, avoiding the information bias of a single image feature or mask feature. The fused feature includes both the category and location information of the elevator component and its surrounding environment, laying a data foundation for the subsequent accurate parsing of image semantics and generation of text descriptions that conform to the actual scene by the Q-Former model. This effectively solves the technical limitations of traditional single-modal features in taking into account both "what kind of component" and "where it is located," and addresses the problem of one-sided information in single-modal features, significantly improving the semantic parsing capability of the image-to-text system for complex elevator scenes.
[0071] Step S400: Based on the cross-modal fusion features, use a text generation model to parse the elevator component association information and generate an initial text description;
[0072] Step S400 specifically includes:
[0073] Step S410: Based on cross-modal fusion features, the spatial topology vector is parsed using the Transformer's self-attention mechanism. The fused feature vector serves as the input to the Q-Former model. The Q-Former model, based on the Transformer architecture, utilizes a self-attention mechanism for deep analysis of the input image features: the fused feature vector is converted into a Q-Former input sequence via a linear projection layer, with the sequence length adapted to the feature dimension. Simultaneously, the model initializes a set of learnable query vectors to focus on key component features. Attention weights between the query vectors and the input sequence are calculated using a self-attention mechanism. The attention weight matrix reflects the correlation strength between different component features; for example, the spatial proximity of the car door and the guide rail corresponds to a higher weight. After processing by a multi-layer Transformer encoder, which includes a self-attention layer and a feedforward neural network, the output is a topology vector encoding the spatial relationships between components. This vector integrates global environment and local component information from the fused features, providing a quantitative basis for semantic association parsing.
[0074] Step S420: Generate a text sequence containing component names, location descriptions, and anomaly markers based on the spatial topology vector and the elevator component knowledge base. Elevator component knowledge base construction: Store 32 categories of component information according to elevator manufacturing and installation safety regulations, including standard names, typical location descriptions, abnormal state features, and associated components. The knowledge base uses a triplet structure for storage. Based on the spatial topology vector and the elevator component knowledge base, structured text is generated through a fully connected layer and an LLM Decoder. The specific process is as follows: The fully connected layer receives the topology vector output by the Q-Former and adjusts the feature dimension through nonlinear transformation to adapt it to the feature space of the large language model, ensuring that the image features and the language feature distribution of the LLM are consistent, thus improving feature parsing efficiency. The adapted features are used as conditional inputs to the LLM, combined with a pre-trained elevator component knowledge base containing structured data such as component names, attributes, and typical location descriptions. The LLM (Large Language Model) uses an autoregressive generation strategy, predicting the next token word by word starting from the initial token.
[0075] The model first parses the component categories from the topology vectors and matches them with the standard names in the knowledge base;
[0076] Generate location descriptions based on spatial relationship information;
[0077] If an abnormal feature is detected, an abnormal region marker from the segmentation mask image in step S200 is added;
[0078] The final output is an initial text sequence containing component names, shaft coordinates, and abnormal states. Descriptive text is then generated, including component names (e.g., "car door," "traction machine") and locations (e.g., "car top," "shaft sidewall"). The generation process uses an autoregressive approach, starting with the initial label and progressively predicting the next word until a complete text sequence is generated. LLM leverages pre-trained, massive amounts of language knowledge and text generation capabilities, combined with input features and label guidance information, to generate the final output text. In the example, the output matches the input label "A drive motor." In real-world applications, more detailed and accurate text descriptions, such as motor model, appearance, and operating mode, will be generated based on image features and labels.
[0079] This embodiment achieves precise mapping from image features to structured text by leveraging the spatial relationship parsing capabilities of Q-Former and the semantic generation capabilities of LLM. The self-attention mechanism effectively captures the topological relationships between components, solving the problem that single features are insufficient to express spatial relationships. Feature adaptation in the fully connected layer ensures effective transmission of cross-modal information, while the text generated by LLM combined with a professional knowledge base contains standardized component names and location information, and accurately describes abnormal states, laying a semantic foundation for subsequent document standardization and avoiding the subjectivity and non-standardization of manual descriptions.
[0080] Step S500: Optimize the initial text description using a multi-source calibration mechanism to generate a standardized elevator status document containing structured metadata; wherein, the standardized elevator status document includes: component status descriptions by car system, door system, and traction system partitions; metadata binding elevator registration code, detection timestamp, and shaft location coordinates; and storing and establishing a component type index field in JSON format;
[0081] Step S500 specifically includes:
[0082] Step S510: Convert the segmentation mask image into a component location text sequence; based on the segmentation mask image generated in step S200, generate the text sequence through the following process:
[0083] Region parsing: For each connected region in the mask image, corresponding to a single elevator component, extract its minimum bounding rectangle coordinates and convert them into relative coordinates of the shaft, such as "1.5m from the bottom of the shaft, 0.3m to the left".
[0084] Text mapping: Combining preset component categories and location description templates, such as "[Component Name] is located at [Hoistway Location], coordinate range", each region is converted into a structured text fragment. For example, the text corresponding to the "Car Door" region in the mask image is "The car door is located at the front of the hoistway, coordinate range (50, 200) - (300, 400)".
[0085] Sequence integration: Sort by component system, car system, door system, etc., to generate a complete text sequence of component locations. It contains information on all detected components and their locations;
[0086] Step S520: Introduce predefined elevator component labels as calibration benchmarks; specifically, predefined elevator component labels... Built based on elevator industry standards, it includes: component name labels such as "car door", "traction machine", "guide rail", etc., which correspond one-to-one with the categories in the segmentation mask image; location description labels such as "left side of the shaft", "center of the top of the car", "below the traction machine", etc., which standardize the way of describing the location; relationship description labels such as "parallel to XX component" and "located above XX component", which define the standard spatial relationship between components. These labels are stored in the system label library and serve as a benchmark reference for subsequent text calibration.
[0087] Step S530: Calculate the edit distance difference between the component location text sequence and the predefined elevator component labels; the edit distance is used for quantification. and The structural differences are calculated as follows: For Each text fragment and The corresponding tags in , calculation will Convert to The minimum number of single-character editing operations required; for example The "car door" in the middle needs to be replaced with The "car door" in the middle has an edit distance of 1. The missing "door lock device" needs to be inserted, increasing the edit distance by 1; the structural alignment loss of component segmentation textification can be expressed as:
[0088]
[0089] In the formula, The smaller the value, the better. and The higher the structural consistency;
[0090] Step S540: Calculate the semantic similarity between the component location text sequence and the initial text description; use cosine similarity to measure the similarity. Compared with the initial text description generated in step S400 The semantic association process is as follows: The Sentence-BERT model is used to perform semantic association on the following aspects respectively. and Encode to obtain semantic vectors and The dimension is 768, of which yes Sentence-BERT encoding, yes Sentence-BERT encoding;
[0091] Step S550: Jointly correct the component location description based on edit distance difference and semantic similarity; the expression for the loss function of the multi-source calibration mechanism is:
[0092]
[0093] In the formula, This indicates the closed-loop calibration loss; Text sequence indicating component location Compared with the initial text description The semantic alignment weight coefficient is set to 0.6 to prioritize semantic consistency between the two. Text sequence indicating component location Compared with the initial text description The semantic difference loss is calculated in the same way as... ,Right now , To calculate the F1 score of text pairs based on the pre-trained BERT model, the smaller the value, the more semantically consistent the text pairs. Indicates predefined elevator component labels Compared with the initial text description Semantic difference loss; Text sequence indicating component location With predefined elevator component labels The structural alignment weight coefficient has a value of 0.4. Text sequence indicating component location With predefined elevator component labels The text structure difference loss is directly calculated using the edit distance value; a smaller value indicates a better structural match. The first term... Weight 0.6, Enhancement and Semantic consistency, such as correcting conflicting descriptions of "car door location"; the second item Used to ensure Complies with industry standard labeling terminology, for example, correcting "door operator" to "landing door drive device". (Item 3) Used for constraints Structure and Consistency, for example, supplementing the missing description of "safety clamp";
[0094] This embodiment implements text correction based on a step-by-step loss function and a total loss function. The total loss function is as follows:
[0095]
[0096] In the formula, Represents the total loss function; This represents the semantic alignment loss between the label and the initial text; express and The semantic collaboration loss, i.e. ,in and They are respectively and The Sentence-BERT encoded vector; minimized using the Adam optimizer. The text generation model, i.e., the parameters of Q-Former and LLM, is updated through backpropagation.
[0097] The joint correction process for component location descriptions includes:
[0098] If the deviation is located :illustrate and Significant semantic discrepancies were found by comparing the semantic vectors of the two. and Locate conflicting segments, for example The description states "the traction machine is on the left side" and When describing "on the right side", use Spatial positioning features are used as a benchmark for correction;
[0099] like :show There are non-standard terms; please refer to [the relevant documentation]. Replacement, for example, uniformly change "elevator box" to "car";
[0100] like :hint There is a structural missing, according to Supplement the component description, such as adding text related to "speed limiter";
[0101] The structured and corrected text is sorted by partition into "Car System - Door System - Traction System - Guiding System - Safety Protection System," and within each system, it is further sorted by component importance. Embedded metadata includes elevator registration code, inspection timestamp, and shaft location coordinates. This data is ultimately stored in JSON format, with component type index fields such as "car_system" corresponding to the car system text and "door_system" corresponding to the door system text, supporting quick retrieval by component type and system category.
[0102] This embodiment integrates structural and semantic constraints through a multi-source calibration mechanism. Based on the joint optimization of edit distance and semantic similarity, the initial text description is corrected. The generated standardized elevator status document not only conforms to industry standard terminology specifications but also accurately reflects the spatial relationships of components. At the same time, the structured metadata design enables efficient retrieval and management, effectively solving the problems of strong subjectivity and chaotic format of manual reports, and providing standardized and traceable textual evidence for the safety management of the entire elevator life cycle.
[0103] In summary, the elevator image-to-text method proposed in this application, based on a large-scale model of semantic recognition and image segmentation, utilizes U-Net image recognition and Q-Former text generation technology for automatic elevator component document generation. It constructs a closed-loop optimization system from image segmentation to text generation by introducing a multi-dimensional loss function. First, the elevator image is preprocessed and input into the U-Net model to achieve accurate segmentation of elevator components. Then, ResNet-50 is used to encode the original image and the segmented mask image. The encoded results are then fused, and the fused feature vector is input into the Q-Former model to generate descriptive text. The text generated by Q-Former and the text generated by U-Net are optimized and fused to generate an optimized output, which is then standardized into a document. For the U-Net model, a combination of generalized Dice Loss and Cross Entropy Loss is used to optimize component segmentation accuracy. BERTScore Loss and contrastive learning loss are used to enhance the semantic alignment between the Q-Former generated text and the input labels. Finally, a triangular closed-loop loss is used to achieve collaborative optimization of the U-Net segmentation results, input labels, and Q-Former generated text.
[0104] In practical applications, compared with traditional manual review and summarization methods, this method can significantly improve the efficiency of elevator status document generation, reducing the time from processing a single image to generating a document from an average of 10 minutes to just a few seconds. Through the collaborative optimization of U-Net and Q-Former, the matching accuracy between text descriptions and actual elevator component status is improved. At the same time, the closed-loop optimization system constructed by multi-dimensional loss functions endows the system with self-learning and adaptive capabilities, reducing manual intervention and effectively improving the automation and standardization level of elevator status inspection and document management.
[0105] Please see Figure 3 This diagram illustrates the structure of an elevator image-to-text system based on a large model of semantic recognition and image segmentation, according to an embodiment of the present invention. The system includes:
[0106] The image acquisition module is configured to acquire the original image of the elevator detection scene and perform preprocessing.
[0107] The segmentation module is configured to process the preprocessed original image based on a pre-trained semantic segmentation model to generate a segmentation mask image of the elevator components.
[0108] The feature fusion module is configured to input the original image and the segmentation mask image into the image encoder for feature extraction, and then fuse the extracted features to generate cross-modal fusion features.
[0109] The text generation module is configured to generate an initial text description by parsing the elevator component association information based on the cross-modal fusion features and using a text generation model.
[0110] The calibration output module is configured to optimize the initial text description using a multi-source calibration mechanism to generate a standardized elevator status document containing structured metadata.
[0111] The third aspect of the present invention provides an electronic device, the electronic device comprising: a processor and a memory communicatively connected to the processor; wherein the memory stores instructions executable by the processor, the instructions being executed by the processor to enable the processor to perform the steps of the elevator image-to-text method based on a large model of semantic recognition and image segmentation as described in the first aspect of the present invention.
[0112] The fourth aspect of the present invention provides a computer-readable storage medium storing a program for implementing an elevator image-to-text method based on a large model of semantic recognition and image segmentation. The program for implementing the elevator image-to-text method based on a large model of semantic recognition and image segmentation is executed by a processor to implement the steps of the elevator image-to-text method based on a large model of semantic recognition and image segmentation as described in the first aspect of the present invention.
[0113] It should be noted that the order of the above embodiments of the present invention is merely for descriptive purposes and does not represent the superiority or inferiority of the embodiments. The processes depicted in the accompanying drawings do not necessarily require a specific or sequential order to achieve the desired result. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
[0114] The various embodiments in this specification are described in a progressive manner. The same or similar parts between the various embodiments can be referred to each other. Each embodiment focuses on describing the differences from other embodiments.
Claims
1. An elevator image-to-text method based on a large-scale semantic recognition and image segmentation model, characterized in that, The method includes: Acquire the original image of the elevator inspection scene and perform preprocessing; The pre-processed original image is processed based on a pre-trained semantic segmentation model to generate a segmentation mask image of the elevator components. The original image and the segmentation mask image are respectively input into an image encoder for feature extraction, and the extracted features are fused to generate cross-modal fusion features; including: The first encoder is used to extract global environmental feature vectors from the original image; The second encoder is used to extract the spatial positioning feature vectors of elevator components from the segmentation mask image; The global environment feature vector is added to the spatial positioning feature vector of the elevator components to generate cross-modal fusion features; Based on the cross-modal fusion features, an initial text description is generated by parsing the elevator component association information using a text generation model. The initial text description is optimized using a multi-source calibration mechanism to generate a standardized elevator status document containing structured metadata; including: Convert the segmentation mask image into a sequence of component location text; Introduce predefined elevator component labels as calibration benchmarks; Calculate the difference in edit distance between the component location text sequence and the predefined elevator component labels; Calculate the semantic similarity between the component location text sequence and the initial text description; The component location description is corrected based on a combination of edit distance difference and semantic similarity. The expression for the loss function of the multi-source calibration mechanism is: In the formula, This indicates the closed-loop calibration loss; Text sequence indicating component location Compared with the initial text description Semantic alignment weight coefficients; Text sequence indicating component location Compared with the initial text description Semantic difference loss; Indicates predefined elevator component labels Compared with the initial text description Semantic difference loss; Text sequence indicating component location With predefined elevator component labels The structural alignment weight coefficient; Text sequence indicating component location With predefined elevator component labels Text structure difference loss.
2. The elevator drawing-to-text method as described in claim 1, characterized in that, The pre-trained semantic segmentation model processes the pre-processed original image to generate a segmentation mask map of the elevator components, including: Construct a multi-category segmentation standard dataset for elevator components, and train a semantic segmentation model based on the standard dataset; The encoder performs multi-level downsampling on the preprocessed original image to extract high-level semantic features; The high-level semantic features are upsampled by the decoder and then concatenated with the features of the corresponding layer of the encoder. Based on the category probability distribution of the output layer, a multi-class mask image is generated by threshold determination.
3. The elevator drawing-to-text method as described in claim 1, characterized in that, Based on the aforementioned cross-modal fusion features, an initial text description is generated by parsing the elevator component association information using a text generation model, including: Based on cross-modal fusion features, spatial topological vectors are parsed using the self-attention mechanism of Transformer; Based on spatial topology vectors and elevator component knowledge base, a text sequence containing component names, location descriptions, and anomaly markers is generated.
4. The elevator drawing-to-text method as described in claim 1, characterized in that, The standardized elevator status document includes: Component status descriptions by car system, door system, and traction system; Metadata including elevator registration code, detection timestamp, and shaft location coordinates; Store and create a component type index field in JSON format.
5. An elevator image-to-text system based on a large-scale semantic recognition and image segmentation model, characterized in that: The elevator image-to-text method based on a large model of semantic recognition and image segmentation as described in any one of claims 1 to 4, the system comprising: The image acquisition module is configured to acquire and preprocess the original images of the elevator detection scene. The segmentation module is configured to process the preprocessed original image based on a pre-trained semantic segmentation model to generate a segmentation mask image of the elevator components. The feature fusion module is configured to input the original image and the segmentation mask image into an image encoder for feature extraction, and then fuse the extracted features to generate cross-modal fusion features; including: The first encoder is used to extract global environmental feature vectors from the original image; The second encoder is used to extract the spatial positioning feature vectors of elevator components from the segmentation mask image; The global environment feature vector is added to the spatial positioning feature vector of the elevator components to generate cross-modal fusion features; The text generation module is configured to generate an initial text description by parsing the elevator component association information based on the cross-modal fusion features and using a text generation model. The calibration output module is configured to optimize the initial text description using a multi-source calibration mechanism, generating a standardized elevator status document containing structured metadata, including: Convert the segmentation mask image into a sequence of component location text; Introduce predefined elevator component labels as calibration benchmarks; Calculate the difference in edit distance between the component location text sequence and the predefined elevator component labels; Calculate the semantic similarity between the component location text sequence and the initial text description; The component location description is corrected based on a combination of edit distance difference and semantic similarity. The expression for the loss function of the multi-source calibration mechanism is: In the formula, This indicates the closed-loop calibration loss; Text sequence indicating component location Compared with the initial text description Semantic alignment weight coefficients; Text sequence indicating component location Compared with the initial text description Semantic difference loss; Indicates predefined elevator component labels Compared with the initial text description Semantic difference loss; Text sequence indicating component location With predefined elevator component labels The structural alignment weight coefficient; Text sequence indicating component location With predefined elevator component labels Text structure difference loss.
6. An electronic device, characterized in that, The electronic device includes: a processor and a memory communicatively connected to the processor; wherein the memory stores instructions executable by the processor, the instructions being executed by the processor to enable the processor to perform the steps of the elevator image-to-text method based on a large model of semantic recognition and image segmentation as described in any one of claims 1 to 4.
7. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a program for implementing an elevator image-to-text method based on a large model of semantic recognition and image segmentation. The program for implementing the elevator image-to-text method based on a large model of semantic recognition and image segmentation is executed by a processor to implement the steps of the elevator image-to-text method based on a large model of semantic recognition and image segmentation as described in any one of claims 1 to 4.