Adaptive alignment cross-modal visual-language ship intelligent human-machine interaction method

By employing an adaptive alignment cross-modal visual-language interaction method, semantic representations are extracted using Faster R-CNN and RoBERTa networks, and KAN and self-attention mechanisms are combined to address the problem of insufficient visual-language information alignment in complex environments for intelligent ships, thereby improving operational efficiency and robustness.

CN119357897BActive Publication Date: 2026-06-16SHANGHAI MARITIME UNIVERSITY

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SHANGHAI MARITIME UNIVERSITY
Filing Date
2024-10-17
Publication Date
2026-06-16

AI Technical Summary

Technical Problem

Existing intelligent ships have shortcomings in human-computer interaction guided by visual and linguistic information, especially in terms of ease of operation and robustness in complex environments, making it difficult to achieve dynamic adaptive modulation and multimodal visual and linguistic information alignment.

Method used

An adaptive alignment cross-modal vision-language intelligent human-computer interaction method for ships is adopted. The semantic representations of visual and linguistic information are extracted through Faster R-CNN and RoBERTa networks. KAN and self-attention mechanism are combined to carry out intra- and extra-modal interaction, optimize information alignment and fusion, and design a loss function to obtain the target and its location.

🎯Benefits of technology

It has achieved improved efficiency in intelligent ship operation in complex marine environments, enhanced the accuracy and robustness of human-computer interaction, and improved the robustness and inference speed of the model by compensating for failure in a single mode through other modes.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN119357897B_ABST
    Figure CN119357897B_ABST
Patent Text Reader

Abstract

The embodiment of the application discloses a kind of self-adapting alignment cross-modal visual-language ship intelligent man-machine interaction methods, including extracting target image objects semantic representation and the position information corresponding to each objects in the visual image collected from shipborne visual sensor;From the corresponding language instruction of visual image, extract text tokens semantic representation, and calculate text summary representation;The in-module alignment operation is carried out to the extracted text tokens semantic representation;The in-module and inter-module interactive alignment is carried out to the extracted image tokens semantic representation;The text tokens semantic representation information of fine granularity is compressed and integrated, fuses text semantic representation and cross-modal image objects semantic representation;Projection is carried out to the fusion feature, constructs loss function, obtains target and its position related with language instruction, the application is convenient for crew to carry out real-time man-machine interaction with intelligent identification scene in the process of cruising, improve the intelligentization and operating efficiency of ship, so that better next step intelligent decision is carried out.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of human-computer interaction technology for ships, and specifically to an adaptive alignment cross-modal vision-language intelligent human-computer interaction method for ships. Background Technology

[0002] Human-computer interaction (HCI) technology has been widely applied in various fields, evolving from early command-line interfaces to today's graphical user interfaces, touch technology, and voice interaction technology. Existing HCI methods have won user favor due to their intuitiveness and efficiency, especially in personal computing devices, smartphones, and smart homes, where users can interact with devices through visual elements and natural language, significantly lowering the barrier to entry. However, despite significant progress in ease of operation and user experience, existing technologies still have some shortcomings. On the one hand, most systems rely on a single-modal interaction method, performing poorly in complex scenarios. On the other hand, existing interaction systems lack sufficient intelligence, relying mainly on preset rules and simple pattern matching, making it difficult to cope with complex and ever-changing user needs.

[0003] With the continuous development of artificial intelligence technology, multimodal human-computer interaction (HCI) offers a more natural and flexible interaction method by combining information from multiple modalities. By fusing data from various sensors such as cameras, microphones, and touch sensors, multimodal interaction systems have been widely applied in fields such as intelligent assistants, augmented reality, virtual reality, and autonomous driving. Compared with traditional unimodal interaction methods, multimodal technology can more accurately capture user intent and provide personalized services through comprehensive analysis of multimodal data. This interaction method not only enhances the naturalness of the user experience but also strengthens the system's robustness in complex environments, enabling compensation through other modalities when information from a single channel is insufficient or fails. However, multimodal HCI technology still faces a series of challenges, including how to acquire fine-grained feature information and how to align and fuse data from different sources.

[0004] Driven by the national strategy of building a maritime power, the demand for marine development and utilization is constantly increasing, prompting the rapid development of ship technology towards intelligence. Intelligent ships, by integrating modern information and automation technologies, aim to achieve more efficient autonomous navigation, intelligent management, and optimized resource allocation, thereby enhancing my country's competitiveness in international maritime affairs. With the gradual promotion of intelligent ships, research on ship intelligence has become a top priority, especially in complex sea conditions and long-distance ocean navigation environments. How to improve ship safety and operational efficiency through intelligent means has become crucial for industry development. However, in-depth research has revealed certain shortcomings in the human-computer interaction guided by visual-verbal information in existing intelligent ships. There is an urgent need for a human-computer interaction system that can provide dynamic adaptive modulation and multimodal visual-verbal information alignment to improve operational convenience and provide greater robustness and safety for subsequent decision-making in complex environments. Summary of the Invention

[0005] The purpose of this invention is to address the problems mentioned in the background section by providing an adaptive active learning method for ship multimodal data with limited labeling. This method can strategically select diverse and uncertain data as a subset based on a limited labeling budget when labeling massive amounts of ship data is limited, thereby optimizing the labeling process of unlabeled data and solving the problems of high cost and poor effectiveness in labeling massive amounts of unlabeled ship data.

[0006] To achieve the above objectives, the present invention specifically adopts the following technical solution:

[0007] An adaptive alignment-based cross-modal vision-language intelligent human-computer interaction method for ships includes:

[0008] S1. Extract the semantic representation of the target images (objects) and the location information corresponding to each object from the visual images acquired by the shipborne vision sensor.

[0009] S2. Extract the semantic representation of text tokens from the corresponding language instructions of the visual image and calculate the summary representation of the text;

[0010] S3. Perform intra-model alignment on the extracted semantic representation of text tokens to obtain a fine-grained semantic representation of text tokens.

[0011] S4. Perform intra- and inter-module interaction alignment on the extracted image tokens semantic representation to obtain a fine-grained image objects semantic representation.

[0012] S5. Compress and integrate the fine-grained semantic representation information of text tokens to obtain a vector form of text semantic representation. Fuse the text semantic representation with the cross-modal semantic representation of image objects to obtain fused features.

[0013] S6. Project the fused features and construct a loss function based on the classification ranking and regression strategy to obtain the target and its location related to the language instruction.

[0014] Beneficial effects:

[0015] By enabling cross-modal information alignment operations primarily based on visual and linguistic information, crew members can engage in real-time human-machine interaction with intelligent recognition scenarios during cruise operations, thereby improving the ship's intelligence and operational efficiency and facilitating better intelligent decision-making in the next stage.

[0016] This invention considers redundant information in text and image content, and designs an adaptive attention unit to filter out irrelevant information, further ensuring the accuracy of information during interaction. This allows the model to accurately capture key information in language commands and focus on image targets that match the key information in the language commands, significantly improving the accuracy of intelligent human-computer interaction. This advantage differs from some existing cross-modal research that fails to effectively address the interference caused by redundant noise information.

[0017] This invention introduces the latest technology, KAN, into the process of intramodal interaction and cross-modal alignment between vision and text. This effectively replaces the traditional multilayer perceptron operation, enhances the robustness of the model, and greatly reduces the model parameters in this part, enabling it to further meet the requirements of model inference speed and model update time in complex marine environments. Attached Figure Description

[0018] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0019] Figure 1 This is a model structure diagram of the adaptive alignment cross-modal vision-language intelligent human-computer interaction method for ships proposed in this invention.

[0020] Figure 2 This is a flowchart of the language instruction preprocessing and feature extraction process in this invention.

[0021] Figure 3 This is a structural diagram of the text encoding module in this invention.

[0022] Figure 4 This is a structural diagram of the image encoding module in this invention.

[0023] Figure 5 This is a flowchart of the process of the present invention. Detailed Implementation

[0024] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings.

[0025] like Figure 1 As shown, this invention provides an adaptive alignment cross-modal vision-language intelligent human-computer interaction method for ships, comprising:

[0026] S1. Extract the semantic representation of the target images (objects) and the location information corresponding to each object from the visual images acquired by the shipborne vision sensor.

[0027] S2. Extract the semantic representation of text tokens from the corresponding language instructions of the visual image and calculate the summary representation of the text;

[0028] S3. Perform intra-model alignment on the extracted semantic representation of text tokens to obtain a fine-grained semantic representation of text tokens.

[0029] S4. Perform intra- and inter-module interaction alignment on the extracted image tokens semantic representation to obtain a fine-grained image objects semantic representation.

[0030] S5. Compress and integrate the fine-grained semantic representation information of text tokens to obtain a vector form of text semantic representation. Fuse the text semantic representation with the cross-modal semantic representation of image objects to obtain fused features.

[0031] S6. Project the fused features and construct a loss function based on the classification ranking and regression strategy to obtain the target and its location related to the language instruction.

[0032] The specific steps of step S1 are as follows:

[0033] S11. The Faster R-CNN network with ResNet152 as the backbone is used to extract the semantic representation of objects in the target image and the coordinates of each object.

[0034] S12. Based on the coordinate information, perform absolute and relative positioning on the image object, and obtain absolute positioning features and relative positioning features through coding design;

[0035] For step S11, an image Img acquired by the ship's vision sensor is used as input. Considering that grid and patch type features may not be very effective for locating and tracking objective objects in complex environments and for recognizing objects of different sizes, resulting in a lot of redundant information, this invention uses a Faster R-CNN network with ResNet152 as the backbone to extract the semantic representation of objects in the target image Img. The coordinates Pos corresponding to the objects are used to enable the model to accurately capture objects of different scales.

[0036]

[0037] in, This represents the extracted semantic features of objects in the image, where n represents the number of objects in the image, set to 100. The feature dimension of the object is set to 768; Pos = [pos1,pos2,…,pos] n ,]∈R n×4 This represents the coordinate information of the corresponding objects. Each object has 4 dimensions, i.e. It is obj i coordinates and They refer to obj respectively i The coordinates of the bottom left and top right corners;

[0038] For step S12, since the directly obtained object coordinate information is not very helpful for the model's inference, this invention effectively processes and encodes the image objects based on the original coordinate information to obtain effective absolute and relative positioning features, thereby enhancing the model's ability to locate, recognize, and infer target objects. The specific implementation process is as follows:

[0039] Transform the coordinates of each object into positions. i ={xpos i ,ypos i ,wpos i ,hpos i};

[0040]

[0041] Where, {xpos i ,ypos i} represents obj i The center point coordinates, wpos i and hpos i They refer to obj respectivelyi Width and height;

[0042] By scaling the model using the absolute positional relationship of each object relative to the image Img, the learning ability and target localization ability of the model are enhanced, thereby increasing the absolute coordinate features Pos of the objects. abs Represented as:

[0043]

[0044] Among them, Img w and Img h These are the width and height of the image Img, respectively; Linear(·) is the linear transformation function;

[0045] Construct relative coordinate features using the relative relationships between each pair of objects:

[0046]

[0047] sti,j∈[1,…,n]

[0048] in, Refers to high-dimensional embedding algorithms.

[0049] like Figure 2 As shown, the specific steps of step S2 are as follows:

[0050] S21. Clean, segment, and encode the text in the form of language instructions to obtain a list of text tokens;

[0051] S22. Use the pre-trained language model RoBERTa to extract semantic feature representations of the text token list;

[0052] S23. Calculate the global representation of the text context based on the semantic feature representation of the text tokens;

[0053] For step S21, for the image Img, the text-based language instruction Que is used as input. First, the text Que is cleaned to remove unnecessary noise interference, including but not limited to stop word removal, special symbol removal, and case unification. Second, the Que is segmented using the BPE word segmenter. To ensure a uniform length of the Que sequence for training purposes, the maximum length of the Que sequence is set to 30; if it is insufficient, padding is performed, and if it is too long, it is truncated. Finally, the segmented Que sequence is converted into an ID sequence, and an attention mask is generated to obtain the input tensor Que. ts ;

[0054] For step S22, ensure the text Que tsAfter obtaining the input format, the pre-trained language model RoBERTa is used to extract features from it, resulting in a semantic representation of the text tokens:

[0055]

[0056] in, These are the semantic features of the text tokens extracted by RoBERTa, where m represents the number of tokens contained in the text sequence, set to 30. The feature dimension of the token is set to 768.

[0057] For step S23, based on the obtained semantic feature representation of the text tokens, an average pooling method is used to calculate and obtain the text summary feature representation, which is then broadcast to the same dimension as the semantic features of the text tokens. The aim is to utilize this feature representation to optimize intra-module interactions and filter out irrelevant noise information in subsequent processes. The specific implementation process is as follows:

[0058]

[0059] Avgpooling(·) refers to average pooling; It is a summary representation of the text obtained by the average pooling method; 1 m It is a vector of length m consisting entirely of 1s; This describes the method for calculating the outer product of matrices. It is a text summary feature following a broadcast.

[0060] The specific steps of step S3 are as follows:

[0061] S31. Considering that the self-attention mechanism in existing technical methods usually only involves the correlation calculation between token pairs and lacks the operation of filtering redundant information of the original token sequence, this invention designs a textural query-key filter self-attention (QKFSA) method to capture the long-distance dependency relationship of text tokens, as well as local context information and global context information, and optimize the intramodal interaction modeling of text token information.

[0062] S32. Introduce Kolmogorov-Arnold Networks (KAN) to obtain more complex dependencies within the text, achieving non-linear modeling patterns and interpretability;

[0063] For step S32, given that the feed-forward network (FFN) in the standard Transformer has problems such as high inference computation overhead, low parameter efficiency, and easy overfitting during model training, this invention introduces Kolmogorov-Arnold Networks (KAN) as a replacement for FFN to handle the complex nonlinear mapping and high-dimensional processing of semantic features of text tokens, obtain more complex dependencies within the text, realize nonlinear modeling mode and interpretability, and improve the model's advantages in terms of representation ability, parameter efficiency, and training complexity.

[0064] semantic features of text tokens Taking the input as an example, the implementation process is as follows:

[0065]

[0066] in, These are the language features of text tokens after KAN processing.

[0067] S33. Design text encoding units based on text query-key filtering self-attention (QKFSA) and Kolmogorov-Arnold Network (KAN), such as Figure 3 As shown, taking the encoding part of Transformer as a reference, multiple layers of cascaded text encoding units are stacked to construct a text encoding module, realizing deep-level redundant information filtering and interaction modeling, so as to obtain fine-grained semantic representation of text tokens.

[0068] S34. Based on the global representation of the text context obtained from each layer of text encoding unit, xLSTM is introduced to obtain the global summary representation of the text context, which is used as the filtering benchmark for image objects in subsequent processes.

[0069] The implementation process of the self-attention method for text query-key filtering in step S31 is as follows:

[0070] semantic features of text tokens As input, tokens query matrices of the same dimension are obtained respectively. Key matrix of tokens and tokens value matrix Summarizing semantic features from text Using the input as the basis, we obtain summary query matrices of the same dimension. and summarize the key matrix

[0071]

[0072] Among them, Linear 3 (·) and Linear 2 (·) respectively indicate performing three and two different linear transformations on the content within the parentheses;

[0073] For the original tokens query matrix and Key matrix The feature information in the matrix is ​​used to summarize the query matrix. and Key matrix Filtering and modulation are performed, and a gating unit is constructed based on this. Based on the importance of tokens in the entire text queue, an optimized tokens query matrix is ​​obtained. With Key Matrix

[0074]

[0075] Utilizing multi-head attention units for computation and The relevance score, and the tokens value matrix Perform the operation to obtain the semantic features of the text tokens after intramodal interaction.

[0076]

[0077] Among them, Linear i (·) represents the linear transformation layer of the i-th head; d scale This indicates scaling, and i∈[1,2,…,8] means that the multi-head attention unit is set to 8 heads.

[0078] Since each text encoding unit layer is based on the semantic features of the output text tokens of that layer. Obtaining text summary feature representation Considering that different layers of text summary features have different emphases in information presentation, step S34 of this invention introduces an xLSTM network to effectively capture and integrate the complex relationships between high- and low-level text summary features, generating a joint representation of the text summary feature. The aim is to achieve efficient complementarity of key information, which can then serve as a filtering benchmark for image objects in subsequent processes.

[0079]

[0080] in, It is a joint matrix representation of the summary features of all layers of text; ht These are the hidden state variables generated by xLSTM at time step t.

[0081] The specific implementation process of step S4 includes:

[0082] S41. Based on the semantic features of image objects extracted by Faster R-CNN, a Visual Optimized Self-Attention (VOSA) method is designed for fine-grained optimization, utilizing the jointly represented text summary features during the process. and absolute positioning feature Pos abs With relative positioning feature Pos rel This method accurately and efficiently models the dependencies within image objects. The specific implementation process is as follows:

[0083] semantic features of input image objects Obtain the objects query matrix of the same dimension. objectsKey matrix objects Value Matrix

[0084] Textual features using joint representation Design a gated filter θ pair and Information filtering is performed to encourage the model to assign more weight to key objects.

[0085]

[0086] Fusion of absolute positioning features Pos abs The objects query matrix Q after filtering information θ and Key matrix K θ Secondary information filtering and modulation are performed to make the model more focused on the location and selection of key objects:

[0087]

[0088] Where dot(·) represents the dot product of a matrix and a vector;

[0089] Based on an improved multi-head attention unit, using the relative localization features Pos of image objects... rel As prior information, we obtain fine-grained feature representations of image objects after intra-module interactions:

[0090]

[0091] S42. To obtain the correlation between image objects and language instruction tokens, a Visual-Textual Alignment Attention (VTAA) technique is designed to learn the alignment relationship between text tokens and image objects, further refining the semantic features of image objects. The specific implementation process is as follows:

[0092] semantic features of image objects after intra-module interaction Obtain the Query matrix Based on fine-grained semantic features of text tokens The key matrix was obtained respectively. and Value Matrix

[0093]

[0094] Optimize image object features using absolute location features

[0095]

[0096] Calculate the relevance score between image objects and text token pairs, and obtain the semantic features of the image objects after alignment and interaction based on this score.

[0097]

[0098] S43. Benefiting from KAN's powerful function approximation ability and its optimization capabilities in model performance and computational efficiency, this embodiment of the invention utilizes KAN to handle the complex nonlinear mapping and high-dimensional processing problems of semantic features of image objects, aiming to capture the complex dependencies of image objects more deeply:

[0099]

[0100] S44, such as Figure 4 Based on the decoding architecture of Transformer, we designed an image encoding unit and further constructed an image encoding module to realize in-depth redundant information filtering and interactive alignment between image objects and text tokens, so as to obtain fine-grained semantic representation of image objects.

[0101] The specific implementation process of step S5 includes:

[0102] S51. Design an attention compression unit based on KAN to integrate the semantic feature information of the text tokens after optimized interaction modeling to obtain the text semantic features in vector form. Assume that the semantic features of the output text tokens after the text encoding module are represented as follows: Considering the inconsistent contribution of each text token, in order to focus on the key tokens with higher scores, we first use KAN to calculate the attention weights of the text tokens:

[0103]

[0104] Secondly, a weighted summation strategy is used to calculate the text semantic features in vector form. It is worth noting that these features are highly representative after multiple layers of screening and filtering.

[0105]

[0106] S52. Furthermore, the text semantic features in vector form Extended to semantic features of image objects Same dimension. Meanwhile, to remove the differences between text semantic features and image object features and enhance the correlation, this invention employs a gating mechanism-based method to fuse the two to obtain fused features.

[0107]

[0108] Among them, 1 n It is a vector of length n consisting entirely of 1s, and ρ is the designed gating mechanism.

[0109] The specific implementation process of step S6 includes:

[0110] S61: Using linear transformation operations to fuse features Projection into a vector And a 4D matrix The former is used to rank and classify all image objects, while the latter is used to regress the coordinates of all image objects.

[0111]

[0112] S62: First, obtain the true scores of n image objects based on the IoU scores between the region bounding box of the unique true image object and the bounding boxes of all objects. And based on the ranking score vector C score Constructing KL divergence loss The purpose is to measure the KL divergence between the predicted and true scores of n image objects:

[0113]

[0114] Then, the overall objective loss function is set based on the classification and regression strategies:

[0115]

[0116] Here, ξ is a balancing parameter, which is set to 0.5 in this embodiment. Further, backpropagation is performed by calculating the loss to update the model parameters, ultimately obtaining the object target and its coordinates related to the language instructions.

[0117] Of course, the present invention also provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, can implement the method described in the previous embodiment of the present invention.

[0118] Since the computer-readable storage medium described in Embodiment 2 of this invention is the same computer-readable storage medium used in implementing the adaptive alignment-based cross-modal visual language-ship intelligent human-computer interaction method in Embodiment 1 of this invention, those skilled in the art can understand the specific structure and variations of this computer-readable storage medium based on the method described in Embodiment 1 of this invention, and therefore will not be repeated here. All computer-readable storage media used in the method of Embodiment 1 of this invention fall within the scope of protection of this invention.

[0119] The present invention also provides a computer device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor, when executing the program, is able to implement the method described in Embodiment 1.

[0120] The above description of the disclosed embodiments enables those skilled in the art to make or use the invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the invention. Therefore, the invention is not to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. An adaptive alignment-based cross-modal vision-language intelligent human-computer interaction method for ships, characterized in that, include: S1. Extract the semantic representation of the target images (objects) and the location information corresponding to each object from the visual images acquired by the shipborne vision sensor. S2. Extract the semantic representation of text tokens from the corresponding language instructions of the visual image and calculate the summary representation of the text; S3. Perform intra-model alignment on the extracted semantic representation of text tokens to obtain a fine-grained semantic representation of text tokens. The specific steps of step S3 are as follows: S31. Design a self-attention method for text query-key filtering to capture long-distance dependencies of text tokens, as well as local and global context information, to achieve intramodal interaction modeling. S32. Introduce Kolmogorov-Arnold networks to obtain more complex dependencies within the text, achieving non-linear modeling patterns and interpretability; S33. Based on the self-attention of text query-key filtering and the Kolmogorov-Arnold network, design text encoding units. With the encoding part of Transformer as a reference, multi-layer cascade text encoding units are stacked to construct a text encoding module, realize deep-level redundant information filtering and interaction modeling, so as to obtain fine-grained semantic representation of text tokens. S34. Based on the global representation of the text context obtained by each text encoding unit, xLSTM is introduced to obtain the global summary representation of the text context, which is used as the filtering benchmark for image objects in subsequent processes. Step S34 introduces an xLSTM network to capture and integrate the complex relationships between high- and low-level text summary features, generating a joint representation of the text summary features. : in, It is a joint matrix representation of the summary features of all layers of text; It is xLSTM at time step The generated hidden state variables; S4. Perform intra- and inter-module interaction alignment on the extracted image tokens semantic representation to obtain a fine-grained image objects semantic representation. The specific implementation process of step S4 includes: S41. Based on the semantic features of image objects extracted by Faster R-CNN, design a visually optimized self-attention method for fine-grained optimization, and utilize the text summary features of joint representation. and absolute positioning features With relative positioning features The modeling process involves defining the dependencies within image objects. The specific implementation is as follows: semantic features of input image objects Each object of the same dimension is obtained. matrix , objects matrix , objects matrix ; Textual features using joint representation Design a gated filter right and Filter information: Fusion of absolute positioning features Filtered objects matrix and Key matrix Perform secondary information filtering and modulation: in, Represents the dot product of a matrix and a vector; Based on an improved multi-head attention unit, using the relative localization features of image objects... As prior information, we obtain fine-grained feature representations of image objects after intra-module interactions: S42. To obtain the correlation between image objects and language instruction tokens, a visual-text alignment attention learning method is designed to learn the alignment relationship between text tokens and image objects, refining the semantic features of image objects. The specific implementation process is as follows: semantic features of image objects after intra-module interaction get matrix Based on fine-grained semantic features of text tokens They were obtained respectively matrix and matrix : Optimize image object features using absolute location features : ; Calculate the relevance score between image objects and text token pairs, and obtain the semantic features of the image objects after alignment and interaction based on this score. : S43. Using KAN to handle complex nonlinear mapping and high-dimensional processing problems of semantic features of image objects: S44. Based on the decoding architecture of Transformer, design an image encoding unit, and further build an image encoding module to realize the process of in-depth redundant information filtering and interaction alignment between image objects and text tokens, so as to obtain fine-grained semantic representation of image objects. S5. Compress and integrate the fine-grained semantic representation information of text tokens to obtain a vector form of text semantic representation. Fuse the text semantic representation with the cross-modal semantic representation of image objects to obtain fused features. The specific implementation process of step S5 includes: S51. Integrate the semantic feature information of the text tokens after optimized interaction modeling to obtain text semantic features in vector form. Assume that the semantic features of the output text tokens after the text encoding module are represented as follows: We use KAN to calculate the attention weights of the text tokens: Secondly, a weighted summation strategy is used to calculate the text semantic features in vector form: S52. Furthermore, the text semantic features in vector form Extended to semantic features of image objects In the same dimension, a gating mechanism-based method is used to fuse the two to obtain the fused features. : in, It is a length of A vector of all 1s It is the gating mechanism designed; S6. Project the fused features and construct a loss function based on the classification ranking and regression strategy to obtain the target and its location related to the language instruction.

2. The adaptive alignment cross-modal vision-language intelligent human-computer interaction method for ships according to claim 1, characterized in that, The specific steps of step S1 are as follows: S11. The Faster R-CNN network with ResNet152 as the backbone is used to extract the semantic representation of objects in the target image and the coordinates of each object. S12. Based on the coordinate information, perform absolute and relative positioning on the image object, and obtain absolute positioning features and relative positioning features through coding design; For step S11, the Faster R-CNN network with ResNet152 as the backbone extracts the target image. semantic representation of objects Coordinates corresponding to objects : in, This represents the semantic features of the extracted image objects. This indicates the number of objects contained in the image, set to 100. The feature dimension of the object is set to 768. This represents the coordinate information of the corresponding objects. Each object has 4 dimensions, i.e. yes coordinates and Each refers to The coordinates of the bottom left and top right corners; For step S12, the image object is effectively processed and encoded based on the original coordinate information to obtain effective absolute and relative positioning features, thereby enhancing the model's ability to locate, recognize, and reason about the target object. The specific implementation process is as follows: Transform the coordinates of each object into ; in, express The coordinates of the center point, They refer to Width and height; Using each object relative to the image The absolute positional relationships of objects are scaled to enhance the model's learning ability and target localization ability, thereby improving the absolute coordinate features of objects. Represented as: in, and These are images Width and height; It is a linear transformation function; Construct relative coordinate features using the relative relationships between each pair of objects: in, Refers to high-dimensional embedding algorithms.

3. The adaptive alignment cross-modal vision-language intelligent human-computer interaction method for ships according to claim 1, characterized in that, The specific steps of step S2 are as follows: S21. Clean, segment, and encode the text in the form of language instructions to obtain a list of text tokens; S22. Use the pre-trained language model RoBERTa to extract semantic feature representations of the text token list; S23. Calculate the global representation of the text context based on the semantic feature representation of the text tokens; For step S21, regarding the image Language instructions in text form As input, this text segment Cleaning treatment was carried out to remove noise interference; Using BPE word segmenter Perform word segmentation; set The maximum length of the sequence is 30; if it is shorter, it will be padded; if it is longer, it will be truncated. The segmented sequence will then be... The sequence is transformed into an ID sequence, and an attention mask is generated to obtain the input tensor. ; For step S22, ensure the text After obtaining the input format, the pre-trained language model RoBERTa is used to extract features from it, resulting in a semantic representation of the text tokens: in, These are the semantic features of the text tokens extracted by RoBERTa. This indicates the number of tokens contained in the text sequence, set to 30. The feature dimension of the token is set to 768. For step S23, based on the obtained semantic feature representation of the text tokens, the average pooling method is used to calculate and obtain the text summary feature representation, and then it is broadcast to the same dimension as the semantic feature of the text tokens. The specific implementation process is as follows: in, This refers to average pooling; It is a summary representation of the text obtained by the average pooling method; It is a length of A vector of all 1s; This describes the method for calculating the outer product of matrices. It is a text summary feature following a broadcast.

4. The adaptive alignment cross-modal vision-language intelligent human-computer interaction method for ships according to claim 1, characterized in that, The implementation process of the self-attention method for text query-key filtering in step S31 is as follows: semantic features of text tokens As input, tokens of the same dimension are obtained respectively. matrix tokens matrix and tokens matrix ; Summarizing semantic features from text Using the same input, we obtain summaries of the same dimensions. matrix and summary matrix : in, and These refer to performing three and two different linear transformations on the content within the parentheses, respectively. For the original tokens matrix and matrix The feature information in the summary matrix and matrix Filtering and modulation are performed, and a gating unit is constructed based on this. Based on the entire text Filter and select tokens based on their importance to obtain optimized tokens. matrix and matrix : Utilizing multi-head attention units for computation and The relevance score, and with tokens matrix Perform the operation to obtain the semantic features of the text tokens after intramodal interaction. : in, Indicates the first Linear transformation layer of size; Indicates scale scaling, defined This refers to setting the multi-head attention unit to 8 heads.

5. The adaptive alignment cross-modal vision-language intelligent human-computer interaction method for ships according to claim 1, characterized in that, The specific implementation process of step S6 includes: S61: Using linear transformation operations to fuse features Projection into a vector And a 4D matrix The former is used to rank and classify all image objects, while the latter is used to regress the coordinates of all image objects. S62: First, obtain the IoU score between the region bounding box of the unique real image object and the bounding boxes of all objects. The true score of each image object And based on the ranking score vector Constructing KL divergence loss , aimed at measuring KL divergence between the predicted and true scores of image objects: Then, the overall objective loss function is set based on the classification and regression strategies: in, It is a balancing parameter, which is set to 0.5 in this embodiment. The model parameters are updated by backpropagation through loss calculation, and finally the object target and its coordinate position related to the language instruction are obtained.