Image analysis method based on multi-modal large model and AI assistant system
By constructing a radio astronomy VQA dataset and optimizing a multimodal large-scale language model, the adaptability problem of image classification and visual question answering in radio astronomy was solved, achieving efficient image analysis and question answering capabilities, and improving scientific research efficiency and data interpretation capabilities.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- GUIZHOU UNIV
- Filing Date
- 2025-09-23
- Publication Date
- 2026-06-26
AI Technical Summary
In the existing technology, general-purpose multimodal large language models (MLLMs) lack adaptability in the field of radio astronomy, making it difficult to simultaneously complete image classification and visual question answering tasks. They also lack knowledge in the field of radio astronomy, and lack standardized multimodal datasets and complex analysis tools, resulting in low research efficiency.
We construct an image analysis method based on a multimodal large-scale language model. By preprocessing radio astronomy data, we generate a visual question answering (VQA) dataset, optimize the model using LoRA quantization technology, design multimodal prompt templates, realize multi-step logical deduction between images and text, and provide an AI assistant system for image analysis.
It improves the efficiency of radio astronomy image analysis, increases the accuracy of the model in identifying pulsar candidates, and can output question-and-answer results that conform to professional logic. It reduces the operating costs and technical thresholds for researchers and promotes the sharing and intelligent implementation of data resources.
Smart Images

Figure CN121168660B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the fields of artificial intelligence and industrial internet technology, specifically involving semantic understanding, knowledge graphs, machine question answering, reasoning models and other technologies, and in particular, it relates to an image analysis method and AI assistant system based on multimodal large language models (MLLMs). Background Technology
[0002] In recent years, radio astronomy, as a key discipline for exploring the mysteries of the universe, has relied on large-scale observation facilities such as the Square Kilometre Array (SKA) and the Five Hundred-Meter Aperture Spherical Radio Telescope (FAST) to acquire massive amounts of data. These facilities have driven an explosive growth in radio astronomy data in recent years. Faced with such a massive data stream, traditional manual analysis methods are inefficient and cannot meet the research needs of real-time interpretation of celestial signals (such as pulsars, radio bursts, and galaxy morphology). Therefore, there is an urgent need for automated and intelligent image analysis technologies.
[0003] In the field of radio astronomy data processing, machine learning and deep learning technologies have been widely applied. Traditional machine learning models (such as support vector machines and random forests) complete classification tasks by extracting manually designed statistical features (such as celestial flux, periodicity, and dispersion), but they are highly dependent on the experience of domain experts and have poor generalizability. Deep learning models (such as convolutional neural networks and Transformers) can automatically learn image features and have made progress in tasks such as pulsar recognition and radio galaxy morphology classification, but they have limitations in task binding. For example, models designed specifically for pulsar classification cannot be directly adapted to radio burst detection, and they ignore the supplementary value of textual context (such as observation reports) and numerical attributes to the analysis results, resulting in weak cross-task generalization ability.
[0004] With the development of artificial intelligence technology, large multimodal language models (MLLMs) have shown great potential in fields such as medicine and optical astronomy due to their ability to integrate multi-source information such as text and images. They can output classification results and generate text explanations that conform to human understanding logic. However, the application of such models in the field of radio astronomy is still in its infancy. This is because general MLLMs have not been adapted for training in the field of radio astronomy and lack an understanding of professional features such as pulsar dispersion measurements and the lobe structure of radio galaxies, resulting in insufficient accuracy in the analysis of radio astronomical images and even misjudging noisy signals. At the same time, most existing astronomical datasets are single-task labeled (such as only containing classification labels) and lack visual question answering (VQA) samples that take into account the association between images and text, which cannot support the training and performance evaluation of MLLMs and restricts their implementation in radio astronomy research.
[0005] Therefore, the current technological system faces three major core problems. First, general-purpose MLLMs are poorly adapted to radio astronomy missions, and due to a lack of domain-specific knowledge, they struggle to simultaneously complete the two core tasks of image classification and visual question answering. Second, there is no standardized VQA dataset in the field of radio astronomy; existing data is scattered across different observation projects, with inconsistent formats and a lack of multimodal annotation, failing to provide effective support for model training. Third, existing analysis tools are complex to operate, and there is a shortage of interdisciplinary talents who understand both astronomy and artificial intelligence. This forces researchers to invest a significant amount of time learning how to use these tools, indirectly reducing data interpretation efficiency and hindering the knowledge discovery process in the field of radio astronomy. Summary of the Invention
[0006] The main objective of this invention is to provide an image analysis method and AI assistant system based on a multimodal large model, filling the gap in dedicated datasets in the field of radio astronomy, while improving the model's ability to analyze radio astronomy images and its cross-task generalization ability.
[0007] Based on the first main aspect of the present invention, an image analysis method based on a multimodal large model is provided, comprising the following steps:
[0008] Input raw image analysis data into a computer system that deploys a large multimodal language model;
[0009] The computer system preprocesses the raw data, divides the training set and test set according to the class balance principle, extracts feature images and unifies the format to obtain the labeled dataset for the classification task;
[0010] The labeled dataset for classification tasks is transformed into visual question answering (VQA) tips suitable for fine-tuning large multimodal language models, and a VQA dataset is constructed.
[0011] Based on the VQA dataset, knowledge injection and inference model optimization are performed on the multimodal large language model, including: constructing a fine-tuning framework using LoRA quantization technology, encoding images into visual embeddings and mapping them to the latent embedding space of the multimodal large language model, designing multimodal prompt templates that integrate visual tags, text queries and task identifiers, adapting training parameters according to the model parameter scale, enabling the multimodal large language model to learn and structurally store domain knowledge, and optimizing its inference mechanism based on multimodal input;
[0012] The system invokes a multimodal large-scale language model optimized by knowledge injection and reasoning to fuse the image to be analyzed with the corresponding prompt template into a multimodal input. The multimodal large-scale language model calls its learned domain knowledge and optimized reasoning model to perform multi-step logical deduction, complete the image analysis, and output the analysis results generated by the multimodal large-scale language model reasoning.
[0013] As a further preferred option, in the aforementioned method, the raw image analysis data includes four types of open-source radio astronomy raw images and associated annotation information, wherein the annotation information includes astrophysical parameters and observation feature descriptions;
[0014] The four types of open-source radio astronomy raw data include: pulsar candidate images and labels from the FAST dataset, pulsar candidate images and physical property parameters from the HTRU Medlat dataset, solar radio spectrum sample images and caption information from the Spectrumcls dataset, and radio galaxy source images and morphological annotations from the Radio Galaxy dataset.
[0015] As a further preferred embodiment, in the aforementioned method, the extraction of feature images during the preprocessing of the original data includes: extracting profile maps, phase maps, sub-band maps, and dispersion measurement maps from FAST dataset samples; extracting three phase maps, sub-band maps, and velocity-momentum maps from HTRU Medlat dataset samples; and uniformly adjusting the Spectrumcls dataset sample images to a preset resolution.
[0016] As a further preferred embodiment, in the aforementioned method, the step of converting the labeled dataset for the classification task into VQA hints suitable for fine-tuning large multimodal language models includes:
[0017] The image input is designed in three modes: single, multiple, and combined. In single mode, only a single frame image is used as input. In multiple mode, all available images are input into the visual encoder one by one, and the features are stitched together to complete the projection. In combined mode, all available images are integrated into a single input image and fed into the visual encoder.
[0018] Numerical features and rules of thumb are embedded in the prompt template to determine the optimal feature set for input to a large multimodal language model and to integrate domain knowledge into task processing;
[0019] The process of building suggestions is represented as follows:
[0020]
[0021] in, This is a prompt template. For feature set, For the set of constraint rules.
[0022] As a further preferred embodiment, in the aforementioned method, the LoRA quantization technique employs 4-bit quantization processing. When encoding the image into a visual embedding and mapping it to the latent embedding space of a multimodal large language model, layer normalization processing is performed through a linear projector to match the visual features with the embedding space distribution of the multimodal large language model.
[0023] As a further preferred embodiment, in the aforementioned method, the fine-tuning framework includes a visual transformer image encoder, a linear projector, and a frozen pre-trained model.
[0024] Based on a second key aspect of the present invention, a method for constructing a Visual Question Answering (VQA) dataset is provided, comprising the following steps:
[0025] Input the original image and associated annotation information into the computer system; preprocess the original image and associated annotation information, divide the training set and test set according to the class balance principle, extract the feature images of each dataset sample and unify the format to obtain the labeled dataset for the classification task;
[0026] The VQA dataset is constructed in two stages: the first stage transforms the labeled dataset into VQA sample 1; the second stage transforms the observation report into VQA sample 2; and the VQA samples from the two stages are merged to obtain a VQA dataset containing multimodal cue templates.
[0027] As a further preferred option, in the aforementioned method, the first sample of Visual Question Answering (VQA) is described as follows:
[0028]
[0029] in, Indicates a command prompt associated with the image. This constitutes a structured description tuple, which includes visual representation, data source metadata, and semantic analysis derived from the original tags:
[0030] The second sample of Visual Question Answering (VQA) is described as follows:
[0031]
[0032] in Represents optical character recognition operators. This indicates manual semantic annotation. Corresponding predefined structured response function;
[0033] The mathematical expression for the VQA dataset containing multimodal cue templates is: , where the operator This represents the fusion of semantically related samples.
[0034] Based on a third key aspect of the present invention, an AI assistant system for implementing an image analysis method based on a multimodal large model is provided, comprising: a computer system deployed with a multimodal large language model, and:
[0035] The data input module is used to receive raw image analysis data, which includes the original image and associated annotation information.
[0036] The data preprocessing module is used to preprocess the raw data, divide the training set and test set according to the class balance principle, extract feature images and unify the format, and output the labeled dataset as a classification task.
[0037] The VQA hint generation module is used to transform labeled datasets for classification tasks into visual question answering (VQA) hints suitable for fine-tuning large multimodal language models. This includes generating basic question-answer pairs based on classification labels, generating deep question-answer pairs by combining physical parameters, and constructing structured hint templates.
[0038] The dataset construction module is used to integrate VQA hints into a VQA dataset, perform classification annotation, split the training set and test set, and perform quality verification;
[0039] The model fine-tuning module is used to perform knowledge injection and inference model optimization on the multimodal large language model based on the VQA dataset. This includes constructing a fine-tuning framework using LoRA quantization technology, encoding images into visual embeddings and mapping them to the latent embedding space of the multimodal large language model, designing multimodal prompt templates, and adapting training parameters according to the model parameter scale.
[0040] The reasoning and analysis module is used to call an optimized multimodal large-scale language model to fuse the image to be analyzed with the corresponding prompt template into a multimodal input, and complete the image analysis by performing multi-step logical deduction through the multimodal large-scale language model;
[0041] The results output module is used to output the analysis results generated by the multimodal large-scale language model inference, including image classification labels and visual question answering text answers.
[0042] Based on a third key aspect of the present invention, a computer-readable storage medium is provided having a computer program stored thereon, which, when executed, implements the image analysis method based on a multimodal large model as described in any one of claims 1-6.
[0043] Compared with existing technologies, this invention relies on the cross-task generalization ability of multimodal large language models (MLLMs), combined with a dedicated VQA dataset and an optimized inference mechanism, to simultaneously complete the dual tasks of image classification and visual question answering. The analysis efficiency is more than 3 times higher than that of traditional methods, and there is no need to repeatedly develop models for a single task, thus reducing the operating costs for researchers.
[0044] In terms of analytical accuracy, this invention addresses the core issue of the lack of radio astronomy knowledge in general MLLMs. By constructing a radio astronomy VQA dataset using four types of open-source radio astronomy data, it infuses the model with professional knowledge such as pulsar dispersion measurements and radio galaxy morphology characteristics. Combined with LoRA quantization fine-tuning and multimodal cue template design, the model's accuracy in identifying pulsar candidates is significantly improved. It can accurately output logically consistent question-and-answer results, such as "the frequency band corresponding to the peak radiation intensity" and "the correlation between spectral anomalies and solar activity," meeting the stringent accuracy requirements of radio astronomy research.
[0045] The radio astronomy VQA dataset constructed in this invention fills a gap in the field and provides a standardized foundation for subsequent related technology development. This invention, through a two-stage construction method of "classification data to VQA hints," forms a standardized dataset containing image-text association pairs and covering multiple types of celestial objects. This dataset not only meets the training requirements of MLLMs but can also be directly used by other radio astronomy intelligent analysis tools, promoting the sharing and reuse of data resources within the field and accelerating the implementation of intelligent technologies in radio astronomy.
[0046] From an application perspective, this invention lowers the technical threshold for interpreting radio astronomy data. The AI assistant platform of this invention, through structured prompt templates and natural language interaction design, allows researchers to obtain structured reports containing classification results, parameter analysis, and reasoning basis simply by inputting the image to be analyzed, without needing to deeply understand the model principles. This enables non-AI professional astronomers to efficiently complete data interpretation and helps more research teams to explore the scientific value in radio astronomy data. Attached Figure Description
[0047] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, obtaining other drawings based on these drawings without creative effort still falls within the scope of the present invention.
[0048] Figure 1 The following is a flowchart illustrating the execution of a radio astronomy image analysis method based on MLLMs in one embodiment of the present invention. Detailed Implementation
[0049] The preferred embodiments of the present invention will be described in detail below to provide a clearer understanding of the purpose, features, and advantages of the invention. It should be understood that the following embodiments are not intended to limit the scope of the invention, but are merely illustrative of the essential spirit of the technical solution of the invention.
[0050] In the following description, certain specific details are set forth for the purpose of illustrating various disclosed embodiments in order to provide a thorough understanding of the various disclosed embodiments. However, those skilled in the art will recognize that embodiments may be practiced without one or more of these specific details. In other instances, well-known techniques associated with the invention may not have been shown or described in detail to avoid unnecessarily obscuring the description of the embodiments.
[0051] Throughout this specification, references to "an embodiment" or "an embodiment" indicate that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment. Therefore, the appearance of "in an embodiment" or "an embodiment" in various places throughout the specification does not necessarily refer to the same embodiment. Furthermore, a particular feature, structure, or characteristic may be combined in any manner in one or more embodiments.
[0052] In the following embodiments, the technical terms that may be involved are explained as follows:
[0053] Radio astronomy images: Visual image data formed by receiving electromagnetic radiation signals in the radio band emitted by celestial bodies (such as pulsars, radio galaxies, etc.) through radio telescopes, and processing and converting them, are the core observational data for studying the physical characteristics of celestial bodies in the universe.
[0054] MLLMs (Multimodal Large Language Models): Large-scale artificial intelligence models capable of processing multiple modalities of data such as text and images simultaneously. They can achieve cross-modal understanding, reasoning, and generation tasks by learning from massive amounts of data, providing core algorithmic support for this invention.
[0055] VQA (Visual Question Answering) is a technical task that integrates computer vision and natural language processing. It requires the model to output a logical natural language answer based on the input image and the corresponding natural language question. It is the key technical form for realizing image depth analysis in this invention.
[0056] LoRA quantization (Low-Rank Adaptation Quantization) is an efficient fine-tuning and compression technique for large language models. It reduces computation and memory usage by introducing low-rank matrices into the model parameters while ensuring fine-tuning performance, enabling MLLMs to inject knowledge from the field of radio astronomy in a conventional hardware environment.
[0057] Visual Embedding: A high-dimensional vector representation of image data transformed by a neural network model (such as an image encoder). It can transform the visual features of an image (such as celestial morphology, signal intensity distribution, etc.) into a numerical form that MLLMs can understand, thus achieving unified processing of image and text modalities.
[0058] The labeled dataset for classification tasks is a dataset formed by manually or automatically labeling raw radio astronomy images according to preset categories (such as "pulsar", "non-pulsar", "radio galaxy" etc.). It contains image data and corresponding category labels and is the basic data for subsequent conversion into VQA prompts and training models.
[0059] Multimodal prompt template: A structured input format that integrates visual tags (associating with image data), text queries (natural language problems), and task identifiers (distinguishing between task types such as classification and parameter parsing) to guide MLLMs to accurately focus on radio astronomy analysis tasks and improve the relevance and accuracy of model inference.
[0060] like Figure 1 As shown, in one embodiment of the present invention, a radio astronomy image analysis method based on a multimodal large model includes the following steps S110-S150:
[0061] S110 inputs raw data for radio astronomy image analysis into a computer system deployed with multimodal large language models (MLLMs);
[0062] S120: The computer system preprocesses the raw data, divides the training set and test set according to the class balance principle, extracts feature images and unifies the format to obtain the labeled dataset for the classification task.
[0063] S130 transforms the labeled dataset for classification tasks into visual question answering (VQA) tips suitable for fine-tuning MLLMs and constructs a radio astronomy VQA dataset.
[0064] S140, Based on the radio astronomy VQA dataset, knowledge injection and inference model optimization are performed on MLLMs, including: constructing a fine-tuning framework using LoRA quantization technology, encoding images into visual embeddings and mapping them to the latent embedding space of MLLMs, designing multimodal prompt templates that integrate visual tags, text queries and task identifiers, adapting training parameters according to the model parameter scale, enabling MLLMs to learn and structurally store knowledge in the field of radio astronomy, and optimizing their inference mechanism based on multimodal input;
[0065] S150 calls the MLLMs optimized by knowledge injection and reasoning, merges the radio astronomy image to be analyzed with the corresponding prompt template into a multimodal input, calls the domain knowledge learned by the MLLMs and the optimized reasoning model to perform multi-step logical deduction, completes the image analysis, and outputs the analysis results generated by the reasoning of the MLLMs.
[0066] The dataset collection and preprocessing are described in one of the following possible implementations.
[0067] In this embodiment, the original dataset used to construct the RadioAstro VQA dataset and its preprocessing steps are described. Specifically, these datasets are derived from the following four open-source projects:
[0068] The FAST dataset comprises 15,482 pulsar candidates collected by the Five Hundred Meter Aperture Spherical Radio Telescope (FAST). This dataset aims to address the challenge of distinguishing between real pulsars and radio frequency interference (RFI). Each sample includes a diagnostic report generated by the Presto system, containing four graphical images: a profile, a phase diagram, a sub-band diagram, and a dispersive measurement (DM) diagram, along with relevant search information. The dataset was manually labeled by experts into two categories: pulsars and RFIs, containing 1,163 known pulsars and 14,319 RFI instances. To address class imbalance, in this embodiment, 837 known pulsars and 1,593 RFI instances were randomly selected as the training set, and the remaining samples were used as the test set.
[0069] The HTRU Medlat dataset is a collection of labeled pulsar candidates located in the mid-galactic latitude region from the High Temporal Resolution Universe (HTRU) survey. This dataset was initially processed to train the SPINN pulsar classifier and contains 1,196 known pulsar candidates and 89,996 non-pulsed radio frequency interference (RFI) candidates from 521 different sources. Each sample includes three sets of visualizations: a phase plot, a sub-band plot, and a velocity-momentum (DM) plot. In this embodiment, 837 known pulsars and 1,593 RFI samples were randomly selected as the training set, while the remaining 359 known pulsars and 8,406 RFI samples were used as the test set.
[0070] The Spectrumcls dataset contains 8,816 solar radio spectrum samples acquired by the Solar Broadband Radio Spectrometer (SBRS) at the National Astronomical Observatories of the Chinese Academy of Sciences. Each image in the dataset is scanned at a resolution of 120 × 120 pixels and labeled as a radio burst, calibration, or non-burst type. In this embodiment, 7,052 samples were randomly selected as the training set, and the remaining 1,764 samples were used as the test set.
[0071] Radio Galaxy: This dataset contains 2158 radio galaxy sources from the FIRST survey. Each image in the dataset is scanned at a resolution of 300 × 300 pixels and labeled as FRI, FRII, Compact, or Bent. In this embodiment, 1758 samples were randomly selected as the training set, and the remaining 400 samples were used as the test set.
[0072] In addition to labeled images, the FAST dataset also utilized the intermediate processing results analysis report provided by Presto for each sample; the HTRU Medlat dataset was analyzed based on the physical characteristic parameters proposed in existing research (F. Zhao, Y. Li, Y. Wang, H. Li, M. Chen, P. Chen, N. Sun, C. Wang, and J. Liu. Pulsar candidate classification with multimodal large language models. https: / / openreview.net / pdf?id=8SKgWpZiDL, 2024.)—including optimal period, optimal dark current density, optimal signal-to-noise ratio, and pulse width. As for the Spectrumcls dataset, the caption information of all samples was directly used as the basis for analysis.
[0073] The following possible implementation will illustrate how to transform labeled datasets for classification tasks into visual question answering (VQA) tips suitable for fine-tuning multimodal large language models (MLLMs).
[0074] Existing deep learning methods for radio astronomy missions typically process images or physical features only once or in a single pass. However, real-world datasets often contain multiple types of images and their corresponding physical features. Therefore, the generated prompts should be flexible enough to accept one or more images, as well as other numerical and categorical attributes, as input. In this way, this embodiment eliminates the need to redesign models for the various chart types provided by different projects.
[0075] As a general process capable of handling various input types, this implementation uses image input. Three modes are designed: single frame, multiple frames, and combination. In single frame mode, this embodiment uses only a single frame image as input, and this mode is only enabled when using a single frame image.
[0076] In the multi-mode implementation, all available images are input into the visual encoder one by one, and the features are stitched together to complete the projection. In the combined mode, all available images are integrated into a single input image and fed into the visual encoder. Furthermore, in this embodiment, numerical features and relevant rules of thumb are embedded in the cue template. Feature extraction in this process effectively captures the physical characteristics of specific radio signals. However, not all features are equally important to prediction performance.
[0077] In most embodiments, existing, mature feature selection techniques based on various machine learning algorithms, including logistic regression, ridge regression, random forest, support vector machine (SVM), light gradient boosting machine (LightGBM), and CatBoost, can be used to evaluate the importance of these features. Through a majority voting mechanism, the optimal feature set for the MLLM input is determined in this embodiment.
[0078] Furthermore, in some embodiments, rules of thumb based on expert experience are employed to effectively integrate domain knowledge into task processing. Overall, the process of constructing prompts can be represented as follows:
[0079]
[0080] in, This is a prompt template. For feature set, For the set of constraint rules.
[0081] The construction of the VQA dataset is described in one of the following possible implementations.
[0082] Unlike downstream classification tasks, visual question answering (VQA) tasks do not rely on a single data source but require a more comprehensive approach to build models that understand images. In radio astronomy, observation reports are often inconsistent in format and scattered across multiple important survey data sources. The limited availability of publicly labeled datasets poses a significant challenge to constructing VQA datasets.
[0083] Therefore, a two-stage strategy is adopted in this embodiment to construct the VQA dataset:
[0084] First, we integrated four publicly labeled image datasets: FAST, HTRU Medlat, Spectrumcls, and radio Galaxy, to ensure the diversity of data sources. Second, based on the FAST and Spectrumcls datasets, we supplemented the self-built report with key parameter information by manually analyzing the results files or generating them with specialized software.
[0085] The following example will elaborate on the two stages of VQA dataset construction.
[0086] Phase 1: Transforming the labeled dataset into Visual Question Answering (VQA) samples. Standard visual datasets typically contain sample instances. ,in and It was generated based on the process prompted in the previous step, and each sample has a corresponding category label. These samples collectively constitute the foundational data of the VQA dataset, labeled as... From a formal perspective, the basic data format can be represented as:
[0087]
[0088] Among them, index Each data instance is uniquely identified. This represents the sample index set.
[0089] In this embodiment, basic information is expanded by integrating image descriptions, data sources, and brief label-based analysis to construct relevant prompts. Taking a sample from the FAST dataset as an example, this embodiment uses the prompt "Please analyze the following pulsar candidate image" to describe the task requirement. Then, the MLLM model provides the answer <Image Analysis>, which must include: i) Image description (e.g., "This image is a diagnostic grayscale image of a pulsar candidate, possibly a temporal or frequency-phase image."); ii) Data source (e.g., "Possibly generated by pulsar search software such as PRESTO."); iii) Classification label and brief analysis (e.g., "The signal is disordered, possibly a radio frequency interference (RFI) signal."). Sample processing for other labeled datasets is similar. Detailed prompt templates for each dataset are shown in the table. In summary, after completing the conversion of all samples from the four labeled datasets, the first-stage visual question-answering dataset is described as follows:
[0090]
[0091] in, Indicates a command prompt associated with the image. This constitutes a structured descriptive tuple, containing visual representations, data source metadata, and semantic analysis derived from the original labels. This structured transformation framework effectively converts raw data into interactive question-and-answer pairs, thereby significantly improving semantic richness and practical application value.
[0092] Phase Two: Transforming observation reports into Visual Question Answering (VQA) samples. Although Primarily serving downstream classification tasks, its representation dimensions are limited, but some original data (such as spectral classification reports) fully retain contextual semantics, reflecting the multimodal association between observed text and corresponding image regions.
[0093] The hybrid information extraction framework of this invention comprises two core steps: (1) extracting text from predefined image regions using a document intelligence tool called Doc2X; and (2) capturing latent semantic attributes (such as the specific timestamp of a solar flare event) using manual annotation. The synthesized features are then processed through a response template function. accomplish.
[0094] Therefore, the VQA dataset for the second phase is constructed as follows:
[0095]
[0096] in Represents optical character recognition operators. This indicates manual semantic annotation. Corresponding predefined structured response function;
[0097] The mathematical expression for the radio astronomy VQA dataset is: , where the operator This represents the fusion of semantically related samples.
[0098] The following is an example of a possible implementation for fine-tuning MLLMs for radio astronomy missions.
[0099] To optimize the performance of multilingual language models (MLLMs) in radio astronomy tasks, this embodiment uses the RadioAstro VQA dataset to fine-tune pre-trained MLLMs models to achieve image classification and visual quality assessment (VQA) tasks.
[0100] To improve the computational efficiency of the fine-tuning process, this embodiment employs quantization and LoRA techniques—a parameter optimization scheme based on low-rank matrix factorization. The fine-tuning framework comprises three core modules: i) an image encoder based on a visual transformer (ViT) for generating visual embeddings; ii) a linear projector that maps visual features to a fixed 4096-dimensional latent embedding space using LLM; and iii) freezing the pre-trained LLM model (optionally optimized using LoRA). Memory optimization is achieved by segmenting the serialized image features and concatenating them along the embedding dimension. The visual encoder parameters are continuously learned during domain adaptation.
[0101] This embodiment also includes an integrated multimodal cue template that integrates visual markers (ImageFeature), text queries, and task identifiers (such as [VQA]). The format specification for this task is " ImageFeature[TaskID]". The dataset in this example contains input prompts for radio astronomy tasks and their corresponding standard output results (labeled I and O, respectively). The generation probability is calculated using the following formula:
[0102]
[0103] in Indicates the first The autoregressive probability of each step. The fine-tuning objective is to minimize the following loss function:
[0104]
[0105] in This represents the dataset used for fine-tuning. Therefore, the fine-tuning framework effectively captures the relationship between input and output, laying a solid foundation for optimizing the performance of multi-head language models in broadcast astronomy missions.
[0106] During fine-tuning, this embodiment uses two open-source multilevel linear models (MLLMs) with different parameter scales as the base models: DeepSeek-VL-7B and InternVL2-40B. The default optimizer is AdamW, and training stability is ensured by setting the maximum norm of gradient clipping to 1. A cosine learning rate scheduling function is used to achieve smooth convergence, along with a 5% warm-up ratio.
[0107] The AI assistant system of the present invention will be described in one of the following possible implementations.
[0108] In this embodiment, the hardware foundation of the AI assistant system is a high-performance computer system that deploys multimodal large language models (MLLMs). The system is equipped with a GPU acceleration cluster to support model training and inference. The MLLMs can be pre-trained models with visual-language understanding capabilities, such as LLaVA and GPT-4V.
[0109] In this embodiment, the data input module receives four types of open-source radio astronomy data through a standardized interface, including raw survey images in FITS format (such as SKA simulation data and NVSS survey data), target source annotation files (including right ascension / declination coordinates), spectral energy distribution data, and astrophysical parameter tables (such as redshift and flux density), supporting batch uploading and real-time streaming data access.
[0110] The data preprocessing module uses the following process to process the raw data. First, sample equalization is performed based on celestial category labels (such as quasars, galaxy clusters, and supernova remnants). For fewer categories, the SMOTE algorithm is used to generate synthetic samples. Second, the target source regions in the images are extracted using the ASTROPY library, cropped into 224×224 pixel feature images, and uniformly converted to PNG format. Finally, the processed images are associated with the corresponding annotation information (category labels, physical parameters) to generate a classification task dataset in CSV format, which is then randomly divided into training and test sets in a 7:3 ratio.
[0111] In this embodiment, the VQA prompt generation module constructs a two-level question-answering system. Basic question-answering pairs are designed for image classification tasks, such as "What type of celestial body is in this image? Answer: Quasar." Deep question-answering pairs are designed in conjunction with physical parameters, such as "If the redshift value of this celestial body is 0.5, what is its approximate distance? Answer: 650 million light-years."
[0112] Simultaneously, a structured prompt template was designed, including an image description ("The following is an image of a celestial body taken in the radio band"), a question type ("Please determine the type of celestial body and calculate its physical parameters"), and output format constraints ("Return the category and parameter values in JSON format"). The dataset construction module integrates the above VQA prompts into a radio astronomy VQA dataset. Ambiguous samples are removed through manual verification, ultimately forming a standardized dataset containing 100,000 training samples and 30,000 test samples.
[0113] In this embodiment, the model fine-tuning module uses LoRA quantization technology to construct a lightweight fine-tuning framework. While keeping the main parameters of MLLMs frozen, only the low-rank adaptation matrix is trained to reduce computational resource consumption. Specifically, the CLIP visual encoder first encodes the preprocessed image into a 768-dimensional visual embedding, which is then mapped to the word embedding space of MLLMs (e.g., a 4096-dimensional space for the 7B parameter model) through a linear projection layer. The mapped visual embedding is then concatenated with text prompts to form a multimodal input, and the model is optimized using a causal language model loss function. Training parameters are adapted for different model sizes; for example, the 7B model uses a batch size of 16 and a learning rate of 2e-4, while the 33B model uses a batch size of 8 and a learning rate of 1e-4. The training cycle is set to 10 rounds for both models.
[0114] After receiving the radio astronomy image to be analyzed, the inference and analysis module automatically calls the preprocessing flow to generate feature images. It then selects a matching prompt template based on the task type (classification / parameter calculation), forming a multimodal input of image embedding and text prompts. MLLMs complete the analysis through multi-step logical deduction. For example, it first identifies morphological features in the image ("a double-lobed structure exists"), then infers the category based on training knowledge ("possibly a radio galaxy"), and finally calculates the derived quantity based on the physical parameter formula ("calculate luminosity based on flux density").
[0115] The results output module presents the analysis results in two forms: structured data (including classification labels, confidence scores, and physical parameter values) is used for subsequent data processing, and natural language text (such as "This celestial body is a quasar, with a redshift of 0.3 and a corresponding luminosity of 1e45erg / s") is used for intuitive display. It supports exporting in formats such as JSON and PDF.
[0116] The technical terms, principles, or means related to the technical solutions of the present invention mentioned in the above embodiments, which are not described in detail above, are all well-known technologies or common practices that are known to those skilled in the art.
[0117] The foregoing has shown and described the basic principles, main features, and advantages of the present invention. Those skilled in the art should understand that the present invention is not limited to the above embodiments. The embodiments and descriptions in the specification are merely illustrative of the principles of the invention. Various changes and modifications can be made to the invention without departing from its spirit and scope, and all such changes and modifications fall within the scope of the present invention as claimed. The scope of protection of this invention is defined by the appended claims and their equivalents.
Claims
1. An image analysis method based on a multimodal large model, characterized in that, Includes the following steps: Input raw image analysis data into a computer system that deploys a large multimodal language model; The computer system preprocesses the raw data, divides the training set and test set according to the class balance principle, extracts feature images and unifies the format to obtain the labeled dataset for the classification task; The labeled dataset for classification tasks is transformed into visual question answering (VQA) tips suitable for fine-tuning large multimodal language models, and a VQA dataset is constructed. Based on the VQA dataset, knowledge injection and inference model optimization are performed on the multimodal large language model, including: constructing a fine-tuning framework using LoRA quantization technology, encoding images into visual embeddings and mapping them to the latent embedding space of the multimodal large language model, designing multimodal prompt templates that integrate visual tags, text queries and task identifiers, adapting training parameters according to the model parameter scale, enabling the multimodal large language model to learn and structurally store domain knowledge, and optimizing its inference mechanism based on multimodal input; The system calls upon a multimodal large-scale language model optimized by knowledge injection and reasoning to fuse the image to be analyzed with the corresponding prompt template into a multimodal input. The system then calls upon the domain knowledge it has learned and the optimized reasoning model to perform multi-step logical deduction, completes the image analysis, and outputs the analysis results generated by the multimodal large-scale language model reasoning. The process of converting labeled datasets for classification tasks into VQA hints suitable for fine-tuning large multimodal language models includes: The image input is designed in three modes: single, multiple, and combined. In single mode, only a single frame image is used as input. In multiple mode, all available images are input into the visual encoder one by one, and the features are stitched together to complete the projection. In combined mode, all available images are integrated into a single input image and fed into the visual encoder. Numerical features and rules of thumb are embedded in the prompt template to determine the optimal feature set for input to a large multimodal language model and to integrate domain knowledge into task processing; The process of building suggestions is represented as follows: in, This is a prompt template. For feature set, For the set of constraint rules.
2. The image analysis method based on a multimodal large model according to claim 1, characterized in that, The raw data for image analysis includes four types of open-source radio astronomy raw images and associated annotation information, which includes astrophysical parameters and observational feature descriptions. The four types of open-source radio astronomy raw data include: pulsar candidate images and labels from the FAST dataset, pulsar candidate images and physical property parameters from the HTRU Medlat dataset, solar radio spectrum sample images and caption information from the Spectrumcls dataset, and radio galaxy source images and morphological annotations from the Radio Galaxy dataset.
3. The image analysis method based on a multimodal large model according to claim 1, characterized in that, The preprocessing of the raw data includes extracting feature images: extracting profile images, phase images, sub-band images, and dispersion measurement images from FAST dataset samples; extracting three phase images, sub-band images, and velocity-momentum images from HTRU Medlat dataset samples; and uniformly adjusting the images of Spectrumcls dataset samples to a preset resolution.
4. The image analysis method based on a multimodal large model according to claim 1, characterized in that, The LoRA quantization technique employs 4-bit quantization processing. When encoding the image into a visual embedding and mapping it to the latent embedding space of a multimodal large language model, layer normalization is performed through a linear projector to match the visual features with the embedding space distribution of the multimodal large language model.
5. The image analysis method based on a multimodal large model according to claim 4, characterized in that, The fine-tuning framework includes a visual transformer image encoder, a linear projector, and a frozen pre-trained model.
6. A method for constructing a Visual Question Answering (VQA) dataset, characterized in that, Includes the following steps: Input the original image and associated annotation information into the computer system; preprocess the original image and associated annotation information, divide the training set and test set according to the class balance principle, extract the feature images of each dataset sample and unify the format to obtain the labeled dataset for the classification task; The VQA dataset is constructed in two phases as follows: Phase 1 transforms the labeled dataset into Visual Question Answering (VQA) Sample 1; Phase 2 transforms the observation report into Visual Question Answering (VQA) Sample 2; the VQA samples from both phases are then merged to obtain a VQA dataset containing multimodal cue templates; Visual Question Answering (VQA) Sample 1 is described below: in, Indicates a command prompt associated with the image. This constitutes a structured description tuple, which includes visual representation, data source metadata, and semantic analysis derived from the original tags: The second sample of Visual Question Answering (VQA) is described as follows: in Represents optical character recognition operators. This indicates manual semantic annotation. Corresponding predefined structured response function; The mathematical expression for the VQA dataset containing multimodal cue templates is: , where the operator This represents the fusion of semantically related samples.
7. An AI assistant system implementing the image analysis method based on a multimodal large model as described in any one of claims 1-5, characterized in that, include: Computer systems that deploy large-scale multimodal language models, and: The data input module is used to receive raw image analysis data, which includes the original image and associated annotation information. The data preprocessing module is used to preprocess the raw data, divide the training set and test set according to the class balance principle, extract feature images and unify the format, and output the labeled dataset as a classification task. The VQA hint generation module is used to transform labeled datasets for classification tasks into visual question answering (VQA) hints suitable for fine-tuning large multimodal language models. This includes generating basic question-answer pairs based on classification labels, generating deep question-answer pairs by combining physical parameters, and constructing structured hint templates. The dataset construction module is used to integrate VQA hints into a VQA dataset, perform classification annotation, split the training set and test set, and perform quality verification; The model fine-tuning module is used to perform knowledge injection and inference model optimization on the multimodal large language model based on the VQA dataset. This includes constructing a fine-tuning framework using LoRA quantization technology, encoding images into visual embeddings and mapping them to the latent embedding space of the multimodal large language model, designing multimodal prompt templates, and adapting training parameters according to the model parameter scale. The reasoning and analysis module is used to call an optimized multimodal large-scale language model to fuse the image to be analyzed with the corresponding prompt template into a multimodal input, and complete the image analysis by performing multi-step logical deduction through the multimodal large-scale language model; The results output module is used to output the analysis results generated by the multimodal large-scale language model inference, including image classification labels and visual question answering text answers.
8. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the program is executed, it implements the image analysis method based on a multimodal large model as described in any one of claims 1-5.