Skin disease diagnosis method and device based on cross-modal hash retrieval and RAG
By combining cross-modal hash retrieval with the RAG method, and extracting lesion features using graph structure and Ricci curvature, and dynamically adjusting the hash table, the subjective and multimodal information fusion problems of existing dermatology diagnostic systems are solved. This enables efficient and accurate dermatology diagnosis and cross-hospital collaboration, and generates reliable diagnostic reports.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SOUTH CHINA NORMAL UNIV
- Filing Date
- 2026-03-09
- Publication Date
- 2026-06-19
AI Technical Summary
Existing dermatology diagnostic systems rely on the experience of dermatologists, resulting in high subjectivity, low efficiency, and a high risk of missed or misdiagnosed cases. AI-assisted diagnostic systems struggle to adapt to the dynamic increase in hospital case data, suffer from severe conceptual drift, are unable to integrate multimodal information, and lack the ability to automatically generate diagnostic reports and data privacy protection mechanisms.
We employ cross-modal hash retrieval and the RAG method, extracting lesion features through graph structure and Ricci curvature, dynamically adjusting the hash table, combining multimodal data for retrieval, generating diagnostic reports, and achieving cross-hospital collaborative training through federated learning and privacy protection mechanisms.
It improves the accuracy and comprehensiveness of dermatology diagnosis, supports efficient retrieval in dynamic data environments, automatically generates professional and reliable diagnostic reports, and enables cross-hospital joint training while protecting patient privacy.
Smart Images

Figure CN122245705A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of computers, and more particularly to a method and apparatus for diagnosing skin diseases based on cross-modal hash retrieval and RAG. Background Technology
[0002] Currently, the diagnosis of skin diseases mainly relies on the experience of dermatologists combined with dermoscopy, which suffers from high subjectivity, low efficiency, and a high risk of missed or misdiagnosed cases. Existing AI-assisted diagnostic systems are mostly based on static image classification or retrieval models, which are difficult to adapt to the dynamic addition of new case data in hospitals, leading to concept drift. Moreover, they are mostly single-modal image retrieval systems, unable to integrate multimodal information such as patient history and symptoms, resulting in insufficient utilization of diagnostic information. In addition, existing systems lack the ability to automatically generate diagnostic reports, have weak data privacy protection mechanisms, and are difficult to achieve cross-hospital collaborative training.
[0003] Existing AI-assisted diagnostic systems suffer from the following shortcomings: Traditional CNNs are weak in recognizing blurred boundaries and small lesions in dermoscopic images; static hash retrieval methods cannot adapt to the continuous increase in hospital data, and concept drift leads to a decline in model performance; existing retrieval systems are mostly unimodal and cannot comprehensively utilize multimodal information such as images, text, and speech; traditional large language models lack support from medical professional knowledge bases, are prone to "illusions," and generate reports with low credibility; data privacy and silo issues: data cannot be shared between hospitals, model training data is limited, and generalization ability is poor. Summary of the Invention
[0004] The following is an overview of the topics described in detail in this article.
[0005] The purpose of this application is to at least partially solve one of the technical problems existing in the related technologies. The embodiments of this application provide a method and apparatus for diagnosing skin diseases based on cross-modal hash retrieval and RAG.
[0006] An embodiment of the first aspect of this application, a method for diagnosing skin diseases based on cross-modal hash retrieval and RAG, includes:
[0007] Acquire skin disease data and input the skin disease data into the diagnostic model;
[0008] When the skin disease data is unimodal image data, lesion features are extracted from the skin disease data using graph structure and Ricci curvature. The hash table is then optimized according to the lesion features based on retrieval performance. Dynamic hash retrieval is performed based on the optimized hash table to obtain the first retrieval result.
[0009] When the skin disease data is multimodal image-text pair data, the data of each modality is mapped to the hash code of each modality through the trained second hash function. Cross-modal hash retrieval is performed according to the hash code of each modality to obtain cross-modal retrieval results. Multiple cross-modal retrieval results are fused to obtain the second retrieval result. In the training process of the second hash function, the hard labels of labeled data are combined with the pseudo labels of unlabeled data to update the shared semantic matrix and each modality matrix, and the training model parameters of the second hash function are dynamically adjusted. The pseudo labels of the unlabeled data in the current round are generated by the training model in the previous round.
[0010] The target medical knowledge fragment is obtained by performing a hash retrieval based on the first or second search result. The target medical knowledge fragment is used as context and combined with the user query to form prompt words. A diagnostic report is generated based on the prompt words through a large language model.
[0011] According to certain embodiments of the first aspect of this application, the extraction of lesion features based on the skin disease data using graph structure and Ricci curvature includes:
[0012] Preliminary image segmentation of the lesion areas is performed on the skin disease data to obtain a mask;
[0013] A graph structure is constructed within the area of the mask, using pixels or superpixels as nodes and spatial and color relationships as edges.
[0014] The Ricci curvature is calculated on the graph structure, and the Ricci curvature is used to reflect abrupt changes, boundaries, and deformed regions of the local structure.
[0015] The Ricci curvature is used as an additional guiding channel or attention weight and fused with feature maps of different levels extracted from the skin disease data by the backbone network to obtain lesion features.
[0016] According to certain embodiments of the first aspect of this application, in the process of extracting lesion features from the skin disease data using graph structure and Ricci curvature, simplified curvature representations corresponding to feature maps at different levels of the backbone network are calculated, and the simplified curvature representations corresponding to feature maps at different levels are kept consistent by a loss function.
[0017] According to certain embodiments of the first aspect of this application, the step of optimizing the hash table based on the lesion features for retrieval performance includes:
[0018] The lesion features and erroneous hash samples are used as training data to train a new hash table. The new hash table is then added to a hash table set, which includes the new hash table and the original K hash tables. The erroneous hash samples are database samples with different labels in historical retrievals.
[0019] The weights are calculated based on the retrieval performance of the training data by the hash table set, and the hash tables with low weights are deleted to keep the number of hash tables in the hash table set at K. The weights are determined based on the cross-entropy between the similarity predicted by the hash code and the similarity between the actual labels and the cosine distance between the first hash functions.
[0020] According to certain embodiments of the first aspect of this application, obtaining the first search result by performing dynamic hash retrieval based on the optimized hash table includes:
[0021] While preserving the original feature vector of the query sample, the query sample is transformed into a query feature vector;
[0022] Calculate the asymmetric distance between the query feature vector and the hash code of the hash table;
[0023] The hyperplane distance from the query feature vector to each first hash function decision hyperplane is used to adjust the adaptive weight of the first hash function based on the hyperplane distance.
[0024] The Hamming distance is calculated based on the asymmetric distance and the adaptive weights.
[0025] The database samples with the smallest Hamming distance are selected as the first search results.
[0026] According to certain embodiments of the first aspect of this application, the pseudo-labels for the unlabeled data are generated according to the following steps:
[0027] The second hash function corresponding to each modality in round t-1 is used to generate modal hash codes for each unlabeled modal data in round t-1. Based on the modal hash codes in round t-1, the database is retrieved to obtain cross-modal hash codes for round t-1. Based on the cross-modal hash codes in round t-1, pseudo-labels for each modality in round t-1 are generated. The pseudo-labels for each modality in round t-1 with a confidence level greater than a preset confidence threshold are used as the supervision signals for round t, until all rounds of training are completed.
[0028] According to certain embodiments of the first aspect of this application, in the objective function of the training model of the second hash function, the class parameter is constrained by regularization, and the constraint strength is adjusted so that the semi-supervised collaborative learning model extracts a small number of class-corresponding features during the learning process.
[0029] According to certain embodiments of the first aspect of this application, the step of obtaining the target medical knowledge fragment by performing a hash search based on the first search result or the second search result includes:
[0030] The key information from the first or second search results is converted into a query vector.
[0031] The query vector is hashed to obtain the query hash code;
[0032] The target hash bucket is determined from the knowledge base based on the query hash code;
[0033] Similarity calculations are performed on the target hash bucket to retrieve the target medical knowledge fragments;
[0034] The knowledge base is constructed according to the following steps: the knowledge base documents related to dermatology are segmented and vectorized to obtain document vectors, the document vectors are mapped to knowledge base hash codes, the knowledge base hash codes are stored in the index of the hash bucket, and the knowledge base documents corresponding to the same or similar hash codes are stored in the same hash bucket.
[0035] According to certain embodiments of the first aspect of this application, different diagnostic systems are deployed with different diagnostic models, and the different diagnostic systems are connected to a central server; the method further includes:
[0036] The local diagnostic model is trained using dermoscopy data from the diagnostic system.
[0037] The trained model gradients or model parameters are encrypted, and the encrypted model gradients or model parameters are uploaded to the central server.
[0038] The central server uses a weighted federated average algorithm to aggregate the model gradients or model parameters uploaded by all diagnostic systems to obtain aggregate parameters. Based on the aggregate parameters, a new global model is obtained and then distributed to each diagnostic system.
[0039] A second aspect of this application provides a dermatology diagnostic device based on cross-modal hash retrieval and RAG, which applies the dermatology diagnostic method based on cross-modal hash retrieval and RAG as described in the first aspect of this application.
[0040] The above-mentioned solution has at least the following beneficial effects: it provides an intelligent dermatology diagnostic system that integrates dynamic cross-modal hash retrieval, retrieval enhancement generation technology, and federated learning; it enhances the feature extraction capability of dermatology data and improves the accuracy of boundary and small target recognition; it supports efficient retrieval in dynamic data environments and adapts to the continuous increase of new cases in hospitals; it integrates multimodal information for retrieval, improving the comprehensiveness and accuracy of diagnosis; it can automatically generate professional and reliable diagnostic reports; and it enables cross-hospital joint training and model optimization while protecting patient privacy. Attached Figure Description
[0041] The accompanying drawings are used to provide a further understanding of the technical solutions of this application and constitute a part of the specification. They are used together with the embodiments of this application to explain the technical solutions of this application and do not constitute a limitation on the technical solutions of this application.
[0042] Figure 1 This is a flowchart illustrating the steps of a skin disease diagnosis method based on cross-modal hash retrieval and RAG.
[0043] Figure 2 It is a sub-step diagram that uses graph structure and Ricci curvature to extract lesion features based on the skin disease data;
[0044] Figure 3 This is a diagram showing the sub-steps of obtaining the first search result through dynamic hash retrieval based on the optimized hash table;
[0045] Figure 4 This is a diagram showing the sub-steps of obtaining the target medical knowledge fragment through hash retrieval based on the first or second search result;
[0046] Figure 5 This is an architecture diagram of a dermoscopy diagnostic auxiliary system;
[0047] Figure 6 This is a structural diagram of the residual block of the ResNeSt feature extraction module;
[0048] Figure 7 This is another structural diagram of the residual block of the ResNeSt feature extraction module;
[0049] Figure 8 This is a structural diagram of the ResNeSt feature extraction module;
[0050] Figure 9 This is another structural diagram of the ResNeSt feature extraction module;
[0051] Figure 10 This is another structural diagram of the ResNeSt feature extraction module;
[0052] Figure 11 This is an architecture diagram of the diagnostic report module that is enhanced by hash retrieval. Detailed Implementation
[0053] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.
[0054] It should be noted that although functional modules are divided in the device schematic diagram and a logical order is shown in the flowchart, in some cases, the steps shown or described may be performed in a different order than the module division in the device or the order in the flowchart. The terms "first," "second," etc., in the specification, claims, or the aforementioned drawings are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence.
[0055] Embodiments of this application provide a method and apparatus for diagnosing skin diseases based on cross-modal hash retrieval and RAG.
[0056] The embodiments of this application will be further described below with reference to the accompanying drawings.
[0057] The skin disease diagnostic device based on cross-modal hash retrieval and RAG applies the following skin disease diagnostic method based on cross-modal hash retrieval and RAG.
[0058] The dermatology diagnostic device based on cross-modal hash retrieval and RAG has a diagnostic model, which includes the following modules: a ResNeSt feature extraction module based on geometric flow enhancement, a dynamic hash retrieval module, a semi-supervised dynamic cross-modal hash retrieval module, a diagnostic report module based on hash retrieval enhancement, and a federated learning and privacy protection module.
[0059] Reference Figure 5The dermatoscopy diagnostic assistance system corresponding to the dermatology diagnostic device employs a horizontal federated learning framework, comprising a user interaction layer, a core algorithm layer, and a data storage layer. The user interaction layer handles unimodal / multimodal data uploading, medical advice acquisition and doctor recommendations, diagnostic report queries, mobile and web-compatible operations, batch / single data queries (unimodal / multimodal), knowledge graph and bar chart visualization (similar case analysis), diagnostic report review and follow-up questions, and historical case management and visualization. The core algorithm layer includes a dynamic feature extraction module, a dynamic hash retrieval module, and an enhancement generation module. Image features from the ResNet-50 geometric flow model in the dynamic feature extraction module are fed into the dynamic hash retrieval module for unimodal hashing and asymmetric distance optimization. Image features and text features aligned with OSSCMFH cross-modal features in the dynamic feature extraction module are then fed into the dynamic hash retrieval module for decision-level fusion. The data storage layer includes a MySQL database and a medical knowledge base, with dynamic updates and multimodal data association indexing via the MySQL database. Data privacy and security are protected within the dermatoscopy diagnostic assistance system.
[0060] Reference Figure 1 A skin disease diagnosis method based on cross-modal hash retrieval and RAG includes the following steps:
[0061] Acquire skin disease data and input the skin disease data into the diagnostic model;
[0062] When the skin disease data is unimodal image data, the lesion features are extracted from the skin disease data using graph structure and Ricci curvature. The hash table is then optimized according to the lesion features based on retrieval performance. Dynamic hash retrieval is then performed based on the optimized hash table to obtain the first retrieval result.
[0063] When the skin disease data is multimodal data, the data of each modality is mapped to the hash code of each modality through the trained second hash function. Cross-modal hash retrieval is performed according to the hash code of each modality to obtain cross-modal retrieval results. Multiple cross-modal retrieval results are fused to obtain the second retrieval result. In the training process of the second hash function, the hard labels of labeled data are combined with the pseudo labels of unlabeled data to update the shared semantic matrix and each modality matrix, and the training model parameters of the second hash function are dynamically adjusted. The pseudo labels of the unlabeled data in the current round are generated by the training model in the previous round.
[0064] The target medical knowledge fragment is obtained by hash retrieval based on the first or second search result. The target medical knowledge fragment is used as context and combined with the user query to form prompt words. A diagnostic report is generated based on the prompt words through a large language model.
[0065] In some embodiments, users (doctors or patients) upload skin disease data to the diagnostic system via a client as input to the diagnostic model of the system.
[0066] The diagnostic system automatically determines the input modality of skin disease data.
[0067] When the skin disease data is unimodal image data, that is, the skin disease data is dermoscopic images; the lesion features are extracted from the skin disease data using graph structure and Ricci curvature; the hash table is optimized according to the lesion features based on retrieval performance; and the first retrieval result is obtained by dynamic hash retrieval based on the optimized hash table.
[0068] Reference Figure 2 The process of extracting lesion features from skin disease data using graph structure and Ricci curvature includes the following steps:
[0069] Preliminary image segmentation of the lesion area is performed on the dermoscopy image to obtain the mask;
[0070] A graph structure is constructed within the masked area, using pixels or superpixels as nodes and spatial and color relationships as edges.
[0071] Ricci curvature is calculated on the graph structure. Ricci curvature is used to reflect abrupt changes, boundaries, and deformed regions of the local structure.
[0072] The Ricci curvature is used as an additional guiding channel or attention weight and fused with feature maps from different levels extracted from dermoscopic images by the backbone network to obtain lesion features.
[0073] Lesion features are extracted using a ResNeSt feature extraction module enhanced with geometric flow (Ricci flow). In this ResNeSt feature extraction module, the backbone network uses ResNeSt as the main network and utilizes its Split-Attention mechanism to capture multi-scale semantic features.
[0074] Reference Figure 6 In the residual block of the ResNeSt feature extraction module, the result of the input being processed by a 3x3 convolutional layer, a batch normalization layer, a ReLU layer, another 3x3 convolutional layer, and a batch normalization layer is fused with the original input and then processed by a ReLU layer. Alternatively, refer to... Figure 7 The input, after being processed by a 3x3 convolutional layer, a batch normalization layer, a ReLU layer, a 3x3 convolutional layer, and a batch normalization layer, is fused with the original input after being processed by a 3x3 convolutional layer, and then processed by a ReLU layer.
[0075] Reference Figure 8On the one hand, in the ResNeSt feature extraction module, the input is fused with the result after passing through a 1x1 convolutional layer, a 3x3 convolutional layer, a 1x1 convolutional layer and a distraction attention layer.
[0076] Reference Figure 9 On the other hand, in the ResNeSt feature extraction module, grouped convolution is used. The convolution result of the input after passing through a 1x1 convolutional layer is then passed through the convolution results of a 3x3 convolutional layer (grouped into 32) and a 5x5 convolutional layer (grouped into 32) before being fed into the distraction attention layer. Then, it passes through a 1x1 convolutional layer, and the output of the 1x1 convolutional layer is fused with the input.
[0077] Reference Figure 10 On the other hand, in the ResNeSt feature extraction module, the input is fed into K basis sets (Cardinal). Each basis set has r parallel convolutional paths, each consisting of multiple convolutional blocks. Within each basis set, the outputs of the multiple convolutional paths pass through a distraction attention layer. The outputs of multiple basis sets are then concatenated by a concatenation layer, and the fusion result of the multiple basis sets is achieved by fusing the outputs and inputs of the convolutional layers.
[0078] In the geometric structure modeling process, the ResNeSt feature extraction module first uses U-Net to perform preliminary lesion region segmentation on the input dermoscopy image, obtaining a mask. A graph structure is then constructed within the masked region, using pixels or superpixels as nodes and spatial and color relationships as edges. Ricci curvature is calculated on the graph structure; this curvature sensitively reflects abrupt changes, boundaries, and deformed regions of the local structure. For example, high curvature typically corresponds to edges or regions with complex textures.
[0079] Through structure-guided fusion, the calculated Ricci curvature map is used as an additional guiding channel or attention weight and fused with feature maps from different stages of ResNeSt. For example, the curvature map is multiplied element-wise with the feature map to enhance the network's attention to high curvature regions (i.e., lesion boundaries and key details).
[0080] In the process of extracting lesion features from skin disease data using graph structure and Ricci curvature, the simplified curvature representations corresponding to the feature maps of different levels of the backbone network (ResNeSt) (such as shallow, medium and deep feature maps) are calculated. The simplified curvature representations corresponding to the feature maps of different levels are constrained to maintain consistency through loss function, thereby ensuring that the structural information from local details to global semantics remains stable during the transmission process and achieving cross-scale consistency constraint.
[0081] The ResNeSt feature extraction module enables the network to have explicit geometric structure perception capabilities, allowing for more accurate extraction of lesion features with clear boundaries and rich details, laying a high-quality foundation for subsequent retrieval. This solves the problem of inaccurate feature extraction caused by blurred lesion boundaries, fine structures of small targets, and diverse morphologies in dermoscopic images.
[0082] Hash retrieval is performed using the dynamic hash retrieval module.
[0083] Optimize the hash table based on lesion characteristics and search performance, including the following steps:
[0084] Using lesion features and erroneous hash samples as training data, a new hash table is trained. The new hash table is then added to the hash table set, which includes the new hash table and the original K hash tables. The erroneous hash samples are database samples with different labels in historical retrieval.
[0085] The weights are calculated based on the retrieval performance of the hash table set on the training data. Hash tables with low weights are deleted to keep the number of hash tables in the hash table set at K. The weights are determined based on the cross-entropy between the similarity predicted by the hash code and the similarity between the actual labels, as well as the cosine distance between the first hash functions.
[0086] Specifically, during the initialization phase, core parameters of the module are set, including the total number of hash tables to be maintained (K), hash code dimension, and multi-objective weight calculation ratio; among them, alpha=0.6 and beta=0.4 correspond to the weights of similarity preservation ability and orthogonality, respectively, to ensure that the parameters are adapted to the characteristics of hospital data (such as disease feature dimension and sample number).
[0087] The first batch of dermoscopy images are preprocessed to obtain standardized feature vectors X_init and corresponding labels y_init (such as disease type, patient ID, etc.), and the first batch of raw feature data is stored as the initial database.
[0088] Based on the initial data, K independent hash tables are trained sequentially.
[0089] The training process for each hash table is as follows:
[0090] Projection matrix initialization: The Hadamard matrix initialization method of the MOHAD algorithm is adopted to generate a projection matrix W (with the dimension of "feature dimension × hash code dimension") with strong orthogonality and high discriminativeness. The orthogonality of the Hadamard matrix reduces hash code redundancy and solves the problem of low encoding quality caused by traditional random initialization.
[0091] Hash code generation: The first batch of feature vectors X_init are projected through the projection matrix W, and then binarized (converted by the sign function, which converts negative numbers to 0 and positive numbers to 1) to obtain the 0 / 1 hash code of the first batch of data, which is stored in the current hash table.
[0092] Error sample recording: The error hash sample identification logic adopts the CIHR algorithm to retrieve the nearest neighbor sample of each sample in the current hash table. If the label of the nearest neighbor sample does not match the label of the sample, it is marked as an "error hash sample" and stored in the current hash table to prepare for subsequent complementary training.
[0093] The system maintains a set of K hash tables. When a new batch of data arrives, the (K+1)th hash table is trained using the new data.
[0094] We continuously receive new incoming data (such as new patient data and updated image data), and perform the same preprocessing as in the initialization phase on the new data to obtain a new feature vector X_new and a corresponding label y_new.
[0095] To improve the error correction and generalization capabilities of the new hash table, a complementary training set is constructed. The new data X_new is used as the basic training data, while simultaneously integrating "erroneous hash samples" (i.e., samples with mismatched labels in historical retrievals) stored in all current hash tables. The merging process is represented as follows: After merging, a complementary training set X_complement and corresponding labels y_complement are formed. Incorrect sample labels must be accurately associated with the original database to ensure label accuracy.
[0096] Each hash table's weight is recalculated based on its retrieval performance (such as similarity preservation and encoding discrimination) for new data (or a mixture of old and new data). The old tables with the lowest weights are discarded, maintaining a total of K tables. This ensures the system always consists of hash functions most efficient for the current data environment.
[0097] Specifically, based on the complementary training set, the K+1th hash table is trained. The training process is the same as the hash table training in the initialization phase. The projection matrix is initialized using the Hadamard matrix of MOHAD, and the projection matrix W of the new hash table, the hash code of the new data, and the error sample record are generated. After completion, the new hash table is added to the hash table set of the system (at this time, the total number of hash tables becomes K+1).
[0098] For each hash table in the current set of K+1 hash tables, calculate its weight, which is expressed as: ; For the current encoded sample, Let be the projection hyperplane corresponding to the b-th encoded bit in the k-th hash table; For query samples The distance to the hyperplane at the b-th bit of the k-th hash table. This represents the minimum distance to the hyperplane found among all encoded bits. This represents the maximum distance to the hyperplane found across all encoded bits. Weight evaluation combines two main objectives: similarity preservation and orthogonality assessment, ensuring the hash table is both efficient and has low redundancy. Similarity preservation is assessed by calculating cross-entropy to measure the consistency between the predicted similarity of the hash codes and the actual label similarity. Lower cross-entropy indicates stronger similarity preservation and a higher score. Orthogonality is assessed by calculating the cosine distance between the projection matrices of the current hash table and all other hash tables, measuring the orthogonality between them. Larger cosine distances indicate stronger orthogonality, lower redundancy, and a higher score.
[0099] The final weight = weight parameter 1 × similarity preservation score + weight parameter 2 × orthogonality score. Weight parameter 1 and weight parameter 2 can be adjusted according to the scenario. The larger the weight value, the stronger the adaptability of the hash table to the current data environment.
[0100] Based on the weight evaluation results, the hash table with the lowest weight is selected. This lowest-weight hash table has the worst adaptability to the current data environment and can no longer effectively support retrieval, exhibiting concept drift failure. The hash table with the lowest weight is removed from the hash table set to ensure the system always maintains K hash tables for dynamic updates. The original feature vector X_new and label y_new of the new data are added to the system's original database to update the database content and provide the latest data support for subsequent queries and retrievals.
[0101] Reference Figure 3 The first search result is obtained by performing dynamic hash retrieval based on the optimized hash table, including the following steps:
[0102] While preserving the original feature vector of the query sample, the query sample is transformed into a query feature vector;
[0103] Calculate the asymmetric distance between the query feature vector and the hash code of the hash table;
[0104] The hyperplane distance from the query feature vector to each first hash function decision hyperplane is used to adjust the adaptive weights of the first hash functions based on the hyperplane distance.
[0105] Hamming distance is calculated based on asymmetric distance and adaptive weights;
[0106] Select the database samples with the smallest Hamming distance as the first search results.
[0107] Specifically, the diagnostic system receives external query requests and query samples, preprocesses the query samples to obtain a standardized query feature vector X_query. This eliminates the need for binarization of the query samples, preserving the original location information and avoiding information loss at the query end.
[0108] The query feature vector X_query is input into the K hash tables currently maintained by the system, and the retrieval calculation is performed in parallel. The retrieval process for each hash table is as follows:
[0109] During the asymmetric distance calculation, only the stored binary hash codes (0 / 1 converted to -1 / 1 for easy distance calculation) are used for the samples in the database. The original projection results of the query samples are retained (without binarization). The asymmetric distance between the projection results of the query samples and the hash codes of the database samples is calculated by dot product and vector normalization to avoid the loss of positional information caused by binarization at the query end. The smaller the distance, the more similar the query samples are to the database samples.
[0110] During the adaptive weight calculation process, the distance (i.e., hyperplane distance) from the feature vector of the query sample to the decision hyperplane of the current hash table projection matrix is calculated. The closer the distance to the hyperplane, the less reliable the hash table's encoding of the query sample is, the lower the credibility of the encoding bits, and the lower the weight assigned accordingly. The farther the distance to the hyperplane, the higher the credibility of the encoding, and the higher the weight assigned accordingly, thus realizing dynamic weight adjustment.
[0111] Multiply the asymmetric distance calculated using the current hash table by the adaptive weight to obtain the weighted distance between the query sample and each database sample under this hash table.
[0112] The weighted Hamming distance is expressed as: ;
[0113] in, For query samples To database samples The weighted Hamming distance; K is the total number of hash tables. Let B be the k-th hash table, and B be the dimension of the hash table. The weight of the hash table, To query the code of the sample at the i-th bit in the k-th hash table, This is the encoding of the database sample at the i-th bit in the k-th hash table.
[0114] The weighted distances calculated from the K hash tables are aggregated, and the weighted distances of each database sample are summed to obtain the total weighted distance between the query sample and the samples in that database. The samples are sorted in ascending order of total weighted distance, with smaller distances indicating higher similarity. The Top-K samples are then selected, where K is the number of search results set by the user.
[0115] The top-K sorted search results (including the total weighted distance between database sample features, labels, and query samples) are output to the user to obtain the first search result.
[0116] It solves the problems of data distribution changes (concept drift) caused by continuous data inflow and the loss of location information caused by binarization in traditional hashing methods, enabling the system to continuously adapt to new data and combat concept drift. At the same time, through innovations such as asymmetric distance, it significantly improves retrieval accuracy while maintaining the high speed and low cost of hash retrieval.
[0117] When dermatological data is multimodal, using a fully supervised model, such as OSCMFH, presents challenges. Fully supervised models rely on a large amount of labeled data to train the hash model, but dermoscopy data annotation requires professional physicians, resulting in high annotation costs and long cycles. This is especially true for rare disease cases where labels are scarce, leading to insufficient labeled data in real-world scenarios. Furthermore, the sample size difference between common and rare dermatological diseases is significant (e.g., melanoma versus rare cutaneous lymphoma). OSCMFH's loss function lacks a class balancing mechanism, making the model prone to bias towards majority class features. Existing algorithms are extremely sensitive to long-tailed distributions, leading to a significant decrease in rare disease retrieval accuracy and impacting the comprehensiveness of clinical diagnosis. For example, the ratio of melanoma (a common disease) to cutaneous T-cell lymphoma (a rare disease) is as high as 1000:1. OSCMFH's retrieval accuracy drops by more than 40% in the rare disease category, potentially causing missed diagnoses in clinical practice. In dermoscopy diagnostic scenarios, existing cross-modal retrieval methods can only achieve one-way retrieval, such as "image-to-text" or "text-to-image" searches. Their core limitation lies in their failure to fully utilize the complementary information between text and images. Image features (such as skin lesion color and texture) and text features (such as medical history descriptions and clinical indicators) do not form an effective synergy, resulting in the retrieval process relying solely on a single-modal-dominated matching logic. This fails to fully utilize the feature information provided by the samples, leading to data waste, weakening the overall system performance, making it difficult to capture deep correlations between multimodal data, and easily causing missed detections or misjudgments in complex cases (such as similar skin lesions but large differences in medical history).
[0118] In this embodiment, when the skin disease data is multimodal data, the data of each modality is mapped to the hash code of each modality by the trained second hash function. Cross-modal hash retrieval is performed according to the hash code of each modality to obtain cross-modal retrieval results. Multiple cross-modal retrieval results are merged to obtain the second retrieval result.
[0119] The pseudo-labels for the unlabeled data are generated according to the following steps:
[0120] The second hash function corresponding to each modality in round t-1 is used to generate modal hash codes for each unlabeled modal data in round t-1. Based on the modal hash codes in round t-1, the database is retrieved to obtain cross-modal hash codes for round t-1. Based on the cross-modal hash codes in round t-1, pseudo-labels for each modality in round t-1 are generated. The pseudo-labels for each modality in round t-1 with a confidence level greater than a preset confidence threshold are used as the supervision signals for round t, until all rounds of training are completed.
[0121] Multimodal data can include image data, text data, and audio data, among others.
[0122] Specifically, the model is first trained under full supervision to acquire basic cross-modal retrieval capabilities, forming an initial hash model. In the semi-supervised training phase, for unlabeled text-image pairs, the model from the previous round (round t-1) is used to perform image-to-text path retrieval and text-to-image path retrieval respectively, generating probabilistic pseudo-labels (1×c dimensions, where c is the number of categories) for the corresponding modality. Low-reliability labels are filtered using a confidence threshold (e.g., labels with a maximum probability < adaptive threshold τ are deleted), forming a cross-modal mutual-guided soft supervision signal.
[0123] In the t-th round of training, the hard label constraints of labeled data and the cross-modal pseudo-label soft constraints of unlabeled data are combined and integrated into the matrix factorization framework of the original hash model. The corresponding pseudo-labels are introduced for the image branch and the text branch respectively. For example, the pseudo-labels generated by text search image are used for the image branch, and the pseudo-labels generated by image search text are used for the text branch.
[0124] For adaptive threshold Adjustments are made based on the current model's classification accuracy on the validation set. ;in For adjustment coefficients, To validate the accuracy of the validation set, the threshold is relaxed at high accuracy to utilize more pseudo-labels, and tightened at low accuracy to suppress noise.
[0125] The overall process is as follows:
[0126] Initial stage: Train the basic model using a small amount of labeled data, and construct the initial hash function and latent semantic space.
[0127] Online update phase (round t):
[0128] Pseudo-label generation steps: For unlabeled data The t-1 round model is used to perform bidirectional retrieval (image-to-text search, text-to-image search) and generate probabilistic pseudo-labels; specifically, for unlabeled images... Generate hash code using the current hash function Retrieve text hash codes from the database The mean of the class distribution of the Top-K results is taken as the pseudo-label. Similarly, generate text pseudo-tags. For each sample i, if and If the categories are consistent and the confidence levels are all higher than τ, then they are merged into the final pseudo-labels. ;
[0129] Confidence filtering step: retain pseudo-labels with the highest probability above the threshold τ to form a reliable supervision signal;
[0130] Joint optimization steps: Combine the hard labels of labeled data with the pseudo labels of unlabeled data, update the shared semantic matrix and the specific semantic matrix of each modality, and dynamically adjust the model parameters.
[0131] During training, an online learning strategy is employed to dynamically update the latent semantic representation of old data, avoiding repeated access to historical data to control computational complexity. Utilizing the mutual teaching and learning between image visual features and textual clinical semantics effectively alleviates the label scarcity problem. Through iterative optimization and confidence screening of pseudo-labels, cross-modal semantic alignment is gradually enhanced, forming a virtuous cycle of "model generating pseudo-labels - pseudo-labels guiding training - training optimizing the model".
[0132] This enables the model to utilize unlabeled data more efficiently, strengthens the association modeling between visual features and clinical descriptions in complex cases, improves cross-modal retrieval accuracy, and maintains online adaptability to streaming new data, providing more robust technical support for dermoscopy-assisted diagnosis.
[0133] In the semi-supervised dynamic cross-modal hash retrieval module, the matrix factorization process is as follows.
[0134] Input the image feature matrix X1 (e.g., CNN features for medical images) and text feature matrix X2 (e.g., BERT encoding for electronic medical records) of the current data block, as well as a small amount of labeled data in the label matrix L (labeling disease types).
[0135] Image features and text features are mapped to their respective modality spaces. The image feature matrix X1 is multiplied by the modality-specific matrix U1 to obtain Z. x1 and U x1 Multiply the text feature matrix X2 by the modality-specific matrix U2 to obtain Z. x2 and U x2 . Will U x1 and U x2 Fusion yields Z u .
[0136] Optimize variables for pseudo-label data The update rules are as follows:
[0137] By employing an alternating optimization strategy while keeping other variables fixed, an analytical solution can be obtained. .
[0138] Modality-specific mapping result Z x1 Z u Z x2 Combined with the label matrix L, we have Shared latent semantic representations are obtained through matrix factorization. .pass ( To further refine the cross-modal shared semantics (using the shared semantic transformation matrix), the latent semantic representation of round t is finally obtained. .
[0139] In the objective function of a semi-supervised co-learning model, regularization is used to constrain the class parameters. By adjusting the strength of these constraints, the semi-supervised co-learning model extracts a small number of features corresponding to each class during the learning process. For example, independent L1 constraints are applied to the V parameters of each class c (especially rare diseases), forcing the model to focus on key features of rare diseases and avoid being overwhelmed by features of common diseases.
[0140] Introduce pseudo-label soft constraint terms: In the formula, The original supervision loss (shared and specific semantic decomposition terms); For pseudo-label constraints, defined as ; The balancing coefficient is dynamically adjusted (increasing as the confidence level of the false label increases).
[0141] To achieve dynamic weight allocation, the sample weights of each category need to be dynamically calculated and updated in each round based on the data distribution of the samples in the current round, with categories that have fewer samples receiving larger weights.
[0142] Calculate the sample size of class k in the current data block. The weights are defined as follows:
[0143] ;in This represents the maximum number of samples in the current block for each category.
[0144] Selecting mode-specific auxiliary matrices The column vectors in the model are used as the target parameters for class-sensitive sparse regularization, and weights are introduced. By controlling the strength of its L1 regularization, the column vectors of rare disease categories with sparse samples are forced to become sparsified to suppress noise, while common diseases retain more distinguishing details. This effectively addresses the challenge of poor identification ability for rare samples caused by class imbalance due to rare diseases. Dynamic optimization of the model can be achieved simply by counting the number of class samples and updating the weights when each round of data stream arrives. It closely aligns with the two major characteristics of dermoscopy scenarios: long-tailed distribution (dynamically balancing class influence) and streaming learning (incremental updates to avoid forgetting historical knowledge), providing doctors with "white-box" retrieval results. Compared with the uninterpretable hash codes of the original OSCMFH, it has significant clinical advantages, achieving deep coupling between algorithm mechanism and medical needs.
[0145] For the sparse regularization term, in the modality-specific auxiliary matrix L1 regularization is introduced, and the optimization objective is adjusted to: ;in This represents the column vector corresponding to the k-th class. It is an L1 norm.
[0146] In the cross-modal image-text hash retrieval model, a pairwise mutual scoring fusion is implemented. The specific technical approach is as follows: For a given image-text query pair (Iq, Tq), the model first performs image-to-text search and text-to-image search respectively. In the image-to-text search process, the query image Iq is encoded into a hash vector, and the Hamming distance or semantic similarity is calculated with the hash vectors of all texts Ti in the database. The one-way matching score sim(Iq→Ti) from image to text is generated and sorted. In the text-to-image search process, the query text Tq is encoded into a hash vector, and the similarity is calculated with the hash vectors of all images Ii in the database. The one-way matching score sim(Tq→Ii) from text to image is generated and sorted.
[0147] Since each image Ii in the database uniquely corresponds to text Ti, the scores sim(Iq→Ti) in the image search text and sim(Tq→Ii) in the text search image can be extracted for each sample Pi=(Ii, Ti). The final matching score with bidirectional calibration is generated through the PRF fusion strategy and used for re-ranking of the search results.
[0148] A weighted average fusion strategy was adopted to fuse the one-way matching scores of image-to-text (I→T) and text-to-image (T→I) to improve the retrieval accuracy in complex scenarios.
[0149] Weighted average fusion introduces weight parameters Image modality scoring Text Modality Score The final matching score is obtained by linear combination, and the specific formula is as follows: In terms of complementary information synergy, visual features of images (such as color and texture) and clinical semantics of text (such as medical history and medication records) need to be considered in dermoscopy diagnosis for collaborative decision-making. Weighted averaging allows for... Dynamically adjusting modal weights can improve image modal weights in "image-to-text" scenarios (e.g., ...). = 0.7), focusing on text semantics in "Text Search Image" (e.g. = 0.3), avoiding information bias caused by single-modality dominance. Regarding noise robustness, the quantization loss of hash coding may cause abnormal scores for a single modality; weighted averaging can correct errors through stable scores from another modality. For example, a sample image's hash code might mismatch due to lighting noise, but the text "history of penicillin allergy" might score highly. Weighting can correct the retrieval ranking and reduce the risk of missed diagnoses. From a clinical needs adaptation perspective, the degree to which doctors rely on images and text for diagnosis varies with case type; weighted averaging supports training with clinical data. This allows the model to adapt to different scenarios, such as increasing image weights in melanoma screening and increasing text weights in the diagnosis of drug-induced dermatitis.
[0150] In image and text retrieval tasks, weighted average fusion improves mAP by approximately 5%-10% compared to single-modality retrieval. Furthermore, its computational complexity is O(N) (where N is the database sample size), involving only linear operations, making it suitable for real-time retrieval needs in medical big data (single query response time < 50ms). Regarding compatibility with the OSCMFH framework, weighted average fusion only applies to the scoring layer in the retrieval stage, requiring no modification to the underlying matrix factorization framework of OSCMFH, and allowing direct reuse of its generated hash codes and projection matrices. When new data arrives and updates the model, the weights are fused. It can be dynamically adjusted through incremental learning to maintain synchronized optimization with the hash model. Meanwhile, It has clear physical meaning, supports doctors' participation in parameter tuning, and enhances algorithm transparency. Furthermore, regarding the sample size imbalance (e.g., 1000:1) between common and rare diseases in dermatology data, the weighted average can be adjusted... Suppress the "majority class bias" of common diseases. For example, reduce image modality weights in rare disease retrieval (…). =0.4), avoiding the model's over-reliance on the visual features of common diseases, and increasing the retrieval weight of rare disease text semantics, can improve the retrieval accuracy of rare disease categories by more than 30%.
[0151] Regarding the optimization and verification of weighted average fusion, the weight parameters... The determination can be achieved through data-driven optimization, that is, by maximizing the mAP metric using the validation set, and employing grid search (…). Determine the optimal weights. For example, in the preliminary experiments on the dermoscopy dataset, The overall retrieval performance is optimal when the weight is 0.6. Future development could extend this to an adaptive weight, dynamically adjusting based on the query type. .
[0152] Decision-level fusion operates only on the scoring output layer during the retrieval phase, requiring no modification to the underlying hash coding network. This ensures strong compatibility with existing models and leverages the natural pairing relationship between image and text samples to achieve direct mapping of bidirectional scores, resulting in manageable computational complexity. Cross-validation of bidirectional scores effectively mitigates the bias between visual features and textual semantics in unidirectional retrieval, particularly addressing the semantic information attenuation problem caused by quantization loss in hash coding. The improved model is expected to maintain the efficiency of hash retrieval while enhancing the bidirectional semantic constraints of images and text through PRF, improving the accuracy of cross-modal matching in complex scenarios. When there is ambiguity or complementary information between image visual features and textual descriptions, the bidirectional scoring mutual calibration mechanism can significantly optimize retrieval ranking, enabling the model to more accurately capture the potential semantic relationships between image and text pairs, providing more robust decision support for multimodal retrieval.
[0153] It achieves efficient cross-modal hash retrieval, such as supporting hybrid retrieval of "voice description + text description + image example", helping doctors quickly locate similar cases and shorten the diagnosis cycle. It solves the problems of insufficient utilization of information in single-modal retrieval and the scarcity and imbalance of labels in cross-modal scenarios. By comprehensively utilizing image and text information for retrieval, it significantly improves the accuracy of complex queries and performs robustly even with insufficient labels and the presence of rare diseases.
[0154] In the diagnostic report module based on hash retrieval enhanced generation (RAG), refer to Figure 4 The target medical knowledge fragment is obtained by performing a hash search based on the first or second search result, including the following steps:
[0155] The key information from the first or second search results is converted into a query vector.
[0156] Hash the query vector to obtain the query hash code;
[0157] Determine the target hash bucket from the knowledge base based on the query hash code;
[0158] Similarity calculations are performed on the target hash bucket to retrieve the target medical knowledge fragments;
[0159] Reference Figure 11The knowledge base is constructed according to the following steps: the knowledge base documents related to dermatology are segmented and vectorized to obtain document vectors, the document vectors are mapped to knowledge base hash codes, the knowledge base hash codes are stored in the index of the hash bucket, and the knowledge base documents corresponding to the same or similar hash codes are stored in the same hash bucket.
[0160] Specifically, professional dermatology medical literature, guidelines, drug instructions, and other knowledge base documents are segmented and vectorized. Then, using techniques such as Position-Sensitive Hashing (LSH), the high-dimensional document vectors are mapped to hash codes and stored in a "hash bucket" index. Documents with the same or similar hash codes are stored in the same bucket; this is to construct a hashed medical knowledge base.
[0161] When the system needs to generate a report for a retrieved case, it transforms the case's key information (such as predicted disease type and characteristics) into a query vector. This query vector is then hashed, directly locating a few relevant hash buckets. Precise similarity calculations are performed only within these buckets to quickly identify the most relevant medical knowledge fragments.
[0162] Retrieved authoritative medical knowledge fragments are used as context, combined with the user query (or case summary) to form prompts, which are then input into a large language model. The model generates reports based on this solid external knowledge, thereby ensuring the professionalism, accuracy, and timeliness of the report content and significantly reducing "illusions."
[0163] This module enables the automatic, rapid, and reliable generation of diagnostic reports. The report content is verifiable, greatly improving the system's usability and user trust.
[0164] Different hospitals deploy their own diagnostic systems. These systems employ different diagnostic models and are all connected to a central server. Federated learning and privacy protection modules address the conflict between data silos between hospitals and patient privacy protection.
[0165] The diagnostic system uses local dermoscopy data to train a local diagnostic model; the trained model gradients or model parameters are encrypted, and then uploaded to the central server.
[0166] The central server uses a weighted federated average algorithm to aggregate model gradients or model parameters uploaded by all diagnostic systems to obtain aggregate parameters. During aggregation, algorithms such as K-means can be used to detect and filter abnormal updates (such as those from attacked nodes or nodes with extremely poor data quality). A new global model is obtained based on the aggregated parameters and then distributed to each diagnostic system.
[0167] Through multiple rounds of iterative training—"local training, secure upload, central aggregation, and global distribution"—the global model benefits from training data from all hospitals, while preventing any single hospital from accessing or reverse-engineering the raw data from other hospitals. Under the premise of strictly protecting patient privacy and hospital data sovereignty, cross-institutional collaborative model training has been achieved, breaking down data silos and significantly improving the generalization ability and robustness of the diagnostic model.
[0168] Through the collaborative work of the above modules, the entire chain of problems from feature extraction, dynamic retrieval, multimodal fusion, report generation to privacy protection has been systematically solved, and an efficient, accurate, reliable and evolvable intelligent diagnostic auxiliary system for skin diseases has been built.
[0169] The above is a detailed description of the preferred embodiments of this application, but this application is not limited to the embodiments. Those skilled in the art can make various equivalent modifications or substitutions without departing from the spirit of this application, and these equivalent modifications or substitutions are all included within the scope defined by the claims of this application.
Claims
1. A skin disease diagnosis method based on cross-modal hash retrieval and RAG, characterized in that, include: Acquire skin disease data and input the skin disease data into the diagnostic model; When the skin disease data is unimodal image data, lesion features are extracted from the skin disease data using graph structure and Ricci curvature. The hash table is then optimized according to the lesion features based on retrieval performance. Dynamic hash retrieval is performed based on the optimized hash table to obtain the first retrieval result. When the skin disease data is multimodal data, the data of each modality is mapped to the hash code of each modality through the trained second hash function. Cross-modal hash retrieval is performed according to the hash code of each modality to obtain cross-modal retrieval results. Multiple cross-modal retrieval results are fused to obtain the second retrieval result. In the training process of the second hash function, the hard labels of labeled data are combined with the pseudo labels of unlabeled data to update the shared semantic matrix and each modality matrix, and the training model parameters of the second hash function are dynamically adjusted. The pseudo labels of the unlabeled data in the current round are generated by the training model in the previous round. The target medical knowledge fragment is obtained by performing a hash retrieval based on the first or second search result. The target medical knowledge fragment is used as context and combined with the user query to form prompt words. A diagnostic report is generated based on the prompt words through a large language model.
2. The skin disease diagnosis method based on cross-modal hash retrieval and RAG according to claim 1, characterized in that, The extraction of lesion features based on the skin disease data using graph structure and Ricci curvature includes: Preliminary image segmentation of the lesion areas was performed on the skin disease data to obtain a mask; A graph structure is constructed within the area of the mask, using pixels or superpixels as nodes and spatial and color relationships as edges. The Ricci curvature is calculated on the graph structure, and the Ricci curvature is used to reflect abrupt changes, boundaries, and deformed regions of the local structure. The Ricci curvature is used as an additional guiding channel or attention weight and fused with feature maps of different levels extracted from the skin disease data by the backbone network to obtain lesion features.
3. The skin disease diagnosis method based on cross-modal hash retrieval and RAG according to claim 2, characterized in that, In the process of extracting lesion features from the skin disease data using graph structure and Ricci curvature, the simplified curvature representations corresponding to the feature maps of different levels of the backbone network are calculated, and the simplified curvature representations corresponding to the feature maps of different levels are kept consistent by constraining them through a loss function.
4. The skin disease diagnosis method based on cross-modal hash retrieval and RAG according to claim 1, characterized in that, The step of optimizing the hash table based on the lesion characteristics according to retrieval performance includes: The lesion features and erroneous hash samples are used as training data to train a new hash table. The new hash table is then added to a hash table set, which includes the new hash table and the original K hash tables. The erroneous hash samples are database samples with different labels in historical retrievals. The weights are calculated based on the retrieval performance of the training data by the hash table set, and the hash tables with low weights are deleted to keep the number of hash tables in the hash table set at K. The weights are determined based on the cross-entropy between the similarity predicted by the hash code and the similarity between the actual labels and the cosine distance between the first hash functions.
5. The skin disease diagnosis method based on cross-modal hash retrieval and RAG according to claim 4, characterized in that, The step of obtaining the first search result by performing dynamic hash retrieval based on the optimized hash table includes: While preserving the original feature vector of the query sample, the query sample is transformed into a query feature vector; Calculate the asymmetric distance between the query feature vector and the hash code of the hash table; The hyperplane distance from the query feature vector to each first hash function decision hyperplane is used to adjust the adaptive weight of the first hash function based on the hyperplane distance. The Hamming distance is calculated based on the asymmetric distance and the adaptive weights. The database samples with the smallest Hamming distance are selected as the first search results.
6. The skin disease diagnosis method based on cross-modal hash retrieval and RAG according to claim 1, characterized in that, The pseudo-labels for the unlabeled data are generated according to the following steps: The second hash function corresponding to each modality in round t-1 is used to generate modal hash codes for each unlabeled modal data in round t-1. Based on the modal hash codes in round t-1, the database is retrieved to obtain cross-modal hash codes for round t-1. Based on the cross-modal hash codes in round t-1, pseudo-labels for each modality in round t-1 are generated. The pseudo-labels for each modality in round t-1 with a confidence level greater than a preset confidence threshold are used as the supervision signals for round t, until all rounds of training are completed.
7. The skin disease diagnosis method based on cross-modal hash retrieval and RAG according to claim 1, characterized in that, In the objective function of the training model of the second hash function, the class parameters are constrained by regularization. By adjusting the constraint strength, the semi-supervised collaborative learning model can extract a small number of features corresponding to the class during the learning process.
8. The skin disease diagnosis method based on cross-modal hash retrieval and RAG according to claim 1, characterized in that, The step of obtaining the target medical knowledge fragment by performing a hash search based on the first search result or the second search result includes: The key information from the first or second search results is converted into a query vector. The query vector is hashed to obtain the query hash code; The target hash bucket is determined from the knowledge base based on the query hash code; Similarity calculations are performed on the target hash bucket to retrieve the target medical knowledge fragments; The knowledge base is constructed according to the following steps: the knowledge base documents related to dermatology are segmented and vectorized to obtain document vectors, the document vectors are mapped to knowledge base hash codes, the knowledge base hash codes are stored in the index of the hash bucket, and the knowledge base documents corresponding to the same or similar hash codes are stored in the same hash bucket.
9. The skin disease diagnosis method based on cross-modal hash retrieval and RAG according to claim 1, characterized in that, Different diagnostic systems deploy different diagnostic models, and these systems are connected to a central server; the method also includes: The local diagnostic model is trained using dermoscopy data from the diagnostic system. The trained model gradients or model parameters are encrypted, and the encrypted model gradients or model parameters are uploaded to the central server. The central server uses a weighted federated average algorithm to aggregate the model gradients or model parameters uploaded by all diagnostic systems to obtain aggregate parameters. Based on the aggregate parameters, a new global model is obtained and then distributed to each diagnostic system.
10. A skin disease diagnostic device based on cross-modal hash retrieval and RAG, characterized in that, The skin disease diagnosis method based on cross-modal hash retrieval and RAG as described in any one of claims 1 to 9 is applied.