A rural tourism digitalization evaluation method and system based on multi-modal data

By constructing a multimodal dataset and performing feature encoding, cross-modal fusion, and semantic alignment, the deep semantic association problem of multimodal data in rural tourism evaluation was solved, realizing dynamic and adaptive digital evaluation, improving the accuracy and adaptability of the evaluation, and reducing deployment costs.

CN122243282APending Publication Date: 2026-06-19NORTHWEST NORMAL UNIVERSITY

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
NORTHWEST NORMAL UNIVERSITY
Filing Date
2026-03-17
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing rural tourism evaluation methods lack the ability to model deep semantic associations of multimodal data. The evaluation index system is static and difficult to adapt to dynamic changes. The fragmented processing of multimodal data leads to a decline in the timeliness and accuracy of evaluation results.

Method used

By constructing a multimodal dataset, performing feature encoding and cross-modal fusion, introducing semantic alignment constraints, establishing a feature-evaluation mapping model and a weight determination model, and conducting end-to-end training through joint optimization of the objective function, deep fusion and dynamic adaptive evaluation of multimodal data are achieved.

Benefits of technology

It achieves deep integration of multimodal data, improves the accuracy and robustness of rural tourism evaluation, adapts to dynamically changing scenarios, and reduces the deployment cost and application threshold of the evaluation system.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122243282A_ABST
    Figure CN122243282A_ABST
Patent Text Reader

Abstract

This invention discloses a method and system for digital evaluation of rural tourism based on multimodal data. The method includes: collecting multimodal data from rural tourism scenarios to construct a multimodal dataset; encoding the features of each modality to obtain corresponding modal feature vectors; performing cross-modal fusion processing on the modal feature vectors to generate fused feature representations; applying semantic alignment constraints to the modal feature vectors; constructing a feature-evaluation mapping model and a weight determination model; jointly training the feature-evaluation mapping model and the weight determination model by jointly optimizing the objective function; based on the trained model, performing inference on the real-time collected multimodal data to obtain and output the comprehensive digital evaluation result of rural tourism; and simultaneously updating the model online based on newly accessed multimodal data. This invention can significantly improve the accuracy of digital evaluation of rural tourism.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of data processing technology, and in particular to a digital evaluation method and system for rural tourism based on multimodal data. Background Technology

[0002] Digital transformation has become a core driving force for the high-quality development of rural tourism. During this process, various platforms, including scenic area management systems, OTA platforms, social media platforms, IoT sensing devices, and geographic information service platforms, continuously generate massive amounts of multi-source, heterogeneous, and multimodal data. This data encompasses various types of information, such as tourist review texts, scenic area monitoring images, real-time visitor flow statistics, consumption behavior records, and geospatial information like AOI and POI. How to extract valuable knowledge from these rich big data resources to achieve an objective, dynamic, and accurate evaluation of the development status of rural tourism has become a key technical problem that urgently needs to be solved to promote the high-quality development of rural tourism.

[0003] Traditional rural tourism evaluation methods primarily rely on statistical reporting and questionnaire sampling. These methods suffer from limited data sources, long update cycles, and limited processing capabilities, making it difficult to fully utilize the aforementioned multimodal big data resources. In recent years, with the rapid development of big data analytics, some studies have attempted to incorporate multi-source data into evaluation systems, but the following technical limitations still exist:

[0004] First, existing methods typically analyze or simply piece together data from different modalities independently, lacking the ability to model deep semantic relationships between modalities. For example, while tourist reviews reflecting subjective feelings and surveillance images depicting objective scenes describe the same tourism event, existing technologies struggle to establish a semantic correspondence between the two. This results in data fusion remaining at a superficial feature-stitching level, failing to achieve true cross-modal information complementarity.

[0005] Second, traditional evaluation index systems rely excessively on expert experience, and once the index weights are determined, they remain fixed for a long period, failing to adapt to the dynamic changes in rural tourism scenarios. When new data characteristics emerge or tourist preferences shift, static evaluation models struggle to adjust in a timely manner, leading to a decline in the timeliness and accuracy of evaluation results.

[0006] Third, existing big data analysis methods often treat feature extraction, modality fusion, weight allocation, and result prediction in a fragmented manner when dealing with multimodal evaluation tasks. Each module is optimized independently, lacking a unified mathematical framework for end-to-end collaborative optimization, making it difficult to achieve optimal overall performance. Summary of the Invention

[0007] The purpose of this section is to outline some aspects of embodiments of the present invention and to briefly describe some preferred embodiments. Simplifications or omissions may be made in this section, as well as in the abstract and title of this application, to avoid obscuring the purpose of these documents; however, such simplifications or omissions should not be construed as limiting the scope of the invention.

[0008] To address the aforementioned technical problems, this invention provides the following technical solution: a digital evaluation method for rural tourism based on multimodal data, comprising the following steps:

[0009] Multimodal data from rural tourism scenarios are collected, and the multimodal data is preprocessed to construct a multimodal dataset;

[0010] Each modality in the multimodal dataset is feature-encoded to obtain a corresponding modal feature vector; the modal feature vectors are then subjected to cross-modal fusion processing to generate a fused feature representation; simultaneously, semantic alignment constraints are applied to the modal feature vectors to make semantically related modal features close to each other in the feature space.

[0011] Construct a feature-evaluation mapping model and a weight determination model, wherein the feature-evaluation mapping model is used to map the fused feature representation to a preset evaluation result space to output the evaluation result, and the weight determination model is used to determine the weight of the evaluation dimension based on prior knowledge and data characteristics;

[0012] The feature-evaluation mapping model and the weight determination model are jointly trained by a joint optimization objective function, wherein the joint optimization objective function includes at least an evaluation result prediction loss term and a weight constraint term.

[0013] Based on the trained feature-evaluation mapping model and weight determination model, reasoning is performed on the real-time collected multimodal data to obtain and output the comprehensive evaluation results of rural tourism digitalization; at the same time, the model is updated online based on newly accessed multimodal data.

[0014] As a preferred embodiment of the multimodal data-based digital evaluation method for rural tourism described in this invention, the preprocessing includes:

[0015] The data is subjected to missing data compensation, noise filtering, and scale normalization.

[0016] By establishing spatiotemporal identifiers using unified timestamps and geographic coordinates, data from different modalities can be aligned within a unified spatiotemporal reference system.

[0017] The missing data compensation includes spatial interpolation based on geographic proximity or time-series prediction models based on historical data from the same period; the spatiotemporal alignment specifically involves mapping data to unified geographic grid units and time slices to form structured data units with spatiotemporal labels.

[0018] As a preferred embodiment of the multimodal data-based digital evaluation method for rural tourism described in this invention, the feature encoding specifically includes:

[0019] Natural language processing models are used to semantically encode text data, extract sentiment and topic features, and generate text feature vectors.

[0020] Convolutional neural networks are used to perform scene recognition on image data, extract landscape quality and facility integrity features, and generate image feature vectors.

[0021] The volatility and trend characteristics of logarithmic statistical data are extracted using time series analysis methods;

[0022] Spatial relationship graphs are constructed from geospatial data using graph neural networks to extract spatial association and accessibility features.

[0023] As a preferred embodiment of the multimodal data-based digital evaluation method for rural tourism described in this invention, the cross-modal fusion processing includes:

[0024] The correlation weights between feature vectors of different modalities are calculated by cross-modal attention mechanism, and adaptive weighted fusion is performed based on the weights to generate a multimodal fusion feature matrix as the fusion feature representation.

[0025] As a preferred embodiment of the multimodal data-based digital evaluation method for rural tourism described in this invention, the semantic alignment constraint is implemented through contrastive learning, specifically including:

[0026] Construct a shared cross-modal embedding space, and project each modal feature onto the shared cross-modal embedding space through its respective mapping network to obtain a unified-dimensional embedding vector;

[0027] In the shared cross-modal embedding space, by minimizing the semantic consistency loss function This makes semantically related embedding vectors of different modalities closer to each other, and semantically unrelated embedding vectors farther apart.

[0028] As a preferred embodiment of the multimodal data-based digital evaluation method for rural tourism described in this invention, the weight determination model includes:

[0029] The first weight calculation module is used to determine the first weight set based on expert experience using the analytic hierarchy process. ;

[0030] The second weight calculation module is used to calculate the second weight set based on the degree of dispersion of the sample data using the entropy weight method. ;

[0031] The dynamic adjustment coefficient generation module is used to generate the index data based on the standard deviation within the current time window. Calculate the dynamic adjustment coefficient ,in It is a smoothing constant;

[0032] The weight fusion module is connected to the first weight calculation module, the second weight calculation module, and the dynamic adjustment coefficient generation module, respectively, and is used to receive the first weight set. Second weight set and dynamic adjustment coefficient and in accordance with Calculate and generate the final weights.

[0033] As a preferred embodiment of the multimodal data-based digital evaluation method for rural tourism described in this invention, the joint optimization objective function is expressed as:

[0034]

[0035] in, To predict losses in order to evaluate the results, For weight stability loss, For semantic consistency loss, and The evaluation result is used as a balance coefficient to predict the loss. The weight stability loss is either mean squared error loss or cross-entropy loss. This is the L2 regularization term for the weight coefficients.

[0036] As a preferred embodiment of the digital evaluation method for rural tourism based on multimodal data described in this invention, the evaluation result space is a preset scoring level or a continuous score range; the output layer of the feature-evaluation mapping model adopts a Softmax function or a linear activation function, corresponding to the probability distribution or specific score of different evaluation results.

[0037] As a preferred embodiment of the intelligent evaluation method for rural tourism based on multimodal data described in this invention, the online update is achieved through an incremental learning mechanism: when the cumulative amount of newly accessed multimodal data reaches a preset threshold, or when the model prediction error exceeds a preset tolerance, the online update of the feature-evaluation mapping model and the weight determination model is triggered. During the update process, an elastic weight consolidation technique is adopted, and the changes of important parameters are constrained by introducing the Fisher information matrix.

[0038] This invention also provides a rural tourism digital evaluation system based on multimodal data, applied to the above method, comprising:

[0039] A multimodal data acquisition unit is used to collect multimodal data in rural tourism scenarios;

[0040] The spatiotemporal preprocessing unit, connected to the multimodal data acquisition unit, is used to perform missing data compensation, noise filtering, and scale normalization on the acquired data. It also establishes spatiotemporal identifiers through unified timestamps and geographic coordinates, aligns different modal data in a unified spatiotemporal reference system, and outputs a structured multimodal dataset with spatiotemporal labels.

[0041] The feature extraction and alignment unit, connected to the spatiotemporal preprocessing unit, is used to encode features of each modality data to obtain modality feature vectors, perform cross-modality fusion processing on the modality feature vectors to generate fused feature representations, and map different modality features to a shared embedding space for semantic alignment through contrastive learning.

[0042] The model training unit, connected to the feature extraction and alignment unit, is used to construct a feature-evaluation mapping model and a weight determination model, and to perform end-to-end joint training of the two models by jointly optimizing the objective function.

[0043] The evaluation inference unit is connected to the model training unit. It is used to store the trained feature-evaluation mapping model and weight determination model, and to perform inference on the real-time collected multimodal data to output the comprehensive evaluation result of rural tourism digitalization.

[0044] The incremental update unit is connected to the feature extraction and alignment unit and the evaluation inference unit, respectively. It is used to monitor newly accessed multimodal data. When the accumulated data reaches a preset threshold or the model prediction error exceeds a preset tolerance, it triggers the online update of the feature-evaluation mapping model and the weight determination model. During the update process, the elastic weight consolidation technique is used to constrain the changes of important parameters.

[0045] The beneficial effects of this invention are:

[0046] 1. This invention achieves deep fusion of multimodal heterogeneous data, including text, visual, temporal, behavioral, and spatial geographic data, in rural tourism scenarios. It breaks through the information limitations of single data sources and solves the core problems of incomplete data dimensions and one-sided information in traditional evaluation methods. By constructing a joint optimization objective function that includes evaluation result prediction loss, weight stability loss, and semantic consistency loss, and by performing end-to-end joint training on the feature-evaluation mapping model and the weight determination model, multimodal semantic alignment, evaluation index weight allocation, and final evaluation result prediction are collaboratively optimized within a unified mathematical framework. This significantly improves the accuracy, robustness, and generalization ability of digital evaluation in rural tourism.

[0047] 2. This invention constructs a multi-dimensional dynamic digital evaluation system, abandoning the traditional static and subjective evaluation model. By constructing a weight determination model composed of the analytic hierarchy process, entropy weight method, and dynamic adjustment coefficients, it achieves dynamic adaptive integration of expert experience and data-driven characteristics. This allows the weights of evaluation indicators to be automatically adjusted according to the real-time fluctuations of rural tourism data. While maintaining the interpretability of evaluation results, it enhances the adaptability to dynamic changing scenarios, enabling real-time, accurate, and quantitative evaluation of the development status of rural tourism, and completing the digital upgrade of the rural tourism evaluation model.

[0048] 3. This invention introduces an elastic weight consolidation technique based on the Fisher information matrix, which applies differentiated constraints to the update magnitude of historically important parameters during the online model update process. This enables the model to effectively prevent catastrophic forgetting while absorbing features of newly accessed multimodal data, thus achieving long-term adaptive evolution and stability assurance of the evaluation model.

[0049] 4. This invention also provides a supporting digital evaluation system, forming a complete technical closed loop with the evaluation method. The system is compatible with various heterogeneous data sources in rural tourism scenarios, such as existing scenic area smart management platforms, IoT sensing terminals, social media platforms, and geographic information service systems. It can achieve fully automated execution of the entire process of automated collection, standardized preprocessing, fusion analysis, and evaluation output of multimodal data without large-scale modification of existing hardware facilities and management systems. This significantly reduces the deployment cost and application threshold of the rural tourism digital evaluation system and can be quickly adapted to rural tourism scenarios of different scales, resource types, and development stages. Attached Figure Description

[0050] To more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort. Wherein:

[0051] Figure 1 This is a flowchart illustrating the overall process of a digital evaluation method for rural tourism based on multimodal data, as proposed in this invention.

[0052] Figure 2 This is a flowchart illustrating the process of determining the final weights in a weight determination model for a digital evaluation method for rural tourism based on multimodal data, as described in this invention.

[0053] Figure 3 This is a flowchart illustrating the joint training process of a digital evaluation method for rural tourism based on multimodal data, as described in this invention.

[0054] Figure 4 This is a flowchart illustrating the online update and incremental learning process of a digital evaluation method for rural tourism based on multimodal data, as described in this invention. Detailed Implementation

[0055] To make the above-mentioned objects, features and advantages of the present invention more apparent and understandable, the specific embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

[0056] Many specific details are set forth in the following description in order to provide a full understanding of the invention. However, the invention may also be practiced in other ways different from those described herein, and those skilled in the art can make similar extensions without departing from the spirit of the invention. Therefore, the invention is not limited to the specific embodiments disclosed below.

[0057] Secondly, the term "one embodiment" or "embodiment" as used herein refers to a specific feature, structure, or characteristic that may be included in at least one implementation of the present invention. The phrase "in one embodiment" appearing in different places in this specification does not necessarily refer to the same embodiment, nor is it a single or selective embodiment that is mutually exclusive with other embodiments.

[0058] Secondly, the present invention is described in detail with reference to the schematic diagrams. When detailing the embodiments of the present invention, for ease of explanation, the cross-sectional views illustrating the device structure may be partially enlarged, not according to the usual scale. Furthermore, the schematic diagrams are merely examples and should not limit the scope of protection of the present invention. In addition, actual fabrication should include three-dimensional spatial dimensions of length, width, and depth.

[0059] Example 1

[0060] Reference Figure 1-4 This is the first embodiment of the present invention, providing a digital evaluation method for rural tourism based on multimodal data. This method can be applied to servers, cloud computing platforms, or edge computing nodes, and is implemented through software programs. It is used to objectively and dynamically quantify the development status of rural tourism scenic spots. Specifically, it includes the following steps:

[0061] S1: Collect multimodal data from rural tourism scenarios, preprocess the multimodal data, and construct a multimodal dataset.

[0062] This step aims to provide a unified and standardized data foundation for subsequent multimodal fusion.

[0063] Multimodal data includes at least text data, image data, numerical statistical data, and geospatial data. Text data is obtained through API interfaces from online travel platforms and social media platforms, including tourist reviews, travelogues, and service evaluations. Image data is obtained through fixed cameras deployed in the scenic area, drone aerial photography equipment, or by crawling photos shared by tourists on social media, including environmental images of the scenic area, tourist check-in photos, and surveillance video frames. Numerical statistical data is obtained through the scenic area management system, including visitor flow data, tourist dwell time data, spending data, and service response time data (such as complaint handling time). Geospatial data is obtained through a GIS (Geographic Information System), including spatial data such as the scenic area's location, road network, and point-of-interest distribution.

[0064] Specifically, preprocessing includes:

[0065] Data is processed by missing data compensation, noise filtering, and scale normalization.

[0066] By establishing spatiotemporal identifiers using unified timestamps and geographic coordinates, different modal data can be aligned within a unified spatiotemporal reference system.

[0067] Through the above preprocessing operations, the original unstructured multimodal data is transformed into a structured data set with a unified spatiotemporal dimension, which can improve data quality and lay a data foundation for subsequent multimodal feature extraction and fusion.

[0068] In one specific implementation, missing data compensation includes spatial interpolation based on geographic proximity or time-series prediction models based on historical data from the same period to ensure data continuity and integrity. Specifically, for time-series numerical data such as visitor flow and spending, if data for some time points is missing due to equipment failure or network interruption, a time-series prediction model based on historical data from the same period is used for completion. For example, the SARIMA seasonal difference autoregressive moving average model is used, taking historical data from the same period over the past 7 days as input, to predict the value for the currently missing period, and the predicted value is used as the completion value. For spatially distributed data, such as visitor flow at various monitoring points within a scenic area, if data for a certain monitoring point is missing, spatial interpolation based on geographic proximity is used for completion. For example, the inverse distance weighted interpolation method is used, calculating a weighted average based on the measured data of the 3-5 monitoring points closest to the missing point, using the reciprocal of the distance as the weight, as the completion value for the missing point.

[0069] In one specific implementation, spatiotemporal alignment involves dividing the rural tourism scenic area into uniformly sized geographic grid units (e.g., 100m × 100m) based on the area and data processing accuracy requirements. The time axis is divided into equally spaced time slices (e.g., 15 minutes). Each piece of multimodal data is mapped to its corresponding grid unit and time slice, forming a structured data unit with spatiotemporal labels. The purpose of spatiotemporal alignment is to enable different modal data describing the same spatiotemporal region to undergo correlation analysis within a unified reference framework, providing a structured data foundation for subsequent cross-modal feature fusion.

[0070] It should be noted that the spatiotemporally aligned structured data is stored in a unified format, and a spatiotemporal index is established to support efficient querying and incremental updates. This provides a standardized data foundation for the subsequent multimodal feature extraction and fusion steps in S2.

[0071] S2: Encode the features of each modality in the multimodal dataset to obtain the corresponding modal feature vectors; perform cross-modal fusion processing on the modal feature vectors to generate fused feature representations; at the same time, apply semantic alignment constraints to the modal feature vectors so that semantically related modal features are close to each other in the feature space.

[0072] This step takes the structured multimodal dataset with spatiotemporal labels output by S1 as input, extracts high-level semantic features of each modality through a deep neural network, performs cross-modal fusion through an attention mechanism, and introduces contrastive learning to semantically align the features to generate a fused feature representation for subsequent evaluation.

[0073] Specifically, feature encoding includes:

[0074] Natural language processing models are used to semantically encode text data, extract sentiment and topic features, and generate text feature vectors.

[0075] Convolutional neural networks are used to perform scene recognition on image data, extract landscape quality and service facility integrity features, and generate image feature vectors.

[0076] The volatility and trend characteristics of logarithmic statistical data are extracted using time series analysis methods;

[0077] Spatial relationship graphs are constructed from geospatial data using graph neural networks to extract spatial association and accessibility features.

[0078] In one specific implementation, a natural language processing (NLP) model is used to semantically encode the text data, extract sentiment and topic features, and generate a text feature vector. Specifically, the original comment text is segmented, stop words are removed, and the segments are truncated or padded to a fixed length (e.g., 128 tokens) to generate a token sequence. A pre-trained BERT-based Chinese model is used as the NLP model. The token sequence is input into BERT, and the output vector at the last layer [CLS] is taken as the basic text representation vector. .exist Then, a sentiment classification head (a two-layer fully connected network with 256 hidden dimensions and a 3-dimensional output layer corresponding to sentiment categories) is added. Fine-tuning is then performed on data labeled with sentiment tags, and the output of the first hidden layer of the sentiment classification head is used as the sentiment tendency feature vector. Simultaneously, a topic classification head (a two-layer fully connected network with 256 hidden dimensions and 10-dimensional output corresponding to a preset topic) is input. Fine-tuning is performed on data labeled with topic tags, and the output of the first hidden layer of the topic classification head is used as the topic feature vector. By concatenating the two, we obtain the text feature vector. .

[0079] In one specific implementation, a convolutional neural network is used to perform scene recognition on image data, extract landscape quality and facility integrity features, and generate image feature vectors. Specifically, the original image is resized to a uniform size (e.g., 224×224 pixels) and normalized. A pre-trained ResNet-50 is used as the convolutional neural network, and the 2048-dimensional features output from the global average pooling layer are taken as the basic features of the image. .exist A landscape quality scoring head (two-layer MLP, 512 hidden layers, 1D output representing the quality score) is then fed in. This head is fine-tuned using a regression task on data labeled with landscape scores, and the hidden layer output of the MLP is used as the landscape quality feature vector. Simultaneously, a facility multi-label classification head (two-layer MLP, 256 hidden layers, 10-dimensional output corresponding to preset facility categories) is input. Fine-tuning is performed on data labeled with facility tags, and the hidden layer output of this classification head is used as the facility integrity feature vector. By concatenating the two, we obtain the image feature vector. .

[0080] In one specific implementation, volatility and trend characteristics of numerical statistical data are extracted using time series analysis. Specifically, time series data such as passenger flow and spending are organized into time slices. A fixed-length sliding window is constructed for each slice, and statistical measures such as standard deviation, coefficient of variation, range, kurtosis, and skewness are calculated for the data within the window, forming a 5-dimensional volatility feature vector. Linear regression is performed on the data within the window to extract the regression slope as the trend strength; the autocorrelation coefficient of the data within the window is calculated, and the lag order corresponding to the maximum autocorrelation is extracted as the potential period length, which is then used as an indicator of periodicity strength; the mean of the first and second differences is calculated to reflect the rate of change. A 6-dimensional trend feature vector is obtained through these methods. By concatenating the two, we obtain the numerical statistical feature vector. .

[0081] In one specific implementation, a spatial relationship graph is constructed from geospatial data using a graph neural network (GNN) to extract spatial association and accessibility features. Specifically, points of interest within the scenic area are used as nodes in the graph, and edges are constructed based on road networks or spatial proximity (e.g., distance less than 200 meters). Each node uses a one-hot encoded node type as its initial feature. A graph convolutional network (GCN) is then used to process the graph, constructing a two-layer GCN with 128 dimensions in each hidden layer, outputting 128-dimensional features for each node. Global average pooling is then applied to the GCN outputs of all nodes in the entire graph to obtain the spatial association feature vector for the entire scenic area. This reflects the overall association pattern between nodes. Based on the graph structure, network metrics such as proximity centrality and betweenness centrality are calculated for each node. Proximity centrality is defined as the reciprocal of the sum of the shortest path distances from that node to all other nodes in the graph, while betweenness centrality is defined as the proportion of shortest paths passing through that node out of all shortest paths. The reachability feature vector is obtained by averaging the proximity centrality and betweenness centrality of all nodes. By concatenating the two, a geospatial feature vector is obtained. .

[0082] It should be noted that, since the modal feature vectors have different dimensions, each modal feature is mapped to the same dimension D=256 through a linear projection layer, resulting in modal feature vectors of uniform dimension. , for subsequent fusion.

[0083] It should also be noted that, through the aforementioned feature encoding steps, high-level semantic feature vectors for each modality have been obtained. However, these feature vectors are still in mutually independent modal spaces, lacking interaction and correlation between different modal features, making it difficult to comprehensively reflect the overall state of rural tourism scenarios. Therefore, cross-modal fusion processing is required.

[0084] Specifically, cross-modal fusion processing includes:

[0085] The correlation weights between feature vectors of different modalities are calculated by cross-modal attention mechanism, and adaptive weighted fusion is performed based on the weights to generate a multimodal fusion feature matrix as the fusion feature representation.

[0086] It should be noted that the cross-modal attention mechanism is used to interactively fuse features from different modalities. On the one hand, the attention mechanism can adaptively calculate the correlation weights between features of each modality, so that modalities that are highly relevant to the current evaluation task receive higher attention. On the other hand, the weighted fusion integrates features of each modality into a unified feature space to generate a fused feature matrix, thereby breaking through the information limitations of a single modality and realizing a comprehensive representation of tourist perception, environmental landscape, operational status and spatial structure, providing richer and more semantically comprehensive feature inputs for subsequent evaluation models.

[0087] In one specific implementation, a multimodal fusion feature matrix is ​​generated as a fusion feature representation. Specifically, the following steps are taken: First, the projected feature vectors of each modality are combined into a feature sequence. Then, a multi-head self-attention mechanism is used for cross-modal fusion: the query is obtained through linear transformation. ,key Sum (All are) This is then split into 8 heads, each with a dimension of 32. Attention weights are calculated for each head:

[0088] ;

[0089] The output of this head is obtained by weighting. By splicing the outputs of the 8 heads together, we obtain... Next, residual connections and layer normalization are applied: Then, through a feedforward network (two linear layers, with a middle dimension of 1024, and GELU activation), we obtain... The residual connection is then normalized to output the fused feature matrix. This matrix is ​​the desired multimodal fusion feature matrix, where each row corresponds to an enhanced feature of a modality after cross-modal interaction. To meet the input requirements of the subsequent evaluation model (S3), this matrix can be subjected to global average pooling to obtain a 256-dimensional fusion feature vector. .

[0090] It should be noted that the attention fusion described above yields a fused feature matrix containing interaction information from each modality. However, the feature representation distance for the same semantics in different modalities is not explicitly constrained. To further eliminate the modal semantic gap and improve the consistency and discriminative power of the fused features, this embodiment introduces semantic alignment constraints.

[0091] Furthermore, semantic alignment constraints are achieved through contrastive learning, specifically including:

[0092] A shared cross-modal embedding space is constructed, and features from each modality are projected onto this space through their respective mapping networks to obtain embedding vectors of uniform dimension. The mapping networks for each modality are designed independently. As an example, a two-layer fully connected neural network can be used, with the first hidden layer having a dimension of 128 and the second output layer having a dimension of 64, using ReLU as the activation function. The output dimension of each modality mapping network is the same (e.g., all set to 64 dimensions) to ensure that embedding vectors from different modalities can be compared in the same space.

[0093] During training, the semantic consistency loss function is minimized. This approach ensures that semantically related embedding vectors from different modalities are close to each other, while semantically unrelated embedding vectors are far apart. The criteria for determining semantic relevance include: if multimodal data originates from the same scenic area and belongs to the same time slice, it is considered semantically related; otherwise, it is considered semantically unrelated. In actual training, for each training batch (each batch contains several samples), a set of positive sample pairs is constructed from the samples within that batch according to the above rules. and negative sample pairs set Alternatively, a cross-batch sample queuing technique can be used to increase the diversity of negative samples.

[0094] Among them, the semantic consistency loss function An exemplary representation is as follows:

[0095]

[0096] in:

[0097] , , This serves as a sample index, with each index corresponding to a sample of a specific modality;

[0098] , , These represent the indexes as follows: , , The embedding vectors of the samples in the shared cross-modal embedding space;

[0099] It is a set of semantically related sample pairs (positive sample pairs), containing any two semantically similar sample pairs. Two samples can come from the same or different modalities;

[0100] It is the set of semantically irrelevant sample pairs (negative sample pairs), containing any two semantically dissimilar sample pairs. Two samples can come from the same or different modalities;

[0101] These are the preset margin parameters;

[0102] Summation symbol They represent the sets and Sum all sample pairs in the traversal;

[0103] Semantic consistency loss function The first term brings positive sample pairs closer together, while the second term pushes negative sample pairs further apart. When the distance is less than m, a penalty is applied, forcing them to move away from each other.

[0104] For ease of understanding, this embodiment uses a simplified training batch as an example to illustrate the semantic consistency loss function. The specific calculation process:

[0105] Suppose the current training batch contains 3 samples from different modalities, but all belong to the same scenic area "A" and were collected within the same time slice "T1", therefore they are semantically related. Additionally, a sample from a different scenic area "B" is introduced as a semantically unrelated negative sample. After feature extraction and mapping, each sample yields an embedding vector in a shared embedding space (dimensionality simplified to 2D) as follows:

[0106] Sample 1 (text comment): ;

[0107] Sample 2 (Scenic Area Image): ;

[0108] Sample 3 (Passenger Flow Data): (Also belongs to scenic area A, time T1);

[0109] Sample 4 (Image of Scenic Spot B): ;

[0110] Set margin parameters .

[0111] Based on the definition of semantic relevance (e.g., data from the same scenic area at the same time slice are semantically related, while data from different scenic areas or at different times are semantically unrelated), the following sample pairs are constructed:

[0112] The set of positive sample pairs P: all pairs of samples from the same semantic group (scenic spot A, time T1). Specifically:

[0113] Sample 1 and Sample 2: ;

[0114] Sample 1 and Sample 3: ;

[0115] Sample 2 and Sample 3: ;

[0116] Right now .

[0117] The negative sample pair set N: This combines each sample from scenic area A with 4 samples from scenic area B. Specifically:

[0118] Sample 1 and Sample 4: ;

[0119] Sample 2 and Sample 4: ;

[0120] Sample 3 and Sample 4: ;Right now .

[0121] Calculate the loss for positive sample pairs: The loss for positive sample pairs is the sum of the squares of the Euclidean distances between each positive sample pair, specifically:

[0122] for : ;

[0123] Loss Item = .

[0124] Similarly, the calculation yields: The loss term is approximately 0.005. The loss term is approximately 0.005.

[0125] The sum of the losses for positive sample pairs is: .

[0126] Then, calculate the negative sample pairs: the loss of the negative sample pairs is the loss of the negative sample pairs when the distance is less than the margin. At that time, Summation; if the distance is greater than or equal to If so, the loss is 0.

[0127] for : ;

[0128] Because 0.99 < 1.0, the loss term = .

[0129] Similarly, the calculation yields: The loss term is approximately 0.02295; The loss term is approximately 0.00653.

[0130] The total loss for the negative sample pairs is: 0.0001 + 0.02295 + 0.00653 = 0.02958.

[0131] Therefore, semantic consistency loss function The value is: 0.03 + 0.02958 = 0.05958.

[0132] As can be seen from the above examples, the semantic consistency loss function It can effectively bring the embedding vectors of semantically related samples (same scenic spot at the same time) closer together, while pushing away the embedding vectors of semantically unrelated samples (different scenic spots), thereby achieving semantic alignment of multimodal features.

[0133] It should be noted that, through the aforementioned semantic alignment constraint mechanism based on contrastive learning, features of each modality are mapped to a shared cross-modal embedding space, and the semantic consistency loss function is minimized. It is used as an important component of the joint optimization objective function, and the evaluation results are used in subsequent step S3 to predict the loss. and weight stability loss End-to-end collaborative optimization enables feature extraction, semantic alignment, evaluation mapping, and weight allocation to mutually promote each other within a unified optimization framework. The gradient backpropagation not only drives the alignment of features of each modality in the embedding space, but also constrains the parameter update direction of the feature-evaluation mapping model and the weight determination model, thus providing high-quality fusion feature inputs that are semantically consistent and task-adaptive for subsequent steps, significantly improving the accuracy of the comprehensive evaluation and the rationality of the weight allocation.

[0134] S3: Construct a feature-evaluation mapping model and a weight determination model. The feature-evaluation mapping model is used to map the fused feature representation to a preset evaluation result space to output the evaluation result. The weight determination model is used to determine the weights of the evaluation dimensions based on prior knowledge and data characteristics.

[0135] This step aims to build two core models: a feature-evaluation mapping model and a weight determination model, and train them through joint optimization so that the models can accurately output the digital evaluation results of rural tourism from multimodal fusion features.

[0136] The feature-evaluation mapping model is a trainable neural network model that maps the fused feature representation output by S2 to a predefined evaluation result space, outputting the corresponding evaluation result. The input dimension of this model is consistent with the fused feature dimension output by S2, and the output layer is designed according to the form of the evaluation result space.

[0137] If the evaluation result is a continuous score range (e.g., 0-100 points), the output layer uses a single neuron with a linear activation function, and directly outputs the predicted score.

[0138] If the evaluation result is a discrete rating level (e.g., four levels: A, B, C, and D), the output layer uses the same number of neurons as the level, the activation function is Softmax, and the output layer outputs the probability distribution of each level.

[0139] The hidden layer structure of the feature-evaluation mapping model employs a multilayer perceptron, residual network, or other neural network architectures suitable for regression / classification tasks. The specific number of layers, neurons, and activation functions can be adjusted according to the actual application scenario and data scale. In this embodiment, as an optional implementation, a three-layer fully connected network is used, with 256, 128, and 64 neurons in the hidden layer, respectively. The ReLU activation function is used, and Dropout is introduced during training to prevent overfitting. However, it should be noted that this structure is only an exemplary implementation and does not constitute a limitation of the present invention. Those skilled in the art can choose other equivalent neural network structures to achieve the same mapping function as needed.

[0140] Regarding the predicted loss based on the evaluation results Explanation:

[0141] In the subsequent joint training process, it is necessary to define an evaluation result prediction loss. This is used to measure the difference between the output of the feature-rating mapping model and the true rating label. The specific form depends on the design of the evaluation result space: when the evaluation result is a continuous score range, mean squared error loss is usually used; when the evaluation result is a discrete rating level, cross-entropy loss is usually used. The detailed definition and calculation method will be given in step S4.

[0142] Accordingly, the specific structure and working principle of the weight determination model are as follows:

[0143] Specifically, the weight determination model includes:

[0144] The first weight calculation module is used to determine the first weight set based on expert experience using the analytic hierarchy process. ;

[0145] The second weight calculation module is used to calculate the second weight set based on the degree of dispersion of the sample data using the entropy weight method. ;

[0146] The dynamic adjustment coefficient generation module is used to generate the index data based on the standard deviation within the current time window. Calculate the dynamic adjustment coefficient ,in It is a smoothing constant; when When the value approaches 1, the objective weight dominates; when... When the value approaches 0, expert weighting becomes dominant.

[0147] The weight fusion module, connected to the first weight calculation module, the second weight calculation module, and the dynamic adjustment coefficient generation module, is used to receive the first weight set. Second weight set and dynamic adjustment coefficient and in accordance with The final weights are calculated and generated; these final weights will be used as the weight stability loss during joint training. The calculation basis is used to constrain the fluctuation range of the weight coefficients and prevent overfitting.

[0148] In one specific implementation, the first weight set is determined by expert experience based on the analytic hierarchy process (AHP). The specific steps are as follows:

[0149] Construct a hierarchical structure: target layer (digital evaluation of rural tourism), criteria layer (various evaluation dimensions), and indicator layer (specific indicators).

[0150] invite Several experts conducted pairwise comparisons of the indicators at each level to construct a judgment matrix. The 1-9 scale method is used.

[0151] Calculate the eigenvector corresponding to the largest eigenvalue of each judgment matrix, and obtain the weight vector of that expert after normalization. .

[0152] right The weight vectors of the experts are taken as an arithmetic mean to obtain the final first weight set. ,in The number of evaluation dimensions.

[0153] Perform a consistency check: Calculate the consistency ratio. ,like If yes, accept; otherwise, adjust the judgment matrix.

[0154] In one specific implementation, the second weight set is calculated based on the degree of dispersion of the sample data using the entropy weight method. The specific steps are as follows:

[0155] Assume there is One sample, 1 evaluation index, construct the original data matrix .

[0156] The positive and negative indicators are standardized separately to obtain the standardized matrix. ,in .

[0157] Calculate the first Information entropy of each indicator (the regulations stipulate that when) hour, ).

[0158] Calculate the first Coefficient of difference of each indicator The larger the difference coefficient, the greater the amount of information provided by the indicator.

[0159] Calculate the second weight The second weight set is obtained. .

[0160] In a preferred embodiment, the smoothing constant Set to 0.1. This value can be adjusted based on experience by setting it as a benchmark value for the expected fluctuation range of the data.

[0161] It should be noted that by determining the model through the above weights, the dynamic adaptive fusion of expert experience and data features is achieved, providing an adjustable basis for calculating the weight stability loss for subsequent joint optimization. This effectively improves the model's adaptability to dynamic changes in data and its generalization performance while maintaining the interpretability of the evaluation results.

[0162] Based on the feature-evaluation mapping model and weight determination model constructed above, this invention performs end-to-end joint training on both by jointly optimizing the objective function, thereby improving evaluation accuracy and weight rationality within a unified optimization framework. Specifically:

[0163] S4: Jointly train the feature-evaluation mapping model and the weight determination model by jointly optimizing the objective function. The joint optimization objective function includes at least the evaluation result prediction loss term and the weight constraint term.

[0164] Specifically, the joint optimization objective function is expressed as:

[0165]

[0166] in, Predict losses based on evaluation results; For weight stability loss, For semantic consistency loss, and The balancing coefficient is used to adjust the weights of each loss term in the joint optimization; the evaluation result predicts the loss. Mean squared error loss or cross-entropy loss, weight stability loss This is the L2 regularization term for the weight coefficients.

[0167] In one specific implementation, the evaluation results predict the loss. The calculation method is based on the design of the evaluation result space, and one of the following two calculation methods is adopted:

[0168] Scenario 1: When the evaluation result is a continuous score range (e.g., 0-100 points, or normalized to...) When the interval is specified, the output layer of the feature-evaluation mapping model consists of a single neuron, which directly outputs the predicted score using a linear activation function. .at this time Mean squared error loss is used:

[0169] ;

[0170] in, The number of training samples. For the first The true evaluation score of each sample The model predicts a score.

[0171] Scenario 2: When the evaluation result is a discrete rating level (e.g., four levels: A, B, C, and D), let the number of levels be... (In this embodiment) =4). The output layer of the feature-evaluation mapping model uses the Softmax activation function, and the output is the 4th... Probability distribution of each sample belonging to each level ,satisfy ,and The actual label uses one-hot encoding, that is, if the first label is a one-hot encoded string, then the first label is a one The true rank of the sample is the th Level, then ,otherwise .at this time Cross-entropy loss is used:

[0172] ;

[0173] in, For the first The true label encoding of each sample The model predicts the first The sample belongs to the first The probability of a grade.

[0174] In one specific implementation, the weight stability loss This is used to constrain the fluctuations of the weight coefficients and prevent overfitting. In this embodiment, the L2 regularization term of the weight coefficients is used for calculation. :

[0175] ;

[0176] in, This represents the total number of evaluation dimensions. For the first The final weights for each evaluation dimension are output by the weight fusion module in step S3; summation symbol Indicates to From 1 to The loss term sums up all dimensions. By penalizing excessively large weight values, it encourages a more even distribution of weights, avoids the model from over-relying on a few evaluation dimensions, and thus improves generalization ability.

[0177] In a preferred embodiment and The method for determining the value is as follows: The value of each loss item is preset as an empirical value; in this embodiment, it is taken as... =0.01, =0.1. It should be noted that this value is only one specific implementation of the present invention, and those skilled in the art can adjust it according to actual application scenarios and data scale. and Adjustments should be made, such as in the evaluation results to predict losses. Decrease appropriately when the value is large. and Or, in the case of semantic consistency loss Increase appropriately when the value is large. These adjustments are all equivalent transformations of the present invention and fall within the protection scope of the present invention.

[0178] It should be noted that, through the above S4 step, this invention constructs a loss prediction method that includes evaluation results. Weight stability loss and semantic consistency loss The objective function is jointly optimized, and the feature-evaluation mapping model and the weight determination model are jointly trained end-to-end. This enables the model to improve the quality of multimodal feature fusion, the accuracy of evaluation result prediction, and the rationality of weight allocation within a unified optimization framework. This lays the core algorithmic foundation for achieving accurate, stable, and adaptive digital evaluation of rural tourism.

[0179] S5: Based on the trained feature-evaluation mapping model and weight determination model, reason about the real-time collected multimodal data to obtain and output the comprehensive evaluation results of rural tourism digitalization; at the same time, update the model online based on the newly accessed multimodal data.

[0180] This step is based on the feature-evaluation mapping model and weight determination model trained jointly by S4, which enables inference evaluation of real-time collected multimodal data, and updates the model online with new data through an incremental learning mechanism, so that the model can adapt to the dynamic changes of rural tourism scenarios.

[0181] Specifically, online updates are achieved through an incremental learning mechanism:

[0182] When the cumulative amount of newly added multimodal data reaches a preset threshold, or when the model prediction error exceeds a preset tolerance, online updates to the feature-evaluation mapping model and the weight determination model are triggered. During the update process, elastic weight consolidation technology is used, and the changes of important parameters are constrained by introducing the Fisher information matrix.

[0183] In one specific implementation, inference is performed on the real-time acquired multimodal data, and the inference process is as follows:

[0184] The system continuously receives real-time multimodal data streams from step S1, including but not limited to: tourist reviews, surveillance images, visitor statistics, consumption records, and geographic information. For each piece of real-time data, the same preprocessing operation as in S1 is immediately performed to generate structured data units with spatiotemporal labels. Subsequently, the preprocessed real-time structured data is sequentially input into the feature extraction network and cross-modal attention fusion module of step S2 to obtain a 256-dimensional fused feature vector. The fused feature vector The trained feature-evaluation mapping model is then used for forward propagation. The model's structure is consistent with that described in step S3, consisting of a three-layer fully connected network (256→128→64). Based on the design of the evaluation result space, the output layer adopts one of two forms:

[0185] If the evaluation result is a continuous score range (0-100 points), then the output layer is a single neuron with a linear activation function, directly outputting the predicted score. .

[0186] The evaluation result is a discrete rating level (e.g., four levels: A, B, C, and D). The output layer consists of four neurons, using the Softmax activation function, and outputs the probability distribution for each level. .

[0187] At the same time, the weight determination model calculates dynamic adjustment coefficients based on indicator data within the current time window (e.g., the last 30 days). and compare it with expert weights Objective weight The results are then fused to obtain the final weights for each evaluation dimension. .

[0188] The evaluation results will then be output in multiple formats:

[0189] Data interface output: The evaluation score or rating is encapsulated in JSON format and made available to other systems via API;

[0190] Visual output: A radar chart is generated to show the scores of each evaluation dimension, and a trend chart is generated to show the changes in the evaluation results over time, intuitively reflecting the development status of rural tourism.

[0191] All evaluation results are stored in a historical database to provide data support for subsequent trend analysis and model optimization.

[0192] In one specific implementation, the triggering conditions for online updates are monitored in the following way: the system maintains two counters, one of which is a counter for the cumulative amount of newly connected data. The other is a sliding window prediction error monitor. The value increments by 1 after each successful processing of a real-time valid sample (i.e., completion of preprocessing, feature extraction, and inference). Reaching the preset threshold Incremental updates are triggered immediately upon arrival; in this embodiment, we take... This value can be dynamically adjusted based on data collection frequency and scenario changes. For example, the threshold can be lowered during peak seasons to quickly adapt to new data, while it can be raised during off-seasons to save computing resources. Simultaneously, the system maintains a sliding window of length M=100, storing the model prediction values ​​of the most recent 100 samples. and their corresponding true values The method for obtaining the true value depends on the type of evaluation result: for continuous score outputs (such as satisfaction ratings), the true value can be obtained through subsequent tourist questionnaires, third-party assessments, or officially released rating data; for discrete level outputs (such as A / B / C / D levels), the true value can be obtained through expert review, official rating results, or user feedback tags.

[0193] Based on the samples within the sliding window, error indices are calculated according to the output type:

[0194] For continuous output values, calculate the mean absolute error within the sliding window. If MAE exceeds the preset tolerance If so, an update is triggered; in this embodiment, the following is taken: Points (assuming a scoring range of 0-100 points).

[0195] For discrete outputs, calculate the classification accuracy within the sliding window. ,in, This is an indicator function; it takes the value 1 if the prediction is correct, and 0 otherwise. Below the preset tolerance If this occurs, an update is triggered. In this embodiment, we take... (i.e., 85%).

[0196] When any of the triggering conditions is met, the system performs the following operations:

[0197] If the system is configured with a backup model (such as the model of the previous version), it will immediately switch to the backup model to continue providing real-time inference services, while the main model will enter the update process.

[0198] If there is no backup model, real-time inference is paused, and the real-time request is temporarily stored in the message queue. Processing will resume after the update is completed, or a degraded response (such as returning the most recent evaluation result) is returned.

[0199] After the update is complete, switch the updated model to the main model and restore normal real-time inference.

[0200] It should be noted that the above thresholds Set to 1000 Set to 5. The values ​​of 0.85 are exemplary values ​​in this embodiment. In practical applications, they can be dynamically adjusted according to factors such as the data scale of rural tourism scenic spots, evaluation accuracy requirements, and business tolerance.

[0201] It should also be noted that if the cumulative amount of newly acquired multimodal data does not reach the preset threshold and the model prediction error does not exceed the preset tolerance, the system will continue to execute the real-time data acquisition and inference process without triggering online updates.

[0202] In one specific implementation, when the triggering condition is met, the system initiates an incremental update process. During the update, an elastic weight consolidation technique is employed, which uses a Fisher information matrix to constrain changes in important parameters and prevent catastrophic forgetting. The specific implementation process is as follows:

[0203] First, before triggering the incremental update, the system selects the most recent batch of historical samples from the historical database to calculate the Fisher information matrix. (Number of historical samples) The value ranges from 2000 to 10000, and can be adjusted according to the number of model parameters, computing resources, and data collection frequency. When the model is large or computing resources are sufficient, the number of samples can be increased to improve the accuracy of importance estimation; conversely, it can be decreased. In this embodiment, the value is... As an example value, experiments have verified that the Fisher information matrix can stably reflect the importance of parameters under this value, and the computational cost is within an acceptable range.

[0204] Subsequently, forward and backward propagation are performed once for each sample to calculate the model parameters. The importance of historical data. It should be noted that, based on the design of the evaluation result space in step S3, the model output may be a continuous score (regression task) or a discrete rank (classification task), and the likelihood probability for both tasks... The calculation methods are different:

[0205] For classification tasks: the model output layer uses a softmax function to output the probability distribution of each level. That is, the corresponding true class in the Softmax output. The probability value.

[0206] For regression tasks: Gaussian likelihood functions can be constructed and calculated based on prediction errors.

[0207] Based on the above definition, for each model parameter Calculate the diagonal elements of its Fisher information matrix on historical data. :

[0208]

[0209] Among them, These are all the model parameters before the update; For the first The input features of each historical sample (i.e., the fused feature representation of the S2 output); For the first Authentic evaluation labels for historical samples; Indicates the parameters in the old model Next, enter Time model output and true label The likelihood probability; For log-likelihood, its gradient It can be calculated through a single backpropagation; For the Fisher information matrix The diagonal elements, the larger their values, the more parameters they represent. The more important the historical data, the stronger the constraints should be placed on subsequent updates.

[0210] It should be noted that calculating the Fisher information matrix only requires one forward propagation and one backward propagation, and the model parameters remain fixed (not updated), so the computational cost is controllable.

[0211] After obtaining the Fisher information matrix, it is used as a quantitative indicator of parameter importance and introduced into the incremental update loss function to construct an Elastic Weight Consolidation (EWC) regularization term:

[0212]

[0213] This regularization term applies to important parameters ( Larger changes in (larger) parameters incur a greater penalty, while less important parameters are penalized. (Small) Allows for relatively large updates; finally, Predicted loss based on evaluation results Weight stability loss Semantic consistency loss The total loss is obtained by weighted summation. By using the Adam optimizer for mini-batch gradient descent updates, historical knowledge is effectively preserved while absorbing new data features, preventing catastrophic forgetting.

[0214] It should be noted that by using the Fisher information matrix as a quantitative indicator of parameter importance and introducing a weighted regularization term based on this matrix into the loss function of incremental updates, the change of key parameters that have made significant contributions to historical tasks during the update process is effectively constrained. This allows for the stable retention of learned knowledge while absorbing new data features, achieving a dynamic balance between adaptive model updates and the retention of historical knowledge.

[0215] In summary, this invention achieves deep fusion of multimodal heterogeneous data, including textual, visual, temporal, behavioral, and spatial geographic data, in rural tourism scenarios. It breaks through the information limitations of single data sources and solves the core problems of incomplete data dimensions and one-sided information in traditional evaluation methods. By constructing a joint optimization objective function that includes evaluation result prediction loss, weight stability loss, and semantic consistency loss, and by performing end-to-end joint training on the feature-evaluation mapping model and the weight determination model, multimodal semantic alignment, evaluation index weight allocation, and final evaluation result prediction are collaboratively optimized within a unified mathematical framework. This significantly improves the accuracy, robustness, and generalization ability of digital evaluation in rural tourism. This invention constructs a multi-dimensional dynamic digital evaluation system, abandoning the traditional static and subjective evaluation model. By building a weight determination model composed of the analytic hierarchy process (AHP), entropy weight method, and dynamic adjustment coefficients, it achieves a dynamic adaptive fusion of expert experience and data-driven features. This allows the weights of evaluation indicators to automatically adjust according to the real-time fluctuations in rural tourism data, enhancing adaptability to dynamically changing scenarios while maintaining the interpretability of evaluation results. It enables real-time, accurate, and quantitative evaluation of the development status of rural tourism, completing the digital upgrade of the rural tourism evaluation model. Furthermore, by introducing elastic weight consolidation technology based on the Fisher information matrix, this invention applies differentiated constraints to the update magnitude of historically important parameters during online model updates. This effectively prevents catastrophic forgetting while absorbing newly acquired multimodal data features, achieving long-term adaptive evolution and stability assurance of the evaluation model.

[0216] Example 2 is a second embodiment of the present invention. This embodiment provides a rural tourism digital evaluation system based on multimodal data, applied to the above method, including:

[0217] A multimodal data acquisition unit is used to collect multimodal data in rural tourism scenarios;

[0218] The spatiotemporal preprocessing unit, connected to the multimodal data acquisition unit, is used to perform missing data compensation, noise filtering, and scale normalization on the acquired data. It also establishes spatiotemporal identifiers through unified timestamps and geographic coordinates, aligns different modal data in a unified spatiotemporal reference system, and outputs a structured multimodal dataset with spatiotemporal labels.

[0219] The feature extraction and alignment unit, connected to the spatiotemporal preprocessing unit, is used to encode features of each modality data to obtain modality feature vectors, perform cross-modality fusion processing on the modality feature vectors to generate fused feature representations, and map different modality features to a shared embedding space for semantic alignment through contrastive learning;

[0220] The model training unit, connected to the feature extraction and alignment unit, is used to build the feature-evaluation mapping model and the weight determination model, and to jointly train the two models end-to-end by jointly optimizing the objective function.

[0221] The evaluation inference unit, connected to the model training unit, is used to store the trained feature-evaluation mapping model and weight determination model, and to perform inference on the real-time collected multimodal data to output the comprehensive evaluation results of rural tourism digitalization.

[0222] The incremental update unit, connected to the feature extraction and alignment unit and the evaluation inference unit respectively, is used to monitor newly accessed multimodal data. When the accumulated data reaches a preset threshold or the model prediction error exceeds a preset tolerance, it triggers the online update of the feature-evaluation mapping model and the weight determination model. During the update process, the elastic weight consolidation technique is used to constrain the changes of important parameters.

[0223] It should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention and are not intended to limit it. Although the present invention has been described in detail with reference to preferred embodiments, those skilled in the art should understand that modifications or equivalent substitutions can be made to the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, and all such modifications or substitutions should be covered within the scope of the claims of the present invention.

Claims

1. A digital evaluation method for rural tourism based on multimodal data, characterized in that, Includes the following steps: Multimodal data from rural tourism scenarios are collected, and the multimodal data is preprocessed to construct a multimodal dataset; Each modality in the multimodal dataset is feature-encoded to obtain a corresponding modal feature vector; the modal feature vectors are then subjected to cross-modal fusion processing to generate a fused feature representation; simultaneously, semantic alignment constraints are applied to the modal feature vectors to make semantically related modal features close to each other in the feature space. Construct a feature-evaluation mapping model and a weight determination model, wherein the feature-evaluation mapping model is used to map the fused feature representation to a preset evaluation result space to output the evaluation result, and the weight determination model is used to determine the weight of the evaluation dimension based on prior knowledge and data characteristics; The feature-evaluation mapping model and the weight determination model are jointly trained by a joint optimization objective function, wherein the joint optimization objective function includes at least an evaluation result prediction loss term and a weight constraint term. Based on the trained feature-evaluation mapping model and weight determination model, reasoning is performed on the real-time collected multimodal data to obtain and output the comprehensive evaluation results of rural tourism digitalization; at the same time, the model is updated online based on newly accessed multimodal data.

2. The method for digital evaluation of rural tourism based on multimodal data as described in claim 1, characterized in that: The preprocessing includes: The data is subjected to missing data compensation, noise filtering, and scale normalization. By establishing spatiotemporal identifiers using unified timestamps and geographic coordinates, data from different modalities can be aligned within a unified spatiotemporal reference system. The missing data compensation includes spatial interpolation based on geographic proximity or time-series prediction models based on historical data from the same period; the spatiotemporal alignment specifically involves mapping data to unified geographic grid units and time slices to form structured data units with spatiotemporal labels.

3. The method for digital evaluation of rural tourism based on multimodal data as described in claim 1, characterized in that: The feature encoding specifically includes: Natural language processing models are used to semantically encode text data, extract sentiment and topic features, and generate text feature vectors. Convolutional neural networks are used to perform scene recognition on image data, extract landscape quality and facility integrity features, and generate image feature vectors. The volatility and trend characteristics of logarithmic statistical data are extracted using time series analysis methods; Spatial relationship graphs are constructed from geospatial data using graph neural networks to extract spatial association and accessibility features.

4. The method for digital evaluation of rural tourism based on multimodal data as described in claim 1, characterized in that: The cross-modal fusion process includes: The correlation weights between feature vectors of different modalities are calculated by cross-modal attention mechanism, and adaptive weighted fusion is performed based on the weights to generate a multimodal fusion feature matrix as the fusion feature representation.

5. The method for digital evaluation of rural tourism based on multimodal data as described in claim 1, characterized in that: The semantic alignment constraint is achieved through contrastive learning, specifically including: Construct a shared cross-modal embedding space, and project each modal feature onto the shared cross-modal embedding space through its respective mapping network to obtain a unified-dimensional embedding vector; In the shared cross-modal embedding space, by minimizing the semantic consistency loss function This makes semantically related embedding vectors of different modalities closer to each other, and semantically unrelated embedding vectors farther apart.

6. The method for digital evaluation of rural tourism based on multimodal data as described in claim 1, characterized in that: The weight determination model includes: The first weight calculation module is used to determine the first weight set based on expert experience using the analytic hierarchy process. ; The second weight calculation module is used to calculate the second weight set based on the degree of dispersion of the sample data using the entropy weight method. ; The dynamic adjustment coefficient generation module is used to generate the index data based on the standard deviation within the current time window. Calculate the dynamic adjustment coefficient ,in It is a smoothing constant; The weight fusion module is connected to the first weight calculation module, the second weight calculation module, and the dynamic adjustment coefficient generation module, respectively, and is used to receive the first weight set. Second weight set and dynamic adjustment coefficient and in accordance with Calculate and generate the final weights.

7. The method for digital evaluation of rural tourism based on multimodal data as described in claim 1, characterized in that: The joint optimization objective function is expressed as: in, To predict losses in order to evaluate the results, For weight stability loss, For semantic consistency loss, and The evaluation result is used as a balance coefficient to predict the loss. The weight stability loss is either mean squared error loss or cross-entropy loss. This is the L2 regularization term for the weight coefficients.

8. The method for digital evaluation of rural tourism based on multimodal data as described in claim 1, characterized in that: The evaluation result space is a preset rating level or a continuous score range; the output layer of the feature-evaluation mapping model adopts a Softmax function or a linear activation function, corresponding to the probability distribution or specific score of different evaluation results.

9. The method for digital evaluation of rural tourism based on multimodal data as described in claim 1, characterized in that: The online update is achieved through an incremental learning mechanism: when the cumulative amount of newly accessed multimodal data reaches a preset threshold, or when the model prediction error exceeds a preset tolerance, the online update of the feature-evaluation mapping model and the weight determination model is triggered. During the update process, elastic weight consolidation technology is adopted, and the changes of important parameters are constrained by introducing the Fisher information matrix.

10. A digital evaluation system for rural tourism based on multimodal data, applied to the digital evaluation method for rural tourism based on multimodal data as described in any one of claims 1-9, characterized in that, include: A multimodal data acquisition unit is used to collect multimodal data in rural tourism scenarios; The spatiotemporal preprocessing unit, connected to the multimodal data acquisition unit, is used to perform missing data compensation, noise filtering, and scale normalization on the acquired data. It also establishes spatiotemporal identifiers through unified timestamps and geographic coordinates, aligns different modal data in a unified spatiotemporal reference system, and outputs a structured multimodal dataset with spatiotemporal labels. The feature extraction and alignment unit, connected to the spatiotemporal preprocessing unit, is used to encode features of each modality data to obtain modality feature vectors, perform cross-modality fusion processing on the modality feature vectors to generate fused feature representations, and map different modality features to a shared embedding space for semantic alignment through contrastive learning. The model training unit, connected to the feature extraction and alignment unit, is used to construct a feature-evaluation mapping model and a weight determination model, and to perform end-to-end joint training of the two models by jointly optimizing the objective function. The evaluation inference unit is connected to the model training unit. It is used to store the trained feature-evaluation mapping model and weight determination model, and to perform inference on the real-time collected multimodal data to output the comprehensive evaluation result of rural tourism digitalization. The incremental update unit is connected to the feature extraction and alignment unit and the evaluation inference unit, respectively. It is used to monitor newly accessed multimodal data. When the accumulated data reaches a preset threshold or the model prediction error exceeds a preset tolerance, it triggers the online update of the feature-evaluation mapping model and the weight determination model. During the update process, the elastic weight consolidation technique is used to constrain the changes of important parameters.