Flood-prone area assessment method and system based on multi-source data sample enhancement

By using a multi-source data sample enhancement method, combining social media and remote sensing data to generate realistic submerged samples, and employing a cluster-enhanced positive sample unlabeled learning method to select non-submerged samples, a bi-branch model is constructed. This solves the problem of insufficient assessment accuracy caused by single data source and heterogeneity of underlying surface in existing technologies, and achieves higher assessment accuracy.

CN122241271APending Publication Date: 2026-06-19SUN YAT SEN UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
SUN YAT SEN UNIV
Filing Date
2026-03-24
Publication Date
2026-06-19

Smart Images

  • Figure CN122241271A_ABST
    Figure CN122241271A_ABST
Patent Text Reader

Abstract

This invention proposes a method and system for flood susceptibility assessment based on multi-source data sample enhancement, belonging to the technical field of environmental monitoring. The method includes acquiring multi-source data related to flooding; preprocessing the multi-source data to obtain inundated samples; selecting non-inundated samples based on the inundated samples using a positive sample unlabeled learning method based on cluster enhancement; inputting the inundated samples and the non-inundated samples into a preset two-branch flood susceptibility assessment model; and outputting flood susceptibility assessment results from the two-branch flood susceptibility assessment model. This invention can obtain non-inundated samples that are both accurate and representative, improving the accuracy of flood susceptibility assessment.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the technical field of environmental monitoring, and in particular to a method and system for assessing flood susceptibility based on multi-source data sample enhancement. Background Technology

[0002] Flood susceptibility assessment provides decision support for disaster emergency response by scientifically identifying areas with potential flood risks, helping to reduce losses caused by floods. Machine learning can analyze the complex nonlinear relationship between input features and predicted values, and has the advantages of fast prediction speed and high accuracy, making it widely used in flood susceptibility assessment.

[0003] Currently, existing technologies for flood susceptibility assessment based on machine learning generally face a core technical bottleneck: the difficulty in obtaining high-quality training samples. This severely restricts the assessment accuracy and model generalization ability. The main problems with existing technologies are: First, existing methods mostly rely solely on satellite remote sensing data or ground monitoring data to obtain inundation information. Remote sensing data is easily affected by factors such as image phase and terrain occlusion, performing poorly in urban areas and failing to comprehensively cover inundated areas. In other words, a single data source cannot quickly and completely obtain the true flood inundation situation across the entire region, resulting in insufficient representativeness and realism of positive samples. This directly affects the basic reliability of model training, making accurate flood susceptibility assessment difficult. Second, the selection of negative samples (i.e., non-inundated samples) in existing technology training data often... Selecting negative samples from the remaining unknown samples using random or subjective methods is problematic. However, due to the difficulty in obtaining inundation information and the heavy reliance on prior knowledge for subjective judgment, the negative samples selected using random or subjective methods cannot balance accuracy and spatial representativeness, leading to biases in model training and distorted evaluation results. Thirdly, there are significant differences in the hydrological cycle mechanisms and runoff patterns between impervious and permeable surfaces within urban areas, resulting in drastically different flood formation and evolution mechanisms. Conventional assessment methods directly construct a single machine learning model for the entire study area, ignoring the differences in hydrological characteristics among different underlying surface types. This makes it impossible to accurately fit the flood-inducing patterns of various underlying surfaces, greatly restricting the improvement of flood susceptibility assessment simulation accuracy and failing to meet the actual needs of refined urban disaster risk assessment. Summary of the Invention

[0004] To address the problems of existing technologies, such as the reliance on a single data source, difficulty in obtaining accurate and representative non-submerged samples, and poor accuracy in flood susceptibility assessment, this invention proposes a flood susceptibility assessment method and system based on multi-source data sample enhancement. This method can obtain accurate and representative non-submerged samples, thereby improving the accuracy of flood susceptibility assessment.

[0005] To achieve the above-mentioned technical effects, the technical solution of the present invention is as follows: A flood susceptibility assessment method based on multi-source data sample enhancement includes the following steps: S1. Obtain multi-source data related to flooding; S2. Preprocess the multi-source data to obtain flooded samples; S3. Based on the submerged samples, select non-submerged samples using a positive sample unlabeled learning method based on clustering enhancement; S4. Input the submerged sample and the non-submerged sample into a preset two-branch flood susceptibility assessment model, and output the flood susceptibility assessment result from the two-branch flood susceptibility assessment model.

[0006] Preferably, the multi-source data includes social media data and remote sensing data, and the flood samples include a first flood sample and a second flood sample.

[0007] Preferably, the social media data in the multi-source data is preprocessed to obtain the first flood sample, including: S211. Collect the text information of the social media data using web crawler technology; S212. Deduplicate the text information to obtain deduplicated social media text; S213. Input the deduplicated social media text into a preset deep learning model and output flood-related text; S214. Extract the geographical locations mentioned in flood-related texts, encode the geographical locations to obtain inundation point data, and designate the grid where the inundation point data is located as the first flood sample.

[0008] Preferably, the remote sensing data in the multi-source data is preprocessed to obtain a second flood sample, including: S221. Perform median filtering on the remote sensing data to obtain filtered image data; S222. Perform image mosaicking on the filtered image data to obtain a pre-disaster radar map and a post-disaster radar map covering the entire study area; S223. Extract features from the pre-disaster radar map and the post-disaster radar map to obtain feature data; S224. Classify the pre-disaster radar image and post-disaster radar image based on the feature data to obtain pre-disaster water body image and post-disaster water body image. S225. Compare the pre-disaster water body images and post-disaster water body images to extract the raster data of the flooded area; S226. Resample the flooded area raster data, and randomly select the same number of grids as the flood point data from the resampled flooded area raster data as the second flood sample.

[0009] Preferably, the step of comparing the pre-disaster water body images and the post-disaster water body images to extract the raster data of the flooded area includes: For any given grid cell, if it is determined to be a non-water body in the pre-disaster water body image and a water body in the post-disaster water body image, then the grid cell is recorded as a flooded area grid cell, and all flooded area grid cells constitute the flooded area grid cell data.

[0010] Preferably, the step of selecting non-submerged samples based on the submerged samples using a cluster-enhanced positive sample unlabeled learning method includes: S31. Using the Gaussian mixture model algorithm based on the selected flood influencing factors, cluster analysis is performed on the unlabeled raster data of the permeable surface and the impermeable surface that are neither located in the flooded area of ​​the flooded sample nor contain flooded point data, to obtain several representative clusters of the permeable surface and the impermeable surface. S32. Calculate the negative class score of each grid cell in the unlabeled grid cells of the permeable surface and the impermeable surface using the positive sample unlabeled learning method; S33. Divide the total number of positive samples by the total number of clusters to obtain the number of negative samples per cluster for the permeable surface and the impermeable surface; S34. Based on the number of negative samples collected from each representative cluster of the permeable surface and the impermeable surface, select an equal number of negative samples from each cluster according to the negative class score from high to low; the positive samples are submerged samples, and the negative samples are non-submerged samples.

[0011] Preferably, the steps for obtaining the selected flood influencing factors are as follows: S311. Determine the initial flood impact factors; S312. Based on the initial flood impact factors, calculate the Spearman rank correlation coefficient, and remove flood impact factors with strong correlation according to the Spearman rank correlation coefficient to obtain the flood impact factors after initial screening; S313. Calculate the variance inflation factor or tolerance of the flood impact factors after initial screening, and based on the variance inflation factor or the tolerance, eliminate flood impact factors with multicollinearity to obtain the selected flood impact factors.

[0012] Preferably, the dual-branch flood susceptibility assessment model includes an input layer, a dual-branch processing layer, and an output layer. The dual-branch processing layer includes an impermeable surface branch processing layer and a permeable surface branch processing layer. The output end of the input layer is connected to the input ends of the impermeable surface branch processing layer and the permeable surface branch processing layer, respectively. The output ends of the impermeable surface branch processing layer and the permeable surface branch processing layer are connected to the output layer.

[0013] This invention also proposes a flood susceptibility assessment system based on multi-source data sample enhancement, the system comprising: The acquisition module is used to acquire multi-source data related to flooding; The preprocessing module is used to preprocess the multi-source data to obtain flooded samples; The non-submerged sample selection module is used to select non-submerged samples based on the submerged samples using a positive sample unlabeled learning method based on clustering enhancement. The assessment module is used to input the inundated samples and the non-inundated samples into a preset two-branch flood susceptibility assessment model, and the two-branch flood susceptibility assessment model outputs the flood susceptibility assessment results. The present invention also proposes a computer device, comprising: a processor, a memory, a communication interface, and a communication bus, wherein the processor, the memory, and the communication interface communicate with each other through the communication bus; The memory is used to store at least one executable instruction that causes the processor to perform operations as described in the flood susceptibility assessment method based on multi-source data sample enhancement.

[0014] Compared with the prior art, the beneficial effects of the technical solution of the present invention are: This invention proposes a flood susceptibility assessment method and system based on multi-source data sample enhancement. First, it acquires multi-source data related to flooding and preprocesses this data to overcome the shortcomings of incomplete coverage and information lag of a single data source, generating more realistic and comprehensive positive inundation samples. Second, based on these inundation samples, it uses a cluster-enhanced positive sample label-free learning method to select non-inundation samples, ensuring the accuracy of negative samples while achieving representativeness of sample distribution through clustering. Furthermore, it inputs the inundation and non-inundation samples into a pre-defined bi-branch flood susceptibility assessment model, which outputs flood susceptibility assessment results. The bi-branch flood susceptibility assessment model accurately fits the flood-inducing patterns of different underlying surfaces, improving the accuracy of flood susceptibility assessment. Attached Figure Description

[0015] Figure 1 This is a flowchart illustrating a flood susceptibility assessment method based on multi-source data sample enhancement proposed in an embodiment of the present invention. Figure 2 This is another flowchart illustrating a flood susceptibility assessment method based on multi-source data sample enhancement proposed in this embodiment of the invention. Figure 3 This is a structural block diagram of a flood susceptibility assessment system based on multi-source data sample enhancement proposed in an embodiment of the present invention; Figure 4This is a structural block diagram of a computer device proposed in an embodiment of the present invention. Detailed Implementation

[0016] The accompanying drawings are for illustrative purposes only and should not be construed as limiting the scope of this patent. It is understandable to those skilled in the art that some well-known details may be omitted from the accompanying drawings; To facilitate understanding of this embodiment, the prior art information of this embodiment is first introduced as follows: Flood susceptibility assessment provides decision support for disaster emergency response by scientifically identifying areas with potential flood risks, helping to reduce losses caused by floods. Machine learning can analyze the complex nonlinear relationship between input features and predicted values, and has the advantages of fast prediction speed and high accuracy, making it widely used in flood susceptibility assessment. [1]However, when conducting flood susceptibility assessments, it is difficult to obtain high-quality data to train machine learning models. Positive samples in the training data, i.e., inundation samples, originate from flood disaster information. Currently, flood disaster information is mainly acquired through ground monitoring, numerical simulation, remote sensing, and social sensing. Among these, ground monitoring, remote sensing, and social sensing methods can obtain near real-time, accurate flood inundation data. Ground monitoring methods, by deploying observation instruments at specific locations, monitor water depth and flow velocity in real time, obtaining highly accurate inundation data. However, their application in large areas or complex terrain conditions is significantly limited by the high-density deployment requirements and the economic costs of long-term maintenance. Synthetic Aperture Radar (SAR) satellites are active microwave remote sensing systems, unaffected by weather and lighting conditions, and feature large-scale monitoring and all-weather imaging capabilities, including satellites such as Sentinel-1 and ALOS-2. SAR satellites achieve Earth observation by actively transmitting microwaves towards targets and receiving their backscattered signals. This overcomes the problem of cloud interference hindering imaging by optical remote sensing satellites during severe weather accompanying floods, making them one of the most important remote sensing satellites for flood disaster monitoring. However, due to the complexity of urban environments, the double reflection effect of buildings, and limitations in spatial resolution, SAR satellites struggle to monitor flooding in urban areas. Furthermore, the temporal sparsity of current SAR satellite data leads to the loss of some temporal information regarding flood disasters. Social media data is a typical example of socially perceptual data, originating from every user on social media platforms. Each user can share and transmit information they observe during floods at any time on these platforms. In recent years, scholars have used social media data to study the spatiotemporal changes of flood disasters, risk assessment, and crisis communication, demonstrating the excellent performance of social media data in flood disaster applications. The number of users is one of the most significant factors affecting the amount of social media data. Urban areas are often densely populated, indicating that a large amount of social media data is generated in these areas during disasters. Therefore, social media data can effectively compensate for the poor performance of remote sensing satellite data in urban areas. By combining the advantages of remote sensing and social media data, more comprehensive information on flooding can be obtained.

[0017] The selection of negative samples, i.e., non-flooded samples, in training data is often done randomly or subjectively from the remaining unknown samples. However, due to the incompleteness of flooded samples and the heavy reliance of subjective judgment on prior knowledge, negative samples selected randomly or subjectively are not entirely accurate. Since the collected flooded samples often fail to encompass all real-world flooding scenarios, random selection may incorrectly classify flooded samples as non-flooded. Subjective selection methods rely too heavily on researchers' prior knowledge, potentially introducing geographical bias. Furthermore, machine learning is sensitive to data quality, and biases in negative sample selection can reduce the model's predictive ability. The PU learning method in semi-supervised learning can address the shortcomings of random or subjective methods for selecting negative samples. This method trains a classifier using a small number of positive samples and a large number of unlabeled samples, selecting a batch of samples equal in number to the known positive samples based on the negative class probability of the unlabeled samples, from high to low. While this method can improve the accuracy of negative samples in the training data due to the limited number of positive samples, the selected negative samples may cluster in areas with prominent features, leading to insufficient representativeness. Furthermore, many studies assessing flood susceptibility typically treat the study area as a whole when training machine learning models. However, urban underlying surfaces are highly heterogeneous, especially impermeable and permeable surfaces, with significant differences in their hydrological cycle processes. Directly building models based on the entire region may limit the improvement of simulation accuracy. Sample selection methods based on visual interpretation of satellite images require substantial manual and time costs, hindering rapid assessment of flood susceptibility. Random or subjective methods struggle to obtain non-inundated samples that are both accurate and representative. Random sampling methods may incorrectly identify unrecorded inundated samples as non-inundated points, leading to a more conservative assessment model and thus underestimating the flood susceptibility of these areas. Subjective sampling and traditional PUL methods are prone to oversampling in areas with distinct elevation and steep slope characteristics when selecting non-inundated samples. In such cases, the model may tend to classify uncertain areas with moderate elevation and gentle slopes as inundated zones, thereby overestimating flood susceptibility.

[0018] The purpose of this invention is to propose a machine learning framework based on sample augmentation, using social media and remote sensing data as data sources, to assess flood susceptibility. This framework enhances the training samples of the dual-branch flood susceptibility assessment model from two aspects: acquiring inundated samples and selecting non-inundated samples. Specifically, combining the temporal and spatial advantages of social media and remote sensing data aims to obtain more realistic and comprehensive flood inundation information; a cluster-enhanced PU learning method is used to select non-inundated samples that are both accurate and representative. The framework also considers the differences in hydrological processes across different underlying surfaces, constructing separate machine learning models for impermeable and permeable surfaces to improve model accuracy.

[0019] The technical solution of the present invention will be further described below with reference to the accompanying drawings and embodiments.

[0020] Example 1 like Figure 1 and Figure 2 As shown in the figure, this embodiment proposes a flood susceptibility assessment method based on multi-source data sample enhancement, including the following steps: S1. Obtain multi-source data related to flooding; The multi-source data includes social media data and remote sensing data, and the flood samples include a first flood sample and a second flood sample.

[0021] The social media data originates from multimedia platforms, primarily acquiring the posting time, text, and user information for each tweet. The remote sensing data is from the Sentinel-1 (S1) satellite. Sentinel-1 has four imaging modes, with the Interferometric Wide swath (IW) mode used by default in land studies. The specific acquisition path is as follows: using Google Earth Engine (GEE), two types of polarization data—VV (co-polarized) and VH (cross-polarized)—are acquired from the S1 multi-source data after thermal noise reduction, radiometric calibration, and terrain correction.

[0022] S2. Preprocess the multi-source data to obtain flooded samples; During floods, as people's normal lives are severely affected, information about flooding on social media platforms increases rapidly. This method selects social media data by retrieving Weibo posts during flood periods using flood-related keywords. The social media data from the multi-source data is preprocessed to obtain a first flood sample, including: S211. Collect text information from the social media data using web crawler technology; the text information includes the posting time and text. S212. Deduplicate the text information to obtain deduplicated social media text; S213. Input the deduplicated social media text into a preset deep learning model and output flood-related text; In order to filter out data unrelated to flood disasters, S213 uses the deep learning model BERT-RCNN to classify deduplicated social media texts, dividing them into flood-related and non-flood-related texts, and then obtaining flood-related texts, which are recorded as flood-related texts; the ratio of the training set, test set and validation set of the BERT-RCNN model is 8:1:1.

[0023] S214. Use the open-source Chinese natural language processing toolkit Han Language Processing (HanLP) to extract the geographical locations mentioned in flood-related texts, encode the geographical locations based on the Gaode Map API to obtain inundation point data, which are the latitude and longitude coordinates of the inundation points, and designate the grid where the inundation point data is located as the first flood sample.

[0024] The remote sensing data from the multi-source data is preprocessed to obtain a second flood sample, including: S221. Perform median filtering on the remote sensing data to obtain filtered image data; In S221, Sentinel-1 images that have undergone noise reduction, radiometric calibration, and terrain correction are acquired from the Google Maps engine platform. Median filtering is applied to each image in the remote sensing data using a circular window with a radius of 30m to reduce errors caused by coherent speckles.

[0025] S222. The filtered image data is mosaicked to obtain a pre-disaster radar image and a post-disaster radar image covering the entire study area; the pre-disaster radar image is a pre-disaster SAR image, and the post-disaster radar image is a post-disaster SAR image. S223. Extract features from the pre-disaster radar map and the post-disaster radar map to obtain feature data; Based on land use data, sample points are randomly generated in areas with different land use types, and feature data is extracted from these sample points to ultimately obtain water body samples and non-water body samples. The feature data includes isopolarization data (VV), cross-polarization data (VH), polarization sum, polarization ratio index, normalized difference polarization index, normalized VV index, and normalized VH index. The specific calculation formulas for VHrVV, NDPI, NVVI, and NVHI are shown below:

[0026] in, It is the backscattering coefficient in VV polarization mode. It is the backscattering coefficient in the VH polarization mode.

[0027] S224. Classify the pre-disaster radar image and post-disaster radar image based on the feature data to obtain pre-disaster water body image and post-disaster water body image. S225. Use the random forest algorithm to classify the pre-disaster radar map and post-disaster radar map based on the feature data.

[0028] S226. Compare the pre-disaster water body images and post-disaster water body images to extract the raster data of the flooded area; S227. The flooded area raster data is resampled, and the same number of grids as the flood point data are randomly selected from the resampled flooded area raster data as the second flood sample. Furthermore, to reduce the salt-and-pepper effect, regions with fewer than 20 connected pixels are removed.

[0029] In the above steps, the remote sensing data is classified into water bodies, and then the difference between the pre-disaster and post-disaster impacts after classification is compared to determine the inundation area raster data. Then, the second flood sample is determined based on the inundation area raster data.

[0030] The step of comparing the pre-disaster and post-disaster water body images to extract raster data of the flooded area includes: For any given grid cell, if it is determined to be a non-water body in the pre-disaster water body image and a water body in the post-disaster water body image, then the grid cell is recorded as a flooded area grid cell, and all flooded area grid cells constitute the flooded area grid cell data.

[0031] Here, flood information acquired from remote sensing data is raster data of inundated areas, while flood information acquired from social media data is inundation point data, which is vector data. To combine the advantages of both data sources, for the remote sensing data, the inundated area raster data acquired from remote sensing data is first resampled to 30m. At a resolution of 30m, for social media data, the grid containing the flooded point is considered flooded, and this grid serves as the first flood sample. Then, the same number of grids within the flooded area are randomly selected as the second flood sample from the remote sensing data. The flooded sample, composed of the first and second flood samples obtained through this method, will serve as the positive sample for the flood susceptibility assessment model. S2 achieves the assimilation of raster and vector data from the flooded area, generating a more realistic and comprehensive flood sample.

[0032] S3. Based on the submerged samples, select non-submerged samples using a positive sample unlabeled learning method based on clustering enhancement; The step of selecting non-submerged samples based on the submerged samples using a cluster-enhanced positive sample unlabeled learning method includes: S31. Using the Gaussian mixture model algorithm based on the selected flood influencing factors, cluster analysis is performed on the unlabeled raster data of the permeable surface and the impermeable surface that are neither located in the flooded area of ​​the flooded sample nor contain flooded point data, to obtain several representative clusters of the permeable surface and the impermeable surface. The steps for obtaining the selected flood influencing factors are as follows: S311. Determine the initial flood impact factors; This step uses 15 initial flood impact factors to assess flood susceptibility. Precipitation data is sourced from the Global Precipitation Observation Project, with temporal and spatial resolutions of 30 minutes and 0.1°×0.1°, respectively. This precipitation data is used for storm sequence identification and to calculate the maximum rainfall, duration, and intensity factor of storms. The storm sequence identification rules follow relevant regulations: hourly rainfall greater than 16 mm, 12-hour rainfall greater than 30 mm, and 24-hour rainfall greater than 50 mm. Digital elevation model (DEM) data uses the ASTER Global Digital Elevation Model V3 to calculate elevation, slope, aspect, plane curvature, profile curvature, coefficient of variation of elevation, and topographic humidity index. Road network density factor data is sourced from the China 1km grid road network density dataset published by Li et al. on the Scientific Data Bank. Soil erosion modulus factor data is sourced from the soil erosion dataset published by Yan et al. Normalized Difference Vegetation Index (NDVI) factor data is sourced from the China 30m annual maximum NDVI dataset calculated using the GEE platform by Yang et al. The impervious area ratio factor was calculated using the ESA WorldCover 10m v100 dataset. The distance factor to the river was calculated based on the "A Chinese Mainland River Network Dataset For Hydrodynamic simulation" river network dataset from the Scientific Data Bank. To ensure data consistency and accuracy, all flood impact factors were resampled to 30m. 30m.

[0033] S312. Based on the initial flood impact factors, calculate the Spearman rank correlation coefficient, and remove flood impact factors with strong correlation according to the Spearman rank correlation coefficient to obtain the flood impact factors after initial screening; S313. Calculate the variance inflation factor or tolerance of the flood impact factors after initial screening, and based on the variance inflation factor or the tolerance, eliminate flood impact factors with multicollinearity to obtain the selected flood impact factors.

[0034] In machine learning, high correlations (especially multicollinearity) among flood-affecting factors can lead to decreased model interpretability and increased sensitivity of prediction results to data perturbations. This invention first calculates the Spearman rank correlation coefficient to eliminate highly correlated flood-affecting factors. Then, it uses the variance inflation factor (VIF) and tolerance to detect multicollinearity among these factors and eliminates those exhibiting multicollinearity. The formulas for calculating the Spearman rank correlation coefficient and VIF are shown in Equations 6 and 7, respectively.

[0035]

[0036] In the formula, The Spearman rank correlation coefficient; It is the first of the two eigenfactors The difference in the order of the data; n is the number of samples.

[0037]

[0038] In the formula, For the first i Variance inflation factor of each characteristic factor; For the first i Goodness of fit when performing regression analysis with one characteristic factor as the dependent variable and the remaining characteristic factors as independent variables.

[0039] A strong correlation is considered to exist between the two flood-related factors when the absolute value of the Spearman rank correlation coefficient is greater than 0.7. Multicollinearity is considered to exist when the VIF is greater than 10 or the TOL is less than 0.1.

[0040] The Gaussian Mixture Model (GMM) algorithm is a soft clustering algorithm based on probability distribution. Its core principle is that the data is generated by a mixture of multiple clusters that conform to a Gaussian distribution, giving it the advantage of flexibly adapting to complex data distributions. For datasets... The mathematical expression for GMM is as follows.

[0041]

[0042] In the formula, Let be the probability density function of the GMM; The number of Gaussian components. For the first The probability density function of Gaussian components; For the first The weighting coefficients of each Gaussian component; For the first The mean vector of Gaussian components; For the first The covariance matrix of Gaussian components; The number of features in the sample.

[0043] After the GMM is constructed, the expected value maximization algorithm is used to solve for the unknown parameters. To avoid human determination... The impact of values ​​on clustering results: This invention uses the Bayesian Information Criterion (BIC) to determine the optimal clustering effect. The BIC value is calculated using the formula shown in Equation 11. BIC considers both goodness of fit and model complexity, selecting the optimal model by comparing the statistics of different models. A smaller BIC value indicates better clustering performance.

[0044]

[0045] In the formula, The number of free parameters that need to be estimated for the GMM model; This represents the maximum likelihood function value of the GMM model.

[0046] S32. Calculate the negative class score of each grid cell in the unlabeled grid cells of the permeable surface and the impermeable surface using the positive sample unlabeled learning method; Positive-unlabeled (PU) learning is a special type of semi-supervised algorithm that uses a small number of positive samples and a large number of unlabeled samples as training data. The goal of PU learning is consistent with traditional binary classification models, but it needs to distinguish between true negative samples and potential positive samples from unlabeled samples in scenarios where there are no clearly defined negative samples. Therefore, compared to random or subjective methods, PU learning can select more reliable negative samples. Positive-unlabeled learning methods include direct standard classification, positive-unlabeled sample ensemble (PU bagging), and two-step methods, which use different strategies to handle the interference of positive samples in the unlabeled samples. To reduce the uncertainty of a single method, this invention uses the above three methods to calculate the probability that each sample in the unlabeled samples is a negative sample and scores it accordingly; a higher score indicates a greater probability that the sample is a negative sample.

[0047] Direct standard classification refers to directly classifying unlabeled samples. Treating them as negative samples and positive samples Jointly train the classifier, and the classifier predicts... The probability of being a negative class is... .

[0048] The PU bagging method is based on the integration concept, and involves multiple steps from... Samples drawn with replacement are used as negative samples. The base classifiers are trained together, and the results are integrated. Specific steps: (1) From... Random sampling with replacement Same number of samples As negative samples and Forming a training set (2) Use (3) Use the trained base classifier to predict which classes will not be included in the current iteration. (3) The negative class probability of the unlabeled sample (i.e., the out-of-bag sample); (4) Repeat the above three steps. Next, final The negative class probability is taken as The mean probability of a sample being predicted as the negative class ( ).

[0049] The two-step method is an iterative approach that selects negative samples by progressively identifying reliable negative samples. Specific steps: (1) First step, train the classifier. Data set, identified The probability of the neutral class is higher and lower than all others. The samples were respectively labeled as new positive samples. and reliable negative samples , Incorporate positive samples Reliable negative samples (2) The second step is to construct an iterative classifier to... Use it as a dataset to train a classifier. Similarly, from... China was not incorporated and Identified from the samples and ,renew and Right now , Iterate repeatedly until... and When the number of iterations is all 0 or the preset number of iterations is reached, the iteration stops and the final result is obtained. and Finally, with the final and Training classifier prediction From the samples, obtain the probability that it is a negative class. The classifiers used in the above methods are all random forest algorithms, and the final calculation... Each sample in the table represents the score of the negative class. as follows.

[0050]

[0051] S33. Divide the total number of positive samples by the total number of clusters to obtain the number of negative samples per cluster for the permeable surface and the impermeable surface; S34. Based on the number of negative samples collected from each representative cluster of the permeable surface and the impermeable surface, select an equal number of negative samples from each cluster according to the negative class score from high to low; the positive samples are submerged samples, and the negative samples are non-submerged samples.

[0052] The accuracy of machine learning models largely depends on data quality; therefore, the accuracy and representativeness of negative sample selection affect the accuracy of model predictions. This invention proposes a GMM-PU learning-based negative sample selection method to balance data accuracy and representativeness. Since the flooding mechanisms of impervious and permeable surfaces differ significantly, this invention selects an equal number of negative samples from both impervious and permeable surfaces as positive samples. Through S3, it achieves automatic and rapid extraction of accurate and representative non-submerged samples from unlabeled samples, reducing manual and time costs and effectively avoiding the negative impact of existing non-submerged sample selection methods on flood susceptibility assessment results.

[0053] S4. Input the flooded samples and the non-flooded samples into a preset dual-branch flood susceptibility assessment model, and output the flood susceptibility assessment results from the dual-branch flood susceptibility assessment model. The dual-branch flood susceptibility assessment model includes an input layer, a dual-branch processing layer, and an output layer. The dual-branch processing layer includes an impermeable surface branch processing layer and a permeable surface branch processing layer. The output end of the input layer is connected to the input ends of the impermeable surface branch processing layer and the permeable surface branch processing layer, respectively. The output ends of the impermeable surface branch processing layer and the permeable surface branch processing layer are connected to the output layer.

[0054] Considering the differences in flood formation mechanisms between impervious and permeable surfaces, this invention employs a dual-branch structure. The input layer divides the data into impervious and permeable surfaces. Based on the branching characteristics of the water surface, CatBoost (Categorical Boosting) models are constructed in the impervious and permeable surface branching layers respectively to assess the flood susceptibility of the study area. CatBoost is a gradient-based decision tree algorithm that can automatically identify and efficiently process categorical features without requiring complex data preprocessing or manual encoding.

[21] Therefore, this invention extracts flood impact factors for corresponding grids based on submerged and non-submerged samples, constructs training and testing sets in an 8:2 ratio, and then constructs binary CatBoost models for impermeable and permeable surfaces in a dual-branch processing layer to assess flood susceptibility.

[0055] To analyze the predictive performance of the two-branch flood susceptibility assessment model, this study used accuracy, precision, recall, and F1-score to evaluate the model's performance. The calculation formulas for each indicator are as follows.

[0056]

[0057] In the formula, This is the number of samples that the model correctly predicted as flooded. It is the number of samples that the model correctly predicted as non-flooded. It is the number of samples that the model incorrectly predicted as flooded. This represents the number of samples that the model incorrectly predicted as non-submerged. S4 uses a two-branch machine learning approach to model the impervious and permeable surfaces of the study area separately. This method considers the differences in hydrological processes between impervious and permeable surfaces, which helps improve the accuracy of flood susceptibility assessment.

[0058] This invention integrates remote sensing and social media data to propose a flood susceptibility assessment method based on multi-source data sample enhancement. This method provides a machine learning framework for flood susceptibility assessment based on multi-source data sample enhancement. This framework effectively balances the representativeness and accuracy of negative samples by selecting them from different clusters using a PU (Programming Injection) learning method. It is particularly noteworthy that this invention leverages the advantages of both remote sensing and social media data to generate more realistic inundation scenarios, solving the problem of quickly obtaining comprehensive and accurate flood disaster information using a single data source. The selection of negative samples (non-inundated samples) in training data is often done randomly or subjectively from the remaining unknown samples. However, due to the difficulty in obtaining inundation information and the heavy reliance of subjective judgment on prior knowledge, negative samples selected using random or subjective methods are not entirely accurate. This invention uses a label-free learning method based on cluster enhancement for positive samples to select negative samples, solving the problem that existing random or subjective methods struggle to obtain non-inundated samples that are both accurate and representative. Furthermore, urban underlying surfaces are highly heterogeneous, especially impermeable and permeable surfaces, whose hydrological cycle processes differ significantly. This invention uses a dual-branch structure to construct machine learning models for both impermeable and permeable surfaces in the study area to assess flood susceptibility, thus solving the problem that existing methods may limit the improvement of simulation accuracy by directly constructing models based on the entire study area.

[0059] Compared with existing technologies, this invention proposes a machine learning framework for flood susceptibility assessment based on sample augmentation. In this framework, the GMM clustering model and CatBoost model can be replaced with other similar models according to different research areas and needs, exhibiting strong scalability and flexibility. The main advantages of this framework are: 1) it combines the advantages of remote sensing and social media data to obtain inundation data that more closely resembles real-world conditions; 2) it uses a cluster-optimized PU learning method to extract non-inundated samples, achieving a balance between sample representativeness and accuracy; 3) it uses a dual-branch machine learning method to model impervious and permeable surfaces separately, improving the accuracy of flood susceptibility assessment. Due to the availability and low cost of the data, and its applicability to flood susceptibility assessment in different regions, this invention has broad applicability and significant application potential.

[0060] Example 2 See Figure 3This embodiment proposes a flood susceptibility assessment system based on multi-source data sample enhancement, the system comprising: The acquisition module is used to acquire multi-source data related to flooding; The preprocessing module is used to preprocess the multi-source data to obtain flooded samples; The non-submerged sample selection module is used to select non-submerged samples based on the submerged samples using a positive sample unlabeled learning method based on clustering enhancement. The evaluation module is used to input the flooded samples and the non-flooded samples into a preset dual-branch flood susceptibility evaluation model, and the dual-branch flood susceptibility evaluation model outputs the flood susceptibility evaluation results.

[0061] In this embodiment, firstly, multi-source data related to flooding is acquired and preprocessed to overcome the shortcomings of incomplete coverage and information lag of a single data source, generating more realistic and comprehensive positive flooding samples. Secondly, based on the flooding samples, a label-free learning method for positive samples based on clustering enhancement is used to select non-flooded samples, ensuring the accuracy of negative samples and achieving representativeness of sample distribution through clustering. Furthermore, the flooding samples and the non-flooded samples are input into a preset two-branch flood susceptibility assessment model, which outputs flood susceptibility assessment results. By accurately fitting the flood induction patterns of different underlying surfaces through the two-branch flood susceptibility assessment model, the accuracy of flood susceptibility assessment is improved.

[0062] Example 3 See Figure 4 This embodiment proposes a computer device, see [link to documentation]. Figure 4 It includes: a processor 41, a memory 42, a communication interface 43 and a communication bus 44, wherein the processor 41, the memory 42 and the communication interface 43 communicate with each other through the communication bus 44; The processor 41, memory 42, and communication interface 43 communicate with each other via a communication bus 44. The communication interface 43 is used for network communication with other devices, such as clients or other servers. The processor 41 executes executable instructions 45, specifically performing the operations of the flood susceptibility assessment method based on multi-source data sample enhancement. Specifically, the executable instructions 45 may include program code. The processor 41 may be a central processing unit (CPU), an application-specific integrated circuit (ASIC), or one or more integrated circuits configured to implement embodiments of the present invention. The computer device includes one or more processors, which may be processors of the same type, such as one or more CPUs; or processors of different types, such as one or more CPUs and one or more ASICs.

[0063] Memory 42 is used to store executable instructions 45. Memory 42 may include high-speed RAM memory, and may also include non-volatile memory, such as at least one disk storage device.

[0064] Executable instruction 45 can be invoked by processor 41 to cause the computer device to perform the following operations: S1. Obtain multi-source data related to flooding; S2. Preprocess the multi-source data to obtain flooded samples; S3. Based on the submerged samples, select non-submerged samples using a positive sample unlabeled learning method based on clustering enhancement; S4. Input the submerged sample and the non-submerged sample into a preset two-branch flood susceptibility assessment model, and output the flood susceptibility assessment result from the two-branch flood susceptibility assessment model.

[0065] In this embodiment, firstly, multi-source data related to flooding is acquired and preprocessed to overcome the shortcomings of incomplete coverage and information lag of a single data source, generating more realistic and comprehensive positive flooding samples. Secondly, based on the flooding samples, a label-free learning method for positive samples based on clustering enhancement is used to select non-flooded samples, ensuring the accuracy of negative samples and achieving representativeness of sample distribution through clustering. Furthermore, the flooding samples and the non-flooded samples are input into a preset two-branch flood susceptibility assessment model, which outputs flood susceptibility assessment results. By accurately fitting the flood induction patterns of different underlying surfaces through the two-branch flood susceptibility assessment model, the accuracy of flood susceptibility assessment is improved.

[0066] Obviously, the above embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the implementation of the present invention. Those skilled in the art can make other variations or modifications based on the above description. It is neither necessary nor possible to exhaustively describe all embodiments here. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of the present invention should be included within the scope of protection of the claims of the present invention.

Claims

1. A flood prone area assessment method based on multi-source data sample enhancement, characterized in that, Includes the following steps: S1. Obtain multi-source data related to flooding; S2. Preprocess the multi-source data to obtain flooded samples; S3. Based on the submerged samples, select non-submerged samples using a positive sample unlabeled learning method based on clustering enhancement; S4. Input the submerged sample and the non-submerged sample into a preset two-branch flood susceptibility assessment model, and output the flood susceptibility assessment result from the two-branch flood susceptibility assessment model.

2. The multi-source data sample augmented flood prone susceptibility assessment method according to claim 1, wherein, The multi-source data includes social media data and remote sensing data, and the flood samples include a first flood sample and a second flood sample.

3. The multi-source data sample augmented flood prone susceptibility assessment method according to claim 2, wherein, The social media data from the multi-source data is preprocessed to obtain the first flood sample, including: S211. Collect the text information of the social media data using web crawler technology; S212. Deduplicate the text information to obtain deduplicated social media text; S213. Input the deduplicated social media text into a preset deep learning model and output flood-related text; S214. Extract the geographical locations mentioned in flood-related texts, encode the geographical locations to obtain inundation point data, and designate the grid where the inundation point data is located as the first flood sample.

4. The multi-source data sample augmented flood prone susceptibility assessment method according to claim 3, characterized in that, The remote sensing data from the multi-source data is preprocessed to obtain a second flood sample, including: S221. Perform median filtering on the remote sensing data to obtain filtered image data; S222. Perform image mosaicking on the filtered image data to obtain a pre-disaster radar map and a post-disaster radar map covering the entire study area; S223. Extract features from the pre-disaster radar map and the post-disaster radar map to obtain feature data; S224. Classify the pre-disaster radar image and post-disaster radar image based on the feature data to obtain pre-disaster water body image and post-disaster water body image. S225. Compare the pre-disaster water body images and post-disaster water body images to extract the raster data of the flooded area; S226. Resample the flooded area raster data, and randomly select the same number of grids as the flood point data from the resampled flooded area raster data as the second flood sample.

5. The multi-source data sample augmentation based flood susceptibility assessment method of claim 4, wherein, The step of comparing the pre-disaster and post-disaster water body images to extract raster data of the flooded area includes: For any given grid cell, if it is determined to be a non-water body in the pre-disaster water body image and a water body in the post-disaster water body image, then the grid cell is recorded as a flooded area grid cell, and all flooded area grid cells constitute the flooded area grid cell data.

6. The multi-source data sample augmentation based flood susceptibility assessment method of claim 4, wherein, The step of selecting non-submerged samples based on the submerged samples using a cluster-enhanced positive sample unlabeled learning method includes: S31. Using the Gaussian mixture model algorithm based on the selected flood influencing factors, cluster analysis is performed on the unlabeled raster data of the permeable surface and the impermeable surface that are neither located in the flooded area of ​​the flooded sample nor contain flooded point data, to obtain several representative clusters of the permeable surface and the impermeable surface. S32. Calculate the negative class score of each grid cell in the unlabeled grid cells of the permeable surface and the impermeable surface using the positive sample unlabeled learning method; S33. Divide the total number of positive samples by the total number of clusters to obtain the number of negative samples per cluster for the permeable surface and the impermeable surface; S34. Based on the number of negative samples collected from each representative cluster of the permeable surface and the impermeable surface, select an equal number of negative samples from each cluster according to the negative class score from high to low; the positive samples are submerged samples, and the negative samples are non-submerged samples.

7. The multi-source data sample augmentation based flood susceptibility assessment method of claim 6, wherein, The steps for obtaining the selected flood influencing factors are as follows: S311. Determine the initial flood impact factors; S312. Based on the initial flood impact factors, calculate the Spearman rank correlation coefficient, and remove flood impact factors with strong correlation according to the Spearman rank correlation coefficient to obtain the flood impact factors after initial screening; S313. Calculate the variance inflation factor or tolerance of the flood impact factors after initial screening, and based on the variance inflation factor or the tolerance, eliminate flood impact factors with multicollinearity to obtain the selected flood impact factors.

8. The multi-source data sample augmented flood prone susceptibility assessment method according to any one of claims 1-7, characterized in that, The dual-branch flood susceptibility assessment model includes an input layer, a dual-branch processing layer, and an output layer. The dual-branch processing layer includes an impermeable surface branch processing layer and a permeable surface branch processing layer. The output end of the input layer is connected to the input ends of the impermeable surface branch processing layer and the permeable surface branch processing layer, respectively. The output ends of the impermeable surface branch processing layer and the permeable surface branch processing layer are connected to the output layer.

9. A flood susceptibility assessment system based on multi-source data sample enhancement, characterized in that, The system includes: The acquisition module is used to acquire multi-source data related to flooding; The preprocessing module is used to preprocess the multi-source data to obtain flooded samples; The non-submerged sample selection module is used to select non-submerged samples based on the submerged samples using a positive sample unlabeled learning method based on clustering enhancement. The evaluation module is used to input the flooded samples and the non-flooded samples into a preset dual-branch flood susceptibility evaluation model, and the dual-branch flood susceptibility evaluation model outputs the flood susceptibility evaluation results.

10. A computer device, characterized in that, include: The processor, memory, communication interface, and communication bus are provided, wherein the processor, memory, and communication interface communicate with each other via the communication bus. The memory is used to store at least one executable instruction that causes the processor to perform the operation of the flood susceptibility assessment method based on multi-source data sample enhancement as described in any one of claims 1-8.