Spatio-temporal fusion system and method for time-series integration of multimodal data
The spatiotemporal fusion system integrates numerical and image data using transformer models to address the challenge of analyzing multivariate data correlations, enhancing analysis accuracy by blending and correlating time-series numerical and image data.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- SIMPLATFORM CO LTD
- Filing Date
- 2025-11-12
- Publication Date
- 2026-07-02
AI Technical Summary
Existing technologies struggle to accurately analyze real-world situations by considering correlations between variables in multivariate data, particularly when mixing time-series information such as images and audio with image information, and are limited to analyzing numerical data only.
A spatiotemporal fusion system that integrates a numeric data receiving unit, image data receiving unit, sequence encoder, image encoder, time encoder, spatial encoder, and fusion unit to blend and analyze time-series numerical and image data using transformer models, generating fused feature vectors through weighted addition.
Enables accurate analysis of multimodal data by correlating numerical and image data, ensuring effective spatiotemporal relationship learning and maintaining orthogonality, thereby improving analysis accuracy.
Smart Images

Figure KR2025018637_02072026_PF_FP_ABST
Abstract
Description
Spatiotemporal Fusion System and Method for Time Series Integration of Multimodal Data
[0001] The present invention relates to a spatiotemporal fusion system and method for time-series integration of multimodal data, and more specifically, to a system and method capable of fusion and analysis of multimodal data such as image data and sensor data.
[0002] With the advancement of artificial intelligence technology, technologies that analyze various incoming time-series monitoring data to analyze industrial environments and detect or predict abnormal situations are being widely used.
[0003] In particular, as the types of input time series data increase, technologies are being developed to enable the analysis of multivariate data.
[0004] The prior art disclosed in Korean Patent Publication No. 10-2023-0033312, "CNN-based multivariate data processing system and CNN-based multivariate data processing method," is a technology that enables monitoring and analyzing multivariate time series data using such a convolutional neural network. However, these prior technologies had a problem in that it was difficult to consider characteristics based on correlations between variables because they extracted and analyzed features from each individual multivariate time series variable. Furthermore, there was a problem in that it was difficult to accurately analyze real-world situations because they analyzed only numerical data.
[0005] Therefore, spatiotemporal fusion technology is required to derive analysis results by analyzing multimodal data in which time-series information, such as images and audio, is mixed with image information, rather than simply analyzing time-series numerical data.
[0006] The present invention aims to derive accurate analysis results compared to the case where only a single modal is used by simultaneously analyzing time-series image multimodal data.
[0007] The present invention aims to derive accurate analysis results by fusing data that is difficult to fuse, such as time-series numerical data and image data.
[0008] The present invention aims to improve the accuracy of analysis through correlation analysis between numerical data and image data by ensuring that the analysis of numerical data and the analysis of image data are not performed separately but are fused and analyzed.
[0009] To achieve this objective, a spatiotemporal fusion system according to an embodiment of the present invention may be configured to include: a numeric data receiving unit for receiving time-series numeric data for a plurality of variables; an image data receiving unit for receiving image data captured in a time-series manner; a sequence encoder for encoding the time-series numeric data to generate a numeric token; an image encoder for encoding the image data to generate an image token; a time encoder for encoding data combining the time-series numeric data and the image token to generate a first intermediate token; a spatial encoder for encoding data combining the image data and the numeric token to generate a second intermediate token; and a fusion unit for deriving integrated characteristic data using the first intermediate token and the second intermediate token.
[0010] At this time, the fusion unit can generate a first final token by blending the first intermediate token and the numerical token, generate a second final token by blending the second intermediate token and the image token, and derive integrated characteristic data by combining the first final token and the second final token.
[0011] In addition, the fusion unit can blend the first intermediate token and the numeric token through weighted addition using a first weight, and blend the second intermediate token and the image token through weighted addition using a second weight.
[0012] At this time, the sequence encoder, the image encoder, the time encoder, and the space encoder may be encoders of a transformer model.
[0013] In addition, the numeric token, the image token, the first intermediate token, and the second intermediate token may all be vectors having the form of 1 x d (where d is a predetermined number).
[0014] The present invention can achieve the effect of deriving accurate analysis results compared to the case where only a single modal is used by simultaneously analyzing time-series image multimodal data.
[0015] The present invention can achieve the effect of deriving accurate analysis results by fusing data that is difficult to fuse, such as time-series numerical data and image data.
[0016] The present invention enables the analysis of numerical data and image data to be fused rather than separated, thereby increasing the accuracy of the analysis through correlation analysis between numerical data and image data.
[0017] The present invention can ensure effective spatiotemporal relationship learning by providing temporal-spatial analysis for time-series image multimodal data, maintaining orthogonality between them, and creating fused feature vectors.
[0018] FIG. 1 is a configuration diagram illustrating the internal configuration of a space-time fusion system according to an embodiment of the present invention.
[0019] FIG. 2 is a diagram illustrating the operation of a space-time fusion system according to an embodiment of the present invention.
[0020] FIG. 3 is a flowchart showing the flow of a spacetime fusion method according to an embodiment of the present invention.
[0021] To achieve this objective, a spatiotemporal fusion system according to an embodiment of the present invention may be configured to include: a numeric data receiving unit for receiving time-series numeric data for a plurality of variables; an image data receiving unit for receiving image data captured in a time-series manner; a sequence encoder for encoding the time-series numeric data to generate a numeric token; an image encoder for encoding the image data to generate an image token; a time encoder for encoding data combining the time-series numeric data and the image token to generate a first intermediate token; a spatial encoder for encoding data combining the image data and the numeric token to generate a second intermediate token; and a fusion unit for deriving integrated characteristic data using the first intermediate token and the second intermediate token.
[0022] At this time, the fusion unit can generate a first final token by blending the first intermediate token and the numerical token, generate a second final token by blending the second intermediate token and the image token, and derive integrated characteristic data by combining the first final token and the second final token.
[0023] In addition, the fusion unit can blend the first intermediate token and the numeric token through weighted addition using a first weight, and blend the second intermediate token and the image token through weighted addition using a second weight.
[0024] At this time, the sequence encoder, the image encoder, the time encoder, and the space encoder may be encoders of a transformer model.
[0025] In addition, the numeric token, the image token, the first intermediate token, and the second intermediate token may all be vectors having the form of 1 x d (where d is a predetermined number).
[0026] Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. In describing the present invention, if it is determined that a detailed description of related known components or functions may obscure the essence of the present invention, such detailed description will be omitted. Furthermore, in describing the embodiments of the present invention, specific numerical values are merely examples and the scope of the invention is not limited thereby.
[0027] The spatiotemporal fusion system according to the present invention may be configured in the form of a server equipped with a central processing unit (CPU) and memory, capable of connecting to other terminals via a communication network such as the Internet. However, the present invention is not limited by the configuration of the central processing unit and memory. Furthermore, the spatiotemporal fusion system according to the present invention may be physically configured as a single device or implemented in a distributed form across multiple devices.
[0028] FIG. 1 is a configuration diagram illustrating the internal configuration of a space-time fusion system according to an embodiment of the present invention.
[0029] As illustrated in the drawing, a spatiotemporal fusion system (101) according to one embodiment of the present invention may be configured to include a numerical data receiving unit (110), an image data receiving unit (120), a sequence encoder (130), an image encoder (140), a time encoder (150), a spatial encoder (160), and a fusion unit (170).
[0030] Each component may be a software module operating within the same physically identical computer system, or may be configured such that two or more physically separated computer systems can operate in conjunction with each other, and various embodiments including the same function fall within the scope of the present invention.
[0031] The numerical data receiving unit (110) receives time-series numerical data for multiple variables. Time-series numerical data is data that is continuously generated and collected over time, and may include various information for monitoring industrial sites such as factories or equipment. For example, status information such as temperature and humidity of a specific space may be continuously collected, and information such as voltage and current of power supplied to specific equipment may also be time-series data.
[0032] The time series numeric data received from the numeric data receiving unit (110) can be data measured at multiple points in time, for example, when voltage and current information is measured and collected at 1-second intervals, it can be time series numeric data at 1-second intervals.
[0033] Time series numeric data received by the numeric data receiving unit (110) can be synchronized with the time points of the data for analysis. If there is a difference in the time points of measurement for each type of data, synchronization can be achieved by extracting data measured at the same time point to match the time points, or by correcting some of the data to modify the data into a form that is as if it were measured at the same time point.
[0034] The image data receiving unit (120) receives image data captured in a time series. The image data may be data received from a camera module, such as a CCTV, installed to check the status of a place or equipment to be monitored, may be an image captured at a specific point in time from a video received in real time, or may be an image taken of still images periodically.
[0035] The image data received by the image data receiving unit (120) can be synchronized with each point in time of the time series numeric data so that it can be analyzed in conjunction with the time series numeric data, and when the time series numeric data is extracted and analyzed in a certain time interval, one or more image data captured within that time interval can be used. Through this, the numerically derived information and the image information of the site can be analyzed simultaneously to derive accurate analysis results.
[0036] In addition to the embodiments described above, the present invention can be applied in various fields such as exercise scene analysis and medical data analysis. In the case of exercise scene analysis, the exercise scene may be input as image data and numerical information such as the trajectory of a ball may be input as numerical data. In the case of medical data analysis, medical image data may be input as image data and biosignal data may be input as numerical data, and various other variations are also possible.
[0037] The "Encoders" described below all refer to encoders used in Transformer models. Transformer models are deep learning models first introduced in the paper "Attention Is All You Need" published by Google researchers in 2017. Unlike conventional Recurrent Neural Networks (RNNs), they solve the problems of slow learning speed and long dependencies by enabling parallel processing rather than sequential computation to process sequence data.
[0038] The Transformer model is designed based on an Encoder-Decoder structure, and the encoder of the Transformer model receives input data and generates a representation of a token. In this invention as well, time-series numerical data and image data are received as input data to generate tokens of a defined form. In this invention, the tokens take the form of vector data.
[0039] The sequence encoder (130) encodes the time series numeric data to generate numeric tokens. As previously described, the time series numeric data is numeric data collected according to the flow of the time series, and mainly represents sensor values measured at regular intervals.
[0040] The sequence encoder (130) generates numeric tokens by encoding such time series data as input data, and the numeric tokens become vector data having the form of 1 x d (where d is a predetermined number). At this time, d may vary depending on the characteristics of the data being analyzed, and the present invention is not limited by that number.
[0041] The image encoder (140) encodes the image data to generate an image token. In order to analyze the image data, the image data may be divided into a predetermined number of regions and features may be derived from each of the data, but the present invention is not limited by such division of data.
[0042] The image token generated by the image encoder (140) becomes a 1 xd vector data, just like the numeric token described earlier, and by composing it with vector data of the same size, it can be easily fused and analyzed.
[0043] The time encoder (150) generates a first intermediate token by encoding data that combines the time series numeric data and the image token. Since the image token is generated by encoding image data as previously described, the first intermediate token is generated by taking into account the characteristics of the image data in addition to the characteristics of the time series numeric data.
[0044] The time encoder (150) is an advanced encoder that receives not only time-series numerical data but also image tokens as part of the input, and combines the time-series numerical data and image tokens to capture the temporal structure along with the spatial context.
[0045] The first intermediate token generated by the time encoder (150) consists of 1 xd vector data, just like the other tokens described above.
[0046] The spatial encoder (160) generates a second intermediate token by encoding data that combines the image data and the numeric token. Since the numeric token is derived by analyzing time-series numeric data as previously described, the second intermediate token is generated by considering the characteristics of the time-series numeric data in addition to the characteristics of the image data.
[0047] The spatial encoder (160) is an advanced encoder that receives not only image data but also numeric tokens as part of the input, and combines the image data and numeric tokens to capture spatial information along with the sequence context.
[0048] The second intermediate token generated by the spatial encoder (160) consists of 1 xd vector data, just like the other tokens described above.
[0049] As such, since the numeric token, image token, first intermediate token, and second intermediate token are all composed of 1 xd vector data, it becomes possible to blend and integrate them.
[0050] The fusion unit (170) derives integrated characteristic data using the first intermediate token and the second intermediate token.
[0051] The fusion unit (170) blends the first intermediate token and the numeric token to generate a first final token, blends the second intermediate token and the image token to generate a second final token, and combines the first final token and the second final token to derive integrated characteristic data.
[0052] The fusion unit (170) blends the first intermediate token and the numeric token through weighted addition using the first weight, and blends the second intermediate token and the image token through weighted addition using the second weight.
[0053] Blending the first intermediate token and the numeric token in the fusion section (170) by weighted addition using the first weight is performed as shown in [Equation 1] below.
[0054] [Formula 1]
[0055]
[0056] At this time, is the first final token, is a numeric token, represents the first intermediate token, and β represents the first weight.
[0057] In addition, blending the second intermediate token and the image token in the fusion section (170) by weighted addition using the second weight is performed as shown in [Equation 2] below.
[0058] [Equation 2]
[0059]
[0060] At this time, is the second final token, is an image token, represents the second intermediate token, and α represents the second weight.
[0061] By allowing the assignment of first and second weights in this manner, it becomes possible to finely control blending. In this case, using the squared terms of the first and second weights, as in [Equation 1] and [Equation 2], allows for more gradual adjustment of their respective contributions, thereby preventing global features from being excessively diluted as localized information accumulates across multiple layers. Furthermore, this approach stabilizes the gradient flow, ensuring smoother learning transitions as network depth increases.
[0062] Therefore, by adjusting the first and second weights, it is possible to enable more accurate fusion analysis of multimodal data.
[0063] Combining the first final token and the second final token in the fusion section (170) can be implemented by simply concatenating two vectors configured in the form of 1 x d to generate vector data having the form of 1 x 2d. This allows the independence of each token to be maintained while enabling the use of the mutually complementary advantages of the two modalities in the representation of the integrated token.
[0064] The integrated feature data generated by the fusion unit (170) is analyzed through a classifier to perform an analysis of the situation at each point in time. At this time, various technologies may be applied to the classifier, and it may vary depending on the characteristics of the subject of analysis. As such, since the integrated feature data contains the results of analyzing the characteristics of both time-series numerical data and image data, it enables fusion analysis of multimodal data including time and space.
[0065] FIG. 2 is a diagram illustrating the operation of a space-time fusion system according to an embodiment of the present invention.
[0066] As illustrated in the drawing, when numeric data and image data are input as input data, the sequence encoder and image encoder respectively encode the numeric data and image data to generate numeric tokens and image tokens. As illustrated, the sequence encoder and image encoder are configured so that the numeric tokens and image tokens are derived in the form of vector data having a 1 x d structure.
[0067] The time encoder generates a first intermediate token by encoding the image token generated by the image encoder into the time-series numerical data. Additionally, the spatial encoder generates a second intermediate token by encoding the numerical token generated by the sequence encoder into the image data. Both such first and second intermediate tokens are composed of 1 xd vector data as illustrated in the drawing, configured to be blendable with the numerical token and image token, respectively.
[0068] The mxd data input to the time encoder becomes vector data composed of time-series numerical data, and the nxd data input to the spatial encoder can become vector data from which features are extracted by dividing image data into multiple regions. Image tokens and numerical tokens are added to each input data, so that the spatial encoder encodes the (n+1) xd input data, and the time encoder encodes the (m+1) xd structure input data.
[0069] Subsequently, a blending process is performed in which the numeric token derived from the sequence encoder and the first intermediate token derived from the time encoder are added using a weighted sum operation, and the image token derived from the image encoder and the second intermediate token derived from the spatial encoder are added using a weighted sum operation. This process is carried out using the equations [Equation 1] and [Equation 2] explained earlier.
[0070] The first and second final tokens generated through blending are concatenated to derive integrated feature data. Since the integrated feature data is structured to include both the characteristics of time-series numerical data and image data, accurate situational analysis by combining them becomes possible.
[0071] When integrated characteristic data is input into a classifier, analysis results regarding the situation at a given point in time are derived. For example, if time-series process sensor data and process image data are the input, the classifier can analyze whether there are any abnormalities in the process.
[0072] FIG. 3 is a flowchart showing the flow of a spacetime fusion method according to an embodiment of the present invention.
[0073] A spacetime fusion method according to one embodiment of the present invention is a method that operates in a spacetime fusion system (101) equipped with a central processing unit (CPU) and memory, and the description of the spacetime fusion system (101) described above can be applied as is. Therefore, it is obvious that all the contents described to explain the spacetime fusion system (101) can be applied as is to implement the spacetime fusion method, even without separate explanation below.
[0074] The numerical data reception step (S301) receives time-series numerical data for multiple variables. Time-series numerical data is data that is continuously generated and collected over time, and may include various information for monitoring industrial sites such as factories or equipment. For example, status information such as temperature and humidity of a specific space may be continuously collected, and information such as voltage and current of power supplied to specific equipment may also be time-series data.
[0075] The time series numeric data received in the numeric data receiving step (S301) may be data measured at multiple points in time, for example, if voltage and current information is measured and collected at 1-second intervals, it may be time series numeric data at 1-second intervals.
[0076] Time series numeric data received in the numeric data reception step (S301) can be synchronized at different times for analysis. If there is a difference in the time of measurement for each type of data, synchronization can be achieved by extracting data measured at the same time to match the time, or by correcting some of the data to modify it into a form that appears as if it was measured at the same time.
[0077] The image data reception step (S302) receives image data captured in a time series. The image data may be data received from a camera module, such as a CCTV, installed to check the status of a location or equipment to be monitored, may be an image captured at a specific point in time from a video received in real time, or may be an image of still images taken periodically.
[0078] In order to enable analysis in conjunction with the time-series numerical data, the image data received in the image data reception step (S302) can be synchronized with each point in time of the time-series numerical data, and when the time-series numerical data is extracted and analyzed in specific time intervals, one or more image data captured within those time intervals can be used. Through this, accurate analysis results can be derived by simultaneously analyzing numerical information and image information about the site.
[0079] In addition to the embodiments described above, the present invention can be applied in various fields such as exercise scene analysis and medical data analysis. In the case of exercise scene analysis, the exercise scene may be input as image data and numerical information such as the trajectory of a ball may be input as numerical data. In the case of medical data analysis, medical image data may be input as image data and biosignal data may be input as numerical data, and various other variations are also possible.
[0080] The "Encoders" described below all refer to encoders used in Transformer models. Transformer models are deep learning models first introduced in the paper "Attention Is All You Need" published by Google researchers in 2017. Unlike conventional Recurrent Neural Networks (RNNs), they solve the problems of slow learning speed and long dependencies by enabling parallel processing rather than sequential computation to process sequence data.
[0081] The Transformer model is designed based on an Encoder-Decoder structure, and the encoder of the Transformer model receives input data and generates a representation of a token. In this invention as well, time-series numerical data and image data are received as input data to generate tokens of a defined form. In this invention, the tokens take the form of vector data.
[0082] The sequence encoding step (S303) generates numeric tokens by encoding the time-series numeric data with a sequence encoder. As previously described, the time-series numeric data is numeric data collected according to the flow of a time series, and mainly represents sensor values measured at regular intervals.
[0083] The sequence encoding step (S303) generates numeric tokens by encoding such time series data as input data, and the numeric tokens become vector data having the form of 1 x d (where d is a predetermined number). At this time, d may vary depending on the characteristics of the data being analyzed, and the present invention is not limited by that number.
[0084] The image encoding step (S304) generates an image token by encoding the image data with an image encoder. To analyze the image data, the image data may be divided into a predetermined number of regions and features may be derived from each of the data; however, the present invention is not limited by such division of data.
[0085] The image token generated in the image encoding step (S304) becomes a 1 xd vector data, just like the numeric token described earlier. By composing it with vector data of the same size, it is possible to easily fuse and analyze it.
[0086] The time encoding step (S305) generates a first intermediate token by encoding the data combining the time series numerical data and the image token using a time encoder. Since the image token is generated by encoding the image data as previously described, the first intermediate token is generated by considering the characteristics of the image data in addition to the characteristics of the time series numerical data.
[0087] The time encoding step (S305) is an advanced encoder that receives not only time-series numerical data but also image tokens as part of the input, and combines the time-series numerical data and image tokens to capture the temporal structure along with the spatial context.
[0088] The first intermediate token generated in the time encoding step (S305) consists of 1 xd vector data, just like the other tokens described above.
[0089] The spatial encoding step (S306) generates a second intermediate token by encoding the data combining the image data and the numeric token using a spatial encoder. Since the numeric token is derived by analyzing time-series numeric data as previously described, the second intermediate token is generated by considering the characteristics of the time-series numeric data in addition to the characteristics of the image data.
[0090] The spatial encoding step (S306) is an advanced encoder that receives not only image data but also numeric tokens as part of the input, and combines the image data and numeric tokens to capture spatial information along with the sequence context.
[0091] The second intermediate token generated in the spatial encoding step (S306) consists of 1 xd vector data, just like the other tokens described above.
[0092] As such, since the numeric token, image token, first intermediate token, and second intermediate token are all composed of 1 xd vector data, it becomes possible to blend and integrate them.
[0093] The fusion step (S307) derives integrated characteristic data using the first intermediate token and the second intermediate token.
[0094] The fusion step (S307) generates a first final token by blending the first intermediate token and the numeric token, generates a second final token by blending the second intermediate token and the image token, and derives integrated characteristic data by combining the first final token and the second final token.
[0095] The fusion step (S307) blends the first intermediate token and the numeric token through weighted addition using a first weight, and blends the second intermediate token and the image token through weighted addition using a second weight.
[0096] In the fusion step (S307), blending the first intermediate token and the numeric token with a first weight using weighted addition is performed as shown in [Equation 1] below.
[0097] [Formula 1]
[0098]
[0099] At this time, is the first final token, is a numeric token, represents the first intermediate token, and β represents the first weight.
[0100] In addition, in the fusion step (S307), blending the second intermediate token and the image token with a second weight using weighted addition is performed as shown in [Equation 2] below.
[0101] [Equation 2]
[0102]
[0103] At this time, is the second final token, is an image token, represents the second intermediate token, and α represents the second weight.
[0104] By allowing the assignment of first and second weights in this manner, it becomes possible to finely control blending. In this case, using the squared terms of the first and second weights, as in [Equation 1] and [Equation 2], allows for more gradual adjustment of their respective contributions, thereby preventing global features from being excessively diluted as localized information accumulates across multiple layers. Furthermore, this approach stabilizes the gradient flow, ensuring smoother learning transitions as network depth increases.
[0105] Therefore, by adjusting the first and second weights, it is possible to enable more accurate fusion analysis of multimodal data.
[0106] In the fusion step (S307), combining the first final token and the second final token can be implemented by simply concatenating two vectors configured in the form of 1 x d to generate vector data having the form of 1 x 2d. This allows the independence of each token to be maintained while enabling the use of the mutually complementary advantages of the two modalities in the representation of the integrated token.
[0107] The integrated feature data generated in the fusion stage (S307) is analyzed through a classifier to perform an analysis of the situation at each point in time. At this time, various techniques may be applied to the classifier, and it may vary depending on the characteristics of the analysis target. As such, since the integrated feature data contains the results of analyzing the characteristics of both time-series numerical data and image data, it enables fusion analysis of multimodal data that includes time and space.
[0108] The spatiotemporal fusion method according to the present invention can be produced as a program to be executed by a computer and recorded on a computer-readable recording medium.
[0109] Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tapes; optical recording media such as CD-ROMs and DVDs; magneto-optical media such as floptical disks; and hardware devices specifically configured to store and execute program instructions such as ROM, RAM, and flash memory.
[0110] Examples of program instructions include machine code, such as that generated by a compiler, as well as high-level language code that can be executed by a computer using an interpreter, etc. The hardware device may be configured to operate as one or more software modules to perform processing according to the present invention, and vice versa.
[0111] Although the invention has been described above with reference to embodiments, those skilled in the art may make various modifications and changes to the invention without departing from the spirit and scope of the invention as described in the following claims.
[0112] The present invention relates to a spatiotemporal fusion system and method for time-series integration of multimodal data, and more specifically, to a system and method capable of fusion and analysis of multimodal data such as image data and sensor data.
[0113] 101: Space-Time Fusion System
[0114] 110: Numerical data receiver 120: Image data receiver
[0115] 130: Sequence Encoder 140: Image Encoder
[0116] 150: Time encoder 160: Spatial encoder
[0117] 170: Fusion Section
Claims
1. A numeric data receiving unit that receives time-series numeric data for multiple variables; An image data receiving unit that receives image data captured in a time series; A sequence encoder that encodes the above-mentioned time-series numeric data to generate numeric tokens; An image encoder that generates an image token by encoding the above image data; A time encoder that generates a first intermediate token by encoding data combining the above time series numerical data and the above image token; A spatial encoder that generates a second intermediate token by encoding data combining the above image data and the above numeric token, and A fusion unit that derives integrated characteristic data using the first intermediate token and the second intermediate token. A spacetime fusion system including 2. In Paragraph 1, The above fusion part A first final token is generated by blending the first intermediate token and the numeric token, and A second final token is generated by blending the second intermediate token and the image token, and Deriving integrated characteristic data by combining the first final token and the second final token. A spacetime fusion system characterized by 3. In Paragraph 2, The above fusion part Blending the above first intermediate token and the above numeric token through weighted addition using the first weight, and Blending the above second intermediate token and the above image token through weighted addition using a second weight A spacetime fusion system characterized by 4. In Paragraph 3, The sequence encoder, the image encoder, the time encoder, and the space encoder are encoders of the Transformer model. A spacetime fusion system characterized by 5. In Paragraph 4, The above numeric token, the above image token, the above first intermediate token, and the above second intermediate token are all vectors having the form 1 x d (where d is a predetermined number). A spacetime fusion system characterized by 6. A spatiotemporal fusion method operating in a spatiotemporal fusion system equipped with a central processing unit and memory, Numerical data receiving step for receiving time-series numerical data for multiple variables; Image data receiving step for receiving image data captured in a time series; A sequence encoding step for encoding the above-mentioned time-series numeric data to generate numeric tokens; An image encoding step for generating an image token by encoding the above image data; A time encoding step for generating a first intermediate token by encoding data combining the above time series numerical data and the above image token; A spatial encoding step for generating a second intermediate token by encoding data combining the image data and the numeric token, and A fusion step for deriving integrated characteristic data using the first intermediate token and the second intermediate token. A spacetime fusion method including 7. In Paragraph 6, The above fusion step A first final token is generated by blending the first intermediate token and the numeric token, and A second final token is generated by blending the second intermediate token and the image token, and Deriving integrated characteristic data by combining the first final token and the second final token. A spacetime fusion method characterized by 8. In Paragraph 7, The above fusion step Blending the above first intermediate token and the above numeric token through weighted addition using the first weight, and Blending the above second intermediate token and the above image token through weighted addition using a second weight A spacetime fusion method characterized by 9. In Paragraph 8, The sequence encoding step, the image encoding step, the time encoding step, and the spatial encoding step are encoded using an encoder of a transformer model. A spacetime fusion method characterized by 10. In Paragraph 9, The above numeric token, the above image token, the above first intermediate token, and the above second intermediate token are all vectors having the form 1 x d (where d is a predetermined number). A spacetime fusion method characterized by 11. A computer-readable recording medium having a program recorded thereon for enabling a computer to execute the method of any one of paragraphs 6 through 10.