A store operation action recommendation method and system based on scene perception
By employing technologies such as the Transformer field decoder and conditional diffusion probability model, the problems of global situational awareness and action recommendation for multi-source data in the store management system were solved, enabling global optimization of business actions and interpretable recommendations, thereby improving the level of intelligence in store operations.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SHENZHEN SHOWTOP TECH CO LTD
- Filing Date
- 2026-05-18
- Publication Date
- 2026-06-16
AI Technical Summary
Existing store management systems lack end-to-end intelligent transformation capabilities from multi-source perception data to specific executable action instructions. They cannot integrate multimodal operational data to form a global situational awareness, and they lack the ability to quantitatively extrapolate the consequences of action execution. This results in insufficient feasibility and credibility of recommendations, and an inability to make optimized decisions on the combination, coordination, and conflict relationships of multiple operational actions in their time sequence.
A Transformer field decoder is used for store situation modeling. Combined with a multilayer perceptron and cross-attention mechanism, a global situation representation vector is generated. A conditional diffusion probability model and a causal graph structure are constructed. Candidate action tuples are generated through a U-Net denoising network. Monte Carlo tree search is performed in a parallel digital twin sandbox to optimize action sequence recommendation.
It achieves a global perception and causal inference of the store's operating status, generates action sequence recommendations with interpretable evidence, improves the perception accuracy and decision-making intelligence of store operations, and ensures the global optimality and feasibility of action execution.
Smart Images

Figure CN122222652A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of smart catering store technology, and in particular to a method and system for recommending store operation actions based on scene perception. Background Technology
[0002] As the catering and retail industries transform towards digital and refined operations, stores have generally deployed various systems such as queuing, ordering, kitchen order management, inventory monitoring, and equipment sensing, capable of generating massive amounts of data in real time, including customer flow, table turnover, food preparation, inventory, and equipment status. However, most current store management systems remain at the level of data dashboards and threshold alarms, merely presenting various indicators in the form of charts or dashboards. They rely on store managers' judgments based on personal experience to determine the appropriate operational actions, lacking the end-to-end intelligent transformation capability from multi-source perceived data to specific executable action instructions. The few systems with recommendation functions typically rely on static rule bases or collaborative filtering strategies, only prompting the opening of additional tables when customer flow exceeds the limit or restocking when inventory falls below the safe level. This type of rule-driven recommendation cannot perceive the coupling effects of multiple abnormal concurrency events. For example, when food preparation delays and queue surges overlap, whether to prioritize expediting orders or increasing staff is difficult for the rule system to optimize.
[0003] Furthermore, existing recommendation methods generally lack the ability to quantify and extrapolate the consequences of actions, failing to predict the specific impact of actions on core operational indicators such as queuing time, table turnover efficiency, and customer complaint risk before execution, resulting in insufficient feasibility and credibility of recommendations. Simultaneously, existing systems often make isolated recommendations for single actions, neglecting the temporal coordination and conflict relationships between multiple operational actions, leading to a lack of global optimality in continuous decision-making. Therefore, there is an urgent need for a method that can integrate multimodal operational data from stores to form a global situational awareness, automatically generate candidate actions adapted to the current operational status, perform causal extrapolation and temporal combination optimization of action execution consequences, and ultimately output an action sequence with interpretable evidence, thereby overcoming the limitations of existing systems in terms of perception depth, decision intelligence, and execution closed loop. Summary of the Invention
[0004] This invention overcomes the shortcomings of the prior art and provides a method and system for recommending store operation actions based on scene awareness. Its main purpose is to improve the perception accuracy and globality of complex store operations.
[0005] To achieve the above objectives, the first aspect of the present invention provides a method for recommending store operation actions based on scene awareness, comprising: Collect multimodal operational data, store equipment operation data, and store inventory change data of the target stores. Perform time windowing and gridding processing on the collected asynchronous multi-source data and convert it into multi-channel situational slices with fixed time steps to generate a store situational sequence set. The physical space of the store is discretized into a three-dimensional grid with a time dimension and an additional learnable coordinate code is added. A Transformer field decoder is used, and the store situation sequence set is combined with a spatiotemporal attention mechanism to perform implicit state field modeling of the store and global situation representation extraction to obtain a global situation representation vector. The operational actions are represented as structured parameter tuples containing action type, target, and action magnitude. A conditional diffusion probability model is constructed. In the forward process, Gaussian noise is gradually added to the real action samples. In the reverse process, the global situation representation vector is used as a condition, and the random noise is iteratively denoised through the U-Net denoising network to generate multiple candidate action tuples. Construct a causal graph structure with store operation elements as nodes and fit the causal path coefficients between nodes. Set the initial values of exogenous variables with the global situation representation vector. Use candidate actions as intervention parameters to perform inverse factual deduction of action impact and priority ranking, and generate a candidate action ranking table with single-step causal deduction conclusions. Construct a parallel digital twin sandbox of the target store, generate a candidate pool to be combined based on the candidate action ranking table, and infer the store's operation trajectory at future moments in the parallel digital twin sandbox through Monte Carlo tree search. Select the optimal action sequence by evaluating cumulative returns and robustness, generate natural language explanations using a large language model, and push them to the terminal.
[0006] In this solution, the collection of multimodal operational data, store equipment operation data, and store inventory change data of the target store, the time windowing and gridding processing of the collected asynchronous multi-source data, and the conversion into multi-channel situational slices with fixed time steps are specifically included in generating a store situational sequence set. Based on the heterogeneous monitoring array of the store, the customer flow trajectory in the front hall and the table opening and checkout events are obtained. The order flow middleware obtains the delivery timestamp of the dishes. The equipment programmable logic controller obtains the equipment load and abnormal alarms. The gravity shelf and smart freezer obtain the material inventory change data to form a multi-source raw data stream. The multi-source raw data stream is aligned with network time protocol timestamps and corrected for jumps. Based on the installation location of each data source sensor in the store, spatial coordinate labels are added to each record, transforming the multi-source raw data stream into a standardized asynchronous record stream with dual time and space labels. Based on the store floor plan, a spatial polyhedron grid is pre-constructed covering the front dining area, kitchen stalls, food delivery aisles, and storage shelving areas. Each unit in the grid is defined as an independent spatial anchor point. The standardized asynchronous recording stream is mapped to the corresponding spatial anchor point according to the spatial coordinate label. The corresponding spatial anchor point is filled with the estimated remaining dining time of the table, the backlog of orders at the stall, the equipment health scalar and the material consumption trend value to form a spatially aligned multi-channel time-series signal stream. Adaptive window length gating is performed on the multi-channel time-series signal stream at each spatial anchor point. Narrow-window Gaussian weighted statistics are applied to fast-changing channels, wide-window smoothing is applied to slow-changing channels, and the frequency and duration of discrete event channels are statistically analyzed. The gating mechanism is used to selectively filter the original values within the window. All channel feature values of all spatial anchor points within the same time step are arranged into multi-channel situation slices for that time step. The channels contained in each slice correspond to the customer flow density field, the average remaining dining time per table, the length of the stall order queue, the confidence level of equipment anomalies, and the material consumption trend, respectively. They are arranged in order of time steps to form a store situation sequence set.
[0007] In this scheme, the physical space of the store is discretized into a three-dimensional grid with a time dimension and an appended learnable coordinate code. A Transformer field decoder is used, and the store situation sequence set is combined with a spatiotemporal attention mechanism to perform implicit state field modeling and global situation representation extraction, resulting in a global situation representation vector. Specifically, this includes: Based on the store floor plan and floor height parameters, the physical space of the target store is discretized into a three-dimensional spatial grid covering the front hall, back kitchen, food delivery area and storage area, and expanded into a four-dimensional spatiotemporal grid framework along the time axis. Each grid point is assigned a learnable coordinate encoding vector composed of spatial location encoding and temporal location encoding. The multilayer perceptron is used to compress the feature values of all spatial anchor points of each multi-channel situation slice in the store situation sequence set into a fixed-dimensional slice embedding vector. At the same time, each spatial anchor point is encoded at each time step to obtain the anchor point local state embedding. A Transformer field decoder composed of multiple layers of cross-attention blocks is introduced. The spatiotemporal coordinates to be queried and their learnable coordinate codes are used as query vectors. Temporal attention aggregation is performed on the slice embedding vectors in sequence to obtain the semantics of the overall store operation rhythm. Spatial attention aggregation is performed on the local state embedding of anchor points to retrieve the local detail signals of neighboring anchor points. After multi-layer cross-attention operation by the Transformer field decoder, the running pressure scalar and implicit state embedding vector of the input spatiotemporal coordinates are obtained. The running pressure comprehensively represents the queuing pressure, food delivery congestion, equipment malfunction risk and inventory depletion tendency at the coordinate location. Uniform sampling is performed on the four-dimensional spatiotemporal grid of the store. The spatiotemporal coordinates of each sampling point are sequentially input into the Transformer field decoder to obtain the pressure distribution field covering the entire store and the implicit state embedding. The operating pressure of all sampling points is normalized to form spatial attention weights. The implicit state embeddings are weighted and summed, and after being grouped according to preset functional areas, cross-area attention integration is performed to obtain the global situation representation vector.
[0008] In this scheme, the operational actions are represented as structured parameter tuples containing action type, target, and action amplitude. A conditional diffusion probability model is constructed. In the forward process, Gaussian noise is progressively added to the real action samples. In the reverse process, the global situational representation vector is used as a condition, and the random noise is iteratively denoised using a U-Net denoising network to generate multiple candidate action tuples. Specifically, this includes: Store operation actions are uniformly represented as structured parameter tuples containing action type, target, and action amplitude fields. After embedding table mapping and concatenation, they are normalized to a fixed interval to form an action parameter vector. The action type field covers personnel dispatch, order to expedite food preparation, inventory replenishment, equipment inspection, and reassurance and discounts. The target field identifies the location where the action is applied by spatial anchor point number, table number, or equipment number. The action amplitude field carries a continuous value that matches the action type. Extract executed and adopted operational action records from historical operational archives, encode the action type, target, and action range of each record into a real action sample vector according to the structured parameter tuple format, and associate it with the store status sequence snapshot at the time of action execution and the corresponding global status representation to form a real action sample set; A conditional diffusion probability model is constructed and the real action sample set is used as input for model training. During the forward noise addition process, for each action parameter vector in the real action sample set, the diffusion time step is randomly sampled and Gaussian noise of corresponding intensity is injected to finally obtain the action vector contaminated by noise. In the reverse denoising process, the U-Net denoising network with encoder-decoder architecture is used to downsample the input noise-contaminated action vector and extract multi-layer features. The decoder gradually restores the dimension through upsampling and transmits fine-grained information of each layer of the encoder through skip connections. Using the global situational representation vector as a conditional signal, the conditional signal is injected into the intermediate feature layer through a cross-attention mechanism at the feature scale of each downsampling and upsampling stage of the U-Net denoising network, so that the denoising network uses the current operating status of the store as a reference when predicting noise components. After the model training converges, a pure random noise vector with the same dimension as the action parameter vector is sampled from the standard Gaussian distribution. Using the current store's global situation representation vector as a condition, the U-Net denoising network, which has been trained and converged, gradually denoises from the maximum time step along the inverse diffusion time step. The denoising process is executed in parallel from multiple different initial noises to generate multiple candidate action parameter vectors, which are then decoded in the parameter space to restore them into candidate action tuples.
[0009] In this solution, the construction of a causal graph structure with store operation elements as nodes and the fitting of causal path coefficients between nodes, the setting of initial values for exogenous variables using the global situational representation vector, the use of candidate actions as intervention parameters for inverse factual deduction and priority ranking of action impact, and the generation of a candidate action ranking table with single-step causal deduction conclusions, specifically includes: Using queuing time, table turnover rate, food preparation congestion index of each stall, inventory consumption rate of key materials, equipment failure risk index and customer complaint tendency as endogenous nodes, and using in-store customer flow intensity, weather conditions, business district activity tags and time attributes as exogenous nodes, directed edges are established between nodes based on the causal relationship of store operation to form a causal graph structure. Using observation data of normal store operation and historical intervention correction data, the path coefficients of each causal directed edge in the causal graph structure are subjected to maximum likelihood estimation and Bayesian regularization fitting to obtain the causal path coefficient matrix, which characterizes the correlation strength and response elasticity between nodes under natural and intervention conditions. For the current moment, the initial values of exogenous variables for each exogenous variable node are calculated from the global situation representation vector at the current moment, and the corresponding exogenous variable nodes are assigned values. At the same time, the current actual observation values of each endogenous node are recorded as the baseline state for counterfactual inference. For each candidate action tuple, determine the endogenous node that directly acts on it in the causal graph structure by its action type and target, cut off all natural incoming edge paths of the target endogenous node, and assign the action amplitude field to the target endogenous node as an interference quantifier after mapping and transformation. The intervention effect is propagated forward node by node along the causal directed edge in topological order. All endogenous nodes reachable from the intervention node through the directed edge are traversed sequentially in the causal graph structure. For each directed edge, the path equation propagation operation is performed on the new value of the preceding node modified by the intervention predictor to obtain the counterfactual prediction value of the endogenous node using the causal path coefficient. The counterfactual prediction values of each endogenous node are compared with the baseline state, and the following indicators are extracted: the decrease in queuing time, the increase in table turnover rate, the degree of relief of food service congestion, the proportion of delayed inventory consumption, and the decrease in customer complaint tendency. Based on the target store's current operating hours and status preferences, a preset weight factor table is queried to obtain indicator weight factors. The indicator weight factors are then used to weight and sum the estimated effect indicators, and the estimated execution cost is introduced as a utility deduction item to finally obtain the comprehensive utility score of the target candidate action tuple. After all candidate action tuples have completed counterfactual deduction and comprehensive utility calculation, they are sorted in descending order according to comprehensive utility score. The multi-index predicted effect of each action and the causal propagation path node sequence are recorded together as the single-step causal deduction conclusion, forming a candidate action ranking table.
[0010] In this solution, the construction of a parallel digital twin sandbox of the target store, the generation of a candidate pool to be combined based on the candidate action ranking table, the deduction of the store's future operating trajectory in the parallel digital twin sandbox through Monte Carlo tree search, the selection of the optimal action sequence by evaluating cumulative returns and robustness, and the generation of natural language explanations using a large language model and pushing them to the terminal specifically include: A parallel digital twin sandbox is constructed, consisting of three components: physical layout mirror, real-time synchronization interface for running status, and running logic simulation rule base. The candidate action ranking table is input, and candidate action tuples with comprehensive utility scores ranked within the preset ranking are extracted to form a candidate pool to be combined. Mutually exclusive action combinations with the same action object in the candidate pool are removed. The physical layout mirror reuses the spatial polyhedral mesh and the position information of each spatial anchor point. The real-time synchronization interface of the running status receives the global situation representation vector at each fixed time step and solves it into a state snapshot of each functional area. The running logic simulation rule base is built based on the causal path coefficient matrix and historical customer behavior random parameters. Using the current real-time synchronization state of the sand table as the root node and the action tuples in the candidate pool to be combined as directed edges, a Monte Carlo tree search is performed. In the expansion phase, new actions are sampled to create child nodes. In the simulation phase, the store's operating trajectory in future periods is deduced. In the backtracking phase, the cumulative returns and robustness indicators are updated to each ancestor node along the search path. During the simulation phase, the simulation rule base propagates the intervention effect based on the causal path coefficients and generates random customer behavior based on the distribution of customer arrival time intervals and dining duration. The same action sequence is simulated multiple times and random perturbations are applied to the random parameters of customer behavior. The cumulative mean and standard deviation of the returns for each simulation are recorded. After completing the preset rounds of search, select actions from the direct child nodes of the root node whose cumulative average return and standard deviation both meet the preset screening rules as the first step of the optimal sequence, and repeat the selection with this child node as the new root node to finally obtain the optimal action sequence. The optimal action sequence is integrated with the single-step causal deduction conclusion of the corresponding action and the summary of the comprehensive effect of the sand table deduction into a structured prompt context. This context is then input into a large language model to generate a natural language explanation. Finally, the action sequence is sent to the management terminal and regional terminal in a differentiated manner according to the action type to realize the recommendation of store operation actions.
[0011] A second aspect of the present invention provides a scene-aware store operation recommendation system, the system comprising: a memory, a processor, and a communication interface, wherein the memory contains a scene-aware store operation recommendation method program, and when the scene-aware store operation recommendation method program is executed by the processor, it implements the scene-aware store operation recommendation method steps as described in any of the preceding claims. Attached Figure Description
[0012] To more clearly illustrate the technical solutions in the embodiments or examples of the present invention, the drawings used in the embodiments or examples will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained according to these drawings without creative effort.
[0013] Figure 1 A flowchart of the first method of a scene-aware store operation recommendation method provided in an embodiment of the present invention; Figure 2 A flowchart of a second method for a scene-aware store operation recommendation method provided in an embodiment of the present invention; Figure 3 A block diagram of a scene-aware store operation recommendation system provided in an embodiment of the present invention; The realization of the objective, functional features and advantages of the present invention will be further explained in conjunction with the embodiments and with reference to the accompanying drawings. Detailed Implementation
[0014] To better understand the above-mentioned objectives, features, and advantages of the present invention, the present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments. It should be noted that, unless otherwise specified, the embodiments and features described in these embodiments can be combined with each other.
[0015] Many specific details are set forth in the following description in order to provide a full understanding of the invention. However, the invention may also be practiced in other ways different from those described herein, and therefore the scope of protection of the invention is not limited to the specific embodiments disclosed below.
[0016] Figure 1 A flowchart of the first method of a scene-aware store operation recommendation method provided in an embodiment of the present invention; like Figure 1 As shown, the present invention provides a first method flowchart for a scene-aware store operation action recommendation method, including: S102: Collect multimodal operational data, store equipment operation data, and store inventory change data of the target store; perform time windowing and gridding processing on the collected asynchronous multi-source data and convert it into multi-channel situational slices with fixed time steps to generate a store situational sequence set. S104, the physical space of the store is discretized into a three-dimensional grid with a time dimension and an additional learnable coordinate code is added. The Transformer field decoder is used, and the store situation sequence set is combined with the spatiotemporal attention mechanism to perform implicit state field modeling of the store and global situation representation extraction to obtain the global situation representation vector. S106, the operation action is represented as a structured parameter tuple containing action type, target and action amplitude, a conditional diffusion probability model is constructed, Gaussian noise is gradually added to the real action sample in the forward process, and the reverse process uses the global situation representation vector as a condition to iteratively denoise the random noise through the U-Net denoising network to generate multiple candidate action tuples. S108, construct a causal graph structure with store operation elements as nodes and fit the causal path coefficients between nodes, set the initial values of exogenous variables with the global situation representation vector, use candidate actions as interference parameters to perform inverse factual deduction of action impact and priority ranking, and generate a candidate action ranking table with single-step causal deduction conclusions. S110, construct a parallel digital twin sandbox of the target store, generate a candidate pool to be combined based on the candidate action ranking table, and infer the store's operating trajectory at future moments in the parallel digital twin sandbox through Monte Carlo tree search, select the optimal action sequence by evaluating cumulative returns and robustness, generate natural language explanations using a large language model and push them to the terminal.
[0017] Furthermore, in a preferred embodiment of the present invention, the process of collecting multimodal operational data, store equipment operation data, and store inventory change data of the target store, performing time windowing and gridding processing on the collected asynchronous multi-source data, and converting it into multi-channel situational slices with fixed time steps to generate a store situational sequence set specifically includes: Based on the heterogeneous monitoring array of the store, the customer flow trajectory in the front hall and the table opening and checkout events are obtained. The order flow middleware obtains the delivery timestamp of the dishes. The equipment programmable logic controller obtains the equipment load and abnormal alarms. The gravity shelf and smart freezer obtain the material inventory change data to form a multi-source raw data stream. The multi-source raw data stream is aligned with network time protocol timestamps and corrected for jumps. Based on the installation location of each data source sensor in the store, spatial coordinate labels are added to each record, transforming the multi-source raw data stream into a standardized asynchronous record stream with dual time and space labels. Based on the store floor plan, a spatial polyhedron grid is pre-constructed covering the front dining area, kitchen stalls, food delivery aisles, and storage shelving areas. Each unit in the grid is defined as an independent spatial anchor point. The standardized asynchronous recording stream is mapped to the corresponding spatial anchor point according to the spatial coordinate label. The corresponding spatial anchor point is filled with the estimated remaining dining time of the table, the backlog of orders at the stall, the equipment health scalar and the material consumption trend value to form a spatially aligned multi-channel time-series signal stream. Adaptive window length gating is performed on the multi-channel time-series signal stream at each spatial anchor point. Narrow-window Gaussian weighted statistics are applied to fast-changing channels, wide-window smoothing is applied to slow-changing channels, and the frequency and duration of discrete event channels are statistically analyzed. The gating mechanism is used to selectively filter the original values within the window. All channel feature values of all spatial anchor points within the same time step are arranged into multi-channel situation slices for that time step. The channels contained in each slice correspond to the customer flow density field, the average remaining dining time per table, the length of the stall order queue, the confidence level of equipment anomalies, and the material consumption trend, respectively. They are arranged in order of time steps to form a store situation sequence set.
[0018] It's important to note that existing store data collection typically records signals from each dimension independently and sends them to the analysis module separately, ignoring the inherent spatial and temporal connections between different data sources. This makes it difficult for subsequent sensing stages to capture cross-regional chain effects such as food backlogs spreading to queues and inventory shortages affecting food preparation. To address this issue, this step establishes a standardized processing pipeline with a unified spatiotemporal coordinate system at the data source. Specifically, the acquisition of front-of-house customer flow trajectories and table events is accomplished collaboratively by depth cameras and queuing terminals deployed at the entrance and dining area. This not only records the number of people entering the store but also uses multi-target tracking to bind each group of customers' complete behavioral chain from taking a number, sitting down, to checking out and leaving to a specific table. In the kitchen, the order flow middleware doesn't simply record order completion times but captures multiple timestamps for each dish, from order placement to the start of preparation at each station, completion, and pickup by the server, thus accurately depicting the food preparation process for each dish and changes in station load. The equipment's programmable logic controller (PLC) transmits parameters such as stove current, steamer temperature, and exhaust fan vibration amplitude at millisecond-level frequencies. After comparison with preset thresholds, it generates a scalar value for equipment anomaly confidence, rather than a simple Boolean alarm. On the inventory side, gravity-fed shelves and smart freezers use weight change pulses to infer material retrieval events and combine this with food production records to calculate the real-time consumption rate of materials, rather than just recording the remaining quantity.
[0019] After time synchronization and spatial coordinate addition, the aforementioned multi-source data is mapped to individual anchor points of a pre-constructed spatial polyhedral grid based on the store's floor plan. It's worth noting that this grid is not a single-sized rectangular unit, but rather flexibly divided according to the physical boundaries and operational density of different functional areas. For example, the grid unit density in the hot food stall area of the back kitchen is higher to match the fine granularity of equipment and food flow, while the grid units in the dining area are aligned with the table arrangement. The multi-channel signals converged at each anchor point introduce an adaptive window length mechanism during the gridding stage. The core of this mechanism lies in the continuous monitoring of the signal variance within each channel window by the gating unit: when the variance within the window of the food backlog channel increases due to a sudden food expediting event and exceeds a preset threshold, the window length automatically shrinks to avoid a smoothing effect masking the abrupt change; when the variance of the customer flow density channel decreases during stable periods, the window length gradually returns to the default width to suppress noise. For discrete event channels such as equipment alarms, the instantaneous value is replaced by the percentage of event frequency and duration within the statistical window, allowing the severity of the anomaly to be preserved rather than a single trigger state. The resulting multi-channel situational slices, each recording five channels simultaneously at the same time step: customer flow density, average remaining dining time per table, stall order queue length, equipment anomaly confidence, and material consumption trend. Differences in channel values between adjacent slices directly reflect the rate and direction of change in each operational dimension. Using these multi-channel situational slices as a unified input format for subsequent implicit state field modeling ensures that the transformation of store operational status from raw multi-source flows to a structured tensor available for deep learning model consumption is fully traceable and has a clear physical meaning, avoiding the interpretability loss inherent in traditional black-box feature engineering.
[0020] Furthermore, in a preferred embodiment of the present invention, the step of discretizing the physical space of the store into a three-dimensional grid with a time dimension and adding learnable coordinate encoding, employing a Transformer field decoder, and combining the store situation sequence set with a spatiotemporal attention mechanism to perform implicit state field modeling of the store and global situation representation extraction to obtain a global situation representation vector specifically includes: Based on the store floor plan and floor height parameters, the physical space of the target store is discretized into a three-dimensional spatial grid covering the front hall, back kitchen, food delivery area and storage area, and expanded into a four-dimensional spatiotemporal grid framework along the time axis. Each grid point is assigned a learnable coordinate encoding vector composed of spatial location encoding and temporal location encoding. The multilayer perceptron is used to compress the feature values of all spatial anchor points of each multi-channel situation slice in the store situation sequence set into a fixed-dimensional slice embedding vector. At the same time, each spatial anchor point is encoded at each time step to obtain the anchor point local state embedding. A Transformer field decoder composed of multiple layers of cross-attention blocks is introduced. The spatiotemporal coordinates to be queried and their learnable coordinate codes are used as query vectors. Temporal attention aggregation is performed on the slice embedding vectors in sequence to obtain the semantics of the overall store operation rhythm. Spatial attention aggregation is performed on the local state embedding of anchor points to retrieve the local detail signals of neighboring anchor points. After multi-layer cross-attention operation by the Transformer field decoder, the running pressure scalar and implicit state embedding vector of the input spatiotemporal coordinates are obtained. The running pressure comprehensively represents the queuing pressure, food delivery congestion, equipment malfunction risk and inventory depletion tendency at the coordinate location. Uniform sampling is performed on the four-dimensional spatiotemporal grid of the store. The spatiotemporal coordinates of each sampling point are sequentially input into the Transformer field decoder to obtain the pressure distribution field covering the entire store and the implicit state embedding. The operating pressure of all sampling points is normalized to form spatial attention weights. The implicit state embeddings are weighted and summed, and after being grouped according to preset functional areas, cross-area attention integration is performed to obtain the global situation representation vector.
[0021] It's important to note that traditional store status assessment methods typically monitor operational indicators for each area as independent scalars, such as focusing only on whether queue length or food preparation time exceeds limits. This fragmented assessment fails to capture the cross-regional transmission effects of anomalies. For instance, when delays in hot food preparation at the kitchen counter overlap with concentrated ordering in the dining area, the queuing pressure in the front of house is not simply the sum of these two values, but rather generates a non-linear amplification of congestion. To address this issue, this step introduces the concept of an implicit state field, modeling the overall store operation as a continuous four-dimensional spatiotemporal function. This function accepts queries for arbitrary spatial locations and time coordinates, returning the comprehensive abnormal pressure value at that point.
[0022] Specifically, based on the store's floor plan and ceiling height parameters, the front-of-house, kitchen, food delivery area, and storage area are discretized into a unified three-dimensional spatial grid. The front-of-house grid units are aligned with the table distribution, while the kitchen grid units are densely divided around each food stall's equipment. A time axis is then extended along a fixed time step direction on top of the three-dimensional spatial grid, forming a four-dimensional spatiotemporal grid framework. Each grid point is assigned a set of learnable coordinate encoding vectors, which are composed of spatial and temporal location encodings. The spatial location encoding captures the grid point's functional area attributes and adjacency relationships within the store, while the temporal location encoding captures the relative position of that time step within the business cycle. After constructing slice embeddings and anchor point local state embeddings, the Transformer field decoder performs cross-attention queries. Taking the query of a specific spatiotemporal coordinate near a hot food stall in the kitchen as an example, the query first interacts with the slice embeddings of each time step to capture the macro-level operational rhythm of the store at the current moment, indicating whether the store is experiencing peak or off-peak customer traffic. Then, it interacts with the local state embeddings of spatially adjacent anchor points to focus on retrieving local details such as current order backlog, food preparation time, and the load of adjacent food delivery areas. After multiple layers of cross-attention are alternately superimposed, the field decoder outputs a scalar of operational pressure and an implicit state embedding for that point. The pressure value is a weighted composite of queuing pressure, food preparation congestion, equipment malfunction risk, and inventory depletion tendency, while the implicit state embedding is the encoding of the point's full-dimensional anomaly features. When the entire store space is uniformly sampled and integrated through pressure-weighted pooling and inter-regional interactions, the resulting global situational representation vector not only reflects the overall health of the store's current operation but also automatically highlights the spatial distribution of high-anomaly areas and their interrelationships. This provides a compact summary of the store's overall operational status, offering high-resolution spatial semantic guidance for subsequent action generation and enabling more precise targeting of actions.
[0023] Furthermore, in a preferred embodiment of the present invention, the step of representing the operational action as a structured parameter tuple containing action type, target, and action amplitude, constructing a conditional diffusion probability model, progressively adding Gaussian noise to the real action samples in the forward process, and using the global situation representation vector as a condition, iteratively denoising from random noise through a U-Net denoising network to generate multiple candidate action tuples specifically includes: Store operation actions are uniformly represented as structured parameter tuples containing action type, target, and action amplitude fields. After embedding table mapping and concatenation, they are normalized to a fixed interval to form an action parameter vector. The action type field covers personnel dispatch, order to expedite food preparation, inventory replenishment, equipment inspection, and reassurance and discounts. The target field identifies the location where the action is applied by spatial anchor point number, table number, or equipment number. The action amplitude field carries a continuous value that matches the action type. Extract executed and adopted operational action records from historical operational archives, encode the action type, target, and action range of each record into a real action sample vector according to the structured parameter tuple format, and associate it with the store status sequence snapshot at the time of action execution and the corresponding global status representation to form a real action sample set; A conditional diffusion probability model is constructed and the real action sample set is used as input for model training. During the forward noise addition process, for each action parameter vector in the real action sample set, the diffusion time step is randomly sampled and Gaussian noise of corresponding intensity is injected to finally obtain the action vector contaminated by noise. In the reverse denoising process, the U-Net denoising network with encoder-decoder architecture is used to downsample the input noise-contaminated action vector and extract multi-layer features. The decoder gradually restores the dimension through upsampling and transmits fine-grained information of each layer of the encoder through skip connections. Using the global situational representation vector as a conditional signal, the conditional signal is injected into the intermediate feature layer through a cross-attention mechanism at the feature scale of each downsampling and upsampling stage of the U-Net denoising network, so that the denoising network uses the current operating status of the store as a reference when predicting noise components. After the model training converges, a pure random noise vector with the same dimension as the action parameter vector is sampled from the standard Gaussian distribution. Using the current store's global situation representation vector as a condition, the U-Net denoising network, which has been trained and converged, gradually denoises from the maximum time step along the inverse diffusion time step. The denoising process is executed in parallel from multiple different initial noises to generate multiple candidate action parameter vectors, which are then decoded in the parameter space to restore them into candidate action tuples.
[0024] It's important to note that traditional store action recommendations typically rely on predefined rule bases or classification models, selecting suggestions from a fixed set of actions. Such methods are limited by the enumeration granularity of the action base, making it difficult to cover the flexible measures that naturally arise in complex business scenarios. For example, when kitchen equipment malfunctions coincide with peak customer traffic, the rule base may not contain combined actions that simultaneously address inspection instructions and adjustments to food delivery staff. To address this issue, this step introduces a conditional diffusion probability model, transforming operational actions from discrete category selection into a generation problem in a continuous parameter space. This allows the system to adaptively synthesize precise action tuples that fit the current store situation.
[0025] Specifically, the various operational operations that a store might perform are first abstracted into structured parameter tuples. The action type field covers five basic operation categories: personnel scheduling, order expediting, inventory replenishment, equipment inspection, and customer satisfaction / discount. The target field uses spatial anchor numbers, table numbers, or equipment numbers to precisely identify the physical location where the action is applied. The action amplitude field carries a corresponding continuous value based on the action type. After mapping through their respective embedding tables, these are concatenated into a continuous vector and normalized to the same numerical range, forming the action parameter vector. Training data comes from records of operational actions executed and adopted by the store manager in the store's historical operational archives. Each record not only contains the structured encoding of the action itself but also associates it with a snapshot of the store's situational sequence at the time of action execution and the corresponding global situational representation vector, forming a pairing of state and action. After constructing the conditional diffusion probability model, the forward process randomly samples the diffusion time steps of the real action sample vector and injects Gaussian noise of appropriate intensity according to the noise scheduling parameters to obtain noise-contaminated action vectors. The inverse denoising process is handled by the U-Net denoising network. This network uses an encoder to progressively downsample and extract multi-layer features, while the decoder progressively recovers the original dimensions through upsampling. Fine-grained spatial information from each layer of the encoder is passed through skip connections. The global situational awareness vector serves as a conditional signal, injected into the intermediate feature layer at each feature scale of the U-Net via a cross-attention mechanism, ensuring the network continuously references the current operational status of the store when predicting noise components. After training convergence, pure random noise vectors are sampled from a standard Gaussian distribution and progressively denoised by the U-Net along the inverse diffusion time step, generating multiple candidate action parameter vectors in parallel. These are then decoded in the parameter space to reconstruct candidate action tuples. Replacing traditional discrete action enumeration with structured parameter tuples expands the search space for action suggestions from a finite set to a continuous parameter space. This enables the generation of context-appropriate composite actions not found in the preset action library, avoiding the problem of action-scene disconnect in traditional random sampling strategies.
[0026] Furthermore, in a preferred embodiment of the present invention, the construction of a parallel digital twin sandbox of the target store, generating a candidate pool to be combined based on the candidate action ranking table, and inferring the store's operational trajectory at future moments through Monte Carlo tree search in the parallel digital twin sandbox, selecting the optimal action sequence by evaluating cumulative returns and robustness, and generating natural language explanations using a large language model and pushing them to the terminal specifically includes: A parallel digital twin sandbox is constructed, consisting of three components: physical layout mirror, real-time synchronization interface for running status, and running logic simulation rule base. The candidate action ranking table is input, and candidate action tuples with comprehensive utility scores ranked within the preset ranking are extracted to form a candidate pool to be combined. Mutually exclusive action combinations with the same action object in the candidate pool are removed. The physical layout mirror reuses the spatial polyhedral mesh and the position information of each spatial anchor point. The real-time synchronization interface of the running status receives the global situation representation vector at each fixed time step and solves it into a state snapshot of each functional area. The running logic simulation rule base is built based on the causal path coefficient matrix and historical customer behavior random parameters. Using the current real-time synchronization state of the sand table as the root node and the action tuples in the candidate pool to be combined as directed edges, a Monte Carlo tree search is performed. In the expansion phase, new actions are sampled to create child nodes. In the simulation phase, the store's operating trajectory in future periods is deduced. In the backtracking phase, the cumulative returns and robustness indicators are updated to each ancestor node along the search path. During the simulation phase, the simulation rule base propagates the intervention effect based on the causal path coefficients and generates random customer behavior based on the distribution of customer arrival time intervals and dining duration. The same action sequence is simulated multiple times and random perturbations are applied to the random parameters of customer behavior. The cumulative mean and standard deviation of the returns for each simulation are recorded. After completing the preset rounds of search, select actions from the direct child nodes of the root node whose cumulative average return and standard deviation both meet the preset screening rules as the first step of the optimal sequence, and repeat the selection with this child node as the new root node to finally obtain the optimal action sequence. The optimal action sequence is integrated with the single-step causal deduction conclusion of the corresponding action and the summary of the comprehensive effect of the sand table deduction into a structured prompt context. This context is then input into a large language model to generate a natural language explanation. Finally, the action sequence is sent to the management terminal and regional terminal in a differentiated manner according to the action type to realize the recommendation of store operation actions.
[0027] It's important to note that causal deduction of a single action can only assess the independent effect of that action. However, in actual store operations, multiple actions often need to be coordinated in sequence to achieve optimal control. For example, when peak customer traffic coincides with kitchen equipment malfunctions, simply adding more staff may be ineffective due to unresolved equipment bottlenecks, while simply conducting equipment inspections cannot alleviate the current queuing pressure. Only by first triggering equipment inspections to restore food production capacity, and then adding more food delivery staff to accelerate food delivery, can congestion be resolved synergistically from both the root cause and transmission levels. To automatically discover such optimal action combinations, this step constructs a parallel digital twin sandbox synchronized in real time with the target store, and performs accelerated deduction and search of temporal action sequences on it.
[0028] The simulation platform consists of three core components. The physical layout mirror directly reuses the constructed spatial polyhedral mesh and anchor point location information to recreate the complete physical topology of the front dining area, kitchen stalls, food delivery aisles, and storage areas. The real-time operational status synchronization interface receives a global situational representation vector at each fixed time step, calculating it into snapshots of customer flow density, table occupancy stages, stall order queue lengths, material consumption progress, and equipment health in each functional area, ensuring precise alignment between the simulation's starting point and the actual store's current state. The operational logic simulation rule base reuses the fitted causal path coefficient matrix as the basis for the propagation of intervention effects. It also introduces stochastic process parameters derived from historical operational data, such as customer arrival time interval distribution, order quantity distribution, and dining duration distribution, driving the probabilistic generation of customer behavior during the simulation. This ensures that the simulated future trajectory retains realistic randomness while conforming to the causal laws of store operations.
[0029] Monte Carlo Tree Search starts with the current state of the sandbox as the root node and gradually expands the search tree of action combinations through multiple iterations. Each iteration delves deeper along the explored path during the selection phase, adhering to the principle of balancing cumulative reward and exploration. In the expansion phase, new actions are sampled from the candidate pool to create child nodes. During the simulation phase, the sandbox rapidly extrapolates the future store operation trajectory starting from this node. During the extrapolation, controlled perturbations are applied to the random parameters of customer behavior and repeated, recording the mean and standard deviation of the cumulative reward for each extrapolation to measure the expected return and robustness to random fluctuations of the action sequence. In the backtracking phase, the evaluation results are updated to ancestor nodes along the path, guiding the search towards a more robust sequence. The output optimal action sequence not only possesses causal interpretability but also includes a quantitative effect summary from the sandbox extrapolation and a natural language explanation, enabling store managers to efficiently execute coordinated actions while understanding the decision-making basis, significantly reducing trial-and-error costs and communication delays when multiple actions are concurrent.
[0030] Figure 2 A flowchart of a second method for a scene-aware store operation recommendation method provided in an embodiment of the present invention; like Figure 2 As shown, the present invention provides a second method flowchart for a scene-aware store operation action recommendation method, including: S202 uses queuing time, table turnover rate, food preparation congestion index of each stall, inventory consumption rate of key materials, equipment failure risk index and customer complaint tendency as endogenous nodes, and in-store customer flow intensity, weather conditions, business district activity tags and time attributes as exogenous nodes. Directed edges are established between nodes based on the causal relationship of store operation to form a causal graph structure. S204. Using the observation data of normal store operation and the historical intervention correction data, the path coefficients of each causal directed edge in the causal graph structure are subjected to maximum likelihood estimation and Bayesian regularization fitting to obtain the causal path coefficient matrix, so as to characterize the correlation strength and response elasticity between nodes under natural and intervention conditions. S206, For the current moment, calculate the initial values of exogenous variables for each exogenous variable node from the global situation representation vector at the current moment and assign values to the corresponding exogenous variable nodes. At the same time, record the current actual observation values of each endogenous node as the baseline state for counterfactual inference. S208, for each candidate action tuple, determine the endogenous node that directly acts in the cause-effect graph structure by its action type and the object of action, cut off all natural incoming edge paths of the target endogenous node, and assign the action amplitude field to the target endogenous node as an interference quantifier after mapping and transformation. S210, propagate the intervention effect forward node by node along the causal directed edge in topological order, and traverse all endogenous nodes in the causal graph structure that can be reached from the intervention node through the directed edge. For each directed edge, use the causal path coefficient to perform path equation propagation operation on the new value of the preceding node modified by the intervention predictor to obtain the counterfactual prediction value of the endogenous node. S212, compare the counterfactual prediction values of each endogenous node with the baseline state, and extract the decrease in queuing time, the increase in table turnover rate, the degree of relief of food service congestion, the proportion of delayed inventory consumption, and the decrease in customer complaint tendency as estimated effect indicators. S214. Based on the target store's current operating hours and status preference, query the preset weight factor table to obtain the indicator weight factors. Use the indicator weight factors to perform a weighted summation of each estimated effect indicator and introduce the estimated execution cost as a utility deduction item to finally obtain the comprehensive utility score of the target candidate action tuple. S216 After all candidate action tuples have completed counterfactual deduction and comprehensive utility calculation, they are sorted in descending order according to comprehensive utility score, and the multi-index predicted effect of each action and the causal propagation path node sequence are recorded together as the single-step causal deduction conclusion, forming a candidate action ranking table.
[0031] It's important to note that traditional store recommendation systems are mostly based on association rules or classification models, learning only empirical mappings of "what actions store managers typically take when a certain metric is abnormal." Such methods cannot distinguish between causal relationships and statistical correlations. For example, historically, when food preparation was delayed, store managers often simultaneously increased staff and issued expedited order requests. However, association models cannot identify whether increasing staff actually alleviated the delay or whether the improvement was merely due to the accompanying expedited order requests. This confusion leads to potentially ineffective or even counterproductive over-interventionist recommendations. To address this issue, this step introduces causal structural equation modeling and counterfactual reasoning mechanisms, enabling the system to quantitatively answer the causal question: "If this action is performed, how will the operational metrics change?"
[0032] Specifically, a causal graph structure is first constructed based on the physical causal relationships of store operations. Endogenous nodes encompass queue waiting time, table turnover rate, food preparation congestion index for each stall, inventory consumption rate of key materials, equipment failure risk index, and customer complaint tendency. These nodes are connected by directed edges based on operational mechanisms. For example, the food preparation congestion index node points to the queue waiting time node, depicting the transmission effect of food preparation delays on the queuing experience; the equipment failure risk index node points to the food preparation congestion index node, depicting the constraint of equipment malfunctions on food preparation capacity. Exogenous nodes encompass in-store customer traffic intensity, weather conditions, commercial district activity tags, and time period attributes. These nodes are not affected by internal store factors and only serve as the starting drivers of the causal chain. The path coefficients of each directed edge are not set based on experience but are derived using long-term observation data of the store under normal operating conditions and changes recorded before and after the implementation of specific intervention measures. This is achieved through joint fitting using maximum likelihood estimation and Bayesian regularization, ensuring that the coefficient matrix reflects both the correlation strength between nodes under natural operating conditions and the responsiveness of nodes to external intervention under intervention conditions. In the counterfactual reasoning phase, for each candidate action tuple, the endogenous nodes directly affected are analyzed. All natural incoming edges that the node originally received influence from other nodes are severed, and the action magnitude is mapped and directly assigned to that node. Subsequently, the intervention effect is propagated forward node by node along the topological order of the causal graph. The change magnitude of each subsequent node is calculated using the fitted path coefficients and compared node by node with the uninterrupted baseline state to extract estimated effect indicators such as the decrease in queuing time and the increase in table turnover rate. Each indicator is weighted according to the preference weight of the current operating period and the estimated execution cost of the action is deducted to obtain a comprehensive utility score. After all candidate action tuples have completed the above reasoning, they are sorted in descending order of comprehensive utility score. The multi-indicator estimated effects of each action are retained along with the causal propagation path, forming a candidate action ranking table with complete reasoning. This provides store managers with a quantitative effect estimate for each candidate action, significantly enhancing the credibility and transparency of the recommendation. At the same time, the cost deduction mechanism avoids aggressive recommendations regardless of cost, ensuring the feasibility of the recommendations under actual operational constraints.
[0033] Furthermore, the method for recommending actions in stores based on high scene awareness provided by this invention also includes the following steps: In the optimal action sequence, after an action is pushed and executed, the pressure distribution field output by the Transformer field decoder is continuously acquired at multiple fixed time steps, and the running pressure of the spatial anchor point to which the action object belongs and each node of the causal propagation link is monitored step by step. If the operating pressure does not show the expected decay trend or rebounds within the preset tolerance time window, it is determined that the action execution has a residual failure; after detecting the residual failure, the sensitivity gradient of the global situation characterization vector relative to the operating pressure of the spatial anchor point is calculated with the spatial anchor point of the residual failure as the target. Simultaneously, in the store's implicit state field, which is composed of pressure distribution and implicit state embedding, we trace back along the spatial dimension to locate the upstream causal anchor point that contributes the most to the current residual failure as the bottleneck anchor point. Centered on the bottleneck anchor point, the global situation representation vector is used as a condition to generate a compensation action parameter vector for the bottleneck anchor point in the continuous action parameter space of the conditional diffusion probability model. The type of compensation action is limited to a priority-top forced order-expediting signal, a brief assistance instruction for adjacent idle workstations, or a small amount of appeasement discount for affected customers. A single-step counterfactual deduction is performed on the generated compensation action parameter vector to verify the elimination effect on the operating pressure of the bottleneck anchor point. After confirming that the elimination effect meets expectations, the compensation action is accelerated in a parallel digital twin sandbox for a short period of time to verify that it will not introduce new chain anomalies. After successful verification, the compensation action tuple and the corresponding counterfactual deduction conclusion are pushed to the corresponding terminal with the highest priority and executed in parallel with the already pushed optimal action sequence, completing the closed loop from residual detection to compensation execution.
[0034] It should be noted that in real store operations, even if the theoretically optimal action sequence has been searched and pushed for execution, two types of deviations may still occur at the actual execution end of the store: First, the action recipient may fail to respond in time due to temporary tasks, such as the order to expedite food delivery being sent to the kitchen display screen, but the chef at the target stall is dealing with a sudden additional order, causing the order to be shelved; Second, new external disturbances may intervene during the execution of the action, such as a sudden low-pressure alarm in an adjacent refrigeration unit during the execution of a replenishment order.
[0035] To address this issue, this step does not terminate the state awareness link after the optimal action sequence is pushed and executed. Instead, it continuously utilizes the Transformer field decoder to output the pressure distribution field step-by-step over time, monitoring the operating pressure of the spatial anchor points affected by the pushed actions and the nodes along the causal propagation link. If, within a preset tolerance time window, the operating pressure of the target anchor point does not decrease along the expected decay direction of causal deduction, or exhibits a rebound phenomenon instead of decreasing, the system determines that the action has a residual failure. At this point, using the failed anchor point as the target, the sensitivity gradient of the global situation representation vector to the operating pressure at that point is calculated, and the direction of the maximum gradient is traced backward along the spatial dimension in the store's implicit state field to locate the upstream bottleneck anchor point that contributes the most to the failure. For example, if the pressure blocking food delivery does not decrease after the execution of the expedited order instruction, the gradient backpropagation points to a specific chef's workstation anchor point, revealing the real bottleneck caused by the temporary additional order at that workstation. Subsequently, centered on this bottleneck anchor point, targeted compensation actions are generated in the action parameter space of the conditional diffusion model. These actions include forced order placement to expedite delivery, brief assistance from adjacent workstations, or small reassurance discounts for affected customers. After single-step counterfactual deduction and short-term accelerated verification in a sandbox environment, these actions are pushed to the corresponding terminals for execution with the highest priority. This enables the automatic detection and remediation of localized failures caused by delays at individual execution terminals in scenarios with multiple concurrent actions during peak hours. It prevents single-point blockages from spreading to the entire chain, significantly enhancing the store's resilience and robustness of the decision-making loop under complex operational pressures.
[0036] Figure 3 A scene-aware store operation action recommendation system 3 is provided as an embodiment of the present invention. The system includes: a memory 301, a processor 302, and a communication interface 303. The memory 301 contains a scene-aware store operation action recommendation method program. When the scene-aware store operation action recommendation method program is executed by the processor 302, it implements the scene-aware store operation action recommendation method steps as described above.
[0037] The above description is merely a specific embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the technical scope disclosed in the present invention should be included within the scope of protection of the present invention. Therefore, the scope of protection of the present invention should be determined by the scope of the claims.
Claims
1. A method for recommending store operation actions based on scene awareness, characterized in that, include: Collect multimodal operational data, store equipment operation data, and store inventory change data of the target stores. Perform time windowing and gridding processing on the collected asynchronous multi-source data and convert it into multi-channel situational slices with fixed time steps to generate a store situational sequence set. The physical space of the store is discretized into a three-dimensional grid with a time dimension and an additional learnable coordinate code is added. A Transformer field decoder is used, and the store situation sequence set is combined with a spatiotemporal attention mechanism to perform implicit state field modeling of the store and global situation representation extraction to obtain a global situation representation vector. The operational actions are represented as structured parameter tuples containing action type, target, and action magnitude. A conditional diffusion probability model is constructed. In the forward process, Gaussian noise is gradually added to the real action samples. In the reverse process, the global situation representation vector is used as a condition, and the random noise is iteratively denoised through the U-Net denoising network to generate multiple candidate action tuples. Construct a causal graph structure with store operation elements as nodes and fit the causal path coefficients between nodes. Set the initial values of exogenous variables with the global situation representation vector. Use candidate actions as intervention parameters to perform inverse factual deduction of action impact and priority ranking, and generate a candidate action ranking table with single-step causal deduction conclusions. Construct a parallel digital twin sandbox of the target store, generate a candidate pool to be combined based on the candidate action ranking table, and infer the store's operation trajectory at future moments in the parallel digital twin sandbox through Monte Carlo tree search. Select the optimal action sequence by evaluating cumulative returns and robustness, generate natural language explanations using a large language model, and push them to the terminal.
2. The method for recommending store operation actions based on scene awareness according to claim 1, characterized in that, The process involves collecting multimodal operational data, store equipment operation data, and store inventory change data from target stores. The collected asynchronous multi-source data undergoes time windowing and gridding processing, and is converted into multi-channel situational slices with fixed time steps to generate a store situational sequence set. Specifically, this includes: Based on the heterogeneous monitoring array of the store, the customer flow trajectory in the front hall and the table opening and checkout events are obtained. The order flow middleware obtains the delivery timestamp of the dishes. The equipment programmable logic controller obtains the equipment load and abnormal alarms. The gravity shelf and smart freezer obtain the material inventory change data to form a multi-source raw data stream. The multi-source raw data stream is aligned with network time protocol timestamps and corrected for jumps. Based on the installation location of each data source sensor in the store, spatial coordinate labels are added to each record, transforming the multi-source raw data stream into a standardized asynchronous record stream with dual time and space labels. Based on the store floor plan, a spatial polyhedron grid is pre-constructed covering the front dining area, kitchen stalls, food delivery aisles, and storage shelving areas. Each unit in the grid is defined as an independent spatial anchor point. The standardized asynchronous recording stream is mapped to the corresponding spatial anchor point according to the spatial coordinate label. The corresponding spatial anchor point is filled with the estimated remaining dining time of the table, the backlog of orders at the stall, the equipment health scalar and the material consumption trend value to form a spatially aligned multi-channel time-series signal stream. Adaptive window length gating is performed on the multi-channel time-series signal stream at each spatial anchor point. Narrow-window Gaussian weighted statistics are applied to fast-changing channels, wide-window smoothing is applied to slow-changing channels, and the frequency and duration of discrete event channels are statistically analyzed. The gating mechanism is used to selectively filter the original values within the window. All channel feature values of all spatial anchor points within the same time step are arranged into multi-channel situation slices for that time step. The channels contained in each slice correspond to the customer flow density field, the average remaining dining time per table, the length of the stall order queue, the confidence level of equipment anomalies, and the material consumption trend, respectively. They are arranged in order of time steps to form a store situation sequence set.
3. The method for recommending store operation actions based on scene awareness according to claim 1, characterized in that, The process involves discretizing the physical space of the store into a three-dimensional grid with a time dimension and adding learnable coordinate encoding. A Transformer field decoder is used, combined with the store situation sequence set, to perform implicit state field modeling and global situation representation extraction using a spatiotemporal attention mechanism, resulting in a global situation representation vector. Specifically, this includes: Based on the store floor plan and floor height parameters, the physical space of the target store is discretized into a three-dimensional spatial grid covering the front hall, back kitchen, food delivery area and storage area, and expanded into a four-dimensional spatiotemporal grid framework along the time axis. Each grid point is assigned a learnable coordinate encoding vector composed of spatial location encoding and temporal location encoding. The multilayer perceptron is used to compress the feature values of all spatial anchor points of each multi-channel situation slice in the store situation sequence set into a fixed-dimensional slice embedding vector. At the same time, each spatial anchor point is encoded at each time step to obtain the anchor point local state embedding. A Transformer field decoder composed of multiple layers of cross-attention blocks is introduced. The spatiotemporal coordinates to be queried and their learnable coordinate codes are used as query vectors. Temporal attention aggregation is performed on the slice embedding vectors in sequence to obtain the semantics of the overall store operation rhythm. Spatial attention aggregation is performed on the local state embedding of anchor points to retrieve the local detail signals of neighboring anchor points. After multi-layer cross-attention operation by the Transformer field decoder, the running pressure scalar and implicit state embedding vector of the input spatiotemporal coordinates are obtained. The running pressure comprehensively represents the queuing pressure, food delivery congestion, equipment malfunction risk and inventory depletion tendency at the coordinate location. Uniform sampling is performed on the four-dimensional spatiotemporal grid of the store. The spatiotemporal coordinates of each sampling point are sequentially input into the Transformer field decoder to obtain the pressure distribution field covering the entire store and the implicit state embedding. The operating pressure of all sampling points is normalized to form spatial attention weights. The implicit state embeddings are weighted and summed, and after being grouped according to preset functional areas, cross-area attention integration is performed to obtain the global situation representation vector.
4. The method for recommending store operation actions based on scene awareness according to claim 1, characterized in that, The process involves representing operational actions as structured parameter tuples containing action type, target, and action amplitude; constructing a conditional diffusion probability model; progressively adding Gaussian noise to real action samples during the forward process; and using the global situational representation vector as a condition, iteratively denoising random noise through a U-Net denoising network to generate multiple candidate action tuples, specifically including: Store operation actions are uniformly represented as structured parameter tuples containing action type, target, and action amplitude fields. After embedding table mapping and concatenation, they are normalized to a fixed interval to form an action parameter vector. The action type field covers personnel dispatch, order to expedite food preparation, inventory replenishment, equipment inspection, and reassurance and discounts. The target field identifies the location where the action is applied by spatial anchor point number, table number, or equipment number. The action amplitude field carries a continuous value that matches the action type. Extract executed and adopted operational action records from historical operational archives, encode the action type, target, and action range of each record into a real action sample vector according to the structured parameter tuple format, and associate it with the store status sequence snapshot at the time of action execution and the corresponding global status representation to form a real action sample set; A conditional diffusion probability model is constructed and the real action sample set is used as input for model training. During the forward noise addition process, for each action parameter vector in the real action sample set, the diffusion time step is randomly sampled and Gaussian noise of corresponding intensity is injected to finally obtain the action vector contaminated by noise. In the reverse denoising process, the U-Net denoising network with encoder-decoder architecture is used to downsample the input noise-contaminated action vector and extract multi-layer features. The decoder gradually restores the dimension through upsampling and transmits fine-grained information of each layer of the encoder through skip connections. Using the global situational representation vector as a conditional signal, the conditional signal is injected into the intermediate feature layer through a cross-attention mechanism at the feature scale of each downsampling and upsampling stage of the U-Net denoising network, so that the denoising network uses the current operating status of the store as a reference when predicting noise components. After the model training converges, a pure random noise vector with the same dimension as the action parameter vector is sampled from the standard Gaussian distribution. Using the current store's global situation representation vector as a condition, the U-Net denoising network, which has been trained and converged, gradually denoises from the maximum time step along the inverse diffusion time step. The denoising process is executed in parallel from multiple different initial noises to generate multiple candidate action parameter vectors, which are then decoded in the parameter space to restore them into candidate action tuples.
5. The method for recommending store operation actions based on scene awareness according to claim 1, characterized in that, The process involves constructing a causal graph structure with store operation elements as nodes and fitting causal path coefficients between nodes. Initial values for exogenous variables are set using the global situational representation vector. Candidate actions are used as intervention parameters to perform inverse factual deduction and priority ranking of action impacts, generating a candidate action ranking table with single-step causal deduction conclusions. Specifically, this includes: Using queuing time, table turnover rate, food preparation congestion index of each stall, inventory consumption rate of key materials, equipment failure risk index and customer complaint tendency as endogenous nodes, and using in-store customer flow intensity, weather conditions, business district activity tags and time attributes as exogenous nodes, directed edges are established between nodes based on the causal relationship of store operation to form a causal graph structure. Using observation data of normal store operation and historical intervention correction data, the path coefficients of each causal directed edge in the causal graph structure are subjected to maximum likelihood estimation and Bayesian regularization fitting to obtain the causal path coefficient matrix, which characterizes the correlation strength and response elasticity between nodes under natural and intervention conditions. For the current moment, the initial values of exogenous variables for each exogenous variable node are calculated from the global situation representation vector at the current moment, and the corresponding exogenous variable nodes are assigned values. At the same time, the current actual observation values of each endogenous node are recorded as the baseline state for counterfactual inference. For each candidate action tuple, determine the endogenous node that directly acts on it in the causal graph structure by its action type and target, cut off all natural incoming edge paths of the target endogenous node, and assign the action amplitude field to the target endogenous node as an interference quantifier after mapping and transformation. The intervention effect is propagated forward node by node along the causal directed edge in topological order. All endogenous nodes reachable from the intervention node through the directed edge are traversed sequentially in the causal graph structure. For each directed edge, the path equation propagation operation is performed on the new value of the preceding node modified by the intervention predictor to obtain the counterfactual prediction value of the endogenous node using the causal path coefficient. The counterfactual prediction values of each endogenous node are compared with the baseline state, and the following indicators are extracted: the decrease in queuing time, the increase in table turnover rate, the degree of relief of food service congestion, the proportion of delayed inventory consumption, and the decrease in customer complaint tendency. Based on the target store's current operating hours and status preferences, a preset weight factor table is queried to obtain indicator weight factors. The indicator weight factors are then used to weight and sum the estimated effect indicators, and the estimated execution cost is introduced as a utility deduction item to finally obtain the comprehensive utility score of the target candidate action tuple. After all candidate action tuples have completed counterfactual deduction and comprehensive utility calculation, they are sorted in descending order according to comprehensive utility score. The multi-index predicted effect of each action and the causal propagation path node sequence are recorded together as the single-step causal deduction conclusion, forming a candidate action ranking table.
6. The method for recommending store operation actions based on scene awareness according to claim 1, characterized in that, The construction of a parallel digital twin sandbox for the target store, the generation of a candidate pool to be combined based on the candidate action ranking table, the deduction of the store's operational trajectory at future moments through Monte Carlo tree search in the parallel digital twin sandbox, the selection of the optimal action sequence by evaluating cumulative returns and robustness, and the generation of natural language explanations using a large language model and pushing them to the terminal specifically include: A parallel digital twin sandbox is constructed, consisting of three components: physical layout mirror, real-time synchronization interface for running status, and running logic simulation rule base. The candidate action ranking table is input, and candidate action tuples with comprehensive utility scores ranked within the preset ranking are extracted to form a candidate pool to be combined. Mutually exclusive action combinations with the same action object in the candidate pool are removed. The physical layout mirror reuses the spatial polyhedral mesh and the position information of each spatial anchor point. The real-time synchronization interface of the running status receives the global situation representation vector at each fixed time step and solves it into a state snapshot of each functional area. The running logic simulation rule base is built based on the causal path coefficient matrix and historical customer behavior random parameters. Using the current real-time synchronization state of the sand table as the root node and the action tuples in the candidate pool to be combined as directed edges, a Monte Carlo tree search is performed. In the expansion phase, new actions are sampled to create child nodes. In the simulation phase, the store's operating trajectory in future periods is deduced. In the backtracking phase, the cumulative returns and robustness indicators are updated to each ancestor node along the search path. During the simulation phase, the simulation rule base propagates the intervention effect based on the causal path coefficients and generates random customer behavior based on the distribution of customer arrival time intervals and dining duration. The same action sequence is simulated multiple times and random perturbations are applied to the random parameters of customer behavior. The cumulative mean and standard deviation of the returns for each simulation are recorded. After completing the preset rounds of search, select actions from the direct child nodes of the root node whose cumulative average return and standard deviation both meet the preset screening rules as the first step of the optimal sequence, and repeat the selection with this child node as the new root node to finally obtain the optimal action sequence. The optimal action sequence is integrated with the single-step causal deduction conclusion of the corresponding action and the summary of the comprehensive effect of the sand table deduction into a structured prompt context. This context is then input into a large language model to generate a natural language explanation. Finally, the action sequence is sent to the management terminal and regional terminal in a differentiated manner according to the action type to realize the recommendation of store operation actions.
7. A scene-aware store operation recommendation system, characterized in that, The system includes: a memory, a processor, and a communication interface. The memory contains a program for a scene-aware store operation recommendation method. When the scene-aware store operation recommendation method program is executed by the processor, it implements the steps of the scene-aware store operation recommendation method as described in any one of claims 1-6.