Portfolio management method based on combination of multi-agent reinforcement learning and causal discovery
By employing multi-agent reinforcement learning and causal discovery methods, we have addressed the issues of insufficient structural representation and training stability in large-scale asset pool scenarios. This approach enables unified modeling of cross-asset structural relationships and long-term dependencies, thereby improving the decision-making performance and stability of the investment portfolio and adapting to real-world trading constraints.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- COMMUNICATION UNIVERSITY OF CHINA
- Filing Date
- 2026-05-12
- Publication Date
- 2026-06-19
AI Technical Summary
Existing portfolio investment management methods suffer from problems such as insufficient structural representation, excessively high state and action dimensions, insufficient training stability, weak adaptability to real transaction constraints, and insufficient interpretability of graph structures in large-scale asset pool scenarios.
We employ a multi-agent reinforcement learning and causal discovery approach. By dividing assets into local subsets, we construct a temporal portfolio graph (TPG) and a multi-agent actor-critic network. We combine graph convolutional networks and gated recurrent units to jointly represent cross-asset structural relationships and long-term temporal dependencies. We also introduce transaction costs and position constraints for joint training and post-action processing.
It significantly improves the scalability and stability of decision-making under large-scale asset pools, enhances the joint expression of cross-asset structural relationships and long-term dependencies, improves risk-adjusted returns and risk control under extreme market conditions, and has a high degree of implementation capability that aligns with real-world transaction constraints.
Smart Images

Figure CN122243649A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the fields of intelligent finance and quantitative investment technology, specifically a portfolio investment management method based on multi-agent reinforcement learning and causal discovery. Background Technology
[0002] In recent years, machine learning and deep learning have been increasingly applied in the financial field. Especially since 2018, models such as deep neural networks, convolutional neural networks, recurrent neural networks, self-attention mechanisms, and Transformers have demonstrated significant advantages in processing high-dimensional time series, text data, and graph-structured data, and have been widely adopted in financial forecasting and intelligent decision-making. At the same time, the combination of reinforcement learning and deep learning has also developed rapidly, laying the foundation for building more intelligent trading and asset management systems.
[0003] As the scale of capital market data continues to expand and asset relationships become increasingly complex, how to leverage smarter modeling methods to improve investment decision-making efficiency, enhance risk control capabilities, and improve strategy stability under real-world trading constraints has become a crucial topic in intelligent finance research. However, existing methods still face several key challenges when dealing with large-scale asset pools: Firstly, with the increase in the number of assets, the complexity of state representation dimensions, action space, and the difficulty of modeling cross-asset dependencies rise rapidly, significantly limiting the scalability and training stability of traditional single-decision frameworks. Secondly, financial markets commonly exhibit complex structures such as industry linkages, pattern resonance, risk transmission, and temporal evolution, making it difficult to simultaneously characterize cross-asset relationships and long-term dynamic features using only conventional deep networks. Furthermore, relationship construction methods based on similarity or correlation are susceptible to noise, spurious correlations, and structural drift, reducing the robustness and interpretability of the model in non-stationary markets. At the same time, many studies do not adequately consider real-world trading conditions such as transaction costs, position constraints, and risk exposure, further limiting the practical application of these models.
[0004] Existing technologies related to portfolio investment management mainly fall into the following categories: The first category is reinforcement learning-based portfolio investment methods, which typically model portfolio management as a Markov decision process or a partially observable Markov decision process, achieving iterative strategy updates through state observation, action output, and reward feedback. The second category is end-to-end portfolio modeling methods based on deep learning, which directly learn feature representations from historical prices of multiple assets and output investment weights through convolutional networks, recurrent networks, or attention mechanisms. The third category is graph neural network-based portfolio management methods, which construct asset relationship graphs to model industry, supply chain, fundamental, or similarity relationships. The fourth category is multi-agent reinforcement learning methods, which alleviate the training difficulties caused by high-dimensional action spaces by decomposing large-scale decision-making tasks into multiple local sub-tasks. The fifth category is causal discovery techniques, which provide prior constraints for graph structure modeling by identifying more directional and interpretable dependencies from time series data.
[0005] Among existing end-to-end deep encapsulation models, the representative EIIE improves training stability and scalability for large-scale asset portfolios through parameter sharing mechanisms, while EI3 further introduces multi-scale convolutional structures to simultaneously capture short-term, medium-term, and long-term market patterns. Existing graph neural network encapsulation schemes typically employ static heterogeneous graphs or dynamic graphs, encoding industry relationships, supply and demand connections, or similarity relationships between assets and combining them with reinforcement learning actuators. Existing multi-agent schemes often adopt a centralized training, decentralized execution paradigm, allowing different agents to process local asset subsets and collaborate through shared parameters or centralized value estimation.
[0006] However, the aforementioned prior art has at least the following shortcomings: First, many deep portfolio models assume that assets are independent of each other, making it difficult to effectively express cross-asset structural dependencies such as industry linkages, risk transmission, and market resonance. Secondly, the single-agent solution faces problems such as high state dimension, large action space and unstable training when the number of assets increases, resulting in poor scalability. Third, existing graph construction methods mostly rely on empirical rules, static relationships, or similarity thresholds, which are easily affected by noise, spurious correlations, and structural drift, resulting in insufficient interpretability and robustness. Fourth, existing solutions often struggle to simultaneously address both spatial structure modeling and long-term dynamic modeling, resulting in limited adaptability to changes in market conditions. Fifth, some methods do not adequately consider real trading conditions such as transaction costs, position limits, risk exposure, and rebalancing execution, making them difficult to implement directly.
[0007] Specifically, for traditional deep learning models, on the one hand, many models assume that assets are independent of each other, ignoring cross-asset structural relationships such as industry linkages, supply chain transmission, and common volatility characteristics; on the other hand, as the number of assets increases, the model input dimensions increase rapidly, and the training difficulty and overfitting risk also rise accordingly. In addition, deep models often lack the ability to structurally handle risk measurement, drawdown control, and dynamic interactions among multiple assets, so their generalization ability and robustness remain limited under market state transitions and real-world trading frictions.
[0008] The application of multi-agent reinforcement learning in financial scenarios is still in its early stages. Most models rely on low-dimensional or simplified market features, making it difficult to scale to hundreds of assets. The credit allocation mechanism, collaboration logic, and interpretability among agents are also insufficient. Multi-agent frameworks combined with graph neural networks are still in the exploratory stage. Therefore, a unified framework that can simultaneously integrate graph structure modeling, temporal dependency characterization, and multi-agent decision decomposition is still needed to improve the structural expressiveness, scalability, and training stability in large-scale portfolio investment management.
[0009] Regarding graph models, although works such as GPM and TC-MAC have demonstrated the significant value of graph structures in financial decision-making, existing graph neural network combinatorial models still face several challenges: graph construction often relies on expert experience or correlation rules, exhibiting a degree of subjectivity; asset relationships in financial markets dynamically change over time, and static graphs struggle to fully reflect this structural evolution; graph models themselves also present challenges in interpretability, training stability, and cross-market transferability. Therefore, how to construct more stable, natural graph structures capable of jointly modeling time-series dependencies in portfolio investment management remains a problem worthy of in-depth research.
[0010] Based on the above analysis, it can be seen that existing portfolio investment management methods face several common bottlenecks in large-scale asset pool scenarios: First, it is difficult to simultaneously characterize cross-asset structural dependencies and time dynamics; second, the single-agent decision-making paradigm lacks scalability in high-dimensional action spaces; third, graph construction methods based on correlation or empirical rules lack stability and interpretability; and fourth, many methods do not adequately consider real-world transaction constraints, resulting in limited strategy implementation capabilities. Therefore, exploring a portfolio investment method that can balance structural expressiveness, decision-making efficiency, training stability, and practical feasibility for large-scale asset pools and real-world transaction constraints has both significant theoretical research value and strong practical application significance. Summary of the Invention
[0011] The technical problem to be solved by this invention is to provide a portfolio investment management method based on multi-agent reinforcement learning and causal discovery, which addresses the problems of insufficient structural expression, excessive state and action dimensions, insufficient training stability, weak adaptability to real transaction constraints, and insufficient interpretability of graph structures in existing portfolio investment management methods in large-scale asset pool scenarios.
[0012] The technical problems to be solved by this invention are: how to uniformly model cross-asset structural relationships and long-term time-series context information in multi-asset markets; how to achieve multi-agent collaborative decision-making while maintaining parameter efficiency; how to use time-series causal discovery to provide more stable prior constraints for cross-asset graph structures; and how to obtain portfolio allocation results that balance returns, stability and feasibility while considering transaction costs, position constraints, risk exposure and course learning optimization objectives.
[0013] The technical solution adopted by this invention to solve its technical problem is: a portfolio investment management method based on multi-agent reinforcement learning and causal discovery, which acquires multi-asset market data, constructs a portfolio investment environment based on the multi-asset market data, and the portfolio investment environment includes at least transaction cost constraints and position constraints; the multi-asset market data includes at least the price characteristics, return sequences and transaction constraint information of multiple assets on consecutive trading days; All assets are divided into multiple non-overlapping local asset subsets, and each local asset subset is assigned a smart agent to form a multi-agent portfolio investment decision structure. Construct a temporal portfolio graph (TPG) to jointly represent the structural relationships and temporal dependencies among assets, and generate a global embedded representation that can be shared by all intelligent agents; Construct a multi-agent Actor-Critic network that uses a shared Actor backbone and agent-specific heads, and enable each intelligent agent to combine the global embedded representation and corresponding local observations to output the target portfolio weights; Perform constraint-aware post-processing on the target portfolio weights to obtain configuration results that satisfy actual trading constraints; The multi-agent Actor-Critic network and TPG are jointly trained using a centralized training and distributed execution approach. The time-series causal discovery method extracts a causal prior graph from the training set reward sequence and fuses the causal prior graph with the dynamic similarity graph in the TPG to generate a global embedding representation containing causal structure constraints.
[0014] Specifically, the preprocessing of the multi-asset market data includes: imputing missing values, time alignment, and standardization of the original price data; extracting at least one price feature from the highest price, lowest price, and closing price; and constructing a return series, a benchmark return series, and historical weight memory information.
[0015] Specifically, the portfolio investment environment is modeled as a partially observable Markov decision process or a Markov decision process; the observations of each intelligent agent include at least the price window features corresponding to the local asset subset, the portfolio weights at the previous time step, and the global embedded representation of the TPG output; the actions of each intelligent agent include at least the target allocation weights of cash and the corresponding local asset subset.
[0016] Specifically, the construction of the TPG includes: The input features of each asset node are constructed based on its price characteristics at the current moment, its action at the previous moment, its reward at the previous moment, and its historical hidden state. Node embedding is performed on the input features of asset nodes to obtain the node representation of each asset; The similarity between assets is calculated based on the representation of each asset node, and a dynamic similarity graph is constructed accordingly.
[0017] Specifically, the dynamic similarity graph is constructed as follows: the heat kernel similarity is calculated based on the distance between asset node representations, and the similarity matrix is subjected to threshold filtering, symmetry processing and normalization processing to obtain a weighted adjacency matrix for graph convolution.
[0018] Specifically, the TPG's method for jointly representing structural relationships and time dependencies includes: The dynamic similarity graph is spatially aggregated using a graph convolutional network to obtain contextual embeddings that reflect cross-asset structural relationships; After spatial aggregation of node embeddings based on the weighted adjacency matrix, the time series of each time step is input into a gated recurrent unit (GRU) to learn the temporal evolution of asset correlation. The global embedding representation is obtained by fusing the historical task embedding of each asset with the context embedding through an attention mechanism.
[0019] Specifically, the multi-agent Actor-Critic network sharing an Actor backbone and agent-specific heads comprises: A shared Actor backbone network is used to extract common policy features from each intelligent agent; Each agent-specific header corresponds to a smart agent and is used to map common policy features to configuration weights on the corresponding local asset subsets. Each Critic network corresponds to a smart agent and is used to estimate state-action value based on joint observations, joint actions, training phase variables, and the global embedding representation.
[0020] Specifically, the constraint-aware action post-processing includes: The position cap is pruned for the target portfolio weights output by each intelligent agent; The weights exceeding the position limit will be reallocated to other assets or cash according to preset rules; The transaction residual factor is calculated by combining the transaction cost rate to obtain the portfolio value and realization weight after deducting transaction frictions.
[0021] Specifically, the joint training includes: Update each Critic network based on the temporal difference objective; The Actor network for each intelligent agent is updated based on a deterministic policy gradient with graph conditions. Introduce portfolio gradient regularization loss to enhance portfolio growth capability while taking turnover costs into account; Mutual information targets are introduced to update the TPG graph encoder to enhance the consistency between the global context representation and the node representation.
[0022] Specifically, the time series causal discovery method is the PCMCI+ method; The construction of the causal prior graph includes: Identify significant dependencies between assets using the return sequences of the training set; A static causal adjacency matrix is generated based on the identification results; The causal embeddings corresponding to the static causal adjacency matrix and the similarity embeddings corresponding to the dynamic similarity graph are weighted and fused to obtain a global embedding representation that includes causal structure constraints.
[0023] The beneficial effects of this invention are: Significantly improves the scalability of decision-making under large-scale asset pools. Under conditions of large-scale asset pools and complex markets, the multi-agent decomposition architecture employed in this invention delivers significant performance and stability benefits. By partitioning assets and using a centralized training and distributed execution paradigm, the computational complexity of high-dimensional decision-making is reduced, enhancing the system's adaptability as asset scales expand.
[0024] This invention enhances the joint representation of cross-asset structural relationships and long-term temporal dependencies. It unifies temporal portfolio graphs and multi-agent decision-making into a scalable framework. By utilizing graph convolutional networks to extract cross-asset dependencies and gated recurrent units to model temporal evolution, it effectively overcomes the shortcomings of traditional deep models that neglect cross-asset linkages, significantly improving the ability of local strategies to perceive the dynamic structure of the global market.
[0025] It demonstrates superior performance in risk-adjusted return metrics. Compared to existing end-to-end deep learning portfolio management models, this invention exhibits significant advantages in core metrics such as cumulative return, Sharpe ratio, and stability. Even under real-world constraints such as strict position limits and high trading frictions when dealing with portfolios of hundreds or even larger stocks, this invention maintains high return generation capabilities while effectively controlling maximum drawdown, making it highly adaptable to the practical needs of real-world financial markets.
[0026] This system effectively enhances risk control and downside stability under extreme market conditions. By introducing a prior structure based on time-series causal discovery, the system significantly suppresses the tail extremism of asset return distribution. This causal prior provides a more economically interpretable network connection, and as a structural regularization method, it effectively reduces over-adjustment and overfitting in extreme market situations, thus improving the portfolio's downside quality.
[0027] This invention offers highly practical implementation capabilities that closely align with real-world trading constraints. It explicitly integrates transaction costs, maximum position limits, cash flow considerations, and a multi-stage learning mechanism into the training environment. The synergistic effect of these modules not only enhances the model's training stability but also ensures that the portfolio weights output by the strategy fully meet the frictional conditions and risk control limitations of real-world trading, providing a complete, robust, and feasible technical solution for quantitative asset allocation. Attached Figure Description
[0028] The present invention will be further described below with reference to the accompanying drawings and embodiments.
[0029] Figure 1 This is a flowchart of the overall model of the portfolio investment management method based on multi-agent reinforcement learning and causal discovery described in this invention. Figure 2 This is a diagram of the architecture of the Time Series Portfolio Chart (TPG) described in this invention; Figure 3 This is a comparison chart of the investment results of the model of this invention on the constituent stocks of the CSI 100 Index; Figure 4 The diagram shows the overall model architecture of this invention, illustrating the core modules and data flow of the combined investment management method based on multi-agent reinforcement learning and causal discovery. Detailed Implementation
[0030] To make the technical means, creative features, objectives and effects of this invention easier to understand, the invention will be further described below in conjunction with specific embodiments.
[0031] like Figures 1-4As shown, the portfolio investment management method based on multi-agent reinforcement learning and causal discovery described in this invention achieves unified modeling of cross-asset structural relationships, long-term dynamic dependencies, and multi-agent scalable decision-making through a process of "asset partitioning—global structure encoding—local strategy decision-making—constraint enforcement—joint training—causal prior fusion." The method generally includes steps such as data preprocessing, portfolio investment environment construction, time-series portfolio graph construction, multi-agent Actor-Critic strategy learning, constraint-aware action post-processing, causal prior graph extraction and fusion, and multi-agent sub-portfolio summary output, roughly comprising the following: (1) Data acquisition, preprocessing and asset classification for multi-asset markets First, historical market data for multiple assets over consecutive trading days is acquired. This data includes at least one or more price characteristics such as the highest price, lowest price, and closing price, as well as a return series calculated from the price series. Preprocessing for this investment environment includes performing missing value imputation, time alignment, and standardization on the raw data, and constructing historical weighted memory information and benchmark return information. The entire asset set is then... Divided into A set of non-overlapping local assets Each subset corresponds to a smart agent. Local asset partitioning satisfies: ; The number of assets controlled by each agent must strictly meet the requirements. ,in, This represents the total assets. For the first The number of assets each agent is responsible for. This partitioning method can decompose the original high-dimensional joint action space into several local action spaces, thereby reducing the learning complexity of a single-policy network under a large-scale asset pool.
[0032] (2) Portfolio investment environment modeling and observation construction This invention models the portfolio investment process as a Markov decision process or a partially observable Markov decision process. For any agent... At any moment Local observations are denoted as Its dimensions are ,in, For price characteristics, this model selects three types of features: highest price, lowest price, and closing price. ; This is the length of the observation window. Therefore, the agent can simultaneously utilize price evolution information of a local asset set over multiple past trading days.
[0033] To form a global structural observation, the local observations of all agents are first stitched together to obtain a joint price observation. Then, the combined actions from the previous moment Rewards from the previous moment The components corresponding to each asset are concatenated into the price observations to form the input observations for the time-series portfolio plot: ; Therefore, the observations ultimately used for the reinforcement learning policy network can be written as: ; in, This represents the global embedding obtained from time-series portfolio graph encoding. Indicates the first Local observations by agents. This design allows local strategies to both preserve the temporal information of the sub-combinations themselves and utilize structural information across assets.
[0034] (3) Action space definition and constraint perception post-processing On the trading day At the end, the next trading day is generated based on observations. Pre-market target allocation actions. (For agents) The policy network outputs the original action vector after softmax normalization. This indicates the cash dimension and the assets managed by the agent. Provisional portfolio weighting allocation on each asset. To meet actual investment constraints, the original action needs to be modified by a deterministic post-processing module before execution, i.e., maximum position constraint pruning.
[0035] The post-processing includes setting the maximum weight threshold for a single asset as follows: If any component exceeds the threshold, it will be pruned, and the excess portion will be proportionally redistributed to the remaining assets or cash dimensions that have not reached the upper limit, until the threshold is met. This process can curb excessive concentration of positions and improve the feasibility and stability of the portfolio.
[0036] To explicitly incorporate transaction costs into the environment update, a transaction residual factor model is used during action execution. Based on the rebalancing range between the previous actual weight and the current target weight, and combined with the transaction fee rate, a transaction friction correction term is calculated. This yields the portfolio value and realized weight after deducting transaction costs. While the specific fee function can be adjusted according to actual market conditions, its core calculation logic is: first, calculate the rebalancing cost from the old weight to the new weight, and then apply this cost to the portfolio value update to avoid the strategy ignoring the actual losses caused by frequent turnover during training.
[0037] (4) Construction of Time Series Portfolio Chart (TPG) To uniformly characterize cross-asset structural relationships and time-dependent features, this invention constructs a Time Series Portfolio Chart (TPG), the detailed structure of which can be found in [link to detailed structure]. Figure 2 For each asset node, the node input is first constructed based on the current price characteristics, the action of the previous time step, the reward of the previous time step, and the historical hidden state. The node representation is then obtained through a node embedding network. Subsequently, a dynamic similarity graph is constructed based on the squared Euclidean distance between node embeddings: ; ; in, Here are the scaling parameters for the hot kernel. To suppress weak connection noise, a threshold filter is applied to the similarity matrix to obtain the filtered adjacency matrix: ; Then, by symmetrization, the final weighted adjacency matrix is obtained: ; (5) Context embedding based on GCN and task embedding based on GRU After obtaining the weighted adjacency matrix, a graph convolutional network is used to extract the spatial structural dependencies between assets. This model employs a two-layer graph convolution, the computation of which can be written as: ; ; in, The node feature matrix, and For trainable parameters, This represents the normalized graph adjacency matrix.
[0038] To extract the global context embedding from all node representations, an attention aggregation mechanism is introduced. For the ... For each asset node, the attention score can be written as: ; The weights of each node are obtained after softmax normalization. This results in global context embedding: ; This context embedding represents the overall structural state across assets in the market at the current moment. To further characterize the temporal evolution of asset correlations, this invention introduces a gated recurrent unit (GRU) after graph structure encoding to perform recursive updates on the graph representation sequence of historical moments. Its calculation can be written as: ;in, This is an intermediate representation obtained by aggregation based on the current graph structure. The state is hidden for the temporal context. Finally, the graph context embedding is fused with the task embedding generated by GRU to obtain a global embedding representation that can be shared by all agents. This design enables the joint modeling of spatial structural information and temporal dynamic information.
[0039] (6) Multi-agent strategy network sharing Actor backbone and agent-specific heads In the policy learning part, this invention adopts a multi-agent Actor-Critic structure that combines a shared Actor backbone with agent-specific heads. The shared backbone network is responsible for extracting reusable common policy features among different agents; the agent-specific heads output the target configuration weights on their respective local asset subsets. This structure improves parameter utilization while preserving the differences in asset volatility patterns and local features among different subsets.
[0040] Global graph embedding can also employ feature modulation techniques on shared backbone features, in the following form: ; in, as an agent The main characteristics, For global embedding, and It is a linear projection function. This represents the global path modulation intensity. Through this method, local strategies can dynamically perceive changes in the global market structure.
[0041] Correspondingly, during the centralized training phase, the Critic network of each agent performs state-action value estimation based on joint observations, joint actions, and global embeddings. Its training objective can be written as: Each agent Actor network is updated according to the gradient of the deterministic policy, which can be written as: .
[0042] (7) Portfolio growth regularization, mutual information objectives and course learning To ensure the strategy simultaneously considers return growth, stability, and feasibility, this invention further introduces portfolio gradient regularization loss, graph encoding mutual information objective, and a phased learning mechanism. The overall training objective can be written as: ; in, This represents a regularization term geared towards portfolio growth, designed to enhance portfolio growth capabilities while taking into account turnover costs. The mutual information target of the graph encoder is used to enhance the consistency between node representations and global context representations; and These are the weighting coefficients.
[0043] This invention employs a course-based learning mechanism, dividing the training process into an absolute return optimization stage, a benchmark alignment stage, and a risk shaping stage. This allows the model to gradually transition from a "learn to grow first" training approach to one that "balances relative performance and risk control." This design is particularly suitable for real-world market environments where trading frictions, position limits, and multi-objective risk control coexist.
[0044] (8) Extraction and fusion of causal prior graphs To further improve the stability and interpretability of graph structure modeling, this invention introduces a static prior graph based on time-series causal discovery into TPG. Before all training rounds begin, this model selects a fixed proportion of consecutive trading days' data from the training set and calculates the logarithmic return based on the closing price of each asset, forming a multivariate time series; subsequently, the PCMCI+ algorithm is run to obtain a graph with the following shape: Causality strength matrix And after symmetrization, it serves as a static causal prior, which remains unchanged during subsequent training.
[0045] In the fusion phase, this model fuses the embeddings obtained from the previous similarity matrix with the causal embeddings obtained from the causal strength matrix, which can be expressed as: ; in, This represents the fusion strength coefficient. The core objective of this fusion is to introduce directional and interpretive constraints provided by causal priors, while preserving the dynamic graph's adaptability to market structure shifts, in order to reduce the interference of spurious connections on the propagation of graph information.
[0046] (9) Summary and final output of multiple agent sub-combinations During the training and evaluation phases, this model assumes that all agents initially receive equal amounts of funds and independently manage their corresponding local sub-portfolios. Let the... An agent at any time The value of the sub-combinations is Then the overall portfolio value can be defined as: ; The global return series is derived from the overall value trajectory: ; Finally, based on the local configuration results of all agents and the overall portfolio aggregation rules, the system outputs the final portfolio weights that satisfy the actual transaction constraints. This aggregation method avoids the introduction of an additional capital reallocation module while fairly reflecting the average decision-making ability of the multi-agent framework in a large-scale asset pool.
[0047] Example 1: Single-market stock portfolio investment management based on CSI 100 constituent stocks. This example uses the CSI 100 index constituent stocks as investment targets and employs the portfolio investment management method based on multi-agent reinforcement learning and causal discovery described in this invention for quantitative investment decision-making. The specific implementation steps are as follows: Step 1: Data Acquisition and Preprocessing. Collect daily market data for the CSI 100 Index constituent stocks from January 1, 2015 to December 31, 2023, including daily high, low, and closing prices. Preprocess the raw data: First, use linear interpolation to fill in missing values to ensure consistent time series lengths for all assets; then, perform z-score standardization on all price features to eliminate dimensional differences; next, calculate the daily logarithmic return series for each stock and construct the benchmark return series for the CSI 100 Index; simultaneously, initialize historical weight memory information, with all assets having an initial weight of 0 and cash having a weight of 1.
[0048] Step 2: Asset Allocation and Environment Construction. The 100 constituent stocks of the CSI 100 Index are divided into 10 non-overlapping local asset subsets, each containing 10 stocks. An independent intelligent agent is configured for each subset. A portfolio investment environment is constructed, modeled as a partially observable Markov decision process. Transaction costs are set at 0.3% on both sides, the maximum position limit for a single asset is 10%, and rebalancing is performed daily. The observation window length for each agent is set to 30 trading days, meaning each agent can observe the price characteristics of the 10 stocks it is responsible for over the past 30 trading days, the portfolio weights at the previous time step, and the globally embedded representation of the TPG output.
[0049] Step 3: Constructing the Time-Series Portfolio Graph (TPG). Each asset is treated as a node. Node input features include the current highest price, lowest price, closing price, the action weight of the previous time step, the reward of the previous time step, and the historical hidden state. A two-layer fully connected network is used as the node embedding network to map the input features of each node into a 64-dimensional node representation. The heat kernel similarity is calculated based on the squared Euclidean distance between node representations, with the heat kernel scaling parameter λ set to 1.0. The similarity matrix is thresholded, retaining connections with similarity greater than 0.1. Then, symmetry and row normalization are performed to obtain a weighted adjacency matrix for graph convolution.
[0050] Step 4: Global Embedding Generation employs a two-layer graph convolutional network to spatially aggregate the dynamic similarity graph. The first layer has an output dimension of 64, and the second layer has 32. Then, a global context embedding is calculated using an attention aggregation mechanism, implemented as a two-layer fully connected network. The context embeddings at each time step are input into a gated recurrent unit (GRU). The GRU has a hidden layer dimension of 64 and is used to learn the temporal evolution features of asset relevance. Finally, the graph context embeddings are concatenated and fused with the task embeddings output by the GRU to obtain a 128-dimensional global embedding representation, which is shared by all intelligent agents.
[0051] Step 5: Construction and Training of the Multi-Agent Actor-Critic Network. A multi-agent Actor-Critic network is constructed. The shared Actor backbone uses a three-layer fully connected network with an input dimension equal to the sum of the local observation dimension and the global embedding dimension. The hidden layer dimensions are 256 and 128, respectively. Global graph embedding is applied to the shared backbone features using feature modulation. Each agent corresponds to an agent-specific head, using a one-layer fully connected network with an output dimension of 11 (10 stocks + cash). Each agent corresponds to an independent Critic network, using a three-layer fully connected network with an input dimension equal to the sum of the joint observation dimension, the joint action dimension, and the global embedding dimension. The hidden layer dimensions are 512, 256, and 128, respectively, and the output dimension is 1.
[0052] Joint training was conducted using a centralized training and decentralized execution paradigm, divided into three phases: The first phase, absolute return optimization, lasted 100 epochs, with the reward function containing only the portfolio's logarithmic return; the second phase, benchmark alignment, lasted another 100 epochs, adding a benchmark return alignment term to the logarithmic return in the reward function; and the third phase, risk shaping, lasted another 100 epochs, further incorporating maximum drawdown and turnover penalties into the reward function. During training, portfolio gradient regularization loss and mutual information objectives were introduced, with weight coefficients λpg set to 0.1 and λmi set to 0.05.
[0053] Step 6: Causal Prior Graph Extraction and Fusion. Before training begins, using training data from January 1, 2015 to December 31, 2020, the daily logarithmic return series for each stock is calculated, forming a 100-dimensional multivariate time series. The PCMCI+ algorithm is run to perform time series causal discovery, resulting in a 100×100 causal strength matrix, which is then symmetricized to serve as the static causal prior graph. During training, the similarity embeddings corresponding to the dynamic similarity graph and the causal embeddings corresponding to the causal prior graph are weighted and fused, with the fusion strength coefficient β set to 0.3.
[0054] Step 7: Post-Action Processing and Portfolio Output At the end of each trading day, each agent outputs local target portfolio weights based on current observations. Constraint-aware post-action processing is performed on the output weights: First, individual asset weights are pruned, with any portion exceeding 10% being proportionally redistributed to other stocks or cash under the agent's responsibility that have not reached their limits. Then, the rebalancing margin is calculated based on the actual weights from the previous time step and the current target weights. A transaction residual factor is calculated using a 0.3% transaction fee rate to obtain the actual execution weights after deducting transaction costs. Finally, the local configuration results of all agents are aggregated to obtain the global portfolio weights.
[0055] Compared to Example 1, the single-agent PPO method was used for comparative experiments. The investment targets, data time periods, transaction costs, and position constraints were set identically to Example 1. The single agent's action space was 101 dimensions (100 stocks + cash), and the observation window length was also 30 trading days. The strategy network used a three-layer fully connected network with hidden layer dimensions of 512, 256, and 128. The training process was also divided into three stages, with a total of 300 epochs.
[0056] Comparative experiment using the EIIE method was conducted, with all experimental settings consistent with Example 1 and Comparative Example 1. The EIIE model employs a parameter-shared convolutional neural network structure, with a 30×3×100 price tensor as input and a 101-dimensional portfolio weight as output. The training process also consists of three phases, with a total of 300 epochs.
[0057] Comparative experiments were conducted using the EI3 method, with all experimental settings consistent with the control example. The EI3 model introduces a multi-scale convolutional structure based on EIIE to capture short-term, medium-term, and long-term market patterns. The training process also consists of three phases, with a total of 300 epochs.
[0058] Experimental results show that on the test set from January 1, 2021 to December 31, 2023, the cumulative return of Example 1 of this invention is significantly higher than that of the three control examples, the Sharpe ratio is about 20% higher than the best-performing EI3 method, and the maximum drawdown is reduced by about 15%. Meanwhile, this invention exhibits better training stability, with less fluctuation in the reward curve and faster convergence during training. This demonstrates that this invention effectively improves the performance and stability of investment decisions under large-scale asset pools through multi-agent decomposition, temporal graph structure encoding, and causal prior fusion.
[0059] Example 2: Investment Management of CSI 300 Component Stocks with Added Causal Prior Enhancement. This example uses the CSI 300 index component stocks as investment targets and focuses on verifying the performance improvement effect of causal prior fusion on the model. The specific implementation steps are as follows: Step 1: Data Acquisition and Preprocessing. Collect daily market data for the CSI 300 Index constituent stocks from January 1, 2015 to December 31, 2023, including daily high, low, and closing prices. Perform the same preprocessing operations as in Example 1 on the raw data, including missing value imputation, time alignment, and standardization, to construct return series, benchmark return series, and historical weight memory information.
[0060] Step 2: Asset Segmentation and Environment Construction. The CSI 300 constituent stocks are divided into 15 non-overlapping local asset subsets, each containing 20 stocks. An independent intelligent agent is configured for each subset. A portfolio investment environment is constructed, setting transaction costs at 0.3% on both sides, a maximum position limit of 5% for a single asset, and daily rebalancing. The observation window length for each agent is set to 60 trading days to capture longer-term market patterns.
[0061] Step 3: Construction of Temporal Portfolio Graph (TPG) and Generation of Global Embeddings. The construction of the Temporal Portfolio Graph (TPG) follows the same methods as in Example 1, using node input features, node embedding networks, and dynamic similarity graphs. The graph convolutional network employs a two-layer structure with output dimensions of 128 and 64, respectively. The attention aggregation mechanism and GRU settings are similar to those in Example 1, with the hidden layer dimension of the GRU set to 128, ultimately resulting in a 256-dimensional global embedding representation.
[0062] Step 4: Construction and Training of the Multi-Agent Actor-Critic Network. A multi-agent Actor-Critic network is constructed, sharing an Actor backbone with a four-layer fully connected network and hidden layer dimensions of 512, 256, 128, and 64. Each agent's specific head uses a one-layer fully connected network with an output dimension of 21 (20 stocks + cash). Each agent's Critic network uses a four-layer fully connected network with hidden layer dimensions of 1024, 512, 256, and 128. The training process is also divided into three stages, with a total of 400 epochs and approximately 133 epochs per stage. Portfolio gradient regularization loss and mutual information objective are introduced, with weight coefficients λpg set to 0.15 and λmi set to 0.08.
[0063] Step 5: Causal Prior Graph Extraction and Comparison with Different Fusion Strengths. Before training begins, using training data from January 1, 2015 to December 31, 2020, the PCMCI+ algorithm is run to obtain a 300×300 causal strength matrix, which is then symmetricized and used as a static causal prior graph. To verify the effect of causal prior fusion, different fusion strength coefficients β are set: 0, 0.2, 0.4, 0.6, and 0.8, for five sets of comparative experiments. β=0 corresponds to the case without adding causal priors, serving as an internal control example in this embodiment.
[0064] Step 6: Post-processing and Combined Output. The post-processing procedure is the same as in Example 1. A 5% cap is applied to the weights of individual assets, and any excess is redistributed proportionally. The remaining transaction factor is calculated to obtain the actual execution weight. Finally, the local configuration results of all agents are summarized to obtain the global portfolio weight.
[0065] Experimental results show that on the test set from January 1, 2021 to December 31, 2023, all experimental groups with causal priors outperformed the control group with β=0. The model achieved optimal performance when the fusion strength coefficient β=0.4, with cumulative returns approximately 18% higher, Sharpe ratios approximately 25% higher, and maximum drawdowns reduced by approximately 20% compared to β=0. Particularly during the two extreme market downturns in April 2022 and October 2023, the experimental groups with causal priors were able to adjust positions more promptly, effectively controlling the downside risk of the portfolio. This demonstrates that causal priors can provide more stable and interpretable constraints for graph structure modeling, significantly improving the model's robustness in non-stationary markets and extreme conditions.
[0066] Example 3: Multi-asset class mixed portfolio investment management. This example extends the method of the present invention to the scenario of multi-asset class mixed portfolio investment. The investment targets include three major asset classes: stocks, bonds, and commodities. The specific implementation steps are as follows: Step 1: Data Acquisition and Preprocessing. Collect multi-asset market data from January 1, 2015 to December 31, 2023. Equity assets include 100 highly liquid stocks from the CSI 300 Index constituents; bond assets include 10 major government bond and corporate bond indices; and commodity assets include 5 major commodity futures indices. Preprocess the daily high, low, and closing prices of all assets, following the same procedure as in Example 1.
[0067] Step 2: Asset Allocation and Environment Construction. Assets are allocated by category: 100 stocks are divided into 10 subsets of 10 stocks each; 10 bond indices are divided into 2 subsets of 5 bonds each; and 5 commodity futures indices are divided into 1 subset of 100 commodities. A total of 13 intelligent agents are configured, each responsible for a different asset subset. A portfolio investment environment is constructed, setting transaction costs at 0.3% (two-way) for equities, 0.1% (two-way) for bonds, and 0.2% (two-way) for commodities. Maximum position limits for each asset class are set at: 10% for equities, 20% for bonds, and 15% for commodities. Rebalancing is set weekly to reduce transaction costs.
[0068] Step 3: Construction and Global Embedding of the Time-Series Portfolio Graph (TPG) A time-series portfolio graph (TPG) containing all assets is constructed, comprising 115 nodes (100 stocks + 10 bonds + 5 commodities). In addition to price features, historical actions, and historical rewards, asset class identifiers are added to the node input features to distinguish different asset types. The node embedding network employs a three-layer fully connected network with an output dimension of 64. The dynamic similarity graph is constructed in the same manner as in Example 1, with the hot kernel scaling parameter λ set to 0.8 and the similarity threshold τ set to 0.05.
[0069] A two-layer graph convolutional network is used for spatial aggregation, with output dimensions of 64 and 32, respectively. The attention aggregation mechanism adopts a category-aware attention mechanism, assigning different attention weights to assets of different categories. The hidden layer dimension of the GRU is set to 64, resulting in a 128-dimensional global embedding representation.
[0070] Step 4: Construction and Training of the Multi-Agent Actor-Critic Network. A multi-agent Actor-Critic network is constructed, sharing a common Actor backbone and employing a three-layer fully connected network with hidden layer dimensions of 256, 128, and 64 respectively. Different agent-specific heads are designed for different asset classes: the output dimension for stock agents is 11 (10 stocks + cash), for bond agents it is 6 (5 bonds + cash), and for commodity agents it is 6 (5 commodities + cash). Each agent's Critic network employs a three-layer fully connected network with hidden layer dimensions of 512, 256, and 128 respectively.
[0071] The training process is divided into three phases, with a total of 350 epochs. In addition to portfolio return, benchmark alignment, and risk penalty terms, the reward function also includes an asset class dispersion penalty term to prevent the portfolio from over-concentrating on a single asset class. Portfolio gradient regularization loss and mutual information objectives are introduced, with weight coefficients λpg set to 0.1 and λmi set to 0.05.
[0072] Step 5: Causal Prior Graph Extraction and Fusion. Before training begins, training data from January 1, 2015 to December 31, 2020 is used to discover causal relationships within and across asset classes (stocks, bonds, and commodities). The PCMCI+ algorithm is run to obtain a 115×115 causal strength matrix, which is then symmetricized to serve as the static causal prior graph. The fusion strength coefficient β is set to 0.3.
[0073] Step 6: Post-processing and Combined Output. Different position limits are applied to different asset classes: equity positions are capped at 10%, bonds at 20%, and commodities at 15%. Any excess is proportionally redistributed to other assets or cash within the corresponding asset class. Then, the transaction residual factor is calculated based on the transaction fee rates for each class to obtain the actual execution weight after deducting transaction costs. Finally, the local configuration results of all agents are summarized to obtain the global multi-asset mixed portfolio weight.
[0074] In contrast to Example 4, a traditional mean-variance model was used for a comparative experiment. The investment targets, data periods, transaction costs, position constraints, and rebalancing frequency were set exactly the same as in Example 3. Each week, the expected return and covariance matrix were estimated based on the return data of the past 60 trading days. The mean-variance optimization problem was then solved to obtain the portfolio weights for the following week.
[0075] Experimental results show that, on the test set from January 1, 2021 to December 31, 2023, the cumulative return of Example 3 of this invention is approximately 35% higher than that of the traditional mean-variance model, the Sharpe ratio is approximately 40% higher, and the maximum drawdown is reduced by approximately 25%. Furthermore, the portfolio of this invention exhibits more stable performance under different market conditions, and can promptly increase the allocation ratio of bonds and commodities when the stock market declines, achieving effective risk diversification. This demonstrates that the method of this invention is not only applicable to single-market stock portfolios but can also be well extended to multi-asset class mixed portfolio investment scenarios, showing broad application prospects.
[0076] The foregoing has shown and described the basic principles, main features, and advantages of the present invention. Those skilled in the art should understand that the present invention is not limited to the above embodiments. The embodiments and descriptions in the specification are merely illustrative of the principles of the invention. Various changes and modifications can be made to the invention without departing from its spirit and scope, and all such changes and modifications fall within the scope of protection claimed by the present invention. The scope of protection of the present invention is defined by the appended claims and their equivalents.
Claims
1. A portfolio investment management method based on multi-agent reinforcement learning and causal discovery, characterized in that, Includes the following steps: Acquire multi-asset market data, and construct a portfolio investment environment based on the multi-asset market data. The portfolio investment environment includes at least transaction cost constraints and position constraints. The multi-asset market data includes at least the price characteristics, return sequences, and trading constraint information of multiple assets over consecutive trading days. All assets are divided into multiple non-overlapping local asset subsets, and each local asset subset is assigned a smart agent to form a multi-agent portfolio investment decision structure. Construct a temporal portfolio graph (TPG) to jointly represent the structural relationships and temporal dependencies among assets, and generate a global embedded representation that can be shared by all intelligent agents; Construct a multi-agent Actor-Critic network that uses a shared Actor backbone and agent-specific heads, and enable each intelligent agent to combine the global embedded representation and corresponding local observations to output the target portfolio weights; Perform constraint-aware post-processing on the target portfolio weights to obtain configuration results that satisfy actual trading constraints; The multi-agent Actor-Critic network and TPG are jointly trained using a centralized training and distributed execution approach. The time-series causal discovery method extracts a causal prior graph from the training set reward sequence and fuses the causal prior graph with the dynamic similarity graph in the TPG to generate a global embedding representation containing causal structure constraints.
2. The portfolio investment management method based on multi-agent reinforcement learning and causal discovery according to claim 1, characterized in that: The preprocessing of the multi-asset market data includes: The original price data was imputed for missing values, aligned to time, and standardized. Extract at least one price feature from the highest price, lowest price, and closing price; It also constructs a return series, a benchmark return series, and historical weight memory information.
3. The portfolio investment management method based on multi-agent reinforcement learning and causal discovery according to claim 1, characterized in that: The portfolio investment environment is modeled as a partially observable Markov decision process or a Markov decision process. The observations of each intelligent agent include at least the price window features corresponding to the local asset subset, the portfolio weights at the previous time step, and the global embedded representation of the TPG output; The actions of each smart agent include at least the target allocation weights of cash and the corresponding subset of local assets.
4. The portfolio investment management method based on multi-agent reinforcement learning and causal discovery according to claim 1, characterized in that: The construction of the TPG includes: The input features of each asset node are constructed based on its price characteristics at the current moment, its action at the previous moment, its reward at the previous moment, and its historical hidden state. Node embedding is performed on the input features of asset nodes to obtain the node representation of each asset; The similarity between assets is calculated based on the representation of each asset node, and a dynamic similarity graph is constructed accordingly.
5. The portfolio investment management method based on multi-agent reinforcement learning and causal discovery according to claim 4, characterized in that: The dynamic similarity graph is constructed as follows: the heat kernel similarity is calculated based on the distance between asset node representations, and the similarity matrix is subjected to threshold filtering, symmetry processing and normalization processing to obtain a weighted adjacency matrix for graph convolution.
6. The portfolio investment management method based on multi-agent reinforcement learning and causal discovery according to claim 1, characterized in that: The TPG method for jointly representing structural relationships and time dependencies includes: The dynamic similarity graph is spatially aggregated using a graph convolutional network to obtain contextual embeddings that reflect cross-asset structural relationships; After spatial aggregation of node embeddings based on the weighted adjacency matrix, the time series of each time step is input into a gated recurrent unit (GRU) to learn the temporal evolution of asset correlation. The global embedding representation is obtained by fusing the historical task embedding of each asset with the context embedding through an attention mechanism.
7. The portfolio investment management method based on multi-agent reinforcement learning and causal discovery according to claim 1, characterized in that: The shared Actor backbone and agent-specific head multi-agent Actor-Critic network comprises: A shared Actor backbone network is used to extract common policy features from each intelligent agent; Each of these corresponds to a specific agent header, used to map common policy features to configuration weights on the corresponding local asset subsets; Each Critic network corresponds to a smart agent and is used to estimate state-action value based on joint observations, joint actions, training phase variables, and the global embedding representation.
8. The portfolio investment management method based on multi-agent reinforcement learning and causal discovery according to claim 1, characterized in that: The constraint-aware action post-processing includes: The position cap is pruned for the target portfolio weights output by each intelligent agent; The weights exceeding the position limit will be reallocated to other assets or cash according to preset rules; The transaction residual factor is calculated by combining the transaction cost rate to obtain the portfolio value and realization weight after deducting transaction frictions.
9. The portfolio investment management method based on multi-agent reinforcement learning and causal discovery according to claim 1, characterized in that: The joint training includes: Update each Critic network based on the temporal difference objective; The Actor network updates each intelligent agent based on a deterministic policy gradient with graph conditions. Introduce portfolio gradient regularization loss to enhance portfolio growth capability while taking turnover costs into account; Mutual information goals are introduced to update the TPG graph encoder to enhance the consistency between the global context representation and the node representation.
10. The portfolio investment management method based on multi-agent reinforcement learning and causal discovery according to claim 1, characterized in that: The time series causality discovery method is the PCMCI+ method; The construction of the causal prior graph includes: Identify significant dependencies between assets using the return sequences of the training set; A static causal adjacency matrix is generated based on the identification results; The causal embeddings corresponding to the static causal adjacency matrix and the similarity embeddings corresponding to the dynamic similarity graph are weighted and fused to obtain a global embedding representation that includes causal structure constraints.