A financial graph multi-modal understanding method and system based on a graph neural network
By constructing a graph structure and injecting Graph-Prompt, the problem of identifying the structural relationships between elements within financial charts in securities research reports is solved, enabling more accurate chart understanding and analysis, and improving the efficiency and quality of financial data analysis.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- TIANJIN POLYTECHNIC UNIV
- Filing Date
- 2026-04-03
- Publication Date
- 2026-06-23
Smart Images

Figure CN121963235B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of intelligent financial chart analysis technology, and in particular relates to a method and system for multimodal understanding of financial charts based on graph neural networks. Background Technology
[0002] Brokerage research reports often contain financial charts. Current financial chart analysis techniques primarily rely on manual reading of these reports, supplemented by tools such as image recognition and OCR. Common methods involve analysts directly observing the chart content, comparing trends, or interpreting the reasons for changes, or using OCR systems to identify chart titles, legends, axes, and annotations, then inputting the results into a natural language processing system for description generation or question answering. Some institutions also employ multimodal large-scale models to interpret financial charts; however, current technologies mainly rely on overall image feature representation combined with simple textual information to generate conclusions, making it difficult to form a true understanding of the chart's internal structure.
[0003] In practical applications, charts in brokerage research reports are often presented in the form of multiple sub-charts, multiple texts, and multiple annotations. Each chart contains various information areas, including titles, main chart areas, legends, scales, annotations, and data sources. Especially in complex topics such as industry comparisons, sector rotation, and macroeconomic indicators, the logical relationships between different areas within the chart are quite close, making it a highly structured visual document. Existing financial chart analysis equipment (including OCR systems and multimodal models) primarily relies on image visual features and text features to determine the meaning of charts, but it cannot identify the relationships between different elements within the chart. For example, the correspondence between legends and curves, the semantic relationship between titles and main charts, and the hierarchical relationship between data sources and annotations are all difficult to accurately model using existing methods.
[0004] Brokerage charts often contain a large amount of structured information presented in text format, such as "Data source: Wind", "Note: The index is based on xx", etc. Figure 1 "Industry prosperity comparison" and other similar rectangular areas have completely different semantic roles. Existing chart understanding models rely solely on pixel features or OCR text, making it impossible to distinguish the functions of different nodes. This can easily lead to errors such as misinterpreting legends as captions or recognizing titles as body text, thus interfering with normal analysis, reducing automation efficiency, and making chart understanding results unstable. Ultimately, this may mislead investment decisions.
[0005] Different areas of a chart exhibit clear structural relationships: the title is typically at the top, summarizing the overall image; the main chart area contains key numerical information such as line charts and bar charts; the legend explains the semantic meaning of curves; and annotations usually describe calculation methods, sample sources, or special cases. These relationships constitute a natural graph structure, but current chart understanding technologies lack explicit encoding methods for this structure. The common approach is to input all text areas into the model in positional order, but this linear concatenation fails to represent the true hierarchical relationships, making it difficult for the model to infer the chart's meaning based on its structure.
[0006] Because financial charts are highly structured, and existing multimodal models are insufficient in modeling structural information, a common industry practice is for developers to manually write prompts, adding titles, annotations, and legends as additional text to the prompt template. However, this manual rule-based approach is ill-suited for complex scenarios and fails to depict the relationships between nodes, lacking versatility and further degrading the model's ability to interpret complex charts. Especially in charts with multiple text blocks, subgraphs, or multi-level structures, the simple method of piecing together structural information is highly susceptible to interference from erroneous OCR outputs, leading to incorrect final results.
[0007] Therefore, under the current technological conditions, how to construct a method that can automatically, uniformly and accurately represent the structural relationships between elements within a chart, and effectively inject this structure into a multimodal large model so that the model can truly understand the internal organizational logic of financial charts, has become a major problem that urgently needs to be solved in the field of intelligent financial chart analysis technology. Summary of the Invention
[0008] This invention provides a method and system for multimodal understanding of financial charts based on graph neural networks. It can automatically, uniformly and accurately represent the structural relationships between elements within the chart and effectively inject this structure into a large multimodal model, enabling the model to truly understand the internal organizational logic of the financial chart.
[0009] To achieve the above objectives, the technical solution of the present invention is as follows:
[0010] A multimodal understanding method for financial charts based on graph neural networks includes:
[0011] S1: Perform structural element analysis on financial charts, dividing them into multiple semantic structural units. Each semantic structural unit contains structural type, spatial location, and text content.
[0012] S2. Using semantic structural units as nodes, edges are constructed based on spatial and semantic relationships between nodes, thereby building a graph structure;
[0013] S3. Perform vectorized encoding on the graph structure to generate full graph structure feature encoding;
[0014] S4. Construct the prompt text for each node and each edge, and integrate them to generate a structured prompt message Graph-Prompt;
[0015] S5. Inject Graph-Prompt into the multimodal large model;
[0016] S6, Multimodal Large Model performs graph understanding tasks based on Graph-Prompt.
[0017] Furthermore, in step S1, the structure type is used to identify the semantic category of the semantic structure unit, the spatial location is used to represent the area occupied by the semantic structure unit in normalized coordinates, and the text content is used to include the text content in the identified semantic structure unit.
[0018] Furthermore, the construction of the graph structure in step S2 includes:
[0019] S201. Construct spatial connection edges based on spatial adjacency relationships to reflect the relative relationships of semantic structural units of different structural types;
[0020] S202. Construct semantic prior edges based on semantic rules to make the graph structure conform to the organizational logic of real financial charts.
[0021] Furthermore, step S3 includes:
[0022] S301. Construct the initial features of the node by converting the node's structural type, spatial location, and text content into vectors and concatenating them.
[0023] S302. Calculate the structural embedding of nodes using a graph attention network;
[0024] S303. Embed the structure of all nodes together and compress to obtain the full graph structure feature encoding.
[0025] Furthermore, step S4 includes:
[0026] S401. Construct node prompt text, including the structural type, spatial location, and text content of each node; and the encoding of the full graph structural features;
[0027] S402. Construct edge hint text, including the edge type;
[0028] S403. Integrate node tooltip text and edge tooltip text to generate a structured tooltip message Graph-Prompt.
[0029] Furthermore, in step S5, the Graph-Prompt injection of the multimodal large model includes one or more of the following: prefix injection, system prompt injection, and graph-text concatenation input.
[0030] Furthermore, the chart understanding task described in step S6 includes chart question answering, chart description generation, trend analysis, and other extended tasks based on financial charts. Examples include data extraction, anomaly detection, cross-chart comparison, etc.
[0031] In another aspect, this invention proposes a multimodal understanding system for financial charts based on graph neural networks, comprising:
[0032] Parsing module: performs structural element parsing on financial charts, dividing them into multiple semantic structural units, each of which contains structural type, spatial location, and text content;
[0033] Graph structure module: Using semantic structural units as nodes, edges are constructed based on spatial and semantic relationships between nodes, thereby building a graph structure;
[0034] Vectorization encoding module: performs vectorization encoding on the graph structure to generate full graph structure feature encoding;
[0035] Prompt text module: Constructs prompt text for each node and each edge, and integrates it to generate structured prompt information Graph-Prompt;
[0036] Injection module: Injects Graph-Prompt into large multimodal models;
[0037] Task execution module: Multimodal large model performs graph understanding tasks based on Graph-Prompt.
[0038] In another aspect, the present invention provides a computer device, the device including a processor and a memory; the memory is used to store computer programs, and the processor is used to execute corresponding computer program code to implement the above-described method for multimodal understanding of financial charts based on graph neural networks.
[0039] In another aspect, the present invention provides a computer-readable storage medium carrying computer program code, which is invoked by a processor to implement the above-described method for multimodal understanding of financial charts based on graph neural networks.
[0040] Compared with the prior art, the beneficial effects and advantages of the present invention are as follows:
[0041] (1) It realizes explicit structural modeling of financial charts, which significantly improves the interpretability and structural understanding of the model.
[0042] Existing multimodal large-scale models primarily rely on image-level global visual features for reasoning, failing to explicitly utilize structural information such as graph titles, footnotes, legends, and axis ticks. This invention, through structural element detection, graph structure construction, and graph neural network embedding, transforms the explicit structural information within graphs into parsable graph-prompts, explicitly capturing the logical relationships and organizational structure within the graph. This explicit structural modeling significantly improves the interpretability of the model in financial scenarios, providing question-answering results with a structural basis rather than relying on black-box visual features.
[0043] (2) Constructing spatial and semantic rule edges to improve structural perception capabilities.
[0044] This invention generates edge sets through spatial adjacency relationships and semantic prior rules, enabling the model to perceive the form and semantic relationships of graphs. Compared to relying solely on image convolutional structures or pure visual Transformers, this invention uses explicit graph structure constraints to allow the model to acquire the unique organizational patterns of graphs, thereby improving logical judgment, trend analysis, and text association capabilities.
[0045] (3) Graph neural network node embedding enhances the ability to model local and global structures.
[0046] A graph attention network is used to aggregate node features, integrating node category, location information, and textual information to generate structured node embeddings. These embeddings capture the importance of local neighborhoods and global structural information, enabling the model to more effectively utilize the structural priors of graphs in downstream tasks, thereby improving graph understanding performance.
[0047] (4) The Graph-Prompt injection method is flexible and enhances the adaptability of multimodal large models.
[0048] This invention designs various graph-prompt injection strategies (prefix injection, system prompt injection, and graph-text concatenation input), which can be seamlessly integrated with existing multimodal large models without significant modifications to the model structure. It transforms the structural knowledge of graphs into an interpretable natural language form, making it easier for models to understand and utilize the semantic organization within the graph.
[0049] (5) It has strong versatility and can be extended to different types of financial charts and different model architectures, and has good engineering applicability.
[0050] The structure detection, graph structure construction, graph embedding, and prompt generation in this invention all adopt a modular design. No modification to the model structure is required; only the Graph-Prompt needs to be used as input. This gives the method engineering advantages of low cost, high adaptability, and high transferability.
[0051] (6) Improve the performance of downstream tasks and increase production efficiency.
[0052] Through the aforementioned improvements, this invention significantly enhances performance in tasks such as chart question answering, description generation, and trend analysis. Compared to traditional manual analysis or simple image / text feature fusion methods, this invention reduces human intervention, improves task processing speed and accuracy, thereby increasing the productivity and quality of financial data analysis, report interpretation, and other business operations. Attached Figure Description
[0053] Figure 1 This is a flowchart of Embodiment 1 of the present invention;
[0054] Figure 2 This is the original financial chart of Embodiment 2 of the present invention.
[0055] Figure 3 This is a schematic diagram of the structure of Embodiment 3 of the present invention. Detailed Implementation
[0056] It should be noted that, unless otherwise specified, the embodiments and features described in the present invention can be combined with each other.
[0057] To enable those skilled in the art to better understand the present invention, the technical solution of the present invention will be clearly and completely described below with reference to specific embodiments and accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0058] Example 1:
[0059] This embodiment provides a detailed explanation of the multimodal understanding method for financial charts based on graph neural networks, such as... Figure 1 As shown, it includes:
[0060] Step 1: Obtaining chart sample data and detecting structural elements.
[0061] The process involves acquiring a financial chart image and using a text-image detection model to parse its structural elements, dividing the chart into multiple semantic structural units. The text-image detection model can utilize conventional chart parsing models, such as the MinerU model or the paddleOCR-VL model. Each parsed semantic structural unit contains at least three types of information: structural type, spatial location, and text content.
[0062] Structure type : Used to identify the first iThe semantic categories of a semantic structural unit may include image caption, image subject, legend, axis title, tick, data source or footnote, etc.
[0063] Spatial location The area occupied by an element is represented by normalized coordinates, where:
[0064] Indicates the first i The bounding box of a semantic structural unit;
[0065] : The x-coordinate of the top-left corner of the element (normalized);
[0066] : The ordinate of the top-left corner of the element (normalized);
[0067] : The x-coordinate of the bottom right corner of the element (normalized);
[0068] : The ordinate of the bottom right corner of the element (normalized);
[0069] Text content The first one obtained by OCR model i The text content of a semantic structural unit.
[0070] This forms a set of semantic structural units:
[0071] I represents the number of semantic structural units obtained from the parsing.
[0072] The structural data generated in this step provides the basic input for subsequent graph structure construction.
[0073] In addition to the three types of information mentioned above, semantic structural units can also contain other information, such as rotation angle. Wait, rotation angle Used to represent the rotation angle of oblique semantic structural units (such as oblique text, slanted coordinate axes, etc.).
[0074] Step 2: Construct a graph structure based on spatial and semantic rules.
[0075] The semantic structural unit is used as a node, and the set of semantic structural units is used as a set of nodes.
[0076] Based on the node set obtained in step 1 Constructing a graph structure The set of edges Describe the spatial and semantic relationships between chart elements.
[0077] Step 2.1: Construct spatial connecting edges based on spatial adjacency relationships.
[0078] Utilizing the spatial location of nodes Based on the center point or the relative relationships of top, bottom, left, and right, construct spatial connecting edges, with rules including but not limited to:
[0079] If node i and nodes j satisfy: ;
[0080] Then construct the edges: .
[0081] in and These are the neighborhood thresholds for the horizontal and vertical directions, respectively. Depending on the chart structure, adaptive or empirically fixed values can be used. For example, statistical methods can be employed to calculate them. and The result set is used to select a percentile as a threshold, such as the minimum distance of 80%, which ensures that most neighboring nodes can be connected to the edge, while avoiding incorrect connections to nodes that are too far away.
[0082] Indicates from node i Pointing to node j The edge.
[0083] Spatial connection edges mainly reflect the relative relationships of each node (semantic structural unit), such as the vertical positional relationship between the figure title and the main image, the positional proximity between the main image and the footnote, and the adjacency relationship between the legend and the data area; thus providing a spatial organization basis for the graph structure.
[0084] Step 2.2: Construct semantic prior edges based on semantic rules.
[0085] In addition to spatial relationships, this embodiment further constructs semantic prior edges based on knowledge of the graph domain. For example:
[0086] (1) If the node structure type is graph caption, then edges are automatically created: ; The main subject of the image;
[0087] (2) If the node structure type is legend, then edges are automatically created: ;
[0088] (3) If the node structure type is axis title, then edges will be automatically created: ; For coordinate scale.
[0089] The semantic rule template can be summarized as follows: ;
[0090] Table 1: Semantic Rules Table (Partial)
[0091]
[0092] This rule system uses domain knowledge to make the graph structure more consistent with the organizational logic of real-world graphs, thereby improving the effectiveness of subsequent graph structure coding.
[0093] Step 3: Generate graph structure encoding (GraphEmbedding).
[0094] The graph structure constructed from the set of nodes and edges is vectorized and encoded. This embodiment employs a graph neural network, preferably a graph attention network, to enhance the modeling ability of local structures and important connections.
[0095] Step 3.1: Initial feature construction of nodes.
[0096] Initial features of each node It consists of three parts:
[0097] ;
[0098] in:
[0099] TypeEmbed is a trainable vector corresponding to the node structure category. For example, a trainable vector (e.g., with a dimension of 128) can be initialized for each structure category. During training, the model will automatically adjust the vector values according to the task, such as unit classification or relation reasoning. Trainable vectors can better capture the semantic relationships between structure categories.
[0100] BBoxEmbed maps the spatial coordinates of a node to a vector.
[0101] TextEmbed is a vector of node text content after being transformed by a text encoder (such as BERT or LLaMA-Tokenizer).
[0102] Step 3.2: Graph Attention Network Computation Node Structure Embedding.
[0103] (1) Initial characteristics of nodes Perform multi-level GAT aggregation:
[0104] ;
[0105] in:
[0106] Represents a node In the The feature vector of the layer;
[0107] Represents a node In the The feature vector of the layer;
[0108] Indicates the first The learnable weight matrix of the layer;
[0109] Indicates the first Layer nodes For nodes Attention coefficient;
[0110] Represents a node The set of neighboring nodes;
[0111] This represents a non-linear activation function, and commonly used functions such as ReLU, ELU, LeakyReLU, and Sigmoid can be used.
[0112] (2) Attention weights are calculated as follows:
[0113] ;
[0114] This represents the attention function (attention mechanism).
[0115] This indicates that softmax normalization is performed on all neighbors j of node i;
[0116] Represents a vector of learnable attention parameters;
[0117] Represents the learnable weight matrix;
[0118] , This represents the features of nodes i and j in the previous layer.
[0119] (3) Obtain the final structural embedding of each node through multi-layer propagation:
[0120] ;
[0121] The final structure embedding representation for node i;
[0122] Let i be the feature representation of node i at the Lth (last) layer.
[0123] This structure embedding simultaneously includes element semantics, layout position, and its structural role in the overall chart.
[0124] Step 3.3: Compress to obtain the structural features of the entire image .
[0125] The final structure of all nodes is embedded together and compressed to obtain the full graph structure features. .
[0126] Step 4: Generate graph-prompt.
[0127] Based on node structure embedding, type information, text content, and edge structure, this embodiment generates structural prompt text that can be used by the language model.
[0128] Step 4.1: Constructing node hint text.
[0129] Using node information and embedding vectors Generate a structural description in the following format, including all nodes and the structural features of the entire graph:
[0130] node :
[0131] Type = ,
[0132] Position = ,
[0133] Content = ;
[0134] ...;
[0135] Structural Feature Vector Summary = .
[0136] Step 4.2: Constructing the side prompt text.
[0137] Convert edge set to edge prompt text Edge information node To the node It can contain edge types, such as spatial connection edges, semantic prior edges, etc.
[0138] Step 4.3: Integrate to generate the final Graph-Prompt.
[0139] The Graph-Prompt takes the following form:
[0140] ;
[0141] Ultimately, structured, readable, and parsable prompts are generated, enabling the model to possess prior cognitive ability regarding chart structures.
[0142] Step 5: Graph-Prompt injection of multimodal large models.
[0143] The generated Graph-Prompt is injected into a multimodal large language model, such as Qwen2.5vl or internvl3. The injection methods include, but are not limited to, one or more of the following: prefix injection, system prompt injection, and graph-text concatenation input.
[0144] For example, the input format is:
[0145] ;
[0146] in For the input image, The question is for input.
[0147] Multimodal large language models extract graph structure, relationships, and semantic information from graphs, thereby enhancing graph understanding capabilities.
[0148] Step 6: The model performs a graph understanding task based on Graph-Prompt.
[0149] Multimodal large language models based on Graph-Prompt can perform graph understanding tasks including but not limited to: graph question answering, graph description generation, and trend analysis. By using structure awareness, model misjudgment can be significantly reduced.
[0150] The method described in this embodiment can automatically, uniformly and accurately represent the structural relationships between elements within a chart, and effectively inject this structure into a multimodal large model, enabling the model to truly understand the internal organizational logic of financial charts.
[0151] Example 2:
[0152] This embodiment provides an application example of the method described in Embodiment 1.
[0153] like Figure 2 The image shown is the original image of a financial chart. The application process of the method described in Example 1 includes the following steps:
[0154] Step 1: Figure 2 The chart in the image shows "PMI New Export Orders and Exports Year-on-Year". Using a graph analysis model to parse the chart's structural elements, the following set of nodes was obtained:
[0155] Table 2: Node Sets
[0156]
[0157] Each node It contains three types of information:
[0158] Structure type Node type, indicating the semantic role of the graph;
[0159] Spatial location : Normalized coordinates of nodes in the chart image;
[0160] Text content The text content recognized by OCR;
[0161] The node set is:
[0162] .
[0163] Step 2: Construct a graph structure based on spatial and semantic rules.
[0164] Construct a graph structure G=(V,E), where E contains spatial connection edges and semantic prior edges.
[0165] Step 2.1: Constructing spatial connection edges.
[0166] Determine whether elements are adjacent based on their spatial center point relationship.
[0167] For example:
[0168] If the diagram is... In the main body of the image Directly above, and satisfying Then construct Spatial connection edge.
[0169] Image subject With footnotes , Adjacent in the vertical direction, construct , Spatial connection edge.
[0170] Step 2.2: Construct semantic prior edges using semantic rules.
[0171] Automatically added based on prior knowledge in the field of financial charting:
[0172] The semantic prior edge between image caption and image subject (image_caption→image);
[0173] Footnotes serve as supplementary information for the image, with semantic prior edges added from image to image_footnote.
[0174] The final result is an edge set E containing spatial connection edges and semantic prior edges.
[0175] Step 3: Graph structure encoding.
[0176] Step 3.1: Initial feature construction of nodes.
[0177] The initial characteristics of each node consist of three parts:
[0178]
[0179] Step 3.2: GAT generates structural embeddings.
[0180] After L-layer graph attention calculation, the final node structure embedding is obtained:
[0181] ;
[0182] The embedding vector contains semantic, positional, and structural organization information.
[0183] Step 3.2: After compressing and embedding the structure of all nodes, we obtain... :
[0184] [22.1453, -3.9857, -48.8765, 9.2182, -13.1046, -5.7820, -2.9459,
[0185] 20.9271, 9.3375, 17.5624, 3.7189, -7.4217, -30.2648, 24.6696, -15.9343, 29.0557.
[0186] Step 4: Generate graph structure hints.
[0187] Based on node structure embedding, type information, text content, and edge structure, this embodiment generates structural prompt text that can be used by the language model.
[0188] Step 4.1: Constructing node hint text.
[0189] Using node information and full-graph structural feature encoding, a structural description in the following format is generated:
[0190] The node information is as follows:
[0191] Node 1:
[0192] type=image_caption,
[0193] Position = [0.03, 0.019, 0.69, 0.077],
[0194] Content = Figure 1 PMI new export orders and exports year-on-year;
[0195] Node 2:
[0196] type=image,
[0197] Position = [0.038, 0.097, 0.968, 0.803],
[0198] Content =;
[0199] Node 3:
[0200] type=image_footnote,
[0201] Position = [0.029, 0.862, 0.421, 0.91],
[0202] Note: Data is as of July 2020;
[0203] Node 4:
[0204] type=image_footnote,
[0205] Position = [0.033, 0.927, 0.277, 0.974],
[0206] Content = Source: Wind;
[0207] Structural Feature Vector Summary:
[0208] [22.1453, -3.9857, -48.8765, 9.2182, -13.1046, -5.7820, -2.9459,
[0209] 20.9271, 9.3375, 17.5624, 3.7189, -7.4217, -30.2648, 24.6696, -15.9343, 29.0557.
[0210] Step 4.2: Constructing the side prompt text.
[0211] Convert edge set to edge prompt text :
[0212] For example, the side prompt text format is:
[0213] Spatial connection edge from node 1 to node 2;
[0214] ...
[0215] Step 4.3: Integrate to generate the final Graph-Prompt.
[0216] The Graph-Prompt takes the following form: ;
[0217] The final result is obtained.
[0218] Step 5: Graph-Prompt injects multimodal models.
[0219] ;
[0220] in P This is the final result obtained in step 4;
[0221] Image is Figure 2 Original image shown;
[0222] UserQuery The question is "Describe the content of the chart in one sentence".
[0223] Step 6: Model execution chart understanding.
[0224] Multimodal large language models can realize graph question answering, graph description and trend analysis, etc.
[0225] For example, the comparison results obtained based on qwen2.5vl are as follows:
[0226] Original answer result:
[0227] This chart shows the year-on-year trends in PMI new export orders and exports from countries B and A (2014-2020).
[0228] The answer obtained using this method:
[0229] This chart shows the changing trends of China's PMI new export orders index and the year-on-year export growth rates of countries B and A from 2014 to 2020, with the data cutoff date being July 2020 and the data source being Wind.
[0230] The test results based on qwen2.5vl are shown in the table below:
[0231] Table 3: Test Results
[0232]
[0233] It is evident that, using this method, the multimodal large language model can truly understand the internal organizational logic of financial charts.
[0234] Example 3:
[0235] This embodiment proposes a multimodal understanding system for financial charts based on graph neural networks, including:
[0236] Parsing module: performs structural element parsing on financial charts, dividing them into multiple semantic structural units, each of which contains structural type, spatial location, and text content;
[0237] Graph structure module: Using semantic structural units as nodes, edges are constructed based on spatial and semantic relationships between nodes, thereby building a graph structure;
[0238] Vectorization encoding module: performs vectorization encoding on the graph structure to generate full graph structure feature encoding;
[0239] Prompt text module: Constructs prompt text for each node and each edge, and integrates it to generate structured prompt information Graph-Prompt;
[0240] Injection module: Injects Graph-Prompt into large multimodal models;
[0241] Task execution module: Multimodal large model performs graph understanding tasks based on Graph-Prompt.
[0242] In the parsing module, the structure type is used to identify the semantic category of the semantic structure unit, the spatial location is used to represent the area occupied by the semantic structure unit in normalized coordinates, and the text content is used to include the text content in the identified semantic structure unit.
[0243] The graph structure module includes:
[0244] Spatial connection edge construction unit: Spatial connection edges are constructed based on spatial adjacency relationships, reflecting the relative relationships of semantic structural units of different structural types;
[0245] Semantic prior edge construction unit: Construct semantic prior edges based on semantic rules to make the graph structure conform to the organizational logic of real financial charts.
[0246] The vectorization coding module includes:
[0247] Initial feature unit for a node: Construct the initial features of the node by converting the node's structural type, spatial location, and text content into vectors and concatenating them.
[0248] Node structure embedding unit: Calculates the structure embedding of nodes through a graph attention network;
[0249] Compression unit: The structure of all nodes is embedded together and compressed to obtain the full graph structure feature encoding.
[0250] The prompt text module includes:
[0251] Node prompt unit: Constructs node prompt text, including the structural type, spatial location, and text content of each node; and the encoding of the full graph structural features;
[0252] Edge hint unit: Constructs edge hint text, including the edge type;
[0253] Integration Unit: Integrates node tooltip text and edge tooltip text to generate structured tooltip information Graph-Prompt.
[0254] The Graph-Prompt injection module injects multimodal large models, including one or more of the following: prefix injection, system prompt injection, and graph-text concatenation input.
[0255] The chart understanding tasks described in the task execution module include chart question and answer, chart description generation, trend analysis, and other extended tasks based on financial charts.
[0256] The financial chart multimodal understanding system based on graph neural networks proposed in this embodiment can realize the financial chart multimodal understanding method based on graph neural networks described in Embodiments 1 and 2, and has the same technical effect as Embodiments 1 and 2.
[0257] The embodiments described above are merely illustrative of several implementations of the present invention, and while the descriptions are specific and detailed, they should not be construed as limiting the scope of the present invention. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of the present invention, and these modifications and improvements all fall within the scope of protection of the present invention. Therefore, the scope of protection of the present invention should be determined by the appended claims.
Claims
1. A method for multimodal understanding of financial charts based on graph neural networks, characterized in that, include: S1: Perform structural element analysis on financial charts, dividing them into multiple semantic structural units. Each semantic structural unit contains structural type, spatial location, and text content. S2. Using semantic structural units as nodes, edges are constructed based on spatial and semantic relationships between nodes, thereby building a graph structure; including: S201. Construct spatial connection edges based on spatial adjacency relationships to reflect the relative relationships of semantic structural units of different structural types; S202. Construct semantic prior edges based on semantic rules to make the graph structure conform to the organizational logic of real financial charts; S3. Perform vectorized encoding on the graph structure to generate full graph structure feature encoding; S4. Construct the prompt text for each node and each edge, and integrate them to generate a structured prompt message, Graph-Prompt; including: S401. Construct node prompt text, including the structural type, spatial location, and text content of each node; and the encoding of the full graph structural features; S402. Construct edge prompt text, including the type of edge, wherein the type of edge includes spatial connection edge and semantic prior edge; S403. Integrate node prompt text and edge prompt text to generate structured prompt information Graph-Prompt; S5. Inject Graph-Prompt into the multimodal large model; S6, Multimodal Large Model performs graph understanding tasks based on Graph-Prompt.
2. The method for multimodal understanding of financial charts based on graph neural networks according to claim 1, characterized in that, In step S1, the structure type is used to identify the semantic category of the semantic structure unit, the spatial location is used to represent the area occupied by the semantic structure unit in normalized coordinates, and the text content is used to include the text content in the identified semantic structure unit.
3. The method for multimodal understanding of financial charts based on graph neural networks according to claim 1, characterized in that, Step S3 includes: S301. Construct the initial features of the node by converting the node's structural type, spatial location, and text content into vectors and concatenating them. S302. Calculate the structural embedding of nodes using a graph attention network; S303. Embed the structure of all nodes together and compress to obtain the full graph structure feature encoding.
4. The method for multimodal understanding of financial charts based on graph neural networks according to claim 1, characterized in that, In step S5, the Graph-Prompt injection of multimodal large models includes one or more of the following: prefix injection, system prompt injection, and graph-text concatenation input.
5. The method for multimodal understanding of financial charts based on graph neural networks according to claim 1, characterized in that, The chart understanding task described in step S6 includes chart question and answer, chart description generation, and trend analysis.
6. A multimodal understanding system for financial charts based on graph neural networks, characterized in that, include: Parsing module: performs structural element parsing on financial charts, dividing them into multiple semantic structural units, each of which contains structural type, spatial location, and text content; Graph Structure Module: Using semantic structural units as nodes, edges are constructed based on spatial and semantic relationships between nodes to build the graph structure; including: Spatial Connection Edge Construction Unit: Spatial connection edges are constructed based on spatial adjacency relationships, reflecting the relative relationships of semantic structural units of different structural types; Semantic Prior Edge Construction Unit: Semantic prior edges are constructed based on semantic rules, making the graph structure conform to the organizational logic of real financial charts; Vectorization encoding module: performs vectorization encoding on the graph structure to generate full graph structure feature encoding; The prompt text module constructs prompt text for each node and each edge, integrating them to generate a structured prompt information Graph-Prompt; it includes: a node prompt unit: constructing node prompt text, including the structure type, spatial location, and text content of each node; and the encoding of the overall graph structure features; an edge prompt unit: constructing edge prompt text, including the edge type, which includes spatial connection edges and semantic prior edges; and an integration unit: integrating the node prompt text and edge prompt text to generate the structured prompt information Graph-Prompt. Injection module: Injects Graph-Prompt into large multimodal models; Task execution module: Multimodal large model performs graph understanding tasks based on Graph-Prompt.
7. A computer device, the device comprising a processor and a memory; the memory being used to store computer programs, characterized in that, The processor is used to execute corresponding computer program code to implement the multimodal understanding method for financial charts based on graph neural networks as described in any one of claims 1-5.
8. A computer-readable storage medium, characterized in that, It contains computer program code, which is invoked by a processor to implement the multimodal understanding method for financial charts based on graph neural networks as described in any one of claims 1-5.