The invention relates to a video depth relation analysis method based on multi-modal feature fusion is based on a visual, sound and character feature fusion network of video sub-screens, scenes and character recognition; and the method comprises the following steps: firstly, dividing an input video into a plurality of screens according to scene, visual and sound models, and extracting corresponding sound and character features on each screen; secondly, identifying positions appearing in each screen according to the input scene screenshots and figure screenshots, extracting corresponding entityvisual features from the scene and the figure, and calculating visual features of a joint area for every two entity pairs; and for each entity pair, connecting the screen features, the entity features and the entity pair features, predicting a relationship between each screen entity pair through small sample learning in combination with zero sample learning, and constructing an entity relationship graph on the whole video by combining the entity relationship on each screen of the video. According to the method, three types of deep video analysis questions including knowledge graph filling, question answering and entity relationship paths can be answered by utilizing the entity relationship graph.