The invention discloses a video scene detection labeling method and system, and the method comprises the steps: obtaining the modal features of a video, an audio and a text through a pre-training model according to a modal information source embedded by the input video, the audio and the text, carrying out the alignment and fusion of the obtained modal features of the video, the audio and the text, and forming a window basic cross-modal representation, according to the multi-temporal attention and the difference between the adjacent windows, the basic cross-modal representation of the windows is evolved into self-adaptive context sensing representation, the scene is detected according to the obtained self-adaptive context sensing representation, and the attributes of the windows are determined through a window attribute classifier; obtaining an accurate position of a scene boundary in the window through a position offset regression device; and based on the obtained scene boundaries, specifying a plurality of labels for each scene to realize scene labeling, attributing scene detection into window attribute classification and position offset regression, and solving the multi-label labeling problem through integrated learning of two-stage classifiers. The problems of error propagation and huge calculation cost are solved through a unified network of cross-modal clues; scene detection is attributed to window attribute classification and position offset regression, and the multi-label labeling problem is solved through ensemble learning of two-stage classifiers.