The invention discloses a multi-modal information fusion football video event detection and semantic annotation method. The multi-modal information fusion football video event detection and semantic annotation method includes the steps of detecting the event type of Internet match result report text description statements with the potential semantic analytical method; detecting football video intermediate semantic objects, dividing a site area, conducting attack-and-defense transition analysis, and determining boundaries of video event fragments; determining the match starting time according to kick-off circle and whistling detection results, and achieving initial semantic classification of attack-and-defense fragments with the Bayesian network; under the constraint of coarse-grained time information in text descriptions, achieving the football video event semantic annotation according to semantic synchronization text descriptions and video events of texts and the video fragments. By means of the method, the Internet text information and video inherent audio-visual feature analysis are fused for analyzing football videos, accuracy for detecting the video events and the boundaries of the video events is improved, the rich semantic annotation of football video contents is achieved, and a solid foundation is laid for building a video indexing mechanism based on semantics.