The invention provides a media segment-based speaking detection method and system. The media segment-based speaking detection method includes the following steps that: inputted media signals are divided into audio signals and video signals, and the audio signals and video signals are processed respectively, and as for the audio signals, the hidden Markov model is adopted to calculate per-second conditional probabilities based on the likelihood ratio of harmonic frequencies, and clustering is performed, and as for the video signals, a face region, a lip portion and the image energy of a lip region are extracted from each frame of image in the video signals of an inputted media file, and clustering is performed according to the image energy, and the hidden Markov model is adopted to calculate per-second conditional probabilities, and clustering is performed, and two clusters can be obtained; and clustering results of the audio signals are matched with clustering results of the video signals respectively, so that the final result of speaking detection can be obtained. According to the method and system of the invention, speaking detection is performed based on the audio and video information, and a detection rate can be improved.