The invention provides a conference video marking system based on a sound detection. The system comprises a container block head, which is used for storing audio and video data position information ofthe same participating member in the conference; N container blocks, which are used for storing audio and video data of each participating member in the conference; a resolution scaling module, whichis used for scaling the video resolution by decoding and encoding the video data; storing the audio and video data information in the conference video to the container block head, and storing the different audio and video data of each participating member in the conference video into the corresponding container block. According to the conference video marking system based on sound detection, theplaying operation is simple, and the picture of the speaking member needing to be viewed can be rapidly switched according to the label; the playing can be flexibly combined, the pictures of all the speakers are extracted at the same time, and the speaking content of all members is rapidly compared; a simple file format is defined, the data is saved by using blocks, and the data extraction is simple and flexible.