A video summarization apparatus stores, in memory, video data including video and audio, and metadata items corresponding to video segments included in the video data respectively, each of metadata items including keyword and characteristic information of content of corresponding video segment, selects metadata items including specified keyword from metadata items, to obtain selected metadata items, extracts, from video data, video segment corresponding to selected metadata items, to obtain selected video segments, generates summarized video data by connecting extracted video segments, detects audio breakpoints included in video data, to obtain audio segments segmented by audio breakpoints, extracts from video data, audio segments corresponding to extracted video segments as audio narrations, and modifies ending time of video segment in summarized video data so that ending time of video segment in summarized video data coincides with or is later than ending time of corresponding audio segment of extracted audio segments.