Method and system for extracting complex named entities from Web video p ages

A named entity and video technology, applied in the field of information extraction, can solve the problem that the algorithm cannot be directly applied, is not suitable for discovering complex named entities, and the named entity lacks context information, etc., and achieves the effect of improving the accuracy of extraction.

Active Publication Date: 2010-01-13
INST OF COMPUTING TECH CHINESE ACAD OF SCI
View PDF0 Cites 9 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, in Web video pages, named entities lack rich context information; at the same time, the algorithms in the prior art require large-scale data sets and long-term calculations, which are not suitable for timely discovery of emerging complex named entities
Ordinary named entities and complex named entities are very different in concept and form of expression, so that the algorithms in the prior art cannot be directly applied to the recognition and extraction of complex named entities; Classification of

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and system for extracting complex named entities from Web video p ages
  • Method and system for extracting complex named entities from Web video p ages
  • Method and system for extracting complex named entities from Web video p ages

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0049] The present invention will be described in further detail below in conjunction with the accompanying drawings.

[0050] The inventive method such as figure 1 shown.

[0051] Step S100, for each Web video page in the Web video page set, extract valid text information from the Web video page, the valid text information forms video text, and all video texts form a training set.

[0052] The specific implementation manner of step S100 is as follows.

[0053] Step 110, setting an information extraction template for each site.

[0054] For the vast majority of video websites, most of their webpages are read by scripts or programs from the interface provided by the database, and then generate HTML pages in a fixed format. Therefore, in the same website, webpages with the same or similar semantic content Usually also have the same or similar HTML syntax structure.

[0055] Due to the particularity of the HTML webpage, the method of extracting the text of the webpage may ado...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a method and a system for extracting complex named entities from Web video pages. The method comprises the following steps: step 1: extracting effective text information from each Web video page in a Web video page set, wherein the effective text information forms video texts, and all video texts form a training set; step 2: classifying the Web video pages, selecting classifications, setting guide words for each selected classification, and selecting words which are related to the guide words and are uniformly and intensively distributed among the classifications from the training set as characteristic words; and step 3: extracting the words which are related to the characteristic words from the training set as candidate complex named entities, and selecting corresponding complex named entities for each selected classification from the candidate named entities according to the correlation degree of the classification with the characteristic words related to the candidate named entities. The method and the system of the invention can be used for extracting complex named entities from Web video pages without carrying out model training for a long time.

Description

technical field [0001] The invention relates to the field of information extraction, in particular to a method for extracting complex named entities of Web video pages. Background technique [0002] With the growth of network bandwidth and the application of Web2.0 technology, YouTube, Youku, Tudou and other video sharing websites have developed rapidly at home and abroad, and the number of Internet videos and the scale of users have increased on a large scale. At present, there are more than 300 video sites on the domestic Internet, among which the number of videos on Youku, Tudou and other websites has exceeded 10 million. How to accurately and effectively extract text information from Web video pages has become an important issue in the field of information extraction. Extracting text from web pages is essentially a process of extracting information from semi-structured text. [0003] Web video pages contain a large amount of text information, such as movie titles, TV dr...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 郑刚张勇东郭俊波
Owner INST OF COMPUTING TECH CHINESE ACAD OF SCI
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products