Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Method and system for extracting complex named entities from Web video p ages

A named entity and video technology, applied in the field of information extraction, can solve the problem that the algorithm cannot be directly applied, is not suitable for discovering complex named entities, and the named entity lacks context information, etc., and achieves the effect of improving the accuracy of extraction.

Active Publication Date: 2012-07-04
INST OF COMPUTING TECH CHINESE ACAD OF SCI
View PDF0 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, in Web video pages, named entities lack rich context information; at the same time, the algorithms in the prior art require large-scale data sets and long-term calculations, which are not suitable for timely discovery of emerging complex named entities
Ordinary named entities and complex named entities are very different in concept and form of expression, so that the algorithms in the prior art cannot be directly applied to the recognition and extraction of complex named entities; Classification of

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and system for extracting complex named entities from Web video p ages
  • Method and system for extracting complex named entities from Web video p ages
  • Method and system for extracting complex named entities from Web video p ages

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0049] The present invention will be further described in detail below in conjunction with the accompanying drawings.

[0050] The method of the present invention is as figure 1 shown.

[0051] Step S100, for each web video page in the web video page set, extract valid text information from the web video page, the valid text information constitutes video text, and all video texts constitute a training set.

[0052] The specific implementation of step S100 is as follows.

[0053] Step 110, setting an information extraction template for each site.

[0054] For most video websites, most of their webpages are read data from the interface provided by the database by scripts or programs, and then generate HTML pages in a fixed format. Therefore, in the same website, webpages with the same or similar semantic content Usually also have the same or similar HTML syntax structure.

[0055] Due to the particularity of HTML web pages, a method of extracting the text of the web page may...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to a method and a system for extracting complex named entities from Web video pages. The method comprises the following steps: step 1: extracting effective text information from each Web video page in a Web video page set, wherein the effective text information forms video texts, and all video texts form a training set; step 2: classifying the Web video pages, selecting classifications, setting guide words for each selected classification, and selecting words which are related to the guide words and are uniformly and intensively distributed among the classifications fromthe training set as characteristic words; and step 3: extracting the words which are related to the characteristic words from the training set as candidate complex named entities, and selecting corresponding complex named entities for each selected classification from the candidate named entities according to the correlation degree of the classification with the characteristic words related to the candidate named entities. The method and the system of the invention can be used for extracting complex named entities from Web video pages without carrying out model training for a long time.

Description

technical field [0001] The invention relates to the field of information extraction, in particular to a method for extracting complex named entities of Web video pages. Background technique [0002] With the growth of network bandwidth and the application of Web 2.0 technology, video sharing websites such as YouTube, Youku, and Tudou have developed rapidly at home and abroad, and the number of Internet videos and the scale of users have increased on a large scale. At present, there are more than 300 video sites on the domestic Internet, among which the number of videos on Youku, Tudou and other sites has exceeded 10 million. How to accurately and effectively extract text information from Web video pages has become an important issue in the field of information extraction. Text extraction from web pages is essentially a process of extracting information from semi-structured text. [0003] Web video pages contain a large amount of text information, such as movie names, TV ser...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/30
Inventor 郑刚张勇东郭俊波
Owner INST OF COMPUTING TECH CHINESE ACAD OF SCI
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products