Web page subject extraction system and method

A technology of extraction system and extraction method, applied in the field of webpage theme extraction system, can solve the problems of unsatisfactory overall effect and sparse vocabulary, and achieve the effect of good effect, avoiding information loss and avoiding vocabulary sparse.

Inactive Publication Date: 2007-11-28
TENCENT TECH (SHENZHEN) CO LTD
View PDF0 Cites 44 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, due to the full development of expressive language, multiple words with one meaning is a common phenomenon, coupled with the use of rhetoric, the phenomenon of sparse vocabulary exists objectively, especially for short articles such as web pages, the overall effect of this algorithm is not very ideal

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Web page subject extraction system and method
  • Web page subject extraction system and method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0023] The present invention will be further elaborated below according to the drawings and specific embodiments.

[0024] As shown in FIG. 1 , a webpage theme extraction system of the present invention includes a document parser 1 , a word segmentation module 2 , a word segmentation postprocessing module 3 , a sememe processing module and a webpage theme output interface 7 . The sememe processing module includes a sememe expansion module 4 , a web page theme sememe calculation module 5 and a sememe recovery keyword module 6 . The modules related to the system of the present invention in the website or other application systems may include: a website webpage storage center 8 , a website navigation tree generation system 9 , a webpage category calculation module 10 and a webpage theme application module 11 . In the present invention, the document parser 1 adopts an Html (HyperText Markup Language, Hypertext Markup Language) document parser.

[0025] Among them, the Html docume...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses an extracting system of net page, which is characterized by the following: comprising document analyzer, classified word module and definition element treating module; extracting net page header and text with different projecting display attribute from the net page source file with the document analyzer; proceeding classified vocabulary for net page text, header and classified information with the classified module; getting the first vocabulary meter; converting the vocabulary of the first word meter to definition element with the definition element treating module; calculating weight of the definition element; proceeding reversal deacidize for the definition element; getting theme word collection. This invention also discloses an extracting method of net page theme. This invention avoids the puzzle of word rarefaction and information lost.

Description

technical field [0001] The present invention relates to network communication technology, and more specifically, to a system and method for extracting webpage topics. Background technique [0002] The so-called webpage theme refers to the abstract content or keyword list of the event described in the text of the webpage, which indicates the subject content and central idea of ​​the webpage. There are mainly two existing web page topic extraction methods. One is a title-based web page theme extraction technology. The method is: adopt Html (HyperText MarkupLanguage, Hypertext Markup Language) document parser, analyze html webpage according to html protocol, build html syntax tree according to the tag identification of webpage source file, to find out the content of the title, body text, etc. of the webpage , then place the header in the page <title>The value of the tag serves as the main idea of ​​the web page. This method is an early and commonly used webpage topic ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 丁江伟
Owner TENCENT TECH (SHENZHEN) CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products