Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Method for extracting webpage subject contents

A content and theme technology, applied in the field of web page theme information extraction, can solve problems such as difficult to include, the template cannot accurately extract the theme content, and the text cannot be extracted separately.

Inactive Publication Date: 2011-09-21
SAMSUNG ELECTRONICS CHINA R&D CENT +1
View PDF1 Cites 24 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] However, it is difficult for the above training set to contain all the situations, resulting in the generated template cannot accurately extract the subject content, and the existing method only extracts a section of subject content for a webpage, and cannot extract the text (description), title (title), category (category), etc.
Moreover, training templates through machine learning cannot be performed on devices with limited resources such as mobile terminals.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for extracting webpage subject contents
  • Method for extracting webpage subject contents
  • Method for extracting webpage subject contents

Examples

Experimental program
Comparison scheme
Effect test

Embodiment approach

[0024] The method for extracting webpage subject content in this embodiment involves RSS information, so the RSS information will be described first.

[0025] RSS information is a format for describing the contents of synchronous websites, and it is a new technical means of information release. Many current web pages, such as blogs and news websites, are published with RSS information. RSS information can be directly called by other sites, and because these data are in standard Extensible Markup Language (XML: Extensible Markup Language) format, they can also be used in other terminals and services.

[0026] RSS is currently the most widely used XML application. A sub-channel of a portal website, such as a technology channel, and all blogs written by a blogger have an RSS file to maintain the latest web page RSS information. Generally, an RSS file only contains the latest updated RSS information of several webpages, and changes with the update of the information release.

[0...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a method for extracting webpage subject contents. The method comprises the following steps of: selecting latest RSS (Really Simple Syndication) information and a corresponding webpage from an RSS file; searching the position of the RSS information in a Dom Tree of the corresponding webpage, wherein the information at the position is taken as a webpage template; and extracting the subject contents of a plurality of target webpages by using the webpage template. The method further comprises the following step of: after the subject contents of a predetermined quantity of target webpages in the plurality of target webpages are extracted, regenerating the webpage template and continually extracting the subject contents of the plurality of target webpages.

Description

technical field [0001] The invention relates to the extraction of webpage theme information, in particular to the extraction of webpage theme content. Background technique [0002] In web pages, there are navigation links, script programs, related articles, advertisement links, copyright information and other noise information irrelevant to the subject content. Removing these noise information and extracting the subject content of web pages has its application value in many aspects, such as using To improve the webpage classification of search engines, deduplication of webpages, and direct access to webpage theme content by mobile terminals. [0003] At present, the technologies for extracting webpage theme content are mainly divided into two categories. One is mainly applied to structured webpages. By analyzing the characteristics of structured webpages, a template for extracting data is found to extract webpage data in batches; the other is to build training A set of web ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 沈文南酆晓杰王艳丽王进玄东俊
Owner SAMSUNG ELECTRONICS CHINA R&D CENT
Features
  • Generate Ideas
  • Intellectual Property
  • Life Sciences
  • Materials
  • Tech Scout
Why Patsnap Eureka
  • Unparalleled Data Quality
  • Higher Quality Content
  • 60% Fewer Hallucinations
Social media
Patsnap Eureka Blog
Learn More