Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

System and method for recognizing content posts of webpage

A web page and text technology, applied in the Internet field, can solve problems such as poor handling of multiple "floor" content, and achieve the effect of excellent reading experience

Active Publication Date: 2015-03-25
北京鸿享技术服务有限公司
View PDF3 Cites 7 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] The purpose of the present invention is to solve the current text extraction technology's dependence on the largest text segment and the poor handling of multi-"building" content, so that when text extraction and rearrangement are performed on web pages, not only the news text can be identified and extracted, but also the news text can be extracted. Able to identify the comment content of news comments, and identify multi-"floor" content in forums

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • System and method for recognizing content posts of webpage
  • System and method for recognizing content posts of webpage
  • System and method for recognizing content posts of webpage

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0069] The specific embodiments of the present invention will be described in further detail below with reference to the accompanying drawings and embodiments. The following examples are suitable to illustrate the present invention, but are not intended to limit the scope of the present invention.

[0070] The structure diagram of the system provided by the present invention is as follows: figure 1 shown.

[0071] The webpage parsing and layout module 100 parses and calculates the layout of the webpage source code. The HTML parsing engine is used when parsing the HTML source code and layout. Commonly used open source HTML parsing engines such as webkit. The parsing and layout are based on the tags in the source code of the web page, but not limited to the div tag, to generate the DOM tree of the web page, and calculate the position and height of each node displayed when the web page is displayed. A DOM tree is generated such as image 3 shown.

[0072] Since it is difficult...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a system for recognizing content posts of webpages. The system comprises a webpage analysis and layout module, a node recognition module, a post dividing module and a mobile terminal page generating module, wherein the webpage analysis and layout module is suitable for analyzing source codes of webpages and executing layout calculation of the analytical result to generate a document object model (DOM) tree of webpages; the node recognition module is suitable for distinguishing content nodes and spam-word nodes by traversing from the root nodes of the DOM tree; the post dividing module is suitable for dividing the recognized content nodes according to webpages' posts; and the mobile terminal page generating module is suitable for generating a mobile terminal page. After recognizing and extracting contents of the traditional internet webpages, the system and the method for distinguishing content floor of webpage can extract the bulletin board system (BBS) contents, news contents and comments, and restores the feature of 'post-by-post' display of contents in original webpages, the display effects maintain original 'multi-post' feature and bring wonderful reading experience to users.

Description

[0001] The patent application of the present invention is a divisional application of a Chinese invention patent application with an application date of June 25, 2012, an application number of 201210214079.9, and the title of "A System and Method for Identifying the Text Floor of a Web Page". technical field [0002] The invention relates to the field of the Internet, in particular to a method for identifying the text floor of a webpage. Background technique [0003] With the development and popularization of mobile terminals, more and more people use mobile terminals to browse web pages. However, because most websites on the Internet do not perform special processing on the display of the mobile terminal, the deformation of the display of most of the webpages on the mobile terminal results in an extremely poor reading experience for the user. [0004] The current method to improve the user's reading experience is to extract and rearrange the text of the web page, and then r...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30
CPCG06F16/9577
Inventor 陈营营
Owner 北京鸿享技术服务有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products