System and method for identifying text floor of webpage

A text and floor technology, applied in the Internet field, can solve the problem of poor handling of multiple "floor" content, and achieve the effect of excellent reading experience

Inactive Publication Date: 2012-11-14
BEIJING QIHOO TECH CO LTD +1
View PDF2 Cites 25 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] The purpose of the present invention is to solve the current text extraction technology's dependence on the largest text segment and the poor handling of multi-"building" content, so that when text extraction and rearrangement are performed on web pages, not only the news text can be identified and extracted, but also the news text can be extracted. Able to identify the comment content of news comments, and identify multi-"floor" content in forums

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • System and method for identifying text floor of webpage
  • System and method for identifying text floor of webpage
  • System and method for identifying text floor of webpage

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0068] The specific implementation manners of the present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments. The following examples are suitable for illustrating the invention, but are not intended to limit the scope of the invention.

[0069] The structural diagram of the system provided by the invention is as figure 1 shown.

[0070]The webpage parsing and layout module 100 parses the source code of the webpage and calculates the layout. An HTML parsing engine is used when parsing HTML source code and layout. Commonly used open source HTML parsing engines such as webkit. The parsing and layout can be based on the tags in the source code of the web page, but not limited to div tags, to generate the DOM tree of the web page, and calculate the position and height of each node when the web page is displayed. A generated DOM tree such as image 3 shown.

[0071] Since it is difficult to display the dynamic effec...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a system for identifying a text floor of a webpage. The system comprises a webpage analysis and layout module, a node identifying module, a floor dividing module and a mobile terminal page generation module, wherein the webpage analysis and layout module is suitable for analyzing a source code of the webpage and carrying out layout calculation on a paring result to generate a DOM (Document Object Model) tree; the node identifying module is suitable for traversing from a root node of the DOM tree to identify a text node and a garbage word node in the DOM tree; the floor dividing module is suitable for dividing the text node identified according to the floor of the webpage; and the mobile terminal page generation module is suitable for generating a mobile terminal page. According to the system and the method for identifying the text floor of the webpage, after conventional content of webpage of Internet is identified and extracted, BBS text, news text and commends can be effectively extracted, the representing characteristics of floors of the text in the original webpage can be restored, the representing effect maintains the original characteristics of multiple floors so as to provide excellent reading experience for users.

Description

technical field [0001] The invention relates to the Internet field, in particular to a method for identifying the floor of a webpage text. Background technique [0002] With the development and popularization of mobile terminals, people increasingly use mobile terminals to browse webpages. However, since most of the websites on the Internet do not do special processing for the display of mobile terminals, the display of most web pages on mobile terminals is distorted, resulting in extremely poor reading experience for users. [0003] The current method to improve the user's reading experience is to extract and rearrange the text of the webpage, and then re-display it to the user. The effect is better for news and information webpages with large sections of content, but user comments will be discarded. For forums where the text is divided into multiple "buildings", the effect is even worse: only the text of a certain building can be recognized or not. text. The spam inform...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30G06F40/143
CPCG06F17/227G06F17/2247G06F17/30G06F17/3089G06F16/958G06F40/154G06F40/143
Inventor 陈营营
Owner BEIJING QIHOO TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products