Unlock instant, AI-driven research and patent intelligence for your innovation.

Webpage purification system based on Render_DOM model and purification method thereof

A technology for purifying the system and DOM tree, which is applied in website content management, network data retrieval, other database retrieval, etc. It can solve the problems of DOM tree redundancy, excessive construction, only considering a single model algorithm, etc., to eliminate webpage noise, The effect of ensuring safety

Pending Publication Date: 2022-01-28
江苏省环科院环境科技有限责任公司
View PDF0 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

And these methods are either based on the DOM tree model alone, or based on the cssom tree model alone, and only consider a single model algorithm
Specifically, the algorithm using the dom tree model is classified according to the function of the semantic tags of html, and then the corresponding dom tree is extracted. All nodes have relationships with each other. It can build a tree structure according to the semantics of html documents. Corresponding the visible and invisible elements that exist on the page, interpreting the entire page structure and content, but only focusing on semantic tags, resulting in excessive construction of unnecessary dom elements, such as some that have little to do with the theme The tags , , etc., lead to extremely redundant DOM tree construction, and do not consider the layout changes caused by style elements; while the algorithm using the cssom tree model uses a nested box model to locate the web page The position and display mode of the elements in the webpage, each element in the webpage is positioned and laid out through the box model, and the cssom tree model algorithm only considers the visual representation of the webpage, while ignoring the semantic content of the page
Therefore, none of these methods can achieve the unification of web page semantics and visualization, and it also shows that using a single model cannot fully and reasonably divide web pages.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Webpage purification system based on Render_DOM model and purification method thereof
  • Webpage purification system based on Render_DOM model and purification method thereof
  • Webpage purification system based on Render_DOM model and purification method thereof

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0042]下面结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。

[0043]本申请提供一种基于Render_DOM模型的网页净化系统,所述网页净化系统包括采集模块、解析模块、渲染模块、分割模块以及视图显示模块;

[0044]其中,所述采集模块用于获取网页HTML信息,所述解析模块能够对获取到的HTML信息进行解析,对HTML编码成字符串,生成DOM树,同时加载图片、样式表、JS脚本等资源解析生成CSSOM树,所述渲染模块用于将生成的DOM树和CSSOM树结合渲染生成Render_DOM渲染树,所述Render_DOM渲染树以盒子模型的形式体现;

[0045]所述优化模块包括Block块树模块、合并模块和分割模块;

[0046]所述Block块树模块用于将HTML中的块状元素对应的Render_DOM渲染树的Block节点以从上至下、从左至右顺序编号的路径形成Block块树,所述Block块树包括基本单元块,所述基本单元块的子节点为叶子节点;

[0047]所述合并模块用于将达到一定相似度的基本单元块进行合并;

[0048]所述分割模块用于将基本单元块中的字符串以盒子模型中属性进行排列,生成字符串序列,将重复的字符串分割;

[0049]所述视图显示模块用于显示经过优化模块合并和分割后的网页。

[0050]图1为现有技术中DOM树模型的一个实施例,对应图1中DOM树模型的HTML文件代码如下所示:

[0051]

[0052]

[0053]

[0054]DOM示例

[0055]

[0056]

[0057]

[0058]

[0059]A Example

[0060]

[0061]

[0062]HelloWorld

[0063]

[0064]

[0065]

[0066]上述HTML代码是图1所示的DOM树模型的HTML代码体现,其中,如图1所示,HTML是DOM树模型中的根节点,HEAD和BODY是根节点HTML的子节点,而作为子节点的HEAD和BODY也各有自己的子节点。其中,在DOM树模型中,将具有下级子节点的子节点称为中间节点,如TITLE;无下级子节点不可再分割的子节点称为叶子节点,如BR。

[0067]通过解析器对...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides a webpage purification system based on a Render_DOM model and a purification method thereof. The webpage purification system comprises an acquisition module, an analysis module, a rendering module, an optimization module and a view display module, the analysis module is used for generating a Render_DOM rendering tree, and each tree node renderer of the RenderDOM rendering tree is embodied as a corresponding DOM visual node and a rectangular frame of a CSS style rule calculated for the DOM visual node; a block element of a tree node renderer is set as a Block node; the Block block tree module is used for forming a Block block tree by taking Block nodes of the Render_DOM rendering tree as numbering paths in a sequence from top to bottom and from left to right, wherein the Block block tree comprises basic unit blocks; the merging module is used for merging the nearest basic unit blocks on the Block tree until the area of the rectangular frame corresponding to the basic unit blocks reaches a page block threshold value; and the segmentation module is used for deleting the basic unit blocks of which the similarity exceeds a similarity threshold value as repeated contents. The working efficiency is improved.

Description

technical field [0001] The present application relates to the technical field of webpage purification, in particular to a system and method for purifying webpages based on the Render_DOM model. Background technique [0002] Nowadays, when people obtain information on the content of web pages on the Internet, in addition to encountering some high-quality web pages that only contain content related to the topic, they often encounter web pages with a lot of network noise pollution. Noise pollution, that is, Wed pages contain a lot of irrelevant content. The existence of this kind of webpage noise not only occupies the normal webpage space, makes the webpage confusing, blocks the normal document flow layout, but also visually interferes with people's acquisition of topic information. Therefore, we need to purify webpages polluted by network noise. In the prior art, webpage segmentation technology is generally used to purify webpages, and the noise-polluted parts are segmented. ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/9536G06F16/958
CPCG06F16/9536G06F16/958
Inventor 张佩佩
Owner 江苏省环科院环境科技有限责任公司