Unlock instant, AI-driven research and patent intelligence for your innovation.

Intelligent webpage content automatic fuzzy extraction system

A technology for extracting system and web page content, applied in website content management, network data indexing, network data retrieval, etc., can solve the problems of difficulty in covering website requirements, inefficient extraction methods, and labor-intensive, and achieve high performance of webpage content extraction. Effect

Active Publication Date: 2019-04-19
中科国力(镇江)智能技术有限公司
View PDF8 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Due to the diversity of web page forms, artificial templates not only consume a lot of labor, but also cannot cover the needs of fast-growing websites, so the extraction method based on artificial preset templates is very inefficient

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Intelligent webpage content automatic fuzzy extraction system
  • Intelligent webpage content automatic fuzzy extraction system
  • Intelligent webpage content automatic fuzzy extraction system

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0045]为了能够更清楚的说明本发明,以下定义并解释如下的术语:

[0046](1)HTML网页、DOM树、DOM节点、DOM节点的属性:HTML网页简称网页,是互联网和移动互联网上的HTML网页和H5网页的统称。根据国际规范,HTML网页由一个棵DOM(DocumentObject Model,即文档对象化模型)树构成,每个节点称为一个DOM节点,也称HTML节点,或简称节点。其中,每个DOM节点有一对标签以及其中的内容文本构成,形如内容文本。其中,内容文本为该DOM节点的内容部分;属性规定了内容文本的一些特性。例如,在DOM节点内容文本中,style="display:none"就是一个表示不显示的属性,简记为display:none属性。又如,在节点易贷网中,有一个href="http: / / bj.edai.com"属性,简记为href 属性。

[0047](2)业务主题、业务主题节点、业务主题值、业务主题值节点:在不致混淆的情况下,业务主题也简称为主题。每个网页内容都蕴含着一定的业务主题,一个业务主题反映了业务的某一个方面。每个业务主题也可以细化为一些更小的业务主题,称为业务子主题(简称子主题)。在HTML网页中,业务主题一般出现在一个DOM树上的节点上:在DOM树上,有些节点代表业务主题(这些节点称为业务主题节点),有些节点代表业务主题值(这些节点称为业务主题值节点)。下面举例说明。图3(a)给出了一个金融网站的节点,与构成了一个 节点,它包含了两个节点,即年化利率以及 9.8%。在本发明中,年化利率称为业务主题节点,这是因为年化利率是金融领域的一个业务主题;而9.8%称为业务主题值节点,9.8%对应着年化利率的值,称为业务主题值节点。

[0048](3)业务主题的命名要素:在网页设计中,设计人员对每个业务主题进行命名时,往往选择意义明确的词语来表达业务主题。例如,在车贷类金融网站上,经常看到"贷款金额”这一的业务主题,它含有两个命名要素:"贷款”、"金额”,分别需要向社会大众筹集的资金额度。表1给出了一些常见的业务主题的命名要素。

[0049]表1:车贷类金融网站中的常见业务主题的命名要素

[0050]

[0051]又如,对"年化利率”这一的业务主题,它含有两个命名要素:"年化”、"利率”,它们清晰地告诉用户投资收益这一条用户关注的信息。

[0052]为了便于快...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses an intelligent webpage content fuzzy extraction system. The intelligent webpage content fuzzy extraction system comprises a module A, a module B and a module C, The module B isused for automatically generating the rapid multiple indexes of the HTML webpage content; The module C is used for generating candidate service themes; The module D is used for carrying out fuzzy verification on the candidate service themes; The module E is used for associating the candidate service themes with the corresponding XPath; And the module F is used for extracting HTML webpage contents. The method has two advantages that (1) the method does not depend on a webpage content extraction template; And (2) carrying out automatic fuzzy recognition on business topics in the webpage, and accurately judging meanings of the business topics. According to the two characteristics, the accuracy and the recall rate of automatic extraction of the webpage content are ensured.

Description

technical field [0001] The invention relates to the field of automatic analysis and extraction of webpage content, in particular to an intelligent automatic fuzzy extraction system and method for webpage content. Background technique [0002] HTML web page information extraction technology mainly refers to extracting the important content needed from web pages written in HTML (hyper-text markup language) or HTML5 (hyper-text markup language 5), and transforming the extracted content into some pre-set These contents are extremely critical for major applications such as merchant analysis, product and service analysis, and government regulation. [0003] With the rapid popularization of the Internet and mobile Internet, various Internet-based applications continue to develop, and websites of various formats continue to appear. In order to attract users, merchants' websites also design unique and various styles of web pages. [0004] The diversity of business themes and page f...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/958G06F16/951
Inventor 符建辉张燎
Owner 中科国力(镇江)智能技术有限公司