Web data collection method and system both based on AJAX (asynchronous javascript and extensible markup language)

A technology of data acquisition system and data acquisition module, which is applied in the direction of electrical digital data processing, special data processing applications, instruments, etc., to achieve the effects of accurate and efficient Web data acquisition and data transmission security.

Inactive Publication Date: 2013-09-04
CHONGQING UNIV OF POSTS & TELECOMM
View PDF6 Cites 20 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The splitter can guarantee 100% data capture without packet loss, but the splitter TAP needs to be purchased at an additional cost, and only one link can be viewed at a time

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Web data collection method and system both based on AJAX (asynchronous javascript and extensible markup language)
  • Web data collection method and system both based on AJAX (asynchronous javascript and extensible markup language)
  • Web data collection method and system both based on AJAX (asynchronous javascript and extensible markup language)

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0033]下面结合附图给出一个非限定性的实施例对本发明作进一步的阐述。

[0034]如图1所示为基于AJAX技术的Web数据采集系统体系结构示意图,该系统主要由以下模块组成:客户端监听模块、数据采集模块、数据传输模块、中心数据库模块。

[0035]客户端监听模块,用户与系统交互的接口,主要功能是监听客户端onclick事件。通过在目标网站植入JS探针代码,监控客户端的点击事件。对每个用户和每个目标采集网站分配一个唯一标识符uid和web_id,一个用户可以申请多个web_uid,用于部署不同的网站 。探针代码形式如下: 

[0036]数据采集模块,由HTML解析器、过滤器、采集器三部分共同完成数据采集功能。客户端监听模块监听到用户的点击行为,触发数据采集模块进行数据采集。整个文档(HTML)视作由标签元素、属性和文本构成,HTML解析器将文档中的标签元素映射为一个由层次节点组成的节点树,节点是用来表示HTML的标签元素,如"”,"”,"” ,"”等。

[0037]利用Javascript中的正则表达式来实现构HTML节点树, 其伪代码如下:

[0038]while (读取数据没有到文件尾) {

[0039]获取HTML文档中的标签

[0040]if(获取标签成功) {返回标签名称和标签类型}

[0041]if(获取标签成功){

[0042]if(标签为开始标签){

[0043]if(根结点为空) {

[0044]根据标签名创建根结点

[0045]当前结点指向根结点

[0046]continue }

[0047]else{

[0048]if(标签属于没有结束标签的标签){

[0049]根据标签名创建新结点, 并且为该结点赋值,

[0050]当前结点指向当前结点的父结点}

[0051]else{

[0052]根据获取标签创建新结点, 当前结点指向新结点

[0053]}}}

[0054]else{

[0055]if(当前结点的结点名不等于结束标签名){

[0056]当前结点到结束标签所对应的结点之间的每一个祖先结点,

[0057]如果该祖先结点得不到匹配, 则把它删除, 并调整 HTML 节点树}

[0058]当前结点指向该结束标签所对应的结点

[0059]if(当前结点为叶子结点){

[0060]当前结点赋值为该结束标签与开始标签之间的内容

[0061]}}}}

[0062]生成的HTML节点树如图3所示。

[0063]...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a web data collection system both based on AJAX (asynchronous javascript and extensible markup language), and relates to an internet data collection technology. Through a JS probe code deployed on a target website, a page clicking event is monitored; an HTML (hypertext markup language) resolver maps label elements in a whole document (HTML) into a node tree formed by hierarchic nodes; after the node tree is generated, a filter is needed to further filter out redundant labels in the HTML node tree; unnecessary content in the HTML node tree is filtered out, and a final collected result includes irrelevant information as few as possible; a collector traverses labels in the HTML node tree, and text content corresponding to the labels in the HTML node tree is collected at the front end of a web; finally, the collected data is packaged by the aid of an AJAX technology, and asynchronous transmission between the collected data and a database is performed. By the web data collection system, message posted by users and comments can be collected timely and accurately, and web data collection is realized efficiently.

Description

technical field [0001] The invention relates to Internet data collection technology, in particular to a Web data collection method and system based on AJAX technology. Background technique [0002] With the rapid development of Internet technology, the number of web pages and websites on the Internet has grown explosively, making the Internet a huge and widely distributed data source. Web data acquisition is widely used in various services and researches such as search engine retrieval, content security detection, user interest mining, and personalized information acquisition. Providing support has very important application value and practical significance. [0003] In recent years, Ajax (Asynchronous JavaScript and XML) has become a technical focus in Web development, and various Ajax frameworks have developed rapidly. At present, all major browser platforms support Ajax, and Ajax has become a mainstream development technology for Web applications. In this collection sy...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 唐红杨广徐川
Owner CHONGQING UNIV OF POSTS & TELECOMM
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products