News website general crawler design method based on GRU neural network

A neural network and design method technology, applied in the computer field, can solve the problems of real-time control, labor and time, difficult public opinion, etc., to reduce noise, improve model accuracy, and simplify complexity.

Active Publication Date: 2019-12-03
XI AN JIAOTONG UNIV
View PDF5 Cites 3 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0002] Some public opinion systems have already been implemented, but these public opinion systems only monitor a few fixed news websites, and it is difficult to achieve real-time contr

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • News website general crawler design method based on GRU neural network
  • News website general crawler design method based on GRU neural network
  • News website general crawler design method based on GRU neural network

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0049]本发明提供了一种基于GRU神经网络的新闻网站通用爬虫设计方法,基于 GRU神经网络算法针对不同样式HTML页面提取其正文,然后构建全站爬取爬虫,爬取网页内容并使用设计的神经网络提取正文。

[0050]本发明一种基于GRU神经网络的新闻网站通用爬虫设计方法,包括以下步骤:

[0051]S1、对HTML页面内容进行预处理,依次进行HTML数据预处理,构建目标数据并标记字符,构建字符字典,HTML内容转换为数字向量,最后填充batch;

[0052]HTML数据预处理

[0053]去除一些无意义或者可能有噪音的标签,如、等;去除所有标签的属性;去掉标签内的纯空格内容如"”

[0054]构建目标数据并标记字符

[0055]构建与样本长度相同的字符串,根据爬取的对应正文内容,将HTML内容中,对应的正文内容字符位置的字符值设为"1”,其余字符设为"2”,这样将整个提取任务转化为对单个字符级的三分类任务(还有一类为后面的填充字符)。

[0056]构建字符字典

[0057]对训练集的字符构建字符级字典,每个字符的value从0开始依次递增。字符字典默认包含四个特殊符号"{~}”、"{^}”、"{$}”和"{#}”,分别代表填充符号、开始符号、结束符号和未知词符号;再将key-value对反转获得反转字符字典。

[0058]HTML内容转换为数字向量

[0059]将每条样本数据(即HTML内容)中的每个字符和特殊字符,根据字符字典,转化为一条数字向量。

[0060]填充batch

[0061]由于数据是以mini-batch形式喂入神经网络的,而每条数据的长度都不同,因此需要先获得该batch中最长数据的长度,然后使用填充符号"{~}”对该batch 中长度小于最长长度的数据填充至最长长度,并将一个batch中的数据按照真实长度从大到下排列。

[0062]S2、建立GRU神经网络,使用Cross Entropy作为其损失函数,Embedding 层使用预训练的字符向量;

[0063]请参阅图2,GRU神经网络单元具体为:

[0064]GRU神经网络是RNN的改进方案,RNN是一种用于处理序列数据的神经网络,它能够捕捉并记录序列内数据间的依赖关系,RNN是通过隐状态传递之前的信息的:

[0065]ht=g(Wxt+Uht-1+b)

[0066]其中,xt为在时间t时的输入向量(假设该向量大小为m×1...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a news website general crawler design method based on a GRU neural network. HTML data preprocessing is carried out on HTML page content, target data is constructed, charactersare marked, a character dictionary is constructed, the HTML content is converted into a digital vector, and finally batch filling is carried out; establishing a GRU neural network, using a Cross Entropy as a loss function, and using a pre-trained character vector to train and predict the GRU neural network by an Embedding layer; and based on the Scrapy crawler framework, constructing a whole-station crawling crawler. According to the method, after a crawler crawls HTML content of any news page, the HTML content is transmitted into the model trained by using the neural network algorithm designed by the invention, so that news texts can be automatically extracted, and customized time and manpower are saved.

Description

technical field [0001] The invention belongs to the technical field of computers, and in particular relates to a general crawler design method for a news website based on a GRU neural network. Background technique [0002] Some public opinion systems have already been implemented, but these public opinion systems only monitor a few fixed news websites, and it is difficult to achieve real-time control of public opinion. Moreover, if these systems want to increase the scope of monitoring, they need to monitor the newly added websites. Customizing crawlers takes a lot of manpower and time. Contents of the invention [0003] The technical problem to be solved by the present invention is to provide a general crawler design method for news websites based on the GRU neural network, which can automatically extract text content from web pages of different styles, effectively saving manpower and time. [0004] The present invention adopts following technical scheme: [0005] A...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F16/951G06F16/958G06F16/36G06N3/04
CPCG06F16/951G06F16/972G06F16/374G06N3/045
Inventor 范建存廖励坤
Owner XI AN JIAOTONG UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products