A General Crawler Design Method for News Websites Based on GRU Neural Network

A neural network and design method technology, applied in the computer field, can solve the problems of website custom crawler, real-time control, labor and time consumption, etc., and achieve the effect of reducing the length of the text and reducing the noise.

Active Publication Date: 2022-04-22
XI AN JIAOTONG UNIV
View PDF5 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] Although some public opinion systems have already been implemented, these public opinion systems only monitor a few fixed news websites, and it is difficult to achieve real-time control of public opinion. Customized crawlers for websites cost a lot of manpower and time

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A General Crawler Design Method for News Websites Based on GRU Neural Network
  • A General Crawler Design Method for News Websites Based on GRU Neural Network
  • A General Crawler Design Method for News Websites Based on GRU Neural Network

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0051]本发明提供了一种基于GRU神经网络的新闻网站通用爬虫设计方法,基于GRU神经网络算法针对不同样式HTML页面提取其正文,然后构建全站爬取爬虫,爬取网页内容并使用设计的神经网络提取正文。

[0052]本发明一种基于GRU神经网络的新闻网站通用爬虫设计方法,包括以下步骤:

[0053]S1、对HTML页面内容进行预处理,依次进行HTML数据预处理,构建目标数据并标记字符,构建字符字典,HTML内容转换为数字向量,最后填充batch;

[0054]HTML数据预处理具体为:

[0055]去除一些无意义或者可能有噪音的标签,如、等;去除所有标签的属性;去掉标签内的纯空格内容如"”。

[0056]构建目标数据并标记字符具体为:

[0057]构建与样本长度相同的字符串,根据爬取的对应正文内容,将HTML内容中,对应的正文内容字符位置的字符值设为"1”,其余字符设为"2”,这样将整个提取任务转化为对单个字符级的三分类任务(还有一类为后面的填充字符)。

[0058]构建字符字典具体为:

[0059]对训练集的字符构建字符级字典,每个字符的value从0开始依次递增。字符字典默认包含四个特殊符号"{~}”、"{^}”、"{$}”和"{#}”,分别代表填充符号、开始符号、结束符号和未知词符号;再将key-value对反转获得反转字符字典。

[0060]HTML内容转换为数字向量具体为:

[0061]将每条样本数据(即HTML内容)中的每个字符和特殊字符,根据字符字典,转化为一条数字向量。

[0062]填充batch具体为:

[0063]由于数据是以mini-batch形式喂入神经网络的,而每条数据的长度都不同,因此需要先获得该batch中最长数据的长度,然后使用填充符号"{~}”对该batch中长度小于最长长度的数据填充至最长长度,并将一个batch中的数据按照真实长度从大到下排列。

[0064]S2、建立GRU神经网络,使用Cross Entropy作为其损失函数,Embedding层使用预训练的字符向量;

[0065]请参阅图2,GRU神经网络单元具体为:

[0066]GRU神经网络是RNN的改进方案,RNN是一种用于处理序列数据的神经网络,它能够捕捉并记录序列内数据间的依赖关系,RNN是通过隐状态传递之前的信息的:

[0067]ht=g(Wxt+Uht-1+b)

[0068]其中,xt为在时间t...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a general crawler design method for news websites based on a GRU neural network. HTML data preprocessing is performed on HTML page content, target data is constructed, characters are marked, a character dictionary is constructed, the HTML content is converted into a digital vector, and a batch is finally filled; Build a GRU neural network, use Cross Entropy as the loss function, and the Embedding layer uses pre-trained character vectors to train and predict the GRU neural network; build a full-site crawler based on the Scrapy crawler framework. After the crawler crawls the HTML content of any news page, the present invention transfers it into the model trained by the neural network algorithm designed by the present invention, and can automatically extract the news text, saving time and manpower for customization.

Description

technical field [0001] The invention belongs to the technical field of computers, and in particular relates to a general crawler design method for a news website based on a GRU neural network. Background technique [0002] Public opinion, also known as social public opinion, refers to the collection of social and political attitudes, beliefs, values, and ideas expressed by the public on the occurrence, development, and change of specific events or phenomena in society within a certain period of time and scope. In layman's terms, public opinion is a concentrated reflection of the thoughts, psychology, emotions and needs of social groups, and represents the current social sentiment and public opinion. Traditional public opinion is not only disseminated through newspapers, radio, television and other carriers, but also contained in the discussions among the people in the streets and alleys. Therefore, it is necessary to obtain public opinion through social visits, public opinio...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/951G06F16/958G06F16/36G06N3/04
CPCG06F16/951G06F16/972G06F16/374G06N3/045
Inventor 范建存廖励坤
Owner XI AN JIAOTONG UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products