Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Open source data processing method and system based on adversarial samples

A data processing system and anti-sample technology, applied in the computer field, can solve the problems of bad blacklist setting, easy accidental injury to normal users, increase the difficulty of data identification, etc., and achieve the effect of improving the difficulty and cost of cracking

Inactive Publication Date: 2020-08-11
GUANGZHOU UNIVERSITY
View PDF3 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] The existing anti-crawler strategies are as follows: (1) Prohibition of ip, User-Agent or cookies: the operation and maintenance personnel of the web page find the recent abnormal access ip and User-Agent or cookies by analyzing the logs, and prohibit them through the blacklist mechanism Access; (2) Verification code verification: When a user visits too many times, the request will automatically jump to a verification code page, and the website can only be accessed after entering the correct verification code; (3) javascript rendering or ajax asynchronous transmission: web page developers put important information in the web page but do not write it into html tags, and the browser will automatically render 标签中的js代码将信息展现在浏览器当中,而爬虫是不具备执行js代码的能力,所以无法将js事件产生的信息读取出来,访问网页的时候服务器将网页框架返回给客户端,在与客户端交互的过程中通过异步ajax技术传输数据包到客户端,呈现在网页上,爬虫直接抓取的话信息为空;(4)提高数据识别难度:一些经常更新的重要数据可以改变其表现形式,例如将文本信息分割成一系列的图片,即使网络爬虫获取到相关的数据,后期也难以分析和处理
[0005] Although the above anti-crawler strategy can reduce the probability of crawlers grabbing data to a certain extent, it also has a greater impact on normal users. For example, the blacklist setting of the first solution is not good, and it is easy to accidentally injure normal users

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Open source data processing method and system based on adversarial samples
  • Open source data processing method and system based on adversarial samples

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0035]下面将结合本发明中的附图,对本发明中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。

[0036]需要说明的是,文中的步骤编号,仅为了方便具体实施例的解释,不作为限定步骤执行先后顺序的作用。本实施例提供的方法可以由相关的服务器执行,且下文均以服务器作为执行主体为例进行说明。

[0037]参见图1,图1是本发明提供的基于对抗样本的开源数据处理方法的一种实施例的流程示意图。如图1所示,该方法包括步骤101至步骤103,各步骤具体如下:

[0038]步骤101:将需处理的开源数据信息X拆解为若干个不可分割的最小单位,组成图片集A。

[0039]在本实施例中,开源数据信息X包括:数字、英文、中文或图片。当开源数据信息X为数字时,拆解的最小单位为每一个数字;当开源数据信息X为英文时,拆解的最小单位为每一个字母;当开源数据信息X为中文时,拆解的最小单位为每一个汉字;当开源数据信息X为图片时,拆解的最小单位为固定尺寸大小的图片。

[0040]在本实施例中,拆解后的一个最小单位对应一个图片,而为了便于后续的数据处理,每个图片可以设置一个固定尺寸。由拆解后的多个图片组成图片集A,对应普通用户而言,无论网络前端展示的是X还是A都能够通过肉眼正常识别。

[0041]步骤102:根据图片集A和识别模型集B,生成对抗样本图片集D;其中,识别模型集B包含网络爬虫对抓取到的数据进行识别时所用到的多个识别模型;对抗样本图片集D中的图片d和图片集A中对应的图片a的差距小于预设阈值δ。

[0042]在本实施例中,步骤102具体为:将图片集A和识别模型集B作为输入,对输入添加干扰噪声,生成对抗样本图片集D;其中,每张图片a∈A生成对应的对抗样本图片d,且满足|d-a|≤δ。

[0043]本实施例的识别模型集B包含多个识别模型,这些识别模型是网络爬虫对抓取到的数据进行识别时所用到的。对应网络爬虫而言,无论展示的是X还是A,其抓取到的数据都图片集A中的一个个元素,再通过常用的识别模型对图片集A进行还原,成为能够供网络爬虫认可或识别的数据,才能进行后续处理。这些模型能够对图片集A以较高的准确率进行识别,将这些模型作为模型集B。

[0044]在本实施...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses an open source data processing method and system based on adversarial samples. The method comprises the following steps of: splitting open source data information X into a plurality of inseparable minimum units, forming a picture set A, generating an adversarial sample picture set D according to the picture set A and an identification model B, enabling the difference between the picture d and the picture a to be smaller than a preset threshold delta, and finally splicing the adversarial sample picture set D to generate open source data information X' for network front-end display. According to the technical scheme, on the premise that normal reading and open source data use of a common user are not affected, a web crawler is hard to correctly analyze information inthe data even if the web crawler can capture the data, and the cracking difficulty and cost are improved.

Description

technical field [0001] The present invention relates to the field of computer technology, in particular to an open source data processing method and system based on adversarial examples. Background technique [0002] With the rapid development of information technology today, it is difficult for users to protect their data from being easily obtained and used by others. Companies and individuals generally display their data in an open source manner, such as through web pages, but due to the existence of web crawlers, attackers can easily use web crawlers to obtain open source data of companies and individuals. [0003] A web crawler is a program or script that automatically grabs open source data on the World Wide Web according to certain rules. People can use web crawlers to obtain the data information they need in batches instead of manually obtaining it, which saves manpower and material resources. For some important data, such as data that is valuable to the company or ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F16/951G06K9/34G06K9/62
CPCG06F16/951G06V10/267G06F18/24
Inventor 顾钊铨廖续鑫方滨兴王乐王新刚张川京王玥天
Owner GUANGZHOU UNIVERSITY
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products