Self-adaption web crawler method based on machine learning

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
An adaptive network and machine learning technology, applied in the computer field, to improve the efficiency of information retrieval and reduce time costs

Active Publication Date: 2016-04-20

NANJING UNIV

View PDF4 Cites 13 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

However, the writing of web crawler programs at this stage usually requires professional software developers to query the page codes, study the corresponding rules, and also need to write different crawler programs for different websites and pages, and none of them can be self-adaptive. Functional program to automatically guide crawlers to crawl isomorphic information

At the same time, with the rise of major e-commerce companies, people in modern society are more inclined to online shopping, which has become an increasingly irresistible trend. How users can find the information they need in e-commerce websites with large amounts of data is a problem. very serious challenge

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0029] Some embodiments of the accompanying drawings of the present invention are described in more detail below. In this example: the HTML codes of webpages A and B are known, and the crawler mode in webpage C is adaptively output.

[0030] according to figure 1 , the present invention is built on the basis of data mining and machine learning technology, and specific implementation method has:

[0031] 1. Get data:

[0032] Get the entire page code provided by the browser plug-in and the position of the part of the code that needs to be crawled on the entire page. The position is expressed as an array, and each number in the array indicates the number of the line of code in the entire page code. the number of . For example, the position array [1,2,6,3,1,2,1,3,2,2,1], the first 1 in the array represents the outermost first-level label in the entire page code, that is html; the second 2 means the second tag body under the upper html tag; the third 6 means the sixth tag unde...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention provides a self-adaption web crawler method based on machine learning. Because at present, software developers need to inquire about complex page codes and research rules to compile net crawler programs, no self-adaption programs exist to automatically guide crawlers to work and meanwhile the field of electronic commerce is not touched, the technical aim of mining a large number of data through the machine learning technology so as to position information related to the content on an unknown webpage needs to be achieved. The aim is achieved through the machine learning and data mining method. The method includes the main technical steps of data obtaining, characteristic extraction, isomerous data normalization, training data establishment, self-adaption training, learning method verification and self-adaption mode generation, the method is used for conducting position characteristic extraction and self-adaption training on the page codes of crawled websites, and a certain effect is achieved for research on self-adaption network crawlers in the field of electronic commerce.

Description

technical field [0001] The invention relates to computer technology, which mainly uses data mining and machine learning methods to solve the self-adaptive matching problem of web crawlers in the field of e-commerce, and belongs to the application fields of computer technology, data mining, machine learning and web crawler intersecting technologies. Background technique [0002] With the explosive growth of Internet information and the continuous rise of e-commerce websites, more and more people have begun to pay attention to how to find their favorite products in the e-commerce network with large data volume, and it is becoming more and more important for software industry professionals It is more inclined to do automatic processing in large amounts of data to improve the efficiency of information retrieval and achieve the purpose of targeted recommendation information. Crawlers, as the main way to obtain web page information, are also well known by more people. However, th...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & AuthorityApplications(China)

IPC IPC(8): G06F17/30

CPCG06F16/951

Inventor汤恩义赵晨李宣东陈鑫张庆垒潘敏学赵祖威

OwnerNANJING UNIV

Self-adaption web crawler method based on machine learning

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology