Self-adaption web crawler method based on machine learning

An adaptive network and machine learning technology, applied in the computer field, to improve the efficiency of information retrieval and reduce time costs

Active Publication Date: 2016-04-20
NANJING UNIV
View PDF4 Cites 13 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, the writing of web crawler programs at this stage usually requires professional software developers to query the page codes, study the corresponding rules, and also need to write different crawler programs for different websites and pages, and none of them can be self-adaptive. Functional program to automatically guide crawlers to crawl isomorphic information
At the same time, with the rise of major e-commerce companies, people in modern society are more inclined to online shopping, which has become an increasingly irresistible trend. How users can find the information they need in e-commerce websites with large amounts of data is a problem. very serious challenge

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Self-adaption web crawler method based on machine learning
  • Self-adaption web crawler method based on machine learning
  • Self-adaption web crawler method based on machine learning

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0029] Some embodiments of the accompanying drawings of the present invention are described in more detail below. In this example: the HTML codes of webpages A and B are known, and the crawler mode in webpage C is adaptively output.

[0030] according to figure 1 , the present invention is built on the basis of data mining and machine learning technology, and specific implementation method has:

[0031] 1. Get data:

[0032] Get the entire page code provided by the browser plug-in and the position of the part of the code that needs to be crawled on the entire page. The position is expressed as an array, and each number in the array indicates the number of the line of code in the entire page code. the number of . For example, the position array [1,2,6,3,1,2,1,3,2,2,1], the first 1 in the array represents the outermost first-level label in the entire page code, that is html; the second 2 means the second tag body under the upper html tag; the third 6 means the sixth tag unde...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a self-adaption web crawler method based on machine learning. Because at present, software developers need to inquire about complex page codes and research rules to compile net crawler programs, no self-adaption programs exist to automatically guide crawlers to work and meanwhile the field of electronic commerce is not touched, the technical aim of mining a large number of data through the machine learning technology so as to position information related to the content on an unknown webpage needs to be achieved. The aim is achieved through the machine learning and data mining method. The method includes the main technical steps of data obtaining, characteristic extraction, isomerous data normalization, training data establishment, self-adaption training, learning method verification and self-adaption mode generation, the method is used for conducting position characteristic extraction and self-adaption training on the page codes of crawled websites, and a certain effect is achieved for research on self-adaption network crawlers in the field of electronic commerce.

Description

technical field [0001] The invention relates to computer technology, which mainly uses data mining and machine learning methods to solve the self-adaptive matching problem of web crawlers in the field of e-commerce, and belongs to the application fields of computer technology, data mining, machine learning and web crawler intersecting technologies. Background technique [0002] With the explosive growth of Internet information and the continuous rise of e-commerce websites, more and more people have begun to pay attention to how to find their favorite products in the e-commerce network with large data volume, and it is becoming more and more important for software industry professionals It is more inclined to do automatic processing in large amounts of data to improve the efficiency of information retrieval and achieve the purpose of targeted recommendation information. Crawlers, as the main way to obtain web page information, are also well known by more people. However, th...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/951
Inventor 汤恩义赵晨李宣东陈鑫张庆垒潘敏学赵祖威
Owner NANJING UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products