Unlock instant, AI-driven research and patent intelligence for your innovation.

A URL cleaning system and method based on integrated learning

An integrated learning and cleaning system technology, applied in the field of network information processing, can solve problems such as easily missing titles and manpower consumption, and achieve the effects of improving accuracy, improving cleaning efficiency, and improving computing efficiency

Active Publication Date: 2021-04-06
XIAMEN KUAISHANGTONG INFORMATION TECH CO LTD
View PDF3 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] The title of the website can be said to be the face of the company. The traditional system is easy to miss strange titles, such as: "Angel in White", which can be clearly seen as the website of the hospital, but this title is likely to be filtered out by the traditional system
Therefore, the disadvantage of the traditional system is that it needs to manually label the data, which will consume a certain amount of manpower in the early stage

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A URL cleaning system and method based on integrated learning
  • A URL cleaning system and method based on integrated learning
  • A URL cleaning system and method based on integrated learning

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0038] In order to make the technical problems, technical solutions and beneficial effects to be solved by the present invention clearer and clearer, the present invention will be further described in detail below in conjunction with specific embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.

[0039] A URL cleaning system based on integrated learning, which includes:

[0040] The data crawling module is used to crawl the URL of the website and its corresponding website title;

[0041] A data labeling module, which judges whether the website title is consistent with the specified crawling theme, if so, then marks the website title as a class A title, otherwise marks it as a B class title;

[0042] The primary prediction model 1 is used to segment the marked A-type titles and B-type titles, calculate the weight value of the word segmentation results, and then use ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a URL cleaning system and method based on integrated learning, which crawls the URL of the website and its corresponding website title; judges whether the website title is consistent with the specified crawling theme; The title of the website is marked as a category A title, otherwise it is marked as a category B title; the marked A category title and B category title are word-segmented, and the naive Bayesian algorithm is trained and predicted according to the word segmentation results, and the corresponding word segmentation results are constructed. Then use the Stacking algorithm to perform fusion processing to obtain the fusion result; finally use the decision tree algorithm to train and predict the fusion result to obtain a decision tree model, and clean the URL through the decision tree model; thus extremely It greatly improves the URL cleaning efficiency, saves a lot of manual inspection time, and improves the accuracy of verifying the website title corresponding to the URL.

Description

technical field [0001] The invention relates to the technical field of network information processing, in particular to an integrated learning-based URL cleaning system and a method for applying the system. Background technique [0002] URL, also known as a web page address, is a standard resource address on the Internet, and is used to completely describe an identification method for the addresses of web pages and other resources on the Internet. Every web page on the Internet has a unique URL address name identification, usually called a URL address, this address can be a local disk, or a computer on the LAN, and more often a site on the Internet . Simply put, a URL is a web address, commonly known as a "web address". Usually crawler engineers need to clean the data after crawling the website data, and the more troublesome thing is the cleaning of the URL. [0003] The title of the website can be said to be the face of the company. The traditional system is easy to miss...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/955G06F16/9535
Inventor 陈鑫肖龙源蔡振华李稀敏刘晓葳谭玉坤
Owner XIAMEN KUAISHANGTONG INFORMATION TECH CO LTD