A method and system for identifying web pages with invalid addresses

A technology for invalid addresses and web pages, which is applied in web data retrieval, website content management, and web data retrieval using information identifiers. It can solve the problems of strong subjectivity and low efficiency, and achieve the effect of improving objectivity.

Active Publication Date: 2020-12-08
CHANGCHUN UNIV OF SCI & TECH
View PDF13 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] The purpose of the present invention is to provide a method and system for identifying webpages with invalid addresses, so as to solve the problems of low efficiency and strong subjectivity in the methods for identifying webpages with invalid addresses in the prior art

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A method and system for identifying web pages with invalid addresses
  • A method and system for identifying web pages with invalid addresses

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0048] The following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some, not all, embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

[0049] The object of the present invention is to provide a method and system for identifying a webpage with an invalid address, so as to solve the problems of low efficiency and strong subjectivity of the method for identifying a webpage with an invalid address in the prior art.

[0050] In order to make the above objects, features and advantages of the present invention more comprehensible, the present invention will be further described in detail below in con...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The system and method for identifying invalid address web pages of the invention extract feature attributes from invalid address web pages marked artificially, and take sparse hash mapping of featureattributes as input of clustering algorithm to obtain multi-cluster uniform resource locator. The longest text shared by multi-cluster URLs from the starting bit is obtained by calculating the matching degree. A blacklist of first and second attribute texts and invalid addresses is determined based on the longest text. Determining first and second texts according to the web page to be identified;When the first attribute text column of the blacklist contains the first text and the second text contains all the contents of the second attribute text corresponding to the first attribute text identical to the first text, the web page to be recognized is determined to be an invalid address web page. The method or the system in the invention obtains an invalid address blacklist through clusteringanalysis of invalid address webpages, and identifies the webpages to be classified through the blacklist, thereby improving the objectivity and the operation efficiency of the method or the system.

Description

technical field [0001] The invention relates to the technical field of network text natural language processing, in particular to a method and system for identifying webpages with invalid addresses. Background technique [0002] Today, the Internet has brought us many conveniences and shortcuts in the way of life or production. Huge network information resources enable people to easily obtain the information they need in life or production through a browser. However, excessive or inappropriate access to network resources not only brings loss of productivity and network bandwidth to the enterprise, but also seriously threatens the network security architecture and information system of the enterprise. At the same time, inappropriate or illegal content on the network greatly endangers the enterprise The physical and mental health of employees can even bring legal problems to enterprises. [0003] In the actual application of the Internet in China, when a user searches for an...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/95G06F16/955G06F16/958
Inventor 周超然刘妍张昕张莹赵建平冯欣张剑飞杨宏伟孙庚
Owner CHANGCHUN UNIV OF SCI & TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products