Internet forum page clustering method based on website structure and equipment

A clustering method and web page technology, applied in the field of web forum page clustering based on URL structure, can solve problems such as lack of solutions

Active Publication Date: 2017-11-28
SHANDONG NORMAL UNIV
View PDF3 Cites 1 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] To sum up, in the prior art, there is still a lack of an effective solution to the problem of how to effectively classify webpages on web forum pages and improve the accuracy and efficiency of webpage classification.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Internet forum page clustering method based on website structure and equipment
  • Internet forum page clustering method based on website structure and equipment
  • Internet forum page clustering method based on website structure and equipment

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0074] As introduced in the background technology, in order to solve the above problems, the present invention provides a network forum page clustering method and device based on the URL structure. The invention constructs structure vectors according to the URLs, and calculates the dissimilarity between the structure vectors, so that the webpages can be classified by using the clustering analysis method, the webpage classification can be effectively realized for the web forum pages, and the accuracy and efficiency of the webpage classification can be improved.

[0075] In order to achieve the above object, the present invention adopts the following technical scheme:

[0076] A network forum page clustering method based on URL structure, the method includes the following steps:

[0077] (1) Preliminary grouping of all webpages according to the domain names to which the webpages belong, sampling each group of webpages after the preliminary grouping to form a sample, and insertin...

Embodiment 2

[0150] The second object of the present invention is to provide a storage device for a network forum page clustering method based on the URL structure.

[0151] In order to achieve the above object, the present invention adopts the following technical scheme:

[0152] A memory device storing therein a plurality of instructions adapted to be loaded and executed by a processor:

[0153] (1) Preliminary grouping of all webpages according to the domain names to which the webpages belong, sampling each group of webpages after the preliminary grouping to form a sample, and inserting marked webpages to be screened into the samples to form a sample webpage;

[0154] (2) segment the URLs of the sample webpage except the domain name according to symbols, number the category and content of each segmented URL, and construct structural blocks;

[0155] (3) Arrange the structural blocks of the same website in order to form the structural vector of the website; calculate the dissimilarity o...

Embodiment 3

[0158] The third object of the present invention is to provide a terminal device for a network forum page clustering method based on the URL structure.

[0159] In order to achieve the above object, the present invention adopts the following technical scheme:

[0160] A terminal device comprising:

[0161] a processor adapted to implement the instructions; and

[0162] A storage device adapted to store a plurality of instructions adapted to be loaded and executed by a processor:

[0163] (1) Preliminary grouping of all webpages according to the domain names to which the webpages belong, sampling each group of webpages after the preliminary grouping to form a sample, and inserting marked webpages to be screened into the samples to form a sample webpage;

[0164] (2) segment the URLs of the sample webpage except the domain name according to symbols, number the category and content of each segmented URL, and construct structural blocks;

[0165] (3) Arrange the structural bloc...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to an Internet forum page clustering method based on a website structure and equipment. The method relates to the field of data mining, and is provided by aiming at solving the large-scale webpage classification problem. The method comprises the following steps that partial samples are extracted from a website totality; each website is subjected to structure division by utilizing the characteristic of Internet forum website high structuralization; structure vectors are constructed; the distance between the structure vectors is evaluated by using a distance function provided by the invention; next, the structure vectors of the samples are subjected to clustering analysis by a density peak value clustering method; the feature structure of each cluster is extracted; a resolver for describing all sample websites in the clusters is constructed; and the resolver is used for resolving and classifying the rest websites in the totality. Experiments prove that the method has high accuracy and execution efficiency.

Description

technical field [0001] The invention belongs to the technical field of network data mining, and in particular relates to a network forum page clustering method and equipment based on a network address structure. Background technique [0002] URL is the basic feature used to uniquely identify a web page. Page classification is of great significance to network data mining, and it is the most important preparatory work before subsequent processing of different types of pages. At present, the methods of classifying webpages include classifying according to semantic structure; using genetic algorithm to classify webpage tags and attributes as classification features; using contextual features to classify using support vector machines. Classification based on preferred features using ant colony algorithm and more. However, in practice, the commonality between forum pages is not significant, which makes the feature extraction of web pages random; in addition, there are many pages...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/35G06F16/958
Inventor 王红刘锐
Owner SHANDONG NORMAL UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products