Eureka AIR delivers breakthrough ideas for toughest innovation challenges, trusted by R&D personnel around the world.

A web page feature extraction method and device

A web page feature and extraction method technology, applied in the Internet field, can solve problems such as structural dependence, web page structure dependence, unreasonable calculation of feature word weight, etc., to achieve the effect of optimizing quality and ensuring correctness

Active Publication Date: 2022-04-15
CHINA MOBILEHANGZHOUINFORMATION TECH CO LTD +1
View PDF7 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] However, the calculation of feature word weights in the TF_IDF algorithm is unreasonable. Because HTML documents are very different from ordinary documents in structure, it belongs to a semi-structured text form, and feature words are located in different positions in the document. It reflects that the degree of representation ability of the article should also be different, and the weight value assigned should be different. Therefore, this simple application of IDF calculation is not scientific and comprehensive; the ability to distinguish between TF_IDF classes is insufficient , TF_IDF can only distinguish the difference between a feature item in this text and the class of this text, but it cannot express the difference between this feature item and other classes very well
The extraction technology based on the DOM tree has too much dependence on the web page structure. The DOM technology is based on the tree-like hierarchical structure characteristics of the HTML web page to realize the data extraction in the HTML web page, and uses the web page features obtained by the DOM tree extraction technology. The precision and recall rate of words are relatively high, but this technology requires several corresponding example webpages, so it is suitable for various knowledge fields, but due to the excessive dependence on structure, it is easy to be passive in the form of webpage structural changes
All in all, the above two basic methods have certain limitations, namely, insensitivity to the position of feature words and over-reliance on web page structure

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A web page feature extraction method and device
  • A web page feature extraction method and device
  • A web page feature extraction method and device

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0043] The following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the drawings in the embodiments of the present invention.

[0044] The web page feature extraction method provided by the present invention adopts position weight in extracting web page features, and integrates the influence of the two elements of position weight and frequency of occurrence on web page feature vector extraction. On the basis of extracting high-frequency words that are distinguishable from other web pages on the entire network, the target web page is divided into multiple document parts according to the basic position structure of web page information, and different weight ratio values ​​are assigned to each document part, and according to the web page The number of occurrences of the feature word with the highest number of times is used as the weight value of the basic position, and the product of the two is used to det...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The embodiment of the present invention discloses a webpage feature extraction method and device, which divides the target webpage into multiple document parts according to the location structure of the webpage information; respectively performs word segmentation processing on the multiple document parts, performs statistics on the word segmentation processing results, and obtains Multiple sets corresponding to multiple document parts, determine the base position weight value according to the number of times corresponding to the feature words in the first set, the first set is the set with the most data pairs in the multiple sets; according to the base position weight value, preset Set the weight ratio value and all the sets except the first set in the multiple sets, and determine the weight values ​​of all the sets except the first set in the multiple sets; The weight values ​​of all sets are integrated and processed to obtain the feature vector of the target web page, so that the feature analysis of the web page is performed according to the feature vector.

Description

technical field [0001] The invention relates to feature extraction technology in the Internet field, in particular to a web page feature extraction method and device. Background technique [0002] The extraction of webpage features is one of the key technologies for data analysis of webpage content, and it is also an important link for personalized analysis of Internet users and personalized service recommendation. The quality of web page feature extraction will directly affect the quality of personalized analysis results for Internet users, and will further affect the quality of personalized services provided to users. The process of extracting web page features is very sensitive to the structure of the web page, the richness of words in the content of the web page, and the synonym of words. Feature words that can best characterize the content of the webpage. [0003] In the prior art, the web page feature extraction algorithm is mainly conceived and optimized based on th...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/958G06F40/289G06F40/216
CPCG06F40/289
Inventor 吕颖韬冯宜安周璐张贝金
Owner CHINA MOBILEHANGZHOUINFORMATION TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Eureka Blog
Learn More
PatSnap group products