Webpage detection method based on text analysis

A technology for text analysis and web pages, which is used in text database clustering/classification, unstructured text data retrieval, file management systems, etc. It can solve the problems of low similarity recognition rate of short texts and long time consumption of web page feature values. , to achieve the effect of improving user experience

Active Publication Date: 2017-01-04
北京惠懂你科技有限公司
View PDF7 Cites 8 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The existing recommendation methods based on text similarity have the following shortcomings. When the data scale is very large, the generation and calculation of web page

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Webpage detection method based on text analysis
  • Webpage detection method based on text analysis
  • Webpage detection method based on text analysis

Examples

Experimental program
Comparison scheme
Effect test

Example Embodiment

[0025] The following is attached to illustrate the principle of the present invention Figure one A detailed description of one or more embodiments of the present invention is provided together. The present invention is described in conjunction with such an embodiment, but the present invention is not limited to any embodiment. The scope of the present invention is limited only by the claims, and the present invention covers many alternatives, modifications and equivalents. In the following description, many specific details are set forth in order to provide a thorough understanding of the present invention. These details are provided for the purpose of example, and the present invention can be implemented according to the claims without some or all of these specific details.

[0026] One aspect of the present invention provides a webpage detection method based on text analysis. figure 1 It is a flowchart of a webpage detection method based on text analysis according to an embo...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a webpage detection method based on text analysis. The method comprises the steps that on the basis of a crawling webpage data source, a feature extraction strategy is defined; page preprocessing is performed, the content of an obtained webpage is determined, and entry attributes unrelated to extracted information are discarded; according to the extraction strategy, a needed data item is obtained and stored in an XML file; the XML file is subjected to feature extraction to obtain a feature vector and to be clustered; the clustered files are stored into a corresponding database according to class clusters. According to the webpage detection method based on text analysis, similar data is rapidly and efficiently checked out for a large dataset, valuable information is rapidly excavated, and the user experience of a search engine is improved.

Description

technical field [0001] The invention relates to natural language processing, in particular to a web page detection method based on text analysis. Background technique [0002] With the rapid development of Internet technology and related industries, data is increasing rapidly on an unprecedented scale. While big data brings impetus, it also brings challenges. How to explore valuable resources in massive Internet data and recommend similar content based on user searches is an important task of big data text processing. For the similarity detection of web pages, the space complexity and time complexity of the algorithm are required to be reduced as much as possible to meet the needs of users. The existing recommendation methods based on text similarity have the following shortcomings. When the data scale is very large, the generation and calculation of web page feature values ​​will take a long time; for professional fields, too much reliance on basic corpora to calculate wor...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30G06F17/27
CPCG06F16/35G06F16/83G06F16/93G06F16/9535G06F40/253G06F40/284G06F2216/03
Inventor 张俤
Owner 北京惠懂你科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products