Method for automatically analyzing Internet web page

An automatic parsing, Internet technology, applied in the field of web page parsing, which can solve the problems of inability to provide classification and screening services, inability to make judgments, and narrow search scope.

Inactive Publication Date: 2015-01-07
INSPUR GROUP CO LTD
View PDF6 Cites 12 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0002] The vast number of Internet users obtain information by searching on the web to obtain results. It is impossible to provide services for classifying and filtering according to industry characteristics, and it is impossible to classify and display. However, it is impossible to judge which content the user wants to find in this search. Users need to spend It takes a long time to find the information you need in the results
The vertical search is a search service for the information on the website. The information on the website is directly added by the users of the website to each category of the website. Although the classification is clear, the search scope is narrow, and even The organizational form and webpage structure of each website of the same type are very different, and it is quite difficult to extract the required information from them. Users need to cooperate with other search engines to learn comprehensive information.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for automatically analyzing Internet web page

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0018] Taking a shopping website as an example, users use vertical search to search on the website, and the specific steps are as follows:

[0019] ①Choose a representative webpage of a shopping website such as Taobao, search for men’s shirts, and segment the representative webpage and display it to the user when the industry word segmentation lexicon is up-to-date. In the most common case, men and shirts are segmented;

[0020] ②According to the graphical display of the word segmentation results on the web page, regular expression matching items are provided, and the regular expressions are replaced by numbers, such as 222 for men and 444 for shirts;

[0021] ③According to the regular expression matching items, select the data to be extracted and set the data name;

[0022] ④According to regular expressions, automatically generate a program for extracting structured data, and establish a vertical search template. When you encounter a shopping website, search for men's shirts ...

Embodiment 2

[0025] Taking the education website as an example, users use vertical search to search on the website. The specific steps are as follows:

[0026] ①Select representative web pages of educational websites such as New Oriental, search for middle school English, and in the case of the latest industry word segmentation thesaurus, segment the representative web pages and display them to users. In the most common cases, the middle school and English are segmented;

[0027] ②According to the graphical display of the word segmentation results on the web page, regular expression matching items are provided, and the regular expression adopts the method of replacing content, such as replacing zx in middle school and yy in English;

[0028] ③According to the regular expression matching items, select the data to be extracted and set the data name;

[0029] ④According to regular expressions, automatically generate a program for extracting structured data, and establish a vertical search tem...

Embodiment 3

[0032] Taking a travel website as an example, users use vertical search to search on the website, and the specific steps are as follows:

[0033] ①Choose a representative web page of a tourism website such as CYTS, search for Huahai, and in the case of the latest industry word segmentation lexicon, segment the representative webpage and display it to the user.

[0034] ②According to the graphical display of the word segmentation results on the web page, regular expression matching items are provided. The regular expression uses the method of deleting specified content or deleting spaces, such as replacing Huahai with Huahai or Huahai;

[0035] ③According to the regular expression matching items, select the data to be extracted and set the data name;

[0036] ④According to regular expressions, automatically generate a program for extracting structured data, and establish a vertical search template. When you encounter a travel website, search for Huahai and use regular expressio...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a method for automatically analyzing an Internet web page and belongs to the field of web page analyzing. According to the method for automatically analyzing the Internet web page, a user uses vertical search to extract structural data of a web page involved in an industry and builds a template according to the extracted structural data, wherein the concrete steps are as follows: (1) selecting a representative web page involved by the industry, carrying out word segmentation on the representative web page when the word segmentation word bank of the industry is the newest, and showing to the user; (2) providing regular expression matching items according to the graphical display of the web page word segmentation result; (3) selecting data which needs extraction according to the regular expression matching items and setting data names; (4) automatically generating a structural data extracting program according to a regular expression, and building a vertical search template; (5) labeling the vertical search template, and automatically analyzing all the web pages involved in the industry according to the vertical search template. The method for automatically analyzing the Internet web page uses the word segmentation, regular expression and label analyzing technology to obtain a vertical search engine and realize intelligent web page analysis.

Description

technical field [0001] The invention relates to a method for automatically analyzing Internet webpages, belonging to the field of webpage analysis. Background technique [0002] The vast number of Internet users obtain information by searching on the web to obtain results. It is impossible to provide services for classification and screening according to industry characteristics, and it is impossible to classify and display. However, it is impossible to judge which content the user wants to find in this search. Users need to spend It takes a long time to find the information you need in the results. The vertical search is a search service for the information on the website. The information on the website is directly added by the users of the website to each category of the website. Although the classification is clear, the search scope is narrow, and even The organizational form and webpage structure of each website of the same type are very different, and it is quite diffi...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/957G06F16/951
Inventor 范莹于治楼梁华勇
Owner INSPUR GROUP CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products