Unlock instant, AI-driven research and patent intelligence for your innovation.

Automatic identification method for network literature directory type web pages

An automatic identification, catalog-type technology, applied in special data processing applications, instruments, electrical digital data processing, etc., to achieve good identification results

Inactive Publication Date: 2012-02-08
SHENGLE INFORMATION TECH SHANGHAI
View PDF0 Cites 11 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] The technical problem to be solved by the present invention is to provide a method for automatic identification of network literature catalog web pages, which can solve the identification problems caused by the diversity of novel catalog pages in different types of sites, and can well identify novel catalog pages

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Automatic identification method for network literature directory type web pages
  • Automatic identification method for network literature directory type web pages

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0017] Such as figure 1 Shown is a flow chart of the method of the present invention. The method for automatically recognizing a web page of a network literature catalog provided by an embodiment of the present invention includes the following steps:

[0018] Step 1: Obtain the data body of the current webpage. The data body is the part between and in the HTML tags in the html source file.

[0019] Step 2: Extract all the character strings corresponding to the hyperlink tags containing the hyperlink addresses in the data body, and store the character strings corresponding to each of the hyperlink tags as an array element in a string array one in. The hyperlink is marked as an html tag , The hyperlink mark containing the hyperlink address is the hyperlink mark containing the "herf=" parameter ; The method of extracting all the character strings corresponding to the hyperlink tags containing the hyperlink address in the data body is: judging whether the data body contains " "To m...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses an automatic identification method for network literature directory type web pages. The method comprises the following steps of: acquiring a data body of a current web page; extracting character strings corresponding to hyperlink marks containing hyperlink addresses and combining the character strings into a character string array I; removing array elements containing image hyperlink marks in the character string array I to form a character string array II; extracting hyperlink text information of array elements of the character string array II to form a character string array III; judging whether each array element in the character string array III is a piece of directory text information, and counting the array elements which are directory text information to obtain a numerical value I; dividing the total number of the array elements of the character string array III by using the numerical value I to obtain a confirmation ratio; and when the confirmation ratio is more than 0.7 or the numerical value I is more than 15, determining that the current web page is a literature directory page. By the method, different novel directory pages in different sites can be well identified.

Description

Technical field [0001] The present invention relates to webpage processing, in particular to a method for automatically identifying webpages of network literature catalogues. Background technique [0002] Online literature business is developing rapidly on the Internet, and Internet users are increasingly relying on reading literary works on the Internet. When reading literature pages on the Internet, the table of contents page is the most important page-this page gives a list of all the chapters of the article, and users can access the chapters they need most conveniently. [0003] In the prior art, a web page is an HTML (HyperText Mark-up Language) or hypertext link mark-up language file. The structure of HTML includes the Head, which is the data header of the web page, and the Body, which is the web page. There are two parts of the data body. The data header of a web page refers to the part between and between HTML tags, and the data body of a web page refers to the part betwee...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 陈运文马飞涛宋海涛
Owner SHENGLE INFORMATION TECH SHANGHAI