Internet information publishing time extraction method based on page analysis

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A technology of Internet information and publication time, applied in network data retrieval, special data processing applications, instruments, etc., to achieve the effect of high collection efficiency, small network resource occupation, fast and accurate extraction

Inactive Publication Date: 2014-02-19

JIANGSU JINGE NETWORK TECH

View PDF3 Cites 4 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

In the prior art, there is no Internet information publishing time extraction technology that can meet these requirements

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment 1

[0032] Embodiment 1, with reference to Figure 1-Figure 4 , an Internet information publishing time extraction method based on page analysis. First, access the Internet to judge the website type according to the Url, load the target page to obtain the webpage text source set S; The title line L, and divide each symbol node in the title line L to obtain the maximum title length; the specific steps are as follows:

[0033] A. Determine the website type according to the Url, load the target page to get the text source code set S; the operation steps are as follows:

[0034] A1. Enter the webpage address, and judge the website type according to the general expression habits of Url;

[0035] A2. Input the web page address, and use HttpClient to obtain the original HTML source code set S;

[0036] B. News websites, identify and mark the title line L in the set S, and match the time; the operation steps are as follows:

[0037] B1. Match the text source code set S according to the...

Embodiment 2

[0050] Embodiment 2, with reference to Figure 1-4 , an operation experiment based on the page analysis method for extracting the publishing time of Internet information, the steps are as follows:

[0051] Step 101, judge the website type according to the Url, load the target page to obtain the text source code set S; the details are as follows:

[0052] (1) Enter the web page address, and according to the general expression habits of Url, the type of website can be judged. For example, if it contains keywords such as "bbs", "forum", "club", etc., it can be judged that the website is a forum.

[0053] (2), input the web page address and use HttpClient to obtain the original HTML source code collection S; for example, the original HTML source code collection S obtained through the Internet is as follows:

[0054]

[0055]

[0056] title

[0057] time

[0058]

[0059] content

[0060]

[0061]

[0062] Step 102, identify the website type, and obtain the source cod...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention relates to an internet information publishing time extraction method based on page analysis. The method comprises the steps of first accessing the Internet, loading target pages according to website types to obtain a web page text source code set S; recognizing time in the set S according to regular expressions of labels and keywords of every website type, and performing time extraction. For news websites, information titles are recognized by utilizing the regular expressions in the set S, and time regular expression matching is performed near the titles by utilizing the characteristic that time occurs near the titles. According to the method, by the extraction of forum information time and the combination of the keywords and the time regular expressions, the good accuracy rate can be achieved, and information publishing time can be extracted rapidly and accurately. The method is high in collecting efficiency, and in the collection process, occupied network resources are small.

Description

technical field [0001] The invention belongs to the field of Internet information collection, in particular to a method for extracting Internet information publication time based on page analysis. Background technique [0002] With the rapid development of social informatization, the Internet has become an important source of information for people. The network information has the characteristics of massive, complex and unstructured, which brings great difficulties to the acquisition of network information and the analysis and research work based on network information collection. A large number of practices have also shown that information collection on various information carriers (news sites, blogs, forums, microblogs, etc.) on the Internet can basically meet the requirements, but if further information is released, there is still a certain technical problem. Especially when it is desired to temporarily collect information for a specific target, high requirements are p...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & Authority Applications(China)

IPC IPC(8): G06F17/30

CPCG06F16/95

Inventor 陈宗华陈永江葛恒虎刘永超乔磊

Owner JIANGSU JINGE NETWORK TECH

Internet information publishing time extraction method based on page analysis

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment 1

Embodiment 2

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology