Unlock instant, AI-driven research and patent intelligence for your innovation.

Internet information publishing time extraction method based on page analysis

A technology of Internet information and publication time, applied in network data retrieval, special data processing applications, instruments, etc., to achieve the effect of high collection efficiency, small network resource occupation, fast and accurate extraction

Inactive Publication Date: 2014-02-19
JIANGSU JINGE NETWORK TECH
View PDF3 Cites 4 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

In the prior art, there is no Internet information publishing time extraction technology that can meet these requirements

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Internet information publishing time extraction method based on page analysis
  • Internet information publishing time extraction method based on page analysis
  • Internet information publishing time extraction method based on page analysis

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0032] Embodiment 1, with reference to Figure 1-Figure 4 , an Internet information publishing time extraction method based on page analysis. First, access the Internet to judge the website type according to the Url, load the target page to obtain the webpage text source set S; The title line L, and divide each symbol node in the title line L to obtain the maximum title length; the specific steps are as follows:

[0033] A. Determine the website type according to the Url, load the target page to get the text source code set S; the operation steps are as follows:

[0034] A1. Enter the webpage address, and judge the website type according to the general expression habits of Url;

[0035] A2. Input the web page address, and use HttpClient to obtain the original HTML source code set S;

[0036] B. News websites, identify and mark the title line L in the set S, and match the time; the operation steps are as follows:

[0037] B1. Match the text source code set S according to the...

Embodiment 2

[0050] Embodiment 2, with reference to Figure 1-4 , an operation experiment based on the page analysis method for extracting the publishing time of Internet information, the steps are as follows:

[0051] Step 101, judge the website type according to the Url, load the target page to obtain the text source code set S; the details are as follows:

[0052] (1) Enter the web page address, and according to the general expression habits of Url, the type of website can be judged. For example, if it contains keywords such as "bbs", "forum", "club", etc., it can be judged that the website is a forum.

[0053] (2), input the web page address and use HttpClient to obtain the original HTML source code collection S; for example, the original HTML source code collection S obtained through the Internet is as follows:

[0054]

[0055]

[0056] title

[0057] time

[0058]

[0059] content

[0060]

[0061]

[0062] Step 102, identify the website type, and obtain the source cod...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to an internet information publishing time extraction method based on page analysis. The method comprises the steps of first accessing the Internet, loading target pages according to website types to obtain a web page text source code set S; recognizing time in the set S according to regular expressions of labels and keywords of every website type, and performing time extraction. For news websites, information titles are recognized by utilizing the regular expressions in the set S, and time regular expression matching is performed near the titles by utilizing the characteristic that time occurs near the titles. According to the method, by the extraction of forum information time and the combination of the keywords and the time regular expressions, the good accuracy rate can be achieved, and information publishing time can be extracted rapidly and accurately. The method is high in collecting efficiency, and in the collection process, occupied network resources are small.

Description

technical field [0001] The invention belongs to the field of Internet information collection, in particular to a method for extracting Internet information publication time based on page analysis. Background technique [0002] With the rapid development of social informatization, the Internet has become an important source of information for people. The network information has the characteristics of massive, complex and unstructured, which brings great difficulties to the acquisition of network information and the analysis and research work based on network information collection. A large number of practices have also shown that information collection on various information carriers (news sites, blogs, forums, microblogs, etc.) on the Internet can basically meet the requirements, but if further information is released, there is still a certain technical problem. Especially when it is desired to temporarily collect information for a specific target, high requirements are p...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/95
Inventor 陈宗华陈永江葛恒虎刘永超乔磊
Owner JIANGSU JINGE NETWORK TECH