Internet information publishing time extraction method based on page analysis
A technology of Internet information and publication time, applied in network data retrieval, special data processing applications, instruments, etc., to achieve the effect of high collection efficiency, small network resource occupation, fast and accurate extraction
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
Embodiment 1
[0032] Embodiment 1, with reference to Figure 1-Figure 4 , an Internet information publishing time extraction method based on page analysis. First, access the Internet to judge the website type according to the Url, load the target page to obtain the webpage text source set S; The title line L, and divide each symbol node in the title line L to obtain the maximum title length; the specific steps are as follows:
[0033] A. Determine the website type according to the Url, load the target page to get the text source code set S; the operation steps are as follows:
[0034] A1. Enter the webpage address, and judge the website type according to the general expression habits of Url;
[0035] A2. Input the web page address, and use HttpClient to obtain the original HTML source code set S;
[0036] B. News websites, identify and mark the title line L in the set S, and match the time; the operation steps are as follows:
[0037] B1. Match the text source code set S according to the...
Embodiment 2
[0050] Embodiment 2, with reference to Figure 1-4 , an operation experiment based on the page analysis method for extracting the publishing time of Internet information, the steps are as follows:
[0051] Step 101, judge the website type according to the Url, load the target page to obtain the text source code set S; the details are as follows:
[0052] (1) Enter the web page address, and according to the general expression habits of Url, the type of website can be judged. For example, if it contains keywords such as "bbs", "forum", "club", etc., it can be judged that the website is a forum.
[0053] (2), input the web page address and use HttpClient to obtain the original HTML source code collection S; for example, the original HTML source code collection S obtained through the Internet is as follows:
[0054]
[0055]
[0056] title
[0057] time
[0058]
[0059] content
[0060]
[0061]
[0062] Step 102, identify the website type, and obtain the source cod...
PUM
Login to View More Abstract
Description
Claims
Application Information
Login to View More 