Webpage data extraction method and device, terminal equipment and storage medium

By performing rule analysis on sample data, obtaining data change rules, and extracting web page data column headers, the problem of not being able to obtain web page data rendered by web programming languages ​​in existing technologies is solved, and efficient web page data extraction is achieved.

CN116881603BActive Publication Date: 2026-06-23CHINA MERCHANTS BANK

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
CHINA MERCHANTS BANK
Filing Date
2023-07-31
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing RPA technologies cannot directly obtain web page data rendered by web programming languages, especially data that does not contain HTML tags.

Method used

By performing rule analysis based on pre-acquired sample data, data change rules are obtained. Data is then extracted from the webpage using these rules, including analyzing the changing patterns of data columns and rows, merging selector type sets, obtaining the webpage's Document object file, and extracting the data column headers, ultimately achieving the extraction of webpage data.

Benefits of technology

It solves the problem of being unable to obtain web page data rendered by programming languages, and improves the efficiency of web page data extraction.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116881603B_ABST
    Figure CN116881603B_ABST
Patent Text Reader

Abstract

The application discloses a webpage data extraction method and device, a terminal equipment and a storage medium, and the method comprises the following steps: performing rule analysis according to sample data obtained in advance, and obtaining a data change rule; and performing data extraction on a preset webpage through the data change rule, and obtaining webpage data. The application solves the problem that webpage data rendered by a programming language cannot be obtained, and improves the webpage data extraction efficiency.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of information technology, and in particular to a method, apparatus, terminal device, and storage medium for extracting web page data. Background Technology

[0002] Existing RPA (Robotic Process Automation) technology can only directly obtain data from HTML tags. or Define and render the data presented on the webpage. You cannot directly access data generated and rendered by the web programming language according to conditional rules; this data typically does not contain... 、 or Label.

[0003] The above content is only used to help understand the technical solution of the present invention and does not represent an admission that the above content is prior art. Summary of the Invention

[0004] The main objective of this invention is to provide a method, apparatus, terminal device, and storage medium for extracting web page data, aiming to solve the technical problem of being unable to obtain web page data rendered by programming languages.

[0005] To achieve the above objectives, the present invention provides a webpage data extraction method, the webpage data extraction method comprising:

[0006] Based on the pre-acquired sample data, rule analysis is performed to obtain the data change rules;

[0007] Data is extracted from preset web pages using the aforementioned data change rules to obtain web page data.

[0008] Optionally, the step of performing rule analysis based on pre-acquired sample data to obtain data change rules includes:

[0009] Analyze the sample data to obtain the data columns and data rows of the sample data;

[0010] Analyze the data columns and data rows of the sample data to obtain the changing patterns of the data columns and data rows;

[0011] By combining the changing patterns of the data columns and data rows, the data change rules can be obtained.

[0012] Optionally, the step of analyzing the data columns and rows of the sample data to obtain the changing patterns of the data columns and rows includes:

[0013] Based on the data columns and data rows of the sample data, obtain the first dataset and the second dataset of the data columns and data rows;

[0014] Based on the analysis of the first dataset and the second dataset, the first common selector type set of the first dataset and the second dataset is obtained;

[0015] Based on the first common selector type set, the data columns and data rows are analyzed using the first dataset and / or the second dataset to obtain the changing patterns.

[0016] Optionally, the step of merging the data columns and data rows to obtain the data change rules includes:

[0017] Based on the changing patterns of the data columns and data rows, obtain the second common selector type set;

[0018] The variation patterns of the data columns and data rows are merged to obtain a set of variation patterns;

[0019] Obtain the corresponding pattern data through the second common selector type set;

[0020] Based on the aforementioned pattern data, classification is performed using the set of change patterns to obtain classification results;

[0021] Based on the classification results, obtain the data change rules.

[0022] Optionally, the step of extracting data from a preset webpage using the data change rules to obtain webpage data includes:

[0023] Obtain the Document object file of the webpage;

[0024] Based on the Document object file, obtain the data column headers of the webpage;

[0025] Based on the data column headers of the webpage, the data is extracted according to the data change rules to obtain the webpage data.

[0026] Optionally, the step of extracting webpage data based on the data column headers of the webpage and the data change rules includes:

[0027] Based on the data column header of the webpage, obtain the webpage data columns and webpage data rows;

[0028] Based on the webpage data columns and rows, the webpage data is extracted using the data change rules.

[0029] Optionally, after the step of extracting webpage data based on the data column headers of the webpage using the data change rules, the method further includes:

[0030] In response to user actions, the system analyzes the Document object file and page number of the webpage to obtain the analysis results.

[0031] If the analysis result indicates that there is a next webpage, then extract the webpage data of the next webpage;

[0032] If the analysis result indicates that there is no next webpage, then the data extraction will end.

[0033] This invention also proposes a webpage data extraction device, which includes:

[0034] The acquisition module is used to perform rule analysis based on pre-acquired sample data to obtain data change rules;

[0035] The data extraction module is used to extract data from preset web pages according to the data change rules, and obtain web page data.

[0036] This invention also proposes a terminal device, which includes a memory, a processor, and a web page data extraction program stored in the memory and executable on the processor. When the web page data extraction program is executed by the processor, it implements the steps of the web page data extraction method described above.

[0037] This invention also proposes a computer-readable storage medium storing a web page data extraction program, which, when executed by a processor, implements the steps of the web page data extraction method described above.

[0038] This invention proposes a method, apparatus, and terminal device for extracting web page data. By performing rule analysis on pre-acquired sample data, data change rules are obtained. These data change rules are then used to extract data from a preset web page, thus acquiring web page data. Therefore, by obtaining data change rules from sample data and then using these rules to extract web page data rendered by a programming language, web page data extraction is achieved, solving the problem of being unable to obtain web page data rendered by a programming language and improving the efficiency of web page data extraction. Attached Figure Description

[0039] Figure 1 This is a schematic diagram of the functional modules of the terminal device to which the web page data extraction device of this invention belongs;

[0040] Figure 2 This is a flowchart illustrating an exemplary embodiment of the webpage data extraction method of the present invention;

[0041] Figure 3 This is a flowchart illustrating another exemplary embodiment of the webpage data extraction method of the present invention;

[0042] Figure 4 This is a schematic diagram illustrating the data change rules involved in the webpage data extraction method of the present invention;

[0043] Figure 5 This is a schematic diagram illustrating the process of obtaining data columns and the transformation rules of data rows in the web page data extraction method of the present invention;

[0044] Figure 6 This is a schematic diagram illustrating the webpage data extraction method of the present invention, which involves analyzing data columns and data rows;

[0045] Figure 7 This is a flowchart illustrating the process of the web page data extraction method of the present invention, which involves merging data columns and understanding the changing patterns of data rows.

[0046] Figure 8 This is a schematic diagram illustrating the method for extracting web page data in this invention, which involves merging the changing patterns of data columns and data rows.

[0047] Figure 9 This is a flowchart illustrating another exemplary embodiment of the webpage data extraction method of the present invention;

[0048] Figure 10 This is a schematic diagram illustrating the webpage data extraction method of the present invention, which involves acquiring webpage data.

[0049] Figure 11 This is a schematic diagram illustrating the process of extracting web page data based on data change rules in the web page data extraction method of the present invention.

[0050] Figure 12 This is a schematic diagram illustrating the webpage data extraction method of the present invention, which involves extracting webpage data.

[0051] Figure 13 This is a flowchart illustrating another exemplary embodiment of the webpage data extraction method of the present invention;

[0052] Figure 14 This is a schematic diagram illustrating the webpage data extraction method of the present invention, which involves obtaining data from the next webpage.

[0053] The realization of the objective, functional features and advantages of the present invention will be further explained in conjunction with the embodiments and with reference to the accompanying drawings. Detailed Implementation

[0054] It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

[0055] The main solution of this invention is as follows: Analyze the sample data to obtain data columns and rows; analyze the data columns and rows to obtain the variation patterns; merge the variation patterns of the data columns and rows to obtain data variation rules. Based on the data columns and rows of the sample data, obtain a first dataset and a second dataset; analyze the first dataset and the second dataset to obtain a first common selector type set; based on the first common selector type set, analyze the first dataset and / or the second dataset to obtain the variation patterns of the data columns and rows. Based on the variation patterns of the data columns and rows, obtain a second common selector type set; merge the variation patterns of the data columns and rows to obtain a variation pattern set; obtain corresponding pattern data through the second common selector type set; classify the pattern data through the variation pattern set to obtain classification results; and obtain data variation rules based on the classification results. The method involves obtaining the Document object file of the webpage; obtaining the data column header of the webpage based on the Document object file; extracting webpage data based on the data column header using the data change rules; obtaining webpage data columns and rows based on the data column header; and extracting webpage data based on the data column header and rows using the data change rules. Responding to user actions, the method analyzes the Document object file and page number of the webpage and obtains the analysis results. If the analysis result indicates the existence of a next webpage, the webpage data for that next webpage is extracted; otherwise, the data extraction process ends. This solves the problem of not being able to obtain webpage data rendered by programming languages, achieving webpage data extraction and improving its efficiency. Based on this invention, starting from the problem of not being able to obtain data generated and rendered on webpages by Web programming languages ​​according to conditional rules, a webpage data extraction method was designed. The effectiveness of this method was verified during webpage data extraction, and the efficiency of webpage data extraction was significantly improved.

[0056] Technical terms involved in the embodiments of this invention:

[0057] Data column headers: Data column headers are the first row in a data table or spreadsheet, typically used to identify the data content represented by each column. Data column headers usually have the following characteristics: Identifiers: Data column headers usually use text or words to identify the column content, such as "Name," "Age," "Gender," etc.; Uniqueness: Each column header should be unique to clearly distinguish different columns; Descriptiveness: Data column headers should be descriptive, accurately reflecting the data information contained in the column. Data column headers play a crucial role in data analysis and processing. They help understand the meaning of data, perform data filtering, sorting, and analysis, and provide appropriate data visualization. In spreadsheet software, the first row can be used as the data column header, with descriptive titles added to each column. In programming languages, data column headers can also be represented using relevant data structures or data frames, allowing for corresponding operations and processing. Whether in spreadsheets or programming, well-designed data column headers are essential for data processing and analysis.

[0058] Common Selector Types: In web development, there are several common selector types used to select HTML elements for applying styles or performing DOM manipulations. Here are some common common selector types: Element Selectors: These use the element's tag name as the selector, selecting all elements that match that tag name. For example, using the "p" selector will select all elements with that tag name. Elements; Class selectors use the element's `class` attribute value as the selector, selecting all elements with that class name. Use "." followed by the class name as the selector; for example, the ".box" selector selects all elements with the class name "box". ID selectors use the element's `id` attribute value as the selector, selecting unique elements with that `id` value. Use "#" followed by the `id` value as the selector; for example, the "#header" selector selects elements with the `id` "header". Descendant selectors use spaces to separate multiple selectors, selecting all descendant elements of a given element. For example, the "ulli" selector selects all descendant elements of a given element. All under element The `` selector selects the direct children of an element. For example, the `ul>li` selector can be used to select... direct child elements of element Elements; attribute selectors, which select elements based on their attribute values. They can match by attribute name, attribute value, or part of the attribute value. For example, the "[href]" selector can select elements with the href attribute; pseudo-class selectors, which select elements that meet specific conditions, such as the state of a link or the position of an element. For example, the ":hover" selector can select the style when the mouse hovers over an element.

[0059] The Document object: The Document object of a webpage is the part of JavaScript used to represent the entire webpage document. It is a built-in object provided by the browser, which can be used to access and manipulate various elements and attributes of the webpage through JavaScript. The Document object also provides other properties and methods for manipulating the webpage, such as modifying element styles, adding and removing DOM nodes, etc.

[0060] The embodiments of the present invention take into account that, when acquiring web page data, related technologies can only acquire data that has HTML tag definitions and is rendered on the web page. This method has the problem that it cannot acquire web page data rendered by programming languages.

[0061] Therefore, this invention addresses the problem that data generated and rendered on web pages by web programming languages ​​according to conditional rules is often unavailable. It designs a web page data extraction method and verifies its effectiveness during the extraction process. Finally, the efficiency of web page data extraction using this method is significantly improved.

[0062] Specifically, refer to Figure 1 , Figure 1 This is a schematic diagram of the functional modules of the terminal device to which the webpage data extraction device of the present invention belongs. This webpage data extraction device can be an independent device capable of extracting webpage data, separate from the terminal device, and can be implemented on the terminal device in hardware or software form. The terminal device can be a smart mobile device with data processing capabilities, such as a mobile phone or tablet computer, or a fixed terminal device or server with data processing capabilities.

[0063] In this embodiment, the terminal device to which the web page data extraction device belongs includes at least an output module 110, a processor 120, a memory 130, and a communication module 140.

[0064] The memory 130 stores the operating system and a webpage data extraction program. The webpage data extraction device can perform rule analysis based on pre-acquired sample data to obtain data change rules; it then extracts data from preset webpages using these data change rules to obtain webpage data. The extraction results and other information obtained through the webpage data extraction program are stored in the memory 130. The output module 110 can be a display screen, etc. The communication module 140 can include a WIFI module, a mobile communication module, and a Bluetooth module, etc., and communicates with external devices or servers through the communication module 140.

[0065] When the web page data extraction program in memory 130 is executed by the processor, it performs the following steps:

[0066] Based on the pre-acquired sample data, rule analysis is performed to obtain the data change rules;

[0067] Data is extracted from preset web pages using the aforementioned data change rules to obtain web page data.

[0068] Furthermore, when the web page data extraction program in memory 130 is executed by the processor, it also performs the following steps:

[0069] Analyze the sample data to obtain the data columns and data rows of the sample data;

[0070] Analyze the data columns and data rows of the sample data to obtain the changing patterns of the data columns and data rows;

[0071] By combining the changing patterns of the data columns and data rows, the data change rules can be obtained.

[0072] Furthermore, when the web page data extraction program in memory 130 is executed by the processor, it also performs the following steps:

[0073] Based on the data columns and data rows of the sample data, obtain the first dataset and the second dataset of the data columns and data rows;

[0074] Based on the analysis of the first dataset and the second dataset, the first common selector type set of the first dataset and the second dataset is obtained;

[0075] Based on the first common selector type set, the data columns and data rows are analyzed using the first dataset and / or the second dataset to obtain the changing patterns.

[0076] Furthermore, when the web page data extraction program in memory 130 is executed by the processor, it also performs the following steps:

[0077] Based on the changing patterns of the data columns and data rows, obtain the second common selector type set;

[0078] The variation patterns of the data columns and data rows are merged to obtain a set of variation patterns;

[0079] Obtain the corresponding pattern data through the second common selector type set;

[0080] Based on the aforementioned pattern data, classification is performed using the set of change patterns to obtain classification results;

[0081] Based on the classification results, obtain the data change rules.

[0082] Furthermore, when the web page data extraction program in memory 130 is executed by the processor, it also performs the following steps:

[0083] Obtain the Document object file of the webpage;

[0084] Based on the Document object file, obtain the data column headers of the webpage;

[0085] Based on the data column headers of the webpage, the data is extracted according to the data change rules to obtain the webpage data.

[0086] Furthermore, when the web page data extraction program in memory 130 is executed by the processor, it also performs the following steps:

[0087] Based on the data column header of the webpage, obtain the webpage data columns and webpage data rows;

[0088] Based on the webpage data columns and rows, the webpage data is extracted using the data change rules.

[0089] Furthermore, when the web page data extraction program in memory 130 is executed by the processor, it also performs the following steps:

[0090] In response to user actions, the system analyzes the Document object file and page number of the webpage to obtain the analysis results.

[0091] If the analysis result indicates that there is a next webpage, then extract the webpage data of the next webpage;

[0092] If the analysis result indicates that there is no next webpage, then the data extraction will end.

[0093] This embodiment, through the above-described scheme, specifically obtains data change rules by performing rule analysis on pre-acquired sample data; and extracts data from a preset webpage using these data change rules. Based on the pre-acquired sample data, rule analysis is performed to obtain data change rules, and these rules are used to extract webpage data, thus solving the problem of being unable to obtain webpage data rendered by programming languages. Based on this invention, starting from the problem of the inability to obtain data generated and rendered on webpages by web programming languages ​​according to conditional rules, a webpage data extraction method is designed. The effectiveness of this webpage data extraction method is verified during webpage data extraction, and the efficiency of webpage data extraction using this method is significantly improved.

[0094] Based on, but not limited to, the above-described terminal device architecture, embodiments of the present invention are proposed.

[0095] Reference Figure 2 , Figure 2 This is a flowchart illustrating an exemplary embodiment of the webpage data extraction method of the present invention. The webpage data extraction method includes:

[0096] Step S01: Perform rule analysis based on the pre-acquired sample data to obtain data change rules;

[0097] The execution subject of the method in this embodiment can be a web page data extraction device, a web page data extraction terminal device, or a server. This embodiment takes a web page data extraction device as an example, which can be integrated into a terminal device with data processing functions.

[0098] To obtain the data change rules, take the following steps:

[0099] Firstly, in existing technologies, webpage data acquisition generally involves users copying data from the page. Some methods also retrieve data using HTML tags. However, data generated and rendered on webpages using web programming languages ​​based on conditional rules cannot be obtained through tags. Web programming refers to the process of creating, developing, and maintaining websites and web applications using programming languages ​​and related technologies. Related technologies and concepts include: front-end development, which involves user interface design and interaction, using technologies such as HTML, CSS, and JavaScript to build the structure, style, and behavior of webpages; and back-end development, which uses back-end programming languages ​​(such as Python, Java, Ruby, etc.) and related frameworks (such as Django, Spring, Ruby on Rails, etc.) to handle data storage, business logic, and interaction with the front-end, etc.

[0100] Then, sample data for generating data change rules is obtained, wherein, in this embodiment, sample data refers to data used to render web pages;

[0101] Finally, after analyzing the variation patterns of HTML elements in the sample data, variation rules for the sample data are obtained. Based on these variation rules, webpage data can be acquired.

[0102] Step S02: Extract data from the preset webpage using the data change rules to obtain webpage data.

[0103] After obtaining the data change rules, use them to retrieve webpage data:

[0104] First, obtain the Document object file of the webpage. In this embodiment, the file is obtained in the [.doc] format, but in other embodiments it can also be in the [.xlsx] format, etc.

[0105] Then, the column headers of the data in the webpage are obtained through the Document object file. The column header refers to the first row in a data table or spreadsheet, which is usually used to identify the data content represented by each column. In programming languages, data column headers can be represented by relevant data structures or data frames, and corresponding operations and processing can be performed. Whether in spreadsheets or programming, good data column header design is crucial for data processing and analysis.

[0106] Finally, based on the column headers of the data, the webpage data is extracted using the previously obtained data change rules.

[0107] This embodiment, through the above-described scheme, specifically obtains data change rules by performing rule analysis on pre-acquired sample data; and then extracts data from a preset webpage using these data change rules to obtain webpage data. Therefore, by using pre-acquired data change rules to obtain webpage data, the problem of being unable to obtain webpage data rendered by programming languages ​​is solved, thus improving the efficiency of webpage data extraction.

[0108] Reference Figure 3 , Figure 3 This is a flowchart illustrating another exemplary embodiment of the webpage data extraction method of the present invention.

[0109] Based on the above Figure 2 In the embodiment shown, step S01, which involves performing rule analysis based on pre-acquired sample data to obtain data change rules, includes:

[0110] Step S011: Analyze the sample data to obtain the data columns and data rows of the sample data;

[0111] Step S012: Analyze the data columns and data rows of the sample data to obtain the changing patterns of the data columns and data rows;

[0112] Step S013: Combine the change patterns of the data columns and data rows to obtain the data change rules.

[0113] Specifically, the steps for obtaining data change rules using sample data are as follows:

[0114] First, the sample data is analyzed to obtain the data columns and data rows. During the analysis of the sample data, if the column headers of the sample data are obtained, the data change rules can also be obtained by analyzing the column headers.

[0115] Then, by analyzing the data columns and data rows of the sample data, the changing patterns of the data rows and data columns are obtained. The changing patterns include, but are not limited to, linear changes, exponential changes, periodic changes, fluctuating changes, and gradual changes.

[0116] Finally, the patterns of change in data columns and data rows are merged to obtain the data change rules.

[0117] More specifically, such as Figure 4 As shown, Figure 4 This is a schematic diagram illustrating the data change rules involved in the webpage data extraction method of the present invention.

[0118] First, sample data is acquired and grouped. The purpose of grouping is that different data will produce different rendering effects. Grouping the sample data can make the acquired variation rules more accurate, and can ensure that multiple types of data are acquired according to corresponding rules when acquiring web page data.

[0119] Then, the sample data is analyzed to determine whether column headers exist. If column headers exist, the pattern of column header changes can be analyzed. If column headers do not exist, the data columns of the sample data can be analyzed.

[0120] Then, if the column header is being analyzed, we can analyze whether there is a pattern of change in the column header. If there is, we can continue to analyze the pattern of change in the data column and the data row.

[0121] Then, since the column headers of the data serve to identify the data content represented by each column, the column headers are retrieved first;

[0122] Then, the patterns of change in the acquired data columns and rows are merged, and the patterns of change are analyzed. If they exist, the patterns of change in the data are saved.

[0123] Finally, the sample data from different groups were analyzed to obtain the patterns of change in all the data.

[0124] Furthermore, when analyzing data columns and data rows, it should be understood that if there is no pattern of change in the column header, the acquisition of the pattern of change can be stopped, but the acquisition of the pattern of change in the data columns and data rows can continue.

[0125] This embodiment, through the above-described scheme, specifically analyzes the sample data to obtain the data columns and rows of the sample data; analyzes the data columns and rows of the sample data to obtain the variation patterns of the data columns and rows; and merges the variation patterns of the data columns and rows to obtain data variation rules. Thus, by merging the variation patterns of data columns and rows to obtain data variation rules, the problem of not having corresponding data variation rules to obtain web page data is solved, improving the efficiency of web page data extraction.

[0126] Reference Figure 5 , Figure 5 This is a schematic diagram illustrating the process of obtaining data columns and the transformation rules of data rows in the web page data extraction method of the present invention.

[0127] Based on the above Figure 3 In the embodiment shown, step S012, which involves analyzing the data columns and rows of the sample data to obtain the changing patterns of the data columns and rows, includes:

[0128] Step S0121: Based on the data columns and data rows of the sample data, obtain the first dataset and the second dataset of the data columns and data rows;

[0129] Step S0122: Analyze the first dataset and the second dataset to obtain the first common selector type set of the first dataset and the second dataset;

[0130] Step S0123: Based on the first common selector type set, analyze the data columns and data rows by using the first dataset and / or the second dataset to obtain the changing patterns.

[0131] Specifically, to obtain the changing patterns of data columns and data rows, the following steps are performed:

[0132] First, a set of sample data in a column and a set of sample data in a row are obtained by using the data columns and data rows of the sample data. In this embodiment, these are represented by the first dataset and the second dataset.

[0133] Then, obtain the common selector type set from the first dataset and the second dataset. In web development, the common selector type refers to the CSS selector used to select and manipulate elements in an HTML document. In this embodiment, the common selector type includes, but is not limited to, element selector, class selector, ID selector, descendant selector, and attribute selector, etc.

[0134] Finally, by using the common selector type, we retrieved the corresponding data from the first and second datasets for analysis, and obtained the changing patterns of the data columns and rows.

[0135] More specifically, such as Figure 6 As shown, Figure 6 This is a schematic diagram illustrating the web page data extraction method of the present invention, which involves analyzing data columns and data rows.

[0136] First, the analysis focuses on the changing patterns of columns or rows, resulting in the acquisition of a row or column of sample data. In this embodiment, both are analyzed, thus obtaining a first dataset and a second dataset.

[0137] Then, obtain the set of common selector types from this data;

[0138] Then, if there is at least one selector type, the first dataset and the second dataset in the selector type are extracted;

[0139] Then, analyze these data. If the column data are the same, record them as value types. If the column data are different, analyze them by the prefix of the data and the step size before the data. When the step size is equal, record them as column and row index types.

[0140] Finally, after analyzing one column or row, there may be another row that needs to be analyzed. Increment the number of rows or columns by 1 and compare it with the number of selector types until the number of rows or columns is greater than the number of selector types. Then you can consider that all the analysis has been completed.

[0141] This embodiment, through the above-described scheme, specifically obtains a first dataset and a second dataset of the data columns and rows based on the data columns and rows of the sample data; analyzes the first dataset and the second dataset to obtain a first common selector type set of the first dataset and the second dataset; and analyzes the first dataset and / or the second dataset based on the first common selector type set to obtain the variation rules of the data columns and rows. Thus, it achieves the acquisition of the variation rules of data columns and rows, solving the problem of not having corresponding variation rules of data columns and rows when acquiring data variation rules, and improving the efficiency of web page data extraction.

[0142] Reference Figure 7 , Figure 7 This is a flowchart illustrating the process of merging data columns and the changing patterns of data rows in the web page data extraction method of this invention.

[0143] Based on the above Figure 3 In the embodiment shown, step S013, which involves merging the change patterns of the data columns and data rows to obtain the data change rules, includes:

[0144] Step S0131: Obtain the second common selector type set based on the changing patterns of the data columns and data rows;

[0145] Step S0132: Merge the change patterns of the data columns and data rows to obtain a set of change patterns;

[0146] Step S0133: Obtain the corresponding pattern data through the second common selector type set;

[0147] Step S0134: Based on the pattern data, classify the data using the set of change patterns to obtain classification results;

[0148] Step S0135: Obtain data change rules based on the classification results.

[0149] Specifically, in order to obtain the data change rules, it is necessary to merge the change patterns of data columns and data rows:

[0150] First, by analyzing the changing patterns of data columns and data rows, a set of corresponding common selector types is obtained. In this embodiment, the common selector type is used to determine data columns and data rows that exhibit changing patterns.

[0151] Then, the patterns of change in the data columns and data rows are merged to obtain a set of patterns of change;

[0152] Then, obtain the regular data corresponding to the data columns and data row change patterns in the common selector type set;

[0153] Then, the data is classified according to the regularity and the set of change patterns to obtain the classification results. The classification results include, but are not limited to, data as value type, data as column index type, and data as row index type.

[0154] Finally, based on the classification results, the data change rules are obtained.

[0155] More specifically, such as Figure 8 As shown, Figure 8 This is a schematic diagram illustrating the method for extracting web page data according to the present invention, which involves merging the changing patterns of data columns and data rows.

[0156] First, obtain the set of common selector types in the column and row variation rules;

[0157] Then, if the selector type set is not empty, the data corresponding to a random selector type in the selector type set is obtained from the column and row variation patterns as the pattern data. In this embodiment, the first selector type is selected.

[0158] Then, the obtained pattern data are merged to obtain a data matrix, which is represented as a pattern dataset in this embodiment;

[0159] Then, the data is categorized by column or row. If the data is a value type, the record is a value type; if the data is a class index type, the record is a column index type; if the data is a row index type, the record is a row index type.

[0160] Then, when its type is obtained, or there is no corresponding type, the data in the next column is classified until all selector types in the common selector type set are classified.

[0161] Finally, the rules governing data variation are derived from the classification results.

[0162] This embodiment, through the above scheme, specifically obtains a second common selector type set based on the changing patterns of the data columns and data rows; merges the changing patterns of the data columns and data rows to obtain a changing pattern set; obtains the corresponding pattern data through the second common selector type set; classifies the pattern data through the changing pattern set to obtain a classification result; and obtains data change rules based on the classification result.

[0163] Reference Figure 9 , Figure 9 This is a flowchart illustrating another exemplary embodiment of the webpage data extraction method of the present invention.

[0164] Based on the above Figure 2 In the embodiment shown, step S02, which involves extracting data from a preset webpage using the data change rules to obtain webpage data, includes:

[0165] Step S021: Obtain the Document object file of the webpage;

[0166] Step S022: Obtain the data column headers of the webpage based on the Document object file;

[0167] Step S023: Extract webpage data based on the data column headers of the webpage and the data change rules.

[0168] Specifically, data extraction from web pages using data change rules is achieved through the following steps:

[0169] First, obtain the Document object file of the webpage. In this embodiment, the file format is [.doc]. The Document object is a part of JavaScript used to represent the entire webpage document. It is a built-in object provided by the browser. It can access and manipulate various elements and attributes of the webpage through JavaScript. The Document object also provides other properties and methods for manipulating the webpage, such as modifying element styles, adding and removing DOM nodes, etc.

[0170] Then, based on the obtained Document object file, retrieve the data column headers of the webpage;

[0171] Finally, using the data column headers of the webpage data, the data is extracted according to the data change rules to obtain the webpage data.

[0172] More specifically, such as Figure 10 As shown, Figure 10 This is a schematic diagram illustrating the webpage data extraction method of the present invention, which involves acquiring webpage data.

[0173] First, obtain the Document object file of the current webpage, in the format [doc].

[0174] Then, the data column headers of the webpage are obtained through the Document object file;

[0175] Then, based on the data column headers of the webpage, the data is extracted according to the data change rules to obtain the webpage data.

[0176] This embodiment, through the above-described scheme, specifically obtains the Document object file of the webpage; obtains the data column headers of the webpage based on the Document object file; and extracts the webpage data based on the data column headers according to the data change rules. Thus, it achieves the acquisition of webpage data, solves the problem of being unable to obtain webpage data rendered by programming languages, and improves the efficiency of webpage data extraction.

[0177] Reference Figure 11 , Figure 11 This is a schematic diagram illustrating the process of extracting web page data based on data change rules in the web page data extraction method of the present invention.

[0178] Based on the above Figure 9 In the illustrated embodiment, step S023, which involves extracting webpage data based on the data column headers of the webpage using the data change rules, includes:

[0179] Step S0231: Obtain the webpage data columns and webpage data rows based on the data column header of the webpage;

[0180] Step S0232: Based on the webpage data columns and webpage data rows, extract the data according to the data change rules to obtain the webpage data.

[0181] Specifically, to obtain webpage data, the following steps are performed:

[0182] First, based on the data column headers of the webpage, obtain the data columns and data rows corresponding to the webpage data;

[0183] Finally, based on the data rows and columns corresponding to the webpage data, the data is extracted using data change rules to obtain the webpage data. Here, webpages include, but are not limited to, the company's official website and article reporting websites.

[0184] More specifically, such as Figure 12 As shown, Figure 12 This is a schematic diagram illustrating the webpage data extraction method of the present invention, which involves extracting webpage data.

[0185] First, when the webpage data columns and webpage data rows are obtained, the maximum values ​​of the webpage columns and webpage rows are obtained. The maximum value of the column is represented by maxColumnIndex, and the maximum value of the row is represented by maxRowIndex.

[0186] Then, determine whether the current row (represented by i) and the current column (represented by j) have reached their maximum values. If they have not, replace the row index of i in the data change rule to obtain rule 1.

[0187] Then, replace the value with column index i in rule 1 to obtain rule 2;

[0188] Then, it is extracted from the Document object file according to rule 2;

[0189] Then, if the corresponding element cannot be extracted, the values ​​of i and j are compared, and the extraction is carried out in the next row or column until the value of i is greater than the maximum value of the row.

[0190] Then, if the corresponding element is found, the text attribute value of the element is obtained;

[0191] Then, store the text attribute value in the corresponding i-th row and j-th column of the table result data;

[0192] Finally, repeat the extraction process until the value of i is greater than the maximum value of the row.

[0193] This embodiment, through the above-described scheme, specifically obtains the webpage data columns and rows based on the webpage's data column header; and extracts the webpage data based on the data change rules using the webpage data columns and rows. This achieves the acquisition of webpage data, solves the problem of being unable to obtain webpage data rendered by programming languages, and improves the efficiency of webpage data extraction.

[0194] Reference Figure 13 , Figure 13 This is a flowchart illustrating another exemplary embodiment of the webpage data extraction method of the present invention.

[0195] Based on the above Figure 9 In the embodiment shown, after step S023, which involves extracting webpage data based on the data column headers of the webpage using the data change rules, the method further includes:

[0196] Step S024: In response to the user's operation, analyze the Document object file and page number of the webpage to obtain the analysis results;

[0197] Step S025: If the analysis result indicates that there is a next webpage, then extract the webpage data of the next webpage.

[0198] Step S026: If the analysis result is that there is no next webpage, then the data extraction ends.

[0199] Specifically, after obtaining the webpage data for the current webpage, it may also be necessary to extract data from the next webpage:

[0200] First, under normal circumstances, a webpage may have more than one page, and users can pre-select to extract data from the next page;

[0201] Then, the user makes a selection. If it is necessary to continue to obtain the web page data of the next web page, the web page's Document object file and page number are used for analysis to obtain the analysis results. The analysis process includes, but is not limited to, judging the current page number and the maximum page number, and whether the Document object file of the next web page is a new file.

[0202] Finally, if both conditions are met, the data from the webpage is retrieved.

[0203] More specifically, such as Figure 14 As shown, Figure 14 This is a schematic diagram illustrating the webpage data extraction method of the present invention, which involves obtaining data from the next webpage.

[0204] First, after obtaining the data from the current webpage, if the user selects to automatically obtain the data from the next page and sets the maximum number of pages expected (if it is 0, the analysis will proceed automatically until the last page), the analysis will then proceed.

[0205] Then, the analysis is performed using the current page number and the maximum page number. If the current page number is less than the maximum page number minus one, the page is turned. If the current page number is greater than or equal to the maximum page number minus one, the maximum page number is checked to see if it is 0. If the maximum page number is 0, the page is turned automatically. If the maximum page number is not 0, the process ends.

[0206] Then, the Document object of the page after pagination is retrieved to obtain a new Document object file;

[0207] Then, the new Document object file is compared with the Document object file of the previous page;

[0208] Finally, if the two files are identical, the data extraction ends; if the two files are different, the webpage data is extracted.

[0209] This embodiment, through the above-described scheme, specifically analyzes the Document object file and page number of the webpage in response to user operations, and obtains the analysis results. If the analysis result indicates that there is a next webpage, the webpage data of the next webpage is extracted; if the analysis result indicates that there is no next webpage, the data extraction ends. Thus, the extraction of next webpage data is achieved, solving the problem of not being able to automatically obtain next webpage data when acquiring webpage data, and improving the efficiency of webpage data extraction.

[0210] Furthermore, embodiments of the present invention also propose a webpage data extraction device, the webpage data extraction device comprising:

[0211] The acquisition module is used to perform rule analysis based on pre-acquired sample data to obtain data change rules;

[0212] The data extraction module is used to extract data from preset web pages according to the data change rules, and obtain web page data.

[0213] Furthermore, this embodiment of the invention also proposes a terminal device, which includes a memory, a processor, and a web page data extraction program stored in the memory and executable on the processor. When the web page data extraction program is executed by the processor, it implements the steps of the web page data extraction method described above.

[0214] Since the data extraction program for this webpage employs all the technical solutions of all the aforementioned embodiments when executed by the processor, it possesses at least all the beneficial effects brought about by all the technical solutions of all the aforementioned embodiments, which will not be elaborated upon here.

[0215] Furthermore, embodiments of the present invention also propose a computer-readable storage medium storing a web page data extraction program, which, when executed by a processor, implements the steps of the web page data extraction method described above.

[0216] Since the data extraction program for this webpage employs all the technical solutions of all the aforementioned embodiments when executed by the processor, it possesses at least all the beneficial effects brought about by all the technical solutions of all the aforementioned embodiments, which will not be elaborated upon here.

[0217] Compared to existing technologies, the webpage data extraction method, apparatus, terminal device, and storage medium proposed in this invention perform rule analysis based on pre-acquired sample data to obtain data change rules; and extract data from a preset webpage using these data change rules to obtain webpage data. This solves the problem of being unable to obtain webpage data rendered by programming languages, achieving webpage data extraction and improving its efficiency. Based on this invention, starting from the problem of the inability to obtain data generated and rendered on webpages by Web programming languages ​​according to conditional rules, a webpage data extraction method is designed. The effectiveness of this method is verified during webpage data extraction, and the efficiency of webpage data extraction using this method is significantly improved.

[0218] Compared with existing technologies, the solutions of the embodiments of the present invention have the following advantages:

[0219] 1. A method for analyzing HTML element rules that are generated and rendered on web pages by web programming languages ​​according to conditional rules;

[0220] 2. Methods for obtaining webpage data according to the analyzed rules;

[0221] 3. Methods for automatic page turning and automatic stopping of data retrieval when data acquisition ends.

[0222] It should be noted that, in this document, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or system. Unless otherwise specified, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or system that includes that element.

[0223] The sequence numbers of the above embodiments of the present invention are for descriptive purposes only and do not represent the superiority or inferiority of the embodiments.

[0224] Through the above description of the embodiments, those skilled in the art can clearly understand that the methods of the above embodiments can be implemented by means of software plus necessary general-purpose hardware platforms. Of course, they can also be implemented by hardware, but in many cases the former is a better implementation method. Based on this understanding, the technical solution of the present invention, or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product is stored in a storage medium (such as ROM / RAM, magnetic disk, optical disk) as described above, and includes several instructions to cause a terminal device (which may be a mobile phone, computer, server, controlled terminal, or network device, etc.) to execute the methods of each embodiment of the present invention.

[0225] The above are merely preferred embodiments of the present invention and do not limit the patent scope of the present invention. Any equivalent structural or procedural transformations made based on the content of the present invention's specification and drawings, or direct or indirect applications in other related technical fields, are similarly included within the patent protection scope of the present invention. 、

Claims

1. A method for extracting webpage data, characterized in that, The webpage data extraction method includes the following steps: Based on the pre-acquired sample data, rule analysis is performed to obtain the data change rules; The step of performing rule analysis based on pre-acquired sample data to obtain data change rules includes: Analyze the sample data to obtain the data columns and data rows of the sample data; Analyze the data columns and data rows of the sample data to obtain the changing patterns of the data columns and data rows; The step of analyzing the data columns and rows of the sample data to obtain the changing patterns of the data columns and rows includes: Based on the data columns and data rows of the sample data, obtain the first dataset and the second dataset of the data columns and data rows; Based on the analysis of the first dataset and the second dataset, the first common selector type set of the first dataset and the second dataset is obtained; Based on the first common selector type set, the data columns and data rows are analyzed through the first dataset and / or the second dataset to obtain the changing patterns of the data columns and data rows. By combining the changing patterns of the data columns and data rows, the data change rules can be obtained; The step of merging the data columns and data rows to obtain the data change rules includes: Based on the changing patterns of the data columns and data rows, obtain the second common selector type set; The variation patterns of the data columns and data rows are merged to obtain a set of variation patterns; Obtain the corresponding pattern data through the second common selector type set; Based on the aforementioned pattern data, classification is performed using the set of change patterns to obtain classification results; Based on the classification results, obtain the data change rules; Data is extracted from preset web pages using the aforementioned data change rules to obtain web page data.

2. The webpage data extraction method according to claim 1, characterized in that, The step of extracting data from a preset webpage using the data change rules to obtain webpage data includes: Obtain the Document object file of the webpage; Based on the Document object file, obtain the data column headers of the webpage; Based on the data column headers of the webpage, the data is extracted according to the data change rules to obtain the webpage data.

3. The webpage data extraction method according to claim 2, characterized in that, The step of extracting webpage data based on the data column headers of the webpage and the data change rules includes: Based on the data column header of the webpage, obtain the webpage data columns and webpage data rows; Based on the webpage data columns and rows, the webpage data is extracted using the data change rules.

4. The webpage data extraction method according to claim 2, characterized in that, After the step of extracting webpage data based on the data column headers of the webpage and the data change rules, the method further includes: In response to user actions, the system analyzes the Document object file and page number of the webpage to obtain the analysis results. If the analysis result indicates that there is a next webpage, then extract the webpage data of the next webpage; If the analysis result indicates that there is no next webpage, then the data extraction will end.

5. A webpage data extraction device, characterized in that, The webpage data extraction device includes: The acquisition module is used to perform rule analysis based on pre-acquired sample data to obtain data change rules; The acquisition module is further configured to: analyze the sample data to acquire the data columns and data rows of the sample data; Analyze the data columns and data rows of the sample data to obtain the changing patterns of the data columns and data rows; Based on the data columns and data rows of the sample data, obtain the first dataset and the second dataset of the data columns and data rows; Based on the analysis of the first dataset and the second dataset, the first common selector type set of the first dataset and the second dataset is obtained; Based on the first common selector type set, the data columns and data rows are analyzed through the first dataset and / or the second dataset to obtain the changing patterns of the data columns and data rows. By combining the changing patterns of the data columns and data rows, the data change rules can be obtained; Based on the changing patterns of the data columns and data rows, obtain the second common selector type set; The variation patterns of the data columns and data rows are merged to obtain a set of variation patterns; Obtain the corresponding pattern data through the second common selector type set; Based on the aforementioned pattern data, classification is performed using the set of change patterns to obtain classification results; Based on the classification results, obtain the data change rules; The data extraction module is used to extract data from preset web pages according to the data change rules, and obtain web page data.

6. A terminal device, characterized in that, The terminal device includes a memory, a processor, and a web page data extraction program stored in the memory and executable on the processor. When the web page data extraction program is executed by the processor, it implements the steps of the web page data extraction method as described in any one of claims 1-4.

7. A calculator-readable storage medium, characterized in that, The calculator-readable storage medium stores a web page data extraction program, which, when executed by a processor, implements the steps of the web page data extraction method as described in any one of claims 1-4.