Page search analysis method, apparatus, device, and medium

By dividing the page into multiple regions and assigning weights to each region, and then combining the results to calculate the matching degree, the problem of low search accuracy in existing technologies is solved, resulting in higher search result matching and user experience.

CN114428894BActive Publication Date: 2026-06-26BEIJING BAIDU NETCOM SCI & TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
BEIJING BAIDU NETCOM SCI & TECH CO LTD
Filing Date
2022-01-28
Publication Date
2026-06-26

Smart Images

  • Figure CN114428894B_ABST
    Figure CN114428894B_ABST
Patent Text Reader

Abstract

The present disclosure provides a page search method, device, equipment and medium, relates to the field of computers, in particular to computer network technology, search engine technology and software application technology. The method comprises the following steps: determining a candidate page based on a query request; determining at least one candidate page area in the candidate page; determining the weight of each of the at least one candidate page area based on a preset rule for the candidate page; calculating the matching degree between the query request and each of the at least one candidate page area; and determining the matching degree between the query request and the candidate page based on at least the matching degree between the query request and each of the at least one candidate page area and the weight of each of the at least one candidate page area.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to the field of computers, specifically to computer network technology, search engine technology, and software application technology, and particularly to a page search method, apparatus, electronic device, computer-readable storage medium, and computer program product. Background Technology

[0002] Search engines crawl a large number of web pages, filter these pages, and then include the filtered pages in their index. After a user sends a query to the search engine, the search engine selects relevant pages based on the query, ranks these pages using various methods, and then displays all or part of the relevant pages to the user based on the ranking results.

[0003] The methods described in this section are not necessarily methods that had been previously conceived or adopted. Unless otherwise specified, no method described in this section should be assumed to be prior art simply because it is included in this section. Similarly, unless otherwise specified, the issues mentioned in this section should not be considered to be accepted in any prior art. Summary of the Invention

[0004] This disclosure provides a page search method, apparatus, electronic device, computer-readable storage medium, and computer program product.

[0005] According to one aspect of this disclosure, a page search method is provided. The page search method includes: determining candidate pages based on a query request; determining at least one candidate page region among the candidate pages; determining the weight of each of the at least one candidate page region based on preset rules for the candidate pages; calculating the matching degree between the query request and each of the at least one candidate page region; and determining the matching degree between the query request and the candidate pages based at least on the matching degree between the query request and each of the at least one candidate page region and the weight of each of the at least one candidate page region.

[0006] According to another aspect of this disclosure, a page search apparatus is provided. The page search apparatus includes: a first determining unit configured to determine candidate pages based on a query request; a second determining unit configured to determine at least one candidate page region among the candidate pages; a third determining unit configured to determine the weights of each of the at least one candidate page region based on preset rules for the candidate pages; a calculating unit configured to calculate the matching degree between the query request and each of the at least one candidate page region; and a fourth determining unit configured to determine the matching degree between the query request and the candidate pages based at least on the matching degree between the query request and each of the at least one candidate page region and the weights of each of the at least one candidate page region.

[0007] According to another aspect of this disclosure, an electronic device is provided, comprising: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to perform the page search method described above.

[0008] According to another aspect of this disclosure, a non-transitory computer-readable storage medium is provided storing computer instructions, wherein the computer instructions are used to cause a computer to perform the above-described page search method.

[0009] According to another aspect of this disclosure, a computer program product is provided, including a computer program, wherein the computer program implements the above-described page search method when executed by a processor.

[0010] According to one or more embodiments of this disclosure, by dividing candidate pages determined based on query requests into multiple page regions and assigning a weight to each page region, and then merging the matching degree between each target region and the query request based on the weight, the matching degree between the query request and the page region is determined, thereby improving the distinguishability of different regions in the page, realizing the utilization of information on the importance of different regions in the page, and improving the matching degree between search results and queries.

[0011] It should be understood that the description in this section is not intended to identify key or essential features of the embodiments of this disclosure, nor is it intended to limit the scope of this disclosure. Other features of this disclosure will become readily apparent from the following description. Attached Figure Description

[0012] The accompanying drawings exemplify embodiments and form part of the specification, serving together with the textual description to explain exemplary implementations of the embodiments. The illustrated embodiments are for illustrative purposes only and do not limit the scope of the claims. Throughout the drawings, the same reference numerals refer to similar but not necessarily identical elements.

[0013] Figure 1 A flowchart of a page search method according to an exemplary embodiment of the present disclosure is shown;

[0014] Figure 2 A flowchart of a page search method according to an exemplary embodiment of the present disclosure is shown;

[0015] Figure 3 A structural block diagram of a page search apparatus according to an exemplary embodiment of the present disclosure is shown; and

[0016] Figure 4 A structural block diagram of an exemplary electronic device that can be used to implement embodiments of the present disclosure is shown. Detailed Implementation

[0017] The exemplary embodiments of this disclosure are described below with reference to the accompanying drawings, including various details of the embodiments to aid understanding, and should be considered merely exemplary. Therefore, those skilled in the art will recognize that various changes and modifications can be made to the embodiments described herein without departing from the scope of this disclosure. Similarly, for clarity and brevity, descriptions of well-known functions and structures are omitted in the following description.

[0018] In this disclosure, unless otherwise stated, the use of terms such as "first," "second," etc., to describe various elements is not intended to limit the positional, temporal, or importance relationships of these elements; such terms are merely used to distinguish one element from another. In some examples, the first element and the second element may refer to the same instance of that element, while in other cases, based on the context, they may refer to different instances.

[0019] The terminology used in the description of the various examples described in this disclosure is for the purpose of describing particular examples only and is not intended to be limiting. Unless the context explicitly indicates otherwise, an element may be one or more unless the number of elements is specifically limited. Furthermore, the term "and / or" as used in this disclosure covers any one of the listed items and all possible combinations thereof.

[0020] In related technologies, existing page search methods are usually based on the matching degree between the query request and the title, and the matching degree between the query request and the overall page data. Such search methods have a single dimension and low accuracy.

[0021] To address the aforementioned issues, this disclosure divides candidate pages determined based on query requests into multiple page regions, assigns a weight to each page region, and then merges the matching degree between each target region and the query request based on the weights to determine the matching degree between the query request and the page region. This improves the differentiation of different regions on the page, enables the utilization of information on the importance of different regions on the page, and enhances the matching degree between search results and queries.

[0022] According to one aspect of this disclosure, a page search method is provided. For example... Figure 1As shown, the page search method includes: step S101, determining candidate pages based on the query request; step S102, determining at least one candidate page region among the candidate pages; step S103, determining the weight of each of the at least one candidate page region based on preset rules for the candidate pages; step S104, calculating the matching degree between the query request and each of the at least one candidate page region; and step S105, determining the matching degree between the query request and the candidate pages based at least on the matching degree between the query request and each of the at least one candidate page region and the weight of each of the at least one candidate page region.

[0023] Therefore, by dividing the candidate pages corresponding to the query into multiple target regions and assigning a weight to each target region, and then merging the matching degree between the query and each target region based on the weight, the matching degree between the query and the candidate pages is determined. This improves the differentiation of different regions on the page, realizes the utilization of information on the importance of different regions on the page, and enhances the matching degree between the search results and the query.

[0024] A query request may be, for example, text entered by the user for the query. In step S101, candidate pages may be determined in the search engine's index based on the query request. Candidate pages may be pages obtained from initial screening in the index, and therefore may include one or more pages. Before presenting the search results to the user, the candidate pages may be further filtered and sorted, thereby enabling pages more relevant to the query request to be presented to the user earlier or in a more prominent position. It is understood that those skilled in the art can choose appropriate methods to determine candidate pages, and no limitations are imposed here.

[0025] After identifying the candidate pages relevant to the query request, at least one candidate page region can be identified from among the candidate pages.

[0026] According to some embodiments, step S102, determining at least one candidate page region in the candidate page, may include: determining at least one node among the multiple page nodes in the candidate page as at least one candidate page region based on the node information of each page node. Therefore, by using the node information of the page nodes, the page can be divided according to the page structure, resulting in a more reasonable division.

[0027] The page structure of the candidate page can be, for example, a dom-tree structure, which allows us to determine multiple page nodes and their corresponding information within the candidate page.

[0028] According to some embodiments, page node information may include, for example, at least one of: node position, node size, node type, and relationship with other nodes. Node position may further include the node's horizontal and vertical coordinates; node size may further include the node's height and width; and node type may further include text, image, video, etc. Based on the node's coordinate information, alignment, and the density between nodes, the page can be divided into multiple regions.

[0029] In one embodiment, at least one candidate page region may include at least one of the following: a header region, a middle region, a bottom region, a left region, and a right region. This method of dividing the candidate page region can be, for example, the division method used for PC web pages. In another embodiment, for a Wise page, at least one candidate page region may include a header region, a middle region, and a bottom region. It is understood that those skilled in the art can divide the page in other ways, and this is not limited thereto.

[0030] In addition to the areas mentioned above, non-visible information on the page can also be filtered out and retained as information of lowest importance.

[0031] According to some embodiments, the central area can be further divided to obtain a main title area and a body text area. The body text area can also be further divided to obtain at least one body text paragraph area. In some embodiments, different body text paragraphs in the body text area may correspond to different page nodes, so the body text area can be divided according to the page node information. In other embodiments, the main content in the body text area may all be in the same page node, so the body text area can be split into paragraphs based on information such as line breaks, indentation, and white space to obtain at least one body text paragraph area.

[0032] According to some embodiments, the main text area may further include at least one subheading area. In some embodiments, subheadings may have separate page nodes, and the subheading area can be divided according to the page node information. In other embodiments, subheadings and main text paragraphs may be mixed together, and the subheadings do not have a special style, so they cannot be identified by format. In such embodiments, step S102, determining at least one candidate page area in the candidate pages, further includes: for each of the plurality of consecutive paragraphs included in the main text area, in response to determining that the similarity between the paragraph and at least one paragraph adjacent to the paragraph in the plurality of consecutive paragraphs meets a preset condition, the paragraph is determined as a subheading area.

[0033] Therefore, by calculating the similarity between a paragraph and its adjacent paragraphs in the context, it is possible to determine whether a paragraph is a title paragraph based on semantic information.

[0034] In some embodiments, text features of multiple paragraphs can be determined, and then the similarity between the text features can be calculated. It is understood that those skilled in the art can use other methods to calculate the similarity between two paragraphs, and no limitation is made herein.

[0035] In one exemplary embodiment, a paragraph that is not similar to the preceding paragraph but is similar to the following paragraph can be identified as a subheading.

[0036] In step S103, after determining at least one candidate page region among the candidate pages, the weight of each of the at least one candidate page region can be determined based on the preset rules used for the candidate page.

[0037] According to some embodiments, preset rules can instruct the following page areas to have descending weights: the main title area, at least one subheading area arranged in reading order, at least one body paragraph area arranged in reading order, the header area, the left area, the right area, and the footer area. Typically, the main title is a summary of the page's content and therefore has the highest weight. Each subheading is a summary of the content of its respective paragraph and therefore has the second highest weight; considering that the importance of content usually decreases sequentially, the weight of at least one subheading decreases sequentially along the reading order. Body paragraphs contain rich content information and are more relevant to the page's theme than other peripheral areas, thus ranking third. Similarly, the weight of at least one body paragraph decreases sequentially along the reading order. The header, left, right, and footer areas also typically contain page-related content, and therefore have decreasing weights sequentially along the reading order. In addition to the above areas, the lowest weight can be assigned to the least important, non-visible information mentioned above to further enrich the information used when matching search queries with candidate pages.

[0038] Therefore, this setting method enables the utilization of information on the importance of different areas on the page, thereby improving the differentiation of different areas on the page.

[0039] According to some embodiments, preset rules can indicate the weight ranking of at least one candidate page region. Step S103, determining the weight of each of the at least one candidate page region includes: determining the weight of each of the at least one candidate page region based on the weight ranking of the at least one candidate page region. Since the number of subheading regions and the number of body paragraph regions in different pages may be different, the preset rules can first determine the weight ranking of these regions, and then determine the weight of each region according to the ranking result.

[0040] In one embodiment, after determining that the candidate page includes six areas—a main title area, two subheading areas, two body paragraph areas, and a footer area—these areas can be sorted first, and then a weight can be assigned to each area based on the sorting result. In another embodiment, a weight can be directly assigned to each area.

[0041] After obtaining the weight of each region, the matching degree between each region and the query request can be calculated. It is understood that those skilled in the art can use various methods to calculate the matching degree, which will not be elaborated here.

[0042] According to some embodiments, after obtaining the matching degree between each region and the query request, the matching degree between the query request and the candidate pages can be calculated based on the following formula:

[0043] data_value=sigmoid(∑w*match_info)

[0044] Where data_value is the matching degree between the query request and the candidate page, the sigmoid function is used to make the result between 0 and 1, match_info is the matching degree between each region and the query request, and w is the weight corresponding to that region.

[0045] According to some embodiments, candidate pages with a match degree greater than a specific value can be returned to the user. In some embodiments, the match degree between each region and the query request can be calculated in parallel, and their weighted sum can be calculated. In other embodiments, if the system performance is poor, the match degree between each region and the query request can be calculated serially from highest to lowest weight, summed successively, and processed using the sigmoid function, and then compared successively with the aforementioned specific value. During this process, if the current calculation result is greater than the specific value, the candidate page can be returned to the user, and further calculation can stop; otherwise, the match degree between regions and the query request with lower weights can continue to be calculated.

[0046] According to some embodiments, preset rules can be adjusted based on user behavior data. For example... Figure 2 As shown, the page search method may further include: step S206, obtaining user behavior data of at least one user on historical pages; step S207, determining at least one historical page region in the historical pages; and step S208, in response to the determination that the user behavior data indicates that the interaction between at least one user and the first historical page region in the at least one historical page region meets the preset conditions, updating the preset rules to adjust the weight of the first historical page region.

[0047] Therefore, by adjusting the preset rules based on user behavior data, paragraphs that users pay more attention to or can obtain more information related to the search query can be given higher weight, thereby improving the matching degree between candidate pages that are more relevant to the search query and optimizing the user experience of the search process.

[0048] According to some embodiments, user behavior data may include, for example, user click behavior data (e.g., opening a page, closing a page) and browsing behavior data (e.g., dwell time, browsing time).

[0049] According to some embodiments, the preset conditions may include one of the following: the average dwell time of at least one user in the first history page area is greater than a first threshold; and the number of users among at least one user whose dwell time in the first history page area is greater than the first threshold is greater than a second threshold.

[0050] Step S207, determining at least one historical page region in the historical page, is similar to the aforementioned operation of determining at least one candidate page region in the candidate page, and will not be described in detail here.

[0051] In some embodiments, step S208, updating the preset rules to adjust the weight of the first historical page area, may include, for example, increasing the weight of the first historical page area.

[0052] Therefore, by increasing the weight of page areas where users spend more time, the information from those areas can be taken into account more when calculating the match between the page and the search query, thereby improving the quality of search results.

[0053] It is understood that those skilled in the art can set a first threshold corresponding to the user's dwell time and a second threshold corresponding to the number of users who spend a longer time in that segment, as needed, without any limitations here.

[0054] According to some embodiments, at least one historical page area may include at least one subheading area and at least one body paragraph area, and the first historical page area may be the body paragraph area. The adjusted weight of the first historical page area may be less than the weight of each of the at least one subheading area.

[0055] Because the main heading is more important than other subheadings and body paragraphs, when increasing the weight of subheading and body paragraph areas, the adjusted weight can still be less than the weight of the main heading. Similarly, subheadings are generally more important than body paragraphs, so the adjusted weight of body paragraphs can be less than the weight of subheading areas.

[0056] In some scenarios, analyzing user behavior data can reveal that users close the page after browsing a specific paragraph, or even end their current search query. In such cases, it can be determined that the information contained in that specific paragraph matches the user's needs.

[0057] According to some embodiments, the preset condition may further include: at least one user group has a greater than a third threshold number of users who ended their query request after browsing the first historical page area. In other words, if many users end their query request after browsing the first historical page area, then the weight of that area can be increased to a higher level than that of some subheadings.

[0058] According to some embodiments, at least one historical page area may include at least one subheading area and at least one body paragraph area, and the first historical page area may be the body paragraph area. The adjusted weight of the first historical page area may be greater than the weight of at least a portion of the subheading areas within the at least one subheading area.

[0059] Therefore, by doing so, the number of paragraphs that match the user's needs can be greatly increased, allowing the user to find the information they need in the search results earlier and improving the user experience.

[0060] It is understood that those skilled in the art can set the third threshold as needed. In some embodiments, when the above weight adjustment is more stringent, a higher third threshold, such as 90%, can be set; while when the weight adjustment is more lenient, a lower third threshold, such as 30%, can be set.

[0061] In addition to the adjustment methods mentioned above, the weight can also be adjusted based on user click behavior. In one embodiment, for example, if it is determined that multiple users clicked on links in the bottom area of ​​the page, the weight of the bottom area can be increased accordingly.

[0062] According to some embodiments, pages within the index can be categorized, for example, by vertical category, by URL, or by other methods. Accordingly, preset rules can be set for each category, and the preset rules for that category can be adjusted based on analysis of user behavior data of historical pages for that category.

[0063] According to another aspect of this disclosure, a page search device is also provided. For example... Figure 3As shown, the page search device 300 includes: a first determining unit 310 configured to determine candidate pages based on a query request; a second determining unit 320 configured to determine at least one candidate page region among the candidate pages; a third determining unit 330 configured to determine the weight of each of the at least one candidate page region based on preset rules for the candidate pages; a calculation unit 340 configured to calculate the matching degree between the query request and each of the at least one candidate page region; and a fourth determining unit 350 configured to determine the matching degree between the query request and the candidate pages based at least on the matching degree between the query request and each of the at least one candidate page region and the weight of each of the at least one candidate page region.

[0064] The operation of units 310-350 of the page search device 300 is similar to the operation of steps S101-S105 of the page search method described above, and will not be repeated here.

[0065] According to embodiments of this disclosure, an electronic device, a readable storage medium, and a computer program product are also provided.

[0066] refer to Figure 4 The present invention describes a structural block diagram of an electronic device 400 that can serve as a server or client of the present disclosure, which is an example of a hardware device that can be applied to various aspects of the present disclosure. The electronic device is intended to represent various forms of digital electronic computer devices, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. The electronic device can also represent various forms of mobile devices, such as personal digital processors, cellular phones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions are merely examples and are not intended to limit the implementation of the present disclosure described and / or claimed herein.

[0067] like Figure 4 As shown, device 400 includes a computing unit 401, which can perform various appropriate actions and processes based on a computer program stored in read-only memory (ROM) 402 or a computer program loaded from storage unit 408 into random access memory (RAM) 403. RAM 403 may also store various programs and data required for the operation of device 400. The computing unit 401, ROM 402, and RAM 403 are interconnected via bus 404. Input / output (I / O) interface 405 is also connected to bus 404.

[0068] Multiple components in device 400 are connected to I / O interface 405, including: input unit 406, output unit 407, storage unit 408, and communication unit 409. Input unit 406 can be any type of device capable of inputting information to device 400. Input unit 406 can receive input numerical or character information and generate key signal inputs related to user settings and / or function control of the electronic device, and may include, but is not limited to, a mouse, keyboard, touchscreen, trackpad, trackball, joystick, microphone, and / or remote control. Output unit 407 can be any type of device capable of presenting information, and may include, but is not limited to, a monitor, speaker, video / audio output terminal, vibrator, and / or printer. Storage unit 408 may include, but is not limited to, a hard disk and an optical disk. Communication unit 409 allows device 400 to exchange information / data with other devices through computer networks such as the Internet and / or various telecommunications networks, and may include, but is not limited to, modems, network cards, infrared communication devices, wireless communication transceivers, and / or chipsets, such as Bluetooth™ devices, 802.11 devices, WiFi devices, WiMax devices, cellular communication devices, and / or the like.

[0069] The computing unit 401 can be a variety of general-purpose and / or special-purpose processing components with processing and computing capabilities. Some examples of the computing unit 401 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various special-purpose artificial intelligence (AI) computing chips, various computing units running machine learning model algorithms, a digital signal processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 401 performs the various methods and processes described above, such as the page search method. For example, in some embodiments, the page search method may be implemented as a computer software program tangibly contained in a machine-readable medium, such as storage unit 408. In some embodiments, part or all of the computer program may be loaded and / or installed on device 400 via ROM 402 and / or communication unit 409. When the computer program is loaded into RAM 403 and executed by the computing unit 401, one or more steps of the page search method described above may be performed. Alternatively, in other embodiments, the computing unit 401 may be configured to perform the page search method by any other suitable means (e.g., by means of firmware).

[0070] Various embodiments of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), application-specific standard products (ASSPs), systems-on-a-chip (SoCs), payload-programmable logic devices (CPLDs), computer hardware, firmware, software, and / or combinations thereof. These various embodiments may include implementations in one or more computer programs that can be executed and / or interpreted on a programmable system including at least one programmable processor, which may be a dedicated or general-purpose programmable processor, capable of receiving data and instructions from a storage system, at least one input device, and at least one output device, and transmitting data and instructions to the storage system, the at least one input device, and the at least one output device.

[0071] The program code used to implement the methods of this disclosure may be written in any combination of one or more programming languages. This program code may be provided to a processor or controller of a general-purpose computer, special-purpose computer, or other programmable data processing apparatus, such that when executed by the processor or controller, the program code causes the functions / operations specified in the flowcharts and / or block diagrams to be implemented. The program code may be executed entirely on a machine, partially on a machine, as a standalone software package partially on a machine and partially on a remote machine, or entirely on a remote machine or server.

[0072] In the context of this disclosure, a machine-readable medium can be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium can be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium can be, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.

[0073] To provide interaction with a user, the systems and techniques described herein can be implemented on a computer having: a display device for displaying information to the user (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor); and a keyboard and pointing device (e.g., a mouse or trackball) through which the user provides input to the computer. Other types of devices can also be used to provide interaction with the user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form (including sound input, voice input, or tactile input).

[0074] The systems and technologies described herein can be implemented in computing systems that include backend components (e.g., as a data server), or computing systems that include middleware components (e.g., an application server), or computing systems that include frontend components (e.g., a user computer with a graphical user interface or web browser through which a user can interact with embodiments of the systems and technologies described herein), or any combination of such backend, middleware, or frontend components. The components of the system can be interconnected via digital data communication of any form or medium (e.g., a communication network). Examples of communication networks include local area networks (LANs), wide area networks (WANs), and the Internet.

[0075] Computer systems can include clients and servers. Clients and servers are generally geographically separated and typically interact via communication networks. The client-server relationship is created by computer programs running on the respective computers and having a client-server relationship with each other. A server can be a cloud server, also known as a cloud computing server or cloud host, a hosting product within the cloud computing service ecosystem, addressing the shortcomings of traditional physical hosts and VPS (Virtual Private Server, or simply "VPS") services, such as high management difficulty and weak business scalability. Servers can also be servers for distributed systems or servers incorporating blockchain technology.

[0076] It should be understood that the various forms of processes shown above can be used to rearrange, add, or delete steps. For example, the steps described in this disclosure can be performed in parallel, sequentially, or in a different order, as long as the desired result of the technical solution disclosed in this disclosure can be achieved, and this is not limited herein.

[0077] While embodiments or examples of this disclosure have been described with reference to the accompanying drawings, it should be understood that the methods, systems, and devices described above are merely exemplary embodiments or examples, and the scope of the invention is not limited by these embodiments or examples, but only by the granted claims and their equivalents. Various elements in the embodiments or examples may be omitted or replaced by their equivalents. Furthermore, the steps may be performed in a different order than that described in this disclosure. Further, various elements in the embodiments or examples may be combined in various ways. Importantly, as the technology evolves, many elements described herein can be replaced by equivalents that appear after this disclosure.

Claims

1. A page search method, comprising: Candidate pages are determined based on the query request, and the candidate pages have a dom-tree page structure. Determining at least one candidate page region from the candidate pages includes: Based on the dom-tree page structure, multiple page nodes in the candidate pages and node information for each page node are determined, including the node position and its relationship with other nodes; and Based on the node information of each page node, at least one node among the plurality of page nodes is determined as the at least one candidate page region; Based on preset rules, the weights of each of the at least one candidate page region are determined. The matching degree between the query request and each of the at least one candidate page regions is calculated serially in descending order of their respective weights. After calculating the matching degree for each candidate page region, the calculated matching degrees and their corresponding weights are weighted and summed, then processed using the sigmoid function to obtain the current matching degree; and In response to determining that the current matching degree is greater than a specific value, the candidate page is identified as the page that matches the query request, and the calculation of the matching degree of the remaining candidate page regions is stopped.

2. The method as described in claim 1, wherein, The at least one candidate page region includes at least one of the following: a header region, a middle region, a bottom region, a left region, and a right region.

3. The method as described in claim 2, wherein, The at least one candidate page area includes a central area, which includes a main title area and a body text area, and the body text area includes at least one body text paragraph area.

4. The method of claim 3, wherein, The main text area further includes at least one subheading area, and determining at least one candidate page area in the candidate pages further includes: For each of the multiple consecutive paragraphs included in the main text area, in response to determining that the similarity between the paragraph and at least one adjacent paragraph in the multiple consecutive paragraphs meets a preset condition, the paragraph is identified as a subheading area.

5. The method of claim 4, wherein, The preset rules indicate that the following page areas have descending weights: the main title area, at least one subheading area arranged in reading order, at least one body paragraph area arranged in reading order, the header area, the left area, the right area, and the bottom area.

6. The method of claim 1, further comprising: Obtain user behavior data for at least one user on historical pages; At least one historical page region is identified within the historical page. as well as In response to determining that the user behavior data indicates that the interaction between the at least one user and the first historical page area in the at least one historical page area meets a preset condition, the preset rule is updated to adjust the weight of the first historical page area.

7. The method of claim 6, wherein, The preset conditions include one of the following: The average dwell time of at least one user in the first history page area is greater than a first threshold; and The number of users whose time spent in the first history page area exceeds the first threshold is greater than the second threshold. The step of updating the preset rules to adjust the weight of the first historical page area includes: Increase the weight of the first history page area.

8. The method of claim 7, wherein, The at least one historical page area includes at least one subheading area and at least one body paragraph area, wherein the first historical page area is a body paragraph area, and the adjusted weight of the first historical page area is less than the respective weight of the at least one subheading area.

9. The method of claim 6, wherein, The preset conditions include: The number of users who ended the query request after browsing the first history page area is greater than the third threshold.

10. The method of claim 9, wherein, The at least one historical page area includes at least one subheading area and at least one body paragraph area, the first historical page area is a body paragraph area, and wherein the adjusted weight of the first historical page area is greater than the weight of at least a portion of the subheading areas in the at least one subheading area.

11. The method of claim 1, wherein, The preset rule indicates the weight ranking of the at least one candidate page region, wherein determining the weight of each of the at least one candidate page region includes: The weights of the at least one candidate page region are determined based on the weight ranking of the at least one candidate page region.

12. A page search device, comprising: The first determining unit is configured to determine candidate pages based on a query request, wherein the candidate pages are dom-tree page structures. The second determining unit is configured to determine at least one candidate page region in the candidate pages, including: Based on the dom-tree page structure, multiple page nodes in the candidate pages and node information for each page node are determined, including the node position and its relationship with other nodes; and Based on the node information of each page node, at least one node among the plurality of page nodes is determined as the at least one candidate page region; The third determining unit is configured to determine the weight of each of the at least one candidate page region based on preset rules for the candidate page. The calculation unit is configured to serially calculate the matching degree between the query request and each of the at least one candidate page regions in descending order of their respective weights. The processing unit is configured to, after calculating the matching degree corresponding to each candidate page region, perform a weighted sum of the calculated matching degrees and their corresponding weights, and then process the sum using the sigmoid function to obtain the current matching degree; and The fourth determining unit is configured to, in response to determining that the current matching degree is greater than a specific value, determine the candidate page as the page that matches the query request, and stop calculating the matching degree of the remaining candidate page regions.

13. An electronic device, comprising: At least one processor; as well as A memory that is communicatively connected to the at least one processor; in The memory stores instructions that can be executed by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-11.

14. A non-transitory computer-readable storage medium storing computer instructions, wherein, The computer instructions are used to cause the computer to perform the method according to any one of claims 1-11.

15. A computer program product comprising a computer program, wherein, When the computer program is executed by a processor, it implements the method of any one of claims 1-11.