Report generation apparatus and method

The report generation device addresses the issue of duplicate articles and unreliable summaries in newspaper recommendations by removing duplicates and categorizing articles, resulting in relevant and comprehensive reports.

WO2026135163A1PCT designated stage Publication Date: 2026-06-25POSCO HLDG INC

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
POSCO HLDG INC
Filing Date
2025-12-16
Publication Date
2026-06-25

AI Technical Summary

Technical Problem

Conventional newspaper article recommendation systems suffer from excessive duplication of articles with similar content and unreliable summaries that fail to capture important parts, leading to inefficiencies and reduced user satisfaction.

Method used

A report generation device and method that utilizes an article data acquisition unit to collect articles, an article data refinement unit to remove duplicates using similarity analysis, and an issue report determination unit to classify and cluster articles by category, generating reports with summaries and details using AI models for enhanced relevance.

Benefits of technology

This approach effectively reduces duplicate articles, ensures relevance to user interests, and provides comprehensive reports by categorizing and clustering articles, improving the reliability and efficiency of the recommendation system.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure KR2025021864_25062026_PF_FP_ABST
    Figure KR2025021864_25062026_PF_FP_ABST
Patent Text Reader

Abstract

A report generation apparatus and method may: obtain a plurality of first news articles related to at least one category from a preset database; extract a plurality of second news articles from the plurality of first news articles by removing duplicate articles on the basis of a first similarity between the plurality of first news articles; classify the plurality of second news articles on the basis of the at least one category; and determine an issue report to be output by clustering the plurality of second news articles for each category.
Need to check novelty before this filing date? Find Prior Art

Description

Report generation device and method

[0001] The present disclosure relates to a technology for generating a report based on recommended articles.

[0002] Recently, many services have been launched that recommend newspaper articles preferred by users and summarize them by extracting keywords. There is a wide variety of methods for recommending articles, extracting keywords, and summarizing article content using the extracted keywords.

[0003] However, conventional newspaper article recommendation systems have a problem in that the amount of recommended newspaper articles becomes unnecessarily excessive because they repeatedly expose newspaper articles containing the same or similar content without removing them.

[0004] In addition, since the summary is generated and provided based solely on extracted keywords, important parts of the newspaper article may be omitted, which raises concerns about the reliability of the recommendation system's performance.

[0005] The present disclosure aims to provide a technology that efficiently recommends newspaper articles and generates reports based on recommended newspaper articles, in order to solve the problems of repeated exposure of newspaper articles with identical or similar content and the recommendation of newspaper articles irrelevant to the user's interests by prioritizing the removal of duplicate articles among collected articles and classifying newspaper articles by category.

[0006] In one aspect, the present embodiments provide a report generating device for generating a report including recommended newspaper articles, comprising: an article data acquisition unit that acquires a plurality of first newspaper articles related to at least one category from a preset database; an article data refinement unit that extracts a plurality of second newspaper articles from a plurality of first newspaper articles with duplicate articles removed based on a first similarity between a plurality of first newspaper articles; and an issue report determination unit that classifies a plurality of second newspaper articles based on at least one category and performs clustering on a plurality of second newspaper articles for each category to determine an issue report to be output.

[0007] In another aspect, the present embodiments provide a method for generating a report including recommended newspaper articles, wherein a plurality of first newspaper articles related to at least one category are obtained from a preset database, a plurality of second newspaper articles from which duplicate articles have been removed from a plurality of first newspaper articles based on a first similarity between the plurality of first newspaper articles, the plurality of second newspaper articles are classified based on at least one category, and clustering is performed on the plurality of second newspaper articles for each category to determine an issue report to be output.

[0008] The present disclosure may provide a technology for generating a report based on recommended articles.

[0009] FIG. 1 is a drawing for explaining the configuration of a device for generating a report according to one embodiment.

[0010] FIG. 2 is a flowchart for schematically explaining the process of generating a report according to one embodiment.

[0011] FIGS. 3a and 3b are drawings for explaining a method of determining duplicate newspaper articles according to one embodiment.

[0012] FIG. 4 is a drawing for explaining a method of classifying newspaper articles according to one embodiment.

[0013] FIG. 5 is an example drawing for explaining an issue report output according to a newspaper article recommendation method according to one embodiment.

[0014] FIGS. 6a to 6d are other example drawings for explaining an issue report output according to a newspaper article recommendation method according to one embodiment.

[0015] FIG. 7 is a flowchart for explaining a method for generating a report according to one embodiment.

[0016] Hereinafter, some embodiments of the present disclosure will be described in detail with reference to the exemplary drawings. In assigning reference numerals to the components of each drawing, the same components may have the same reference numeral as much as possible, even if they are shown in different drawings. Furthermore, in describing the embodiments, if it is determined that a detailed description of related known components or functions may obscure the essence of the technical concept, such detailed description may be omitted. Where terms such as "comprising," "having," or "consisting of" are used in this specification, other parts may be added unless "only" is used. Where a component is expressed in the singular, it may include a plural unless otherwise specified.

[0017] Additionally, terms such as first, second, A, B, (a), (b), etc., may be used to describe the components of the present disclosure. These terms are used merely to distinguish the components from other components, and the nature, order, sequence, or number of the components are not limited by such terms.

[0018] In describing the positional relationship of components, where it is stated that two or more components are "connected," "combined," or "joined," it should be understood that while the two or more components may be directly "connected," "combined," or "joined," they may also be "connected," "combined," or "joined" with other components "intervened." Here, the other components may be included in one or more of the two or more components that are "connected," "combined," or "joined" with one another.

[0019] In describing the temporal flow relationship regarding components, methods of operation, or methods of production, for example, when the temporal or sequential relationship is described using "after," "following," "next," or "before," it may include cases where the relationship is not continuous unless "immediately" or "directly" is used.

[0020] Meanwhile, where numerical values ​​or corresponding information regarding a component (e.g., levels, etc.) are mentioned, even without separate explicit notation, the numerical values ​​or corresponding information may be interpreted as including a range of error that may occur due to various factors (e.g., process factors, internal or external shocks, noise, etc.).

[0021] The embodiments are described in detail below with reference to the drawings.

[0022] FIG. 1 is a drawing for explaining the configuration of a device for generating a report according to one embodiment.

[0023] Referring to FIG. 1, the report generation device (100) of the present disclosure includes an article data acquisition unit (110) that acquires a plurality of first newspaper articles related to a preset field from a preset database.

[0024] The report generating device (100) of the present disclosure can determine a newspaper article corresponding to the user's interests and provide an issue report that combines a title or summary, etc.

[0025] The aforementioned pre-set fields can be configured in various ways as needed, regardless of type or scope. For example, they can be configured as a field for crawling secondary batteries, or autonomous vehicles can be configured as a single field. Additionally, the aforementioned fields do not need to be configured as only one, and two or more fields may be configured.

[0026] For example, the article data acquisition unit (110) of the present disclosure may collect newspaper articles published on a website through a crawling technique during a pre-set period and store the collected newspaper articles in a pre-set database. The aforementioned pre-set period is a period during which new issues are periodically generated and may be set in units of one day, one week, and one month. However, this is merely an example and may be set in various ways as needed.

[0027] The report generation device (100) of the present disclosure sets a URL (Uniform Resource Locator) that serves as a starting point for crawling, and if data in a field set in advance exists at the URL through a pre-set library, it obtains the title or body content of a newspaper article as text, extracts the content included in the text, and stores it in a database. The aforementioned pre-set library may include an HTTP library, but is not limited thereto, and various libraries may be used as needed.

[0028] The report generation device (100) of the present disclosure can separate the title and body of the newspaper article obtained by the crawling method described above and store them in a separate database, and can encrypt at least one of the title and body and store it in the database described above.

[0029] The report generating device (100) of the present disclosure includes an article data refinement unit (120) that extracts a plurality of second newspaper articles from a plurality of first newspaper articles with duplicate articles removed based on a first similarity between a plurality of first newspaper articles.

[0030] The article data refinement unit (120) of the present disclosure can calculate the similarity between newspaper articles in a preset field obtained through crawling and remove some of the newspaper articles with high similarity.

[0031] For example, the article data refinement unit (120) of the present disclosure may perform labeling for a plurality of first newspaper articles and input the plurality of first newspaper articles into an artificial intelligence model to perform learning. The aforementioned artificial intelligence model may be a model that incorporates prompt engineering techniques to precisely design the model so that a result matching the user's intention is output.

[0032] For example, the article data refinement unit (120) of the present disclosure may label the title of each first newspaper article with content regarding which category each first newspaper article may be included in.

[0033] In addition, the first newspaper article to be labeled may be determined based on user feedback regarding multiple first newspaper articles. The aforementioned user feedback may include information regarding the number of user recommendations for each newspaper article. Accordingly, labeling may be performed on newspaper articles with a high number of recommendations, and input into an artificial intelligence model to perform learning.

[0034] The training data input to the artificial intelligence model may be the title of the first newspaper article, the body text, or the full text including both the title and body text. The trained artificial intelligence model can be used to classify the category of newspaper articles from which duplicate articles have been removed among multiple first newspaper articles.

[0035] As another example, the article data refinement unit (120) of the present disclosure may convert at least one word included in each of a plurality of first newspaper articles into a vector based on an embedding technique, and calculate a first similarity between two newspaper articles based on a vector of each of two newspaper articles included in the plurality of first newspaper articles and a cosine similarity judgment technique.

[0036] As another example, the article data refinement unit (120) of the present disclosure may extract a plurality of second newspaper articles by deleting one of the two first newspaper articles when there are two first newspaper articles where the first similarity is less than a preset threshold. The preset threshold may be set in various ways as needed by mistake.

[0037] The aforementioned embedding technique is a method that converts each piece of data into a high-dimensional vector containing numbers while preserving the meaning of the original data. For example, the word 'secondary battery' can be converted into a high-dimensional vector such as [0.52, -0.04, 0.16, … 0.27] through the embedding technique, and the word 'car' can be converted into a high-dimensional vector such as [0.42, -0.17, 0.18, … 0.25] through the embedding technique. Furthermore, the embedding technique can convert not only a single word into a vector but also an entire sentence or text into a single vector. For example, a specific newspaper article can be converted into a vector of [0.2, -0.01, 0.36, … 0.87].

[0038] The article data refinement unit (120) of the present disclosure can calculate the similarity between two newspaper articles through comparison between vectors converted based on an embedding technique.

[0039] The present disclosure proposes using a cosine similarity judgment technique for the aforementioned similarity calculation. The cosine similarity judgment technique is a technique that measures the degree of similarity between two data by comparing the directionality between two vectors using a cosine function. θ is an angle, X and Y are two vectors to be compared, X o Y is the dot product of the two vectors, and |X| and |Y| can be calculated by Equation 1, where the magnitudes of each vector are factors.

[0040]

[0041] For example, if X is a vector [2, 1] and Y is a vector [1, 2], X o Y is 2*1 + 1*2 = 4, and |X| and |Y| are respectively by ...is obtained. Therefore, the first similarity, Cosθ, can be calculated as 0.8.

[0042] As another example, the article data refinement unit (120) of the present disclosure may extract multiple second newspaper articles by deleting one of the two newspaper articles when there are two newspaper articles with a similarity level less than a preset threshold. The preset threshold may be set as a real number and may vary as needed.

[0043] The report generation device (100) of the present disclosure can obtain a plurality of first newspaper articles using a crawling technique and determine or extract a plurality of second newspaper articles from which duplicate articles are removed among the plurality of first newspaper articles based on an embedding technique and a cosine similarity judgment technique.

[0044]

[0045] The report generation device (100) of the present disclosure includes an issue report determination unit (130) that classifies a plurality of second newspaper articles into at least one category and performs clustering for each category to determine the content of the issue report to be output.

[0046] For example, the issue report determination unit (130) of the present disclosure may classify a plurality of second newspaper articles, from which duplicate articles have been removed from a plurality of first newspaper articles, into pre-set categories in order to effectively recommend newspaper articles preferred by the user, and the classification of the plurality of second newspaper articles may be performed through the aforementioned learned artificial intelligence model.

[0047] The issue report determination unit (130) of the present disclosure can classify each second newspaper article by category using an artificial intelligence model trained with a plurality of first newspaper articles as training data. The aforementioned pre-set categories may refer to sub-concepts of the aforementioned pre-set fields. For example, if the pre-set field is automobiles, the pre-set categories may include categories for autonomous vehicles and categories for electric vehicles. The content or number of the aforementioned pre-set fields and categories are merely examples for convenience of explanation and may be set in various ways as needed.

[0048] As another example, the issue report determination unit (130) of the present disclosure determines at least one cluster according to a predetermined criterion based on clustering performed on a plurality of second newspaper articles classified by a predetermined category, and determines and generates content to be included in an issue report based on a plurality of second newspaper articles included in each cluster, wherein the predetermined criterion may include determining the cluster based on the number of first newspaper articles determined as duplicate articles based on a first similarity.

[0049] The aforementioned clustering can be performed using at least one of the title and body of a plurality of second newspaper articles.

[0050] The issue report determination unit (130) of the present disclosure can classify second newspaper articles classified by category into clusters within a category through clustering. In addition, it can determine a predetermined number of clusters in which the second newspaper articles included in each cluster and the first newspaper articles that were determined to be duplicate articles and deleted are numerous, and generate an issue report based on the second newspaper articles included in the determined clusters.

[0051] As another example, the issue report includes a first issue report containing a summary and details of a second newspaper article included in each cluster, wherein the summary of the first issue report is determined based on the Chain of Thought (CoT) technique, and the details of the first issue report can be determined through prompt engineering.

[0052] The report generation device (100) of the present disclosure can generate at least one report among a first issue report and a second issue report for each cluster.

[0053] The first issue report contains information on the summary and details of the second newspaper article included in each cluster, the summary is determined based on the Chain of Thought (CoT) technique, and the details can be determined through an artificial intelligence model that incorporates prompt engineering. The artificial intelligence model that incorporates prompt engineering may be the same model as the artificial intelligence model used to classify the aforementioned second newspaper article by category, or it may be a separate artificial intelligence model trained using different training data.

[0054] The aforementioned summary can be determined through a Chain of Thought (CoT) technique, which extracts key information from the content of at least one second newspaper article included in each cluster, organizes the extracted key information into a logical flow, and then writes the summary. Additionally, the aforementioned summary may include specific numerical information contained in the newspaper article.

[0055] The aforementioned details can be determined through the Large Language Model (LM), which is one of the artificial intelligence models. The LLM is a type of AI model trained on a vast amount of text data. The LLM can generate consistent responses to input commands or questions. Additionally, the LLM can translate languages, generate text that meets specific conditions, and summarize text.

[0056] Prompt engineering refers to the sophisticated design of a model to ensure that the output matches the user's intent. For example, instructions, context, input form, and output form can be set as a single condition, and at least one of these elements can be extracted from the text entered by the user and fed into the AI ​​model. A representative LLM model that applies prompt engineering is Chat GPT (Generative Pretrained Transformer), a type of OpenAI.

[0057] The report generation device (100) of the present disclosure can determine the content to be included in the detailed content based on the content of at least one second newspaper article included in each cluster through an LLM learned through learning data including the content of the main body of a newspaper article, and can generate a first issue report including the aforementioned summary content and detailed content for each cluster.

[0058] As another example, the aforementioned issue report includes a second issue report in which the titles of the second newspaper articles included in each cluster are output in order, wherein the order in which the titles of the second newspaper articles are output can be determined based on the second similarity between the summary content of the first issue report corresponding to each cluster and the titles of the second newspaper articles included in each cluster, calculated through a cosine similarity judgment technique.

[0059] The report generation device (100) of the present disclosure can calculate a second similarity between the summary content of a first issue report generated for each cluster and the title of each second newspaper article, and set the titles of the second newspaper articles to be included in the second issue report in order of highest second similarity. The embedding technique and cosine similarity judgment technique used in the first similarity calculation may also be used in the calculation of the second similarity.

[0060] The report generation device (100) of the present disclosure may be implemented by a computing device comprising at least some of a processor, memory, user input device, and presentation device. The computing device may include various devices such as a smartphone, tablet, laptop, desktop, server, and client. The computing device may be a single stand-alone device, or it may include multiple computing devices operating in a distributed environment composed of multiple computing devices that cooperate with each other through a communication network.

[0061] Meanwhile, a computing device may be a quantum computing device rather than a classical computing device. Quantum computing devices perform operations in units of qubits rather than bits. A qubit can have a state in which 0 and 1 are simultaneously superpositioned, and if there are M qubits, 2^M states can be represented simultaneously.

[0062] A quantum computing device can use various types of quantum gates (e.g., Pauli / Rotation / Hadamard / CNOT / SWAP / Toffoli) that receive one or more qubits to perform quantum operations and perform specified operations, and can combine quantum gates to form a quantum circuit with a special function.

[0063] Quantum computing devices can use quantum artificial neural networks (e.g., QCNN, QGRNN) that can perform functions of conventional artificial neural networks (e.g., CNN, RNN) at a faster speed while using fewer parameters.

[0064] Additionally, the aforementioned artificial intelligence model may be stored in the memory of the report generation device (100) or may be stored in a specific server outside the report generation device (100). Accordingly, the report generation device (100) of the present disclosure may be connected to an external server via a wired or wireless connection through a network.

[0065] The present disclosure has the advantage of preventing duplication of article content included in issue reports and enabling newspaper articles suitable for field-specific issues to be recommended and provided to users by generating issue reports based on embedding techniques and cosine similarity judgment techniques, by first removing duplicate articles, performing clustering by category to determine at least one cluster for each category, and determining the content to be included in the issue report from each determined cluster.

[0066] Below, the overall process of generating an issue report is explained in more detail with reference to the diagram.

[0067] FIG. 2 is a flowchart for schematically explaining the process of generating a report according to one embodiment.

[0068] Referring to FIG. 2, the report generating device of the present disclosure can remove duplicate articles among newspaper articles collected for a specific field, classify them by category, and determine the content to be included in the issue report for each cluster determined through clustering.

[0069] Specifically, the report generating device of the present disclosure collects a plurality of first newspaper articles for a preset field (S200).

[0070] The report generating device of the present disclosure can collect a plurality of first newspaper articles to be collected from each website through crawling according to a pre-set field. The plurality of first newspaper articles can be separated into basic information such as title and date of creation and body text, and each can be stored in a separate database, and the body text can be stored in an encrypted state.

[0071] Multiple first newspaper articles can be collected regardless of the type of language. Since collecting newspaper articles through all existing websites may be unreasonable in terms of time and cost, the report generating device of the present disclosure may determine and collect newspaper articles related to a predetermined field as the first newspaper article from among newspaper articles published through the websites of major domestic and foreign media outlets that are predetermined.

[0072] A time limit may be set for the collection of the first newspaper article. However, the length of the period may be varied as needed.

[0073] When a plurality of first newspaper articles are collected, the report generating device of the present disclosure performs labeling on some of the plurality of first newspaper articles and performs learning of a preset artificial intelligence model based on the labeled first newspaper articles (S210).

[0074] The report generation device of the present disclosure can perform labeling on a plurality of collected first newspaper articles and perform training of a preset artificial intelligence model using the labeled first newspaper articles. As described above, labeling can be performed on the titles of the first newspaper articles, and the first newspaper articles to be labeled can be performed based on user feedback.

[0075] The training of the aforementioned artificial intelligence model may be performed using only the title or body of the first newspaper article, or using both the title and body. The aforementioned artificial intelligence model may be a model that incorporates prompt engineering techniques to design the model in a sophisticated manner so that a result matching the user's intent is output.

[0076] When a newspaper article is input into a trained artificial intelligence model, a result indicating which category the article belongs to can be output.

[0077] When the learning of a preset artificial intelligence model is performed, the report generating device of the present disclosure removes duplicate first newspaper articles among a plurality of collected first newspaper articles (S220).

[0078] The report generation device of the present disclosure may determine and remove duplicate articles from multiple first newspaper articles to prevent duplicate articles, as duplicate articles may exist when all of the multiple first newspaper articles are utilized. However, since the presence of many duplicate articles implies that the articles were issues during a specific period, the duplicate articles may not be simply removed but may be assigned a corresponding weight. Furthermore, the report generation device of the present disclosure does not require an exact match when removing duplicate articles; rather, if they can be considered semantically identical, they may be removed even if they are judged to be similar.

[0079] The report generating device of the present disclosure may compare the titles of each of a plurality of first newspaper articles, compare the body texts of the plurality of first newspaper articles, or compare both the titles and the body texts in a manner that determines duplicate articles among a plurality of first newspaper articles. This is not limited to any one method and can be configured in various ways as needed.

[0080] For example, assuming that the report generating device of the present disclosure determines duplicate articles by calculating a first similarity between the titles of a plurality of first newspaper articles, the report generating device of the present disclosure may obtain the title of each newspaper article from a database in which the titles of a plurality of first newspaper articles are stored, extract words included in the title according to an embedding technique, and convert each word into a high-dimensional vector containing numbers.

[0081] When the title of each first newspaper article is converted into a high-dimensional vector, the report generation device of the present disclosure can calculate the similarity for each newspaper article. The present disclosure proposes utilizing the aforementioned cosine similarity judgment technique as a method for calculating similarity. When similarity is calculated, if each similarity is greater than or equal to a preset threshold, the newspaper articles are determined to be duplicate articles, and one of the two newspaper articles can be deleted.

[0082] For example, assuming there are three first newspaper articles, such as Article L, Article M, and Article N, the report generating device of the present disclosure calculates a high-dimensional vector for the titles of Article L, Article M, and Article N, respectively, calculates the similarity between Article L and Article M, the similarity between Article L and Article N, and the similarity between Article M and Article N, respectively, and compares each similarity with a preset threshold; if there is a case where the similarity is greater than or equal to the preset threshold, one of the two articles may be deleted. The report generating device of the present disclosure may delete either Article L or Article M if the similarity between Article L and Article M is greater than or equal to the preset threshold.

[0083] In this disclosure, the remaining newspaper articles from which duplicate articles have been removed from a plurality of first newspaper articles are referred to as a plurality of second newspaper articles.

[0084] When duplicate articles are removed from a plurality of first newspaper articles, the report generating device of the present disclosure matches a plurality of second newspaper articles from which duplicate articles have been removed with a preset category (S230).

[0085] When duplicate articles are removed, the report generating device of the present disclosure can input the title or body of each second newspaper article into an artificial intelligence model trained on a plurality of second newspaper articles, which are the results of removing duplicate articles, and receive an output of which category each second newspaper article belongs to among the pre-set categories.

[0086] When multiple second newspaper articles and a preset category are matched, the report generation device of the present disclosure determines the cluster to which an issue report will be generated through clustering (S240).

[0087] Depending on how they are set, the aforementioned categories may be set as broad categories or narrow categories. Therefore, problems may arise where the results of issue reports generated using multiple second newspaper articles included in each category lack consistency or satisfaction with the results is low.

[0088] Accordingly, the present disclosure proposes a method for performing clustering by category when a plurality of second newspaper articles are matched with pre-set categories in order to solve the aforementioned problem. In particular, the present disclosure proposes K-means clustering among clustering methods.

[0089] In K-means clustering, K represents the number of clusters or groups. Through clustering performed for each category, articles containing common issues can be grouped to form at least one cluster.

[0090] For example, for multiple second newspaper articles included in one category, clustering can be performed such that K is 2, that is, two clusters are formed. Or, clustering can be performed such that K is 3, 4, 5 or more.

[0091] For clusters generated through clustering performed with different K values, the report generation device of the present disclosure can calculate an evaluation metric (Sum of squared error, SSE) for each clustering. The aforementioned evaluation metric is an indicator of how close a data point of each cluster is to a center point, and as K increases, the evaluation metric decreases.

[0092] For example, if the user preferences for newspaper articles A, B, C, and D are 1, 2, 3, and 6, respectively, then 1, 2, 3, and 6 are the data points, and the centroid can be the average of the data points, which is 3. In this case, the evaluation metric (Sum of squared error, SSE) can be calculated by changing the value of K.

[0093] The aforementioned evaluation indicators are as follows: n is the number of data points, which is the number of second newspaper articles included in a specific category, x i is the data point for each of the second newspaper articles, y i It can be determined based on mathematical formula 2, which takes the average of the data points as a factor.

[0094]

[0095] When an evaluation indicator is calculated, the report generation device of the present disclosure can determine a section in which the degree of decrease of the aforementioned evaluation indicator slows down based on K, and clustering can be performed by considering the K corresponding to that section as the optimal K and generating clusters included in the corresponding category in the number of optimal K clusters.

[0096] When clustering is performed, the report generating device of the present disclosure may determine a predetermined number of clusters for each category, and the criteria for determining the clusters may prioritize clusters with a large number of the aforementioned duplicate articles. Alternatively, the report generating device of the present disclosure may determine the priority of clusters based on an indicator representing user preference for each second newspaper article, or may determine the priority of clusters by considering both the number of duplicate articles and user preference for each second newspaper article. Alternatively, the report generating device of the present disclosure may determine the priority of clusters by assigning a predetermined weight to each of the aforementioned number of duplicate articles and user preference for each second newspaper article. The aforementioned indicator representing user preference may refer to the number of recommendations for each second newspaper article, which may be collected together during the process of collecting the first newspaper article through crawling and stored in a database.

[0097] The method of performing clustering and the method of determining the priority of clusters are not limited to the aforementioned methods and can be configured in various ways as needed.

[0098] When a cluster is determined, the report generating device of the present disclosure determines the contents to be included in the issue report (S250).

[0099] When a predetermined number of clusters is determined for each category, the report generating device of the present disclosure determines the contents to be included in the issue report. As described above, a first issue report including summary content and detailed content for each cluster may be generated, or a second similarity between the summary content of the first issue report and a second newspaper article may be calculated, and a second issue report including the title of the second newspaper article may be generated based on the calculation result.

[0100] The first issue report can generate a summary based on the CoT method, which is the latest SOTA (State Of The Art Algorithm) algorithm, and generate detailed content through a model that incorporates prompt engineering techniques. The model that incorporates prompt engineering techniques may be the same model as the artificial intelligence model used to classify the aforementioned multiple second newspaper articles by category, or it may be an artificial intelligence model trained separately.

[0101] In addition, the aforementioned summary can be configured to contain specific details including numerical values.

[0102] The second issue report can be generated so that the title of a newspaper article is displayed for each prompt. The priority of the displayed newspaper articles can be assigned in order of the second similarity between the summary of the first issue report generated for a specific prompt and each second newspaper article. The aforementioned second similarity can be calculated based on an embedding technique and a cosine similarity judgment technique.

[0103] FIGS. 3A and 3B are drawings for explaining a method of determining duplicate newspaper articles according to one embodiment.

[0104] Referring to FIGS. 3A and 3B, the report generation device of the present disclosure collects a plurality of first newspaper articles related to a preset field through crawling and can determine duplicate articles based on the aforementioned embedding technique and cosine similarity determination technique.

[0105] Figures 3A and 3B are the titles of the first collected newspaper articles, and each article title can be converted into a high-dimensional vector containing numbers through an embedding technique.

[0106] For example, Fig. 3A discloses the title “POSCO Future M’s 52-week high broken, No. 1 in cathode material market share. Sufficient momentum for expansion of anode material business.” The report generation device of the present disclosure can generate a high-dimensional vector by extracting each word excluding particles from the article title and converting each word into a number.

[0107] The report generating device of the present disclosure can extract {'POSCO Future M', 52 weeks, new high, new high, cathode material, M / S, No. 1, anode material, business, expansion, momentum, sufficient} from 'POSCO Future M', 52 weeks, new high, new high, cathode material, M / S, No. 1, anode material, business, expansion, momentum, sufficient}. In addition, each word can be converted into a number, such as 'POSCO Future M' as -0.008 and 52 weeks as -0.03.

[0108] Similarly, in Fig. 3B, words can be extracted and each word converted into a number to generate a high-dimensional vector.

[0109] Similarity can be calculated for the generated high-dimensional vector through Equation 1, which incorporates the aforementioned cosine similarity judgment technique, and whether it is a duplicate article can be determined by comparing the calculated similarity with a preset threshold.

[0110] FIG. 4 is a drawing for explaining a method of classifying newspaper articles according to one embodiment.

[0111] Referring to FIG. 4, the report generating device of the present disclosure can classify a plurality of second newspaper articles from which duplicate articles have been removed according to preset categories, and can perform clustering on the classified categories.

[0112] Each of the multiple second newspaper articles classified into one category can be converted into data and represented in a graph. According to the graph (400) at the top of FIG. 4, the report generating device of the present disclosure can perform clustering so that newspaper articles with similar characteristics within a category can form a group together.

[0113] As described above, regarding the clustering method, the present disclosure proposes a K-means clustering method. According to the graph (400) at the bottom of FIG. 4, it can be seen that as a result of performing clustering, the optimal K was determined to be 5 and 5 clusters (400) were formed. However, the clustering method and the number of optimal clusters are not limited to those described above and can be set in various ways as needed.

[0114] FIG. 5 is an example drawing for explaining an issue report output according to a newspaper article recommendation method according to one embodiment.

[0115] Referring to FIG. 5, the report generation device of the present disclosure can generate a first issue report for each determined cluster.

[0116] The first issue report contains information on the summary and details of the second newspaper article included in each cluster, the summary is determined based on the Chain of Thought (CoT) technique, and the details can be determined through an artificial intelligence model that incorporates prompt engineering.

[0117] The aforementioned summary can be determined through a Chain of Thought (CoT) technique, which extracts key information from the content of at least one second newspaper article included in each cluster, organizes the extracted key information into a logical flow, and then writes the summary. Additionally, the aforementioned summary may include specific numerical information contained in the newspaper article.

[0118] The aforementioned details can be determined through LLM (LARGE LANGUAGE MODEL), which is one of the artificial intelligence models, and the details may include part of the body content of the second newspaper article or the URL (uniform resource locator) of the second newspaper article.

[0119] The report generation device of the present disclosure can determine the content to be included in the detailed content based on the content of at least one second newspaper article included in each cluster through an LLM learned through learning data including the content of the main body of a newspaper article, and can generate a first issue report including the aforementioned summary content and detailed content for each cluster.

[0120] According to Fig. 5, the market trends and strategies disclosed in Fig. 5 represent pre-set categories, and the increase in orders from LG Energy Solutions for Issue 1 Tesla and the rise in battery subscriptions represent clusters. Additionally, the content following Issue 1_Summary represents the summary content, and the content at the bottom represents the details.

[0121] Additionally, Issue2 at the bottom refers to a different cluster, and the content following it refers to the details regarding Issue2.

[0122] FIGS. 6A to 6D are other example drawings for explaining an issue report output according to a newspaper article recommendation method according to one embodiment.

[0123] Referring to FIGS. 6A to 6D, the report generating device of the present disclosure can generate at least one issue report among a first issue report and a second issue report for each determined cluster.

[0124] The second issue report may include the titles of the second newspaper articles included in each cluster in order of priority. The priority of outputting the titles of the second newspaper articles may be determined based on the second similarity between the summary content of the first issue report corresponding to each cluster, calculated through a cosine similarity judgment technique, and the titles of the second newspaper articles included in each cluster.

[0125] The report generation device of the present disclosure can calculate a second similarity between the summary content of a first issue report generated for each cluster and the title of each second newspaper article, and set the titles of the second newspaper articles to be included in the second issue report in order of highest second similarity. The embedding technique and cosine similarity judgment technique used in the first similarity calculation may also be used in the calculation of the second similarity.

[0126] According to FIG. 6A, the report generating device of the present disclosure can list second newspaper articles included in a cluster of market trends and strategies in order of high similarity based on the result of calculating similarity.

[0127] According to FIGS. 6B, 6C, and 6D, the report generating device of the present disclosure lists the titles of second newspaper articles in order of high similarity based on the issue topics of each cluster and the results of calculating similarity.

[0128] FIG. 7 is a flowchart for explaining a method for generating a report according to one embodiment.

[0129] Referring to FIG. 7, the report generation method of the present disclosure includes an article data acquisition step of acquiring a plurality of first newspaper articles related to a preset field from a preset database (S700).

[0130] The report generating device of the present disclosure can determine a newspaper article corresponding to the user's interests and provide an issue report that combines a title or summary, etc.

[0131] The aforementioned pre-set fields can be configured in various ways as needed, regardless of type or scope. Additionally, the aforementioned fields do not need to be configured as a single field, and two or more fields may be configured.

[0132] For example, the report generation device of the present disclosure may collect newspaper articles published on a website through a crawling technique during a preset period and store the collected newspaper articles in a preset database.

[0133] The report generation device of the present disclosure sets a starting URL for crawling, obtains the title or body content of a newspaper article as text through a pre-configured library when data in a pre-configured field exists at the said URL, and extracts the content contained in the said text and stores it in a database. The aforementioned pre-configured library may include an HTTP library, but is not limited thereto, and various libraries may be used as needed.

[0134] The report generation device of the present disclosure can separate the title and body of the content of a newspaper article obtained by the crawling method described above and store them in a separate database, and can encrypt at least one of the title and body and store it in the database described above.

[0135] The report generation method of the present disclosure includes an article data refinement step of extracting a plurality of second newspaper articles from a plurality of first newspaper articles with duplicate articles removed based on a first similarity between a plurality of first newspaper articles (S710).

[0136] The report generation device of the present disclosure can calculate the similarity between newspaper articles in a preset field obtained through crawling and remove some of the newspaper articles with high similarity.

[0137] For example, the report generation device of the present disclosure may perform labeling for a plurality of first newspaper articles and input the plurality of first newspaper articles into an artificial intelligence model to perform learning. The aforementioned artificial intelligence model may be a model that incorporates prompt engineering techniques to precisely design the model so that a result matching the user's intention is output.

[0138] For example, the report generating device of the present disclosure can label the title of each first newspaper article with content regarding which category each first newspaper article may be included in.

[0139] In addition, the first newspaper article to be labeled may be determined based on user feedback regarding multiple first newspaper articles. The aforementioned user feedback may include information regarding the number of user recommendations for each newspaper article. Accordingly, labeling may be performed on newspaper articles with a high number of recommendations, and input into an artificial intelligence model to perform learning.

[0140] The training data input to the artificial intelligence model may be the title of the first newspaper article, the body text, or the full text including both the title and body text. The trained artificial intelligence model can be used to classify the category of newspaper articles from which duplicate articles have been removed among multiple first newspaper articles.

[0141] As another example, the report generating device of the present disclosure can convert at least one word included in each of a plurality of first newspaper articles into a vector based on an embedding technique, and calculate a first similarity between two newspaper articles based on the vector of each of two newspaper articles included in the plurality of first newspaper articles and a cosine similarity determination technique.

[0142] As another example, the report generating device of the present disclosure may extract a plurality of second newspaper articles by deleting one of the two first newspaper articles when there are two first newspaper articles where the first similarity is less than a preset threshold.

[0143] The aforementioned embedding technique is a method that transforms each data point into a high-dimensional vector containing numbers while preserving the meaning of the original data.

[0144] The report generation device of the present disclosure can calculate the similarity between two newspaper articles through comparison between vectors converted based on an embedding technique.

[0145] The present disclosure proposes using a cosine similarity judgment technique for the aforementioned similarity calculation. The cosine similarity judgment technique is a technique that measures the degree of similarity between two data using a cosine function by comparing the directionality between two vectors. It can be calculated by the aforementioned mathematical formula 1, where θ is an angle, X and Y are the two vectors to be compared, XY is the dot product of the two vectors, and |X| and |Y| are the magnitudes of each vector as factors.

[0146] As another example, the report generating device of the present disclosure may extract multiple second newspaper articles by deleting one of the two newspaper articles when there are two newspaper articles with a similarity level below a preset threshold. The preset threshold may be set as a real number and may vary as needed.

[0147] The report generation device of the present disclosure can acquire a plurality of first newspaper articles using a crawling technique and determine or extract a plurality of second newspaper articles from which duplicate articles are removed among the plurality of first newspaper articles based on an embedding technique and a cosine similarity determination technique.

[0148] The report generation method of the present disclosure includes an issue report determination step of classifying a plurality of second newspaper articles into at least one category and performing clustering for each category to determine the content of the issue report to be output (S720).

[0149] For example, the report generating device of the present disclosure may classify a plurality of second newspaper articles, from which duplicate articles have been removed from a plurality of first newspaper articles, into pre-set categories in order to effectively recommend newspaper articles preferred by a user, and the classification of the plurality of second newspaper articles may be performed through the aforementioned learned artificial intelligence model.

[0150] The report generation device of the present disclosure can classify each second newspaper article by category through an artificial intelligence model trained using a plurality of first newspaper articles as training data. The aforementioned pre-set category may mean a sub-concept of the aforementioned pre-set field.

[0151] As another example, the report generating device of the present disclosure determines at least one cluster according to a preset criterion based on clustering performed on a plurality of second newspaper articles classified by a preset category, and determines and generates content to be included in an issue report based on a plurality of second newspaper articles included in each cluster, wherein the preset criterion may include determining the cluster based on the number of first newspaper articles determined as duplicate articles based on a first similarity.

[0152] The report generation device of the present disclosure can classify second newspaper articles classified by category into clusters within a category through clustering. In addition, it can determine a predetermined number of clusters in which the second newspaper articles included in each cluster and the first newspaper articles that have been determined as duplicate articles and deleted are numerous, and generate an issue report based on the second newspaper articles included in the determined clusters.

[0153] As another example, the issue report includes a first issue report containing a summary and details of a second newspaper article included in each cluster, wherein the summary of the first issue report is determined based on the CoT method, and the details of the first issue report can be determined through prompt engineering.

[0154] The report generation device of the present disclosure can generate at least one report among a first issue report and a second issue report for each cluster.

[0155] The first issue report contains information on the summary and details of the second newspaper article included in each cluster, the summary is determined based on the CoT technique, and the details can be determined through an artificial intelligence model that incorporates prompt engineering. The artificial intelligence model that incorporates prompt engineering may be the same model as the artificial intelligence model used to classify the aforementioned second newspaper article by category, or it may be a separate artificial intelligence model trained using different training data.

[0156] The aforementioned summary can be determined through the CoT technique, which extracts key information from the content of at least one second newspaper article included in each cluster, organizes the extracted key information into a logical flow, and then creates the summary. Additionally, the aforementioned summary may include specific numerical information contained in the newspaper article.

[0157] The aforementioned details can be determined through LLM, which is one of the artificial intelligence models. LLM is a type of AI model trained on a vast amount of text data. LLM can generate consistent responses to input commands or questions. Additionally, LLM can translate language, generate text that meets specific conditions, and summarize text.

[0158] Prompt engineering refers to the sophisticated design of a model to ensure that results match the user's intent. For example, instructions, context, input form, and output form can be set as a single condition, and at least one of these can be extracted from the text entered by the user and input into the artificial intelligence model.

[0159] The report generation device of the present disclosure can determine the content to be included in the detailed content based on the content of at least one second newspaper article included in each cluster through an LLM learned through learning data including the content of the main body of a newspaper article, and can generate a first issue report including the aforementioned summary content and detailed content for each cluster.

[0160] As another example, the aforementioned issue report includes a second issue report in which the titles of the second newspaper articles included in each cluster are output in order, wherein the order in which the titles of the second newspaper articles are output can be determined based on the second similarity between the summary content of the first issue report corresponding to each cluster, calculated through a cosine similarity judgment technique, and the titles of the second newspaper articles included in each cluster.

[0161] The report generation device of the present disclosure can calculate a second similarity between the summary content of a first issue report generated for each cluster and the title of each second newspaper article, and set the titles of the second newspaper articles to be included in the second issue report in order of highest second similarity. The embedding technique and cosine similarity judgment technique used in the first similarity calculation may also be used in the calculation of the second similarity.

[0162] Through the operation of the aforementioned configurations, it is possible to prevent duplicate articles from being displayed and ensure that important content is not omitted when recommending articles on topics preferred by the user.

[0163] The foregoing description is merely an illustrative explanation of the technical concept of the present disclosure, and those skilled in the art to which the present disclosure pertains may make various modifications and variations within the scope of the essential characteristics of the technical concept. Furthermore, since these embodiments are intended to explain, not limit, the scope of the technical concept is not limited by these embodiments. The scope of protection of the present disclosure shall be interpreted by the claims below, and all technical concepts within an equivalent scope shall be interpreted as being included within the scope of rights of the present disclosure.

[0164]

[0165] CROSS-REFERENCE TO RELATED APPLICATION

[0166] This patent application claims priority pursuant to Section 119(a) of the U.S. Patent Act (35 USC § 119(a)) to Korean Patent Application No. 10-2024-0191880 filed on December 19, 2024, all of which are incorporated by reference into this patent application. Additionally, this patent application claims priority in countries other than the United States for the same reasons as above, all of which are incorporated by reference into this patent application.

Claims

1. An article data acquisition unit that acquires multiple first newspaper articles related to a preset field from a preset database; Article data refinement unit that extracts a plurality of second newspaper articles from a plurality of first newspaper articles, with duplicate articles removed, based on a first similarity between the plurality of first newspaper articles; and A report generating device comprising an issue report determining unit that classifies the plurality of second newspaper articles based on at least one category and performs clustering of the plurality of second newspaper articles for each category to determine an issue report to be output.

2. In Paragraph 1, The above article data acquisition unit is, Collecting newspaper articles published on the website through crawling techniques during a preset period, and A report generation device characterized by storing the collected newspaper articles in a pre-set database.

3. In Paragraph 1, The above article data refinement department, Labeling is performed on the above plurality of first newspaper articles, and A report generation device characterized by inputting the above-mentioned plurality of first newspaper articles into an artificial intelligence model to perform learning.

4. In Paragraph 1, The above article data refinement department, At least one word included in each of the above plurality of first newspaper articles is converted into a vector based on an embedding technique, and A report generating device characterized by calculating the first similarity between the two newspaper articles based on the vector and cosine similarity judgment techniques for each of the two newspaper articles included in the plurality of first newspaper articles.

5. In Paragraph 4, The above article data refinement department, A report generation device characterized by extracting a plurality of second newspaper articles by deleting one of the two newspaper articles when there are two newspaper articles in which the first similarity is less than a preset threshold.

6. In Paragraph 3, The above issue report decision department, A report generation device characterized by inputting the above-mentioned plurality of second newspaper articles into the above-mentioned artificial intelligence model and classifying them according to the above-mentioned preset categories.

7. In Paragraph 6, The above issue report decision department, Based on the clustering performed on the plurality of second newspaper articles classified by the aforementioned preset categories, at least one cluster is determined according to preset criteria, and the issue report is determined based on the plurality of second newspaper articles included in each cluster, The above-mentioned preset criteria are, A report generating device characterized by including determining the cluster based on the number of the first newspaper articles determined as duplicate articles based on the first similarity.

8. In Paragraph 7, The above issue report is, It includes a first issue report containing a summary and details of the second newspaper article included in the cluster above, The summary of the above-mentioned first issue report is, It is determined based on the CoT (Chain of Thought) technique, and The details of the above-mentioned first issue report are, A report generation device characterized by being determined through prompt engineering.

9. In Paragraph 8, The above issue report is, It includes a second issue report in which the titles of the second newspaper articles included in each cluster are printed in order, The order in which the titles of the above second newspaper article are printed is, A report generation device characterized by being determined based on the second similarity between the summary content of a first issue report corresponding to each cluster calculated through the above-mentioned cosine similarity judgment technique and the title of a second newspaper article included in each cluster.

10. An article data acquisition step of acquiring multiple first newspaper articles related to a preset field from a preset database; Article data refinement step of extracting a plurality of second newspaper articles from a plurality of first newspaper articles, with duplicate articles removed, based on a first similarity between the plurality of first newspaper articles; and A report generation method comprising an issue report determination step of classifying the plurality of second newspaper articles based on at least one category and performing clustering on the plurality of second newspaper articles for each category to determine an issue report to be output.

11. In Paragraph 10, The above article data acquisition step is, Collecting newspaper articles published on the website through crawling techniques during a preset period, and A method for generating a report characterized by storing the collected newspaper articles in a pre-set database.

12. In Paragraph 10, The above article data refinement step is, Labeling is performed on the above plurality of first newspaper articles, and A report generation method characterized by inputting the above-mentioned plurality of first newspaper articles into an artificial intelligence model to perform learning.

13. In Paragraph 10, The above article data refinement step is, At least one word included in each of the above plurality of first newspaper articles is converted into a vector based on an embedding technique, and A report generation method characterized by calculating the first similarity between the two newspaper articles based on the vector and cosine similarity judgment techniques for each of the two newspaper articles included in the plurality of first newspaper articles.

14. In Paragraph 13, The above article data refinement step is, A method for generating a report characterized by deleting one of the two newspaper articles to extract the plurality of second newspaper articles when there are two newspaper articles in which the first similarity is less than a preset threshold.

15. In Paragraph 12, The above issue report decision step is, A method for generating a report characterized by inputting the above-mentioned plurality of second newspaper articles into the above-mentioned artificial intelligence model and classifying them according to the above-mentioned preset categories.

16. In Paragraph 15, The above issue report decision step is, Based on the clustering performed on the plurality of second newspaper articles classified by the aforementioned preset categories, at least one cluster is determined according to preset criteria, and the issue report is determined based on the plurality of second newspaper articles included in each cluster, The above-mentioned preset criteria are, A method for generating a report characterized by including determining the cluster based on the number of the first newspaper articles determined as duplicate articles based on the first similarity.

17. In Paragraph 16, The above issue report is, It includes a first issue report containing a summary and details of the second newspaper article included in the cluster above, The summary of the above-mentioned first issue report is, It is determined based on the CoT (Chain of Thought) technique, and The details of the above-mentioned first issue report are, A report generation method characterized by being determined through prompt engineering.

18. In Paragraph 17, The above issue report is, It includes a second issue report in which the titles of the second newspaper articles included in each cluster are printed in order, The order in which the titles of the above second newspaper article are printed is, A report generation method characterized by being determined based on the second similarity between the summary content of a first issue report corresponding to each cluster calculated through the above-mentioned cosine similarity judgment technique and the title of a second newspaper article included in each cluster.