Issue reporting method, computer program therefor, and computer-readable storage medium storing computer program

The issue reporting method addresses the limitations of conventional news summaries by filtering, clustering, and generating quantitatively rich reports tailored to user preferences, enhancing the relevance and depth of news analysis.

WO2026135284A1PCT designated stage Publication Date: 2026-06-25POSCO HLDG INC

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
POSCO HLDG INC
Filing Date
2025-12-18
Publication Date
2026-06-25

AI Technical Summary

Technical Problem

Conventional news summary services struggle with duplicate content inclusion and omission of important information, especially when processing large volumes of news data, and lack the ability to generate summaries with specific quantitative insights.

Method used

An issue reporting method that filters articles based on user preferences, removes duplicates by calculating embedding vector similarity, performs clustering at multiple levels, and generates insightful summaries with quantitative information using generative AI models.

Benefits of technology

Effectively filters and clusters news articles to provide meaningful, quantitatively rich summaries that reflect user preferences and highlight significant issues, improving information retrieval efficiency.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure KR2025022109_25062026_PF_FP_ABST
    Figure KR2025022109_25062026_PF_FP_ABST
Patent Text Reader

Abstract

This issue reporting method comprises: a data collection step of collecting and storing articles published on a plurality of dates; a recommendation modeling step of filtering the collected articles by using a recommendation model trained via user feedback-based labeling; a duplicate removal step of removing duplicate articles by converting the filtered articles into first embedding vectors by using a first embedding model and calculating a similarity between the first embedding vectors; a first clustering step of determining a plurality of first clusters by performing first clustering on articles from which duplicates have been removed, and selecting a predetermined number of first clusters from among the plurality of first clusters; a first report generation step of generating first issue reports from the selected first clusters; a second clustering step of determining a plurality of second clusters by performing second clustering on the first issue reports; and a second report generation step of generating a second issue report from the second clusters.
Need to check novelty before this filing date? Find Prior Art

Description

Issue reporting method, computer program for this purpose, and computer-readable storage medium for storing the computer program

[0001] The present invention relates to an issue reporting method, a computer program for the same, and a computer-readable storage medium for storing the computer program.

[0002] With the recent surge in the volume of news articles, the importance of article summary services is increasing to allow users to efficiently obtain the information they want.

[0003] Conventional news summary services use a method of simply extracting keywords and generating summaries based on them.

[0004] However, this method has drawbacks when processing large amounts of news data, such as the inclusion of duplicate content in summaries or the omission of important information. Additionally, it has limitations in generating summaries that include specific quantitative information or insights.

[0005] The present invention provides an issue reporting method that can filter articles by reflecting user preferences.

[0006] The present invention provides an issue reporting method that can effectively remove duplicate articles while reflecting importance.

[0007] The present invention provides an issue reporting method that efficiently classifies major issues through clustering.

[0008] The present invention provides an issue reporting method that generates an insightful summary containing specific quantitative information.

[0009] The technical problems to be solved in this document are not limited to those mentioned above, and other technical problems not mentioned will be clearly understood by those skilled in the art to which this invention belongs from the description below.

[0010] An issue reporting method according to an embodiment of the present invention may include: a data collection step of collecting and storing articles published on a plurality of dates; a recommendation modeling step of filtering the collected articles using a recommendation model learned through user feedback-based labeling; a duplicate removal step of converting the filtered articles into first embedding vectors using a first embedding model and calculating similarity between the first embedding vectors to remove duplicate articles; a first clustering step of performing first clustering on the articles from which duplicates have been removed to determine a plurality of first clusters and selecting a predetermined number of first clusters among the plurality of first clusters; a first report generation step of generating first issue reports from the selected first clusters; a second clustering step of performing second clustering on the first issue reports to determine a plurality of second clusters; and a second report generation step of generating second issue reports from the second clusters.

[0011] The above duplicate removal step may include: a step of converting the filtered articles into the first embedding vectors; and a step of determining articles in which the similarity between the converted first embedding vectors exceeds a predetermined threshold as the duplicate articles.

[0012] The second clustering step comprises: a step of converting the first issue reports into second embedding vectors using a second embedding model; and a step of performing the second clustering using the second embedding vectors; wherein the second embedding model may be different from the first embedding model.

[0013] The first embedding model includes a first language model trained to extract features of a first unit, and the second embedding model may include a second language model trained to extract features of a second unit that is wider than the first unit.

[0014] The first embedding model and the second embedding model may have different embedding dimensions, be trained with different training data, and use different tokenization methods.

[0015] Both the first report generation step and the second report generation step can be performed using a generative artificial intelligence model through prompt engineering with the Chain of Thought (CoT) technique applied.

[0016] The prompt engineering of the first report generation step is configured to identify major events of each of the first clusters, and the prompt engineering of the second report generation step may be configured to identify different detailed topics included in each of the second clusters and to identify major events of each of the detailed topics.

[0017] The first report generation step may include a step of generating a summary related to a major event of each of the first clusters, and the second report generation step may include a step of generating a summary related to a major event of each of the different detailed topics included in each of the second clusters.

[0018] The first report generation step is performed using a generative artificial intelligence model through first prompt engineering, and the second report generation step can be performed using the generative artificial intelligence model through second prompt engineering that is different from the first prompt engineering.

[0019] The first prompt engineering is configured to identify major events of each of the first clusters, and the second prompt engineering may be configured to identify detailed topics of each of the second clusters and identify major events of each of the detailed topics.

[0020] The above first clustering step is,

[0021] The method may include a step of calculating a first k value, which is a clustering parameter that determines the number of clusters using a cluster evaluation index, and determining the plurality of first clusters by performing the first clustering according to the first k value; and the second clustering step may include a step of calculating a second k value, which is a clustering parameter that determines the number of clusters different from the first k value using the cluster evaluation index, and determining the plurality of second clusters by performing the second clustering according to the second k value.

[0022] The above second k value may be smaller than the above first k value.

[0023] A computer-readable storage medium according to one embodiment of the present invention can record a computer program for executing the above methods on a computer.

[0024] A computer program according to one embodiment of the present invention may perform the above methods when executed by one or more processors of a computer.

[0025] According to one embodiment of the present invention, by using a user feedback-based recommendation model, articles reflecting the user's preferences are filtered preferentially from among numerous articles, thereby allowing for the selection of only articles meaningful to the user while reducing data throughput.

[0026] According to one embodiment of the present invention, duplicate articles are removed by calculating the similarity between embedding vectors, and by assigning weights based on the number of duplicate articles, the importance of issues that received attention during the relevant period can be reflected.

[0027] According to one embodiment of the present invention, by generating an issue report that includes specific quantitative information such as the number of contracts, export volume, investment amount, market share, and growth rate, it is possible to provide an insightful issue report rather than a simple news summary.

[0028] According to one embodiment of the present invention, key information necessary for the user can be provided quickly and effectively from a vast amount of articles.

[0029] FIG. 1 is a flowchart for explaining an issue reporting method according to one embodiment.

[0030] FIG. 2 is a diagram illustrating a data collection step according to one embodiment.

[0031] FIG. 3 is a diagram illustrating a recommendation modeling step according to one embodiment.

[0032] FIG. 4 is a diagram illustrating a duplicate removal step according to one embodiment.

[0033] FIG. 5 is a diagram illustrating a first clustering step according to one embodiment.

[0034] FIG. 6 is a diagram illustrating ranking in the first clustering step according to one embodiment.

[0035] FIG. 7 is a diagram illustrating a first report generation step according to one embodiment.

[0036] FIG. 8 illustrates an example of a first issue report generated by an issue reporting method according to one embodiment.

[0037] FIG. 9 is a diagram illustrating a second clustering step according to one embodiment.

[0038] FIG. 10 is a diagram illustrating a second report generation step according to one embodiment.

[0039] FIG. 11 illustrates an example of a second issue report generated by an issue reporting method according to one embodiment.

[0040] The embodiments described in this document and the configurations illustrated in the drawings are merely preferred examples of the disclosed invention, and various modifications that may replace the embodiments and drawings of this specification may exist at the time of filing this application.

[0041] The terms used in this document are for describing the embodiments and are not intended to limit or restrict the disclosed invention.

[0042] For example, in this specification, singular expressions may include plural expressions unless the context clearly indicates otherwise.

[0043] In this document, each of the phrases such as "A or B", "at least one of A and B", "at least one of A or B", "A, B or C", "at least one of A, B and C", and "at least one of A, B, or C" may include any one of the items listed together in the corresponding phrase, or all possible combinations thereof.

[0044] The term "and / or" includes a combination of multiple related described components or any of the multiple related described components. For example, "A and / or B" may include only "A," only "B," or both "A and B."

[0045] Additionally, terms such as “include” or “have” are intended to express the existence of the features, numbers, steps, actions, components, parts, or combinations thereof described in the specification, and do not exclude the additional existence or addition of one or more other features, numbers, steps, actions, components, parts, or combinations thereof.

[0046] When it is said that a component is "connected," "combined," "supported," or "in contact" with another component, this includes not only cases where the components are directly connected, combined, supported, or in contact, but also cases where they are indirectly connected, combined, supported, or in contact through a third component.

[0047] When it is said that a component is located "on" another component, this includes not only cases where one component is in contact with the other, but also cases where another component exists between the two components.

[0048] Meanwhile, terms such as "front," "rear," "left," "right," "top," and "bottom" used in the following description are defined based on the drawings; however, the shape and position of each component are not limited by these terms. For example, the front side may be defined as the +X side and the rear side as the -X side. For example, based on the drawings, the right side may be defined as the +Y side and the left side as the -Y side. For example, based on the drawings, the top side may be defined as the +Z side and the bottom side as the -Z side.

[0049] In addition, terms including ordinal numbers, such as "first," "second," etc., are used to distinguish one component from another and do not limit the components.

[0050] In addition, terms such as "~part," "~unit," "~block," "~part," and "~module" may refer to a unit that processes at least one function or operation. For example, the terms may refer to at least one piece of hardware such as an FPGA (field-programmable gate array) or ASIC (application specific integrated circuit), at least one piece of software stored in memory, or at least one process processed by a processor.

[0051] Methods and functions according to one embodiment of the present invention may be realized in the form of hardware, software, or a combination thereof. When implemented in software, a program for performing the methods of the present invention may be stored on a computer-readable storage medium (or recording medium). The recording medium may include program instructions, data files, data structures, etc., either alone or in combination.

[0052] The program instructions stored on the above-mentioned recording medium may be those specifically designed and configured for the present invention, or those known and available to those skilled in the art of computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tapes; optical recording media such as CD-ROMs and DVDs; magneto-optical media such as floptical disks; and hardware devices specifically configured to store and execute program instructions, such as ROM, RAM, and flash memory. Additionally, examples of program instructions include machine code, such as that generated by a compiler, as well as high-level language code that can be executed by a computer using an interpreter, etc.

[0053] The above-mentioned hardware device (e.g., computer) may be configured to operate as at least one software module to perform the operation of the present invention, and vice versa.

[0054] An embodiment of the disclosed invention is described in detail below with reference to the attached drawings. Identical reference numbers or symbols in the attached drawings may indicate parts or components that perform substantially the same function.

[0055] The operating principle and embodiments of the present invention will be described below with reference to the attached drawings.

[0056] In the present invention, 'issue reporting' may be referred to as 'issue briefing', 'news reporting', 'news briefing', 'article reporting', 'article briefing', etc.

[0057] FIG. 1 is a flowchart for explaining an issue reporting method according to one embodiment.

[0058] Referring to FIG. 1, an issue reporting method (100) according to one embodiment of the present invention may include a data collection step (110), a recommendation modeling step (120), a duplicate removal step (130), a first clustering step (140), a first report generation step (150), a second clustering step (160), and a second report generation step (170).

[0059] In the data collection step (110), a large number of articles can be collected and stored. The large number of articles may include multilingual articles from major domestic and international media outlets.

[0060] In the recommendation modeling step (120), collected articles can be filtered using a recommendation model trained through user feedback-based labeling. At this time, the recommendation model is a model that reflects the actual user's preferences, and can perform primary filtering by calculating a recommendation score for each article.

[0061] In the duplicate removal step (130), the filtered articles are converted into first embedding vectors, and duplicate articles can be removed by calculating the similarity between the first embedding vectors (e.g., cosine similarity). Instead of simply removing identical articles, articles that are semantically similar can be detected and processed. At this time, some of the duplicate articles may be removed, and weights may be assigned to the remaining ones, taking into account that the article may be an important issue as it has been covered by multiple media outlets.

[0062] In the duplicate removal step (130), the articles filtered using the first embedding model can be changed into first embedding vectors. As will be explained later, the first embedding model may be different from the second embedding model used in the second clustering step (160).

[0063] In the first clustering step (140), clustering is performed on articles from which duplicates have been removed to determine multiple clusters, and a predetermined number of clusters can be selected from among them. An optimal first k value can be calculated using a cluster evaluation index, and clustering can be performed accordingly. When selecting clusters, a ranking that considers both the recommendation score of each article and the weight based on duplicates can be applied, so that clusters of high importance can be selected.

[0064] The k value (e.g., the first k value, the second k value) is a clustering parameter that determines the number of clusters and can be used to determine the boundaries of the clusters.

[0065] For example, the k value determines how many clusters the data will be divided into, and if k is 3, the entire data can be classified into 3 groups.

[0066] In each of the first and second clustering steps (140, 160), a silhouette analysis method may be used to find the optimal first and second k values.

[0067] For convenience of explanation, the clusters clustered in the first clustering step (140) may be referred to as the first clusters.

[0068] In the first report generation step (150), an issue report can be generated using a generative AI model through prompt engineering with a Chain of Thought (CoT) technique applied to each selected first cluster. The generated summary can include quantitative information (specific figures such as the number of contracts, export volume, investment amount, market share, or growth rate), thereby generating a report that provides practical insights rather than a simple news summary.

[0069] Through this series of steps, the present invention can automatically generate issue reports that are substantially meaningful to the user from a vast amount of multilingual articles, and in particular, enable effective identification of the latest trends in the field of future materials.

[0070] In the second clustering step (160), a plurality of second clusters can be determined by performing second clustering on the first issue reports generated in the first report generation step (150). At this time, a second embedding model different from the first embedding model may be used, and the second embedding model may include a second language model trained to extract features (e.g., contextual features) of a second unit (e.g., contextual unit).

[0071] To this end, the embedding dimensions of the first embedding model and the second embedding model (or the length of the vector generated by the first embedding model and the second embedding model) may be different.

[0072] In one embodiment, the first embedding model can generate an n-dimensional vector, and the second embedding model can generate an m-dimensional vector. Here, n dimensions may be smaller than m dimensions. That is, the embedding dimension of the second embedding model may be larger than the embedding dimension of the first embedding model. For example, the embedding dimension of the first embedding model may be 128 dimensions, and the embedding dimension of the second embedding model may be 512 dimensions. In this case, the first embedding model can generate a 128-dimensional vector, and the second embedding model can generate a 512-dimensional vector. However, the sizes of the embedding dimensions of the first embedding model and the second embedding model are not limited thereto, and as long as they satisfy the relationship described above, they may be adopted without restriction as the embedding dimensions of the first embedding model and the embedding dimensions of the second embedding model.

[0073] In the second report generation step (170), a second issue report can be generated from the second clusters. When generating the second report, different detailed topics included in each of the second clusters can be identified, and a summary related to each of the major events can be generated.

[0074] Similar to the first report generation step (150), in the second report generation step (170), an issue report can be generated using a generative AI model through prompt engineering with a Chain of Thought (CoT) technique applied for each selected second cluster. The generated summary can include quantitative information (specific figures such as the number of contracts, export volume, investment amount, market share, or growth rate), thereby generating a report that provides practical insights rather than a simple news summary.

[0075] The specific processing steps for each stage are explained in detail with reference to the drawings described below.

[0076] FIG. 2 is a diagram illustrating a data collection step according to one embodiment.

[0077] Referring to FIG. 2, in the data collection step (110), the computer (1) can collect a plurality of articles (10). The articles (10) may include multilingual articles published by major domestic and foreign media outlets. For example, articles written in Korean, English, and Chinese may be collected.

[0078] The computer (1) can separate and process each collected article (10). Specifically, the body of each article can be encrypted and stored in a database, and metadata such as the title and date of creation can be stored directly in the database without encryption.

[0079] The encryption of the text can be performed using various encryption algorithms. This encryption process protects copyrighted content while allowing it to be decrypted and utilized in subsequent processing steps if necessary.

[0080] On the other hand, metadata such as titles and publication dates may be information frequently used for searching, filtering, and sorting articles, so it can be stored in a database without encryption to enable quick access.

[0081] The articles used to implement the issue reporting method (100) according to one embodiment of the present invention may be a plurality of articles classified and stored by the same date by metadata.

[0082] When performing an issue reporting method (100) using multiple articles classified and stored as the same date by metadata, a daily issue report can be generated.

[0083] When performing an issue reporting method (100) using multiple articles classified and stored by dates within a predetermined period (e.g., one week, one month, one year) by metadata, an issue report corresponding to the predetermined period (e.g., weekly issue report, monthly issue report, annual issue report) can be generated.

[0084] This method of processing article body and metadata separately enables efficient system operation while maintaining data security. In particular, since rapid access to metadata is possible, processing speed can be improved in subsequent steps such as recommendation modeling or clustering.

[0085] Meanwhile, articles stored in the database may be automatically deleted or moved to a separate storage location after a specified period has elapsed. This can enable efficient management of the database.

[0086] FIG. 3 is a diagram illustrating a recommendation modeling step according to one embodiment.

[0087] Referring to FIG. 3, in the recommendation model (ML1) ring step (120), articles (10) collected in the data collection step (110) can be filtered using a recommendation model (ML1) that reflects user preferences. For example, the recommendation model (ML1) can be trained through user feedback-based labeling data.

[0088] Inputting the articles (10) collected in the data collection step (110) into the recommendation model (ML1) may include inputting various features, such as the title, at least part of the body, or metadata of each article collected on the same date, into the recommendation model (ML1).

[0089] To train a recommendation model (ML1), user feedback on a predetermined number of articles (e.g., about 10,000) may be collected. This feedback may include user evaluations of the relevance, importance, or usefulness of the articles. The collected feedback may be converted into labeled data and used to train the recommendation model (ML1).

[0090] The trained recommendation model (ML1) can calculate a recommendation score for an article when a new article is input. The recommendation score may have a value within a predetermined range (e.g., between 0 and 1), and a high score may indicate that the article is more relevant or important to the user.

[0091] The recommendation model (ML1) can use various features as input, such as the title, body text, or metadata of each article. These features can be converted into an appropriate form through natural language processing techniques and input into the recommendation model (ML1).

[0092] Based on the calculated recommendation scores, articles with scores lower than a predetermined threshold can be filtered out and removed. This allows subsequent processing steps to handle only articles that are practically meaningful to the user.

[0093] Articles with scores lower than a predetermined threshold are filtered out and removed, and filtered articles (11) can be obtained.

[0094] In one embodiment, the calculated recommendation score may also be used to evaluate the importance of the first cluster in the subsequent first clustering step (140). This may contribute to improving the quality of the issue report that is finally generated.

[0095] Meanwhile, the recommendation model (ML1) can be retrained periodically by incorporating new user feedback. This allows it to continuously reflect user preferences that change over time.

[0096] FIG. 4 is a diagram illustrating a duplicate removal step according to one embodiment.

[0097] Referring to FIG. 4, in the duplicate removal step (130), articles (11) filtered through the recommendation modeling step (120) can be converted into embedding vectors (12) through the first embedding model. The embedding vector conversion can be performed using natural language processing technology and can map the text information of each article into a high-dimensional vector space.

[0098] When embedding vectors (12) are generated, similarity between them can be calculated to identify duplicate articles. For example, cosine similarity between embedding vectors (12) can be calculated, and if the calculated cosine similarity exceeds a predetermined threshold, the articles can be determined to be duplicates.

[0099] In this case, not only are identical articles deemed duplicates, but articles dealing with semantically similar content may also be identified as duplicates. For example, articles from different media outlets that describe the same event differently may also be treated as duplicates.

[0100] Articles identified as duplicates are not all removed; instead, some may be removed while weights are assigned to the remainder. This takes into account that articles covered simultaneously by multiple media outlets are likely to be significant issues at that particular time.

[0101] Through this process, articles (13) with duplicates removed can be finally obtained. The articles (13) with duplicates removed and the weight information assigned to them can be utilized in the subsequent first clustering step (140).

[0102] Meanwhile, the first embedding model used when transforming the embedding vector in the duplicate removal step (130) can be selected from various natural language processing models and may be additionally trained to suit a specific domain as needed.

[0103] The first embedding model used when transforming the embedding vector in the duplicate removal step (130) may be different from the second embedding model used in the second clustering step (160), as will be described later.

[0104] FIG. 5 is a diagram illustrating a first clustering step according to one embodiment.

[0105] Referring to FIG. 5, in the first clustering step (140), the first clustering can be performed on the duplicated articles (13) obtained in the duplicate removal step (130).

[0106] Since the duplicate articles (13) remain in the form of first embedding vectors, performing first clustering on the duplicate articles (13) may include performing first clustering on the first embedding vectors.

[0107] The first clustering step (140) may include a step of mapping the embedding vectors of the articles (13) from which duplicates have been removed into a two-dimensional space through a dimensionality reduction technique.

[0108] Dimensionality reduction can be used to visualize high-dimensional embedding vectors on a two-dimensional plane while preserving their relationships. This allows articles covering similar content to be located close to each other in two-dimensional space.

[0109] To perform the first clustering, the optimal first k value can first be calculated using the silhouette coefficient. For example, the first k value can be selected from values ​​between 5 and 25, but is not limited thereto, and the first k value can be determined by considering the distribution characteristics of the data as a parameter that determines the number of first clusters.

[0110] When a first clustering is performed according to the calculated first k value, a plurality of first clusters (A, B, C, D, E) can be formed as shown in FIG. 5. A center value can be calculated for each first cluster, and a similarity (e.g., cosine similarity) between this center value and each data within the first cluster can be calculated.

[0111] Data within each first cluster whose cosine similarity with the center value exceeds a predetermined threshold may be selected. Additionally, the top sentences with the highest cosine similarity for each first cluster may be selected. This may be intended to identify sentences that best express the core content of the corresponding first cluster.

[0112] Among the formed first clusters, a predetermined number (e.g., 4) of first clusters with relatively high importance may be finally selected. At this time, the importance of each first cluster may be evaluated by considering the number of data included in the first cluster, the recommendation scores of the articles included in the first cluster, and the weights based on duplication.

[0113] To this end, the first clustering step (140) may include a step of selecting only a predetermined number of first clusters with relatively high importance among the formed first clusters, and this step will be described in detail later with reference to FIG. 6.

[0114] For the convenience of explanation below, it is assumed that the predetermined number is 4, but it goes without saying that this predetermined number may change depending on the user's settings.

[0115] The selected top four first clusters can be used to organize the main content of the issue report during the subsequent report generation stage. In particular, the sentences with the highest cosine similarity selected for each first cluster can be used to summarize the main content of the corresponding first cluster.

[0116] Meanwhile, clustering algorithms can be selected from various algorithms depending on the characteristics of the data, and hyperparameters can be adjusted as needed.

[0117] FIG. 6 is a diagram illustrating ranking in the first clustering step according to one embodiment.

[0118] Referring to FIG. 6, the ranking process in the first clustering step (140) according to an embodiment of the present invention can be seen.

[0119] The first clustering step (140) may include a step of calculating a score for ranking a plurality of first clusters.

[0120] In the first clustering step (140), ranking scores for a plurality of first clusters (A, B, C, D, E) can be calculated. The ranking scores can be calculated by comprehensively considering the recommendation scores, weights, and number of articles (number of data) of the articles included in each first cluster.

[0121] For example, the step of calculating ranking scores for a plurality of first clusters (A, B, C, D, E) may include the step of calculating a ranking score for each of the plurality of first clusters based on the recommendation scores of articles included in each of the plurality of first clusters.

[0122] These recommendation scores may be assigned to each article in the recommendation modeling step (120).

[0123] As another example, the step of calculating ranking scores for multiple first clusters (A, B, C, D, E) may include the step of calculating a ranking score for each of the multiple first clusters based on the weights of the articles included in each of the multiple first clusters. These weights may be assigned to each article in the duplicate removal step (130).

[0124] In one embodiment, the step of calculating the score may include the step of calculating the ranking score of each of the plurality of first clusters such that the more articles with the weights assigned within a specific first cluster among the plurality of first clusters there are, the higher the ranking score of the said first cluster becomes.

[0125] As another example, the step of calculating ranking scores for a plurality of first clusters (A, B, C, D, E) may include the step of calculating a ranking score for each of the plurality of first clusters based on the number of articles included in each of the plurality of first clusters.

[0126] As illustrated in FIG. 6, the ranking results can be presented in the form of a table. The table may include a recommendation score, a weight, and the number of articles for each first cluster.

[0127] Specifically, the first cluster B, which ranked 1st, may have a recommendation score b1, a weight b2, and a number of articles b3. The first cluster A, which ranked 2nd, may have a recommendation score a1, a weight a2, and a number of articles a3. The first cluster D, which ranked 3rd, may have a recommendation score d1, a weight d2, and a number of articles d3. The first cluster C, which ranked 4th, may have a recommendation score c1, a weight c2, and a number of articles c3, and the first cluster E, which ranked 5th, may have a recommendation score e1, a weight e2, and a number of articles e3.

[0128] In this case, the ranking score of each first cluster can be calculated as a combination of the average recommendation score, average weight, and total number of articles included in the corresponding first cluster. For example, the ranking score can be calculated by assigning a predetermined weight to each of these three factors.

[0129] For example, the ranking score of each first cluster can be determined based on the following [Equation 1].

[0130] [Formula 1]

[0131] Ranking score =

[0132] Here, K can mean the number of articles included in the corresponding first cluster, and R n can mean the recommendation score of the nth article, and W n can mean the weight of the nth article.

[0133] However, the method of calculating the ranking scores of each first cluster is not limited to this.

[0134] In the first clustering step (140), a predetermined number (e.g., 4) of first clusters (e.g., B, A, D, and C) may be selected based on the ranking scores calculated as above, and the selected first clusters may be utilized in the subsequent report generation step (150).

[0135] Meanwhile, the weights of each element for calculating the ranking score can be adjusted according to user settings or data characteristics.

[0136] FIG. 7 is a diagram illustrating a report generation step according to one embodiment.

[0137] Referring to FIG. 7, in the first report generation step (150), an issue report can be generated using a generative artificial intelligence model for each selected first cluster. At this time, prompt engineering can be performed by applying a specific persona, which can enable the report to be generated with a consistent perspective and style.

[0138] Step-by-step prompts (PTs) can be input into generative artificial intelligence models (ML2). These prompts utilize the Chain of Thought (CoT) technique, enabling the generation of a summary containing specific information through a sequential thought process.

[0139] In the first step, articles from the selected first cluster can be input into a generative artificial intelligence model (ML2). This may be a step that provides basic data for the generative artificial intelligence model (ML2) to process.

[0140] Inputting articles of the selected first cluster into the generative artificial intelligence model (ML2) may include inputting data into the generative artificial intelligence model (ML2) such that the cosine similarity with the centroid value calculated for each first cluster exceeds a predetermined threshold.

[0141] In the second step, a request can be made to identify key events within the input first cluster. Through this, the generative artificial intelligence model (ML2) can identify key events or issues being addressed in that first cluster.

[0142] In the third step, a request may be made to extract specific numerical information included in each identified event. This may be intended to generate a summary containing numerical information (or quantitative information), such as the number of contracts, investment amount, market share, growth rate, etc.

[0143] A prompt according to one embodiment of the present invention may be engineered to include a prompt requesting the extraction of specific numerical information included in each identified event, following a prompt requesting the identification of a major event within an input first cluster.

[0144] Through this stepwise prompt processing, generative AI models (ML2) can generate specific and insightful summaries. The generated summaries can include not only qualitative descriptions but also quantitative numerical information, thereby providing more practical information.

[0145] Meanwhile, the specific details of the prompt (PT) may be adjusted according to the operational purpose of the system or the characteristics of the data, and additional steps may be included as necessary.

[0146] When generating an issue for each first cluster, only articles deemed to have high importance can be selected and extracted as relevant articles, rather than including all articles within the first cluster. In this case, importance can be determined by comprehensively considering the recommendation score, weight, and other factors of the article.

[0147] The association between generated issues and related articles can be verified through cosine similarity. Specifically, the cosine similarity between the final generated issue and each related article can be calculated, and articles with a similarity below a predetermined threshold can be excluded from the list of related articles.

[0148] Subsequently, related articles can be sorted in order of highest cosine similarity to the issue. This allows the final issue report to sequentially reference the most relevant articles first.

[0149] Meanwhile, it goes without saying that the descriptions regarding the first report generation step (150) can be applied to the second report generation step (170).

[0150] However, actions that are added only in the second report generation step (170) will be described later.

[0151] FIG. 8 illustrates an example of an issue report generated by an issue reporting method according to one embodiment.

[0152] Referring to FIG. 8, the issue report (RT1) may include multiple summaries (CR1, CR2) generated for each first cluster. The summaries (CR1, CR2) may include key contents extracted from each first cluster.

[0153] For example, the summary (CR1) covers news regarding the financing for the expansion of a specific company (Company L)'s lithium production facility, and may include specific numerical information (g1) of $2.26 billion.

[0154] The summary (CR2) covers news of the discovery of lithium reserves in a specific country (Country S) and may include specific numerical information (g2) that the discovered reserves can meet more than nine times the global demand for lithium.

[0155] In this way, the issue report (RT1) can quantitatively determine the scale or impact of the issue by providing the key contents of each first cluster along with specific numerical information (g1, g2).

[0156] Meanwhile, these specific figures can be extracted through prompt engineering applying the CoT technique described earlier and included in the summary.

[0157] In one embodiment, the issue reporting method may be provided with various categories distinguished according to various metadata.

[0158] For example, issue reporting methods may include methods for reporting issues regarding articles published domestically, methods for reporting issues regarding articles published overseas (e.g., the United States), and methods for reporting issues regarding articles published in specific groups of countries (e.g., all countries, Asia, Europe, etc.).

[0159] In one embodiment, steps 110 to 150 described above may be a daily issue reporting method performed based on articles collected during the day.

[0160] In one embodiment, the issue reporting method of the present invention may further include an issue reporting method corresponding to a predetermined period (e.g., a weekly issue reporting method, a monthly issue reporting method, an annual issue reporting method) performed based on articles collected during a predetermined period.

[0161] In one embodiment, the first issue report generated by steps 110 to 150 described above may be a report reflecting issues for a shorter period than the second issue report described later.

[0162] FIG. 9 is a diagram illustrating a second clustering step according to one embodiment.

[0163] Referring to FIG. 9, in the second clustering step (160), second clustering can be performed on the first issue reports (RT1) generated in the first report generation step (150).

[0164] The second clustering step (160) may include a step of converting the first issue reports (RT1) into second embedding vectors using the second embedding model.

[0165] For example, the first embedding model may include a first language model trained to extract features (e.g., semantic features) of a first unit (e.g., sentence unit), and the second embedding model may include a second language model trained to extract features (e.g., contextual features) of a second unit (e.g., paragraph unit) that is broader than the first unit.

[0166] As another example, the first embedding model and the second embedding model may have different embedding dimensions, be trained with different training data, and use different tokenization methods.

[0167] As explained above, the embedding dimension of the second embedding model can be larger than the embedding dimension of the first embedding model.

[0168] The first and second embedding models can calculate centroids in different ways. In the case of the first embedding model, since semantic features are extracted at the sentence level, the centroid of each cluster can have a vector value close to the core sentences or keywords of the articles belonging to that cluster. For example, specific sentence-level centroids such as "POSCO Future M's investment plan" or "expansion of lithium production" can be formed.

[0169] On the other hand, since the second embedding model extracts contextual features at the paragraph level, the centroids of each cluster can have vector values ​​representing more abstract and comprehensive topics. For example, the centroids of clusters F, G, and H can represent broader topics such as "the battery materials industry as a whole," "global production facility investment trends," and "market size and growth trends."

[0170] These differences in centroid values ​​can also affect the cosine similarity calculation. In the first clustering step (140), specific sentence-level similarity is calculated to enable fine distinction, whereas in the second clustering step (160), overall context-level similarity is calculated to enable more comprehensive grouping.

[0171] Due to these differences, the optimal k value at each clustering step (140, 160) may also differ, and this can be determined through cluster evaluation metrics.

[0172] In the second clustering step (160), analysis at a different level from the individual article unit processing in the duplicate removal step (130) or the primary classification in the first clustering step (140) can be performed.

[0173] For example, unlike the first embedding model which extracts semantic features at the level of individual sentences such as "POSCO Future M breaks 52-week high," the second embedding model can grasp the overall context and flow of the first issue reports composed of multiple sentences. For example, it can identify the relationships between various first issue reports under the broad theme of "Global growth of the battery materials industry."

[0174] As illustrated in Fig. 9, the second embedding vectors can be mapped onto a two-dimensional space through a dimensionality reduction technique. In this process, it can be seen that first issue reports with similar contexts are positioned close to each other, naturally forming three large clusters (F, G, H). For example, reports related to the first topic (e.g., battery materials) can be clustered in cluster F, reports related to the second topic (e.g., investment in battery production facilities) can be clustered in cluster G, and reports related to the third topic (e.g., battery market trends) can be clustered in cluster H.

[0175] This second clustering can have meaning beyond simple classification by topic. If the duplicate removal step (130) focuses on filtering out similar articles and the first clustering step (140) focuses on classifying articles by detailed topics, the second clustering step (160) can focus on finding deeper connections between already organized content.

[0176] Additionally, the second k value used in the second clustering step (160) may be set to a value smaller than the first k value used in the first clustering step (140). For example, if a first k value between 5 and 25 is used in the first clustering step (140), a smaller second k value between 5 and 10 may be used in the second clustering step (160). This may be because the second clustering step (160) is intended for classification within a larger category. Through this multi-stage clustering structure, the present invention can effectively identify trends or industry trends that are difficult to discover at the individual article level.

[0177] Meanwhile, the ranking process described in the first clustering step (140) can also be applied in the second clustering step (160).

[0178] In the second clustering step (160), ranking scores for multiple second clusters (F, G, H) can be calculated. The ranking scores can be calculated by comprehensively considering the recommendation scores, weights, and number of issue reports included in each second cluster.

[0179] A score calculated by comprehensively considering the recommendation score, weight, and number of issue reports can be assigned to each issue report when issue reports are generated. For example, in Fig. 6, an issue report corresponding to cluster B may have a higher score than an issue report corresponding to cluster A.

[0180] The step of calculating ranking scores for a plurality of second clusters (F, G, H) may include the step of calculating a ranking score for each of the plurality of second clusters based on the scores of issue reports included in each of the plurality of second clusters.

[0181] FIG. 10 is a diagram illustrating a second report generation step according to one embodiment.

[0182] Referring to FIG. 10, in the second report generation step (170), an issue report can be generated using a generative artificial intelligence model for each selected second cluster.

[0183] At this time, step-by-step prompts (PT) can be input into the generative artificial intelligence model (ML2). Unlike the first report generation step (150), the prompts of the second report generation step (170) can be configured to perform more detailed step-by-step analysis.

[0184] In the first step, a selected second cluster can be input. This may be a step that provides basic data to be processed by a generative artificial intelligence model (ML2).

[0185] In the second step, you may be asked to identify distinct subtopics within the input second cluster. This may be intended to separate and analyze different perspectives or aspects even within the same cluster.

[0186] In the third step, you may be asked to identify major events by identified sub-topics. This may be intended to identify specific events or changes related to each sub-topic.

[0187] In the fourth step, a request may be made to extract quantitative information contained in each identified event. This may be intended to generate a summary containing specific numerical information, such as the number of contracts, investment amount, market share, and growth rate.

[0188] That is, the first report generation step (150) and the second report generation step (170) are both performed using a generative artificial intelligence model through prompt engineering to which the Chain of Thought (CoT) technique is applied, but the first report generation step (150) is performed using a generative artificial intelligence model through the first prompt engineering, and the second report generation step (170) can be performed using a generative artificial intelligence model through second prompt engineering that is different from the first prompt engineering.

[0189] The first prompt engineering may be configured to identify the major events of each of the first clusters, and the second prompt engineering may be configured to identify the detailed topics of each of the second clusters and to identify the major events of each of the topics.

[0190] That is, the prompt engineering applied in the first report generation step (150) and the second report generation step (170) may be different from each other, and the second prompts by the second prompt engineering may include additional prompts compared to the first prompts by the first prompt engineering.

[0191] The prompt configuration of this second report generation step (170) can enable more systematic and hierarchical analysis compared to the first report generation step (150). Through this, contents that are distinct from one another within the same cluster can be clearly organized and included in the final second issue report.

[0192] Accordingly, the second issue report may include multiple detailed topics for a single main topic.

[0193] FIG. 11 illustrates an example of a second issue report generated by an issue reporting method according to one embodiment.

[0194] Referring to FIG. 11, the second issue report (RT2) may include multiple summaries (SJ1, SJ2) separated by main events. Each summary may be hierarchically organized to include one main event (MJ1, MJ2) and multiple subtopics belonging thereto.

[0195] Summary 1 (SJ1) covers the main event (MJ1) related to S Company's loan and employee reduction. This summary may include two different subtopics (AJ1). The first subtopic concerns S Company's restructuring and may include specific numerical information such as a workforce reduction of approximately 1,000 people. The second subtopic concerns S Company's factory expansion plan and may include quantitative information such as a loan of $1 billion.

[0196] Summary 2 (SJ2) covers the main event (MJ2) related to lithium battery safety issues. This summary may also include two different subtopics (AJ2). The first subtopic concerns Company S's safety investments and may include a specific figure of an investment of $100 million. The second subtopic concerns the status of lithium battery accidents and may include a statistical figure of 1,000 accidents occurring every month.

[0197] As such, the RT2 (Report on the Second Issue) can have a hierarchical structure in which main events, subtopics, and specific numerical information are organically connected. This structure can simultaneously provide the overall context and detailed information regarding each issue.

[0198] According to the present invention, by using a first embedding model that extracts features of a first unit in the duplicate removal step (130) and the first clustering step (140), and using a second embedding model that extracts features of a second unit wider than the first unit in the second clustering step (160), features of different levels can be effectively extracted.

[0199] According to the present invention, by using different k values ​​in the first clustering step (140) and the second clustering step (160), detailed subject classification can be performed first, and classification in a larger category can be performed second. Through this, hierarchical and systematic issue analysis can be performed.

[0200] According to the present invention, by applying different prompt engineering in the first report generation step (150) and the second report generation step (170), a summary of individual issues can be performed primarily, and an in-depth analysis distinguishing main events and subtopics can be performed secondarily.

[0201] According to the present invention, the second issue report finally generated has a hierarchical structure in which a main event, a subtopic, and specific numerical information are organically connected, thereby enabling the simultaneous provision of the overall context and detailed information regarding each issue.

[0202] Meanwhile, the disclosed embodiments may be implemented in the form of a recording medium that stores instructions executable by a computer. The instructions may be stored in the form of program code and, when executed by a processor, may generate a program module to perform the operation of the disclosed embodiments. The recording medium may be implemented as a computer-readable recording medium.

[0203] Computer-readable recording media include all types of recording media that store instructions that can be decoded by a computer. Examples include ROM (read-only memory), RAM (random access memory), magnetic tape, magnetic disk, flash memory, optical data storage devices, etc.

[0204] Additionally, computer-readable recording media may be provided in the form of non-transitory storage media. Here, 'non-transitory storage media' simply means that it is a tangible device and does not contain a signal (e.g., electromagnetic waves), and this term does not distinguish between cases where data is stored semi-permanently and cases where it is stored temporarily. For example, 'non-transitory storage media' may include a buffer in which data is stored temporarily.

[0205] According to one embodiment, the method according to the various embodiments disclosed herein may be provided as included in a computer program product. The computer program product may be traded between a seller and a buyer as a product. The computer program product may be distributed in the form of a device-readable recording medium (e.g., compact disc read-only memory (CD-ROM)), or distributed online (e.g., download or upload) through an application store (e.g., Play Store™) or directly between two user devices (e.g., smartphones). In the case of online distribution, at least a portion of the computer program product (e.g., downloadable app) may be temporarily stored or temporarily created on a device-readable recording medium, such as the memory of a manufacturer's server, an application store's server, or a relay server.

[0206] As described above, the disclosed embodiments have been explained with reference to the attached drawings. Those skilled in the art will understand that the present invention may be practiced in forms different from the disclosed embodiments without changing the technical spirit or essential features of the invention. The disclosed embodiments are illustrative and should not be interpreted restrictively.

Claims

1. Regarding issue reporting methods, A data collection step for collecting and storing articles published on multiple dates; A recommendation modeling step for filtering the collected articles using a recommendation model trained through user feedback-based labeling; A duplicate removal step of converting filtered articles into first embedding vectors using a first embedding model, and calculating the similarity between the first embedding vectors to remove duplicate articles; A first clustering step of performing a first clustering on articles from which duplicates have been removed to determine a plurality of first clusters, and selecting a predetermined number of first clusters from among the plurality of first clusters; A first report generation step of generating first issue reports from the first clusters selected above; A second clustering step of determining a plurality of second clusters by performing second clustering on the first issue reports above; and An issue reporting method comprising: a second report generation step of generating a second issue report from the second clusters.

2. In Paragraph 1, The above duplicate removal step is, An issue reporting method comprising the step of determining articles in which the similarity between the first embedding vectors exceeds a predetermined threshold as duplicate articles.

3. In Paragraph 1, The above second clustering step is, A step of converting the first issue reports into second embedding vectors using a second embedding model; and The method includes the step of performing the second clustering using the second embedding vectors; An issue reporting method in which the above-mentioned second embedding model is different from the above-mentioned first embedding model.

4. In Paragraph 3, The above-mentioned first embedding model includes a first language model trained to extract features of a first unit, and The above-mentioned second embedding model is an issue reporting method comprising a second language model trained to extract features of a second unit wider than the first unit.

5. In Paragraph 3, The above first embedding model and the above second embedding model are, Having different embedding dimensions, trained through different training data, Issue reporting methods using different tokenization schemes.

6. In Paragraph 1, An issue reporting method in which both the first report generation step and the second report generation step are performed using a generative artificial intelligence model through prompt engineering with the Chain of Thought (CoT) technique applied.

7. In Paragraph 6, The prompt engineering of the first report generation step above is, It is configured to identify the main events of each of the first clusters mentioned above, and The prompt engineering of the second report generation step mentioned above is, Identifying different detailed topics included in each of the above second clusters, and An issue reporting method configured to identify major events for each of the above detailed topics.

8. In Paragraph 1, The above first report generation step is, The method includes the step of generating a summary related to the major events of each of the first clusters; The above second report generation step is, An issue reporting method comprising the step of generating a summary related to the major event of each of the different detailed topics included in each of the second clusters.

9. In Paragraph 1, The above first report generation step is, It is performed using a generative artificial intelligence model through first prompt engineering, and The above second report generation step is, An issue reporting method performed using the generative artificial intelligence model through a second prompt engineering different from the first prompt engineering.

10. In Paragraph 9, The above-mentioned first prompt engineering is, It is configured to identify the main events of each of the first clusters mentioned above, and The above second prompt engineering is, An issue reporting method configured to identify detailed topics for each of the second clusters and to identify major events for each of the detailed topics.

11. In Paragraph 1, The above first clustering step is, The method includes the step of calculating a first k value, which is a clustering parameter for determining the number of clusters using a cluster evaluation index, and determining the plurality of first clusters by performing the first clustering according to the first k value. The above second clustering step is, An issue reporting method comprising the step of calculating a second k value, which is a clustering parameter that determines the number of clusters different from the first k value using the cluster evaluation indicator, and determining the plurality of second clusters by performing the second clustering according to the second k value.

12. In Paragraph 11, The above second k value is an issue reporting method that is smaller than the above first k value.

13. A computer-readable storage medium for recording a program for executing the method of any one of paragraphs 1 through 12 on a computer.

14. A computer program stored on a computer-readable storage medium, wherein the computer program performs steps of reporting an issue when executed on one or more processors of a computer, and The above steps are, A data collection step for collecting and storing articles published on multiple dates; A recommendation modeling step for filtering the collected articles using a recommendation model trained through user feedback-based labeling; A duplicate removal step that calculates the similarity between the first embedding vectors of the filtered articles to remove duplicate articles; A first clustering step of performing a first clustering on articles from which duplicates have been removed to determine a plurality of first clusters, and selecting a predetermined number of first clusters from among the plurality of first clusters; A first report generation step of generating first issue reports from the first clusters selected above; A second clustering step of determining a plurality of second clusters by performing second clustering on the first issue reports above; and A second report generation step of generating a second issue report from the second clusters; comprising A computer program that reports issues.