Cognitive aggregation and authoring methods and systems

The system combines inputs from multiple NLP pipelines using a MKP algorithm and transformer models to generate personalized text content, overcoming the limitations of existing NLP technologies in creating unified and user-tailored summaries.

JP7880944B2Inactive Publication Date: 2026-06-26INTERNATIONAL BUSINESS MACHINE CORPORATION

Patent Information

Authority / Receiving Office
JP · JP
Patent Type
Patents
Current Assignee / Owner
INTERNATIONAL BUSINESS MACHINE CORPORATION
Filing Date
2022-07-08
Publication Date
2026-06-26
Estimated Expiration
Not applicable · inactive patent

AI Technical Summary

Technical Problem

Existing natural language processing (NLP) technologies struggle to combine diverse outputs from multiple pipelines into a unified summary or document, and lack the ability to personalize generated text for user preferences.

Method used

A system that uses a Multiple Knapsack Problem (MKP) algorithm and masking techniques to combine inputs from multiple NLP pipelines, applying transformer models and decision optimization algorithms to generate personalized machine-generated text content based on user preferences and sentiment analysis.

Benefits of technology

Enables the creation of unified, personalized machine-generated text content by optimizing the joint probability of relevant and diverse information, addressing the limitations of existing NLP technologies in combining and personalizing text generation.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure 0007880944000016
    Figure 0007880944000016
  • Figure 0007880944000017
    Figure 0007880944000017
  • Figure 0007880944000018
    Figure 0007880944000018
Patent Text Reader

Abstract

A cognitive aggregation and authoring method and system is provided that uses natural language processing techniques to identify a set of candidate text items among a plurality of digital content datasets based on their relevance to a specified subtopic, group the candidate text items into a predetermined number of groups using the relevance scores and feature vectors, train a pre-trained encoder-decoder model using the specified group of selected text items, where the pre-trained encoder-decoder model is pre-trained to generate text content according to a particular writing style, generate machine-generated text content in the particular writing style using the pre-trained encoder-decoder model, resulting in articles on the specified subtopic based on the specified group of selected text items, and finally transmit the articles to a remote web server as updates for the website.
Need to check novelty before this filing date? Find Prior Art

Description

[Technical Field]

[0001] This invention generally relates to the field of computer data distribution over communication networks, and more specifically to elucidated natural language artifact recombination through contextual recognition. [Background technology]

[0002] Natural Language Processing (NLP) is a field of computer science and artificial intelligence (AI) that encompasses linguistics involving some form of processing of natural language input. Natural language input is typically in the form of unstructured data. Unstructured data refers to information that does not have a predefined data model or is not organized in a predefined manner. Unstructured data is often primarily text in some form, such as written or spoken form. Broadly speaking, NLP typically involves converting unstructured data into structured data.

[0003] Examples of NLP include Natural Language Generation (NLG) and Natural Language Understanding (NLU). NLU is a branch of NLP that primarily involves parsing text and extracting metadata from unstructured content such as concepts, entities, keywords, categories, emotions, feelings, relationships, and semantic roles. NLU typically uses deep learning algorithms to parse and extract such information from unstructured text. For example, NLU can be used to analyze customer feedback by performing semantic analysis of customer comments and determining whether the comments are positive or negative.

[0004] NLG is a branch of NLP that primarily focuses on creating machine-generated content. For example, NLG can be used for extractive summarization, which involves analyzing a large document to identify key terms and phrases, and then using that information to create a summary of the document.

[0005] The above and other forms of NLP are available through their respective platforms and services for performing tasks such as data mining or extractive summarization. These techniques can be used to set up NLP pipelines for analyzing the body or corpus of information, and will return different results depending on the techniques implemented in each NLP pipeline. This can be useful in situations where different forms of information are required. For example, one could set up an NLP pipeline to search for statistical information in a particular area of ​​interest, and another NLP pipeline to search for article reviews in the same area of ​​interest. These two NLP pipelines require different forms of NLP because they require different types of information. However, due to the differences between the two NLP pipelines, the results of the statistical information and the article reviews may relate to the same area of ​​interest, but in detail they may focus on different aspects of that area of ​​interest. As a result, it may be difficult or impossible to combine the outputs of multiple NLP pipelines into a single, unified summary, article, or document. [Overview of the Initiative]

[0006] The exemplary embodiments provide context-aware recombination of natural language artifacts. One embodiment includes loading multiple digital content datasets into memory as part of content extraction from a corpus, wherein the multiple digital content datasets satisfy a query statement, the query statement includes a content topic to which the multiple digital content datasets relate. This embodiment also includes identifying a pair of candidate text items from the multiple digital content datasets based on their relevance to a subtopic of each candidate text item using a computed relevance score for each candidate text item, the computed relevance score being determined by parsing the text content of each candidate text item using one or more natural language processing techniques, the result of which parsing the text content of each candidate text item yields a feature vector for each candidate text item, each feature vector including a relevance value and a quality value. This embodiment also includes using the computed relevance score and feature vector to group the candidate text items of a pair into a predetermined number of groups of candidate text items as a result of executing a set of instructions in a processor. This embodiment also includes training a first pre-trained encoder-decoder model using a first designated group of candidate text items from a group of candidate text items, the first pre-trained encoder-decoder model being pre-trained to generate text content according to a first style. This embodiment also includes using the first pre-trained encoder-decoder model to generate machine-generated text content in a first style, which results in a first article on a designated subtopic based on the first designated group of candidate text items.This embodiment also includes training a second pre-trained encoder-decoder model using a second designated group of candidate text items from a group of candidate text items, the second pre-trained encoder-decoder model being pre-trained to generate text content according to a second style. This embodiment also includes using the second pre-trained encoder-decoder model to generate machine-generated text content in a second style, which results in a second article on a designated subtopic based on the second designated group of candidate text items. This embodiment also includes sending the first and second articles to a remote web server as updates for a website hosted by the remote web server. Other embodiments of this aspect include corresponding computer systems, devices, and computer programs recorded on one or more computer storage devices, each configured to perform the operations of this embodiment.

[0007] Some such embodiments further include loading user-generated content into memory and identifying selected user-generated content from the user-generated content based on the relevance of selected user-generated content to a topic. Loading relevant user-generated content allows for embodiments that further include, advantageously, analyzing the selected user-generated content from which the polarity of the selected user-generated content is determined as part of determining the user's sentiment to the topic. Determining the user's sentiment to the topic allows for embodiments that further include generating a weight vector based on the user's sentiment to the topic, grouping candidate text items into groups of candidate text items, and further including using the weight vector to adjust the values ​​of the feature vector. Adjusting the feature vector using such a weight vector advantageously allows the first and second articles to be tailored to the user so that the submission of the first and second articles can be used as updates for a custom webpage, and the custom webpage is customized for the user.

[0008] According to another aspect of the present disclosure, one embodiment includes performing a query process to search for content related to a specified topic in multiple corpora. This embodiment also includes extracting a set of candidate text items from the search results received from the query process based on the relevance of each candidate text item to a specified subtopic using a computed relevance score for each candidate text item, the computed relevance score being determined by analyzing the text content of each candidate text item using one or more natural language processing techniques. This embodiment also includes obtaining a feature vector for each candidate text item as a result of analyzing the text content of the candidate text items. This embodiment also includes each feature vector including its respective relevance value and its respective quality value. This embodiment also includes loading user-generated content into memory. This embodiment also includes analyzing user-generated content, as part of determining the user's sentiment towards a specified subtopic, the resulting polarity of the user-generated content is determined. This embodiment also includes generating weight vectors based on the user's sentiment towards a specified subtopic. This embodiment also includes grouping candidate text items from a set of candidate text items into a predetermined number of groups of candidate text items using computational relevance scores, weight vectors, and feature vectors as a result of executing a set of instructions in the processor. This embodiment also includes training a pre-trained encoder-decoder model using designated groups of candidate text items from among the groups of candidate text items. This embodiment also includes using the pre-trained encoder-decoder model to generate machine-generated text content that results in an article on a designated subtopic based on the designated groups of candidate text items. This embodiment also includes sending the article to a remote web server as an update for displaying personalized content for the user based on the user's sentiment.Other embodiments of this model include corresponding computer systems, devices, and computer programs recorded on one or more computer storage devices, each configured to perform the operations of this model.

[0009] One embodiment includes a computer-readable program product. The computer-readable program product includes a computer-readable storage medium and program instructions stored in the storage medium.

[0010] One embodiment includes a computer system. The computer system includes a processor, computer-readable memory, a computer-readable storage medium, and a program stored in the storage medium for execution by the processor via memory.

[0011] Novel features that are considered characteristic of the present invention are described in the appended claims. However, the present invention itself, preferred uses, further purposes and advantages will best be understood by reading and referring to the following detailed description of exemplary embodiments together with the accompanying drawings. [Brief explanation of the drawing]

[0012] [Figure 1] This is a block diagram showing a data processing system capable of implementing the exemplary embodiment. [Figure 2] This block diagram shows an exemplary configuration according to one embodiment of the example. [Figure 3A] This block diagram shows an exemplary configuration according to one embodiment of the example. [Figure 3B] This block diagram shows another example configuration according to one embodiment of the example. [Figure 4] This block diagram shows another example configuration according to one embodiment of the example. [Figure 5] This is a block diagram illustrating an exemplary CAA system according to one exemplary embodiment. [Figure 6]This is a block diagram illustrating an exemplary CAA system according to one exemplary embodiment. [Figure 7] This flowchart illustrates the process of cognitive aggregation and authoring according to one exemplary embodiment. [Figure 8] This flowchart illustrates the process of cognitive aggregation and authoring according to one exemplary embodiment. [Figure 9] This figure shows a cloud computing environment according to one embodiment of the present invention. [Figure 10] This figure shows an abstraction model layer according to one embodiment of the present invention. [Modes for carrying out the invention]

[0013] The amount of information available for consumption continues to grow at an unprecedented pace. For example, in recent years, natural language text data, including web pages, news articles, scientific literature, emails, corporate documents, and social media (blog posts, forum posts, product reviews, and tweets, etc.), has increased dramatically. This information is consumed by a growing audience of internet users around the world. Recent statistics show that more than 4 billion people now have access to the internet in some form, which is more than half of the world's population.

[0014] A vast number of users who can access a large amount of information are generating an ever-increasing demand for high-performance software tools to assist people in effectively and efficiently managing and analyzing large amounts of information. Applications that address this demand include personalized news aggregators that collect in one place distributable web content such as online newspapers, blogs, podcasts, and video blogs (vlogs) to make it easier to view. Such applications have been recognized as being able to be improved by using NLP techniques to generate summaries of each of the aggregated articles and provide these summaries to users, thereby enabling users to more quickly receive an overview of the aggregated content.

[0015] NLP techniques have been demonstrated to be useful for many such applications that assist users in gaining insights from the large amount of information available. Many different types of NLP techniques and algorithms are known for producing text content for consumption by users. The delivery, content, and style of text content vary depending on the NLP techniques used to collect and produce it. Typically, multiple NLP techniques are combined to form an NLP pipeline that performs a combination of NLP processes. The use of multiple different NLP pipelines for collecting and processing information may be desirable to realize the diversity of styles and content of the information accumulated. However, this diversity becomes a problem when attempting to assemble a unified composition such as an article, document, or web page using the content provided by different NLP pipelines.

[0016] As an example, an application or web page dedicated to a large event such as a sports event, exhibition or gathering may collect content using multiple different NLP pipelines that use different types of NLP techniques. One NLP pipeline searches for articles related to a topic from a large amount of information, uses extractive summarization to summarize those articles, ranks the summaries based on relevance to the topic, quality of information and other factors, and outputs some of the top-ranked summaries to provide factoids. As used herein, "factoid" refers to a summary of a news article related to a topic or interesting but little-known (i.e., trivial) information related to the topic. Another NLP pipeline queries statistical quantities related to a topic provided as structured data in a statistical database, uses natural language generation to write new sentences based on the statistical quantities related to the topic, ranks the statistical sentences based on relevance to the topic, quality of information and other factors, and outputs some of the top-ranked statistical sentences to provide statistical sentences, also referred to herein as "findings".

[0017] The resulting summaries and statistical sentences may relate to a particular topic such as a particular sport, music genre or technical field, but may still be targeting different aspects of that topic. For example, if the topic is a particular sport, the summaries may be targeting the player's home life, new stadiums, training techniques, while the statistical sentences may be targeting the player's performance under specific weather conditions, team records, and league records. Although presented separately, these are all relevant information about the given topic that the user may find useful and interesting. However, there is a technical problem because these results contain information about aspects of the topic that are too unrelated to be combined in a consistent manner as a single unified article about that topic.

[0018] Currently, AI technology also includes NLP technologies such as natural language generation, which generate machine-generated content such as machine-generated stories, content, summaries, and novel texts. These technologies generate text that is closely related to the input data, which limits them to using input data from a single NLP pipeline to generate new text content. Therefore, these technologies lack the ability to overcome the aforementioned technical problems when attempting to generate consistent content from diverse NLP pipelines.

[0019] Other technical problems include the lack of ability for these technologies to personalize the generated text for each user. User preferences can be identified, for example, based on user input or activity, to determine things like a user's sentiment towards a particular topic, or their preferred style of writing, or both. However, existing text generation technologies lack the ability to generate personalized text content that matches the user's preferences, possessing characteristics such as style or topic sentiment.

[0020] The exemplary embodiments address these technical problems by providing a means for combining inputs from multiple NLP pipelines and using the combined inputs as a basis for generating new machine-generated text content. The disclosed embodiments create an NLP pipeline that combines inputs from multiple NLP pipelines using a Multiple Knapsack Problem (MKP) algorithm. The disclosed embodiments apply a masking technique that detects user preferences and adapts the text generation process to generate text having one or more characteristics selected based on the user preferences.

[0021] The problem of generating machine-generated text content from multiple different NLP pipelines can be addressed by identifying content items from multiple relevant NLP pipelines. As a non-limiting example provided to aid understanding of this disclosure, one embodiment includes a first NLP pipeline that provides input data for a factoid and a second NLP pipeline that provides input data for insights. In some embodiments, the first and second NLP pipelines query one or more corpora for digital content datasets that satisfy a query statement. In some such embodiments, the query statement includes a reference to a particular topic of interest (e.g., basketball in the example above).

[0022] In some such embodiments, the first and second NLP pipelines provide input by extracting digital content datasets that satisfy the query (e.g., factoids by the first NLP pipeline and insights by the second NLP pipeline) and loading them into memory. The digital content items may be text items and may include words, sentences, or other blocks of text. In some such embodiments, the input is used to create content for an application running on a user device that displays content related to a particular topic. This display content, which in this embodiment is called current relevant content, is updated periodically so that older created content is sometimes replaced by newer created content.

[0023] The problem of identifying content items from multiple related and different NLP pipelines can be rephrased as optimizing the joint probability of having factoids, insights, and current relevant content together within the same application display. As an optimization problem, this problem can be modeled as equation (1) below.

[0024]

number

[0025]

number

number

number

[0026] In equations (1) to (4), P(R c P(Factoids|Insights) = The probability of having relevant content depends on whether input data related to the current topic can be retrieved. For example, if the current topic is an event the user is participating in, e.g., a basketball tournament, this is the probability that the NLP pipeline will provide input related to basketball or a basketball tournament. P(Factoids|Insights) = The probability of obtaining factoids related to insights. Decision optimization algorithms are used to create groups of packages of factoids and insights that are highly relevant and correlated. P(Insights) = The probability of generating high-quality and diverse insights from source data that is highly precise structured information about the topic (i.e., basketball in this embodiment). Embodiments use natural language generation, transformer models, and decision optimization algorithms to improve the likelihood of generating high-quality and diverse insights.

[0027] Focusing further on the optimization problem modeled by equations (1) to (4), the model can be optimized by maximizing equation (4), which can be maximized by maximizing two terms shown as equations (5) and (6) below.

[0028] P(Factoids|R c ) (5) P(Insights|R c ) (6)

[0029] Therefore, given the content a user is currently viewing, the NLP roadmap can focus on optimizing factoid and insight extraction. The disclosed embodiment attempts to address this optimization by using a novel multi-head attention transformer to focus on the most relevant information based on customization or page editing.

[0030] In some exemplary embodiments, the optimization problem is modeled as MKP. There are many known algorithms for solving MKP, any of which can be used as decision optimization algorithms to create groups of packages. The MKP algorithm attempts to group a digital content dataset (e.g., factoid sentences received from a first NLP pipeline and insight sentences received from a second NLP pipeline).

[0031] In some such embodiments, certain candidate items are selected from a dataset to reduce the amount of processing required by the MKP algorithm before the content in the NLP pipeline is processed by the MKP algorithm. In some such embodiments, one or more sets of candidate text items (corresponding to their respective subtopics) are identified from multiple digital content datasets. In some such embodiments, the candidate text items within each set are identified based on their relevance to the subtopic associated with the set of candidate items.

[0032] Following the above embodiment where the topic is basketball, three example subtopics provided for illustrative purposes only may include opposing team pairings, player profiles, and injury reports. In some embodiments, candidate items are analyzed for relevance to each of the subtopics. In some such embodiments, candidate items may also be evaluated for other factors such as quality (e.g., amount of grammatical or spelling errors, offensive content, etc.), sentiment (e.g., intensity of expressed opinion, side of argument supported, etc.), length, or other metrics. In some such embodiments, this analysis yields a computed relevance score for each candidate text item, which is determined by analyzing the text content of each candidate text item using one or more natural language processing techniques. This information is used to generate feature vectors for each factoid and insight candidate item, where “feature” is a pre-selected subtopic, and the feature value indicates how similar the factoid or insight is to its respective subtopic. In some embodiments, the feature vectors may contain more or fewer features than in this embodiment. Therefore, the feature vector includes numerical values ​​representing the degree to which the associated factoid or insight is relevant to each subtopic, a quality score, and values ​​for other desired elements.

[0033] In the exemplary embodiment, the MKP algorithm receives candidate items and feature vectors for each candidate factoid and each insight. As a result of executing a set of instructions in the processor, the MKP algorithm uses the computed relevance score and feature vectors to group the candidate text items of the set of candidate text items into a predetermined number of groups of candidate text items.

[0034] Some exemplary embodiments involve the use of extractive summarization by a cross-entropy quality measure to select a subset of sentences from one of a group of packages. In some embodiments, a subset of sentences can be selected using a method such as cross-entropy summarization (CES) to select the most "promising" subset of sentences. As background, the cross-entropy (CE) method provides a general Monte Carlo optimization framework for solving hard combinatorial problems. For this purpose, CE takes, for example, as input.

number

number

number

number

[0035] For a given sentence s∈D, ψ(s) indicates the likelihood that the sentence is included in summary S. Starting with the selection policy with the highest entropy (i.e., ψ0(s)=0.5), the CE method is

number

[0036] For this purpose, ψ*(·) is incrementally learned using importance sampling techniques. At each iteration t = 1, 2, ..., N sentence subsets S t-1 are generated according to the selection policy ψ j (·) learned in the previous iteration t - 1. The likelihood of selecting a sentence s ∈ D at iteration t is estimated (by cross-entropy minimization) by the following equation (7).

[0037] [Equation] where δ [·] denotes the Kronecker delta (indicator) function and γ t denotes the (1 - ρ) quantile (p ∈ (0, 1)) of the sample performance [Equation] . Thus, the likelihood of selecting a sentence s ∈ D is higher if that sentence is included in more (subset) samples with performance exceeding the current minimum required quality target value γ t . In some embodiments, ψt(·) can be further smoothed as follows. Φt(·)' = αψ t-1 (·) + (1 - α)ψ t (·) where α ∈ [0, 1].[[]END]] <00Q0219> At the end, the CE method is expected to converge to the globally optimal selection policy ψ*(·). Next, based on this convergence, a single summary S* ~ ψ*(·) can be generated. To ensure that only admissible summaries are generated, if the length of the sampled summary S j exceeds the L-word limit, then always [Equation] This can be set. Alternatively, the maximum length constraint may be enforced directly during sampling.

[0039] In some embodiments, an unsupervised setting is assumed, and therefore there are no actual reference summaries available for training. Similarly, it is not possible to directly optimize the actual quality target Q(S|q,D). Instead, Q(S|q,D) can be "substituted" by several summary quality prediction means, such as the following:

[0040]

number

[0041] To estimate the saliency or focus level of a given candidate summary S, such quality "predictors" are used.

number

[0042]

number

[0043] In some embodiments, this summarizer may employ several different predictors, for example, five different predictors. As a non-limiting example, in one embodiment, the first two predictors use unigram language models constructed from a factoid corpus and a statistical corpus, respectively. These employ known techniques to measure how much information in a sentence covers its query and how much quantity is dedicated to that query. A third predictor determines how much of its package set the summary covers. A fourth predictor measures entropy to realize diversity in the sentence. The last two predictors are provided for biased sentences that are longer and described by predicate-argument structures. In some embodiments, this summarizer may employ more or fewer predictors.

[0044] Therefore, the optimization formula shown above as equation (7) provides a model for finding the best combination of sentences that conform to the original constraints (equations (5) and (6) above). This combination of sentences is then input into an algorithm that generates machine-generated text content, such as a T5 transformer, by rewriting the sentences into unified units of text.

[0045] In some embodiments, the content is personalized for the end user. In some such embodiments, the process for personalizing the content loads user-generated content into memory. For example, in some embodiments, the process sends the user a feedback request, which includes a request for feedback on an opinion expressed in an editorial article related to a subtopic. The user may be asked to comment on that opinion or simply indicate whether they agree or disagree with it. Thus, the user-generated content would include the feedback received from the user in response to the feedback request. In some embodiments, the user-generated content may include one or more comments posted by the user in response to a post or article on the subtopic on, for example, a news website or a social media website. In such embodiments, the user has provided in advance a list of such websites in which they actively participate and has expressed their agreement to choose to allow the process to access these comments previously posted by the user.

[0046] In some embodiments, the process analyzes user-generated content to determine its polarity, and the process uses this to determine the user's sentiment towards a given subtopic. In some such embodiments, the process generates a weight vector based on the user's sentiment towards the given subtopic. This weight vector is input to the MKP algorithm along with the candidate items and feature vectors of the candidate factoids and insights. The weight vector has values ​​for each value of the feature vector, which act as a mask to emphasize or disemphasize certain features of the feature vector. For example, if a user is interested in the first and third subtopics but not the second subtopic, the weight values ​​may be set to a first value (e.g., 1 or 100) for the subtopics of interest to the user and to a second value (e.g., zero) for the subtopics of interest to the user. As a result, the MKP algorithm groups the candidate text items into a predetermined number of groups based on the calculated relevance score, the weight vector, and the feature vector.

[0047] In some such embodiments, the process trains a pre-trained encoder-decoder model using a designated group of candidate text items from a group of candidate text items. The process then uses the pre-trained encoder-decoder model to generate machine-generated text content, resulting in articles on designated subtopics based on the designated group of candidate text items.

[0048] The exemplary embodiments can be implemented with respect to any type of data, data source, or access to a data source via a data network. Any type of data storage device can, within the scope of the invention, provide data to embodiments of the invention locally in a data processing system or via a data network. If an embodiment is described using a mobile device, any type of data storage device suitable for use with that mobile device can, within the scope of the exemplary embodiment, provide data to such embodiment locally in the mobile device or via a data network.

[0049] The exemplary embodiments are described using specific code, designs, architectures, protocols, layouts, diagrams, and tools only as examples, and are not limiting to the exemplary embodiments. Furthermore, in some cases, specific software, tools, and data processing environments are used only as examples to facilitate the explanation of the exemplary embodiments. The exemplary embodiments can be used with other equivalent or similar structures, systems, applications, or architectures for similar purposes. For example, other equivalent mobile devices, structures, systems, applications, or architectures therefor may be used with such embodiments of the present invention within the scope of the present invention. An exemplary embodiment can be implemented in hardware, software, or a combination thereof.

[0050] The examples in this disclosure are used for illustrative purposes only and are not limiting to the exemplary embodiments. Additional data, behaviors, actions, tasks, activities, and operations can be conceived from this disclosure and are intended to be within the scope of the exemplary embodiments.

[0051] Any advantages listed herein are merely examples and are not intended to limit the embodiments described herein. Additional or different advantages may also be realized by specific exemplary embodiments. Furthermore, one specific exemplary embodiment may have some or all of the advantages listed above, or none of them.

[0052] While this disclosure includes a detailed description of cloud computing environments, it should be understood that implementations of the teachings described herein are not limited to cloud computing environments. Rather, embodiments of the present invention can be implemented with any other type of computing environment that is currently known or may be developed in the future.

[0053] Cloud computing is a service distribution model that enables convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, network bandwidth, servers, processing, memory, storage, applications, virtual machines, and services) that can be rapidly provisioned and released with minimal administrative effort or interaction with service providers. This cloud model may include at least five features, at least three service models, and at least four deployment models.

[0054] The features are as follows:

[0055] On-demand self-service: Cloud consumers can unilaterally and automatically provision computing functions such as server time and network storage as needed, without requiring human intervention with service providers.

[0056] Wide network access: Functionality is available over the network and accessed through standard mechanisms that facilitate use by heterogeneous thin-client or thick-client platforms (e.g., mobile phones, laptops, and personal digital assistants (PDAs)).

[0057] Resource Pooling: To accommodate multiple consumers using a multi-tenant model, a provider's computing resources are pooled, and different physical and virtual resources are dynamically allocated and reallocated as needed. Consumers generally have no control over or knowledge of the exact location of the resources provided, but they may be able to specify a higher level of abstraction for location (e.g., country, state, or data center), thus creating a sense of location independence.

[0058] Rapid Scale: With rapid scale, features can be provisioned automatically in some cases, allowing for rapid scaling out, and features can be quickly released to scale in. To consumers, the available features for provisioning often appear unlimited, and they can purchase as many as they like, whenever they want.

[0059] Metered Services: Cloud systems automatically control and optimize resource utilization by leveraging appropriate levels of metric functionality depending on the type of service (e.g., storage, processing, bandwidth, and active user accounts). Resource usage can be monitored, controlled, and reported, providing transparency to both providers and consumers regarding the services used.

[0060] The service model is as follows:

[0061] Software as a Service (SaaS): The functionality provided to consumers is the use of a provider's applications running on cloud infrastructure. These applications are accessible from various client devices via thin-client interfaces such as web browsers (e.g., web-based email). Consumers do not manage or control the underlying cloud infrastructure, including the network, servers, operating system, storage, or individual application functions, with the possible exception of limited user-specific application configuration settings.

[0062] Platform as a Service (PaaS): The functionality offered to consumers is the ability to deploy consumer-created or acquired applications, written using programming languages ​​and tools supported by the provider, on a cloud infrastructure. Consumers do not manage or control the underlying cloud infrastructure, including the network, servers, operating system, or storage, but they can control the deployed applications and, in some cases, the application hosting environment configuration.

[0063] Infrastructure as a Service (IaaS): The functionality provided to consumers is the provisioning of processing, storage, networking, and other basic computing resources, allowing consumers to deploy and run any software, including operating systems and applications. Consumers do not manage or control the underlying cloud infrastructure, but they can control the operating system, storage, and deployed applications, and in some cases, have limited control over selected network components (e.g., host firewalls).

[0064] The deployment model is as follows:

[0065] Private Cloud: This cloud infrastructure operates solely for the organization. It can be managed by the organization or a third party and can reside on-premises or off-premises.

[0066] Community Cloud: This cloud infrastructure is shared by several organizations to support specific communities with common interests (e.g., missions, security requirements, policies, and compliance matters). It can be managed by an organization or a third party and can reside on-premises or off-premises.

[0067] Public Cloud: This cloud infrastructure is available to the public or large industry groups and is owned by the organization that sells the cloud services.

[0068] Hybrid Cloud: This cloud infrastructure is a combination of two or more clouds (private, community, or public) that remain separate entities but are connected by standardized or proprietary technologies (e.g., cloud bursting for load balancing between clouds) that enable data and application portability.

[0069] Cloud computing environments are service-oriented, focusing on statelessness, loose coupling, modularity, and semantic interoperability. At the heart of cloud computing is the infrastructure, including a network of interconnected nodes.

[0070] Referring to the drawings, particularly Figures 1 and 2, these drawings illustrate an exemplary data processing environment in which an exemplary embodiment can be implemented. Figures 1 and 2 are merely examples and are not intended to claim or imply any limitation to environments in which different embodiments can be implemented. Many modifications can be made to the illustrated environment based on the following description for a particular implementation.

[0071] Figure 1 shows a block diagram of a network of a data processing system capable of implementing an exemplary embodiment. The data processing environment 100 is a network of computers capable of implementing an exemplary embodiment. The data processing environment 100 includes network 102. Network 102 is a medium used to provide communication links between various devices and computers connected to each other within the data processing environment 100. Network 102 may include connections such as wires, wireless communication links, or fiber optic cables.

[0072] The client or server is merely an example of the role of a particular data processing system connected to network 102, and is not intended to exclude other configurations or roles of these data processing systems. Servers 104 and 106 are connected to network 102 along with storage unit 108. Software applications can be run on any computer within the data processing environment 100. Clients 110, 112, and 114 are also connected to network 102. A data processing system such as server 104 or 106, or client 110, 112, or 114, may contain data and may have software applications or software tools running on it. In one embodiment, data processing system 104 includes memory 124, which includes application 105A that can be configured to implement one or more of the data processor functions described herein, according to one or more embodiments.

[0073] Server 106 is connected to network 102 together with storage unit 108. Storage unit 108 includes a database 109 configured to store data such as image data and attribute data, as described herein in relation to various embodiments. Server 106 is a conventional data processing system. In one embodiment, server 106 includes a neural network application 105B, which can be configured to implement one or more of the processor functions described herein, according to one or more embodiments.

[0074] Clients 110, 112, and 114 are also connected to network 102. Server 106, or a conventional data processing system such as clients 110, 112, or 114, may have software applications or software tools that contain data and perform conventional computing processes on it.

[0075] Figure 1 shows specific components that can be used in an exemplary implementation of one embodiment, without implying any limitation to such architecture, and is merely an example. For example, servers 104 and 106 and clients 110, 112, and 114 are shown as servers and clients only as examples and do not imply any limitation to a client-server architecture. As another example, one embodiment can be distributed across several data processing systems and data networks as shown, while another embodiment can be implemented on a single data processing system within the scope of the exemplary embodiment. Data processing systems 104, 106, 110, 112, and 114 also represent exemplary nodes, partitions, and other configurations in a cluster suitable for implementing one embodiment.

[0076] Device 132 is an example of a device described herein. For example, Device 132 can take the form of a smartphone, tablet computer, laptop computer, client 110 in fixed or portable form, wearable computing device, or any other suitable device. Any software application described as running in another data processing system in Figure 1 can be configured to run in Device 132 in a similar manner. Any data or information stored or generated in another data processing system in Figure 1 can be configured to be stored or generated in Device 132 in a similar manner.

[0077] Applications 105A / 105B implement one embodiment described herein. Applications 105A / B run on any of the servers 104 and 106, clients 110, 112 and 114, and device 132.

[0078] Servers 104 and 106, storage unit 108, clients 110, 112, and 114, and device 132 can be connected to network 102 using a wired connection, wireless communication protocol, or other suitable data connection. Clients 110, 112, and 114 may be, for example, personal computers or network computers.

[0079] In the illustrated embodiment, server 104 can provide data such as boot files, operating system images, and applications to clients 110, 112, and 114. Clients 110, 112, and 114 can be clients to server 104 in this embodiment. Clients 110, 112, 114, or any combination thereof, may include their own data, boot files, operating system images, and applications. The data processing environment 100 may include additional servers, clients, and other devices not shown.

[0080] In the illustrated embodiment, memory 124 can provide data such as boot files, operating system images, and applications to processor 122. Processor 122 may include its own data, boot files, operating system images, and applications. The data processing environment 100 may include additional memory, processors, and other devices not shown.

[0081] In the illustrated embodiment, the data processing environment 100 may be the Internet. Network 102 may represent a collection of networks and gateways that use the Transmission Control Protocol / Internet Protocol (TCP / IP) and other protocols to communicate with one another. At the heart of the Internet is a backbone of data communication links between major nodes or host computers, including thousands of commercial, government, educational and other computer systems that route data and messages. Naturally, the data processing environment 100 can also be implemented as several different types of networks, such as an intranet, a local area network (LAN), or a wide area network (WAN). Figure 1 is intended as an example and is not intended to limit the architecture of different illustrative embodiments.

[0082] Among its many applications, the data processing environment 100 can be used to implement a client-server environment capable of implementing the exemplary embodiment. The client-server environment allows software applications and data to be distributed across a network so that applications function using interconnections between client and server data processing systems. The data processing environment 100 may also employ a service-oriented architecture that allows interoperable software components distributed across the network to be packaged together as a consistent business application. The data processing environment 100 may take the form of a cloud and may employ a cloud computing model of service delivery to enable convenient on-demand network access to a shared pool of configurable computing resources (e.g., networks, network bandwidth, servers, processing, memory, storage, applications, virtual machines, and services) that can be rapidly provisioned and released with minimal administrative effort or interaction with service providers.

[0083] Referring to Figure 2, this figure shows a block diagram of a data processing system capable of implementing an exemplary embodiment. The data processing system 200 is an example of a computer, such as servers 104 and 106 or clients 110, 112 and 114 in Figure 1, or another type of device in which computer-readable program code or instructions for implementing the process can be placed for the exemplary embodiment.

[0084] The data processing system 200 also represents a data processing system or configuration in a data processing system, such as the data processing system 132 in Figure 1, on which computer-usable program code or instructions that implement the processes of the exemplary embodiment can be placed. While the data processing system 200 is described as a computer only as an example, it is not limited to a computer. Implementations of other devices, such as the device 132 in Figure 1, may modify the data processing system 200, for example, by adding a touch interface, and certain illustrated components may be omitted from the data processing system 200 without departing from the overview of the operation and function of the data processing system 200 described herein.

[0085] In the illustrated embodiment, the data processing system 200 employs a hub architecture including a northbridge and memory controller hub (NB / MCH) 202 and a southbridge and input / output (I / O) controller hub (SB / ICH) 204. A processing unit 206, main memory 208, and graphics processor 210 are coupled to the NB / MCH 202. The processing unit 206 may include one or more processors and may be implemented using one or more heterogeneous processor systems. The processing unit 206 may be a multicore processor. In certain implementations, the graphics processor 210 may be coupled to the NB / MCH 202 via an accelerated graphics port (AGP).

[0086] In the illustrated embodiment, a LAN adapter 212 is coupled to the SB / ICH204. An audio adapter 216, a keyboard and mouse adapter 220, a modem 222, a read-only memory (ROM) 224, a Universal Serial Bus (USB) and other ports 232, and a PCI / PCIe device 234 are coupled to the SB / ICH204 via bus 238. A hard disk drive (HDD) or solid state drive (SSD) 226 and a compact disk read-only memory (CD-ROM) 230 are coupled to the SB / ICH204 via bus 240. The PCI / PCIe device 234 may include, for example, an Ethernet(R) adapter for a notebook computer, an add-in card, and a PC card. PCI uses a card bus controller, while PCIe does not. The ROM 224 may be, for example, a flash binary input / output system (BIOS). The HDD226 and CD-ROM230 may use, for example, an Integrated Drive Electronics (IDE), Serial Advanced Technology Attachment (SATA) interface, or a variation thereof such as External SATA (eSATA) and Micro SATA (mSATA). A Super I / O (SIO) device 236 may be coupled to the SB / ICH204 via the bus 238.

[0087] Some examples of computer-usable storage devices include main memory (208), ROM (224), or flash memory (not shown). Other examples of computer-usable storage devices, including HDDs or SSDs (226), CD-ROMs (230), and other similarly usable devices, include computer-usable storage media.

[0088] An operating system runs on the processing unit 206. The operating system coordinates and controls the various components within the data processing system 200 shown in Figure 2. The operating system may be a commercially available operating system for any type of computing platform, including but not limited to server systems, personal computers, and mobile devices. An object-oriented or other type of programming system can work with the operating system and make calls to the operating system from programs or applications running on the data processing system 200.

[0089] Instructions for an operating system, an object-oriented programming system, and an application or program such as application 105 in Figure 1 are located on a storage device, such as in the form of code 226A on a hard disk drive 226, and are loadable into at least one of one or more memories, such as main memory 208, for execution by the processing unit 206. The process of the exemplary embodiment is executable by the processing unit 206 using computer implementation instructions that may be located in memory such as main memory 208, read-only memory 224, or one or more peripheral devices.

[0090] In one example, code 226A may be downloaded via network 201A from a remote system 201B where a similar code 201C is stored in storage device 201D. In another example, code 226A may be downloaded via network 201A to a remote system 201B where the downloaded code 201C is stored in storage device 201D.

[0091] The hardware in Figures 1 and 2 may vary depending on the implementation. Other internal hardware or peripheral devices, such as flash memory, equivalent non-volatile memory, or optical disc drives, may be used in addition to or instead of the hardware shown in Figures 1 and 2. Furthermore, the processes of the exemplary embodiments may be applied to multiprocessor data processing systems.

[0092] In some exemplary embodiments, the data processing system 200 may be a PDA generally configured with flash memory to provide non-volatile memory for storing operating system files or user-generated data, or both. The bus system may include one or more buses, such as a system bus, an I / O bus, and a PCI bus. Naturally, the bus system can be implemented using any kind of communication fabric or architecture that enables the transfer of data between different components or devices connected to the fabric or architecture.

[0093] The communication unit may include one or more devices used to send and receive data, such as a modem or network adapter. Memory may be, for example, main memory 208, or a cache such as the cache located in NB / MCH202. The processing unit may include one or more processors or CPUs.

[0094] The embodiments illustrated in Figures 1 and 2 and the embodiments described above are not intended to imply any limitation of architecture. For example, the data processing system 200 may take the form of a mobile or wearable device, as well as a tablet computer, laptop computer, or telephone device.

[0095] When a computer or data processing system is described as a virtual machine, virtual device, or virtual component, that virtual machine, virtual device, or virtual component operates in the manner of the data processing system 200 using virtualized embodiments of some or all of the components illustrated in the data processing system 200. For example, in a virtual machine, virtual device, or virtual component, processing unit 206 is embodied as a virtualized instance of all or some of the number of hardware processing units 206 available in the host data processing system, main memory 208 is embodied as a virtualized instance of all or some of the main memory 208 that may be available in the host data processing system, and disk 226 is embodied as a virtualized instance of all or some of the disk 226 that may be available in the host data processing system. In such a case, the host data processing system is represented by the data processing system 200.

[0096] Referring to Figures 3A and 3B, these figures show block diagrams of exemplary configurations 300A and 300B according to exemplary embodiments. Each exemplary embodiment includes a cognitive aggregation and authoring (CAA) system 302. In some embodiments, the CAA system 302 is an example of application 105A / 105B in Figure 1.

[0097] In the exemplary embodiments, a user device 310, such as a personal computer, is used to send a request for information. For example, the user device 310 may request to receive news articles or other forms of digital content, or both, related to a specific topic, such as a topic related to current major news or an event the user is participating in, or another topic of interest to the user. The user device 310 makes a request to the CAA system 302 via the network 308. As detailed below, the CAA system 302 receives information from multiple data sources, such as databases 304-306, or other sources of information available via the Internet. The CAA system 302 uses this information to generate machine-generated content. In some embodiments, the CAA system 302 generates content for a specific topic by identifying information related to the topic and organizing that topic-related information into groups. In some embodiments, the CAA system 302 uses a multiple knapsack algorithm to optimize the groups to maximize the relevance of the grouped information and other quality metrics as desired. The CAA system 302 then selects a group and uses the information within that group to generate machine-generated content.

[0098] In some embodiments, such as configuration 300A shown in Figure 3A, the user device 310 directly requests information from a service on the CAA system 302 via the network 308. In alternative embodiments, such as configuration 300B shown in Figure 3B, the user device 310 makes requests indirectly to the CAA system 302. For example, in the illustrated embodiment of configuration 300B, the user device 310 makes a request to a third-party service 312, which then communicates with the CAA system 302. In some such embodiments, the third-party service 312 operates a news website or mobile application, such as an online newspaper, digital magazine, or news aggregator, for which the third-party service 312 receives content from the CAA system 302.

[0099] Referring to Figure 4, this figure shows a block diagram of an exemplary configuration 400 according to one exemplary embodiment. The exemplary embodiment includes a CAA system 418. In some embodiments, the CAA system 418 is an example of the CAA system 302 in Figures 3A and 3B and application 105A / 105B in Figure 1.

[0100] In the exemplary embodiment, a user device 422, such as a smartphone, tablet computer, or other computing device, runs an application 424 that sends a request for information. For example, the user device 422 requests to receive news articles or other forms of digital content, or both, related to a specific topic, such as current major news or an event the user is participating in or another topic of interest to the user. The user device 422 sends this request (directly or indirectly) to the CAA system 418 via the network 420. As detailed below, the CAA system 418 receives information from multiple data sources, such as corpora 414-416, or other sources of information available via the Internet. The CAA system 418 uses this information to generate machine-generated content. In some embodiments, the CAA system 418 generates content for a specific topic by identifying topic-related information and organizing that topic-related information into groups. In some embodiments, the CAA system 418 optimizes the groups using multiple knapsack algorithms to maximize the relevance of the grouped information and other quality metrics as desired. The CAA system 418 then selects a group and uses the information within that group to generate machine-generated content.

[0101] In some embodiments, such as the configuration 400 shown in Figure 4, corpora 414-416 are generated by their respective independent NLP pipelines 410-412. NLP pipeline 410 constructs corpus 414 using data from data source 402, which NLP pipeline 410 accesses via network 406. NLP pipeline 411 constructs corpus 415 using data from data source 403, which NLP pipeline 411 accesses via network 407. NLP pipeline 412 constructs corpus 416 using data from data source 404, which NLP pipeline 412 accesses via network 408.

[0102] Three NLP pipelines 410-412 are shown, but alternative embodiments may include any number of NLP pipelines. Examples of NLP pipelines that can function as sources of information for the CAA system 418 include text analysis systems that may include information retrieval, lexical analysis for examining word frequency distributions, pattern recognition, tagging / annotation, information extraction, data mining techniques including link and relevance analysis, visualization and predictive analytics.

[0103] In some embodiments, one or more NLP pipelines among NLP pipelines 410-412 include processing to generate insights using text content generated from structured data. For example, in some embodiments, the NLP processing includes data mining of large data sources, for example, by querying tens or hundreds of thousands of data sources for information related to a particular topic and retrieving gigabytes of data from the query results. In some such embodiments, the NLP processing further includes the use of natural language generation and transformer models and decision optimization to generate insightful summaries that describe causal relationships within a particular context or scenario, for example, by identifying relationships and behaviors that contribute to or aid in understanding causal relationships.

[0104] In some embodiments, one or more of the NLP pipelines 410-412 include processing for generating factoids as text content from established data sources. As used herein, “factoid” refers to information that is little known (i.e., trivia) but interesting. In some such embodiments, the NLP processing includes data mining of large data sources by, for example, querying tens or hundreds of thousands of data sources for information related to a particular topic and retrieving several gigabytes of data in the query results. The processing then includes applying extractive summarization to the articles within the query results to find sentences that summarize the articles or parts of the articles. In some embodiments, one or more quality metrics are used to rank the results based on, for example, how closely the results relate to a particular topic of interest, the quality of the text (e.g., whether the results contain grammatical errors, misspellings or offensive language), or other desired criteria.

[0105] Referring to Figure 5, this figure shows a block diagram of an exemplary CAA system 500 according to one exemplary embodiment. In a particular embodiment, the CAA system is an example of application 105A / 105B in Figure 1, CAA system 302 in Figures 3A and 3B, and CAA system 418 in Figure 4.

[0106] In some embodiments, the CAA system 500 includes a loading module 502, a candidate selection module 504, a grouping module 506, a training module 508, an article generation module 510, an article publishing module 512, a management interface 514, memory 516, and a processor 518. In alternative embodiments, the CAA system 500 may include some or all of the functions described herein, but grouped in different ways into one or more modules. In some embodiments, the functions described herein are distributed across multiple systems, which may include software-based systems, hardware-based systems, or a combination of both, such as application-specific integrated circuits (ASICs), computer programs, or smartphone applications. In some embodiments, modules 502-512 and the management interface 514 are software modules containing program instructions executable by the processor 518 to cause the processor 518 to perform the operations described herein.

[0107] In this exemplary embodiment, the loading module 502 loads multiple digital content datasets into memory 516 as part of extracting content from one or more corpora 520. The multiple digital content datasets satisfy query statements that include the content topics to which the multiple digital content datasets relate.

[0108] The candidate selection module 504 identifies one or more sets of candidate text items from a plurality of digital content datasets based on the relevance of each candidate text item to one or more respective subtopics using a calculated relevance score for each candidate text item. In some embodiments, an item is a sentence or phrase. In some embodiments, the candidate selection module 504 calculates relevance determined by analyzing the text content of each candidate text item using one or more natural language processing techniques. In some embodiments, the candidate selection module 504 generates a feature vector for each candidate text item based on the analysis of the candidate text items. In some embodiments, each feature vector includes one or more relevance values ​​and optionally one or more quality values. In some embodiments, a set of candidate text items includes a factid from a first source and statistics from a second source. In some embodiments, a set of candidate text items includes a first candidate text item and a second candidate text item, each written in a different style.

[0109] The grouping module 506 uses the computed relevance score and feature vector to group candidate text items from a set of candidate text items into a predetermined number of groups of candidate text items. In some embodiments, the grouping module 506 groups candidate text items by finding a solution to the MKP that results in a predetermined number of groups of candidate text items.

[0110] The training module 508 trains a pre-trained encoder-decoder model using a specified group of candidate text items from a group of candidate text items. The pre-trained encoder-decoder model is pre-trained to generate text content according to a specific writing style. The article generation module 510 uses the pre-trained encoder-decoder model to generate machine-generated text content in that specific writing style, resulting in an article on a specified subtopic based on the specified group of candidate text items.

[0111] In some embodiments, the training module 508 trains multiple pre-trained encoder-decoder models using one of several designated groups of candidate text items from a group of candidate text items. Each of the multiple pre-trained encoder-decoder models is pre-trained to generate text content according to its own distinct style. The article generation module 510 utilizes the multiple pre-trained encoder-decoder models to generate machine-generated text content in each style, resulting in multiple articles on a designated subtopic based on the designated groups of candidate text items.

[0112] The article publishing module 512 sends the article to the remote web server 522 as an update for a website hosted by the remote web server 522. In some such embodiments, the article publishing module 512 sends the article to the remote web server 522 as an update for a custom webpage, which is customized for the user. In some such embodiments, memory 516 stores user-generated content that indicates the user's sentiment towards a topic or subtopic. In some embodiments, the user inputs at least some of the user-generated content by, for example, answering survey questions, filling out a user profile, or through other processes, by operating the computer device 524 via the user management interface 514. In some such embodiments, the candidate selection module 504 generates weight vectors based on the user's sentiment towards a topic or subtopic, and the grouping module 506 groups candidate text items into groups of candidate text items using the weight vectors to adjust the values ​​of the feature vectors.

[0113] Referring to Figure 6, this figure shows a block diagram of an exemplary CAA system 600 according to one exemplary embodiment. In a particular embodiment, the CAA system is an example of application 105A / 105B in Figure 1, CAA system 302 in Figures 3A and 3B, and CAA system 418 in Figure 4.

[0114] In some embodiments, the CAA system 600 includes a loading module 602, a candidate selection module 604, a grouping module 606, a training module 608, an article generation module 610, an article publishing module 612, a management interface 614, a memory 616, a processor 618, an emotion analysis module 626, a masking module 634, a polarity detection module 636, and a user feedback module 638. In alternative embodiments, the CAA system 600 may include some or all of the functions described herein, but grouped in different ways into one or more modules. In some embodiments, the functions described herein are distributed across multiple systems, which may include software-based systems, hardware-based systems, or a combination of both, such as ASICs, computer programs, or smartphone applications. In some embodiments, modules 602-612, 626, and 634-638 and the management interface 614 are software modules containing program instructions executable by the processor 618 to cause the processor 618 to perform the operations described herein.

[0115] In this exemplary embodiment, the loading module 602 loads multiple digital content datasets into memory 616 as part of extracting content from one or more corpora 620. The multiple digital content datasets satisfy query statements that include the content topics to which the multiple digital content datasets relate.

[0116] The candidate selection module 604 identifies one or more sets of candidate text items from a plurality of digital content datasets based on the relevance of each candidate text item to one or more respective subtopics using a calculated relevance score for each candidate text item. In some embodiments, items are sentences or phrases. In some embodiments, the candidate selection module 604 calculates relevance determined by analyzing the text content of each candidate text item using one or more natural language processing techniques. In some embodiments, the candidate selection module 604 generates a feature vector for each candidate text item based on the analysis of the candidate text items. In some embodiments, the polarity detection module 636 uses NLP to determine the polarity of each candidate text item, giving a polarity score indicating the degree to which the article is optimized, and may also include an indicator of the side of the issue the article supports. In some embodiments, each feature vector includes one or more relevance values ​​and optionally one or more quality values ​​and polarity scores. In some embodiments, a set of candidate text items includes a factid from a first source and statistics from a second source. In some embodiments, a pair of candidate text items includes a first candidate text item and a second candidate text item, each written in a different style.

[0117] The user feedback module 638 loads user-generated content into memory 616. The sentiment analysis module 626 analyzes the user-generated content as part of determining the user's sentiment towards the subtopic, and as a result determines the polarity of the user-generated content. The masking module 634 generates a weight vector based on the user's sentiment towards the subtopic.

[0118] The grouping module 606 uses the computed relevance score, feature vector, and weight vector to group candidate text items from a set of candidate text items into a predetermined number of groups of candidate text items. In some embodiments, the grouping module 606 groups candidate text items by obtaining a solution to the resulting MKP for the predetermined number of groups of candidate text items.

[0119] The training module 608 trains a pre-trained encoder-decoder model using a specified group of candidate text items from a group of candidate text items. The pre-trained encoder-decoder model is pre-trained to generate text content according to a specific writing style. The article generation module 610 uses the pre-trained encoder-decoder model to generate machine-generated text content in that specific writing style, resulting in an article on a specified subtopic based on the specified group of candidate text items.

[0120] In some embodiments, the training module 608 trains multiple pre-trained encoder-decoder models using one of several designated groups of candidate text items from a group of candidate text items. The multiple pre-trained encoder-decoder models are pre-trained to generate text content according to a style that may differ from one another. The article generation module 610 utilizes the multiple pre-trained encoder-decoder models to generate machine-generated text content in each style, resulting in multiple articles on a designated subtopic based on the designated groups of candidate text items.

[0121] The article publishing module 612 transmits the article to a remote web server 622 via a network 628, which may include the Internet, as an updated article 632 for a website hosted by the remote web server 622 for an application running on a computing device 630, such as a smartphone or tablet. In some such embodiments, the article publishing module 612 transmits the article to the remote web server 622 as an update for a custom web page, which is customized for the user. In some such embodiments, memory 616 stores user-generated content that indicates the user's sentiment towards a topic or subtopic. In some embodiments, the user inputs at least some of the user-generated content by operating a computer device 624 via a user management interface 614, for example, by answering survey questions, filling out a user profile, or through other processes. In some such embodiments, the candidate selection module 604 generates a weight vector based on the user's sentiment towards a topic or subtopic, and the grouping module 606 uses the weight vector to adjust the values ​​of the feature vector and group the candidate text items into groups of candidate text items.

[0122] Referring to Figure 7, this figure shows a flowchart of an exemplary process 700 for cognitive aggregation and authoring according to one exemplary embodiment. In some embodiments, CAA system 302, CAA system 418, CAA system 500, or CAA system 600 performs process 700.

[0123] In one embodiment, in block 702, the process loads multiple digital content datasets into memory as part of content extraction from a corpus. The multiple digital content datasets satisfy query statements that include the content topics to which the multiple digital content datasets relate.

[0124] Next, in block 704, the process uses a computed relevance score for each candidate text item to identify a pair of candidate text items from multiple digital content datasets based on the relevance of each candidate text item to a subtopic. In some embodiments, an item is a sentence or phrase. In some embodiments, the computed relevance score is determined by analyzing the text content of each candidate text item using one or more natural language processing techniques. In some embodiments, the analysis of the text content of each candidate text item yields a feature vector for each candidate text item. In some embodiments, each feature vector includes a relevance value and a quality value. In some embodiments, a pair of candidate text items includes a factoid from a first source and statistics from a second source. In some embodiments, a pair of candidate text items includes a first candidate text item and a second candidate text item, each written in a different style.

[0125] Next, in block 706, the process uses computational relevance scores and feature vectors, as a result of executing a set of instructions in the processor, to group candidate text items from a set of candidate text items into a predetermined number of groups of candidate text items. In some embodiments, the grouping of candidate text items includes finding a solution to MKP which results in a predetermined number of groups of candidate text items. Next, in block 708, the process trains a first pre-trained encoder-decoder model using a first designated group of candidate text items from the group of candidate text items. The first pre-trained encoder-decoder model is pre-trained to generate text content conforming to a first style. Next, in block 710, the process uses the first pre-trained encoder-decoder model to generate machine-generated text content in a first style which results in a first article on a designated subtopic based on the first designated group of candidate text items. Next, in block 712, the process trains a second pre-trained encoder-decoder model using a second designated group of candidate text items from the group of candidate text items. A second pre-trained encoder-decoder model is pre-trained to generate text content in a second writing style. Then, in block 714, the process utilizes the second pre-trained encoder-decoder model to generate machine-generated text content in a second writing style, resulting in a second article on a specified subtopic based on a second specified group of candidate text items.

[0126] Next, in block 716, the process sends the first and second articles to a remote web server as updates for a website hosted by the remote web server. In some such embodiments, sending the first and second articles to the remote web server as website updates includes sending the first and second articles as updates for a custom web page, the custom web page being customized for a user. In some such embodiments, the process includes loading user-generated content into memory, analyzing the selected user-generated content such that the polarity of the selected user-generated content is determined as part of determining the user's sentiment towards a topic, and identifying the selected user-generated content from that user-generated content based on its relevance to the topic. In some such embodiments, the process includes generating a weight vector based on the user's sentiment towards a topic, further including grouping candidate text items into groups of candidate text items, and using the weight vector to adjust the values ​​of the feature vector.

[0127] Referring to Figure 8, this figure shows a flowchart of an exemplary process 800 for cognitive aggregation and authoring according to one exemplary embodiment. In some embodiments, CAA system 302, CAA system 418, CAA system 500, or CAA system 600 performs process 800.

[0128] In one embodiment, in block 802, the process executes a query process to search for content related to a specified topic in multiple corpora.

[0129] Next, in block 804, the process extracts a set of candidate text items from the search results received from the query process based on the relevance of each candidate text item to a specified subtopic, using the computed relevance score for each candidate text item. The computed relevance score is determined by parsing the text content of the candidate text items using one or more natural language processing techniques. The parsing of the text content of the candidate text items yields a feature vector for each candidate text item. Each feature vector contains its respective relevance value and its respective quality value.

[0130] Next, in block 806, the process loads user-generated content into memory. For example, in some embodiments, the process sends a feedback request to the user, which includes a request for feedback on opinions expressed in an editorial article related to a subtopic. The user may be asked to comment on the opinions or simply indicate whether they agree or disagree with them. Thus, the user-generated content will include the feedback received from the user in response to the feedback request. In some embodiments, the user-generated content may include one or more comments posted by the user in response to a post or article on the subtopic on, for example, a news website or a social media website. In such embodiments, the user has provided in advance a list of such websites in which the user actively participates and has also expressed their agreement to choose to allow the process to access these comments previously posted by the user.

[0131] Next, in block 808, the process analyzes user-generated content to determine the polarity of the user-generated content, which the process uses to determine the user's sentiment toward a given subtopic.

[0132] Next, in block 810, the process generates a weight vector based on the user's sentiment towards a specified subtopic. Then, in block 812, the process groups candidate text items into a predetermined number of blocks based on the calculated relevance score, weight vector, and feature vector.

[0133] Next, in block 814, the process trains a pre-trained encoder-decoder model using a specified group of candidate text items from a group of candidate text items. Then, in block 816, the process uses the pre-trained encoder-decoder model to generate machine-generated text content, resulting in an article on a specified subtopic based on the specified group of candidate text items. Finally, in block 818, the process sends the article to a remote web server as an update for displaying personalized content for the user based on the user's sentiment.

[0134] Referring to Figure 9, this figure shows a cloud computing environment 950. As illustrated, the cloud computing environment 950 includes one or more cloud computing nodes 910 to which local computing devices used by cloud consumers, such as a personal digital assistant (PDA) or mobile phone 954A, a desktop computer 954B, a laptop computer 954C, or an automotive computer system 954N, or a combination thereof, can communicate. The nodes 910 can communicate with each other. The nodes 910 may be physically or virtually grouped (not shown) in one or more networks, such as the private cloud, community cloud, public cloud, or hybrid cloud, or a combination thereof. This allows the cloud computing environment 950 to provide infrastructure, platform, or software, or a combination thereof, as a service to cloud consumers, eliminating the need for them to maintain resources on their local computing devices for that purpose. The types of computing devices 954A through 954N shown in Figure 9 are for illustrative purposes only, and it is understood that the computing node 910 and the cloud computing environment 950 can communicate with any type of computerized device via any type of network connection or network addressable connection or a combination thereof (e.g., using a web browser).

[0135] Referring to Figure 10, this figure illustrates a set of functional abstraction layers provided by the cloud computing environment 950 (Figure 9). It should be understood that the components, layers, and functions shown in Figure 10 are for illustrative purposes only, and embodiments of the present invention are not limited thereto. As illustrated, the following layers and corresponding functions are provided:

[0136] The hardware and software layer 1060 includes hardware components and software components. Examples of hardware components include a mainframe 1061, a reduced instruction set computer (RISC) architecture-based server 1062, a server 1063, a blade server 1064, a storage device 1065, and a network and networking component 1066. In some embodiments, the software components include network application server software 1067 and database software 1068.

[0137] The virtualization layer 1070 provides an abstraction layer that can give examples of virtual entities such as a virtual server 1071, virtual storage 1072, a virtual network 1073 including a virtual private network, a virtual application and operating system 1074, and a virtual client 1075.

[0138] In one embodiment, the management layer 1080 may provide the following functions: Resource provisioning 1081 dynamically procures computing resources and other resources used to perform tasks within the cloud computing environment. Metering and pricing 1082 tracks the cost of using resources within the cloud computing environment and processes billing or invoicing for the consumption of these resources. In one embodiment, these resources may include application software licenses. Security verifies the identity of cloud consumers and tasks and protects data and other resources. User portal 1083 provides consumers and system administrators with access to the cloud computing environment. Service level management 1084 allocates and manages cloud computing resources to ensure that the required service levels are met. Service Level Agreement (SLA) planning and execution 1085 pre-arranges and procures cloud computing resources for which future demands are anticipated in accordance with the SLA.

[0139] The workload layer 1090 provides examples of capabilities that can leverage a cloud computing environment. Examples of workloads and capabilities that can be provided from this layer include mapping and navigation 1091, software development and lifecycle management 1092, virtual classroom education delivery 1093, data analysis processing 1094, transaction processing 1095, and contextually-aware natural language artifact recombination 1096.

[0140] For the purposes of the claims and interpretation of this specification, the following definitions and abbreviations shall be used. The terms “comprises,” “comprising,” “includes,” “including,” “has,” “having,” “contains,” or “containing,” as used herein, are intended to include non-exclusive inclusion. For example, a composition, mixture, process, method, object, or apparatus containing any enumerated element is not necessarily limited to that element alone and may include other elements not expressly enumerated or that are inherent to such composition, mixture, process, method, object, or apparatus.

[0141] Furthermore, the term “exemplary” as used herein means “example, case, or illustrative.” No embodiment or design described herein as “exemplary” should necessarily be construed as being preferable or advantageous to any other embodiment or design. The terms “at least one” and “one or more” are understood to include one or more arbitrary integers, i.e., 1, 2, 3, 4, etc. The term “multiple” is understood to include two or more arbitrary integers, i.e., 2, 3, 4, 5, etc. The term “connection” may include indirect “connections” and direct “connections.”

[0142] Where the terms "one embodiment," "a particular embodiment," or "an exemplary embodiment" are used herein, they indicate that the described embodiments may include certain features, structures, or characteristics, but not all embodiments may or may not possess those particular features, structures, or characteristics. Furthermore, such terms do not necessarily refer to the same embodiment. In addition, where certain features, structures, or characteristics are described in relation to one embodiment, whether explicitly stated or not, it is considered within the knowledge of those skilled in the art that such features, structures, or characteristics may affect other embodiments.

[0143] The terms “approximately,” “substantially,” and “approximately,” and their variations, are intended to include the errors associated with measuring a particular quantity based on the equipment available at the time of filing of this application. For example, “approximately” may include a range of ±8%, 5%, or 2% of a given value.

[0144] While descriptions of various embodiments of the present invention have been provided for illustrative purposes, they are not intended to be exhaustive or to limit the invention to the embodiments disclosed. Many changes and modifications will be apparent to those skilled in the art without departing from the scope and spirit of the embodiments described. The terminology used herein has been selected to best describe the principles of the embodiments, their practical applications, or the technical improvements to the art found in the market, or to enable those skilled in the art to understand the embodiments described herein.

[0145] While various embodiments of the present invention have been presented for illustrative purposes, they are not intended to be exhaustive or to limit the invention to those disclosed. Those skilled in the art will see many modifications and variations without departing from the scope and spirit of the embodiments described. The terminology used herein has been selected to best describe the principles of the embodiments, their practical applications, or technical improvements to the art found in the market, or to enable those skilled in the art to understand the embodiments described herein.

[0146] Accordingly, the exemplary embodiments provide a computer implementation method, system or apparatus, and computer program product for managing a CAA system environment, as well as other related features, functions or operations. Where an embodiment or part thereof is described in relation to a certain type of device, the computer implementation method, system or apparatus, computer program product or part thereof is adapted or configured for use with a suitable equivalent embodiment of that type of device.

[0147] Where an embodiment is described as being implemented in an application, the distribution of the application in a SaaS model is intended to be within the scope of the exemplary embodiment. In a SaaS model, the functionality of an application implementing one embodiment is provided to the user by running the application on a cloud infrastructure. Users can access the application using various client devices via a thin client interface such as a web browser (e.g., web-based email) or other lightweight client applications. Users do not manage or control the underlying cloud infrastructure, including the network, servers, operating system, or storage of the cloud infrastructure. In some cases, users may not even manage or control the functionality of the SaaS application. In some other cases, the SaaS implementation of an application may allow for possible exceptions to limited user-specific application configuration settings.

[0148] The present invention may be a system, method, or computer program product or combination thereof at any possible level of technical detail of integration. The computer program product may include a computer-readable storage medium (or multiple mediums) storing computer-readable program instructions for causing a processor to perform aspects of the present invention.

[0149] A computer-readable storage medium can be a tangible device capable of holding and storing instructions for use by an instruction execution device. A computer-readable storage medium may be, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination thereof. A non-exhaustive list of more specific examples of computer-readable storage media includes portable computer diskettes, hard disks, random-access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static random-access memory (SRAM), portable CD-ROMs, digital versatile disks (DVDs), memory sticks, floppy disks, mechanically encoded devices such as punch cards or grooved raised structures on which instructions are recorded, and any suitable combination thereof. The computer-readable storage medium as used herein should not be interpreted as a transient signal itself, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., light pulses through optical fiber cables), or electrical signals transmitted through wires.

[0150] The computer-readable program instructions described herein can be downloaded from a computer-readable storage medium to each computing / processing device, or to an external computer or external storage device via a network, such as the Internet, a local area network, a wide area network, or a wireless network, or a combination thereof. The network may include copper transmission cables, optical transmission fibers, wireless transmissions, routers, firewalls, switches, gateway computers, or edge servers, or a combination thereof. A network adapter card or network interface in each computing / processing device receives computer-readable program instructions from the network and transfers those computer-readable program instructions for storage in a computer-readable storage medium within the respective computing / processing device.

[0151] The computer-readable program instructions for performing the operation of the present invention may be assembler instructions, instruction set architecture (ISA) instructions, machine instructions, machine-dependent instructions, microcode, firmware instructions, state setting data, integrated circuit configuration data, or source code or object code written in any combination of one or more programming languages, including object-oriented programming languages ​​such as Smalltalk(R) and C++, and procedural programming languages ​​such as the C programming language or similar programming languages. The computer-readable program instructions may be executed as a standalone software package, either entirely on the user's computer, partially on the user's computer, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In the latter case, the remote computer may be connected to the user's computer via any type of network, including a LAN or WAN, or the connection may be made to an external computer (for example, via the Internet using an Internet Service Provider). In some embodiments, to carry out aspects of the present invention, an electronic circuit including, for example, a programmable logic circuit, a field-programmable gate array (FPGA), or a programmable logic array (PLA) can execute a computer-readable program instruction by personalizing the electronic circuit using state information of the computer-readable program instruction.

[0152] Aspects of the present invention are described herein with reference to flowcharts or block diagrams, or both, illustrating methods, apparatus (systems), and computer program products according to embodiments of the present invention. It should be understood that each block in the flowcharts or block diagrams, or both, and combinations of blocks in the flowcharts or block diagrams, or both, can be implemented using computer-readable program instructions.

[0153] These computer-readable program instructions may be supplied to a processor of a general-purpose computer, a dedicated computer, or other programmable data processing device to realize a machine, such that instructions executed by the processor of a computer or other programmable data processing device form means for implementing functions / operations specified in one or more blocks of a flowchart or block diagram, or both. Furthermore, these computer-readable program instructions may be stored on a computer-readable storage medium such that the storage medium contains a product containing instructions that implements modes of functions / operations specified in one or more blocks of a flowchart or block diagram, or both, and can instruct a computer, a programmable data processing device, or other device, or a combination thereof, to function in a particular manner.

[0154] Furthermore, computer-readable program instructions may be loaded into a computer, other programmable data processing device, or other device to enable a computer implementation process, such that instructions executed on the computer, other programmable device, or other device implement functions / operations specified in one or more blocks of a flowchart or block diagram, or both, and cause the computer, other programmable device, or other device to execute a series of operational steps.

[0155] The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of instructions containing one or more executable instructions for implementing a specified logical function. In some alternative implementations, the functions described in a block may be performed in an order different from the order shown in the figure. For example, two consecutively shown blocks may actually be executed substantially in parallel, depending on the functions involved, or the blocks may be executed in reverse order in some cases. It should also be noted that each block in a block diagram or flowchart or both, and combinations of blocks in a block diagram or flowchart or both, can be implemented by a dedicated hardware-based system that executes a combination of dedicated hardware and computer instructions to perform a specified function or operation.

[0156] Embodiments of the present invention can also be distributed as part of service contracts with client companies, non-profit organizations, government agencies, internal organizational structures, etc. Aspects of these embodiments may include configuring computer systems to perform some or all of the methods described herein, and deploying software, hardware, and web services that implement some or all of the methods described herein. Aspects of these embodiments may also include analyzing client operations, creating recommendations based on that analysis, building systems that implement some of the recommendations, integrating those systems into existing processes and infrastructure, quantifying the use of those systems, allocating costs to users of those systems, and billing for the use of those systems. While each of the above embodiments of the present invention has been described by its individual advantages, the present invention is not limited to any particular combination thereof. Conversely, such embodiments can be combined in any way and in any number according to the intended development of the present invention without losing their beneficial effects.

Claims

1. A method of computer information processing, wherein the method is Loading multiple digital content datasets into memory as part of content extraction from a corpus, wherein the multiple digital content datasets satisfy a query statement and contain at least one text content relating to a content topic specified in the query statement, and the query statement stipulates that the multiple digital content datasets contain text content relating to the content topic, Identifying a pair of candidate text items from among the multiple digital content datasets based on the relevance of each candidate text item to a subtopic using a calculated relevance score for each candidate text item, wherein the candidate text item is a text item (sentence, phrase, or text block) included in the multiple digital content datasets, the subtopic is a plurality of subtopics relating to the content topic, and the calculated relevance score is a score representing the similarity between the text content of the candidate text item and the subtopic for each subtopic, determined by analyzing the text content of each candidate text item using one or more natural language processing techniques, and as a result of the analysis of the text content of the candidate text item, a feature vector is obtained for each of the candidate text items, and each feature vector is a vector whose elements are the relevance value for each subtopic and the quality value of the candidate text item, thereby identifying the pair of candidate text items. As a result of executing a set of instructions in the processor, the candidate text items from the set of candidate text items are grouped into a predetermined number of groups of candidate text items using the computational relevance score and the feature vector, Training a first pre-trained encoder-decoder model using a first designated group of candidate text items from the group of candidate text items, wherein the first pre-trained encoder-decoder model is pre-trained to generate text content according to a first style, Using the first pre-trained encoder-decoder model, the machine-generated text content in the first style is generated, resulting in a first article on the subtopic based on the first designated group of candidate text items. Training a second pre-trained encoder-decoder model using a second designated group of candidate text items from the aforementioned group of candidate text items, wherein the second pre-trained encoder-decoder model is pre-trained to generate text content according to a second writing style, Using the second pre-trained encoder-decoder model, the machine-generated text content in the second style is generated, resulting in a second article on the subtopic based on the second designated group of candidate text items. A method comprising sending the first and second articles to a remote web server as updates for a website hosted by the remote web server.

2. A computer program that causes a computer to perform the method described in Claim 1.

3. A computer-readable storage medium recording the computer program described in Claim 2.

4. A computer system comprising a processor, one or more computer-readable storage media, and program instructions collectively stored in the one or more computer-readable storage media, wherein the program instructions are executable by the processor to cause the processor to perform an operation, and the operation is Loading multiple digital content datasets into memory as part of content extraction from a corpus, wherein the multiple digital content datasets satisfy a query statement and contain at least one text content relating to a content topic specified in the query statement, and the query statement stipulates that the multiple digital content datasets contain text content relating to the content topic, Identifying a pair of candidate text items from among the multiple digital content datasets based on the relevance of each candidate text item to a subtopic using a calculated relevance score for each candidate text item, wherein the candidate text item is a text item (sentence, phrase, or text block) included in the multiple digital content datasets, the subtopic is a plurality of subtopics relating to the content topic, and the calculated relevance score is a score representing the similarity between the text content of the candidate text item and the subtopic for each subtopic, determined by analyzing the text content of each candidate text item using one or more natural language processing techniques, and as a result of the analysis of the text content of the candidate text item, a feature vector is obtained for each of the candidate text items, and each feature vector is a vector whose elements are the relevance value for each subtopic and the quality value of the candidate text item, thereby identifying the pair of candidate text items. The aforementioned relevance score is determined by analyzing the text content of each candidate text item using one or more natural language processing techniques. As a result of the analysis of the text content of the candidate text items, a feature vector is obtained for each of the candidate text items. Each of the aforementioned feature vectors includes its respective correlation value and its respective quality value. Identifying the aforementioned set of candidate text items, As a result of executing a set of instructions in the processor, the candidate text items from the set of candidate text items are grouped into a predetermined number of groups of candidate text items using the computational relevance score and the feature vector, Training a first pre-trained encoder-decoder model using a first designated group of candidate text items from the group of candidate text items, wherein the first pre-trained encoder-decoder model is pre-trained to generate text content according to a first style, Using the first pre-trained encoder-decoder model, the machine-generated text content in the first style is generated, resulting in a first article on the subtopic based on the first designated group of candidate text items. Training a second pre-trained encoder-decoder model using a second designated group of candidate text items from the aforementioned group of candidate text items, wherein the second pre-trained encoder-decoder model is pre-trained to generate text content according to a second writing style, Using the second pre-trained encoder-decoder model, the machine-generated text content in the second style is generated, resulting in a second article on the subtopic based on the second designated group of candidate text items. A computer system that includes sending the first and second articles described above to a remote web server as updates for a website hosted by the remote web server.