Patents

Literature

Patsnap Eureka AI that helps you search prior art, draft patents, and assess FTO risks, powered by patent and scientific literature data.

598 results about "Content extraction" patented technology

Filter

Efficacy Topic

Property

Owner

Technical Advancement

Application Domain

Technology Topic

Technology Field Word

Patent Country/Region

Patent Type

Patent Status

Application Year

Inventor

Content extraction is the task of separating boilerplate such as comments, navigation bars, social media links, ads, etc, from the main body of text of an article formatted as HTML. The main content typically accounts for only a small portion of a page’s source code (highlighted in red in the image below).

Automated world wide web navigation and content extraction

ActiveUS7725875B2Quick searchInformation wideDigital data processing detailsMultiple digital computer combinationsWeb siteWeb navigation

Storage mediums and a computer-implemented method for automating web navigation and content extraction are provided. In particular, a storage medium with program components which are executable through a common application program interface and are utilizable by a developer to write programming instructions is provided. In some cases, the storage medium may include a program component for adaptively navigating through one or more websites and another program component for extracting scripted content from the one or more websites. In addition or alternatively, the storage medium may include a program component for standardizing content on a web page. In some cases, the storage medium may be configured to allow a user to include XPath query language in program instructions written from the storage medium. A storage medium comprising program instructions executable using a processor for performing such functions and a computer-implemented method employing such processes are also provided herein.

Automated world wide web navigation and content extraction

Automated world wide web navigation and content extraction

Automated world wide web navigation and content extraction

Owner:ACTIAN CORP

Method, apparatus and system for capturing and analyzing interaction based content

InactiveUS20110206198A1Digital data processing detailsSpecial service for subscribersData informationMulti segment

An apparatus and methods for capturing and analyzing customer interactions the apparatus comprising interaction information units, interaction meta-data information units associated with each of the interaction information units, a rule based analysis engine component for receiving the interaction information, an adaptive database, an interaction capture and storage component for capturing interaction information, a multi segment interaction capture device, an initial set up and calibration device and a pre processing and content extraction device.

Method, apparatus and system for capturing and analyzing interaction based content

Method, apparatus and system for capturing and analyzing interaction based content

Method, apparatus and system for capturing and analyzing interaction based content

Owner:NICE SYSTEMS

Meta-content analysis and annotation of email and other electronic documents

InactiveUS7178099B2Biological modelsOffice automationEmail addressContent analytics

Meta-content analysis and annotation upon the body of email documents, and other electronic documents, and to create a displayable index of these instances of meta-content, which is sorted and annotated by type are provided. In addition, the electronic document is enhanced by providing links for the semantic foci to external documents containing related information. An electronic document adapted for delivery to one or more recipients, the electronic document including a header and a body, is processed by:performing meta-content extraction of semantic foci within said header and said body, the semantic foci comprising a plurality of type of information including one or more of email addresses, URLs, dates, currency values, organization names, names of people, names of places, and phone numbers;creating a meta-content index the document based upon said extracted semantic foci;arranging the meta-index according to said plurality of types;combining said meta-content index with said header and said body to provide an enhanced document; andsending said enhanced document to said one or more recipients via a communication network.The process includes converting the electronic mail document to a markup language format, and wherein said meta-content index comprises one or more objects expressed in said markup language adapted for presentation with body in said enhanced document.

Meta-content analysis and annotation of email and other electronic documents

Meta-content analysis and annotation of email and other electronic documents

Meta-content analysis and annotation of email and other electronic documents

Owner:SAP AMERICA

Systems and methods for content extraction

ActiveUS20070050708A1Maintain informationMaintain usabilityWeb data indexingNatural language data processingAutomatic controlWorld Wide Web

Systems and methods are presented for content extraction from markup language text. The content extraction process may parse markup language text into a hierarchical data model and then apply one or more filters. Output filters may be used to make the process more versatile. The operation of the content extraction process and the one or more filters may be controlled by one or more settings set by a user, or automatically by a classifier. The classifier may automatically enter settings by classifying markup language text and entering settings based on this classification. Automatic classification may be performed by clustering unclassified markup language texts with previously classified markup language texts.

Systems and methods for content extraction

Systems and methods for content extraction

Systems and methods for content extraction

Owner:THE TRUSTEES OF COLUMBIA UNIV IN THE CITY OF NEW YORK

Method for extracting and processing network information and its system

InactiveCN1536483AAutomatic summarizationProgram loading/initiatingSpecial data processing applicationsText retrievalAutomatic summarization

The invention relates to a network information extracting and processing method, adopting artificial intelligence and natural language processing technique, able to automatically download daily up-to-date news and information from named websites, making content extraction, classification, automatic abstracting and retrenching full text, then storing the full text, and then indexing the full text for making high-efficiency full text retrieval in future.

Method for extracting and processing network information and its system

Method for extracting and processing network information and its system

Method for extracting and processing network information and its system

Owner:陈文中

Linguistic extraction of temporal and location information for a recommender system

PendingUS20090187467A1Lack presenceDigital data information retrievalSelective content distributionTemporal informationVoice transformation

One embodiment of the present invention provides a system that recommends activities. During operation, the system receives a piece of content obtained from text or converted to text from speech. The system then analyzes the received content to identify any activity type, indication of willingness to participate in any type of activities, and at least one piece of temporal information, which can be implicitly and / or explicitly stated in the content, and / or one piece of location information associated with the activity type. The system further recommends one or more activities, venues, and / or services that afford or support activities for a user based on the information extracted from the content.

Linguistic extraction of temporal and location information for a recommender system

Linguistic extraction of temporal and location information for a recommender system

Linguistic extraction of temporal and location information for a recommender system

Owner:XEROX CORP

Method, apparatus and system for capturing and analyzing interaction based content

InactiveUS8204884B2Digital data processing detailsSpecial service for subscribersData informationMulti segment

An apparatus and methods for capturing and analyzing customer interactions the apparatus comprising interaction information units, interaction meta-data information units associated with each of the interaction information units, a rule based analysis engine component for receiving the interaction information, an adaptive database, an interaction capture and storage component for capturing interaction information, a multi segment interaction capture device, an initial set up and calibration device and a pre processing and content extraction device.

Method, apparatus and system for capturing and analyzing interaction based content

Method, apparatus and system for capturing and analyzing interaction based content

Method, apparatus and system for capturing and analyzing interaction based content

Owner:NICE SYSTEMS

Processing method and device for instant communication information including hyperlink

ActiveCN101102255AConvenient instant messaging operationImprove interactivityDigital data information retrievalStore-and-forward switching systemsHyperlinkComputer terminal

The method comprises: a) extracting the supper link from the instant communication message; b) getting the webpage corresponding to the extracted supper link; c) extracting the content abstract information from said webpage; d) displaying said content abstract information. The apparatus thereof comprises: a supper link extracted module, a webpage obtained module and a display module. By the invention, users can know the webpage content corresponding to the supper link without starting the browser while using the instant communication tool to make instant communication.

Processing method and device for instant communication information including hyperlink

Processing method and device for instant communication information including hyperlink

Processing method and device for instant communication information including hyperlink

Owner:TENCENT TECH (SHENZHEN) CO LTD

Customer service information providing method and device, electronic equipment and storage medium

ActiveCN107679234ASoothe emotionsImprove accuracyCustomer relationshipSemantic analysisConditional random fieldText entry

The invention provides a customer service information providing method and device, electronic equipment and a storage medium. The method comprises the steps of receiving a Chinese text input by a user; inputting the input Chinese text into a Chinese customer service question-answering model based on a Bi-LSTM (Bidirectional Long Short-Term Memory) model and a CNN (Convolutional Neural Network) model to acquire an answering statement; inputting the input Chinese text into a content extraction and intention classification model based on a Bi-LSTM-CRF (Conditional Random Field) model and an LSTMclassifier to acquire customer intention classification and key information; determining service recommended to a user according to the customer intention classification and the key information; inputting the input Chinese text into a Chinese text emotion analysis model based on the CNN model to acquire a user emotion classification; adjusting the answering statement according to the user emotionclassification; and in combination with the adjusted answering statement and the determined service, providing customer service information to the user. According to the method and device optimizationmodel provided by the invention, the automatic customer service answering is realized.

Customer service information providing method and device, electronic equipment and storage medium

Customer service information providing method and device, electronic equipment and storage medium

Customer service information providing method and device, electronic equipment and storage medium

Owner:上海携程国际旅行社有限公司

Document image information management apparatus and document image information management program

InactiveUS20060085442A1Improve convenienceEasy to manageSpecial data processing applicationsMetadata based other databases retrievalPaper documentDocument preparation

Metadata of document images can be universally handled by dealing with the document images in units of individual regions according to their contents, thereby making it possible to improve convenience for management, search, operation thereof and so on. In order to mange metadata of contents and contexts related to the document images, prescribed image regions are analyzed as image objects based on image contents of the document images, and attribute information is extracted based on contents of the image objects thus analyzed, so that the metadata of the contents thus extracted is managed in association with the document images and the image objects. Also, attribute information is extracted based on a situation of the documents of the document images, so that the metadata of the contexts extracted is managed in association with the document images and the image objects.

Document image information management apparatus and document image information management program

Document image information management apparatus and document image information management program

Document image information management apparatus and document image information management program

Owner:KK TOSHIBA +1

Systems and methods for indexing and searching digital video content

ActiveUS20070253678A1Television system detailsDigital data information retrievalDigital videoMultimedia

The present invention relates to systems and methods for indexing digital video content maintained on a storage media item. The method of the present invention comprises extracting caption and subtitle content from one or more video object (“VOB”) files maintained on the storage media item. The extracted caption and subtitle content are segmented into one or more segments and video and audio content corresponding to the one or more segments are extracted. Descriptions of the video and audio content corresponding to the segmented caption and subtitle content are generated. The captions, subtitles, descriptions, and corresponding video and audio content associated with the one or more segments of the one or more VOB files are indexed.

Systems and methods for indexing and searching digital video content

Systems and methods for indexing and searching digital video content

Systems and methods for indexing and searching digital video content

Owner:VERIZON PATENT & LICENSING INC

Knowledge management tool

InactiveUS20070244867A1Promote collaborationEasy to createDigital data information retrievalDigital data processing detailsManagement toolDocument preparation

A document processor for use with an indexing application comprising: a content extractor proxy that implements a pre-defined programmatic interface for content extractors; a data store; and an extended document metadata processor; wherein: the content extractor proxy receives a signal from the indexing application identifying a target document; and the document metadata processor creates from the target document extended document metadata for storage in the data store.

Knowledge management tool

Knowledge management tool

Knowledge management tool

Owner:BA INSIGHT

Method for extracting, analyzing and searching network flow and content

ActiveCN103281213ASolve the repeatabilitySolve problems such as serial number reset to zeroData switching networksSpecial data processing applicationsData informationOriginal data

The invention discloses a method for extracting, analyzing and searching network flow and content. The method comprises the following steps: shunting original flow into n data processing queues; independently processing an original data message of each data processing queue by the data processing queue, performing protocol recognition and filtration on the message and performing conversation recombination on TCP (Transmission Control Protocol) flow in the message; performing protocol resolving and decoding on a recombined TCP conversation and extracting out structured data information therein; and as for key information specified by requirements, performing searching labeling in data content extracted by a content resolving and extracting module based on a multimode matching algorithm or a search engine technology, and submitting labeling results to a searching labeling information database, thereby providing searching labeling results for multiple modes of applications. The method can be used for solving the problems of repeated data packets, serial number zero adjustment and the like in the TCP conversation recombination, realizing the character labeling for the original flow, and ensuring that a user can acquire effective information conveniently.

Method for extracting, analyzing and searching network flow and content

Method for extracting, analyzing and searching network flow and content

Method for extracting, analyzing and searching network flow and content

Owner:XI AN JIAOTONG UNIV

Method for picking-up, and aggregating micro content of web page, and automatic updating system

InactiveCN1959679ASupport divergenceEasy to introduceTransmissionSpecial data processing applicationsPersonalizationRelationship - Father

A method for picking up and gathering micro-content of web page includes inputting web page address at user end, transmitting legal content to micro-content analysis subsystem of web page at server end then labeling different micro-content block or column as per superchaining group, transmitting labeled html text content back to user end, selecting original micro-content or its father node and adding it on micro-content desk subsystem at user end for finalizing desk arrangement.

Method for picking-up, and aggregating micro content of web page, and automatic updating system

Method for picking-up, and aggregating micro content of web page, and automatic updating system

Method for picking-up, and aggregating micro content of web page, and automatic updating system

Owner:北京中搜云商网络技术有限公司

Display control apparatus, recording media, display control method, and display control program

ActiveUS20080018625A1Novel configurationSuitable displayBroadcast components for monitoring/identification/recognitionCathode-ray tube indicatorsFeature extractionExecution control

A display control apparatus is provided for controlling displaying of a play list specifying a reproduction sequence of a plurality of pieces of content. The display control apparatus has a play list feature extraction block configured to extract a feature of a play list on the basis of a plurality of content belonging to the play list, a display pattern selection block configured to select a display pattern for displaying the play list on the basis of the feature of the play list, the feature being extracted by the feature extraction block, and a control block configured to execute control such that the play list is displayed on a display block on the basis of the display pattern selected by the display pattern selection block. The novel configuration provides new ways of enjoying content.

Display control apparatus, recording media, display control method, and display control program

Display control apparatus, recording media, display control method, and display control program

Display control apparatus, recording media, display control method, and display control program

Owner:SONY CORP

Digital media content extraction and natural language processing system

InactiveUS20170213469A1Natural language data processingSpeech recognitionPart of speechNamed entity

An automated lesson generation learning system extracts text-based content from a digital programming file. The system parses the extracted content to identify one or more topics, parts of speech, named entities and / or other material in the content. The system then automatically generates and outputs a lesson containing content that is relevant to the content that was extracted from the digital programming file.

Digital media content extraction and natural language processing system

Digital media content extraction and natural language processing system

Digital media content extraction and natural language processing system

Owner:WESPEKE

Apparatus and method of delivering content between applications

InactiveUS20100175011A1Web data retrievalInterprogram communicationWeb applicationUser input

Disclosed are an apparatus and a method of delivering content between applications. Content which is to be delivered from a source application to a target application may be extracted according to a user input signal, a content type describing object and a content extraction scheme corresponding to a content type. For example, the source application may be a web application including information received through a network, and the target application may be a local application which is executed using information stored in the apparatus, and vice versa.

Apparatus and method of delivering content between applications

Apparatus and method of delivering content between applications

Apparatus and method of delivering content between applications

Owner:SAMSUNG ELECTRONICS CO LTD

Knowledge management tool

InactiveUS20120059822A1Result setHigh popularityDigital data information retrievalDigital data processing detailsManagement toolData memory

A document processor for use with an indexing application comprising: a pre-defined programmatic interface for content extractors; a data store; and an extended document metadata processor; wherein: the content extractor proxy receives a signal from the indexing application identifying a target document; and the document metadata processor creates from the target document extended document metadata for storage in the data store.

Knowledge management tool

Knowledge management tool

Knowledge management tool

Owner:BA INSIGHT

Listed-company announcement classification and abstract generation method based on deep learning

InactiveCN107403375ASave time on text processingFinanceSpecial data processing applicationsModel testingClassification rule

The invention discloses a listed-company announcement classification and abstract generation method based on deep learning. The method comprises the following steps: step 1, acquiring announcement original-text data, extracting text, picture and form information, and establishing structured documents. step 2, establishing a classification rule word library of different announcements on the basis of industry knowledge of announcement fields according to various company operation change event keyword differences, and carrying out statistical judgment on announcement classes; and step 3, for the announcements of the different classes, extracting announcement document contents, combining the rule word library of corresponding class keywords to train an announcement content classification model, and automatically generating document abstract contents, wherein content extraction, training set selection, keyword model optimization, model training, model testing, result analysis and content generation are included. The method can solve technical problems of automatically classifying the announcements for a large amount of announcement information generated each day, automatically extracting key and important information according to classification situations, generating the abstract contents and the like.

Listed-company announcement classification and abstract generation method based on deep learning

Listed-company announcement classification and abstract generation method based on deep learning

Listed-company announcement classification and abstract generation method based on deep learning

Owner:北京文因互联科技有限公司

Method and System for a Speech Synthesis and Advertising Service

ActiveUS20080059189A1Reduce the need for computing resourcesReduce deploymentAutomatic call-answering/message-recording/conversation-recordingMultiple digital computer combinationsText to speech synthesisSpeech sound

Methods and systems for providing a network-accessible text-to-speech synthesis service are provided. The service accepts content as input. After extracting textual content from the input content, the service transforms the content into a format suitable for high-quality speech synthesis. Additionally, the service produces audible advertisements, which are combined with the synthesized speech. The audible advertisements themselves can be generated from textual advertisement content.

Method and System for a Speech Synthesis and Advertising Service

Method and System for a Speech Synthesis and Advertising Service

Method and System for a Speech Synthesis and Advertising Service

Owner:CHEMTRON RES

Contents extraction method, contents extraction apparatus, contents information display method and apparatus

InactiveUS20050177846A1Television system detailsSpecific information broadcast systemsData miningExtraction methods

A set age of a user is updated on the basis of information of the birth date of the user and date information updated with time. Taste information of the user is learned on the basis of operation history information of the user and contents related information corresponding to the operation. Contents extraction conditions are reset in accordance with the taste information of the user and the updated set age, and contents which can be an object to be watched by the user are extracted from a plurality of contents depending on the reset extraction conditions.

Contents extraction method, contents extraction apparatus, contents information display method and apparatus

Contents extraction method, contents extraction apparatus, contents information display method and apparatus

Contents extraction method, contents extraction apparatus, contents information display method and apparatus

Owner:CANON KK

Webpage text content extracting method and device

ActiveCN102541874AImprove accuracySpecial data processing applicationsInformation retrievalContent extraction

The invention discloses a webpage text content extracting method and device. The method comprises the following steps of: acquiring two webpages which belong to a catalogue at the same hierarchy below the same site; for each acquired webpage, respectively executing the following steps of: dividing the webpage into content blocks; determining label density and / or link density of each content block; selecting the content block the label density and / or link density of which meets corresponding preset conditions; extracting the content block with the text content of being not consistent with the text contexts of the content blocks selected from another webpage; and determining the extracted content block as the text content of the webpage. By adopting the technical scheme of the invention, the problem that accuracy is lower when the text content of the webpage is extracted in the prior art can be solved.

Webpage text content extracting method and device

Webpage text content extracting method and device

Webpage text content extracting method and device

Owner:CHINA MOBILE COMM GRP CO LTD

Webpage content extraction forwarding system for mobile communication terminal and application method thereof

InactiveCN101674374ASolve the technical problem of not being able to send to by SMSEasy to shareSubstation equipmentSpecial data processing applicationsHyperlinkText message

The invention relates to the field of a mobile communication equipment terminal, in particular to a browse system for the mobile communication equipment terminal and an application method thereof. Theinvention provides the browse system for the mobile communication equipment terminal, which comprises a browse module, a short message converting module, a shortening module, an identifying module and a skipping module, wherein the browse module is arranged in the mobile communication equipment terminal and used for browsing a page, the short message converting module is arranged in the mobile communication equipment terminal and used for sending a hyperlink by a short message, the shortening module is arranged in the mobile communication equipment terminal and uses a short link for replacingthe hyperlink, the identifying module is arranged on a transferring server and used for transmitting the short link. The browse system transmits the hyperlink to users and friends in a short messagemode, causes the users to conveniently share network resources, solves the technical problem that in the short message, the overlong hyperlink can not be sent by the short message, and causes the users to send various hyperlink by the short message.

Webpage content extraction forwarding system for mobile communication terminal and application method thereof

Webpage content extraction forwarding system for mobile communication terminal and application method thereof

Owner:UCWEB

System, method and program for extracting web page core content based on web page layout

InactiveCN1786947AAvoid confusionHigh precisionSpecial data processing applicationsDocument preparationComputer science

The invention provides a system and method for extracting webpage kernel contents, and the system receives HTML documents (web pages) and extracts the kernel contents, and comprises: text block analyzer for using HTML label as delimiter to divide the text fragments in each available basic structure in the input HTML documents into one or plural independent file blocks and in order connecting all the file blocks together to output, where the available basic structure comprises webpage kernel contents; and text block checker for removing the file blocks without the kernel contents and outputting the rest as the webpage kernel contents. The invention determines if each file block contains advertisements and navigation information, thus able to accurately determine the webpage kernel contents and also raises the processing efficiency.

System, method and program for extracting web page core content based on web page layout

System, method and program for extracting web page core content based on web page layout

System, method and program for extracting web page core content based on web page layout

Owner:IBM CN

Method and system for extracting news webpage content using webpage label clustering

InactiveCN102298638ASpecial data processing applicationsContent extractionInformation retrieval

The invention provides a method and system for extracting news webpage content by using webpage tag clustering. The method includes: preprocessing the webpage content, including parsing the webpage content into a DOM tree and counting the information of each node of the DOM tree; deleting the nodes of the DOM tree heuristically; deleting the DOM tree according to rules The nodes of the tree; and clustering and deleting the nodes of the DOM tree based on the tag structure, thereby generating a final DOM tree for output.

Method and system for extracting news webpage content using webpage label clustering

Method and system for extracting news webpage content using webpage label clustering

Method and system for extracting news webpage content using webpage label clustering

Owner:BEIJING ZHONGSOU NETWORK TECH

Apparatus and method for sharing social media content

InactiveUS20110179062A1Improve reliabilitySearch results are accurateDigital data information retrievalDigital data processing detailsSocial mediaUser input

An apparatus for sharing social media content includes a content management unit for, when sharable content is input by a user, extracting profile information about the sharable content by analyzing the sharable content, and generating social media content by associating the extracted profile information about the sharable content with profile information about a user to store the generated social media content in a database. Further, the apparatus for sharing the social media content includes a content searching unit for extracting an initial sample by searching the database based on keywords requested to be searched for in response to a search request of a user, and searching for the sample by comparing profile information about each piece of content included in the initial sample with one of the keywords.

Apparatus and method for sharing social media content

Apparatus and method for sharing social media content

Apparatus and method for sharing social media content

Owner:ELECTRONICS & TELECOMM RES INST

Apparatus and method for displaying multimedia contents

InactiveUS20070174791A1Input/output for user-computer interactionTelevision system detailsRemote controlTelephony

Provided are an apparatus and method of displaying multimedia contents, more particularly, an apparatus and method of displaying stored multimedia contents to accommodate a user's preference using limited buttons of a remote control device or a cellular phone. The apparatus for displaying multimedia contents includes an alignment condition determination unit determining an alignment condition corresponding to a first user command signal among a plurality of alignment conditions, a detailed condition determination unit determining a detailed condition corresponding to a second user command signal among detailed conditions included in the alignment condition, a contents extraction unit extracting first multimedia contents according to the determined detailed condition, and a display unit displaying the determined alignment condition in a first region, the determined detailed condition in a second region, and a second multimedia contents selected by a user among the extracted first multimedia contents in a third region of a screen.

Apparatus and method for displaying multimedia contents

Apparatus and method for displaying multimedia contents

Apparatus and method for displaying multimedia contents

Owner:SAMSUNG ELECTRONICS CO LTD

Inference method and device of MicroBlog user interests

InactiveCN105740366ANatural language data processingSpecial data processing applicationsMicrobloggingRelationship extraction

The invention provides a method for establishing a MicroBlog user interest inference model. The method comprises an interest label calculation model, an interest model used for MicroBlog text content extraction and a blogger interest point model used for blogger social relationship extraction, and the three models are fused through a model fusion strategy to obtain the final MicroBlog user interest inference model. The method combines personal information, MicroBlog contents and the social relationship, adopts a USER strategy that all MicroBlog contents of the same blogger are mixed by aiming at the sparsity problem of the MicroBlog contents, mines an implicit theme of the MicroBlog by a LPA (Label propagation algorithm), puts forwards a social label propagation algorithm on the basis of a network formed by blogger attention, and calculates influence on the blogger by various interest labels. The method exhibits good identification capability and information filtering capability, and filters false information to identify false bloggers before recommendation is carried out, so that the recommendation quality and accuracy of a recommendation system can be improved, and better experience is brought for the blogger.

Inference method and device of MicroBlog user interests

Inference method and device of MicroBlog user interests

Inference method and device of MicroBlog user interests

Owner:HARBIN INST OF TECH SHENZHEN GRADUATE SCHOOL

Systems and methods for content extraction

ActiveUS8468445B2Maintain informationMaintain usabilityWeb data indexingNatural language data processingWorld Wide WebHuman language

A content extraction process may parse markup language text into a hierarchical data model and then apply one or more filters. Output filters may be used to make the process more versatile. The operation of the content extraction process and the one or more filters may be controlled by one or more settings set by a user, or automatically by a classifier. The classifier may automatically enter settings by classifying markup language text and entering settings based on this classification. Automatic classification may be performed by clustering unclassified markup language texts with previously classified markup language texts.

Systems and methods for content extraction

Systems and methods for content extraction

Systems and methods for content extraction

Owner:THE TRUSTEES OF COLUMBIA UNIV IN THE CITY OF NEW YORK

Method and device for extracting webpage text content

ActiveCN102810097AImprove accuracySpecial data processing applicationsContent extraction

The invention discloses a method and a device for extracting webpage text content. The method includes steps of dividing a webpage with requirement on text content extraction into different content blocks; executing operations, including determining link text length and non-link text length of the content blocks, to the different divided content blocks respectively; determining the link text density of the corresponding content block according to the determined link text length and non-link text length; and determining that the content blocks are the text content of the webpage when the link text density is not higher than a first specified threshold value. By the method and the device for extracting webpage text content, the problem of low accuracy in webpage text content extraction in the prior art is solved.

Method and device for extracting webpage text content

Method and device for extracting webpage text content

Method and device for extracting webpage text content

Owner:ALIBABA (CHINA) CO LTD

Popular searches

Program instruction Application programming interface XPath Application software Query language Standardization Human–computer interaction Data library Uniform resource locator Electronic mail