Method and system for automatic detection of anatomical structures in medical images

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By employing a hierarchical anatomical recognition scheme, two neural networks are used to detect anatomical structures of different sizes, solving the accuracy problem of automated anatomical structure recognition in medical images and enabling early identification of abnormalities such as ectopic pregnancy in early pregnancy.

CN115004223BActive Publication Date: 2026-06-16KONINKLIJKE PHILIPS NV

View PDF 3 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: KONINKLIJKE PHILIPS NV
Filing Date: 2021-01-12
Publication Date: 2026-06-16

Application Information

Patent Timeline

12 Jan 2021

Application

16 Jun 2026

Publication

CN115004223B

IPC: G06T7/00; G06N3/045; G06T7/10; G06V10/82; G06V10/25; G06V10/26; G06V10/774

CPC: G06T7/10; G06T7/0012; G06T2207/20081; G06T2207/10132; G06T2207/20084; G06N3/084; G06V10/25; G06V10/267

AI Tagging

Application Domain

Image enhancement Image analysis

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

Smart Images

Figure CN115004223B_ABST

Patent Text Reader

Abstract

The invention relates to a computer-implemented method for automatically detecting anatomical structures (3) in a medical image (1) of a subject, the method comprising applying a target detector function (4) to the medical image, wherein the target detector function performs the following steps: (A) applying a first neural network (40) to the medical image, wherein the first neural network is trained to detect a first plurality of types of larger-sized anatomical structures (3a), thereby generating as output coordinates of at least one first bounding box (51) and a confidence score that the at least one first bounding box contains a larger-sized anatomical structure; (B) cropping (42) the medical image to the first bounding box, thereby generating a cropped image (11) containing the image content within the first bounding box (51); and (C) applying a second neural network (44) to the cropped medical image, wherein the second neural network is trained to detect at least one second type of smaller-sized anatomical structure (3b), thereby generating as output coordinates of at least one second bounding box (54) and a confidence score that the at least one second bounding box contains a smaller-sized anatomical structure.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The present invention relates to a computer-implemented method for automatically detecting anatomical structures in medical images of objects, a method for training a target detector function useful in the process of detecting multiple types of anatomical structures in medical images, and related computer programs and systems. Background Technology

[0002] Medical imaging modalities such as X-rays, computed tomography (CT), magnetic resonance imaging (MRI), positron emission tomography (PET), and ultrasound imaging modalities have become widely available and are frequently used for diagnostic and other medical purposes. However, the interpretation of medical images remains challenging, for example, due to various artifacts, noise, and other sources of image inaccuracy. In particular, the interpretation of ultrasound images is known to have high intra-user and inter-user variability, even among specialists such as radiologists and certified ultrasound physicians. This is exacerbated because ultrasound is often performed by interns or emergency room physicians in the context of emergency medical events. Therefore, computer-assisted methods are needed in the detection and identification of anatomical structures and / or in determining the probability of predefined medical conditions based on medical images.

[0003] One example where such computer-assisted implementation is desired is prenatal ultrasound screening, which is recommended for every pregnant woman worldwide. The primary purpose of ultrasound scans in early pregnancy is to assess the viability of the pregnancy and calculate its date, count the number of fetuses, and rule out abnormal early pregnancies such as ectopic pregnancies and miscarriages. Pregnancy loss is common in this early stage, and ectopic pregnancy is a crucial but often undetected clinical abnormality and remains a significant source of maternal mortality globally. Ultrasound imaging plays a critical role in the early identification of these clinical abnormalities in early pregnancy. However, even for experts, inter- and intra-observer variability significantly limits the diagnostic value of ultrasound images.

[0004] Over the past few years, deep learning technology has made significant progress in pattern recognition, object detection, image classification, and semantic segmentation. This is the first attempt to apply artificial neural networks to locate anatomical structures in medical images.

[0005] The paper “ConvNet-Based Localization of Anatomical Structures in 3D Medical Images” by Bob D. de Vos, Jelmer M. Wolterink, Pim A. de Jong, Tim Leiner, Max A. Viergever, and Ivana Isgum (IEEE Transactions on Medical Imaging, PP.DUI.: 10.1109 / TMI. 2017. 2673121, April 19, 2017) proposes a method for automatically localizing one or more anatomical structures in 3D medical images by detecting the presence of one or more anatomical structures in 2D image slices using a convolutional neural network (ConvNet). A single ConvNet is trained to detect the presence of anatomical structures of interest in axial, coronal, and sagittal slices extracted from a 3D image. Spatial pyramid pooling is applied to allow the ConvNet to analyze slices of different sizes. After detection, 3D bounding boxes are created by combining the outputs of the ConvNets from all slices. The output feature map of the spatial pyramid pooling layer is concatenated to a sequence of two fully connected layers, which are then connected to an output layer with 2N terminal nodes, where N indicates the number of target anatomical structures. Spatial pyramid pooling allows for the analysis of images with variable input sizes.

[0006] WO 2017 / 1242221 A1 discloses a method for object detection, the method comprising: grouping object types to be detected into multiple object clusters constituting a hierarchical tree structure; obtaining an image and at least one bounding box for the obtained image; evaluating objects in each bounding box from the root cluster to the leaf clusters of the hierarchical tree structure by means of a convolutional neural network trained for each cluster in the hierarchical tree structure respectively, to determine the deepest leaf cluster of the object; and outputting an object type label at the determined deepest leaf cluster as a predicted object type label of the object.

[0007] The article "A hierarchical model for automatic nuchal translucency detection from ultrasound images" by Y. Deng, Y. Wang, P. Chen, and J. Yu (Computers in Biology and Medicine 42, 2012, pp. 706-713) proposes an algorithm for automated detection of the nuchal translucency (NT) region. When given ultrasound images, the entire fetal body is first identified and located. Then, based on knowledge of the body, the NT region and head of the fetus can be inferred from the images. The established graphical model appropriately represents this causal relationship between the target NT region, the head, and the body.

[0008] The paper "Hierarchical part detection with deep neural networks" by CERVANTES ESTEVE et al. (2016 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), IEEE, September 25, 2016, pp. 1933-1937, XP033016808) discloses an end-to-end hierarchical target and part detection framework. Accordingly, a single convolutional neural network is used to determine bounding boxes in the input image data for target detection. Furthermore, a single proposal for each part of the target is determined within the bounding box.

[0009] EP 2911111 A2 discloses an apparatus and method for lesion detection. The lesion detection method involves: detecting lesion candidates from a medical image; detecting anatomical targets from the medical image; verifying each lesion candidate based on anatomical background information including information about the positional relationship between the lesion candidate and the anatomical target; and removing one or more false-positive lesion candidates from the detected lesion candidates based on the verification structure. Summary of the Invention

[0010] The object of this invention is to provide a reliable computer-aided method for detecting and identifying anatomical structures in medical images, preferably operating at near real-time speeds on commercial hardware. Another object of this invention is to provide a computer-aided solution to identify certain anatomical structures with accuracy equal to or exceeding that of human performance. For example, it is desirable to have robust computer-aided solutions for identifying intrauterine pregnancy (IUP) characteristics and their abnormal counterparts, particularly pregnancy loss and ectopic pregnancy, thereby improving the OBGYN workflow (diagnosis of IUP and its gestational age).

[0011] Any features, advantages, or alternative embodiments of the claimed method described herein are also applicable to other types of claims and aspects of the invention, particularly training methods, claimed systems, and computer programs, and vice versa. In particular, the claimed method can provide or improve target detector functions and neural networks. Furthermore, the input / output data for the target detector function can include advantageous features and embodiments of the input / output training data, and vice versa.

[0012] According to a first aspect, the present invention provides a computer-implemented method for automatically detecting anatomical structures in a medical image of an object, the method comprising the following steps:

[0013] a) Receive at least one medical image of the object's field of view;

[0014] b) Applying an object detector function to the medical image, wherein the object detector function is trained to detect multiple types of anatomical structures, thereby generating coordinates of multiple bounding boxes and a confidence score for each bounding box as output, the confidence score giving the probability that the bounding box contains an anatomical structure belonging to one of the multiple types;

[0015] The target detector function is characterized by performing the following steps:

[0016] The first neural network is applied to the medical image, wherein the first neural network is trained to detect a first plurality of larger anatomical structures, thereby generating the coordinates of at least one first bounding box and a confidence score of the at least one first bounding box containing a larger anatomical structure as output.

[0017] The medical image is cropped to the first bounding box, thereby generating a cropped image containing the image content within the first bounding box;

[0018] The second neural network is applied to the cropped medical image, wherein the second neural network is trained to detect at least one smaller anatomical structure of a second type, thereby generating the coordinates of at least one second bounding box and a confidence score of the at least one second bounding box containing the smaller anatomical structure as output.

[0019] Therefore, this invention provides a hierarchical anatomical recognition scheme for the automated detection of anatomical structures of different sizes or scales of detail in medical images. Given that certain anatomical structures with smaller sizes or lower scales of detail (smaller-sized anatomical structures) are expected to be found within another anatomical structure with larger sizes or higher scales of detail (larger-sized anatomical structures), this invention advantageously crops the input medical image to a bounding box containing the larger-sized anatomical structure and uses the cropped image to search for the smaller-sized anatomical structure. In this way, the neural network used to detect anatomical structures at each hierarchical level (e.g., at the larger-sized anatomical level and the smaller-sized anatomical level) requires a very simple architecture, can be trained quickly, and is more robust, i.e., has higher average accuracy. In other words, independent and separate neural networks can be implemented at each hierarchical level, and thus the neural network can be specifically trained for a particular detection task based on the hierarchical level. For example, in early pregnancy ultrasound (US) images, the yolk sac is expected to be found within the gestational sac. However, it has been found that the yolk sac (YS) is a very delicate structure that cannot be trained together with relatively large anatomical structures. To achieve better detection, a dedicated second neural network can be trained on cropped images, such as gestational sacs (GS), which are cropped from the original input medical image. This allows the search area to be continuously reduced, thereby improving training and subsequent detection.

[0020] In embodiments, the computer-implemented method of the present invention can be implemented on any commercial hardware, such as a conventional PC, laptop, tablet, cloud computer, server, and particularly on an ultrasound system for performing ultrasound scans. The method can also be executed on a remote computer, i.e., images can be acquired via ultrasound scan, uploaded to a remote computer or server, for example, via the Internet or a cellular connection, and processed according to the present invention. The results of the present invention (e.g., the coordinates of at least one first bounding box and one second bounding box, and confidence scores for these bounding boxes containing a certain anatomical structure) and generally also the type of said anatomical structure can be transmitted back to the ultrasound scanner or any other hardware device via the Internet or a cellular connection, at which the results of the present invention can be used to evaluate the images.

[0021] This invention utilizes trained artificial neural networks (NNs), specifically a first neural network and a second neural network used one after the other. That is, the first and second neural networks can have different inputs and / or different outputs. In this way, a hierarchical anatomical recognition scheme for predetermined anatomical structures has been implemented, and this scheme has shown excellent results in reliably detecting predefined anatomical structures. Therefore, it enables a more systematic diagnostic approach. In the case of early pregnancy, this invention can ultimately improve the clinical outcomes of abnormal early pregnancy, particularly due to the increased detection rate of ectopic pregnancies. However, this invention is not limited to cases of early pregnancy but can also be applied to many different medical conditions and clinical applications.

[0022] The medical image input to the target detector function is preferably a 2D image. This is because the preferred first and second neural networks are best suited for 2D image processing. However, this can be extended to 3D. The input medical image can be generated by any medical imaging modality (e.g., X-ray, CT, MRI, PET, or ultrasound (e.g., B-mode ultrasound, color Doppler, shear wave elastography, etc.)).

[0023] In embodiments, the method is applied to a series of medical images, which can be a temporal series of medical images, such as a sequence of medical images acquired from a moving target structure (e.g., the heart). This series of medical images can also be a series of medical images covering various fields of view, for example, a series of medical images acquired during an ultrasound scan by sweeping the probe across the region of interest. To process numerous medical images, in embodiments, the method presents its results in real-time or near real-time (e.g., at a frame rate of 10-60, preferably 20-40 FPS (frames per second per image)).

[0024] The field of view of the input medical image can cover any region of interest within the human or animal body, such as the head or brain, limbs, parts of limbs, or any organ or group of organs within the chest, trunk, or abdomen, such as the heart, lungs, breast, liver, kidneys, reproductive organs, intestines, etc. "Anatomy" can be any anatomical feature identifiable within such a field of view, such as the aforementioned organs or parts of organs (e.g., uterus, gestational sac, embryo, fluid in the rectouterine pouch, ovary, ovarian sac, specific bone, blood vessel, heart valve) or abnormal structures (e.g., tumor, cyst, lesion, aneurysm) or implanted structures (e.g., screw, knee or shoulder implant, implanted heart valve), etc. In an embodiment, the target detector function is trained to detect multiple predefined anatomical structures, where each category of anatomical structure corresponds to a "type" (e.g., "uterus" could be one type, "ovary" could be another). In an embodiment, each predefined category / type of anatomical structure can be defined on the medical image to distinguish it from other organs, allowing radiologists to draw bounding boxes that completely encompass the anatomical structure. In an embodiment, the bounding box generated by the target detector function is rectangular and preferably axis-aligned, that is, the four sides of the rectangle are aligned with the four edges of the medical image and / or with the sides of other bounding boxes.

[0025] In an embodiment, each input medical image is a square 2D image, i.e., it has the same number of pixels in both the width and height directions. For the purposes of this invention, it is advantageous if the target detector function is trained for a specified image size (e.g., 416 × 416 pixels). Therefore, the method of this invention may include an optional step in which the medical images (one or more) received from the imaging modality are rescaled to the specified input size for the target detector function, for example, using known interpolation techniques.

[0026] The target detector function includes a first neural network and a second neural network, the first neural network being trained to detect larger-sized anatomical structures, and the second neural network being trained to detect smaller-sized anatomical structures. "Larger-sized" and "smaller-sized" refer to the following: smaller-sized anatomical structures are generally (i.e., in most objects) smaller or have more and finer details compared to the minimum or average result in larger-sized anatomical structures. In embodiments, at least one smaller-sized anatomical structure is generally contained within or part of a larger-sized anatomical structure within a larger-sized anatomical structure. In terms of absolute size, the average diameter of (one or more) smaller-sized anatomical structures may generally be less than 40 mm, preferably less than 30 mm, most preferably less than 20 mm, and optionally, the minimum size is between 0.2 and 5 mm. The average diameter of larger-sized anatomical structures may generally be greater than 10 mm, preferably greater than 20 mm, and most preferably greater than 30 mm, and optionally, the maximum size is between 50 and 400 mm. Examples of matching sets for larger-sized structures and smaller-sized anatomical structures can be found in the table below:

[0027]

[0028] The target detector function is based on a hierarchical relationship between larger and smaller anatomical structures, wherein at least one of the first plurality of larger anatomical structures (at least in some medical conditions) is expected to contain one or more types of smaller anatomical structures. For example, in early pregnancy screening, the types of larger anatomical structures may include the uterus (U), gestational sac (GS), and optional embryo (E). The types of smaller anatomical structures may include the yolk sac (YS) and optional embryo (E). Therefore, the method of the present invention allows for the automated detection of IUP (intrauterine pregnancy) using an automated deep learning-based approach by generating bounding boxes based on the relationships between the bounding boxes locating the uterus, GS, yolk sac, and embryo. A first level of automation (i.e., a first neural network) will locate the presence of GS within the uterus or elsewhere, since both the uterus and GS are larger anatomical structures of the first plurality of types. In a next step, a medical image is cropped to a first bounding box, in this example, which is a bounding box containing the gestational sac, thereby generating a cropped image containing image content within the first bounding box (i.e., a smaller image that primarily contains only the GS). The second level of automation (using a second neural network) involves identifying the presence / absence of the embryonic frame and yolk sac frame within the cropped image (i.e., within the gestational sac). This hierarchical relationship of the bounding boxes provides for the automatic identification of IUP cases and non-IUP cases (e.g., ectopic pregnancy).

[0029] The output of the method of the present invention is bounding boxes and a confidence score for each bounding box containing a certain type of anatomical structure. The output can be visualized by outputting a medical image and / or a cropped image and displaying the bounding boxes with the highest confidence scores (e.g., in a contrasting color scheme, or overlaid on an image).

[0030] The advantages of this invention are that it allows for robust computer-aided identification of anatomical structures, recognizing both larger anatomical structures and at least one (usually multiple) types of smaller anatomical structures, which may or may not be included within a larger anatomical structure. By using this two-step method, the detection and identification of smaller anatomical structures becomes significantly more accurate and robust. When the field of view covers the uterus of a woman in early pregnancy, this invention can be used to more systematically diagnose normal or abnormal early pregnancies, particularly ectopic pregnancies, hydatidiform moles, or adnexal features.

[0031] In an embodiment, the method includes the following additional steps:

[0032] c) Use an inference scheme to determine the probability of a predefined medical condition of the object based on the presence or absence of one or more types of anatomical structures and / or based on the relative spatial location of the detected bounding boxes containing the anatomical structures.

[0033] A predetermined medical condition can be any clinical finding that can be inferred from the presence or absence of a bounding box containing anatomical structures. It is not usually such a diagnosis, but rather a probability of a certain medical condition, such as the probability of the presence or absence of a tumor or other local abnormality (which may be a smaller anatomical structure) in a predefined organ (which may be a larger anatomical structure).

[0034] Intrauterine pregnancy;

[0035] Ectopic pregnancy;

[0036] Pregnancy loss (e.g., when no embryo or yolk sac is found in the gestational sac).

[0037] The probability of presentation can be 1 or 0, or it can take a value in between, and it can depend on the confidence score used to determine the bounding box it defines. For example, the inference scheme can include a number of IF ELSE commands, which are implemented in the algorithm in any computational language. In the case of early pregnancy, the inference scheme could be, for example, as follows: Let the bounding box of the uterus be represented as B... u And let the bounding box of GS be represented as B. GS If B GS It is B u A subset of the embryos is considered a normal IUP; otherwise, it is considered an abnormal pregnancy. In the second level of automation using a second neural network, the yolk sac (YS) and embryo are detected and located. Let the bounding box for the yolk sac and embryo be called B... YS and B E If B YS and / or B E It is B GS A subset of these is considered a normal pregnancy. If no YS and embryo are detected in the GS, the chance of an abnormal pregnancy increases. For example, an "increased" probability of a medical condition can mean that the probability is higher than a predefined threshold, such as a value in the range of 50-95%, preferably higher than the range of 60-90%.

[0038] Therefore, in some embodiments, the probability of a predefined medical condition is increased if a first detected bounding box containing a first type of anatomical structure (e.g., the GS) covers a second detected bounding box containing a second type of anatomical structure (e.g., the GS). The first anatomical structure may completely cover (i.e., completely contain) the second bounding box, or the algorithm may allow a predefined amount of overlap. For example, a predefined percentage (e.g., at least 80%) of the second bounding box must be inside the first bounding box to increase the probability.

[0039] In other embodiments, the relative spatial positions of the detected bounding boxes may require not only that one bounding box is a subset of another bounding box, but also that the bounding boxes of two or more types of anatomical structures have a predetermined range of size ratios and / or a predetermined amount of overlap and / or a predetermined spatial relationship.

[0040] According to an embodiment, the method is iteratively performed on multiple two-dimensional medical images with different fields of view acquired during the same examination period of the subject, and the confidence scores for the detected bounding boxes are used to calculate the one or more medical images or one or more fields of view most suitable for further evaluation. In this case, further evaluation may involve the corresponding medical image with the bounding box having the highest confidence score being further observed by a skilled user, or undergoing further automated image analysis techniques, such as segmentation, modeling, feature detection, distance measurement, feature tracking, etc. Further evaluation may also refer to acquiring further images from the identified field of view, which may be done using other imaging techniques, such as Doppler ultrasound when the images used so far are B-mode US images.

[0041] In another embodiment, the orientation of additional 2D planes for acquiring further medical images is also determined based on the detected bounding boxes. Thus, the detected bounding boxes and corresponding input images can be used to calculate optimal planes, meaning, for example, those planes with the highest confidence scores. For example, the weighted average of the bounding boxes and the confidence scores have been used to establish confidence intervals, which are then used to calculate the optimal planes. For example, medical images that capture the largest cross-section of a certain anatomical structure are used for further evaluation. These planes / images and their corresponding bounding boxes can then be used for the automated calculation of key parameters, such as, in the example of early pregnancy screening, yolk sac diameter, average gestational sac diameter, crown-rump length of the embryo, ectopic / ectopic mass, and fluid volume in the rectouterine pouch. Therefore, the invention can also be used to automatically identify standard planes / fields of view or good planes / fields of view in which further measurements of anatomical structures can be performed.

[0042] The target detector function according to the invention can be provided as a software program, but can also be implemented as hardware. In an embodiment, each detection step of the target detector function is performed by a single neural network. In a first step, a single first NN is applied to the complete medical image, and in a second step, a single second NN is applied to the cropped image. The corresponding NN can divide the input image into multiple regions and predict bounding boxes and probabilities for each region. These bounding boxes can be weighted by the predicted probabilities. In an embodiment, only bounding boxes with a certain confidence score of 20-40% (e.g., 25% or higher) are displayed and / or considered.

[0043] In embodiments, the first artificial neural network (NN) and the second artificial neural network (NN) forming the target detector function have similar or identical architectures, but are trained to detect different types of anatomical structures, and may be of different numbers. The second NN is more specialized for detecting specific, smaller anatomical structures that are difficult to detect in a complete medical image, which is the input to the first NN. The first NN and the second NN can also be of different categories. Therefore, in the following text, whenever the term "neural network" is used, it means the first NN and / or the second NN, preferably both, but not necessarily both.

[0044] Artificial neural networks (NNs) are based on a collection of connected artificial neurons (also called nodes), where each connection (also called an edge) is capable of transmitting a signal from one node to another. Each receiving artificial neuron can process the signal and transfer it to other artificial neurons connected to it. In useful embodiments, the artificial neurons of a first NN and a second NN are arranged in layers. The input signal (i.e., the pixel values of a medical image) travels from the first layer (also called the input layer) to the last layer (output layer). In embodiments, the first NN and / or the second NN are feedforward networks. The first NN and the second NN preferably include several layers (including hidden layers), and are therefore deep neural networks. In embodiments, the first NN and the second NN are trained based on machine learning techniques (particularly deep learning, e.g., backpropagation). Alternatively, the first NN and the second NN can be provided as software functions, which are not necessarily structured in exactly the same way as the trained neural networks. For example, if some connections or edges have weights of 0 after training, such connections can be omitted when providing the target detector function.

[0045] According to embodiments, the first neural network and / or the second neural network does not include fully connected layers, i.e., layers where each node can be connected to every node in subsequent layers. In embodiments, the first NN and / or the second NN includes at least one convolutional layer. In embodiments, the first neural network and / or the second neural network (preferably both) are fully convolutional neural networks (CNNs). A fully convolutional NN can be defined as a convolutional NN without fully connected layers. A convolutional layer applies relatively small filter kernels across the entire layer, such that neurons within that layer are connected only to small regions in the next layer. This architecture ensures that the learned filter kernels produce the strongest response to spatially local input patterns. In embodiments of the invention, the parameters of the convolutional layer include a set of learnable filter kernels that have small receptive fields but extend through the full depth of the layer volume. During forward propagation through the convolutional layer, each filter kernel is convolved across the width and height of the input layer, the dot product between the entry of the filter kernel and the input layer is computed, and an output map associated with its filter kernel is produced. Stacking the output maps for all filter kernels along the depth dimension forms the full output volume of the convolutional layer (also referred to herein as a feature map).

[0046] Convolutional layers are typically defined by their size (dimension) and the stride of their filter kernels. A single convolutional layer can include several filter kernels, each producing a different output map. These filter kernels are stacked together along the depth dimension for all feature kernels, forming the output volume or feature map. The filter kernels typically extend through the full depth of the input volume. Therefore, if the input dimension for a convolutional layer is 416×416×3 and the size of the filter kernel is 3×3, it essentially means that the dimension of the convolutional filter kernel is 3×3×3. This will result in a single feature map. The stride of the filter kernel is the number of pixels the filter kernel shifts around the input layer / volume during convolution. Therefore, a filter kernel with a stride of 2 will cause the output layer dimension to be reduced to half that of the input layer.

[0047] In an embodiment, the first neural network (NN) and / or the second NN comprises 3 to 14 layer blocks, preferably 4 to 10 layer blocks. Each block includes a convolutional layer employing a plurality of filters, each filter having a filter kernel of size 3×3 and a stride of 1, followed by a max-pooling layer of size 2×2 and a stride of 2. Thus, each such block reduces the layer dimension by half. In an embodiment, the convolutional layers of the first NN and / or the second NN have a stride of 2... 3 -2 7 (Preferred to be 2) 4 -2 6 For example, 2 5The input medical image is downsampled using a factor of 32. In this example, using an input image of size 416×416, the output feature map can have a dimension of 13×13.

[0048] By using a combination of convolutional and max-pooling layers, the dimension of the image is reduced as the image travels through the first NN and / or the second NN, resulting in a 3D tensor that encodes the coordinates of several bounding boxes, the confidence score box for each bounding box, and type predictions (e.g., the probability that a detected target belongs to one of a predefined type of anatomical structure).

[0049] In the embodiments, the first neural network and / or the second neural network are adjustments to the YOLOv3 network, particularly adjustments to a YOLOv3 micronetwork. YOLOv3 is described in J. Redmon and A. Farhadi's "YOLOv3: An Incremental Improvement" (arXiv preprint arXiv: 1804.02767, 2018, published in...). https: / / arxiv.org / abs / 1804.02767 This is publicly available in [the database / source]. By training on a miniature version of YOLOv3, this detection can run comfortably without requiring any additional hardware support.

[0050] Output of the first NN / second NN: In this embodiment, for each possible bounding box, the first NN and / or the second NN predicts a confidence score (“object”) and a type probability. The confidence score gives the probability that the bounding box contains an anatomical structure belonging to any of a plurality of types, and the type probability is the probability that the object (anatomical structure) in the bounding box belongs to each of the trained types. If there are ten different types of objects / anatomical structures (in which the network has been trained), the network will predict ten probability values for each bounding box. Only bounding boxes whose confidence scores and type probabilities exceed a certain predefined threshold are considered.

[0051] In an embodiment, the final output of the first NN and / or the second NN of the present invention is generated by applying a 1×1 detection kernel to the (final) downsampled feature map (also referred to as the output grid). In an embodiment of the present invention, the final output of the first NN and / or the second NN is generated by applying a 1×1 detection kernel to two feature maps of different sizes at two different locations in the network. The shape of the detection kernel is 1×1×(B*(5+C)). Here, B is the number of bounding boxes that the grid cell on the feature map can predict, "5" refers to four bounding box attributes (offsets in the x and y directions and width / height offsets from the anchor boxes, as explained below) and an object confidence score (also referred to as "object"), and C is the number of types. Therefore, if the dimension of the final feature map is N×N (e.g., 9×9, 13×13, or 26×26), then for B anchor boxes, 4 bounding box attributes, 1 object prediction, and C type predictions, the size of the 3D output tensor is N×N×[B*(4+1+C)]. If B=3, each grid cell in the last layer can predict up to three bounding boxes, corresponding to three anchor boxes. For each anchor box, the tensor includes a confidence score (i.e., the probability that the box contains the target), four numbers representing the bounding box coordinates relative to the anchor box, and a probability vector containing the probability that the target belongs to one of each predefined type. From the probabilities of different types, logistic regression is used to predict a score for each type, and a threshold is used to predict one or more annotations for each detected anatomy. Type probabilities above the threshold are assigned to the bounding boxes.

[0052] Previous Anchor Boxes / Bounding Boxes: In this embodiment, as in YOLOv3, the neural network predicts the location of bounding boxes in terms of xy offsets relative to specific cells in the output grid (e.g., a 13×13 grid). Once the image is divided into the final grid, for each target (anatomical structure), a grid cell containing the target's center is identified, and this grid cell is now "responsible" for predicting the target. Therefore, the center of each bounding box is described based on its offset from the responsible cell. Alternatively, instead of directly predicting the width and height of the bounding boxes, the neural network in this embodiment (such as YOLOv3) predicts offset width and height offsets relative to previous boxes (also referred to herein as anchor boxes). Thus, during training, the network is trained to predict offsets from a pre-determined set of anchor boxes using a specific height-width ratio, which is determined using clustering based on the training data. The coordinates of the annotation boxes in the training data are clustered into the required number of anchor boxes, for example, 6 anchor boxes in the YOLOv3 micro-network. Typically, the k-means clustering algorithm is used to generate the set of anchor boxes. The intersection rate on the union between the real-world data and the anchor boxes is typically taken from a distance metric for k-means clustering. In embodiments of the first and / or second neural network of the present invention, each grid cell in the output tensor predicts three bounding boxes with respect to their height and width offsets from the three anchor boxes. In other words, with an output grid size of 13×13, a maximum of 13×13×3 bounding boxes can be detected.

[0053] Cross-scale detection: According to embodiments of the invention, the first neural network (NN) and / or the second NN includes the detection of anatomical structures at two to three (preferably two) different scales, each scale being given by a predetermined downsampling of the medical image. This concept was termed "cross-scale prediction" by the authors of YOLOv3. Therefore, cascading is used to merge earlier layers in the NN with later layers (which are initially upsampled). This is done because smaller targets are easier to detect in earlier, higher-resolution layers, while they are less easily detected in later, lower-resolution layers with significant downsampling, but the later layers contain semantically strong features. By merging the earlier, higher-resolution layers with the downsampled later feature maps, this method allows for more meaningful semantic information from the upsampled feature maps and more fine-grained information from the earlier feature maps. Thus, a second output tensor can be predicted, the size of which is twice the size of the prediction at scale 1. In YOLOv3, this is done at three scales, while the network of the present invention preferably makes predictions only at two different scales, where one scale is in the most downsampled layer and the dimension of one layer is twice the dimension of the last layer. 1 -2 2The neural network of this invention typically predicts (N×N + 2N×2N)×3 bounding boxes for two different scales. Bounding boxes are filtered out using thresholds based on confidence scores and type probabilities, and according to an embodiment, another filter, known as non-maximum suppression, is applied as a function of the intersection rate on the union (IOU) between two bounding boxes. The key steps of the maximum suppression filter are as follows:

[0054] Select bounding boxes (that meet the thresholds for type probability and confidence score);

[0055] Calculate the overlap with all other boxes that meet the threshold, and remove boxes whose overlap is greater than a predetermined IOU threshold;

[0056] Return to step a) and iterate through the operation until there is no box with a lower confidence score than the currently selected box.

[0057] able to Found at https: / / arxiv.org / pdf / 1704.04503.pdf More information about nonmaximal suppression.

[0058] This ensures that the optimal bounding box (especially the optimal bounding box between two scales) is still the output.

[0059] The greatest advantage of using this network architecture is that the training and robustness of the network are greatly improved, especially when the network is trained on only a limited number of types, which can achieve the above effects for clinical applications.

[0060] The size of the input medical image is preferably an odd multiple of 2. Z Here, Z can be an integer between 4 and 10, so after several (e.g., Z) downsampling steps, the final grid will have an odd number of dimensions, such as 7×7, 9×9, 11×11, or 13×13. Therefore, there will be a central grid cell, which is advantageous because larger targets / anatomical structures are often found in the center of the image, so it is advantageous that a single grid cell is responsible for detecting the largest targets.

[0061] According to a preferred embodiment, the target detector function is trained to detect 2 to 12 (preferably 3 to 6) types of anatomical structures. For example, the first neural network can be trained to detect 2-10 (preferably 3-6) types. For example, in the case of first-semester screening, the types are GS, uterus, and embryo. The second neural network can be trained to detect fewer types (e.g., 1-4 types, for example, only one type, such as yolk sac). By reducing the number of types in this way, the first and second neural networks can be made very small, with only 9-16 (preferably 13) convolutional layers each, thus being very fast and robust.

[0062] According to an embodiment, method step a) includes receiving a video stream of medical images acquired during an ultrasound scan of a subject. Therefore, aspects of the invention can be applied to time-series medical images at a frame rate of 20-100 frames per second. Furthermore, if the ultrasound images are encoded in the video stream, they will not be grayscale images, but rather color images with three channels. Typically, all three channels have the same value because ultrasound images are grayscale images. Therefore, the input to the target detector function can be a medical image in a three-channel format (e.g., RGB). This has the advantage that the same neural network architecture used for photographic images can be adapted for medical image processing.

[0063] Depending on the useful application, the medical images are acquired during an ultrasound scan in early prenatal pregnancy, and the various types of anatomical structures include the uterus, gestational sac, embryo, and / or yolk sac. Most preferably, the first neural network (NN) is trained to detect the first multiple types of larger anatomical structures (including the uterus, gestational sac, and embryo). Therefore, the second NN can be trained to detect the yolk sac and possibly other smaller anatomical structures.

[0064] According to the reference inference scheme, in an embodiment regarding the application of "early pregnancy screening," the probability of a "normal pregnancy" is increased if the detected bounding box of the uterus includes the detected bounding box of the gestational sac, and the detected bounding box of the gestational sac includes the detected bounding box of the embryo and / or yolk sac. The hierarchical inference scheme of the bounding boxes provides automatic identification of normal IUP cases and non-IUP cases (such as ectopic pregnancies). Therefore, it enables a more systematic diagnostic approach and allows for the detection of abnormal pregnancies using a simple scheme.

[0065] According to another aspect, the present invention provides a method for training a target detector function useful in detecting multiple types of anatomical structures in medical images, the target detector function comprising a first neural network, the method comprising:

[0066] (a) Receive input training data, i.e., at least one medical image of the object's field of view;

[0067] (b) Receive output training data, i.e., tensors, the tensors including the coordinates of at least one first bounding box containing a larger anatomical structure within the medical image and numbers indicating the type of the larger anatomical structure, the larger anatomical structure belonging to one of a first plurality of larger anatomical structures.

[0068] (c) The first neural network is trained by using the input training data and the output training data.

[0069] The target detector function described herein can be trained using this training method, wherein the first neural network being trained is preferably constructed as described herein. Furthermore, the input training data is medical images of the field of view described herein, such as B-mode ultrasound images of the human body, particularly medical images of reproductive organs in early pregnancy. To generate output training data, the medical images constituting the input training data can be manually annotated with axis-aligned bounding boxes, each bounding box completely covering an anatomical structure. The data generated therefrom is a tensor, which includes the coordinates of at least one first bounding box containing the larger-sized anatomical structure and a number indicating the type of the anatomical structure, wherein different types used are, for example, uterus, GS, and embryo. In an embodiment, the target detector function can be trained using a dataset derived from 5-15 objects, each object having approximately 100-1000 images. Bounding boxes are drawn for all possible anatomical structures present in these images. The training steps can be performed similarly to... https: / / pjreddie.com / darknet / yolo / The Darknet framework was developed in [location missing]. However, it is preferable to adjust the configuration parameters to achieve better performance in terms of training speed and training loss. In particular, it is preferable to adjust the learning rate and batch size. In the embodiments of the training method, the following parameters have been used:

[0070] Number of batches during training = 64

[0071] Number of subdivisions during training = 16

[0072] Maximum batch quantity = 500200

[0073] The number of filters in the final layer = (no type + 5) × 3

[0074] Anchor boxes depend on the image resolution, i.e., the dimensions of the bounding box.

[0075] Number of steps = 24000, 27000

[0076] During the training phase of the first and / or second neural network, a batch of images is typically read and fed as input training data. Each batch is divided into mini-batches. Let the batch size be N, and the number of mini-batches be n. Then, (N / n) images are fed into the network at a time. This parameter depends on the availability of the GPU. By using smaller subdivisions, the mini-batch size used to compute gradients is increased. Therefore, calculating gradients based on larger mini-batch sizes yields better optimization. According to deep learning conventions, a batch can also be considered as an epoch.

[0077] The training method further includes a step of training the second NN, comprising the following additional steps:

[0078] (d) Receive input training data, i.e., cropped images, the cropped images including image content containing a first bounding box of a larger anatomical structure;

[0079] (e) Receive output training data, i.e., tensors, the tensors including the coordinates of at least one second bounding box containing a smaller anatomical structure within the cropped image, the smaller anatomical structure belonging to at least one second type of smaller anatomical structure;

[0080] (f) The first neural network is trained by using the input training data and the output training data.

[0081] Therefore, the same input training data can be used to train the first neural network, but in this case, a cropped image containing bounding boxes of larger anatomical structures is used as the input training data. The output training data is the bounding box coordinates of the smaller anatomical structures.

[0082] The training method described in this paper can be used to provide a target detector function for detecting anatomical structures, i.e., initially training a first neural network (NN) and / or a second NN. It can also be used to recalibrate already trained networks. The training of the first and second NNs can be performed via backpropagation. In this method, the input training data is propagated through the corresponding NN using pre-determined filter kernels. This output is compared with the output training data (whose output is backpropagated through the NN) using an error function or an induced function, thereby calculating gradients to find the filter kernels, or possibly other parameters (e.g., bias) that produce the minimum error. This can be done by adjusting the weights of the filter kernels and following the negative gradient in the cost function.

[0083] This invention also relates to a computer program comprising instructions that, when executed by a computing unit, cause the computing unit to perform the method of the invention. This is useful for methods for automatically detecting anatomical structures in medical images, as well as training methods (particularly for training a first neural network and a second neural network). The computer program can be implemented using Darknet. This is a framework developed for training neural networks; it is open source, written in C / CUDA, and serves as the basis for YOLO. Repositories and Wikipedia are available at this specific link (…). https: / / pjreddie.com / darknet / In ), computer programs can be delivered as computer program products.

[0084] The computing unit capable of running the methods of the present invention can be any processing unit, such as a CPU (Central Processing Unit) or a GPU (Graphics Processing Unit). The computing unit can be a part of a computer, cloud, server, or mobile device (e.g., a laptop, tablet, mobile phone, etc.). In particular, the computing unit can be a part of an ultrasound imaging system. The ultrasound imaging system may also include a display, such as a computer screen.

[0085] The present invention also relates to a computer-readable medium comprising instructions that, when executed by a computing unit, cause the computing unit to perform a method according to the invention, particularly a method or training method for automatically detecting anatomical structures in medical images. Such a computer-readable medium can be any digital storage medium (e.g., hard disk, server, cloud, or computer, as well as optical or magnetic digital storage media, CD-ROM, SSD card, SD card, DVD, or USB or other storage sticks). Computer programs can be stored on the computer-readable medium.

[0086] In one embodiment, the method further includes the step of displaying the medical image together with at least one first bounding box and at least one second bounding box displayed in contrasting colors, allowing the user to check the accuracy of the prediction.

[0087] According to another aspect, the present invention relates to a system for automatically detecting anatomical structures in medical images of an object, the system comprising:

[0088] a) A first interface configured to receive at least one medical image of the object's field of view;

[0089] b) A computational unit configured to apply an object detector function to the medical image, wherein the object detector function is trained to detect multiple types of anatomical structures, thereby generating coordinates of multiple bounding boxes and a confidence score for each bounding box as output, the confidence score giving the probability that the bounding box contains an anatomical structure belonging to one of the multiple types, wherein the computational unit is configured to perform the following steps:

[0090] The first neural network is applied to the medical image, wherein the first neural network is trained to detect a first plurality of larger anatomical structures, thereby generating the coordinates of at least one first bounding box and a confidence score of the at least one first bounding box containing a larger anatomical structure as output.

[0091] The medical image is cropped to the first bounding box, thereby generating a cropped image containing the image content within the first bounding box;

[0092] The second neural network is applied to the cropped medical image, wherein the second neural network is trained to detect at least one smaller anatomical structure of a second type, thereby generating the coordinates of at least one second bounding box and a confidence score of the at least one second bounding box containing the smaller anatomical structure as output.

[0093] The system is preferably configured to run the method of the present invention for automatically detecting anatomical structures in medical images. The computing unit can be any processing unit associated with the computing unit running the program, as described above. The system can be implemented on an ultrasound imaging system, particularly on one of its processing units (e.g., a GPU). However, medical images can also be transferred from the imaging system to another computing unit, either locally or remotely, for example via the Internet, and the coordinates of the bounding boxes, or even the probabilities of predefined medical conditions, can be transmitted back to the imaging system from the local or remote computing unit and displayed to the user or otherwise output to the user. In embodiments, the system may include a second interface for outputting the bounding box coordinates (particularly for outputting medical images in which a first and second bounding box are drawn). Therefore, the second interface can be connected to a display device, such as a computer screen, touchscreen, etc.

[0094] In addition, the present invention relates to a system for training a target detector function, and in particular a system for training a first NN and / or a second NN by means of the training method described herein.

[0095] According to another aspect, the present invention relates to an ultrasound imaging system comprising an ultrasound transducer and a computing unit, the ultrasound transducer being configured to transmit and receive ultrasound signals, and the computing unit being configured to apply a target detector function to the medical image as described herein, the ultrasound imaging system comprising the system according to the invention. Because the method of the present invention has low computational cost, it can operate on existing ultrasound imaging systems. Attached Figure Description

[0096] Useful embodiments of the invention will now be described with reference to the accompanying drawings. Similar elements or features are referred to by the same reference numerals. These drawings depict the following:

[0097] Figure 1 : Medical image of a fetal ultrasound scan of a subject with a gestational age of 8 weeks and 4 days, with an annotated bounding box;

[0098] Figure 2 : A flowchart of an embodiment of the detection method according to the present invention;

[0099] Figure 3 : A flowchart of another embodiment of the detection method of the present invention;

[0100] Figure 4 A flowchart of the inference scheme according to an embodiment of the present invention;

[0101] Figure 5: An example of the localization of anatomical structures that can be achieved using embodiments of the present invention, wherein (a) shows a bounding box around the uterus, gestation sac (GS), and embryo, and (b) shows a bounding box for the yolk sac (YS) in a cropped GS image.

[0102] Figure 6 : A schematic diagram of the first neural network (NN) and / or the second neural network (NN);

[0103] Figure 7 A flowchart of a training method according to an embodiment of the present invention;

[0104] Figure 8 : A schematic diagram of a system according to an embodiment of the present invention.

[0105] List of reference numerals

[0106]

[0107] Detailed Implementation

[0108] Figure 1 The illustration shows a possible training image for training the target detector function, namely a 2D B-mode medical ultrasound image acquired during an early pregnancy scan at 8 weeks and 4 days gestation. Human objects have been drawn with bounding boxes, which are then annotated to generate the output training data. A maximum bounding box is drawn around the uterus (U), and another bounding box is drawn around the gestational sac (GS). Both the embryo (E) and yolk sac (YS) are visible within the gestational sac, increasing the probability of a normal pregnancy, as opposed to the case where the embryo is not visible within the GS.

[0109] Figure 2 An embodiment of a method for detecting anatomical structures in medical images 1 (e.g., a series of 2D ultrasound images 1a, 1b, 1c) is illustrated. Each of these images covers a slightly different field of view 2 and is able to distinguish organs or anatomical structures 3 on image 1. A target detector function 44, described in detail below, passes these images one after another. As described herein, the target detector function 4 preferably includes at least two neural networks 40, 44. The output of the target detector function 4 is at least one bounding box 5 or its coordinates and a confidence score that at least one bounding box 5 contains a specific anatomical structure. The confidence score can be an object, i.e., the probability that box 5 contains a target / anatomical structure and / or the probability that the target is of a specific type.

[0110] In a useful embodiment, in step 8, an input image 1 with bounding box 5 is displayed, for example, on a display device (e.g., a screen) that can be connected to the image acquisition unit, in which a sufficiently high confidence score is drawn. Then, in step 6, the probability of a predefined medical condition (e.g., normal / abnormal condition (e.g., IUP pregnancy or non-IUP pregnancy)) of the object can be determined based on the detected anatomical structure 3, the spatial location of the anatomical structure 3, and / or its relationship to each other. Therefore, the inference scheme 6 uses the bounding box 5 calculated by the target detector function 4 and may include an algorithm capable of calculating, for example, whether a particular type of bounding box 5 is completely included in another type, and whether a certain type of anatomical structure 3 is present or absent. Moreover, the relative spatial location of the bounding box 5 can be calculated and used to subtract an appropriate probability 7 for the medical condition.

[0111] Figure 3 The target detector function 4 is illustrated in more detail: the input is again one or more medical images 1 in the field of view 2, wherein at least some of the images depict organs or anatomical structures 3. In an embodiment, the received medical images can have any dimension and pixel size, while the first NN 40 is in a size of M*2. Z x M*2 ZThe most effective approach is to use a square image, where M is an odd number. Therefore, in step 39, the medical image 1 is optionally upsampled or downsampled to fit the expected input dimension of the first neural network 40. The output of the first NN 40 is at least the coordinates (typically also confidence scores) of the bounding boxes 50, 51, and 52. If one of the detected bounding boxes has a type belonging to a pre-determined larger anatomical structure 3a, the detected bounding box 50 is used in the cropping step 42 to crop the medical image 1 to the first bounding box 50, thereby generating a cropped image 11. “Cropping” means, for example, an operation performed by a cropping tool in photo processing, i.e., cutting a smaller image 11 from the larger image 1, with the cut edges along the edges of the bounding box 50. Therefore, the cropped image 11 is not necessarily a square image. Therefore, it is preferable to subject the image to a downsampling or upsampling step 45, such that the cropped image 11 preferably has a predefined dimension (e.g., a square 2D image for the first NN), and then feed it to the second NN 44. The output of the second neural network is then at least one second bounding box 54 containing a smaller anatomical structure 3b. This smaller anatomical structure 3b is typically very small or has very fine detail relative to the field of view 2 or the overall organ or structure being imaged, making it difficult to train the first neural network 40 to detect it. However, if the expected location of such a structure is known, it is possible to first crop the image 1 around the bounding box 50 and then train the second neural network 44 (possibly exclusively) to detect this second type of smaller anatomical structure 3b without difficulty.

[0112] Figure 4This is a schematic diagram of an embodiment of inference scheme 6, which uses bounding boxes calculated based on a first NN and a second NN. In an example of an early pregnancy ultrasound scan, inference step 60 may calculate whether a bounding box for the yolk sac (GS) exists. If yes, the method proceeds to step 61. If no, the chance of it being an abnormal pregnancy (medical condition 7c) increases. In step 61, the algorithm determines whether the bounding box for the GS is a subset of the bounding box for the uterus. If yes, the probability of it being a normal IUP (condition 7a) increases, and the method proceeds to step 62. If no, i.e., the GS exists but is not in the uterus, the probability of medical condition 7b, "ectopic pregnancy," increases. These steps can be performed after the image is passed through the first NN 40 and before the application of the clipping step 42 and the second NN 44. Then, in the second stage of inference, the yolk sac (YS) and embryo are detected and located. Step 62 determines whether the bounding box for the yolk sac and / or embryo is a subset of the bounding box for the GS. If yes, the probability of it being a normal pregnancy (7a) increases. If YS and embryo are not detected in GS, the chance of abnormal pregnancy increases (7c).

[0113] Figure 5 illustrates the possible results of the hierarchical object detector function: Figure 5a The image depicts bounding boxes that have already been identified around the uterus, GS, and embryo. According to an embodiment of the invention, the image has been cropped around the GS bounding box, as shown below. Figure 5b As shown in the figure, the second neural network has been trained to detect YS within GS, and the resulting bounding boxes are plotted in the figure.

[0114] Figure 6 A schematic diagram of the first NN and / or the second NN is shown, preferably an adjusted version of the YOLOv3 micronetwork. In this representation, each image input or feature map 20, 22 is annotated at the top with its dimensions (square image) and at the bottom with the number of channels. Thus, the input dataset 20 is a square 2D image of size 416×416 pixels with three channels, for example, an (e.g., RGB) color image. In grayscale images, typically each channel has the same value. On the other hand, the layer immediately preceding the output layer 32a has only a 13×13 dimension and a depth of 512 channels.

[0115] The input layer 20 is fed to a 3×3 convolutional filter 24 with a stride of 1, followed by a 2×2 max-pooling filter 28 with a stride of 2. More specifically, 16 such convolutional filters 24 are used in this layer, each with a depth of 3, resulting in a feature map 22a with a depth of 16 and a dimension of 208, which is half the size of the input layer 20. The feature map 22a is then convolved with another 3×3 convolutional filter with a stride of 1, followed by a 2×2 max-pooling filter 28 with a stride of 28, resulting in feature map 22b. This operation or layer block (i.e., the 3×3 convolutional filter 24 with a stride of 1) (followed by a 2×2 max-pooling filter 28 with a stride of 28) is repeated twice, resulting in a total of 5 convolutional layers 24, each followed by a pooling layer 28, thus reducing the dimension by half each time. Then, feature map 22e is submitted to convolutional filter 24 again, but this time it is followed by max pooling 29 with a size of 2×2 and a stride of 1, so that it does not cause further reduction in dimensionality in the next feature map 22f, which has a depth of 512 and a dimension of 13×13. This layer is followed by another convolutional filter 24 with a dimension of 3×3 and a stride of 1, resulting in output volume 22g. Output volume 22g is submitted to convolutional filter 25 with a dimension of 1×1 and a stride of 1, which is used to reduce the depth from 1024 to 256. Therefore, convolutional filter 25 may be called feature map pooling or projection layer. This filter reduces the number of feature maps (number of channels) but preserves salient features. The output 22h of the projection layer is submitted to another convolutional filter 24 with a dimension of 3×3 and a stride of 1, resulting in an output volume 22i, which is ultimately followed by a convolutional filter 26 with a dimension of k and a stride of 1, where k = (C + 5) × B, where C is the number of types and B is the number of anchor boxes, which is 3 in the preferred example. This yields an output layer 32a, which can be referred to as YOLO inference at scale 1, and can have an output format as explained above, i.e., for each point in a 13×13 grid, it contains data for up to B (preferably 3) bounding boxes, each bounding box including four box coordinates and an object score and an individual type probability. Bounding boxes are filtered out using a threshold regarding the object and / or type scores.

[0116] To perform detection at scale 2, the earlier feature map 22d is subjected to convolutional filter 24 to obtain feature map 22j. Additionally, feature map 22h is submitted to convolutional filter 25, followed by an upsampling of size 2 and stride 1, to obtain feature map 22l. This is concatenated with feature map 22j to obtain feature map 22m. This is submitted to another 3×3 convolutional filter 24 to obtain feature map 22n. This feature map is then submitted again to convolutional filter 26 to obtain output volume (3D tensor) 32b. Therefore, output volume (3D tensor) 32b contains the coordinates and probabilities of B-boundary boxes on each cell of a higher resolution 26×26 grid. Boundary box predictions at scale 1 and scale 2 can be combined as described above.

[0117] Figure 7 A training method is illustrated. Training images 12 are provided. For the example described herein, fetal ultrasound scans of fetuses less than 11 weeks gestation were collected for algorithm development. Then, each image frame from the ultrasound scans was manually annotated. Annotations were then made covering the entire anatomical structure (U, GS, E, YS, examples of which are shown below). Figure 1 Each image is annotated with axis-aligned bounding boxes (as shown). Ensure the data distribution is uniform and assign equal weight to all possible gestational ages. For example, annotate over 500 to 5000 images by drawing bounding boxes for GS, U, and embryos12, where, in Figure 7 The corresponding annotations are 70, 71, and 72. The gestational sac is cropped using annotation 70 (step 73). On the cropped image, the yolk sac is annotated, and this annotation is saved as 75.

[0118] Therefore, training image 12 is used as input training data, and GS annotation 70, uterus annotation 71, and embryo annotation 72 are used as output training data to train the first NN in step 76. Accordingly, an image cropped around GS 73 and yolk sac annotation 75 are used to train the second NN in step 78.

[0119] Figure 8This is a schematic diagram of an ultrasound system 100 configured to perform the methods of the invention according to an embodiment of the invention. The ultrasound system 100 includes a typical ultrasound hardware unit 102, which includes a CPU 104, a GPU 106, and a digital storage medium 108 (e.g., a hard disk or a solid-state optical disc). Computer programs can be loaded into the hardware unit from a CD-ROM 110 or via the Internet 112. The hardware unit 102 is connected to a user interface 114, which includes a keyboard 116 and an optional touchpad 118. The touchpad 118 can also function as a display device for displaying imaging parameters. The hardware unit 102 is connected to an ultrasound probe 120, which includes an array of ultrasound transducers 122 that allows the acquisition of B-mode ultrasound images from an object or patient (not shown) (preferably in real time). The B-mode image 124 acquired using the ultrasound probe 120 and the bounding box 5 generated by the method of the present invention executed by the CPU 104 and / or GPU are displayed on a screen 126, which can be any commercially available display unit, such as a screen, television, flat panel screen, projector, etc. Additionally, it can be connected to a remote computer or server 128, for example, via the Internet 112. The method according to the invention can be executed by the CPU 104 or GPU 106 of the hardware unit 102, but can also be executed by the processor 15 of the remote server 128.

[0120] The foregoing discussion is intended to illustrate the system only and should not be construed as limiting the claims to any particular embodiment or group of embodiments. Therefore, while the system has been described in particular and in detail with reference to exemplary embodiments, it should be understood that those skilled in the art can devise many modifications and alternative embodiments without departing from the broader and contemplated scope of the invention as set forth in the claims. Consequently, the specification and drawings should be considered illustrative and not intended to limit the scope of the claims.

Claims

1. A computer-implemented method for automatically detecting anatomical structures (3) in a medical image of an object, the method comprising the steps of: a) Receive at least one medical image of the field of view (2) of the object; b) Apply the target detector function (4) to the medical image, wherein the target detector function is trained to detect multiple types of anatomical structures (3) to generate coordinates of multiple bounding boxes (5) and a confidence score for each bounding box as output, the confidence score giving the probability that the bounding box contains an anatomical structure belonging to one of the multiple types; The target detector function is characterized by performing the following steps: The first neural network (40) is applied to the medical image, wherein the first neural network is trained to detect a first plurality of larger anatomical structures (3a), thereby generating the coordinates of at least one first bounding box (51) and a confidence score of the at least one first bounding box containing a larger anatomical structure as output. The medical image is cropped to the first bounding box to generate a cropped image (11) containing the image content within the first bounding box (51). A second neural network (44) is applied to the cropped medical image, wherein the second neural network is trained to detect at least one smaller anatomical structure (3b) of a second type, thereby generating the coordinates of at least one second bounding box (54) and a confidence score of the at least one second bounding box containing the smaller anatomical structure as output. The target detector function is based on a hierarchical relationship between larger and smaller anatomical structures, wherein at least one of the first plurality of larger anatomical structures is expected to include one or more types of smaller anatomical structures.

2. The method of claim 1, further comprising the following steps: c) Determine the probability of the predefined medical conditions (7a, 7b, 7c) of the object, wherein, The probability of the predefined medical condition is determined using an inference scheme (6) based on the presence or absence of one or more types of anatomical structures and / or based on the relative spatial location of the detected bounding boxes containing the anatomical structures.

3. The method according to claim 1 or 2, wherein, If a first detected bounding box (51) containing a first type of anatomical structure covers a second detected bounding box (54) containing a second type of anatomical structure, the probability of the predefined medical condition increases.

4. The method according to claim 1 or 2, wherein, The method is performed iteratively on multiple two-dimensional medical images (1a, 1b, 1c) with different fields of view acquired during the same examination period of the subject, and the confidence scores for the detected bounding boxes (51, 52, 53, 54) are used to calculate the most suitable medical image or one or more fields of view for further evaluation.

5. The method according to claim 1 or 2, wherein, The first neural network (40) and / or the second neural network (44) are fully convolutional neural networks.

6. The method according to claim 1 or 2, wherein, The first neural network (40) and / or the second neural network (44) include the detection of anatomical structures (3a, 3b) at two different scales, each scale being given by a predetermined downsampling of the medical image.

7. The method according to claim 1 or 2, wherein, The first neural network (40) and / or the second neural network (44) are YOLOv3 fully convolutional neural networks.

8. The method according to claim 1 or 2, wherein, The target detector function (4) is trained to detect 2 to 12 types of anatomical structures (3a, 3b).

9. The method according to claim 1 or 2, wherein, The medical images were acquired during an ultrasound scan in early prenatal pregnancy, and the various types of anatomical structures include the uterus, gestational sac, embryo, and / or yolk sac.

10. The method according to claim 2, wherein, The probability of a "normal pregnancy" is increased if the detected bounding box of the uterus includes the detected bounding box of the gestational sac and the detected bounding box of the gestational sac includes the detected bounding box of the embryo and / or yolk sac.

11. A method for training a target detector function (4) for detecting multiple types of anatomical structures (3) in medical images, the target detector function comprising a first neural network (40) and a second neural network (44), the method comprising: (a) Receive input training data, i.e., at least one medical image of the object's field of view; (b) Receive output training data (70, 71, 72), i.e., tensors, the tensors including the coordinates of at least one first bounding box containing a larger anatomical structure (3a) within the medical image and numbers indicating the type of the larger anatomical structure, the larger anatomical structure belonging to one of a first plurality of larger anatomical structures. (c) The first neural network (40) is trained by using the input training data and the output training data; (d) Receive input training data (73), namely, cropped image (11), the cropped image including image content containing a first bounding box of a larger anatomical structure (3a); (e) Receive output training data (75), i.e., tensors, the tensors including the coordinates of at least one second bounding box containing a smaller anatomical structure (3b) within the cropped image, the smaller anatomical structure belonging to at least one second type of smaller anatomical structure; (f) The second neural network (44) is trained by using the input training data and the output training data. The target detector function is based on a hierarchical relationship between larger and smaller anatomical structures, wherein at least one of the first plurality of larger anatomical structures is expected to be one or more types that include smaller anatomical structures.

12. The training method according to claim 11, wherein, The output training data includes tensors of size N×N×[B*(4+1+C)], where N×N is the dimension of the final feature map, B is the number of anchor boxes, and C is the number of types.

13. The training method according to claim 11 or 12, wherein, The output training data is generated by applying a 1×1 detection kernel to the downsampled feature map, wherein the shape of the detection kernel is 1×1×(B*(5+C)), where B is the number of anchor boxes and C is the number of types.

14. A computer program product comprising instructions which, when run by a computing unit (106), cause the computing unit to perform the method according to any one of claims 1 to 13.

15. A system (100) for automatically detecting anatomical structures in medical images of an object, the system comprising: a) A first interface configured to receive at least one medical image of the object's field of view; b) A computation unit (106) configured to apply a target detector function (4) to the medical image, wherein the target detector function is trained to detect multiple types of anatomical structures (3) to generate coordinates of multiple bounding boxes (5) and a confidence score for each bounding box as output, the confidence score giving the probability that the bounding box (5) contains an anatomical structure (3) belonging to one of the multiple types, wherein the computation unit is configured to perform the following steps: The first neural network (40) is applied to the medical image, wherein the first neural network is trained to detect a first plurality of larger anatomical structures (3a), thereby generating the coordinates of at least one first bounding box (51) and a confidence score of the at least one first bounding box containing a larger anatomical structure as output. The medical image is cropped to the first bounding box to generate a cropped image (11) containing the image content within the first bounding box (51). A second neural network (44) is applied to the cropped medical image, wherein the second neural network is trained to detect at least one smaller anatomical structure (3b) of a second type, thereby generating the coordinates of at least one second bounding box (54) and a confidence score of the at least one second bounding box containing the smaller anatomical structure as output, wherein the target detector function is based on a hierarchical relationship between larger and smaller anatomical structures, wherein at least one of the first plurality of larger anatomical structures is expected to contain one or more types of smaller anatomical structures.