Publications
A list of my publications.
2024
- TCDDream-in-Style: Text-to-3D Generation using Stylized Score DistillationHubert Kompanowski, and Binh-Son Hua2024
2023
- EMBRYOAIDP-225 Trustworthy AI algorithm for embryo rankingS Deluga-Białowarczuk , P Wygocki , P Pawlik , and 9 more authorsHuman Reproduction, Jun 2023
Deep-learning algorithms are known to be non-robust: can the variability and inconsistency of AI algorithms be reduced in embryo selection?We reduced the variability of algorithms (measured on different tasks like rotations and brightness changes) by 86\% while preserving their quality.Deep-learning methods are generally known to be non-robust, i.e., decisions change with even slight modification of input data. Current solutions for embryo scoring are not robust - for example rotating the input image results in a different score in most solutions on the market. Despite this fact and expressed concerns of embryologists, there are no other publications focusing on the problem of variance in AI solutions used in IVF. Most of the publications measure accuracy, sensitivity, specificity, and ROC AUC; there are no variance metrics.The data-set was collected within multiple clinics using various devices. It contains 34,821 embryos (4,510 were transferred with known pregnancy results), represented by time-lapse videos or images. This gives 3,290,481 frames of embryos at various maturity levels.From the data-set 925 randomly selected embryos were chosen as a test set.The frames were modified by methods that are not supposed to change the results of the algorithm.We measured the variability of the scores given by our algorithm.We have considered seven different modifications of images that should not influence embryo scoring:• Rotations (10 different angles);• Brightness and Contrast modifications;• Substitutions of Frames (from time-lapse monitoring taken from a 2 hours interval);• Blur (Generalised Normal filter);• Gaussian Noise;• Gaussian Blur;• Sharpening.We used several techniques to reduce variance of our deep neural network model (architecture commonly used for embryo selection):• Ensemble (of different models in cross validation);• Test time augmentation (TTA);• Robust training.In order to measure the variance we have used the following method. First, the scores are stretched to the standard uniform distribution. In other words we look in which percentile the score lies. This way the range of the scores are normalised thus the variance can be compared.Second, we train the EMBROAID model on the augmented data that includes all the above modifications.Third, we compute the variance of the normalised scores on the test set.The mean variance dropped by 86\% (0.0055 to 0.0008) across all measured input modifications.The individual drops in the variance on measured input modifications: Rotations: 77\% (0.009 -\> 0.002), Brightness and Contrast: 81\% (0.0036 -\> 0.0007), Substitution of Frames: 76\% (0.0076 -\> 0.0019), Blur 94\% (0.012 -\> 0.0008), Gaussian Noise: 96\% (0.0049 -\> 0.0002), Gaussian Blur: 95\% (0.0052 -\> 0.0003), Sharpening: 77\% (0.0015 -\> 0.0003). The significance was tested with Wilcoxon Rank Sum Test giving the p-value \< 0.01 on all input modifications.Finally, we stress that these results were obtained without any loss in the ROC AUC metric. We have tested the algorithm both on the original test-set. Both models achieved an ROC AUC of 0.66 (CI 0.63-0.69) on both test-sets.Further work needs to be done to extend the set of possible augmentations of data.Increased reliability of AI scoring algorithms for embryo selection.It is possible to obtain consistent results over a wide range of data modifications.not applicable
- EMBRYOAIDP-214 The size of the embryo can be used as a single predictor for selecting embryo for transferI Martynowicz , P Wygocki , T Gilewicz , and 9 more authorsHuman Reproduction, Jun 2023
What is the performance of the blastocyst area feature compared to the state-of-art AI algorithms in the embryo ranking problem?We study a simple embryo selection rule: the larger the better. This rule gives the same quality answers as other methods including AI.There are many existing methods for scoring embryos starting from the human executed rules, and ending with AI grading systems. Previous studies indicate systematic problems with comparing these methods. This is caused by the lack of a gold standard algorithm, i.e., the repeatable, easy and objective test that would not be susceptible to human interpretation. Common benchmarks typically lead to speeds-up in development of ML methods. We show that the embryo area, a single, interpretable feature, can be used as such a gold standard. It delivers the same quality results (aucroc = 0.66) as existing state-of-art AI algorithms.The data-set for retrospective study was collected from a single-center - Kriobank, Bialystok. The data-set includes 1550 time-lapse videos of embryos cultured in an incubator for up to 140 hours post-fertilization with no hatching. All these embryos were transferred with known implantation outcome (beta-hCG). All the videos have been recorded using the same optical magnification setting.Two methods of embryo ranking are compared:- In the first method, the score was equal to the embryo area on the last frame of time-lapse. The area was computed using an AI segmentation algorithm that was trained on hundreds of manually annotated images/frames.- In the second method, state-of-art AI algorithms were designed and trained to grade embryos directly from image data.AUC ROC was used to compare the ranking performance.The proposed gold standard, i.e., the embryo area, requires no training process. It induces the total ranking of all embryos, which should correlate with the treatment outcome, and the way to test it is ROC AUC score. For the whole data-set available (1550 cases) ROC AUC = 0.659 (CI 0.630-0.688), which is relatively high value and compares to the one reported for state-of-art AI algorithms based on deep-learning.We have compared the area-based algorithms with state-of-art machine learning methods used in commercial solutions. In this test we have split the data-set into training (1236 cases) and test subsets (314 cases). The area-based algorithm resulted in ROC AUC = 0.657 (CI 0.597-0.717). For the deep-learning algorithm we have obtained ROC AUC = 0.659 (CI 0.599-0.719).We note that several previous studies have proven relatively high disagreement between embryologists and AI-algorithms. Thus revealing a need for standardization in this area of study, and the need of development of common grounds for tests. The above study strongly suggests that embryo area can serve as a gold standard for further comparison of the developed algorithms, and clear improvement against such gold standards shall be expected.• Currently, the application of the embryo area rule is limited to devices with fixed optical magnification.• Further work is needed to compare the segmentation algorithm with manually annotated images.• Despite initial confirmation of the viability the algorithm should be verified and tested on a large, multi-centric data-set.• While AI based tools have the highest potential of increasing the efficacy of embryo selection, there are simple methods that can support embryologists without or with very simple software.• There is a strong need for gold standards that will be further improved.not applicable
- EMBRYOAIDP-289 Evaluation of AI-based, non-invasive and annotation free EMBRYOAID software with embryologists: time and predictionP Wygocki , M Siennicki , P Pawlik , and 12 more authorsHuman Reproduction, Jun 2023
Is an AI model as good as an experienced embryologist?Properly trained AI models can perform as good as embryologists with respect to accuracy, improving in the same time decisiveness.There are many attempts to solve the embryo selection problem. One way is to determine the specific features of the embryo and on this basis calculate the final score. The non-algorithmic approach uses the professional knowledge of the embryologist to perform the visual analysis and score the embryos. The most promising attempts are using deep learning, specifically CNNs to directly predict pregnancy probabilities from a given image or set of images. Although these tools deliver high quality answers they have rather low intra-embryologist agreement.Comparing scoring of AI algorithms with embryologists is a challenge, as they miss a common scale, e.g., total ranking of embryos. In order to overcome this problem we have designed a test containing150 pairs of day-5 embryo time-lapses. For each pair of embryos only one gave the pregnancy (implantation based on beta-hCG). We compared our algorithm with the decisions of 10 embryologists with 10 years of experience on average.We have created a web questionnaire for the test. It displayed time-lapses for a pair of embryos and allowed the embryologists to choose the more promising one. We have invited doctors from several clinics to take part in the study.The AI model was tested on the same data, i.e., its goal was to choose between two transferred embryos.After collection of data, the effectiveness of the embryologists and the model were compared.The results of the comparison are as follows. The accuracy of predicting the embryo that gave the pregnancy was:- 66.9 (CI 63.1 - 70.7) for our model,- 63.8 (CI 62.6 - 65.0) on average for the embryologists.The decisions taken by the algorithm are slightly better, however, this holds with rather low statistical significance. Some decisions taken by the doctors have high variance, e.g., there were cases where 5out of 10 decisions indicated one embryo.In order to understand these variances better, we have divided the test-set into two parts:a) 57 cases where all doctors agreed on the decision,b) 93 cases where there were some differences.On the a) set the decisions of the algorithm agreed with the experts in 95\% of cases. While for the set b) the correlation between expert decisions or the algorithms with the ground truth was rather weak,i.e., p-score of approximately 0.1.The last aspect of this study was the time of making the decision. The average time for all experts was 54 seconds for each decision, while our algorithm took decisions in 2 seconds on average.The experiment shows high agreement between algorithms and experts in the case when experts agree. However, the difference between the average accuracy scores shows low statistical significance.The model returns the result of the analysis almost immediately, thus it can speed up the process of selecting the most significant embryos. The model agrees with the experts in the case when experts agree.not applicable