Performance Estimation of multi-class 2D Segmentation

Consider to download this Jupyter Notebook and run locally, or test it with Colab.

In this notebook, we will show how to evaluate the performance of multi-class 2D segmentation tasks.
We provide the model predicted 2D segmentation results (network logits) for this tutorial, which will be download automatically. We also provide the model training code in https://github.com/ZerojumpLine/Robust-Medical-Segmentation.
More specifically, we show an example of estimating the performance under domain shifts on Cardiac MRI segmentation (into 4 classes including background, left ventricle (LV), myocardium(MYO) and right ventricle (RV)) based on a 3D U-Net. We will utilize the calculated logits on test dataset acquired with a different scanner.
We will calculated model confidence with different confidence scores and varied calibration methods.

[1]:

!pip install moval
!pip install statannotations
!pip install pandas
!pip install tqdm
!pip install matplotlib
!pip install nibabel
!pip install seaborn==0.12 # because statannotations not support the latest

Requirement already satisfied: moval in /Users/zejuli/opt/anaconda3/envs/moval/lib/python3.8/site-packages (0.3.16)
Requirement already satisfied: scikit-learn>=1.3.0 in /Users/zejuli/opt/anaconda3/envs/moval/lib/python3.8/site-packages (from moval) (1.3.0)
Requirement already satisfied: scipy>=1.8.0 in /Users/zejuli/opt/anaconda3/envs/moval/lib/python3.8/site-packages (from moval) (1.10.1)
Requirement already satisfied: pytest in /Users/zejuli/opt/anaconda3/envs/moval/lib/python3.8/site-packages (from moval) (7.4.3)
Requirement already satisfied: gdown in /Users/zejuli/opt/anaconda3/envs/moval/lib/python3.8/site-packages (from moval) (4.7.1)
Requirement already satisfied: pandas in /Users/zejuli/opt/anaconda3/envs/moval/lib/python3.8/site-packages (from moval) (1.5.3)
Requirement already satisfied: nibabel in /Users/zejuli/opt/anaconda3/envs/moval/lib/python3.8/site-packages (from moval) (5.1.0)
Requirement already satisfied: numpy>=1.17.3 in /Users/zejuli/opt/anaconda3/envs/moval/lib/python3.8/site-packages (from scikit-learn>=1.3.0->moval) (1.24.4)
Requirement already satisfied: joblib>=1.1.1 in /Users/zejuli/opt/anaconda3/envs/moval/lib/python3.8/site-packages (from scikit-learn>=1.3.0->moval) (1.3.1)
Requirement already satisfied: threadpoolctl>=2.0.0 in /Users/zejuli/opt/anaconda3/envs/moval/lib/python3.8/site-packages (from scikit-learn>=1.3.0->moval) (3.1.0)
Requirement already satisfied: filelock in /Users/zejuli/opt/anaconda3/envs/moval/lib/python3.8/site-packages (from gdown->moval) (3.13.1)
Requirement already satisfied: requests[socks] in /Users/zejuli/.local/lib/python3.8/site-packages (from gdown->moval) (2.31.0)
Requirement already satisfied: six in /Users/zejuli/.local/lib/python3.8/site-packages (from gdown->moval) (1.16.0)
Requirement already satisfied: tqdm in /Users/zejuli/opt/anaconda3/envs/moval/lib/python3.8/site-packages (from gdown->moval) (4.65.0)
Requirement already satisfied: beautifulsoup4 in /Users/zejuli/opt/anaconda3/envs/moval/lib/python3.8/site-packages (from gdown->moval) (4.12.2)
Requirement already satisfied: importlib-resources>=1.3 in /Users/zejuli/.local/lib/python3.8/site-packages (from nibabel->moval) (5.12.0)
Requirement already satisfied: packaging>=17 in /Users/zejuli/opt/anaconda3/envs/moval/lib/python3.8/site-packages (from nibabel->moval) (23.1)
Requirement already satisfied: python-dateutil>=2.8.1 in /Users/zejuli/opt/anaconda3/envs/moval/lib/python3.8/site-packages (from pandas->moval) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /Users/zejuli/opt/anaconda3/envs/moval/lib/python3.8/site-packages (from pandas->moval) (2023.3.post1)
Requirement already satisfied: iniconfig in /Users/zejuli/opt/anaconda3/envs/moval/lib/python3.8/site-packages (from pytest->moval) (2.0.0)
Requirement already satisfied: pluggy<2.0,>=0.12 in /Users/zejuli/opt/anaconda3/envs/moval/lib/python3.8/site-packages (from pytest->moval) (1.3.0)
Requirement already satisfied: exceptiongroup>=1.0.0rc8 in /Users/zejuli/opt/anaconda3/envs/moval/lib/python3.8/site-packages (from pytest->moval) (1.1.3)
Requirement already satisfied: tomli>=1.0.0 in /Users/zejuli/opt/anaconda3/envs/moval/lib/python3.8/site-packages (from pytest->moval) (2.0.1)
Requirement already satisfied: zipp>=3.1.0 in /Users/zejuli/.local/lib/python3.8/site-packages (from importlib-resources>=1.3->nibabel->moval) (3.15.0)
Requirement already satisfied: soupsieve>1.2 in /Users/zejuli/opt/anaconda3/envs/moval/lib/python3.8/site-packages (from beautifulsoup4->gdown->moval) (2.5)
Requirement already satisfied: charset-normalizer<4,>=2 in /Users/zejuli/.local/lib/python3.8/site-packages (from requests[socks]->gdown->moval) (3.1.0)
Requirement already satisfied: idna<4,>=2.5 in /Users/zejuli/.local/lib/python3.8/site-packages (from requests[socks]->gdown->moval) (3.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in /Users/zejuli/.local/lib/python3.8/site-packages (from requests[socks]->gdown->moval) (2.0.3)
Requirement already satisfied: certifi>=2017.4.17 in /Users/zejuli/.local/lib/python3.8/site-packages (from requests[socks]->gdown->moval) (2023.5.7)
Requirement already satisfied: PySocks!=1.5.7,>=1.5.6 in /Users/zejuli/opt/anaconda3/envs/moval/lib/python3.8/site-packages (from requests[socks]->gdown->moval) (1.7.1)
Requirement already satisfied: statannotations in /Users/zejuli/opt/anaconda3/envs/moval/lib/python3.8/site-packages (0.6.0)
Requirement already satisfied: numpy>=1.12.1 in /Users/zejuli/opt/anaconda3/envs/moval/lib/python3.8/site-packages (from statannotations) (1.24.4)
Requirement already satisfied: seaborn<0.12,>=0.9.0 in /Users/zejuli/opt/anaconda3/envs/moval/lib/python3.8/site-packages (from statannotations) (0.11.2)
Requirement already satisfied: matplotlib>=2.2.2 in /Users/zejuli/opt/anaconda3/envs/moval/lib/python3.8/site-packages (from statannotations) (3.7.4)
Requirement already satisfied: pandas<2.0.0,>=0.23.0 in /Users/zejuli/opt/anaconda3/envs/moval/lib/python3.8/site-packages (from statannotations) (1.5.3)
Requirement already satisfied: scipy>=1.1.0 in /Users/zejuli/opt/anaconda3/envs/moval/lib/python3.8/site-packages (from statannotations) (1.10.1)
Requirement already satisfied: contourpy>=1.0.1 in /Users/zejuli/opt/anaconda3/envs/moval/lib/python3.8/site-packages (from matplotlib>=2.2.2->statannotations) (1.1.1)
Requirement already satisfied: cycler>=0.10 in /Users/zejuli/opt/anaconda3/envs/moval/lib/python3.8/site-packages (from matplotlib>=2.2.2->statannotations) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /Users/zejuli/opt/anaconda3/envs/moval/lib/python3.8/site-packages (from matplotlib>=2.2.2->statannotations) (4.46.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /Users/zejuli/opt/anaconda3/envs/moval/lib/python3.8/site-packages (from matplotlib>=2.2.2->statannotations) (1.4.5)
Requirement already satisfied: packaging>=20.0 in /Users/zejuli/opt/anaconda3/envs/moval/lib/python3.8/site-packages (from matplotlib>=2.2.2->statannotations) (23.1)
Requirement already satisfied: pillow>=6.2.0 in /Users/zejuli/opt/anaconda3/envs/moval/lib/python3.8/site-packages (from matplotlib>=2.2.2->statannotations) (10.1.0)
Requirement already satisfied: pyparsing>=2.3.1 in /Users/zejuli/opt/anaconda3/envs/moval/lib/python3.8/site-packages (from matplotlib>=2.2.2->statannotations) (3.1.1)
Requirement already satisfied: python-dateutil>=2.7 in /Users/zejuli/opt/anaconda3/envs/moval/lib/python3.8/site-packages (from matplotlib>=2.2.2->statannotations) (2.8.2)
Requirement already satisfied: importlib-resources>=3.2.0 in /Users/zejuli/.local/lib/python3.8/site-packages (from matplotlib>=2.2.2->statannotations) (5.12.0)
Requirement already satisfied: pytz>=2020.1 in /Users/zejuli/opt/anaconda3/envs/moval/lib/python3.8/site-packages (from pandas<2.0.0,>=0.23.0->statannotations) (2023.3.post1)
Requirement already satisfied: zipp>=3.1.0 in /Users/zejuli/.local/lib/python3.8/site-packages (from importlib-resources>=3.2.0->matplotlib>=2.2.2->statannotations) (3.15.0)
Requirement already satisfied: six>=1.5 in /Users/zejuli/.local/lib/python3.8/site-packages (from python-dateutil>=2.7->matplotlib>=2.2.2->statannotations) (1.16.0)
Requirement already satisfied: pandas in /Users/zejuli/opt/anaconda3/envs/moval/lib/python3.8/site-packages (1.5.3)
Requirement already satisfied: python-dateutil>=2.8.1 in /Users/zejuli/opt/anaconda3/envs/moval/lib/python3.8/site-packages (from pandas) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /Users/zejuli/opt/anaconda3/envs/moval/lib/python3.8/site-packages (from pandas) (2023.3.post1)
Requirement already satisfied: numpy>=1.20.3 in /Users/zejuli/opt/anaconda3/envs/moval/lib/python3.8/site-packages (from pandas) (1.24.4)
Requirement already satisfied: six>=1.5 in /Users/zejuli/.local/lib/python3.8/site-packages (from python-dateutil>=2.8.1->pandas) (1.16.0)
Requirement already satisfied: tqdm in /Users/zejuli/opt/anaconda3/envs/moval/lib/python3.8/site-packages (4.65.0)
Requirement already satisfied: matplotlib in /Users/zejuli/opt/anaconda3/envs/moval/lib/python3.8/site-packages (3.7.4)
Requirement already satisfied: contourpy>=1.0.1 in /Users/zejuli/opt/anaconda3/envs/moval/lib/python3.8/site-packages (from matplotlib) (1.1.1)
Requirement already satisfied: cycler>=0.10 in /Users/zejuli/opt/anaconda3/envs/moval/lib/python3.8/site-packages (from matplotlib) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /Users/zejuli/opt/anaconda3/envs/moval/lib/python3.8/site-packages (from matplotlib) (4.46.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /Users/zejuli/opt/anaconda3/envs/moval/lib/python3.8/site-packages (from matplotlib) (1.4.5)
Requirement already satisfied: numpy<2,>=1.20 in /Users/zejuli/opt/anaconda3/envs/moval/lib/python3.8/site-packages (from matplotlib) (1.24.4)
Requirement already satisfied: packaging>=20.0 in /Users/zejuli/opt/anaconda3/envs/moval/lib/python3.8/site-packages (from matplotlib) (23.1)
Requirement already satisfied: pillow>=6.2.0 in /Users/zejuli/opt/anaconda3/envs/moval/lib/python3.8/site-packages (from matplotlib) (10.1.0)
Requirement already satisfied: pyparsing>=2.3.1 in /Users/zejuli/opt/anaconda3/envs/moval/lib/python3.8/site-packages (from matplotlib) (3.1.1)
Requirement already satisfied: python-dateutil>=2.7 in /Users/zejuli/opt/anaconda3/envs/moval/lib/python3.8/site-packages (from matplotlib) (2.8.2)
Requirement already satisfied: importlib-resources>=3.2.0 in /Users/zejuli/.local/lib/python3.8/site-packages (from matplotlib) (5.12.0)
Requirement already satisfied: zipp>=3.1.0 in /Users/zejuli/.local/lib/python3.8/site-packages (from importlib-resources>=3.2.0->matplotlib) (3.15.0)
Requirement already satisfied: six>=1.5 in /Users/zejuli/.local/lib/python3.8/site-packages (from python-dateutil>=2.7->matplotlib) (1.16.0)
Requirement already satisfied: nibabel in /Users/zejuli/opt/anaconda3/envs/moval/lib/python3.8/site-packages (5.1.0)
Requirement already satisfied: importlib-resources>=1.3 in /Users/zejuli/.local/lib/python3.8/site-packages (from nibabel) (5.12.0)
Requirement already satisfied: numpy>=1.19 in /Users/zejuli/opt/anaconda3/envs/moval/lib/python3.8/site-packages (from nibabel) (1.24.4)
Requirement already satisfied: packaging>=17 in /Users/zejuli/opt/anaconda3/envs/moval/lib/python3.8/site-packages (from nibabel) (23.1)
Requirement already satisfied: zipp>=3.1.0 in /Users/zejuli/.local/lib/python3.8/site-packages (from importlib-resources>=1.3->nibabel) (3.15.0)
WARNING: Error parsing requirements for seaborn: [Errno 2] No such file or directory: '/Users/zejuli/opt/anaconda3/envs/moval/lib/python3.8/site-packages/seaborn-0.12.0.dist-info/METADATA'
Collecting seaborn==0.12
  Using cached seaborn-0.12.0-py3-none-any.whl (285 kB)
Requirement already satisfied: numpy>=1.17 in /Users/zejuli/opt/anaconda3/envs/moval/lib/python3.8/site-packages (from seaborn==0.12) (1.24.4)
Requirement already satisfied: pandas>=0.25 in /Users/zejuli/opt/anaconda3/envs/moval/lib/python3.8/site-packages (from seaborn==0.12) (1.5.3)
Requirement already satisfied: matplotlib>=3.1 in /Users/zejuli/opt/anaconda3/envs/moval/lib/python3.8/site-packages (from seaborn==0.12) (3.7.4)
Requirement already satisfied: contourpy>=1.0.1 in /Users/zejuli/opt/anaconda3/envs/moval/lib/python3.8/site-packages (from matplotlib>=3.1->seaborn==0.12) (1.1.1)
Requirement already satisfied: cycler>=0.10 in /Users/zejuli/opt/anaconda3/envs/moval/lib/python3.8/site-packages (from matplotlib>=3.1->seaborn==0.12) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /Users/zejuli/opt/anaconda3/envs/moval/lib/python3.8/site-packages (from matplotlib>=3.1->seaborn==0.12) (4.46.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /Users/zejuli/opt/anaconda3/envs/moval/lib/python3.8/site-packages (from matplotlib>=3.1->seaborn==0.12) (1.4.5)
Requirement already satisfied: packaging>=20.0 in /Users/zejuli/opt/anaconda3/envs/moval/lib/python3.8/site-packages (from matplotlib>=3.1->seaborn==0.12) (23.1)
Requirement already satisfied: pillow>=6.2.0 in /Users/zejuli/opt/anaconda3/envs/moval/lib/python3.8/site-packages (from matplotlib>=3.1->seaborn==0.12) (10.1.0)
Requirement already satisfied: pyparsing>=2.3.1 in /Users/zejuli/opt/anaconda3/envs/moval/lib/python3.8/site-packages (from matplotlib>=3.1->seaborn==0.12) (3.1.1)
Requirement already satisfied: python-dateutil>=2.7 in /Users/zejuli/opt/anaconda3/envs/moval/lib/python3.8/site-packages (from matplotlib>=3.1->seaborn==0.12) (2.8.2)
Requirement already satisfied: importlib-resources>=3.2.0 in /Users/zejuli/.local/lib/python3.8/site-packages (from matplotlib>=3.1->seaborn==0.12) (5.12.0)
Requirement already satisfied: pytz>=2020.1 in /Users/zejuli/opt/anaconda3/envs/moval/lib/python3.8/site-packages (from pandas>=0.25->seaborn==0.12) (2023.3.post1)
Requirement already satisfied: zipp>=3.1.0 in /Users/zejuli/.local/lib/python3.8/site-packages (from importlib-resources>=3.2.0->matplotlib>=3.1->seaborn==0.12) (3.15.0)
Requirement already satisfied: six>=1.5 in /Users/zejuli/.local/lib/python3.8/site-packages (from python-dateutil>=2.7->matplotlib>=3.1->seaborn==0.12) (1.16.0)
Installing collected packages: seaborn
  Attempting uninstall: seaborn
    Found existing installation: seaborn 0.11.2
    Uninstalling seaborn-0.11.2:
      Successfully uninstalled seaborn-0.11.2
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
statannotations 0.6.0 requires seaborn<0.12,>=0.9.0, but you have seaborn 0.12.0 which is incompatible.
Successfully installed seaborn-0.12.0

[2]:

import os
import gdown
import itertools
import zipfile
import pandas as pd
import numpy as np
import nibabel as nib
import moval
from moval.solvers.utils import ComputMetric
from tqdm import tqdm
import seaborn as sns
import matplotlib.pyplot as plt

[3]:

print(f"The installed MOVAL verision is {moval.__version__}")
print(f"The installed seaborn verision is {sns.__version__}")

The installed MOVAL verision is 0.3.16
The installed seaborn verision is 0.12.0

Load the data

[4]:

# download the data of cardiac

output = "data_moval_supp.zip"
if not os.path.exists(output):
    url = "https://drive.google.com/u/0/uc?id=1ZlC66MGmPlf05aYYCKBaRT2q5uod8GFk&export=download"
    output = "data_moval_supp.zip"
    gdown.download(url, output, quiet=False)

directory_data = "data_moval_supp"
if not os.path.exists(directory_data):
    with zipfile.ZipFile(output, 'r') as zip_ref:
        zip_ref.extractall(directory_data)

[5]:

ls

analysis_cls.ipynb    data_moval_supp.zip   img_cifar/
analysis_seg2d.ipynb  estim_cls.ipynb       img_cifar.zip
analysis_seg3d.ipynb  estim_seg2d.ipynb     img_prostate/
data_moval/           estim_seg3d.ipynb     img_prostate.zip
data_moval.zip        img_cardiac/
data_moval_supp/      img_cardiac.zip

[6]:

# now I am playing with cardiac segmentation

Datafile_eval = "data_moval_supp/Cardiacresults/seg-eval.txt"
Imglist_eval = open(Datafile_eval)
Imglist_eval_read = Imglist_eval.read().splitlines()

logits = []
gt = []
for Imgname_eval in Imglist_eval_read:
    #
    caseID = Imgname_eval.split("/")[-2]
    #
    GT_file = f"data_moval_supp/Cardiacresults/GT/1/{caseID}/seg.nii.gz"
    #
    logit_cls0_file = "data_moval_supp/Cardiacresults/cardiacval/results/pred_" + caseID + "cls0_prob.nii.gz"
    logit_cls1_file = "data_moval_supp/Cardiacresults/cardiacval/results/pred_" + caseID + "cls1_prob.nii.gz"
    logit_cls2_file = "data_moval_supp/Cardiacresults/cardiacval/results/pred_" + caseID + "cls2_prob.nii.gz"
    logit_cls3_file = "data_moval_supp/Cardiacresults/cardiacval/results/pred_" + caseID + "cls3_prob.nii.gz"
    #
    logit_cls0_read = nib.load(logit_cls0_file)
    logit_cls1_read = nib.load(logit_cls1_file)
    logit_cls2_read = nib.load(logit_cls2_file)
    logit_cls3_read = nib.load(logit_cls3_file)
    #
    logit_cls0      = logit_cls0_read.get_fdata()   # ``(H, W, D)``
    logit_cls1      = logit_cls1_read.get_fdata()
    logit_cls2      = logit_cls2_read.get_fdata()
    logit_cls3      = logit_cls3_read.get_fdata()
    #
    GT_read         = nib.load(GT_file)
    GTimg           = GT_read.get_fdata()           # ``(H, W, D)``
    #
    logit_cls = np.stack((logit_cls0, logit_cls1, logit_cls2, logit_cls3))  # ``(d, H, W, D)``
    # only including the slices that contains labels
    for dslice in range(GTimg.shape[2]):
        if np.sum(GTimg[:, :, dslice]) > 0:
            logits.append(logit_cls[:, :, :, dslice])
            gt.append(GTimg[:, :, dslice])

# logits is a list of length ``n``,  each element has ``(d, H, W)``.
# gt is a list of length ``n``,  each element has ``(H, W)``.
# H and W could differ for different cases.

Datafile_test = "data_moval_supp/Cardiacresults/seg-testA.txt"
Imglist_test = open(Datafile_test)
Imglist_test_read = Imglist_test.read().splitlines()

logits_test = []
gt_test = []
for Imgname_eval in Imglist_test_read:
    caseID = Imgname_eval.split("/")[-2]
    #
    GT_file = f"data_moval_supp/Cardiacresults/GT/2/{caseID}/seg.nii.gz"
    #
    logit_cls0_file = "data_moval_supp/Cardiacresults/cardiactest_2/results/pred_" + caseID + "cls0_prob.nii.gz"
    logit_cls1_file = "data_moval_supp/Cardiacresults/cardiactest_2/results/pred_" + caseID + "cls1_prob.nii.gz"
    logit_cls2_file = "data_moval_supp/Cardiacresults/cardiactest_2/results/pred_" + caseID + "cls2_prob.nii.gz"
    logit_cls3_file = "data_moval_supp/Cardiacresults/cardiactest_2/results/pred_" + caseID + "cls3_prob.nii.gz"
    #
    logit_cls0_read = nib.load(logit_cls0_file)
    logit_cls1_read = nib.load(logit_cls1_file)
    logit_cls2_read = nib.load(logit_cls2_file)
    logit_cls3_read = nib.load(logit_cls3_file)
    #
    logit_cls0      = logit_cls0_read.get_fdata()   # ``(H, W, D)``
    logit_cls1      = logit_cls1_read.get_fdata()
    logit_cls2      = logit_cls2_read.get_fdata()
    logit_cls3      = logit_cls3_read.get_fdata()
    #
    GT_read         = nib.load(GT_file)
    GTimg           = GT_read.get_fdata()           # ``(H, W, D)``
    logit_cls = np.stack((logit_cls0, logit_cls1, logit_cls2, logit_cls3))  # ``(d, H, W, D)``
    # only including the slices that contains labels
    for dslice in range(GTimg.shape[2]):
        if np.sum(GTimg[:, :, dslice]) > 0:
            logits_test.append(logit_cls[:, :, :, dslice])
            gt_test.append(GTimg[:, :, dslice])

# logits_test is a list of length ``n``,  each element has ``(d, H, W)``.
# gt_test is a list of length ``n``,  each element has ``(H, W)``.
# H and W could differ for different cases.

[7]:

print(f"The validation predictions, ``logits`` are a list of length {len(logits)} each element has approximately {logits[0].shape}")
print(f"The validation labels, ``gt`` are a list of length {len(gt)}, each element has approximately {gt[0].shape}\n")
print(f"The test predictions, ``logits_test`` are a list of length {len(logits_test)} each element has approximately {logits_test[0].shape}")
print(f"The test labels, ``gt_test`` are a list of length {len(gt_test)}, each element has approximately {gt_test[0].shape}")

The validation predictions, ``logits`` are a list of length 156 each element has approximately (4, 210, 257)
The validation labels, ``gt`` are a list of length 156, each element has approximately (210, 257)

The test predictions, ``logits_test`` are a list of length 74 each element has approximately (4, 303, 303)
The test labels, ``gt_test`` are a list of length 74, each element has approximately (303, 303)

[8]:

import random
random.seed(79)
test_inds = list(range(len(logits)))
random.shuffle(test_inds)
test_inds = test_inds[:100]
#
_logits = []
_gt = []
for test_ind in test_inds:
    _logits.append(logits[test_ind])
    _gt.append(gt[test_ind])
logits = _logits
gt = _gt
#
print(f"The validation predictions, ``logits`` are a list of length {len(logits)} each element has approximately {logits[0].shape}")
print(f"The validation labels, ``gt`` are a list of length {len(gt)}, each element has approximately {gt[0].shape}")

The validation predictions, ``logits`` are a list of length 100 each element has approximately (4, 223, 272)
The validation labels, ``gt`` are a list of length 100, each element has approximately (223, 272)

MOVAL estimation

[9]:

moval_options = list(itertools.product(moval.models.get_estim_options(),
                               ["segmentation"],
                               moval.models.get_conf_options(),
                               [False, True]))

[10]:

# ac-model does not need class-speicfic variants
for moval_option in moval_options:
    if moval_option[0] == 'ac-model' and moval_option[-1] == True:
        moval_options.remove(moval_option)

[11]:

print(f"The number of moval options is {len(moval_options)}")

The number of moval options is 36

[12]:

def test_cls(estim_algorithm, mode, confidence_scores, class_specific, logits, gt, logits_test, gt_test):
    """Test MOVAL with different conditions for segmentation tasks

    Args:
        mode (str): The given task to estimate model performance.
        confidence_scores (str):
            The method to calculate the confidence scores. We provide a list of confidence score calculation methods which
            can be displayed by running :py:func:`moval.models.get_conf_options`.
        estim_algorithm (str):
            The algorithm to estimate model performance. We also provide a list of estimation algorithm which can be displayed by
            running :py:func:`moval.models.get_estim_options`.
        class_specific (bool):
            If ``True``, the calculation will match class-wise confidence to class-wise accuracy.
        logits: The network output (logits) of a list of n ``(d, H, W, (D))`` for segmentation.
        gt: The cooresponding annotation of a list of n ``(H, W, (D))`` for segmentation.
        logits_test:  The network testing output (logits) of a list of n' ``(d, H', W', (D'))`` for segmentation.
        gt_test: The cooresponding testing annotation of a list of n' ``(H', W', (D'))`` for segmentation.

    Returns:
        err_test (float): testing error.
        moval_model: Optimized moval model.

    """

    moval_model = moval.MOVAL(
                mode = mode,
                metric = "f1score",
                confidence_scores = confidence_scores,
                estim_algorithm = estim_algorithm,
                class_specific = class_specific,
                approximate = True,
                approximate_boundary = 10
                )

    #
    moval_model.fit(logits, gt)

    # save the test err in the result files.
    # the gt_guide for test data is optional
    gt_guide_test = []
    for n_case in range(len(logits_test)):
        gt_case_test     = gt_test[n_case]
        gt_exist_test = []
        for k_cls in range(logit_cls[0].shape[0]):
            gt_exist_test.append(np.sum(gt_case_test == k_cls) > 0)
        gt_guide_test.append(gt_exist_test)

    estim_dsc_test = moval_model.estimate(logits_test, gt_guide = gt_guide_test)

    DSC_list_test = []
    for n_case in range(len(logits_test)):
        pred_case   = np.argmax(logits_test[n_case], axis = 0) # ``(H', W', (D'))``
        gt_case     = gt_test[n_case] # ``(H', W', (D'))``

        dsc_case = np.zeros(logits_test[n_case].shape[0])
        for kcls in range(1, logits_test[n_case].shape[0]):
            if np.sum(gt_case == kcls) == 0:
                dsc_case[kcls] = -1
            else:
                dsc_cal, _, _ = ComputMetric(pred_case == kcls, gt_case == kcls)
                dsc_case[kcls] = dsc_cal
        DSC_list_test.append(dsc_case)

    # only aggregate the ones which are not -1
    DSC_list_test = np.array(DSC_list_test) # ``(n, d)``
    m_DSC_test = []
    m_DSC_test.append(0.)
    for kcls in range(1, logits_test[n_case].shape[0]):
        m_DSC_test.append(DSC_list_test[:, kcls][DSC_list_test[:,kcls] >= 0].mean())

    m_DSC_test = np.array(m_DSC_test)
    err_test = np.abs( m_DSC_test[1:] - estim_dsc_test[1:] )

    return err_test, moval_model

[13]:

err_test_list = []
moval_parameters = []
moval_parameters_ = []

[14]:

for k_cond in tqdm(range(len(moval_options))):

    err_test, moval_model = test_cls(
        estim_algorithm = moval_options[k_cond][0],
        mode = moval_options[k_cond][1],
        confidence_scores = moval_options[k_cond][2],
        class_specific = moval_options[k_cond][3],
        logits = logits,
        gt = gt,
        logits_test = logits_test,
        gt_test = gt_test
    )
    err_test_list.append(err_test)
    moval_parameters.append(moval_model.model_.param)
    if moval_model.model_.extend_param:
        moval_parameters_.append(moval_model.model_.param_ext)
    else:
        moval_parameters_.append(0.)

  0%|                                                                                                                                                                                | 0/36 [00:00<?, ?it/s]

Starting optimizing for model ac-model with confidence max_class_probability-conf based on metric f1score, class specific is False.
Calculating and saving the fitted case-wise performance...

  3%|████▋                                                                                                                                                                   | 1/36 [00:03<02:04,  3.57s/it]

Starting optimizing for model ac-model with confidence energy-conf based on metric f1score, class specific is False.
Calculating and saving the fitted case-wise performance...

  6%|█████████▎                                                                                                                                                              | 2/36 [00:06<01:47,  3.18s/it]

Starting optimizing for model ac-model with confidence entropy-conf based on metric f1score, class specific is False.
Calculating and saving the fitted case-wise performance...

  8%|██████████████                                                                                                                                                          | 3/36 [00:09<01:48,  3.30s/it]

Starting optimizing for model ac-model with confidence doctor-conf based on metric f1score, class specific is False.
Calculating and saving the fitted case-wise performance...

 11%|██████████████████▋                                                                                                                                                     | 4/36 [00:13<01:46,  3.32s/it]

Starting optimizing for model ts-model with confidence max_class_probability-conf based on metric f1score, class specific is False.
Opitimizing with 100 samples...
Be patient, it should take a while...
Calculating and saving the fitted case-wise performance...

 14%|███████████████████████▎                                                                                                                                                | 5/36 [00:25<03:19,  6.45s/it]

Starting optimizing for model ts-model with confidence max_class_probability-conf based on metric f1score, class specific is True.
Opitimizing with 100 samples...
Be patient, it should take a while...
Calculating and saving the fitted case-wise performance...

 17%|████████████████████████████                                                                                                                                            | 6/36 [01:07<09:13, 18.46s/it]

Starting optimizing for model ts-model with confidence energy-conf based on metric f1score, class specific is False.
Opitimizing with 100 samples...
Be patient, it should take a while...
Calculating and saving the fitted case-wise performance...

 19%|████████████████████████████████▋                                                                                                                                       | 7/36 [01:16<07:34, 15.67s/it]

Starting optimizing for model ts-model with confidence energy-conf based on metric f1score, class specific is True.
Opitimizing with 100 samples...
Be patient, it should take a while...
Calculating and saving the fitted case-wise performance...

 22%|█████████████████████████████████████▎                                                                                                                                  | 8/36 [01:50<10:00, 21.44s/it]

Starting optimizing for model ts-model with confidence entropy-conf based on metric f1score, class specific is False.
Opitimizing with 100 samples...
Be patient, it should take a while...
Calculating and saving the fitted case-wise performance...

 25%|██████████████████████████████████████████                                                                                                                              | 9/36 [02:02<08:18, 18.46s/it]

Starting optimizing for model ts-model with confidence entropy-conf based on metric f1score, class specific is True.
Opitimizing with 100 samples...
Be patient, it should take a while...
Calculating and saving the fitted case-wise performance...

 28%|██████████████████████████████████████████████▍                                                                                                                        | 10/36 [02:44<11:06, 25.64s/it]

Starting optimizing for model ts-model with confidence doctor-conf based on metric f1score, class specific is False.
Opitimizing with 100 samples...
Be patient, it should take a while...
Calculating and saving the fitted case-wise performance...

 31%|███████████████████████████████████████████████████                                                                                                                    | 11/36 [02:55<08:50, 21.23s/it]

Starting optimizing for model ts-model with confidence doctor-conf based on metric f1score, class specific is True.
Opitimizing with 100 samples...
Be patient, it should take a while...
Calculating and saving the fitted case-wise performance...

 33%|███████████████████████████████████████████████████████▋                                                                                                               | 12/36 [03:34<10:39, 26.63s/it]

Starting optimizing for model doc-model with confidence max_class_probability-conf based on metric f1score, class specific is False.
Opitimizing with 100 samples...
Be patient, it should take a while...
Calculating and saving the fitted case-wise performance...

 36%|████████████████████████████████████████████████████████████▎                                                                                                          | 13/36 [03:44<08:13, 21.45s/it]

Starting optimizing for model doc-model with confidence max_class_probability-conf based on metric f1score, class specific is True.
Opitimizing with 100 samples...
Be patient, it should take a while...
Calculating and saving the fitted case-wise performance...

 39%|████████████████████████████████████████████████████████████████▉                                                                                                      | 14/36 [04:15<08:57, 24.45s/it]

Starting optimizing for model doc-model with confidence energy-conf based on metric f1score, class specific is False.
Opitimizing with 100 samples...
Be patient, it should take a while...
Calculating and saving the fitted case-wise performance...

 42%|█████████████████████████████████████████████████████████████████████▌                                                                                                 | 15/36 [04:23<06:49, 19.51s/it]

Starting optimizing for model doc-model with confidence energy-conf based on metric f1score, class specific is True.
Opitimizing with 100 samples...
Be patient, it should take a while...
Calculating and saving the fitted case-wise performance...

 44%|██████████████████████████████████████████████████████████████████████████▏                                                                                            | 16/36 [04:47<06:59, 20.96s/it]

Starting optimizing for model doc-model with confidence entropy-conf based on metric f1score, class specific is False.
Opitimizing with 100 samples...
Be patient, it should take a while...
Calculating and saving the fitted case-wise performance...

 47%|██████████████████████████████████████████████████████████████████████████████▊                                                                                        | 17/36 [04:58<05:37, 17.74s/it]

Starting optimizing for model doc-model with confidence entropy-conf based on metric f1score, class specific is True.
Opitimizing with 100 samples...
Be patient, it should take a while...
Calculating and saving the fitted case-wise performance...

 50%|███████████████████████████████████████████████████████████████████████████████████▌                                                                                   | 18/36 [05:31<06:43, 22.42s/it]

Starting optimizing for model doc-model with confidence doctor-conf based on metric f1score, class specific is False.
Opitimizing with 100 samples...
Be patient, it should take a while...
Calculating and saving the fitted case-wise performance...

 53%|████████████████████████████████████████████████████████████████████████████████████████▏                                                                              | 19/36 [05:40<05:14, 18.48s/it]

Starting optimizing for model doc-model with confidence doctor-conf based on metric f1score, class specific is True.
Opitimizing with 100 samples...
Be patient, it should take a while...
Calculating and saving the fitted case-wise performance...

 56%|████████████████████████████████████████████████████████████████████████████████████████████▊                                                                          | 20/36 [06:11<05:55, 22.20s/it]

Starting optimizing for model atc-model with confidence max_class_probability-conf based on metric f1score, class specific is False.
Opitimizing with 100 samples...
Be patient, it should take a while...
Calculating and saving the fitted case-wise performance...

 58%|█████████████████████████████████████████████████████████████████████████████████████████████████▍                                                                     | 21/36 [06:21<04:38, 18.59s/it]

Starting optimizing for model atc-model with confidence max_class_probability-conf based on metric f1score, class specific is True.
Opitimizing with 100 samples...
Be patient, it should take a while...
Calculating and saving the fitted case-wise performance...

 61%|██████████████████████████████████████████████████████████████████████████████████████████████████████                                                                 | 22/36 [06:56<05:28, 23.47s/it]

Starting optimizing for model atc-model with confidence energy-conf based on metric f1score, class specific is False.
Opitimizing with 100 samples...
Be patient, it should take a while...
Not satisfied with initial optimization results of param, trying more initial states...
Tried 1/2 times.
Tried 2/2 times.
Starting from [array([0.19312145]), array([0.29048203])]
Optimization results are [(0.08045459362085416, array([1.e-06])), (0.0, array([0.14002602])), (2.7885757628576258e-06, array([0.14002142]))]
Calculating and saving the fitted case-wise performance...

 64%|██████████████████████████████████████████████████████████████████████████████████████████████████████████▋                                                            | 23/36 [07:10<04:28, 20.64s/it]

Starting optimizing for model atc-model with confidence energy-conf based on metric f1score, class specific is True.
Opitimizing with 100 samples...
Be patient, it should take a while...
Not satisfied with initial optimization results of param for class 1, trying more initial states...
Tried 1/2 times.
Tried 2/2 times.
Starting from [array([0.177334]), array([0.22173539])]
Optimization results are [(0.8559184201221959, 1.0), (1.6680447718631086e-05, 0.173428843706127), (1.6680447718631086e-05, 0.17342836399719505)]
Not satisfied with initial optimization results of param for class 2, trying more initial states...
Tried 1/2 times.
Tried 2/2 times.
Starting from [array([0.14649727]), array([0.2286929])]
Optimization results are [(0.7733349333970854, 1.0), (1.533523151908689e-06, 0.16663349482341716), (1.533523151908689e-06, 0.1666342666892477)]
Not satisfied with initial optimization results of param for class 3, trying more initial states...
Tried 1/2 times.
Tried 2/2 times.
Starting from [array([0.15105374]), array([0.21764884])]
Optimization results are [(0.6414503204804388, 1.0), (1.0778631821528606e-06, 0.1667851199540572), (1.0778631821528606e-06, 0.1667855084976424)]
Calculating and saving the fitted case-wise performance...

 67%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████▎                                                       | 24/36 [08:20<07:03, 35.30s/it]

Starting optimizing for model atc-model with confidence entropy-conf based on metric f1score, class specific is False.
Opitimizing with 100 samples...
Be patient, it should take a while...
Calculating and saving the fitted case-wise performance...

 69%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                                                   | 25/36 [08:31<05:08, 28.02s/it]

Starting optimizing for model atc-model with confidence entropy-conf based on metric f1score, class specific is True.
Opitimizing with 100 samples...
Be patient, it should take a while...
Calculating and saving the fitted case-wise performance...

 72%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                                              | 26/36 [09:06<05:00, 30.09s/it]

Starting optimizing for model atc-model with confidence doctor-conf based on metric f1score, class specific is False.
Opitimizing with 100 samples...
Be patient, it should take a while...
Calculating and saving the fitted case-wise performance...

 75%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎                                         | 27/36 [09:16<03:36, 24.09s/it]

Starting optimizing for model atc-model with confidence doctor-conf based on metric f1score, class specific is True.
Opitimizing with 100 samples...
Be patient, it should take a while...
Calculating and saving the fitted case-wise performance...

 78%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                                     | 28/36 [09:49<03:34, 26.86s/it]

Starting optimizing for model ts-atc-model with confidence max_class_probability-conf based on metric f1score, class specific is False.
Opitimizing with 100 samples...
Be patient, it should take a while...
Calculating and saving the fitted case-wise performance...

 81%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                                | 29/36 [10:08<02:51, 24.43s/it]

Starting optimizing for model ts-atc-model with confidence max_class_probability-conf based on metric f1score, class specific is True.
Opitimizing with 100 samples...
Be patient, it should take a while...
Not satisfied with initial optimization results of param_ext for class 3, trying more initial states...
Tried 1/2 times.
Tried 2/2 times.
Starting from [array([0.53913427]), array([0.63207521])]
Optimization results are [(0.6414503204804388, 1.0), (3.218282718098209e-08, 0.5458471227260648), (3.218282718098209e-08, 0.5458477120983736)]
Calculating and saving the fitted case-wise performance...

 83%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                           | 30/36 [11:47<04:41, 46.92s/it]

Starting optimizing for model ts-atc-model with confidence energy-conf based on metric f1score, class specific is False.
Opitimizing with 100 samples...
Be patient, it should take a while...
Calculating and saving the fitted case-wise performance...

 86%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                       | 31/36 [12:02<03:06, 37.26s/it]

Starting optimizing for model ts-atc-model with confidence energy-conf based on metric f1score, class specific is True.
Opitimizing with 100 samples...
Be patient, it should take a while...
Not satisfied with initial optimization results of param_ext for class 1, trying more initial states...
Tried 1/2 times.
Tried 2/2 times.
Starting from [array([0.76880795]), array([0.77759651])]
Optimization results are [(0.8559184201221959, 1.0), (5.128821463085131e-07, 0.7697880787367792), (5.128821463085131e-07, 0.7697880389497944)]
Not satisfied with initial optimization results of param_ext for class 2, trying more initial states...
Tried 1/2 times.
Tried 2/2 times.
Starting from [array([0.64530392]), array([0.65915408])]
Optimization results are [(0.7733349333970854, 1.0), (1.770843187420823e-06, 0.6485174057313767), (1.770843187420823e-06, 0.6485173585157611)]
Not satisfied with initial optimization results of param_ext for class 3, trying more initial states...
Tried 1/2 times.
Tried 2/2 times.
Starting from [array([0.58497467]), array([0.59798012])]
Optimization results are [(0.6414503204804388, 1.0), (0.0003873978043705817, 0.5920989567019319), (0.0003873978043705817, 0.5920989517050825)]
Calculating and saving the fitted case-wise performance...

 89%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                  | 32/36 [13:52<03:56, 59.21s/it]

Starting optimizing for model ts-atc-model with confidence entropy-conf based on metric f1score, class specific is False.
Opitimizing with 100 samples...
Be patient, it should take a while...
Calculating and saving the fitted case-wise performance...

 92%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████              | 33/36 [14:12<02:22, 47.46s/it]

Starting optimizing for model ts-atc-model with confidence entropy-conf based on metric f1score, class specific is True.
Opitimizing with 100 samples...
Be patient, it should take a while...
Calculating and saving the fitted case-wise performance...

 94%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋         | 34/36 [15:35<01:55, 57.99s/it]

Starting optimizing for model ts-atc-model with confidence doctor-conf based on metric f1score, class specific is False.
Opitimizing with 100 samples...
Be patient, it should take a while...
Calculating and saving the fitted case-wise performance...

 97%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎    | 35/36 [15:54<00:46, 46.19s/it]

Starting optimizing for model ts-atc-model with confidence doctor-conf based on metric f1score, class specific is True.
Opitimizing with 100 samples...
Be patient, it should take a while...
Calculating and saving the fitted case-wise performance...

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 36/36 [17:09<00:00, 28.61s/it]

Compare estimation results

[15]:

estim = []
conf = []
err = []
err_mean = []
novel = []
k_option = 0
# attach the class-wise segmentation error
for moval_option in moval_options:
    for k_cond in range(len(err_test_list[k_option])):
        #
        if moval_option[3] == True:
            estim_cs = 'CS '
        else:
            estim_cs = ''
        #
        if moval_option[0] == 'ac-model':
            estim.append(estim_cs + 'AC')
        elif moval_option[0] == 'ts-model':
            estim.append(estim_cs + 'TS')
        elif moval_option[0] == 'doc-model':
            estim.append(estim_cs + 'DoC')
        elif moval_option[0] == 'atc-model':
            estim.append(estim_cs + 'ATC')
        else:
            estim.append(estim_cs + 'TS-ATC')
        #
        if moval_option[2] == 'max_class_probability-conf':
            conf.append('MCP')
        elif moval_option[2] == 'energy-conf':
            conf.append('Energy')
        elif moval_option[2] == 'entropy-conf':
            conf.append('Entropy')
        else:
            conf.append('Doctor')
        #
        if moval_option[2] == 'max_class_probability-conf' and moval_option[3] == False:
            novel.append('Existing Methods')
        else:
            novel.append('Provided by MOVAL')
        #
        err.append(err_test_list[k_option][k_cond])
        err_mean.append(np.mean(err_test_list[k_option]))
    k_option += 1

[16]:

d = {'Estimation Algorithm': estim, 'Confidence Score': conf, 'MAE': err_mean, 'MAE ': err, 'Category': novel}
df = pd.DataFrame(data=d)
#
custom_order = ['AC', 'TS', 'DoC', 'ATC', 'TS-ATC', 'CS TS', 'CS DoC', 'CS ATC', 'CS TS-ATC']
df['Estimation Algorithm'] = pd.Categorical(df['Estimation Algorithm'], categories=custom_order, ordered=True)
df = df.sort_values(by='Estimation Algorithm')
#
custom_order = ['MCP', 'Doctor', 'Entropy', 'Energy']
df['Confidence Score'] = pd.Categorical(df['Confidence Score'], categories=custom_order, ordered=True)
df = df.sort_values(by='Confidence Score')

[17]:

df.head()

[17]:

	Estimation Algorithm	Confidence Score	MAE	MAE	Category
0	AC	MCP	0.180960	0.160946	Existing Methods
38	DoC	MCP	0.147394	0.149014	Existing Methods
61	ATC	MCP	0.084708	0.096688	Existing Methods
62	ATC	MCP	0.084708	0.045736	Existing Methods
60	ATC	MCP	0.084708	0.111699	Existing Methods

[18]:

sns.set(rc={'figure.figsize':(6,3)})
sns.set_style("darkgrid")
category_palette = {'Existing Methods': 'grey', 'Provided by MOVAL': '#1f77b4'}
ax = sns.scatterplot(
    data=df, x="Estimation Algorithm", y="Confidence Score", hue="Category", size="MAE",
    sizes=(40, 1000), palette=category_palette
)
ax.set(ylim=(3.5, -0.5))
ax.tick_params(axis='x', rotation=15)
#
# Get the handles and labels from the legend
handles, labels = ax.get_legend_handles_labels()

# Create a custom legend with only desired categories
desired_labels = ['Category', 'Existing Methods', 'Provided by MOVAL', 'MAE', '0.06', '0.12']
desired_handles = [h for h, l in zip(handles, labels) if l in desired_labels]

legend = plt.legend(handles=desired_handles, labels=desired_labels, bbox_to_anchor=(1.2, 1), labelspacing=1)

# Increase the line spacing by adjusting position

[19]:

from statannotations.Annotator import Annotator
sns.set(rc={'figure.figsize':(6,2)})
sns.set_style("white")
ax = sns.barplot(df, x="Estimation Algorithm", y="MAE", color = '#1f77b4')
ax.tick_params(axis='x', rotation=15)
#
ax.spines['top'].set_color('none')
ax.spines['right'].set_color('none')
ax.spines['bottom'].set_color('none')
ax.spines['left'].set_color('none')
#
pairs=[("TS", "CS TS"), ("DoC", "CS DoC"), ("ATC", "CS ATC"), ("TS-ATC", "CS TS-ATC")]

annotator = Annotator(ax, pairs, data=df, x="Estimation Algorithm", y="MAE")
annotator.configure(test='Mann-Whitney', text_format='star', loc='inside')
annotator.apply_and_annotate()

p-value annotation legend:
      ns: 5.00e-02 < p <= 1.00e+00
       *: 1.00e-02 < p <= 5.00e-02
      **: 1.00e-03 < p <= 1.00e-02
     ***: 1.00e-04 < p <= 1.00e-03
    ****: p <= 1.00e-04

TS vs. CS TS: Mann-Whitney-Wilcoxon test two-sided, P_val:3.223e-05 U_stat=1.440e+02
DoC vs. CS DoC: Mann-Whitney-Wilcoxon test two-sided, P_val:3.223e-05 U_stat=1.440e+02
ATC vs. CS ATC: Mann-Whitney-Wilcoxon test two-sided, P_val:3.223e-05 U_stat=1.440e+02
TS-ATC vs. CS TS-ATC: Mann-Whitney-Wilcoxon test two-sided, P_val:9.674e-03 U_stat=1.170e+02

[19]:

(<Axes: xlabel='Estimation Algorithm', ylabel='MAE'>,
 [<statannotations.Annotation.Annotation at 0x7ff3baa3fe20>,
  <statannotations.Annotation.Annotation at 0x7ff37bea33a0>,
  <statannotations.Annotation.Annotation at 0x7ff37bea33d0>,
  <statannotations.Annotation.Annotation at 0x7ff37bea3430>])

[20]:

sns.set(rc={'figure.figsize':(3,2)})
sns.set_style("white")
ax = sns.barplot(df, x="Confidence Score", y="MAE", color = '#1f77b4')
ax.tick_params(axis='x', rotation=15)
#
ax.spines['top'].set_color('none')
ax.spines['right'].set_color('none')
ax.spines['bottom'].set_color('none')
ax.spines['left'].set_color('none')
#
pairs=[("MCP", "Doctor"), ("MCP", "Entropy"), ("MCP", "Energy")]

annotator = Annotator(ax, pairs, data=df, x="Confidence Score", y="MAE")
annotator.configure(test='Mann-Whitney', text_format='star', loc='inside')
annotator.apply_and_annotate()

p-value annotation legend:
      ns: 5.00e-02 < p <= 1.00e+00
       *: 1.00e-02 < p <= 5.00e-02
      **: 1.00e-03 < p <= 1.00e-02
     ***: 1.00e-04 < p <= 1.00e-03
    ****: p <= 1.00e-04

MCP vs. Doctor: Mann-Whitney-Wilcoxon test two-sided, P_val:8.218e-01 U_stat=3.780e+02
MCP vs. Entropy: Mann-Whitney-Wilcoxon test two-sided, P_val:8.218e-01 U_stat=3.780e+02
MCP vs. Energy: Mann-Whitney-Wilcoxon test two-sided, P_val:7.437e-02 U_stat=2.610e+02

[20]:

(<Axes: xlabel='Confidence Score', ylabel='MAE'>,
 [<statannotations.Annotation.Annotation at 0x7ff3baa3f730>,
  <statannotations.Annotation.Annotation at 0x7ff3baa3ffd0>,
  <statannotations.Annotation.Annotation at 0x7ff3baa3f9a0>])

[21]:

sns.set(rc={'figure.figsize':(12,3)})
category_palette = {'MCP': '#e5f0f8',
                    'Doctor': '#99c6e4',
                    'Entropy': '#4c9cd0',
                    'Energy': '#0072bd'
                   }
ax = sns.boxplot(df, x="Estimation Algorithm", y="MAE ", hue="Confidence Score", palette=category_palette)

sns.move_legend(ax, "upper left", bbox_to_anchor=(1, 1))

pairs=[
    [("ATC", "MCP"), ("CS DoC", "Doctor")]
]



annotator = Annotator(ax, pairs, data=df, x="Estimation Algorithm", y="MAE", hue="Confidence Score")
annotator.configure(test='Mann-Whitney', text_format='star', loc='inside')
annotator.apply_and_annotate()

ax.set(ylim=(-0.02, 0.25))

p-value annotation legend:
      ns: 5.00e-02 < p <= 1.00e+00
       *: 1.00e-02 < p <= 5.00e-02
      **: 1.00e-03 < p <= 1.00e-02
     ***: 1.00e-04 < p <= 1.00e-03
    ****: p <= 1.00e-04

ATC_MCP vs. CS DoC_Doctor: Mann-Whitney-Wilcoxon test two-sided, P_val:4.685e-02 U_stat=9.000e+00

[21]:

[(-0.02, 0.25)]

[ ]:

[ ]:

[ ]: