Evaluation of results

This example shows how to use the pymia.evaluation package to evaluate predicted segmentations against reference ground truths. Common metrics in medical image segmentation are the Dice coefficient, an overlap-based metric, and the Hausdorff distance, a distance-based metric. Further, we also evaluate the volume similarity, a metric that does not consider the spatial overlap. The evaluation results are logged to the console and saved to a CSV file. Further, statistics (mean and standard deviation) are calculated over all evaluated segmentations, which are again logged to the console and saved to a CSV file. The CSV files could be loaded into any statistical software for further analysis and visualization.

Tip

This example is available as Jupyter notebook at ./examples/evaluation/basic.ipynb and Python script at ./examples/evaluation/basic.py.

Note

To be able to run this example:

Import the required modules.

[1]:
import glob
import os

import numpy as np
import pymia.evaluation.metric as metric
import pymia.evaluation.evaluator as eval_
import pymia.evaluation.writer as writer
import SimpleITK as sitk

Define the paths to the data and the result CSV files.

[2]:
data_dir = '../example-data'

result_file = '../example-data/results.csv'
result_summary_file = '../example-data/results_summary.csv'

Let us create a list with the three metrics: the Dice coefficient, the Hausdorff distance, and the volume similarity. Note that we are interested in the outlier-robust 95th Hausdorff distance, and, therefore, pass the percentile as argument and adapt the metric’s name.

[3]:
metrics = [metric.DiceCoefficient(), metric.HausdorffDistance(percentile=95, metric='HDRFDST95'), metric.VolumeSimilarity()]

Now, we need to define the labels we want to evaluate. In the provided example data, we have five labels for different brain structures. Here, we are only interested in three of them: white matter, grey matter, and the thalamus.

[4]:
labels = {1: 'WHITEMATTER',
          2: 'GREYMATTER',
          5: 'THALAMUS'
          }

Finally, we can initialize an evaluator with the metrics and labels.

[5]:
evaluator = eval_.SegmentationEvaluator(metrics, labels)

We can now loop over the subjects of the example data. We will load the ground truth image as reference. An artificial segmentation (prediction) is created by eroding the ground truth. Both images, and the subject identifier are passed to the evaluator.

[6]:
# get subjects to evaluate
subject_dirs = [subject for subject in glob.glob(os.path.join(data_dir, '*')) if os.path.isdir(subject) and os.path.basename(subject).startswith('Subject')]

for subject_dir in subject_dirs:
    subject_id = os.path.basename(subject_dir)
    print(f'Evaluating {subject_id}...')

    # load ground truth image and create artificial prediction by erosion
    ground_truth = sitk.ReadImage(os.path.join(subject_dir, f'{subject_id}_GT.mha'))
    prediction = ground_truth
    for label_val in labels.keys():
        # erode each label we are going to evaluate
        prediction = sitk.BinaryErode(prediction, [1] * prediction.GetDimension(), sitk.sitkBall, 0, label_val)

    # evaluate the "prediction" against the ground truth
    evaluator.evaluate(prediction, ground_truth, subject_id)
Evaluating Subject_2...
Evaluating Subject_4...
Evaluating Subject_3...
Evaluating Subject_1...

After we evaluated all subjects, we can use a CSV writer to write the evaluation results to a CSV file.

[7]:
writer.CSVWriter(result_file).write(evaluator.results)

Further, we can use a console writer to display the results in the console.

[8]:
print('\nSubject-wise results...')
writer.ConsoleWriter().write(evaluator.results)

Subject-wise results...
SUBJECT    LABEL        DICE   HDRFDST95  VOLSMTY
Subject_1  GREYMATTER   0.313  9.165      0.313
Subject_1  THALAMUS     0.752  2.000      0.752
Subject_1  WHITEMATTER  0.642  6.708      0.642
Subject_2  GREYMATTER   0.298  10.863     0.298
Subject_2  THALAMUS     0.768  2.000      0.768
Subject_2  WHITEMATTER  0.654  6.000      0.654
Subject_3  GREYMATTER   0.287  8.718      0.287
Subject_3  THALAMUS     0.761  2.000      0.761
Subject_3  WHITEMATTER  0.641  6.164      0.641
Subject_4  GREYMATTER   0.259  8.660      0.259
Subject_4  THALAMUS     0.781  2.000      0.781
Subject_4  WHITEMATTER  0.649  6.000      0.649

We can also report statistics such as the mean and standard deviation among all subjects using dedicated statistics writers. Note that you can pass any functions that take a list of floats and return a scalar value to the writers. Again, we will write a CSV file and display the results in the console.

[9]:
functions = {'MEAN': np.mean, 'STD': np.std}
writer.CSVStatisticsWriter(result_summary_file, functions=functions).write(evaluator.results)
print('\nAggregated statistic results...')
writer.ConsoleStatisticsWriter(functions=functions).write(evaluator.results)

Aggregated statistic results...
LABEL        METRIC     STATISTIC  VALUE
GREYMATTER   DICE       MEAN       0.289
GREYMATTER   DICE       STD        0.020
GREYMATTER   HDRFDST95  MEAN       9.351
GREYMATTER   HDRFDST95  STD        0.894
GREYMATTER   VOLSMTY    MEAN       0.289
GREYMATTER   VOLSMTY    STD        0.020
THALAMUS     DICE       MEAN       0.766
THALAMUS     DICE       STD        0.010
THALAMUS     HDRFDST95  MEAN       2.000
THALAMUS     HDRFDST95  STD        0.000
THALAMUS     VOLSMTY    MEAN       0.766
THALAMUS     VOLSMTY    STD        0.010
WHITEMATTER  DICE       MEAN       0.647
WHITEMATTER  DICE       STD        0.005
WHITEMATTER  HDRFDST95  MEAN       6.218
WHITEMATTER  HDRFDST95  STD        0.291
WHITEMATTER  VOLSMTY    MEAN       0.647
WHITEMATTER  VOLSMTY    STD        0.005

Finally, we clear the results in the evaluator such that the evaluator is ready for the next evaluation.

[10]:
evaluator.clear()

Now, let us have a look at the saved result CSV file.

[11]:
import pandas as pd

pd.read_csv(result_file, sep=';')
[11]:
SUBJECT LABEL DICE HDRFDST95 VOLSMTY
0 Subject_1 GREYMATTER 0.313373 9.165151 0.313373
1 Subject_1 THALAMUS 0.752252 2.000000 0.752252
2 Subject_1 WHITEMATTER 0.642021 6.708204 0.642021
3 Subject_2 GREYMATTER 0.298358 10.862780 0.298358
4 Subject_2 THALAMUS 0.768488 2.000000 0.768488
5 Subject_2 WHITEMATTER 0.654239 6.000000 0.654239
6 Subject_3 GREYMATTER 0.287460 8.717798 0.287460
7 Subject_3 THALAMUS 0.760978 2.000000 0.760978
8 Subject_3 WHITEMATTER 0.641251 6.164414 0.641251
9 Subject_4 GREYMATTER 0.258504 8.660254 0.258504
10 Subject_4 THALAMUS 0.780754 2.000000 0.780754
11 Subject_4 WHITEMATTER 0.649203 6.000000 0.649203

And also at the saved statistics CSV file.

[12]:
pd.read_csv(result_summary_file, sep=';')

[12]:
LABEL METRIC STATISTIC VALUE
0 GREYMATTER DICE MEAN 0.289424
1 GREYMATTER DICE STD 0.020083
2 GREYMATTER HDRFDST95 MEAN 9.351496
3 GREYMATTER HDRFDST95 STD 0.894161
4 GREYMATTER VOLSMTY MEAN 0.289424
5 GREYMATTER VOLSMTY STD 0.020083
6 THALAMUS DICE MEAN 0.765618
7 THALAMUS DICE STD 0.010458
8 THALAMUS HDRFDST95 MEAN 2.000000
9 THALAMUS HDRFDST95 STD 0.000000
10 THALAMUS VOLSMTY MEAN 0.765618
11 THALAMUS VOLSMTY STD 0.010458
12 WHITEMATTER DICE MEAN 0.646678
13 WHITEMATTER DICE STD 0.005355
14 WHITEMATTER HDRFDST95 MEAN 6.218154
15 WHITEMATTER HDRFDST95 STD 0.290783
16 WHITEMATTER VOLSMTY MEAN 0.646678
17 WHITEMATTER VOLSMTY STD 0.005355