Evaluation of results¶

This example shows how to use the pymia.evaluation package to evaluate predicted segmentations against reference ground truths. Common metrics in medical image segmentation are the Dice coefficient, an overlap-based metric, and the Hausdorff distance, a distance-based metric. Further, we also evaluate the volume similarity, a metric that does not consider the spatial overlap. The evaluation results are logged to the console and saved to a CSV file. Further, statistics (mean and standard deviation) are calculated over all evaluated segmentations, which are again logged to the console and saved to a CSV file. The CSV files could be loaded into any statistical software for further analysis and visualization.

Tip

This example is available as Jupyter notebook at ./examples/evaluation/basic.ipynb and Python script at ./examples/evaluation/basic.py.

Note

To be able to run this example:

Get the example data by executing ./examples/example-data/pull_example_data.py.
Install pandas (pip install pandas).

Import the required modules.

[1]:

import glob
import os

import numpy as np
import pymia.evaluation.metric as metric
import pymia.evaluation.evaluator as eval_
import pymia.evaluation.writer as writer
import SimpleITK as sitk

Define the paths to the data and the result CSV files.

[2]:

data_dir = '../example-data'

result_file = '../example-data/results.csv'
result_summary_file = '../example-data/results_summary.csv'

Let us create a list with the three metrics: the Dice coefficient, the Hausdorff distance, and the volume similarity. Note that we are interested in the outlier-robust 95th Hausdorff distance, and, therefore, pass the percentile as argument and adapt the metric’s name.

[3]:

metrics = [metric.DiceCoefficient(), metric.HausdorffDistance(percentile=95, metric='HDRFDST95'), metric.VolumeSimilarity()]

Now, we need to define the labels we want to evaluate. In the provided example data, we have five labels for different brain structures. Here, we are only interested in three of them: white matter, grey matter, and the thalamus.

[4]:

labels = {1: 'WHITEMATTER',
          2: 'GREYMATTER',
          5: 'THALAMUS'
          }

Finally, we can initialize an evaluator with the metrics and labels.

[5]:

evaluator = eval_.SegmentationEvaluator(metrics, labels)

We can now loop over the subjects of the example data. We will load the ground truth image as reference. An artificial segmentation (prediction) is created by eroding the ground truth. Both images, and the subject identifier are passed to the evaluator.

[6]:

# get subjects to evaluate
subject_dirs = [subject for subject in glob.glob(os.path.join(data_dir, '*')) if os.path.isdir(subject) and os.path.basename(subject).startswith('Subject')]

for subject_dir in subject_dirs:
    subject_id = os.path.basename(subject_dir)
    print(f'Evaluating {subject_id}...')

    # load ground truth image and create artificial prediction by erosion
    ground_truth = sitk.ReadImage(os.path.join(subject_dir, f'{subject_id}_GT.mha'))
    prediction = ground_truth
    for label_val in labels.keys():
        # erode each label we are going to evaluate
        prediction = sitk.BinaryErode(prediction, [1] * prediction.GetDimension(), sitk.sitkBall, 0, label_val)

    # evaluate the "prediction" against the ground truth
    evaluator.evaluate(prediction, ground_truth, subject_id)

Evaluating Subject_2...
Evaluating Subject_4...
Evaluating Subject_3...
Evaluating Subject_1...

After we evaluated all subjects, we can use a CSV writer to write the evaluation results to a CSV file.

[7]:

writer.CSVWriter(result_file).write(evaluator.results)

Further, we can use a console writer to display the results in the console.

[8]:

print('\nSubject-wise results...')
writer.ConsoleWriter().write(evaluator.results)


Subject-wise results...
SUBJECT    LABEL        DICE   HDRFDST95  VOLSMTY
Subject_1  GREYMATTER   0.313  9.165      0.313
Subject_1  THALAMUS     0.752  2.000      0.752
Subject_1  WHITEMATTER  0.642  6.708      0.642
Subject_2  GREYMATTER   0.298  10.863     0.298
Subject_2  THALAMUS     0.768  2.000      0.768
Subject_2  WHITEMATTER  0.654  6.000      0.654
Subject_3  GREYMATTER   0.287  8.718      0.287
Subject_3  THALAMUS     0.761  2.000      0.761
Subject_3  WHITEMATTER  0.641  6.164      0.641
Subject_4  GREYMATTER   0.259  8.660      0.259
Subject_4  THALAMUS     0.781  2.000      0.781
Subject_4  WHITEMATTER  0.649  6.000      0.649

We can also report statistics such as the mean and standard deviation among all subjects using dedicated statistics writers. Note that you can pass any functions that take a list of floats and return a scalar value to the writers. Again, we will write a CSV file and display the results in the console.

[9]:

functions = {'MEAN': np.mean, 'STD': np.std}
writer.CSVStatisticsWriter(result_summary_file, functions=functions).write(evaluator.results)
print('\nAggregated statistic results...')
writer.ConsoleStatisticsWriter(functions=functions).write(evaluator.results)


Aggregated statistic results...
LABEL        METRIC     STATISTIC  VALUE
GREYMATTER   DICE       MEAN       0.289
GREYMATTER   DICE       STD        0.020
GREYMATTER   HDRFDST95  MEAN       9.351
GREYMATTER   HDRFDST95  STD        0.894
GREYMATTER   VOLSMTY    MEAN       0.289
GREYMATTER   VOLSMTY    STD        0.020
THALAMUS     DICE       MEAN       0.766
THALAMUS     DICE       STD        0.010
THALAMUS     HDRFDST95  MEAN       2.000
THALAMUS     HDRFDST95  STD        0.000
THALAMUS     VOLSMTY    MEAN       0.766
THALAMUS     VOLSMTY    STD        0.010
WHITEMATTER  DICE       MEAN       0.647
WHITEMATTER  DICE       STD        0.005
WHITEMATTER  HDRFDST95  MEAN       6.218
WHITEMATTER  HDRFDST95  STD        0.291
WHITEMATTER  VOLSMTY    MEAN       0.647
WHITEMATTER  VOLSMTY    STD        0.005

Finally, we clear the results in the evaluator such that the evaluator is ready for the next evaluation.

[10]:

evaluator.clear()

Now, let us have a look at the saved result CSV file.

[11]:

import pandas as pd

pd.read_csv(result_file, sep=';')

[11]:

	SUBJECT	LABEL	DICE	HDRFDST95	VOLSMTY
0	Subject_1	GREYMATTER	0.313373	9.165151	0.313373
1	Subject_1	THALAMUS	0.752252	2.000000	0.752252
2	Subject_1	WHITEMATTER	0.642021	6.708204	0.642021
3	Subject_2	GREYMATTER	0.298358	10.862780	0.298358
4	Subject_2	THALAMUS	0.768488	2.000000	0.768488
5	Subject_2	WHITEMATTER	0.654239	6.000000	0.654239
6	Subject_3	GREYMATTER	0.287460	8.717798	0.287460
7	Subject_3	THALAMUS	0.760978	2.000000	0.760978
8	Subject_3	WHITEMATTER	0.641251	6.164414	0.641251
9	Subject_4	GREYMATTER	0.258504	8.660254	0.258504
10	Subject_4	THALAMUS	0.780754	2.000000	0.780754
11	Subject_4	WHITEMATTER	0.649203	6.000000	0.649203

And also at the saved statistics CSV file.

[12]:

pd.read_csv(result_summary_file, sep=';')

[12]:

	LABEL	METRIC	STATISTIC	VALUE
0	GREYMATTER	DICE	MEAN	0.289424
1	GREYMATTER	DICE	STD	0.020083
2	GREYMATTER	HDRFDST95	MEAN	9.351496
3	GREYMATTER	HDRFDST95	STD	0.894161
4	GREYMATTER	VOLSMTY	MEAN	0.289424
5	GREYMATTER	VOLSMTY	STD	0.020083
6	THALAMUS	DICE	MEAN	0.765618
7	THALAMUS	DICE	STD	0.010458
8	THALAMUS	HDRFDST95	MEAN	2.000000
9	THALAMUS	HDRFDST95	STD	0.000000
10	THALAMUS	VOLSMTY	MEAN	0.765618
11	THALAMUS	VOLSMTY	STD	0.010458
12	WHITEMATTER	DICE	MEAN	0.646678
13	WHITEMATTER	DICE	STD	0.005355
14	WHITEMATTER	HDRFDST95	MEAN	6.218154
15	WHITEMATTER	HDRFDST95	STD	0.290783
16	WHITEMATTER	VOLSMTY	MEAN	0.646678
17	WHITEMATTER	VOLSMTY	STD	0.005355