Evaluation of results¶
This example shows how to use the pymia.evaluation
package to evaluate predicted segmentations against reference ground truths. Common metrics in medical image segmentation are the Dice coefficient, an overlap-based metric, and the Hausdorff distance, a distance-based metric. Further, we also evaluate the volume similarity, a metric that does not consider the spatial overlap. The evaluation results are logged to the console and saved to a CSV file. Further, statistics (mean and standard
deviation) are calculated over all evaluated segmentations, which are again logged to the console and saved to a CSV file. The CSV files could be loaded into any statistical software for further analysis and visualization.
Tip
This example is available as Jupyter notebook at ./examples/evaluation/basic.ipynb and Python script at ./examples/evaluation/basic.py.
Note
To be able to run this example:
Get the example data by executing ./examples/example-data/pull_example_data.py.
Install pandas (
pip install pandas
).
Import the required modules.
[1]:
import glob
import os
import numpy as np
import pymia.evaluation.metric as metric
import pymia.evaluation.evaluator as eval_
import pymia.evaluation.writer as writer
import SimpleITK as sitk
Define the paths to the data and the result CSV files.
[2]:
data_dir = '../example-data'
result_file = '../example-data/results.csv'
result_summary_file = '../example-data/results_summary.csv'
Let us create a list with the three metrics: the Dice coefficient, the Hausdorff distance, and the volume similarity. Note that we are interested in the outlier-robust 95th Hausdorff distance, and, therefore, pass the percentile as argument and adapt the metric’s name.
[3]:
metrics = [metric.DiceCoefficient(), metric.HausdorffDistance(percentile=95, metric='HDRFDST95'), metric.VolumeSimilarity()]
Now, we need to define the labels we want to evaluate. In the provided example data, we have five labels for different brain structures. Here, we are only interested in three of them: white matter, grey matter, and the thalamus.
[4]:
labels = {1: 'WHITEMATTER',
2: 'GREYMATTER',
5: 'THALAMUS'
}
Finally, we can initialize an evaluator with the metrics and labels.
[5]:
evaluator = eval_.SegmentationEvaluator(metrics, labels)
We can now loop over the subjects of the example data. We will load the ground truth image as reference. An artificial segmentation (prediction) is created by eroding the ground truth. Both images, and the subject identifier are passed to the evaluator.
[6]:
# get subjects to evaluate
subject_dirs = [subject for subject in glob.glob(os.path.join(data_dir, '*')) if os.path.isdir(subject) and os.path.basename(subject).startswith('Subject')]
for subject_dir in subject_dirs:
subject_id = os.path.basename(subject_dir)
print(f'Evaluating {subject_id}...')
# load ground truth image and create artificial prediction by erosion
ground_truth = sitk.ReadImage(os.path.join(subject_dir, f'{subject_id}_GT.mha'))
prediction = ground_truth
for label_val in labels.keys():
# erode each label we are going to evaluate
prediction = sitk.BinaryErode(prediction, [1] * prediction.GetDimension(), sitk.sitkBall, 0, label_val)
# evaluate the "prediction" against the ground truth
evaluator.evaluate(prediction, ground_truth, subject_id)
Evaluating Subject_2...
Evaluating Subject_4...
Evaluating Subject_3...
Evaluating Subject_1...
After we evaluated all subjects, we can use a CSV writer to write the evaluation results to a CSV file.
[7]:
writer.CSVWriter(result_file).write(evaluator.results)
Further, we can use a console writer to display the results in the console.
[8]:
print('\nSubject-wise results...')
writer.ConsoleWriter().write(evaluator.results)
Subject-wise results...
SUBJECT LABEL DICE HDRFDST95 VOLSMTY
Subject_1 GREYMATTER 0.313 9.165 0.313
Subject_1 THALAMUS 0.752 2.000 0.752
Subject_1 WHITEMATTER 0.642 6.708 0.642
Subject_2 GREYMATTER 0.298 10.863 0.298
Subject_2 THALAMUS 0.768 2.000 0.768
Subject_2 WHITEMATTER 0.654 6.000 0.654
Subject_3 GREYMATTER 0.287 8.718 0.287
Subject_3 THALAMUS 0.761 2.000 0.761
Subject_3 WHITEMATTER 0.641 6.164 0.641
Subject_4 GREYMATTER 0.259 8.660 0.259
Subject_4 THALAMUS 0.781 2.000 0.781
Subject_4 WHITEMATTER 0.649 6.000 0.649
We can also report statistics such as the mean and standard deviation among all subjects using dedicated statistics writers. Note that you can pass any functions that take a list of floats and return a scalar value to the writers. Again, we will write a CSV file and display the results in the console.
[9]:
functions = {'MEAN': np.mean, 'STD': np.std}
writer.CSVStatisticsWriter(result_summary_file, functions=functions).write(evaluator.results)
print('\nAggregated statistic results...')
writer.ConsoleStatisticsWriter(functions=functions).write(evaluator.results)
Aggregated statistic results...
LABEL METRIC STATISTIC VALUE
GREYMATTER DICE MEAN 0.289
GREYMATTER DICE STD 0.020
GREYMATTER HDRFDST95 MEAN 9.351
GREYMATTER HDRFDST95 STD 0.894
GREYMATTER VOLSMTY MEAN 0.289
GREYMATTER VOLSMTY STD 0.020
THALAMUS DICE MEAN 0.766
THALAMUS DICE STD 0.010
THALAMUS HDRFDST95 MEAN 2.000
THALAMUS HDRFDST95 STD 0.000
THALAMUS VOLSMTY MEAN 0.766
THALAMUS VOLSMTY STD 0.010
WHITEMATTER DICE MEAN 0.647
WHITEMATTER DICE STD 0.005
WHITEMATTER HDRFDST95 MEAN 6.218
WHITEMATTER HDRFDST95 STD 0.291
WHITEMATTER VOLSMTY MEAN 0.647
WHITEMATTER VOLSMTY STD 0.005
Finally, we clear the results in the evaluator such that the evaluator is ready for the next evaluation.
[10]:
evaluator.clear()
Now, let us have a look at the saved result CSV file.
[11]:
import pandas as pd
pd.read_csv(result_file, sep=';')
[11]:
SUBJECT | LABEL | DICE | HDRFDST95 | VOLSMTY | |
---|---|---|---|---|---|
0 | Subject_1 | GREYMATTER | 0.313373 | 9.165151 | 0.313373 |
1 | Subject_1 | THALAMUS | 0.752252 | 2.000000 | 0.752252 |
2 | Subject_1 | WHITEMATTER | 0.642021 | 6.708204 | 0.642021 |
3 | Subject_2 | GREYMATTER | 0.298358 | 10.862780 | 0.298358 |
4 | Subject_2 | THALAMUS | 0.768488 | 2.000000 | 0.768488 |
5 | Subject_2 | WHITEMATTER | 0.654239 | 6.000000 | 0.654239 |
6 | Subject_3 | GREYMATTER | 0.287460 | 8.717798 | 0.287460 |
7 | Subject_3 | THALAMUS | 0.760978 | 2.000000 | 0.760978 |
8 | Subject_3 | WHITEMATTER | 0.641251 | 6.164414 | 0.641251 |
9 | Subject_4 | GREYMATTER | 0.258504 | 8.660254 | 0.258504 |
10 | Subject_4 | THALAMUS | 0.780754 | 2.000000 | 0.780754 |
11 | Subject_4 | WHITEMATTER | 0.649203 | 6.000000 | 0.649203 |
And also at the saved statistics CSV file.
[12]:
pd.read_csv(result_summary_file, sep=';')
[12]:
LABEL | METRIC | STATISTIC | VALUE | |
---|---|---|---|---|
0 | GREYMATTER | DICE | MEAN | 0.289424 |
1 | GREYMATTER | DICE | STD | 0.020083 |
2 | GREYMATTER | HDRFDST95 | MEAN | 9.351496 |
3 | GREYMATTER | HDRFDST95 | STD | 0.894161 |
4 | GREYMATTER | VOLSMTY | MEAN | 0.289424 |
5 | GREYMATTER | VOLSMTY | STD | 0.020083 |
6 | THALAMUS | DICE | MEAN | 0.765618 |
7 | THALAMUS | DICE | STD | 0.010458 |
8 | THALAMUS | HDRFDST95 | MEAN | 2.000000 |
9 | THALAMUS | HDRFDST95 | STD | 0.000000 |
10 | THALAMUS | VOLSMTY | MEAN | 0.765618 |
11 | THALAMUS | VOLSMTY | STD | 0.010458 |
12 | WHITEMATTER | DICE | MEAN | 0.646678 |
13 | WHITEMATTER | DICE | STD | 0.005355 |
14 | WHITEMATTER | HDRFDST95 | MEAN | 6.218154 |
15 | WHITEMATTER | HDRFDST95 | STD | 0.290783 |
16 | WHITEMATTER | VOLSMTY | MEAN | 0.646678 |
17 | WHITEMATTER | VOLSMTY | STD | 0.005355 |