{
 "cells": [
  {
   "cell_type": "raw",
   "metadata": {
    "pycharm": {
     "name": "#%% raw\n"
    },
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    ".. _example-data1:\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "Creation of a dataset\n",
    "=====================\n",
    "\n",
    "This example shows how to use the `pymia.data` package to create a HDF5 (hierarchical data format version 5) dataset. All examples follow the use case of medical image segmentation of brain tissues, see [Examples](examples.rst) for an introduction into the data.\n",
    "Therefore, we create a dataset with the four subjects and their data: a T1-weighted MR image, a T2-weighted  MR image, a label image (ground truth, GT), and a mask image, as well as demographic information age, grade point average (GPA), and gender.\n",
    "\n",
    "<div class=\"alert alert-warning\">\n",
    "\n",
    "Tip\n",
    "\n",
    "This example is available as Jupyter notebook at [./examples/data/creation.ipynb](https://github.com/rundherum/pymia/blob/master/examples/data/creation.ipynb) and Python script at [./examples/data/creation.py](https://github.com/rundherum/pymia/blob/master/examples/data/creation.py).\n",
    "\n",
    "</div>\n",
    "\n",
    "<div class=\"alert alert-info\">\n",
    "\n",
    "Note\n",
    "\n",
    "To be able to run this example:\n",
    "\n",
    "- Get the example data by executing [./examples/example-data/pull_example_data.py](https://github.com/rundherum/pymia/blob/master/examples/example-data/pull_example_data.py).\n",
    "\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Import the required modules."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "import enum\n",
    "import glob\n",
    "import os\n",
    "import typing\n",
    "\n",
    "import SimpleITK as sitk\n",
    "import numpy as np\n",
    "\n",
    "import pymia.data as data\n",
    "import pymia.data.conversion as conv\n",
    "import pymia.data.definition as defs\n",
    "import pymia.data.creation as crt\n",
    "import pymia.data.transformation as tfm\n",
    "import pymia.data.creation.fileloader as file_load"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us first define an enumeration with the data we will write to the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "class FileTypes(enum.Enum):\n",
    "    T1 = 1  # The T1-weighted MR image\n",
    "    T2 = 2  # The T2-weighted MR image\n",
    "    GT = 3  # The label (ground truth) image\n",
    "    MASK = 4  # The foreground mask\n",
    "    AGE = 5  # The age\n",
    "    GPA = 6  # The GPA\n",
    "    GENDER = 7  # The gender"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we define a subject. Each subject will have two structural MR images (T1w, T2w), one label image (ground truth), a mask, two numericals (age and GPA), and the gender (a character \"m\" or \"w\")."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "class Subject(data.SubjectFile):\n",
    "\n",
    "    def __init__(self, subject: str, files: dict):\n",
    "        super().__init__(subject,\n",
    "                         images={FileTypes.T1.name: files[FileTypes.T1], FileTypes.T2.name: files[FileTypes.T2]},\n",
    "                         labels={FileTypes.GT.name: files[FileTypes.GT]},\n",
    "                         mask={FileTypes.MASK.name: files[FileTypes.MASK]},\n",
    "                         numerical={FileTypes.AGE.name: files[FileTypes.AGE], FileTypes.GPA.name: files[FileTypes.GPA]},\n",
    "                         gender={FileTypes.GENDER.name: files[FileTypes.GENDER]})\n",
    "        self.subject_path = files.get(subject, '')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now collect the subjects, and initialize a `Subject` holding paths to each of the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "data_dir = '../example-data'\n",
    "\n",
    "# get subjects\n",
    "subject_dirs = [subject_dir for subject_dir in glob.glob(os.path.join(data_dir, '*')) if os.path.isdir(subject_dir) and os.path.basename(subject_dir).startswith('Subject')]\n",
    "sorted(subject_dirs)\n",
    "\n",
    "# the keys of the data to write to the dataset\n",
    "keys = [FileTypes.T1, FileTypes.T2, FileTypes.GT, FileTypes.MASK, FileTypes.AGE, FileTypes.GPA, FileTypes.GENDER]\n",
    "\n",
    "subjects = []\n",
    "# for each subject on file system, initialize a Subject object\n",
    "for subject_dir in subject_dirs:\n",
    "    id_ = os.path.basename(subject_dir)\n",
    "\n",
    "    file_dict = {id_: subject_dir}  # init dict with id_ pointing to the path of the subject\n",
    "    for file_key in keys:\n",
    "        if file_key == FileTypes.T1:\n",
    "            file_name = f'{id_}_T1.mha'\n",
    "        elif file_key == FileTypes.T2:\n",
    "            file_name = f'{id_}_T2.mha'\n",
    "        elif file_key == FileTypes.GT:\n",
    "            file_name = f'{id_}_GT.mha'\n",
    "        elif file_key == FileTypes.MASK:\n",
    "            file_name = f'{id_}_MASK.nii.gz'\n",
    "        elif file_key == FileTypes.AGE or file_key == FileTypes.GPA or file_key == FileTypes.GENDER:\n",
    "            file_name = f'{id_}_demographic.txt'\n",
    "        else:\n",
    "            raise ValueError('Unknown key')\n",
    "\n",
    "        file_dict[file_key] = os.path.join(subject_dir, file_name)\n",
    "\n",
    "    subjects.append(Subject(id_, file_dict))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then, we define a `LoadData` class. We load the structural MR images (T1w and T2w) as float and the other images as int. The age, GPA, and gender are loaded from the text file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "class LoadData(file_load.Load):\n",
    "\n",
    "    def __call__(self, file_name: str, id_: str, category: str, subject_id: str) -> \\\n",
    "            typing.Tuple[np.ndarray, typing.Union[conv.ImageProperties, None]]:\n",
    "        if id_ == FileTypes.AGE.name:\n",
    "            with open(file_name, 'r') as f:\n",
    "                value = np.asarray([int(f.readline().split(':')[1].strip())])\n",
    "                return value, None\n",
    "        if id_ == FileTypes.GPA.name:\n",
    "            with open(file_name, 'r') as f:\n",
    "                value = np.asarray([float(f.readlines()[1].split(':')[1].strip())])\n",
    "                return value, None\n",
    "        if id_ == FileTypes.GENDER.name:\n",
    "            with open(file_name, 'r') as f:\n",
    "                value = np.array(f.readlines()[2].split(':')[1].strip())\n",
    "                return value, None\n",
    "\n",
    "        if category == defs.KEY_IMAGES:\n",
    "            img = sitk.ReadImage(file_name, sitk.sitkFloat32)\n",
    "        else:\n",
    "            # this is the ground truth (defs.KEY_LABELS) and mask, which will be loaded as unsigned integer\n",
    "            img = sitk.ReadImage(file_name, sitk.sitkUInt8)\n",
    "\n",
    "        # return both the image intensities as np.ndarray and the properties of the image\n",
    "        return sitk.GetArrayFromImage(img), conv.ImageProperties(img)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we can use a writer to create the HDF5 dataset by passing the list of `Subject`s and the `LoadData` to a `Traverser`. For the structural MR images, we also apply an intensity normalization."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "start dataset creation\n",
      "[1/4] Subject_1\n",
      "[2/4] Subject_2\n",
      "[3/4] Subject_3\n",
      "[4/4] Subject_4\n",
      "dataset creation finished\n"
     ]
    }
   ],
   "source": [
    "hdf_file = '../example-data/example-dataset.h5'\n",
    "\n",
    "# remove the \"old\" dataset if it exists\n",
    "if os.path.exists(hdf_file):\n",
    "    os.remove(hdf_file)\n",
    "\n",
    "with crt.get_writer(hdf_file) as writer:\n",
    "    # initialize the callbacks that will actually write the data to the dataset file\n",
    "    callbacks = crt.get_default_callbacks(writer)\n",
    "\n",
    "    # add a transform to normalize the structural MR images\n",
    "    transform = tfm.IntensityNormalization(loop_axis=3, entries=(defs.KEY_IMAGES, ))\n",
    "\n",
    "    # run through the subject files (loads them, applies transformations, and calls the callback for writing them)\n",
    "    traverser = crt.Traverser()\n",
    "    traverser.traverse(subjects, callback=callbacks, load=LoadData(), transform=transform)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "This should now have created a `example-dataset.h5` in the directory `./examples/example-data`. By using a HDF5 viewer like [HDF Compass](https://support.hdfgroup.org/projects/compass/) or [HDFView](https://www.hdfgroup.org/downloads/hdfview/), we can inspect the dataset. It should look similar to the figure below.\n",
    "\n",
    "![The dataset structure.](./images/fig-dataset.png)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}