The overlap metric is a method for comparing two segmentations that is more critical than comparisons using the volume. It is defined for a given voxel class assignment as the sum of the number of voxels that both have the class assignment in each segmentation divided by the sum of voxels where either segmentation has the class assignment. This is the same as the Tanimoto coefficient (See Pattern Classification and Scene Analysis by Duda and Hart, 1973, p. 216).
This metric approaches a value of 1.0 for results that are very similar and is near 0.0 when they share no similarly classified voxels.
The following results are from work done by Jagath C. Rajapakse and are partially based on the method described in: Rajapakse JC and Kruggel F, Segmentation of MR Images with Intensity Inhomogeneities, Image and Vision Computing, 1998, In press. The data sets used were the 20 normal subjects (brain-only MR data files) which are available along with the manual segmentations from this IBSR.
Average Overlap between manually-guided segmentations and various methods for 20 brain scans gray white method ----- ----- ------------------------------------ 0.564 0.567 adaptive MAP 0.558 0.562 biased MAP 0.473 0.567 fuzzy c-means 0.550 0.554 Maximum Aposteriori Probability (MAP) 0.535 0.551 Maximum-Likelihood 0.477 0.571 tree-structure k-means 0.876 0.832 Manual (4 brains averaged over 2 experts)
More Details
Also available are the average overlap numbers
for the background, CSF, gray and white regions separately for each method and
each scan.
The graphs above show the overlap scores for each of the 20 brains. Scores have been multiplied by 1000. The bran scans have been roughly ordered by their difficulty to be segmented. The line labeled "expert" is the average overlap between two expert operators who segmented the same four brain scans for a study using different data. This was included to give a sense of the overlap level that has been found to be acceptable for volumetric studies.
The 20 coronal brain scans used to generate these results were chosen because they have been used in published volumetric studies in the past and because they have various levels of difficulty. The worst ones have low contrast and relatively large intensity gradients. More recently acquired (i.e. better quality) data should result in far better overlap scores for the automated methods.