Identifying Seafloor Environments through Texture Recognition

Nick Loomis (MIT/WHOI)
Fall 2008, MAS622j Final Project

Intoduction and Motivation

The ocean remains one of our closest mysteries: we know more about the lunar surface than we do about the ocean floor. Some methods, such as sonar, can operate over large areas from far away, providing rough depth estimates from surface ships. However, while depth maps can be useful for large-scale geological formations, they rarely contain the fine-scale biological information necessary to distinguish various types of habitats or estimate species populations.

Biological information is immediately useful for oceanic biologists, chemists, and ecologists. It is a secondary use in the fisheries, as many crustaceans, mollusks, and certain profitable fish use the seafloor as protection during larval stages of their life -- if not their entire lives. Finally, information about reef and coral ecology, along with the species inhabiting these areas, can provide sensitive information about the changing temperature and chemical balances in different parts of the ocean, a metric on global health.

The greatest hurdle has been creating instruments which can map out large areas of the seafloor with geo-referenced positions while withstanding immense pressure, carrying their own power and propulsion, and overcoming deep-level currents. As mentioned earlier, sonar is often a rough measure, and can have difficulty distinguishing certain environments; its resolution is limited by the wavelength of the sound, and thus it also cannot measure the fine details of seafloor-dwelling animals. Imaging solutions have the ability to return high resolution maps, and additional views can even return 3D maps over large swaths [1].

Imaging methods have their own associated challenges: an instrument must carry its own light source, color becomes highly attenuated in seawater and intelligent post-processing is required to correct intensity and color, and instruments return with vast amounts of data. The first of these problems have been solved with engineering power in the SeaBED instrument, an autonomous underwater vehicle (AUV) [2,3]. The SeaBED dives to depth, then cruises within a few meters of the seafloor, capturing several thousand images to an internal storage device. The images are retrieved when the AUV resurfaces. Human experts then analyze the images, searching out animals and interesting features, a process which takes several weeks or months per single dive.

The goal of this project is to assist the human experts by automatically classifying the environments in the SeaBED images. Quantitative data on environments has been limited due to the extreme amount of work required to mask and label different image sections, a task which should be much easier for a computer. Previous work in this field is limited also by the methods used. For example, in 2007, an automated clustering method was used to identify textures, which were then post-associated by a human oracle with their apparent environments [4]. The results show several clusters describing the same type of environment, and some clusters describing mixes of different environments, suggesting that the clusters were of marginal descriptive utility. It seems reasonable, especially given the state of pattern recognition, that we should be able to do better: classify useful and biologically interesting environments using labeled data. The end goal, then, is to start with a collection of labeled environements to create a classifier which can be used to estimate environmental/habitat data for the remaining SeaBED imagery data.

Dataset description

The dataset for this project comes from Hanu Singh's research group at Woods Hole Oceanographic, around 10 GB of images acquired during a sequence of 13 dives of a SeaBED AUV. From these images, 830 sample sub-images were extracted of consistent texture blocks. After an initial test period, there were reduced to 631 sub-images of textures which could be classified into recognizable textures with minimal confusion (from the the human oracle's perspective). Examples of the six distinct categories are shown in Figure 1. Of these, there is still a certain amount of acceptable confusion: boulders and reefs differ minimally, some sand contains enough small rocks that it could be considered rubble-like, and the distinction between a rubble field and a field of small boulders is a matter of opinion. Misclassifying these categories in the real data is less problematic than classifying "Mud" as "Rocks", as these provide distinctly different types of protection and only support certain animals.

Figure 1: Sample textures and classes used for training classifiers.

The textures in the dataset have a significant range of sizes and orientations, though there is a bias for upward-looking shadows due to the fixed-position strobe light on the AUV. The textures can also be mixed: sand sometimes coats the tops of large rocks or fills between rubble, for example. There is also a fine line beteween rocks and rubble, and between "reefs" and rocks.

Unlabeled Clustering Methods

Earlier work used unlabeled clustering methods to estimate seafloor habitat coverage. In [4], the authors mention that they used similar filterbanks as Varma and Zisserman (in [5]) to produce a set of features for each pixel. They then used some form of EM clustering (not specified in their paper) to find six categories. The image areas which corresponded with each area were then analyzed by a human expert to pick out the salient environmental type, associating a label with each cluster. As mentioned earlier, these labels appear to be somewhat less than useful: there were three different clusters of hard corals, for example, out of six suggested clusters. While the authors considered this clustering to be helpful, I wanted to estimate how useful it would actually be for the SeaBED images -- whether clustering matched the arbitrary labels chosen based on a priori biological information.

Using the labeled dataset, I used filtered features computed for the VZ method (discussed below) with various clustering methods: Gaussian mixtures [C1], mean-shift clustering [C2], k-means [C3], k-centers [C4], heirarchical trees [C5], and affinity propagation [6,C6]. (Ref [7] gives another viewpoint of mean-shift which is more intuitive, and may be useful for relating mean-shift to trees.) Each method was used to find clusters within the filtered data, and each cluster was given a label based on the greatest human-labeled-class membership it contained. (The number of samples for each class was equal, so posteriors were also equal in this case.) Any ties were broken by arbitrarily chosing the lowest-numbered class. A confusion matrix and classification rates for each class and over all the samples were computed. The number of clusters returned for each method was varied so as to both under- and over-fit the data.

The similarity measure for GMM, k-means, heirarchical trees, and mean-shift used the Euclidean distance between filtered points. Affinity propagation and k-centers used a negative squared Euclidean distance. (I limited the amount of experimentation with various distance metrics, feature space rescaling, or feature space decompositions. While these could be useful, I doubt they are worth the effort for this dataset, and not a major educational goal for this project.) For the GMM, the appropriate cluster was selected as the maximum a posteriori estimate of mixture membership.

The overall recognition rate was computed as the number of true positives divided by the number of total samples. Each class was given an equal number of samples for these experiments, chosen randomly from the available pre-computed filtered responses.

The cluster-to-classification results are summarized in Figure 2. The GMMs performed reasonably well at representing smaller chunks of data which could be associated with labels, increasing in their performance as the number of clusters went above the number of human-labeled classes. They plateaued at around 70%, not strongly overfitting as the number of clusters increased from 10 to 30 ("high-order GMMs"). Mean-shift clustering gave nearly perfect results, but only due to strong overfitting: it generated 2314 clusters for the smallest tested bandwidth, nearly one cluster per sample point. Larger mean-shift bandwidths return fewer classes and poorer label-clustering results. Both k-means and k-centers give similar results, not surprisingly, and tended to be poor at representing the entire filter space. This could be an indication that the space is high anisotropic, making these isotropic methods efficient. The heirarchical trees have nearly the same response for all high-level clustering divisions, indicating that the highest-level branches all split off to different parts of the same subspace; a theoretical example of how this could occur is shown in Figure 3. Affinity propagation, the slowest of the methods (when run in Matlab and with 3000 samples) provided similar results as k-means and k-centers.

Figure 2: Classification using cluster memberships found using unlabeled clustering methods.

Figure 3: One explanation for why the heirarchical trees returned the same class label for the first branches, and thus, poor recognition rates: the first sub-branches could have been dominated by one particular class. Here, three of four branches are "Class 3", suggesting a poor similarity within the class.

Good fits from the GMMs suggest that the data can be fitted; poor responses from k-means, mean-shift, affinity propagation (with a Euclidean similarity metric), and the other isotropic methods suggests that the filtered responses are distributed anisotropically through the filter subspace: a highly skewed dataset is harder to represent with Euclidean spheres as opposed to the GMM's deformable Gaussian ellipsoids. Given enough time trying different transformations on the filter space, these rates could probably be improved slightly. However, I don't see this as a good method for distinguishing environment types, especially when the class margin is narrow.

Another big point is that these methods are based around classifying each individual pixel. Thus, the filterbank responses centered at each pixel should be similar throughout a class. This requires, in general, large filters and space-invariant filtering -- and textures which are self-similar across the scale of the filter or smaller. With the SeaBED textures, it is less likely that any particular filter will have a consistent response across the dataset of a single class due to natural variations in size and texture coverage. It would intuitively be better to group together several responses from nearby pixels to "average out" some of the natural variation. This can be done through aggregating together certain filter responses or by looking at the distribution of filter responses, something which the next method seeks to accomplish.

Varma-Zisserman Texture Recognition

The original Varma-Zisserman method (VZ) takes a slightly different approach: instead of looking at the filter responses centered at each pixel, it takes a look at the statistical distribution of different types of responses [5,8]. During the first step of the training phase, a series of filter responses are computed on the training images (Figure 4). For each class, the k representative filter responses are chosen, using k-means or another clustering method. These are known as the "textons", texture descriptors which describe each particular sample reasonably well on their own. The texton clusters are recorded and aggregated across all the labeled classes to create what Varma and Zisserman term a texton dictionary. In the second training step, the filterbank response for each pixel is compared against the entire texton dictionary and the nearest texton (considered the best descriptor for that pixel's filter response) is recorded (Figure 5). The frequency of each texton being chosen is recorded for each class, resulting in a histogram (or an estimated PDF) of that texton's use in decribing the class.

Figure 4: The first stage of the VZ method involves finding representative textons for each class.

Figure 5: The second training stage involves using the textons found in the first stage to calculate how often each texton appears in images of each class, building a single model per class.

The testing phase of VZ is similar to the second stage: a test image is passed through the filterbank (Figure 6). At each pixel, the nearest texton is recorded, and a histogram of texton use is created for the entire image. The texton distribution is then compared against the class texton distributions (using a Chi-squared histogram-distance measure), and the nearest class histogram is selected as the label.

Figure 6: Classification with VZ uses the textons and models from the training stages. A histogram of textons is computed for the new image and compared against the known class models.

The benefit of VZ is that it looks at the distribution of responses throughout the image, and is thus more tolerant to noise. It requires a reasonable choice of a filterbank and a good selection of training samples, similar to almost any other supervised machine learning mechanism. Excellent results have been obtained using VZ on large databases, and it provides an excellent starting point.

Several variants of VZ have been proposed. The original VZ was designed for grayscale images, though with careful filters and processing, it can be extended to represent color images as well [9]. (Much of Ref [9] is dedicated to balancing color detail with color invariance, and it may not be necessary if the images have had proper color adjustments already applied.) Other variants, proposed by Varma and Zisserman themselves, include using direct samples of the images as features and incorporating a Markov random field (MRF) representation into the feature space [8]. These last two variants will be discussed in following sections, after reviewing results using the original filterbank concept.

VZ with Filters

Varma and Zisserman first developed their algorithm to work with filtered images, yielding a higher-dimensional feature vector for each pixel [5]. They tried several different types of filters, finding the best recognition rates for the MR8 filter set. (MR8 consists of three bars and three edge filters at three different size scales and four or more orientations, along with a Gaussian and Laplacian of Gaussian-like filters. The Maximum Response across the different orientations is selected as the appropriate filtered value for each scale.)

I started by looking at a variety of filterbanks, including MR8, and a series of different image filters. MR8 [C7] and Gabor filters [C8] (by taking the maximum response of the Gabor magnitudes) gave only moderately discriminable responses, possibly because of the size mis-match between filter and texture. Grayscale moments and Hu invariants (personal code for sliding-window moments) also gave limited discriminability, especially since different orders had vastly different magnitude scales, even after log transforms. In the end, a set of statistical filters, describing the central distribution moments, the local color, and the local entropy were selected for the SeaBED images. Local entropy was scaled down by a factor of 10.0 to have similar magnitude as the other feature values.

When choosing textons, two different clustering methods were tried: a standard k-means (with Euclidean distance) and affinity propagation (with similarity defined by the negative squared Euclidean distance). While the selected textons were slightly different, the end classification was nearly identical. This suggests that there are a larger number of reasonable choices for textons, so long as they represent the feature space "well". For the sake of computational speed, k-means was used throughout the rest of the experiments.

Two approaches for creating the texton distribution model for each class were tried and are compared throughout the rest of the experiments. In the first, termed the "Single Model" approach, the class texton distribution was estimated from the aggregated texton usage on twenty training images from each class. This results in the texton distributions from each training image getting averaged together, and thus assumes that the distributions are nearly the same for every image in the class. While examining the images, however, there seemed to be several different types of "Rubble", for example, which should have the same label despite their disparate appearances. To accomodate multiple models per class, the texton histograms were computed for each of the same twenty training images, and the Chi-squared distance between pairs used with affinity propagation to select between two and four exemplar distributions per category label. When classifying new textures, the image texton distrubtion is created and compared against all the exemplar distributions and is assigned the label of the nearest exemplar. This is termed the "Multi-model" approach. The number of exemplars was controlled by the affinity propagation code, which attempts to automatically chose an appropriate number of clusters to account for the observed variation. Its use also assumes that there is at least one true exemplar texton distribution (or, two to four for my case) within the twenty training texton distributions per class which would do an excellent job representing the entire class.

The overall true positive rates, defined as the total number of true positives divided by the number of images, is presented in Table 1 for Filtered VZ. The rates are reasonable. The classifier actually had trouble labeling boulders (part of the "Rocks" category), potentially because the nominal boulder size was larger than the local filters used in creating filter responses. Alternately, it could be that boulders share many similar characteristics with reefs and rubble (the two greatest classes of confusion during the experiments), suggesting that the lower-dimensional filter space used was insufficient to separate this category. Finally, for this set of experiments, the multi-model approach slightly improved recognition when small numbers of textons were available, and appeared to hurt the results when a larger texton dictionary was used. At this point, it is hard to generalize about single model versus multi-model; the question will be addressed later.

	Single model	Multi-model
k = 10 textons/class	85.9%	86.7%
k = 30 textons/class	89.1%	84.6%

Table 1: Overall true positive rates for classification of test images.

VZ with Image Patches

One of the ideas that Varma and Zisserman present in [8] is using image patches directly as features -- which is essentially computing the matched filter responses. Small patches, ranging from 3x3 pixels and up, are pulled from around a central pixel and used for finding representative textons using k-means. In my implementation, since the color had been carefully corrected by the SeaBED team, I included the average RGB color values for the patch as three additional features.

Recognition rates for square image patches of size p = {3,5,7,9,11} pixels per side and k = {10,30} textons per class were computed for both single and multiple model experiments and are listed in Table 2. In general, the additional textons per class improved recognition by fitting the images better. (For comparison, I had a maximum texton dictionary of 180 elements, while Varma and Zisserman had 610 elements.) As p increased, the patches became more specific: one 11x11 texton incorporated the image of a small starfish. (Sample textons are shown in Figure 8.) The best overall true positive rates came from smaller, more general patches.

	Single model	Multi-model
p = 3, k = 10	85.3%	88.0%
p = 3, k = 30	87.5%	90.2%
p = 5, k = 10	85.4%	84.5%
p = 5, k = 30	88.1%	83.8%
p = 7, k = 10	84.6%	83.4%
p = 7, k = 30	85.9%	82.4%
p = 9, k = 10	79.7%	85.9%
p = 9, k = 30	82.6%	85.1%
p = 11, k = 10	83.4%	78.1%
p = 11, k = 30	86.1%	83.7%

Table 2: Overall true positive rates for image patches. Patches are sized pxp and k textons are used per class. Best and worst results are highlighted.

Varma and Zisserman suggest that image patches work as well as filters because they operatate in higher-dimensional space, simliar to kernel methods, and because they contain enough information to estimate gradients (and thus the frequency content). Their argument is that if one-dimensional gradients can be estimated using three- or five-point formulas with higher-order accuracy, then patches of even three and five pixels should contain the same information, albeit in a slightly different form. Thus, even small patches have information which extrapolates roughly to the entire image, and thus can describe textures much larger than the patch.

Another way to view the same argument is that, if there are large features, we would expect a low spatial frequency content. The gradients would then be changing slowly, and patches with only very slight changes across their faces will appear much more frequently than patches with extremely high changes in intensity. The distribution of "gradient" patches allows us to estimate the spatial frequency content in the image.

I think it is also instructive to look at the textons selected for various patch scales. Varma and Zisserman used 610 textons in their dictionary, many of which describe edges and bars of different orientations and positions within the patch, shown in Figure 7. Sample texton dictionaries for this project are shown in Figure 8. Smaller patches show some of the same edge and bar type patterns, along with several flatter patches for matching smaller gradients. Figure 9 shows a slightly different set of features: those selected by a boosting algorithm designed to located man-made objects in images [10]. The boosting scheme is special in that it shares features common to multiple categories -- and also tends towards bar- and edge-like patches. (Ref [10] also shows the difference between shared features and class-specific features, which contain fewer generic edge- and bar-like filters.) These are all selecting generic features as some of the best to describe images and objects. It is still important to note that, in both VZ and the SeaBed textons, there are a handful of patches which are specific to the task and which would not have appeared in either the MR8 or my filters. This may be the crux of why patches perform better than filters: the patches find the optimum matched filters to describe each texture.

Figure 7: The patch-based texton dictionary used in [8].

Figure 8: The patch-based textons selected by k-means for SeaBED images.

Figure 9: Patch-based filters selected by a shared-feature boosting algorithm. Note the prevalence of bar- and edge-type generic filters. Filter images were ruthlessly copied from Ref [10].

It may help to note that filtering is a dot product operator between sub-sections of images and particular filters. Selecting the maximum filter response is equivalent to selecting the filter vector which best aligns with the image vector, and if the filters in a bank have the correct scaling, this gives the filter with the minimum Euclidean distance to the image vector. This is the same as finding the closest patch texton vector to the image vector, making filtering with maximum response selection an equivalent process to patch matching, further justifying the view of patches as special filters.

The major difference when using patches is that each response is selected independently; in VZ with standard filtering, the feature vectors are the linked responses of several different filters evaluated at the same point. Selecting a particular filtered texton then requires that the particular filter responses always appear together. Patches, on the other hand, are independent, and thus add flexibility. The other way to think about the process is that the filtered VZ method is finding and modeling the joint probability of several filters together, while the patches are finding marginal probabilities, and thus spanning a larger distribution subspace.

Finally, it is worth noting that several patches with slightly different shifts can be combined to represent a single filter. Small patches allow only a few possible shifts, and thus, only a limited number are needed to create an equivalent filter. Larger patches have many more possible shifts, and a larger texton dictionary should be necessary to represent the same filters.

VZ with MRF + Patches

The final variant of VZ is to include the joint texton-intensity distribution, with the notion that the full 2D joint distribution has more discriminative power than the 1D texton distribution alone. The idea is related to Markov random fields (MRF) in that the goal is to estimate the distribution of textons around the pixel (nodal) intensity distribution.

Implementing the MRF method is somewhat straight-forward, depicted in Figure 10. Again, a texton dictionary is created and the nearest textons are found for each pixel. Then, for each texton, the pixel intensity values are binned and counted, estimating the intensity distribution for each type of texton. The resulting 2D joint texton-intensity distribution is finally converted into a large feature vector and used for finding class distribution models and classifications.

Figure 10: Additional step for calculating the MRF representation. A joint distribution of textons and pixel intensity values is used for each class model.

Results are summarized in Table 3. Again, small patches performed marginally better than large patches, and the multi-model method was slightly worse than the single model for nearly all cases. The exception is p = 3 and k = 30, the best performing classifier of the MRF set, which was also the best performing patch-based classifier -- upon which the MRF was built.

	Single model	Multi-model
p = 3, k = 10	84.3%	84.2%
p = 3, k = 30	85.9%	86.7%
p = 5, k = 10	84.0%	83.0%
p = 5, k = 30	85.9%	81.8%
p = 7, k = 10	84.5%	83.0%
p = 7, k = 30	85.4%	83.4%
p = 9, k = 10	84.0%	80.4%
p = 9, k = 30	85.4%	83.9%
p = 11, k = 10	83.4%	79.7%
p = 11, k = 30	85.3%	78.9%

Table 3: Overall true positive rates for image patches combined with intensity as MRFs. Patches are sized pxp and k textons are used per class. Best and worst results are highlighted.

One challenge for the MRF method is that, because the dimensionality of the feature vector is increased, a larger sample of features is necessary. It is also unclear a priori how many intensity bins should be used; Ref [8] uses 90 bins. I used a more conservative 20 bins. Because of this, the intensity bins were slightly better sampled, but had less specificity. I also tended to see similar intensity distributions across many textons, or at least the textons which were frequent enough to generate a good sampling of the intensity space, which might be an indication that not enough textons were used to describe the image: the same textons appeared over too wide a range of intensities, leading to a loss of specificity. I should also mention that, when I experimented with MRFs, I used the patch-based textons as opposed to the filter-based textons. The patches retain some of their information about mean intensity already, whereas the mean is destroyed in the filterbank I used. Future experiments might benefit from using patch textons with the central pixel removed (the one used for the intensity bins in the first place). Alternately, the patches might be individually renormalized to remove any residual mean-value information.

Results Comparisons

Four parameters were varied during the experiments: VZ variation, number of textons per class (ie, total size of texton dictionary), patch size, and number of texton distribution models per class.

Filtered VZ gave fairly good recognition rates. The Patch-based VZ gave similar results to Filtered, or slightly better, for the same number of textons and number of class models. Patch VZ has the benefit of being able to tune the patch size to optimize the results, however, and not all patch sizes result in better recognition rates. (In other words, patches are not always better than filters unless the "right" parameters are chosen.) The MRF VZ results tended to be worse than Patch results, which is the opposite of Ref [8]. This may be because too few intensity bins were used, limiting the specificity of the MRF representations, or because not enough samples were used to give a good sampling over the entire 2D texton-intensity space.

Additional tests were performed to examine the Patch VZ parameter dependencies. For these, k ranged from 10 to 60 textons per class (in steps of 10), p was 3x3, 5x5, or 7x7 patches, and both single and multi-model approaches were tried. The optimal performance, a 91.8% overall true positive recognition rate, was achieved with p = 3x3 patches, k = 50 textons per class, and the multi-model approach; the corresponding confusion matrix is shown in Table 4. Similarly high recognition rates, 89.9% and above, were achieved for k = 30 or larger for 3x3 patches and the multi-model method.

		Estimated Classes
		Mud	Sand	Reefs	Rocks	Rubble	Cam err	TP rate
True Class	Mud	96	1	0	0	0	0	99.0%
	Sand	14	261	0	0	2	1	93.9%
	Reefs	0	0	21	7	2	0	70.0%
	Rocks	0	0	5	104	8	0	88.9%
	Rubble	0	10	1	1	76	0	86.4%
	Cam error	0	0	0	0	0	21	100%

Table 4: Confusion matrix corresponding to the best overall recognition rate (out of all parameters and methods tried in this project).

The Patch VZ parameter search also showed that the recognition rates could be influenced by the model method. In the single model case (one texton distribution model per class), the recognition rates varied but all remained around 85-88%, shown in Figure 11. Due to averaging of the texton histograms over the training sample, the variation is less likely due to outliers and more likely due to random sampling of patches used to create the texton space and the inability of the single-model approach to predict textures further from the class average. The multi-model approach, also shown in Figure 11, shows a stronger dependence on the size of the patches. Here, I suspect the larger patches are allowed too much flexibility: the 7x7 patches, for example, describe a 52-dimensional (7x7+3 color features) space, which is sparesely sampled even with a 360-element (60x6) texton dictionary. Thus, the resulting histograms show more variation, and exemplar histograms are less representative. This results in poorer exemplar models. The 3x3 patches, on the other hand, have a much smaller feature space, which is normally well sampled, even with smaller numbers of textons. Better exemplars can be found from the low-noise histograms.

Figure 11: Recognition rates as the patch size and number of textons per class was varied. On the left, using a single texton distribution model per class. On the right, the multi-model approach, where each class had between two and four representative models.

Ideas for Future Improvements

Filters. The choice of particular filters came early in this project, and a workable set was chosen. In retrospect, and especially after having seen the patch textons of Figure 8, I would be interested in retrying some different filter sets with small support -- with the goal of finding a filterbank which is faster to compute than re-arranging image patches into filters.

Dr. Rosalind Picard also brings up a great idea: using the knowledge that the image patches are acting like optimized filters, I could use vector quantization methods to find optimal filters to describe the observed textures. She has further suggested applying PCA to the texton dictionary to find prototypical eigen-textons. To induce translational invariance, the PCA could also be applied to the magnitude of the Fourier transform coefficients [11].

MRF over adjacent images. The SeaBED instrument captures adjacent images of the seafloor which can then be stitched together into a large-area swath image. This also means that there is continuous information and continuous distributions between the different environment types. A Markov random field model could combine the continuity information and the estimated classes to potentially generate a better guess.

Classification by clustering. In all fairness, I should redo the clustering with an improved filter space, one which can capitalize on the isotropic methods as well. It might also be worth trying clustering with 3x3 and 5x5 patches as the features, just for the sake of getting the comparisons. I don't expect phenomenal or highly useful results from this method, however, as noted earlier.

Improved multi-model. When finding multiple models, I forced the algorithm to have at least two (ie, multiple) models per class. In some cases, a single model may have been more appropriate; similarly, allowing up to four models per class may have been a poor choice, allowing too much flexibility in the description. Using more models could also warrant a larger training set to make sure that the texton distribution space is adequately sampled -- that enough examples are available that the algorithm can make an intelligent choice. Finally, the exemplar histograms were chosen using affinity propagation. It might be interesting to modify a k-means algorithm to use the Chi-squared distance, resulting in histogram models averaged over each cluster. This could potentially reduce some sensitivity to noise and provide more samples per texton.

Color spaces. The SeaBED images tend to have some useful color information after color-corrections have been applied, and the mean color was used as three additional features in the patch descriptors. RGB was used for simplicity in this project. Other color spaces might add disciminability or increase independence between the color channels (via an independent components analysis).

Conclusions and Next Steps

The next stpes involve implementing some of the above ideas and checking the true variability of the performance using cross-validation on the best models. The imagery also needs to be stitched together, possibly using SIFT-Ransac, to determine which pixels overlap between images (and thus only need to be classified once, or which could be used as a second vote on the classification). Finally, the most robust algorithm can be applied to the available data, a major boon to fisheries and environmental monitoring.

References
[1] O. Pizarro, R. Eustice, and H. Singh, Large area 3D reconstructions from underwater surveys, Proc. Oceans 2004.
[2] O. Pizarro, R. Eustice, and H. Singh, Relative Pose Estimation for Instrumented, Calibrated Imaging Platforms, Proc. VII Digital Computing, Technologies and Applications, Sydney, 2003.
[3] H. Singh, R. Armstrong, F. Gilbes, R. Eustice, C. Roman, O. Pizarro, J. Torres, Imaging Coral 1: Imaging Coral Habitats with the SeaBED AUV, Subsurface Sensing Technologies and Applications 5, 25-42, 2004.
[4] R. Camilli, O. Pizarro, and L. Camilli, Rapid Swath Mapping of Reef Ecology and Associated Water Column Chemistry in the Gulf of Chiriqui, Panama, Proc. IEEE Oceans, 2007.
[5] M. Varma and A. Zisserman, Classifying images of materials: achieving viewpoint and illumination independence, European Conference on Computer Vision, Copenhagen, 255-271, 2002.
[6] B. Frey and D. Dueck, Clustering by Passing Messages Between Data Points, Science 315, 972-976, 2007.
[7] S. Paris and F. Durand, A Topological Approach to Hierarchical Segmentation using Mean Shift, Proc. IEEE Computer Vision and Pattern Recognition, 2007.
[8] M. Varma and A. Zisserman, A Statistical Approach to Material Classification Using Image Patch Exemplars, IEEE Transactions on Pattern Analysis and Machine Intelligence, accepted for future publication; currently available from http://research.microsoft.com/~manik/.
[9] G. Burghouts and J.M. Geusebroek, Colo Textons for Texture Recognition, British Machine Vision Conference 3, 1099-1108, 2006.
[10] A. Torralba, K. Murphy, and W. Freeman, Sharing visual features for multiclass and multiview object detection, IEEE Transactions on Pattern Analysis and Machine Intelligence 29, 854-869, 2007.
[11] Personal communication with Dr. Rosalind Picard, Dec 08 2008.

Code Sources
[C1] C. Bouman, Cluster: An unsupervised algorithm for modeling Gaussian mixtures, available from http://www.ece.purdue.edu/~bouman/.
[C2] B. Finkston, Mean Shift Clustering, Matlab Central, available at http://www.mathworks.com/matlabcentral/fileexchange/10161
[C3] Matlab, Statistics Toolbox, kmeans algorithm.
[C4] B. Frey, k-centers Matlab function, available from http://www.psi.toronto.edu/affinitypropagation/software/other.zip
[C5] Matlab, Statistics Toolbox, heirarchical tree utilities.
[C6] B. Frey and D. Dueck, affinity propagation Matlab functions, available from http://www.psi.toronto.edu/affinitypropagation/.
[C7] Visual Geometry Group, Oxford, Matlab scripts for generating RFS (MR8) filterbanks, available from http://www.robots.ox.ac.uk/~vgg/research/texclass/filters.html.
[C8] P. Kovesi, Matlab and Octave Functions for Computer Vision and Image Processing, University of Western Australia, available at http://www.csse.uwa.edu.au/~pk/research/matlabfns/.
All other codes were generated in Matlab by N. Loomis for this project.

Additional Links
SeaBED AUV, developed by Hanumant Singh at Woods Hole Oceangraphic Institution.
MAS622J/1.126J 2008 Homepage: Pattern Recognition and Analysis, taught by Dr. Rosalind Picard at MIT.
MIT Sea Grant College Program, research sponsor.
NOAA's National Marine Fisheries Service, research sponsor.
MIT-WHOI Joint Program in Oceanography/Applied Ocean Science and Engineering, the grad student program which allows me to do work like this on a regular basis
3D Optical Systems Group, my home at MIT.
nloomis at mit dot edu to e-mail the author.

Last edited: Dec 10, 2008