Scene Classification
https://www.kaggle.com/datasets/nitishabharathi/scene-classification
https://www.kaggle.com/datasets/nitishabharathi/scene-classification
A total of 16,979 labelled images (150px x 150px x 3rgb channels) are available for training.
They are split into 6 different categories of approximately equal size.
Our baseline accuracy for each class is therefore ~17%.
Building
Sea
Street
Forest
Glacier
Mountain
A set of images was created to visualize the "average" of each of the categories, and is made up of the mean rgb value for each pixel as calculated from all images in the class.
The forest representation is the most distinct.
The mountain image shows a reasonably strong contrast gradient.
There is little to differentiate between the other 4 classes.
Building
Sea
Street
Forest
Glacier
Mountain
We can similarly evaluate the amount of variation in each image class by plotting the variance per pixel for each red, green and blue layer in each image. Dark areas represent regions of low variance with light areas showing high variance.
Building Red
Building Green
Building Blue
Sea Red
Sea Green
Sea Blue
Street Red
Street Green
Street Blue
Forest Red
Forest Green
Forest Blue
Glacier Red
Glacier Green
Glacier Blue
Mountain Red
Mountain Green
Mountain Blue
Data from each image were flattened into a single row with 67,500 (105 x 150 x 3) columns. The result is a matrix with 16,797 rows and 67,500 columns (n > m).
Once flattened, each image can be thought of as a point in high dimensional space.
Principal component analysis (https://en.wikipedia.org/wiki/Principal_component_analysis) allows us to map these higher dimensions down onto a lower dimensional space.
The chart on the left plots the cumulative variance explained by the first 1679 principal components (1/10 of the total number of samples).
The chart on the right plots zooms in on only the first 100 PCAs.
The first 10 PCAs explain approximately 50% of the total variance of the dataset and will arbitrarily be chosen as the number for all models.
In order to ease visualization, only the first three PCAs have been plotted below.
Dimensionality reduction makes the classification problem more computationally tractable.
Logistic regression, random forest and support vector machine (SVM) classifiers were all attempted on the embedded data.
A train/test split to the data was applied, in which a random 20% of the images were held back for testing.
Instead of simply looking at the average testing accuracy of the classifier, the accuracy was calculated separately for each class.
A dimensionality reduction technique (https://arxiv.org/abs/1802.03426) that, instead of re-projecting data onto a new ortho-normal basis, assumes there is a manifold that preserves a sense of local structure. The following image is taken from https://umap-learn.readthedocs.io/en/latest/interactive_viz.html.
It is important to note that UMAP is generally a non-deterministic technique, so each run of the embedding will produce slightly different results. Hyper-parameters for the technique were tuned in a manual way to produce a sensible plot.
The image dataset was projected using UMAP onto a three dimensional manifold. We can again see relatively good separation between the embeddings for images of forests and mountains, with the addition of glaciers as a more distinct group. Embeddings for images of streets, seas and buildings continue to occur throughout the space and are not obviously linearly separable.
A similar exercise of model training took place on the 10 dimensional UMAP embeddings. Accuracy scores are notably lower in this case.
TSNE is a statistical embedding technique that attempts to preserve a notion of local distance similarity. Visualizations through TSNE are generally less affected by outliers.
An excellent tutorial and interactive website on the technique is available here: https://distill.pub/2016/misread-tsne/.
The implementation in the rapids.ai API only allows for reduction to the two dimensional case.
Modelling on the TSNE embedding shows a similar pattern to the PCA and UMAP results.
The confusion matrix allows us to visualize where and how misclassifications are taking place. This can be displayed in tabular or graphical form.
For examples, predicted classes for images labelled as sea and predominantly either glaciers or mountains. This is unsurprising based on our earlier characterization of the example images. The misclassification between mountains and glaciers is also in line with expectations.
Interestingly, images of buildings are generally just poorly
We can also show the prediction plots for knn (closest 200 neighbours), logistic regression, random forest and support vector machine predictions. Classification for the KNN and support vector machine models are very similar in form as opposed to logistic regression and the random forest outputs.