GitHub

In this three part work:
For the first part:
1- I implemented three distance/similarity metrics (Cosine, Minkowski, Mahalanobis).
2- I implemented a KNN class.
3- I performed 10-fold cross-validation (with stratification) for hyperparameter tuning (via grid search).

For the second part:
1- Visualized the elbow method for kmeans and kmedoids, separately for loss and silhouette scores.
2- Determined the best k value using the visualizations.
3- Implemented PCA and Autoencoder classes from scratch.
4- I employed PCA, autoencoder, t-SNE and UMAP to visualize a high dimensional dataset in 2-D.
5- Calculated the worst-case run time analysis of kmeans.
For the third part:
1- Visualized clustering using HAC and DBSCAN algorithms.
2- Performed grid hyperparameter search for HAC and DBSCAN.
3- Performed silhouette analysis for HAC and DBSCAN hyperparameter configurations.
4- I employed PCA, autoencoder, t-SNE and UMAP to visualize another high dimensional dataset in 2-D.
5- Calculated the worst-case run time analysis of HAC.

Results for the Part 1:

Results for the Part 2:

Resulting plots for kmeans with confidence intervals:

As can be seen k=6 seem to be the best k value for this dataset according to both loss and silhouette scores.

Resulting plots for kmedoids with confidence intervals:
It can be seen that the elbow point is at k=6 and silhouette score tops at that point as well, thus it is our optimal k.

Here are the plots of the reduction methods I’ve used for Dataset1:

We can see that every method gave nearly the same results with 6 distinct clusters. Thus our result from elbow method was accurate.

Worst-case run time analysis:
For kmeans, in every iteration, every point(N) is compared with each cluster centroid(K) to find the nearest one. Computing the distances between them depends on vector dimension(d). This is repeated I times so complexity is O(K * N * I * d) Updating the centroids have the same complexity. For kmedoids, assigning centers to points is the same as kmeans, however when you want to update the medoids, you have to compute the cost for every n data points instead of calculating only once, thus the complexity becomes O(n2). The final complexity becomes O(K* N 2 * I * d).\

Results for Part 3:
Since there are a lot of plots in this part, I'm only adding a sample of them. You can see all the results and plots in report.pdf.
\

Worst-case run time analysis:
HAC starts by calculating distances between all pairs of data points, which takes O(n2 ∗D). Then merges the clusters which takes O(n2). This is done n times so the complexity is O(n3). If the algorithm utilizes a heap it can be optimized to O(n2logn) Thus the final equation becomes O(n2 ∗d + n2logn).

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.vscode		.vscode
Part1		Part1
Part2		Part2
Part3		Part3
datasets		datasets
part2-figures		part2-figures
part3-figures		part3-figures
README.md		README.md
report.pdf		report.pdf
specification.pdf		specification.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Languages

yildirimozew/dimensionality_reduction

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages