Skip to content

elenamedea/data-science-portfolio

Repository files navigation

Data Science portfolio

Artificial Neural Networks

The goal of this project is to focus on building an Artificial Neural Network recognizing objects on images made by the webcam.

Throughout this project I implemented a Feed-Forward Neural Network and Backpropagation from Scratch on make moons dataset, a Convolutional Neural Network with Keras on MNIST dataset and finally I classified the images made with webcam with Pre-trained Networks(VGG16, MobileNet2).

To Do:

  • Update and clean the notebooks
  • Explore further the hyperparameters of the networks

Markov Chain-Monte Carlo (MCMC) Simulation

In this project, I teamed up with my colleague Moritz von Ketelhodt to write an program simulating and predicting customer behaviour between departments/aisles in a supermarket, applying Markov Chain modeling and Monte-Carlo simulation.

The project involved the following tasks:

Data Wrangling

See the notebook/supermarket_data_wrangling.ipynb.

Data Analysis and Expolration

See the notebook/supermarket_EDA.ipynb.

Calculating Transition Probabilities between the aisles (5x5 crosstab)

See the notebook/customer_transition_matrix.ipynb.

Creating a Customer Class

See the notebook/customer_class.ipynb.

Running MCMC (Markov-Chain Monte-Carlo) simulation for a single class customer

See the simulation/customer_class_one_customer_ simulation_ES.py.

Extending the simulation to multiple customers

See the simulation/one_script.py.

To Do:

  • Visualization of the supermarket layout and the simulation of the customer behaviour based on the transition probabilities
  • Displaying the avatars at the exit location
  • Displaing path of the avatars' move between the locations

Supervised Machine Learning: Classification - Kaggle's Titanic Challenge

This project approaches a classic Machine Learning problem, with a classication model to predict the survival of Titanic passenger based on the features in the dataset of Kaggle's Titanic - Machine Learning from Disaster.

Based on the Exploratory Data Analysis (plotted missing values and the correlation between survival and the different data categories) selected the most significant features and dropped the ones which cannot contribute to accurate prediction.

The data was trained on Scikit-learn's LogisticRegression and RandomForestClassifier models.

Data source: Kaggle: Titanic - Machine Learning from Disaster.

Supervised Machine Learning: Regression - Bicycle Rental Forecast

The goal of this project is to build a regression model, in order to predict the total number of rented bycicles in each hour based on time and weather features, optimizing the accuracy of the model for RMSLE, using Kaggle's "Bike Sharing Demand" dataset that provides hourly rental data spanning two years.

After extracting datetime features, highly correlated variables were dropped via feature selection (correlation analysis, Variance Inflation Factor) to avoid multicollienarity. I compared more linear regression models with one another (PossionRegressor, PolinomialFeatures, Lasso, Ridge, RandomForestRegressor) based on R2 and RMSLE scores.

Data source: Kaggle: Bike Sharing Demand.

Natural Language Processing (NLP): Text Classification

The main goal of this project was to build a text classification model on song lyrics to predict the artist from a piece of text, additionally, to make user-inputs ((artists, lyrics) possible in CLI.

Through web scraping with BeautifulSoup, the song-lyrics of selected artists are extracted from lyrics.com. I built two functions on how to handle the scraping data (extract the song lyrics directly from htmls OR download and save the song lyrics locally as .txt files from every single song lyrics url). In any case, all lyrics will be loaded from the .txt files to create corpus.

In the model pipeline, Tfidfvectorizer (TF-IDF) transforms the words of the corpus into a matrix, count-vectorizes and normalizes them at once by default. For classification, the multinomial Naive Bayes classifier MultinomialNB() was used which is suitable for classification with discrete features like word counts for text classification.

To Do:

  • Text pre-processing, word-tokenizer and word-lemmatizer of Natural Language Toolkit (NLTK) in order to "clean" the extracted texts
  • Debug CLI

Time Series Analysis: Temperature Forecast

For this project, I applied the ARIMA model for a short-term temperature forecast. After visualizing the trend, the seasonality and the remainder of the time series data, I inspected the lags of the Autocorrelation (ACF) and Partial Auto Correlation Functions (PACF) plots to determine the parameters of the ARIMA odel (p, d, q) and run tests such as ADF and KPSS for checking stationarity (time dependence).

Data source: European Climate Assessment Dataset.

Unsupervised learning: Recommender Systems

This project refers to a movie recommender built with a web interface. The movie recommender is based on the NMF approach, and creates predictions for movies from their ratings average to recommend movies that would most likely be appreciated by that new similar user. Trained on 'small' dataset of MovieLens.

To Do:

  • Finish and clean the code for the flask app
  • Use Streamlit to re-create the app

**All projects were developed under the scope of Data Science Bootcamp of Spiced Academy.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages