Data Science Projects

Automated Malaria Detection Using Convolutional Neural Networks

This project implements an automated malaria detection system using a CNN. It leverages a pre-trained MobileNetV2 model with custom layers to classify single-cell images as either parasitized or uninfected. Data augmentation is applied to enhance training, and the model’s performance is evaluated using accuracy and F1 score before generating predictions for a Kaggle submission.

Full Repository

Protein Family Classification with Deep Learning

In this project, I use a one-dimensional convolutional neural network (CNN) built with Keras to classify protein sequences with an accuracy of 0.99855. The process involves tokenizing protein sequences at the character level, padding them to ensure uniformity, and encoding family IDs into numerical values. The model architecture includes an embedding layer, a 1D convolutional layer, global max pooling, and fully connected layers with dropout for regularization. I also applied visualization techniques like UMAP and t-SNE to illustrate the separation between protein families in the embedding space. This work demonstrates how deep learning methods can be effectively applied to protein classification and contributes to advances in computational biology.

Full Repository

Breast Cancer Diagnosis Prediction

This project focuses on developing and optimizing machine learning models to accurately classify breast cancer diagnoses as malignant (M) or benign (B) using the Breast Cancer Wisconsin dataset. The workflow begins with data cleaning and preprocessing, including removing irrelevant columns and mapping diagnosis labels to binary values for machine learning compatibility. Features are standardized using StandardScaler to ensure uniformity across variables. To enhance predictive performance, the project employs feature selection with SelectKBest using ANOVA F-statistics to identify the 10 most significant features.

Full Repository

Sentiment Analysis of my Break Up Texts

My dating pool, unfortunately, consists of 22-year-old men. I wanted to determine the Subjectivity and Polarity of my ex’s text messages from immediately before or immediately after they, or I, decided to end things. As you go through this code, I hope that you don’t find my dating history relatable, but, rather, entertaining.

Polarity is an attribute of sentiment analysis in which the output lies between -1 and 1, where -1 refers to negative sentiment and +1 refers to positive sentiment.

Subjectivity is the output between 0 and 1 and refers to personal opinions and judgments.

All identities have been anonymized; each individual is identified only by a letter (C, A, J, or G).

Full Repository

Exploration and Analysis of Gendered Insults

In this project, I utilized a dataset of Wikipedia comments provided by Jigsaw, a subsidiary of Google, which developed the Perspective API. The dataset contains a unique comment ID, the text of the comment, and binary labels indicating whether the comment was deemed “toxic,” “severe_toxic,” “obscene,” “threatening,” “insulting,” or “identity_hate” by human raters. I added a “score” column representing the toxicity score assigned by the live version of the Perspective API.

I decided to examine if there was a difference in scoring for insults that are generally deemed to be misogynistic and targeted towards self-identifying women (I chose the terms: ‘bitch’, ‘whore’, ‘slutty’, ‘skank’, and ‘tramp’) and insults that are generally deemed to be primarily targeted towards self-identifying men (I chose the terms: ‘dickhead’, ‘bastard’, ‘jackass’, ‘asshole’,’prick’).

Full Repository