Kyocera

Project Overview

  • Duration: September 2018 - December 2018
  • Team: Data science consulting team composed of 2 Project Managers and 6 Data Analysts
  • My role: Data Analyst
  • Tools & Frameworks: Python, Jupyter, scikit-learn, matplotlib

In the fall of 2018, I got the opportunity to work alongside Kyocera Document Solutions in proposing a classification model that would help Kyocera better understand their printers. This task was broken up into three sections: feature creation, model exploration, and model selection.

Feature Creation

My team initially underwent data cleaning and preliminary data analysis to better understand the dataset. Data cleaning was done through standard techniques such as normalization and reformating data types to make it easier to join and group columns across the given datasets. We also developed new features that included averages, variances, and ratios of the given features so that our models would better represent the business problem statement given.

Model Exploration and Selection

My team then explored various clustering models but we wanted to focus on models that allowed for an intuitive and interpretable model based on the created features. My team initially narrowed down our choices to spectral clustering, mean shift clustering, and hierarchical clustering since they had the best results and were relatively interpretable too. For our final model and recommendation, my team ended up utilizing a mean shift clustering algorithm which is a centroid based algorithm that works by updating candidates for the centroids to be the mean of points within their associated region. This can be thought of as creating an estimation of the underlying data distribution through Kernel Density Estimation (KDE). Then, given the contours of the estimated distribution, the algorithm shifts each point to its nearest local maximum until convergence.

I am unable to share a lot of the technical work I did publicly, but feel free to reach out if you have any questions!

Reflections

This was my first data science projects outside of the classroom and it was a great experience being able to apply some of the skills and techniques I learned to a real company business problem. I learned that while a classroom setting usually has well-cleaned data and the focus of a project is usually to implement the model from scratch, an industry problem spends most of its time on cleaning the data and using pre-existing packages to do the modeling. From this project, I also learned the importance of quality data since oftentimes we had trouble using traditional methods in the modeling phase perhaps due to lackluster features or the lack of overall predictive power in the dataset.