OSEMN for Newbies

Jari El
5 min readJun 26, 2020

As this is one of my first data science projects as well as my first time modeling, I thought it would be nice to present this blog post to the other fresh data scientist. I can definitely see how standing at the summit of a new project, when you are expected to use your new models, can be a daunting task from start to finish. So, I will take you through one of the best workflows that was bestowed upon me, the OSEMN process.

In 2010, Hilary Mason, data scientist and the founder of a technology startup Fast Forward Labs, as well as a Data Scientist in Residence at Accel Partners and Chris Wiggins, an associate professor of applied mathematics at Columbia University, articulated the OSEMN process. The OSEMN process is a taxonomy that all data scientist new and old should feel comfortable using (Brownlee, 2014). Throughout this blog, I will walk through the process and give insight into the process.

The OSEMN stands for Obtain, Scrub, Explore, Model, and iNterpret. Simple enough huh? It really is, this is one of the main reasons why it has been so widely excepted and used by inexperienced as well as experienced data scientists. Dr. Jason Brownlee, a machine learning expert states, “It is a list of tasks a data scientist should be familiar and comfortable working on. Although, the authors point out that no data scientist will be an expert at all of them”(2014).

First, with anything related to a data science project you must obtain the data, the “O” in our process. The authors’ Mason and Dr. Wiggins make a strong point to say that the idea of manually generating data is not possible and or efficient. As data scientists in this age of big data and models that need a plethora of data to give accurate responses, it is pertinent to learn how to “automatically” obtain the data you need for a given problem (Browlee, 2014). When I say “manually generating data,” I mean like pointing and clicking with a mouse and or copying and pasting data from documents or taking photographs and labeling. Mason and Dr. Wiggins suggest that you adopt a range of tools for acquiring the necessary data and choose the correct method for the job (Brownlee, 2014). They point to Unix command-line tools as well as SQL in databases, web scrapping, and scripting using python (Brownlee, 2014). Finally, Mason and Dr. Wiggins explain the importance of using APIs to access data efficiently. API stands for application programming interface. It is a way to interact with other servers, you can do things like acquire data through an API.

Next in the process is the “S” scrubbing data. Inherently real data is “messy”. This is due to several things like human error and randomness. This means there could be missing values, or random symbols that your model will have trouble dealing with, your data could be in the wrong type; for instance, it could be a number but has the type of a string. This would cause trouble in several of the processes. These will all cause inaccurate results when modeling or simply cause your model to not work at all. Mason and Dr. Wiggins point out that data cleaning is the least sexy part of working on data problems (Brownlee, 2014). It will also be one of the most time-consuming parts of the project but good clean data may provide the most benefits when feeding that data to your model which in turn can lead to more accurate results. Python is a very commonly used tool for data science. Python has several great packages such as pandas to help import, clean and manipulate data.

Next is the process is “E”, explore data. This is also called exploratory data analysis (aka EDA). This is before any hypothesis is being tested or any predictions are being made and evaluated. Data exploration is useful for getting to know your data (Brownlee, 2014). You want to get to know your data in order to build ideas for its structure as well as getting ideas for transforming data and even possibly predictive models to use later down the line. Mason and Dr. Wiggins list several methods that are helpful in exploring your data, here are a few:

  • Pairwise Histograms to plot attributes against each other and highlight relationships and outliers
  • Dimensionality Reduction methods for creating lower-dimensional plots and models of the data
  • Clustering to expose natural groupings in the data (Brownlee, 2014).
  • Histograms to summarize the distribution of individual data attributes.

Next in the process is Modeling “M” your data. Brownlee states, “Model accuracy is often the ultimate goal for a given data problem. This means that the most predictive model is the filter by which a model is chosen”(2014). Generally, the goal is to utilize a model to simply predict and interpret. Prediction can be evaluated quantitatively where interpretations are a “softer” and qualitative. A model’s accuracy can be evaluated by how well it predicts and performs on unseen data. It can be estimated using several methods but one the most widely known and used is cross-validation.

Lastly, interpreting results. Brownlee states, “The predictive power of a model is determined by its ability to generalize” (2014). Mason and Dr. Wiggins suggest the interpretative power of a model are its abilities to suggest the most interesting experiments to perform next. It should give insights into the problem as well as the domain. Mason and Dr. Wiggins point to three main concerns when choosing a model to balance the predictive and interpretability of a model:

  • Choose good features, the attributes of the data that you select to model.
  • Choose a good hypothesis space, constrained by the models and data transforms you select.
  • Choose a good representation, the form of the data that you obtain, most data is messy.

While data science is a complex web of maths, computing, and modeling this process really provides structure to those who may be lost, as I once was. I hope this blog post inspires and helps many people, young and old, new and experienced. I wish you all the best on your studies.

References:

Brownlee, J., 2014. How To Work Through A Problem Like A Data Scientist. [online] Machine Learning Mastery. Available at: <https://machinelearningmastery.com/how-to-work-through-a-problem-like-a-data-scientist/> [Accessed 26 June 2020].

--

--