A step-by-step guide helps data scientists with no experience in shell commands to set up their python development environment to share reproducible notebooks.

Image for post
Image for post
https://xkcd.com/1987/

Update 12/12/2020 — Utilizing pyenv virtualenv to automate activate/deactivate the virtual environment. — Joe Klemmer Suggestion

Recently, I treated myself by purchasing a new Macbook Pro and started setting up my development environment for python. This step is pretty essential because if you did it the wrong way, plenty of issues would arise and distract you from focusing on the actual development work. So, I’ve decided to write a comprehensive guide listing the steps to help others setting their python development environment from scratch. Hence, you have a more comfortable life handling errors. Let’s get started!

Who this article for?

  1. For Data Scientist who…


Lessons from Kaggle: Compare ensembles algorithms in terms of model accuracy, robustness, and generalization. Implementation included!

Image for post
Image for post
Ensembles methods — image by the author

Introduction

We previously discussed some of the common ways to leverage the prediction power of Machine Learning (ML) models. These methods are mainly utilized to improve model generalizability by splitting the data into particular schemes.


How I built a great UI for my PDF-to-audiobook converter

Image for post
Image for post
PDF to Audiobook
Final interface hosted at Gradio hub!

Introduction

One significant challenge for data scientists, data analysts, and machine learning engineers is to showcase and demo their models to non-technical personnel. That often demands additional skills, including frontend development, backend development, and sometimes even devops. Even if you are skilled in these areas, it takes a tremendous amount of time to get the job done. Some libraries can handle this job for you and let you focus more on model development tasks. Gradio is a great example; it’s a Python library that generates an easy-to-use UI for every machine learning model.

This tutorial…


Lessons I learned from Kaggle to enhance the final model performance. Each evaluation strategy’s pros and cons and when to use each one are included.

Image for post
Image for post
My performance compared to other kagglers 🤔 — image by the author

Introduction

It is an excellent trait to learn from others and start from where they stopped. Also, I acknowledge that the solution to any problem is to understand it well and then judge different solutions until it gets a stable final one.

Since I started my account on Kaggle two years ago, my data science and machine learning skills have advanced dramatically. Such improvement came from either competing or reading other kaggler’s notebooks. However, whenever I submitted a submission, I never got up to the public board. I started to question myself. Why is that? …


This article aims to automate the ML projects hosted on GitHub and reducing code-breaking via pull requests.

Image for post
Image for post
https://github.com/features/actions

Introduction

In my previous article, I walked you through a step by step tutorial to correctly setting up the python3 environment using pyenv on your local machine. However, some issues may arise in some circumstances due to some conflicts resulting from pull requests that use different python versions or even missing modules.

This tutorial will demonstrate the steps to build a standard workflow for any project utilizing Github actions plugins. Configuring Github actions in the project CI folder would protect the codebase pull requests that don’t meet the code standards regarding code formatting, syntax error, and version conflict. …


This article will discuss the proper ways for serialization considering multiprocessing tasks. The more robust serialization package dill improves the situation over the pickle package.

Image for post
Image for post
https://xkcd.com/2054/

Introduction

As a data scientist, you may sometimes require to send complex object hierarchies over a network or save your objects’ internal state to a disk or database for later use. To achieve this, you can use a process called serialization, which is entirely supported by the standard library. This article will discuss the following:

  1. How to serialize an object using both pickle and dill packages.
  2. When to use dill over pickle
  3. The right class initialization for the pickling class
  4. What is multiprocessing and how to use it in the context of serialization?

Pre-requisites

  1. Python 3 installed in your computer — check…


A comparison between different categorical encoding strategies commonly used in the machine learning models preprocessing pipelines. This guide aims to assist you in selecting the right strategy based on your application.

Image for post
Image for post
Photo by Dan Gold on Unsplash

Introduction

This post will discuss the different strategies to encode the categorical variables as a preprocessing step necessary to develop reliable machine learning models. Encode categorical variables is considered one of several actions that could enhance the model performance if applied appropriately. There are different encoding types; in this article, I picked the widely used encoders to discuss their pros and cons, and when it is appropriate to use as possible as I can. So, let’s get started.

Disclaimer: I attached an executable notebook for each encoding strategy hosted on kaggle. …


Deploy the Machine Learning API to Heroku using the CircleCI pipeline with fewer command

Image for post
Image for post
Photo by timJ on Unsplash

In the previous post, I gave a high level overview of the required steps to develop a machine learning pipeline according to industry best practices. In this part, I will continue, and build accessible REST-API endpoints and deploy it to the Heroku platform using CI/CD pipeline.

Most of the code in this post is already explained in part#1.

The project repo at Github in case you want to follow along.

Table of content

  1. Develop the REST-api package
  2. Define the required circleci configuration
  3. Deploy to Heroku platform
  4. Test out one end point.

Let’s break it down one by one to understand the structure.

What is a REST API?

REST…


Develop customized ML pipelines from describing the business problem to deployment. This extensive guide includes Docker and packaging configurations.

Image for post
Image for post
Photo by Jason Leung on Unsplash

Machine learning models could have tremendous value only when delivered to the end-users. The end-user could be recommender systems in the real-estate platform that suggests properties to renters or investors — Zillow, for instance.

However, machine learning projects can only be successful when a model is deployed, and its predictions are being served.

I was surprised that the machine learning deployment is unusually discussed online — this particular skill you need to learn in the practice workflow.

I tried to google this particular topic, but I found many blog posts about setting Flask APIs for machine learning models. …


Have a strong argument why picking a classification algorithm over the other based on significance level in performance

Image for post
Image for post
Photo by Alexis Fauvet on Unsplash

Introduction

Picking the right machine learning algorithm is decisive, where it decides the performance of the model. The most dominating factor in choosing a model is the performance, which employs the KFold-cross-validation technique to achieve independence.

The chosen model usually has a higher mean performance. Nevertheless, sometimes it originated through a statistical fluke. There are many statistical hypothesis-testing approaches to evaluate the mean performance difference resulting from the cross-validation to address this concern. If the difference is above the significance level `p-value` we can reject the null hypothesis that the two algorithms are the same, and the difference is not significant.

Salma Elshahawy, MSc.

Data Scientist graduate student at CUNY with software engineering background. I build things that lives on the internet http://linkedin.com/in/salma-elshahawy

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store