A step-by-step guide helps data scientists with no experience in shell commands to set up their python development environment to share reproducible notebooks.


Update 12/12/2020 — Utilizing pyenv virtualenv to automate activate/deactivate the virtual environment. — Joe Klemmer Suggestion

Recently, I treated myself by purchasing a new Macbook Pro and started setting up my development environment for python. This step is pretty essential because if you did it the wrong way, plenty of issues would arise and distract you from focusing on the actual development work. So, I’ve decided to write a comprehensive guide listing the steps to help others setting their python development environment from scratch. Hence, you have a more comfortable life handling errors. Let’s get started!

Who this article for?

  1. For Data Scientist who…

A step-by-step guide to automating data extraction jobs from the EC2 in 5 minutes

Photo by Debby Hudson on Unsplash


According to Wikipedia, cron is a software utility and a time-based task scheduling found in Unix-like operating systems. Cron runs in the background, and executes scheduled jobs with cron known as “cron jobs”.

This guide provides a high-level overview of how to schedule jobs using cron’s proper syntax. It further goes over some ways to make job schedules more straightforward to write and understand.


To complete this guide, you’ll need

1. Access to an AWS EC2 instance, this tutorial based on Ubuntu 18.04 version

2. EC2 credentials, stored in a .pem file

3. Secured transfer applications such as WinSCP for…

How I built a great UI for a complicated sketch recognition model

Sketch prediction application — image by the author
the final deployed application — video by the author


Data scientists excel at creating models representing and predicting real-world data, but actually putting machine learning models in use is more of an art than science. Deployment requires skills more commonly found in software engineering and DevOps. Venturebeat states that 87% of data science projects never make it to production, while redapt indicates 90%. Both highlight that a significant factor that determines success and failure is collaboration and iteration as a team.

A machine learning model’s goal is to solve a problem, and it can only do that when consumed through…

Lessons from Kaggle: Compare ensembles algorithms in terms of model accuracy, robustness, and generalization. Implementation included!

Ensembles methods — image by the author


We previously discussed some of the common ways to leverage the prediction power of Machine Learning (ML) models. These methods are mainly utilized to improve model generalizability by splitting the data into particular schemes.

How I built a great UI for my PDF-to-audiobook converter

PDF to Audiobook
Final interface hosted at Gradio hub!


One significant challenge for data scientists, data analysts, and machine learning engineers is to showcase and demo their models to non-technical personnel. That often demands additional skills, including frontend development, backend development, and sometimes even devops. Even if you are skilled in these areas, it takes a tremendous amount of time to get the job done. Some libraries can handle this job for you and let you focus more on model development tasks. Gradio is a great example; it’s a Python library that generates an easy-to-use UI for every machine learning model.

This tutorial…

Lessons I learned from Kaggle to enhance the final model performance. Each evaluation strategy’s pros and cons and when to use each one are included.

My performance compared to other kagglers 🤔 — image by the author


It is an excellent trait to learn from others and start from where they stopped. Also, I acknowledge that the solution to any problem is to understand it well and then judge different solutions until it gets a stable final one.

Since I started my account on Kaggle two years ago, my data science and machine learning skills have advanced dramatically. Such improvement came from either competing or reading other kaggler’s notebooks. However, whenever I submitted a submission, I never got up to the public board. I started to question myself. Why is that? …

This article aims to automate the ML projects hosted on GitHub and reducing code-breaking via pull requests.



In my previous article, I walked you through a step by step tutorial to correctly setting up the python3 environment using pyenv on your local machine. However, some issues may arise in some circumstances due to some conflicts resulting from pull requests that use different python versions or even missing modules.

This tutorial will demonstrate the steps to build a standard workflow for any project utilizing Github actions plugins. Configuring Github actions in the project CI folder would protect the codebase pull requests that don’t meet the code standards regarding code formatting, syntax error, and version conflict. …

This article will discuss the proper ways for serialization considering multiprocessing tasks. The more robust serialization package dill improves the situation over the pickle package.



As a data scientist, you may sometimes require to send complex object hierarchies over a network or save your objects’ internal state to a disk or database for later use. To achieve this, you can use a process called serialization, which is entirely supported by the standard library. This article will discuss the following:

  1. How to serialize an object using both pickle and dill packages.
  2. When to use dill over pickle
  3. The right class initialization for the pickling class
  4. What is multiprocessing and how to use it in the context of serialization?


  1. Python 3 installed in your computer — check…

A comparison between different categorical encoding strategies commonly used in the machine learning models preprocessing pipelines. This guide aims to assist you in selecting the right strategy based on your application.

Photo by Dan Gold on Unsplash


This post will discuss the different strategies to encode the categorical variables as a preprocessing step necessary to develop reliable machine learning models. Encode categorical variables is considered one of several actions that could enhance the model performance if applied appropriately. There are different encoding types; in this article, I picked the widely used encoders to discuss their pros and cons, and when it is appropriate to use as possible as I can. So, let’s get started.

Disclaimer: I attached an executable notebook for each encoding strategy hosted on kaggle. …

Deploy the Machine Learning API to Heroku using the CircleCI pipeline with fewer command

Photo by timJ on Unsplash

In the previous post, I gave a high level overview of the required steps to develop a machine learning pipeline according to industry best practices. In this part, I will continue, and build accessible REST-API endpoints and deploy it to the Heroku platform using CI/CD pipeline.

Most of the code in this post is already explained in part#1.

The project repo at Github in case you want to follow along.

Table of content

  1. Develop the REST-api package
  2. Define the required circleci configuration
  3. Deploy to Heroku platform
  4. Test out one end point.

Let’s break it down one by one to understand the structure.

What is a REST API?


Salma El Shahawy

Data Engineer at Fortune Magazine. https://www.linkedin.com/in/salma-elshahawy/

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store