Scalable Machine Learning for Small Teams¶

List of Chapters¶

Introduction & Overview
Data Science & Python
Self-Host Services & Serverless Functions
Containers & Batch/Streaming Pipelines
Building Batch Model Pipelines: Training and Predicting batch processes (Ongoing)

Introduction¶

Nowadays, Data Scientists are expected to build distributed systems that are scalable and robust. The systems can run distributed programs in parallel but must be resilient to recover from failures. In this project, I will build a scalable system with solid tools such as PySpark which let a data scientist build end-to-end programs more efficiently and quickly.

Small teams¶

For teams of small size (like start-ups, small companies, or limited budget and resource projects), we want to take advantage of the handful of tools (such as cloud environment and existing ML libraries).

Google Cloud Platforms provide a lot of solid environments and tools for managed solutions. For example, in the case of Kafka, if we want to host Kafka services by ourselves, there are many tasks to do including: managing the server, adding worker nodes, dealing with problems, and updating/fixing bugs which means that we need more data engineers. However, Google Cloud Dataflow or Pub/Sub can provide the same functionalities but less headache to manage and maintain the servers. So that the team now can focus on building the model with little care about the server, environment, and dependencies.

In this project, I mainly use services from GCP.

System Design¶

Problem¶

We have a lot of users, who do some activities (shopping, playing games, watching movies), and we want to give some recommendations based on the history and profile of the user. However, we want to update the model frequently as the users constantly watch/play new items or their taste changes (from romantic to thriller movies or buying from men to women stuff).

To ease life, we must do it smartly and with the most help from a machine.

Solution¶

For example, we log behaviors of users and results from previous recommendations to Data Lake or Warehouse, then a Scheduler calls Model Trainer every, say, day to train the model, save this new model to Model Storage.

Once we have a trained model, we can do a prediction in 2 scenarios:

Batch Modeling: For example, when a user opens an app or website, no history of the prediction happens. So we give a prediction based on previous behavior. We can do this in advance by running Batch Recommender. The benefit is the low latency when showing the list to the user because this list was already generated.
Streaming Modeling: Whenever a user interacts with another user or item, we can suggest the next users or items that the user might interact with by using On-Demand Recommender. For example, if a user watched a thriller movie, we can show some other thriller movies. Because it takes some time to predict (based on which algorithms using), we can run the predicting service in the background right after the user starts interacting. The solutions rely on the situation and I am happy to talk about it in the next project.

Training Model¶

The main code of the program is written in Python and wrapped in Docker and deploy in Google Kubernetes Engine (GKE). The first step is to save the image to a Docker registry that can interface with the orchestration system in Google Container Registry.

The data flow:

Logging data from the user: Using Google Cloud Pub/Sub to save data into Google BigQuery directly or via a middle service.
A Scheduler using Apache Airflow to run the Model Trainer. To deploy Airflow, either hosted-solution (self-host Airflow on Kubernetes or private server) or Google Cloud Composer.
Model Trainer is deployed in Google Kubernetes Engine and uses PyPark to be able to scale to massive data volumes in distributed machine learning.
Model Trainer can use its built-in Machine Learning library MLlib which fully support distributed ML. We also can use external frameworks like Scikit-Learn or TensorFlow. But we must be careful when choosing which algorithms or level of abstractions to be able to train distributed. For example, TensorFlowOnSpark could be a great choice to use TensorFlow on Spark or spark-tensorflow-distributor.

If the data is not large, we can transform Spark Dataframe to Pandas Dataframe to train the Tensorflow model directly.

After training successfully, we can save the model to storage. To make the trained model highly accessible and reliable, we should use a Persistent Storage such as Google Cloud Storage which supports data in general usage.

Predicting Model¶

Batch Recommender¶

The predicted results are saved to a database so that the App could give recommendations quickly when a user opens an app on a phone or a website. We do a prediction for each user. If there are too many users, we can limit by cutting off the number of users who do not online frequently or it's been a long time since the user logged in.

The data flow:

A Scheduler using Apache Airflow to run the Predicting Service to start batch predictions.
Predicting Service pulls data of users from Data Lake or Warehouse link Google BigQuery and trained models from Model Storage and use batch prediction to give a recommended list for each user and save this list to the App database which could be Google Cloud Datastore NoSQL which later a Mobile or Web App could pull this list to serve that particular user.

The Predicting Service is Containerized and deployed in Google Kubernetes Engine and uses Apache Beam running in Google Cloud Dataflow environment to be able to scale to massive batch predicting.

System-Design3_predicting-service

On-Demand Recommender¶

This scenario can be used a lot when people are online and interact with other items, and users. Depending on the workload, model latency, and availability, we can deploy this endpoint service in a Serverless Function (Amazon Lambda or Google Cloud Functions) or container environment fully supporting dependencies and prerequisites (Google Kubernetes Engine).

However, we need an API Gateway to do authentication and/or logging, statistics, and rate limiting. There are many benefits from using API Gateway and cloud services such as Google API Gateway which eases the process with their fine-tuned and well-managed.

The data flow:

An App sends a request to an Endpoint Predictor for a prediction or recommendation. This could be REST API and go through a Gateway API to control the valid request and limit the number of requests if we do not want to expand the computing resources.
The Predictor service will load the trained model from Model Storage. It can re-load the trained model every day or hour to get the latest updated trained model.

What's Next?¶

Step 1: We review briefly about Data Science, Python & its packages and how to interact with Google Cloud.

Step 2: Now we can build a simple model on a server, and deploy it on Serverless Function services.

Step 3: After building the model in development, we need to bring it to the production environment with the same one while developing. Docker provides a container that reproduces the same environment but is isolated. We also explore Batch model pipelines with PySpark, DataFlow, WorkFlow, and other tools.

Step 4: Building Batch Model Pipelines: Training and Predicting batch processes (Ongoing)

References¶

[1] https://www.educative.io/courses/data-science-in-production-building-scalable-model-pipelines

[2] https://www.codementor.io/blog/scalable-ml-models-6rvtbf8dsd

[3] https://neptune.ai/blog/how-to-scale-ml-projects

[4] https://theaisummer.com/scalability/

[5] https://codelabs.developers.google.com/codelabs/pyspark-bigquery

[6] https://blog.knoldus.com/apache-beam-vs-apache-spark-a-quick-guide/

In [ ]: