Recommendation systems & Tools for Data Imputation¶

Overview¶

The Program is divided into several sections, which are each presented in their own notebook:

Note: There are some issues with rendering Notebook files by GitHub, which show empty outputs. Please download to test locally or visit the links from my other sites.

Introduction¶

Welcome to my personal project about popular tools to ease the process of imputing data. The work is demonstrated in the case of recommendation systems based on Amazon reviews.

Nowadays, recommendation systems are one of the most important keys for the success of e-commerce platforms such as Amazon and other online retailers, because:

It recommends the user find the right product.
It recommends the users to other interesting products which can increase the user's engagement. For example, there are 40% more clicks on Google News due to recommendations.
It helps the item providers to deliver the items to the right user. On Amazon, 35 % of products get sold due to recommendations.
It recommends the users based on their liking to make the content more personalized. On Netflix, most of the rented movies are from recommendations.

Missing data is a real-world problem and often happens with every type of data. Many reasons cause this issue, such as inconsistently collecting information, damaged data due to crashed storage or on the way moving, human errors and biases, conflicts between datasets. It is problematic because complete data is needed to be analyzed accurately.

Simply deleting missing records is not work well in many cases, especially the missing data takes a large proportion or in a small dataset. Hence, Data Imputation is very important in Machine Learning. It helps reserve data and makes more meaningful out of missing data. Data Engineers or Scientists could check the data quality manually and find out missing data. After that, they can clean it by filling empty values with less meaningful values such as the Average, Mean, or Mode.

Besides, some Machine Learning algorithms such as KNN (K – Nearest Neighbour) and SVM (Support Vector Machines) or Random Forest could work well with missing data. KNN uses other fields to cluster by the K nearest values so that it is generally effective to work with missing values. Random Forest manipulates various types of missing data by addressing interactions and nonlinearity to scale to high dimensions. Read more here and other algorithms here.

Types of Recommendation Systems¶

There are mainly 3 types of recommendations systems:

Content-based recommendations. It is based on the information of the content of the items (such as the price of a product or its color) rather than on the user opinions. The main idea is if the user likes an item, then he or she will like the "other" similar item.
Collaborative Filtering. It is based on the assumption that people like things similar to other things they liked, and things that are liked by other people with similar tastes. It is mainly of two types: a) User-User; b) Item -Item; and c) mix of user and item-based.
Hybrid Approaches: This system approach combines collaborative filtering, content-based filtering, and other approaches.

Others approaches:

Popularity-based systems. It works by recommending items viewed and purchased by most people and are rated high. It is not a personalized recommendation.
Classification model-based. It works by understanding the features of the user and applying the classification algorithm to decide whether the user is interested or not in the product.
Association rule mining: Association rules capture the relationships between items based on their patterns of co-occurrence across transactions in large databases. It is intended to identify strong rules discovered in databases using some measures of interestingness.

Setup¶

Clone or download the repo¶

First get local copies of the program:

$ git clone https://github.com/linhhlp/Recommendation-system-Amazon-review.git

Or download from: https://github.com/linhhlp/Recommendation-system-Amazon-review/archive/main.zip

Install the dependencies¶

This program has been developed and tested on:

python 3.9.10
pandas 1.4.1
notebook 6.4.8
numpy 1.22.2
tensorflow 2.6.0
sklearn 1.0.2
matplotlib: 3.5.1
seaborn 0.11.2
statsmodels 0.13.2
Kaggle learntools

The quickest, easiest way to install is to use Anaconda:

Installing with anaconda¶

Install anaconda

The quickest, easiest way to install dependencies is to use the command line to create an environment and install the packages:

$ conda env create
$ source activate new_env

Install the remaining dependencies with:

conda install tensorflow sklearn seaborn

References¶

Databases¶

The databases were from Kaggle and original source

Citations:

Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering

R. He, J. McAuley

WWW, 2016

Image-based recommendations on styles and substitutes

J. McAuley, C. Targett, J. Shi, A. van den Hengel

SIGIR, 2015