{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Neural Networks for Collaborative Filtering\n",
"> Neural networks to learn the embeddings! and how to combine them\n",
"\n",
"- toc: true \n",
"- badges: true\n",
"- comments: true\n",
"- author: Nipun Batra\n",
"- categories: [ML]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Recently, I had a chance to read an interesting WWW 2017 paper entitled: [Neural Collaborative Filtering](https://www.comp.nus.edu.sg/~xiangnan/papers/ncf.pdf). The first paragraph of the abstract reads as follows:\n",
"\n",
">In recent years, deep neural networks have yielded immense success on speech recognition, computer vision and natural language processing. However, the exploration of deep neural networks on recommender systems has received relatively less scrutiny. In this work, we strive to develop techniques based on neural networks to tackle the key problem in recommendation — collaborative filtering — on the basis of implicit feedback.\n",
"\n",
"I'd recently written a [blog post](../recommend-keras.html) on using Keras (deep learning library) for implementing traditional matrix factorization based collaborative filtering. So, I thought to get my hands dirty with building a prototype for the paper mentioned above. The authors have already provided their [code on Github](https://github.com/hexiangnan/neural_collaborative_filtering), which should serve as a reference for the paper and not my post, whose purpose is merely educational!\n",
"\n",
"\n",
"Here's how the proposed network architecture looks in the paper:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are a few terms that we need to understand:\n",
" \n",
"1. User (u) and Item (i) are used to create embeddings (low-dimensional) for user and item\n",
"2. Generalized Matrix Factorisation (GMF) combines the two embeddings using the dot product. This is our regular matrix factorisation.\n",
"3. Multi-layer perceptron can also create embeddings for user and items. However, instead of taking a dot product of these to obtain the rating, we can concatenate them to create a feature vector which can be passed on to the further layers.\n",
"4. Neural MF can then combine the predictions from MLP and GMF to obtain the following prediction."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As done in my previous post, I'll use the MovieLens-100k dataset for illustration. Please refer to my [previous post](../recommend-keras.html) for more details."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Peak into the dataset"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import warnings\n",
"warnings.filterwarnings('ignore')\n",
"\n",
"%matplotlib inline"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"dataset = pd.read_csv(\"/Users/nipun/Downloads/ml-100k/u.data\",sep='\\t',names=\"user_id,item_id,rating,timestamp\".split(\",\"))"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
user_id
\n",
"
item_id
\n",
"
rating
\n",
"
timestamp
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
196
\n",
"
242
\n",
"
3
\n",
"
881250949
\n",
"
\n",
"
\n",
"
1
\n",
"
186
\n",
"
302
\n",
"
3
\n",
"
891717742
\n",
"
\n",
"
\n",
"
2
\n",
"
22
\n",
"
377
\n",
"
1
\n",
"
878887116
\n",
"
\n",
"
\n",
"
3
\n",
"
244
\n",
"
51
\n",
"
2
\n",
"
880606923
\n",
"
\n",
"
\n",
"
4
\n",
"
166
\n",
"
346
\n",
"
1
\n",
"
886397596
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" user_id item_id rating timestamp\n",
"0 196 242 3 881250949\n",
"1 186 302 3 891717742\n",
"2 22 377 1 878887116\n",
"3 244 51 2 880606923\n",
"4 166 346 1 886397596"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dataset.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So, each record (row) shows the rating for a user, item (movie) pair. It should be noted that I use item and movie interchangeably in this post."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(943, 1682)"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(dataset.user_id.unique()), len(dataset.item_id.unique())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We assign a unique number between (0, #users) to each user and do the same for movies."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"dataset.user_id = dataset.user_id.astype('category').cat.codes.values\n",
"dataset.item_id = dataset.item_id.astype('category').cat.codes.values"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
user_id
\n",
"
item_id
\n",
"
rating
\n",
"
timestamp
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
195
\n",
"
241
\n",
"
3
\n",
"
881250949
\n",
"
\n",
"
\n",
"
1
\n",
"
185
\n",
"
301
\n",
"
3
\n",
"
891717742
\n",
"
\n",
"
\n",
"
2
\n",
"
21
\n",
"
376
\n",
"
1
\n",
"
878887116
\n",
"
\n",
"
\n",
"
3
\n",
"
243
\n",
"
50
\n",
"
2
\n",
"
880606923
\n",
"
\n",
"
\n",
"
4
\n",
"
165
\n",
"
345
\n",
"
1
\n",
"
886397596
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" user_id item_id rating timestamp\n",
"0 195 241 3 881250949\n",
"1 185 301 3 891717742\n",
"2 21 376 1 878887116\n",
"3 243 50 2 880606923\n",
"4 165 345 1 886397596"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dataset.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Train test split\n",
"\n",
"We'll now split our dataset of 100k ratings into train (containing 80k ratings) and test (containing 20k ratings). Given the train set, we'd like to accurately estimate the ratings in the test set."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from sklearn.model_selection import train_test_split\n",
"train, test = train_test_split(dataset, test_size=0.2)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"