Out of matrix non-negative matrix factorisation#
What if we to predict for entries not within the matrix?!
toc: true
badges: true
comments: true
author: Nipun Batra
categories: [ML]
I have written a bunch of posts on this blog about non-negative matrix factorisation (NNMF). However, all of them involved the test user to be a part of the matrix that we factorise to learn the latent factors. Is that always the case? Read on to find more!
Standard Problem#
Our goal is given a matrix A, decompose it into two non-negative factors, as follows:
\( A_{M \times N} \approx W_{M \times K} \times H_{K \times N} \), such that \( W_{M \times K} \ge 0\) and \( H_{K \times N} \ge 0\)
Our Problem- Out of matrix factorisation#
Imagine that we have trained the model for M-1 users on N movies. Now, the \(M^{th}\) user has rated some movies. Do we retrain the model from scratch to consider \(M^{th}\) user? This can be a very expensive operation!
Instead, as shown in above figure, we will learn the user factor for the \(M^{th}\) user. We can do this the shared movie factor (H) has already been learnt.
We can formulate as follows:
Taking transpose both sides
However, \(A[M,:]^T\) will have missing entries. Thus, we can mask those entries from the calculation as shown below.
Thus, we can write
If instead we want the factors to be non-negative, we can use non-negative least squares instead of usual least squares for this estimation.
Code example#
I’ll now present a simple code example to illustrate the procedure.
Defining matrix A#
import numpy as np
import pandas as pd
M, N = 20, 10
np.random.seed(0)
A_orig = np.abs(np.random.uniform(low=0.0, high=1.0, size=(M,N)))
pd.DataFrame(A_orig).head()
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0.548814 | 0.715189 | 0.602763 | 0.544883 | 0.423655 | 0.645894 | 0.437587 | 0.891773 | 0.963663 | 0.383442 |
1 | 0.791725 | 0.528895 | 0.568045 | 0.925597 | 0.071036 | 0.087129 | 0.020218 | 0.832620 | 0.778157 | 0.870012 |
2 | 0.978618 | 0.799159 | 0.461479 | 0.780529 | 0.118274 | 0.639921 | 0.143353 | 0.944669 | 0.521848 | 0.414662 |
3 | 0.264556 | 0.774234 | 0.456150 | 0.568434 | 0.018790 | 0.617635 | 0.612096 | 0.616934 | 0.943748 | 0.681820 |
4 | 0.359508 | 0.437032 | 0.697631 | 0.060225 | 0.666767 | 0.670638 | 0.210383 | 0.128926 | 0.315428 | 0.363711 |
Masking a few entries#
A = A_orig.copy()
A[0, 0] = np.NAN
A[3, 1] = np.NAN
A[6, 3] = np.NAN
# Masking for last user.
A[19, 2] = np.NAN
A[19, 7] = np.NAN
We will be using A2 (first 19 users) matrix for learning the movie factors and the user factors for the 19 users
A2 = A[:-1,:]
A2.shape
(19, 10)
A_df = pd.DataFrame(A)
A_df.head()
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | 0.715189 | 0.602763 | 0.544883 | 0.423655 | 0.645894 | 0.437587 | 0.891773 | 0.963663 | 0.383442 |
1 | 0.791725 | 0.528895 | 0.568045 | 0.925597 | 0.071036 | 0.087129 | 0.020218 | 0.832620 | 0.778157 | 0.870012 |
2 | 0.978618 | 0.799159 | 0.461479 | 0.780529 | 0.118274 | 0.639921 | 0.143353 | 0.944669 | 0.521848 | 0.414662 |
3 | 0.264556 | NaN | 0.456150 | 0.568434 | 0.018790 | 0.617635 | 0.612096 | 0.616934 | 0.943748 | 0.681820 |
4 | 0.359508 | 0.437032 | 0.697631 | 0.060225 | 0.666767 | 0.670638 | 0.210383 | 0.128926 | 0.315428 | 0.363711 |
Defining matrices W and H (learning on M-1 users and N movies)#
K = 4
W = np.abs(np.random.uniform(low=0, high=1, size=(M-1, K)))
H = np.abs(np.random.uniform(low=0, high=1, size=(K, N)))
W = np.divide(W, K*W.max())
H = np.divide(H, K*H.max())
pd.DataFrame(W).head()
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 0.078709 | 0.175784 | 0.095359 | 0.045339 |
1 | 0.006230 | 0.016976 | 0.171505 | 0.114531 |
2 | 0.135453 | 0.226355 | 0.250000 | 0.054753 |
3 | 0.167387 | 0.066473 | 0.005213 | 0.191444 |
4 | 0.080785 | 0.096801 | 0.148514 | 0.209789 |
pd.DataFrame(H).head()
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0.239700 | 0.203498 | 0.160529 | 0.222617 | 0.074611 | 0.216164 | 0.157328 | 0.003370 | 0.088415 | 0.037721 |
1 | 0.250000 | 0.121806 | 0.126649 | 0.162827 | 0.093851 | 0.034858 | 0.209333 | 0.048340 | 0.130195 | 0.057117 |
2 | 0.024914 | 0.219537 | 0.247731 | 0.244654 | 0.230833 | 0.197093 | 0.084828 | 0.020651 | 0.103694 | 0.059133 |
3 | 0.033735 | 0.013604 | 0.184756 | 0.002910 | 0.196210 | 0.037417 | 0.020248 | 0.022815 | 0.171121 | 0.062477 |
Defining the cost that we want to minimise#
def cost(A, W, H):
from numpy import linalg
WH = np.dot(W, H)
A_WH = A-WH
return linalg.norm(A_WH, 'fro')
However, since A has missing entries, we have to define the cost in terms of the entries present in A
def cost(A, W, H):
from numpy import linalg
mask = pd.DataFrame(A).notnull().values
WH = np.dot(W, H)
WH_mask = WH[mask]
A_mask = A[mask]
A_WH_mask = A_mask-WH_mask
# Since now A_WH_mask is a vector, we use L2 instead of Frobenius norm for matrix
return linalg.norm(A_WH_mask, 2)
Let us just try to see the cost of the initial set of values of W and H we randomly assigned. Notice, we pass A2!
cost(A2, W, H)
7.2333001567031294
Alternating NNLS procedure#
num_iter = 1000
num_display_cost = max(int(num_iter/10), 1)
from scipy.optimize import nnls
for i in range(num_iter):
if i%2 ==0:
# Learn H, given A and W
for j in range(N):
mask_rows = pd.Series(A2[:,j]).notnull()
H[:,j] = nnls(W[mask_rows], A2[:,j][mask_rows])[0]
else:
for j in range(M-1):
mask_rows = pd.Series(A2[j,:]).notnull()
W[j,:] = nnls(H.transpose()[mask_rows], A2[j,:][mask_rows])[0]
WH = np.dot(W, H)
c = cost(A2, W, H)
if i%num_display_cost==0:
print i, c
0 3.74162948918
100 2.25416363991
200 2.25258698617
300 2.25229707846
400 2.25131714233
500 2.24968386447
600 2.24967129897
700 2.24965023589
800 2.24961410381
900 2.24955008837
A_pred = pd.DataFrame(np.dot(W, H))
A_pred.head()
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0.590301 | 0.653038 | 0.531940 | 0.623272 | 0.584763 | 0.630835 | 0.574041 | 0.700139 | 0.841706 | 0.565808 |
1 | 0.802724 | 0.532299 | 0.482430 | 1.017968 | 0.149923 | 0.449312 | 0.097775 | 0.708727 | 0.506595 | 0.846219 |
2 | 0.764296 | 0.563711 | 0.527292 | 0.905236 | 0.306275 | 0.505674 | 0.223192 | 0.705882 | 0.604356 | 0.757878 |
3 | 0.373539 | 0.745239 | 0.334948 | 0.663219 | 0.132686 | 0.551844 | 0.760420 | 0.598546 | 0.808108 | 0.627732 |
4 | 0.467623 | 0.331457 | 0.617263 | 0.239449 | 0.634455 | 0.370041 | 0.294412 | 0.288539 | 0.484822 | 0.126945 |
Learning home factors for \(M^{th}\) home#
A_m = A[-1,:]
A_m_transpose = A_m.T
mask = ~np.isnan(A_m_transpose)
W_m = nnls(H.T[mask], A_m_transpose[mask])[0]
W_m
array([ 0.12248095, 0.20778687, 0.15185613, 0. ])
Predicting for \(M^{th}\) home#
ratings_m_home = np.dot(H.T, W_m)
ratings_m_home[~mask]
array([ 0.4245947 , 0.57447552])
A_orig[-1,:][~mask]
array([ 0.18619301, 0.25435648])
There you go, we are able to get ratings for the \(M^{th}\) user for the movies that they have not seen. We only trained the model on the other users! Ofcourse, these numbers might not look so impressive. However, this was just a toy example based on random data. In reality, we could expect better results!