{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Out of matrix non-negative matrix factorisation\n", "> What if we to predict for entries not within the matrix?!\n", "\n", "- toc: true \n", "- badges: true\n", "- comments: true\n", "- author: Nipun Batra\n", "- categories: [ML]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I have written a bunch of posts on this blog about non-negative matrix factorisation (NNMF). However, all of them involved the test user to be a part of the matrix that we factorise to learn the latent factors. Is that always the case? Read on to find more!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Standard Problem" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our goal is given a matrix A, decompose it into two non-negative factors, as follows:\n", "\n", "$ A_{M \\times N} \\approx W_{M \\times K} \\times H_{K \\times N} $, such that $ W_{M \\times K} \\ge 0$ and $ H_{K \\times N} \\ge 0$ " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![](https://nipunbatra.github.io/blog/images/o_matrix.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Our Problem- Out of matrix factorisation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Imagine that we have trained the model for M-1 users on N movies. Now, the $M^{th}$ user has rated some movies. Do we retrain the model from scratch to consider $M^{th}$ user? This can be a very expensive operation!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![](https://nipunbatra.github.io/blog/images/o_matrix_estimation.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Instead, as shown in above figure, we will learn the user factor for the $M^{th}$ user. We can do this the shared movie factor (H) has already been learnt. \n", "\n", "We can formulate as follows:\n", "\n", "$$\n", "A[M,:] = W[M,:]H\n", "$$\n", "\n", "Taking transpose both sides\n", "\n", "$$\n", "A[M,:]^T = H^T W[M,:]^T\n", "$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "However, $A[M,:]^T$ will have missing entries. Thus, we can mask those entries from the calculation as shown below." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![](https://nipunbatra.github.io/blog/images/o_matrix_estimation_ls.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Thus, we can write\n", "\n", "$$\n", "W[M,:]^T = \\mathrm{Least Squares} (H^T[Mask], A[M,:]^T[Mask])\n", "$$\n", "\n", "If instead we want the factors to be non-negative, we can use non-negative least squares instead of usual least squares for this estimation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Code example" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I'll now present a simple code example to illustrate the procedure." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Defining matrix A" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123456789
00.5488140.7151890.6027630.5448830.4236550.6458940.4375870.8917730.9636630.383442
10.7917250.5288950.5680450.9255970.0710360.0871290.0202180.8326200.7781570.870012
20.9786180.7991590.4614790.7805290.1182740.6399210.1433530.9446690.5218480.414662
30.2645560.7742340.4561500.5684340.0187900.6176350.6120960.6169340.9437480.681820
40.3595080.4370320.6976310.0602250.6667670.6706380.2103830.1289260.3154280.363711
\n", "
" ], "text/plain": [ " 0 1 2 3 4 5 6 \\\n", "0 0.548814 0.715189 0.602763 0.544883 0.423655 0.645894 0.437587 \n", "1 0.791725 0.528895 0.568045 0.925597 0.071036 0.087129 0.020218 \n", "2 0.978618 0.799159 0.461479 0.780529 0.118274 0.639921 0.143353 \n", "3 0.264556 0.774234 0.456150 0.568434 0.018790 0.617635 0.612096 \n", "4 0.359508 0.437032 0.697631 0.060225 0.666767 0.670638 0.210383 \n", "\n", " 7 8 9 \n", "0 0.891773 0.963663 0.383442 \n", "1 0.832620 0.778157 0.870012 \n", "2 0.944669 0.521848 0.414662 \n", "3 0.616934 0.943748 0.681820 \n", "4 0.128926 0.315428 0.363711 " ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "import pandas as pd\n", "\n", "M, N = 20, 10\n", "\n", "np.random.seed(0)\n", "A_orig = np.abs(np.random.uniform(low=0.0, high=1.0, size=(M,N)))\n", "pd.DataFrame(A_orig).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Masking a few entries" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "A = A_orig.copy()\n", "A[0, 0] = np.NAN\n", "A[3, 1] = np.NAN\n", "A[6, 3] = np.NAN\n", "\n", "# Masking for last user. \n", "A[19, 2] = np.NAN\n", "A[19, 7] = np.NAN" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will be using A2 (first 19 users) matrix for learning the movie factors and the user factors for the 19 users" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(19, 10)" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "A2 = A[:-1,:]\n", "A2.shape" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123456789
0NaN0.7151890.6027630.5448830.4236550.6458940.4375870.8917730.9636630.383442
10.7917250.5288950.5680450.9255970.0710360.0871290.0202180.8326200.7781570.870012
20.9786180.7991590.4614790.7805290.1182740.6399210.1433530.9446690.5218480.414662
30.264556NaN0.4561500.5684340.0187900.6176350.6120960.6169340.9437480.681820
40.3595080.4370320.6976310.0602250.6667670.6706380.2103830.1289260.3154280.363711
\n", "
" ], "text/plain": [ " 0 1 2 3 4 5 6 \\\n", "0 NaN 0.715189 0.602763 0.544883 0.423655 0.645894 0.437587 \n", "1 0.791725 0.528895 0.568045 0.925597 0.071036 0.087129 0.020218 \n", "2 0.978618 0.799159 0.461479 0.780529 0.118274 0.639921 0.143353 \n", "3 0.264556 NaN 0.456150 0.568434 0.018790 0.617635 0.612096 \n", "4 0.359508 0.437032 0.697631 0.060225 0.666767 0.670638 0.210383 \n", "\n", " 7 8 9 \n", "0 0.891773 0.963663 0.383442 \n", "1 0.832620 0.778157 0.870012 \n", "2 0.944669 0.521848 0.414662 \n", "3 0.616934 0.943748 0.681820 \n", "4 0.128926 0.315428 0.363711 " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "A_df = pd.DataFrame(A)\n", "A_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Defining matrices W and H (learning on M-1 users and N movies)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true }, "outputs": [], "source": [ "K = 4\n", "W = np.abs(np.random.uniform(low=0, high=1, size=(M-1, K)))\n", "H = np.abs(np.random.uniform(low=0, high=1, size=(K, N)))\n", "W = np.divide(W, K*W.max())\n", "H = np.divide(H, K*H.max())" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123
00.0787090.1757840.0953590.045339
10.0062300.0169760.1715050.114531
20.1354530.2263550.2500000.054753
30.1673870.0664730.0052130.191444
40.0807850.0968010.1485140.209789
\n", "
" ], "text/plain": [ " 0 1 2 3\n", "0 0.078709 0.175784 0.095359 0.045339\n", "1 0.006230 0.016976 0.171505 0.114531\n", "2 0.135453 0.226355 0.250000 0.054753\n", "3 0.167387 0.066473 0.005213 0.191444\n", "4 0.080785 0.096801 0.148514 0.209789" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame(W).head()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123456789
00.2397000.2034980.1605290.2226170.0746110.2161640.1573280.0033700.0884150.037721
10.2500000.1218060.1266490.1628270.0938510.0348580.2093330.0483400.1301950.057117
20.0249140.2195370.2477310.2446540.2308330.1970930.0848280.0206510.1036940.059133
30.0337350.0136040.1847560.0029100.1962100.0374170.0202480.0228150.1711210.062477
\n", "
" ], "text/plain": [ " 0 1 2 3 4 5 6 \\\n", "0 0.239700 0.203498 0.160529 0.222617 0.074611 0.216164 0.157328 \n", "1 0.250000 0.121806 0.126649 0.162827 0.093851 0.034858 0.209333 \n", "2 0.024914 0.219537 0.247731 0.244654 0.230833 0.197093 0.084828 \n", "3 0.033735 0.013604 0.184756 0.002910 0.196210 0.037417 0.020248 \n", "\n", " 7 8 9 \n", "0 0.003370 0.088415 0.037721 \n", "1 0.048340 0.130195 0.057117 \n", "2 0.020651 0.103694 0.059133 \n", "3 0.022815 0.171121 0.062477 " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame(H).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Defining the cost that we want to minimise" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def cost(A, W, H):\n", " from numpy import linalg\n", " WH = np.dot(W, H)\n", " A_WH = A-WH\n", " return linalg.norm(A_WH, 'fro')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "However, since A has missing entries, we have to define the cost in terms of the entries present in A" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def cost(A, W, H):\n", " from numpy import linalg\n", " mask = pd.DataFrame(A).notnull().values\n", " WH = np.dot(W, H)\n", " WH_mask = WH[mask]\n", " A_mask = A[mask]\n", " A_WH_mask = A_mask-WH_mask\n", " # Since now A_WH_mask is a vector, we use L2 instead of Frobenius norm for matrix\n", " return linalg.norm(A_WH_mask, 2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us just try to see the cost of the initial set of values of W and H we randomly assigned. Notice, we pass A2!" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "7.2333001567031294" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cost(A2, W, H)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Alternating NNLS procedure" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 3.74162948918\n", "100 2.25416363991\n", "200 2.25258698617\n", "300 2.25229707846\n", "400 2.25131714233\n", "500 2.24968386447\n", "600 2.24967129897\n", "700 2.24965023589\n", "800 2.24961410381\n", "900 2.24955008837\n" ] } ], "source": [ "num_iter = 1000\n", "num_display_cost = max(int(num_iter/10), 1)\n", "from scipy.optimize import nnls\n", "\n", "for i in range(num_iter):\n", " if i%2 ==0:\n", " # Learn H, given A and W\n", " for j in range(N):\n", " mask_rows = pd.Series(A2[:,j]).notnull()\n", " H[:,j] = nnls(W[mask_rows], A2[:,j][mask_rows])[0]\n", " else:\n", " for j in range(M-1):\n", " mask_rows = pd.Series(A2[j,:]).notnull()\n", " W[j,:] = nnls(H.transpose()[mask_rows], A2[j,:][mask_rows])[0]\n", " WH = np.dot(W, H)\n", " c = cost(A2, W, H)\n", " if i%num_display_cost==0:\n", " print i, c\n", " " ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123456789
00.5903010.6530380.5319400.6232720.5847630.6308350.5740410.7001390.8417060.565808
10.8027240.5322990.4824301.0179680.1499230.4493120.0977750.7087270.5065950.846219
20.7642960.5637110.5272920.9052360.3062750.5056740.2231920.7058820.6043560.757878
30.3735390.7452390.3349480.6632190.1326860.5518440.7604200.5985460.8081080.627732
40.4676230.3314570.6172630.2394490.6344550.3700410.2944120.2885390.4848220.126945
\n", "
" ], "text/plain": [ " 0 1 2 3 4 5 6 \\\n", "0 0.590301 0.653038 0.531940 0.623272 0.584763 0.630835 0.574041 \n", "1 0.802724 0.532299 0.482430 1.017968 0.149923 0.449312 0.097775 \n", "2 0.764296 0.563711 0.527292 0.905236 0.306275 0.505674 0.223192 \n", "3 0.373539 0.745239 0.334948 0.663219 0.132686 0.551844 0.760420 \n", "4 0.467623 0.331457 0.617263 0.239449 0.634455 0.370041 0.294412 \n", "\n", " 7 8 9 \n", "0 0.700139 0.841706 0.565808 \n", "1 0.708727 0.506595 0.846219 \n", "2 0.705882 0.604356 0.757878 \n", "3 0.598546 0.808108 0.627732 \n", "4 0.288539 0.484822 0.126945 " ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "A_pred = pd.DataFrame(np.dot(W, H))\n", "A_pred.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Learning home factors for $M^{th}$ home" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "A_m = A[-1,:]\n", "A_m_transpose = A_m.T\n", "mask = ~np.isnan(A_m_transpose)\n", "W_m = nnls(H.T[mask], A_m_transpose[mask])[0]" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ 0.12248095, 0.20778687, 0.15185613, 0. ])" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "W_m" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Predicting for $M^{th}$ home" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "ratings_m_home = np.dot(H.T, W_m)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ 0.4245947 , 0.57447552])" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ratings_m_home[~mask]" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ 0.18619301, 0.25435648])" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "A_orig[-1,:][~mask]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There you go, we are able to get ratings for the $M^{th}$ user for the movies that they have not seen. We only trained the model on the other users! Ofcourse, these numbers might not look so impressive. However, this was just a toy example based on random data. In reality, we could expect better results!" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" } }, "nbformat": 4, "nbformat_minor": 1 }