{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "J0qXcTp-FofG" }, "source": [ "\n", "# Rank Regression\n", "[](https://rehline-python.readthedocs.io/en/latest/)\n", "\n", "Suppose we have a dataset $(y_i,\\mathbf{x}_j)_{i=1}^n$, where $y_i \\in \\mathbb{R}$ is the response and $\\mathbf{x}_i \\in \\mathbb{R}^{d}$ is the independent variable. Assume the data is generated from the following linear regression model,\n", "\n", "$$\n", " y_i = \\boldsymbol{\\beta}^{\\intercal}\\mathbf{x}_i + \\epsilon_i, \\quad i=1,2,\\cdots n,\n", "$$\n", "where $\\boldsymbol{\\beta} \\in \\mathbb{R}^d$ is the unknown parameter to be estimated and $\\{\\epsilon_i\\}_{i=1}^n$ are iid random errors. \n", "\n", "When the random errors $\\{\\epsilon_i\\}_{i=1}^n$ follow non-Gaussian distributions, particularly those with heavy tails, the performance of OLS maybe poor. Rank regression is one of the robust regression methods which can deal with the case. It suffices to solve the following optimization problem (Jaeckel, 1972):\n", "\n", "$$\n", " \\min_{\\boldsymbol{\\beta} \\in \\mathbb{R}^d} \\sum\\limits_{i=1}^{n}R_i^c(\\boldsymbol{\\beta})(y_i - \\boldsymbol{\\beta}^{\\intercal}\\mathbf{x}_i),\n", "$$\n", "where $R_i^c(\\boldsymbol{\\beta})=R_i(\\boldsymbol{\\beta}) - \\frac{n+1}{2}$, $R_i(\\boldsymbol{\\beta})$ is the rank of ${y_i-\\boldsymbol{\\beta}^{\\intercal}\\mathbf{x}_i}$ among all $\\{ y_i-\\boldsymbol{\\beta}^{\\intercal}\\mathbf{x}_i\\}_{i=1}^n$. Canonically, we estimate $\\boldsymbol{\\beta}$ by solving the following optimization problem,\n", "\n", "$$\n", " \\min_{\\boldsymbol{\\beta} \\in \\mathbb{R}^d} \\frac{2}{n(n-1)}\\sum\\limits_{1 \\leq i < j \\leq n} \\big| y_{ij} - \\boldsymbol{\\beta}^{\\intercal} \\mathbf{x}_{ij} \\big| \\tag{1}\n", "$$\n", "where $\\mathbf{x}_{ij} = \\mathbf{x}_i - \\mathbf{x}_j$, $y_{ij}= y_i - y_j$. In the remaining part, I will show two methods to solve $(1)$ with implementations via `rehline`.\n", "\n", "In `ReHLine`, we have two options to solve rank regression: (i) use [Manual ReHLine Formulation](https://rehline-python.readthedocs.io/en/latest/tutorials/ReHLine_manual.html) based on `rehline.ReHLine`; (ii) use the form of [Empirical Risk Minimization](https://rehline-python.readthedocs.io/en/latest/tutorials/ReHLine_ERM.html) with the check loss based on `rehline.plqERM_Ridge`.\n", "" ] }, { "cell_type": "markdown", "metadata": { "id": "u3gEkxOBdUft" }, "source": [ "## Numerical example" ] }, { "cell_type": "markdown", "metadata": { "id": "CDuMki43tnuL" }, "source": [ "### Preparation \n", "**Generate the data**" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "id": "r3VoUcR4FceK" }, "outputs": [], "source": [ "import numpy as np\n", "from sklearn.linear_model import LinearRegression\n", "\n", "from rehline import ReHLine, plqERM_Ridge\n", "\n", "np.random.seed(0)\n", "\n", "# Generate the data\n", "n, d = 3000, 3\n", "X = np.random.randn(n, d)\n", "beta_true = np.random.randn(d)\n", "y = X.dot(beta_true) + np.random.standard_t(df=10, size=n)" ] }, { "cell_type": "markdown", "metadata": { "id": "8B62NsX7t2SS" }, "source": [ "**Pairwise differenced dataset**" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "id": "OcdNUtvrn_Hy" }, "outputs": [], "source": [ "# Differencing\n", "i, j = np.triu_indices(n, k=1)\n", "X_diff = (X[:, np.newaxis, :] - X[np.newaxis, :, :])[i, j]\n", "y_diff = (y[:, np.newaxis] - y[np.newaxis, :])[i, j]\n", "n_diff = y_diff.shape[0]" ] }, { "cell_type": "markdown", "metadata": { "id": "VtuYCtgruBdK" }, "source": [ "### Method 1: Solve Rank Regression via `rehline.plqERM_Ridge`\n", "\n", "Since `rehline.plqERM_Ridge` support for check loss, we can directly use it to solve the problem.\n", "\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "pNLvCXgO2KcT", "outputId": "7f205dd8-3daf-4cd6-ce8b-5e43c7a4ab57" }, "outputs": [], "source": [ "# solve the problem use quantitle regression model with $\\tau = 0.5$ by reline\n", "C = 20 / n_diff\n", "RankReg = plqERM_Ridge(loss={\"name\": \"QR\", \"qt\": np.array([0.5])}, C=C, max_iter=10000)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 793 ms, sys: 132 ms, total: 925 ms\n", "Wall time: 925 ms\n" ] } ], "source": [ "%%time\n", "RankReg.fit(X_diff, y_diff)" ] }, { "cell_type": "markdown", "metadata": { "id": "_aE8U8k8t4Uq" }, "source": [ "### Method 2: Solve Rank Regression by `rehline.ReHLine`\n", "Since the rank regression is a plq formulation, we can also use `rehline.RehLine` to solve it by manually specifying the ReLU-ReHU parameters.\n", "\n", "As shown in table 2 of (Dai and Qiu, 2024), when $L_i = C_i \\big| y_i - \\boldsymbol{\\beta}^{\\intercal} \\mathbf{x}_i \\big|$, ReLU parameters are as followed: $u_{1i}=C_i$, $v_{1i}=-C_i y_i$, $u_{2i}=-C_i$, $v_{2i}=C_i y_i$." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "9BIbKUma39CH", "outputId": "1b663080-64df-4dc4-cee4-4acc5e4ef819" }, "outputs": [], "source": [ "# solve the problem use the least absolute deviation model (LAD)\n", "C = 20 / n_diff / 2\n", "U = np.zeros(shape=(2, n_diff))\n", "V = np.zeros(shape=(2, n_diff))\n", "U[0, :], U[1, :] = C, -C\n", "V[0, :], V[1, :] = -C * y_diff, C * y_diff\n", "RankReg = ReHLine(C=C, max_iter=10000)\n", "RankReg.U, RankReg.V = U, V" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 737 ms, sys: 95 ms, total: 832 ms\n", "Wall time: 830 ms\n" ] } ], "source": [ "%%time\n", "RankReg.fit(X_diff)" ] }, { "cell_type": "markdown", "metadata": { "id": "pwPOUiEyeQfF" }, "source": [ "### Results\n", "\n", "We compare the results of rank regression with OLS methods." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "Nvjfr8YbCsUM", "outputId": "46dd7432-4c08-4083-8672-fe3fae7e397d" }, "outputs": [ { "data": { "text/html": [ "
LinearRegression(fit_intercept=False)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LinearRegression(fit_intercept=False)