{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "4l2AItnCizvk" }, "source": [ "# FairSVM\n", "\n", "[](https://rehline-python.readthedocs.io/en/latest//)\n", "\n", "The FairSVM solves the following optimization problem:\n", "\n", "$$\n", "\\begin{aligned}\n", "\\min_{\\beta \\in \\mathbb{R}^d}\\quad &\n", "\\frac{C}{n}\\sum_{i=1}^n (1-y_i\\beta^\\top x_i)_+ + \\frac{1}{2}\\|\\beta\\|_2^2, \\\\[1ex]\n", "\\text{subject to}\\quad &\n", "\\frac{1}{n}\\sum_{i=1}^n z_i\\,\\beta^\\top x_i \\le \\rho,\\quad\n", "\\frac{1}{n}\\sum_{i=1}^n z_i\\,\\beta^\\top x_i \\ge -\\rho.\n", "\\end{aligned}\n", "$$\n", "\n", "where:\n", "\n", "- $x_i \\in \\mathbb{R}^d$ is a feature vector\n", "- $y_i \\in \\{-1,1\\}$ is a binary label\n", "- $z_i$ is a collection of **centered sensitive features**, such as gender and/or race, satisfying\n", " $$\n", " \\sum_{i=1}^n z_{ij}=0\n", " $$\n", "- $z_i \\in \\mathbb{R}^{d_0}$ is a $d_0$-length sensitive feature vector\n", "- $\\rho \\in \\mathbb{R}_+^{d_0}$ is a vector of constants that trade off predictive accuracy and fairness\n", "\n", "The constraints limit the correlation between the sensitive features and the decision function, helping ensure fairness in predictions.\n", "\n", "> **Note.** Since the hinge loss is a plq function and the fairness constraints are linear, we can optimize this model using `rehline.plq_Ridge_Classifier`." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "id": "CdpIoLwYNrOE" }, "outputs": [], "source": [ "## install rehline\n", "%pip install rehline -q" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "id": "FcaI-p84K6m4" }, "outputs": [], "source": [ "## simulate data\n", "import numpy as np\n", "from sklearn.datasets import make_classification\n", "from sklearn.preprocessing import StandardScaler\n", "\n", "scaler = StandardScaler()\n", "\n", "n, d = 10000, 5\n", "X, y = make_classification(n_samples=n, n_features=d, n_redundant=2, random_state=42)\n", "y = 2 * y - 1\n", "X = scaler.fit_transform(X)\n", "\n", "## we take the first column of X as sensitive features, and tol is 0.1\n", "sen_idx = [0]\n", "tol_sen = 0.1" ] }, { "cell_type": "markdown", "metadata": { "id": "VEQKzCdrM3ii" }, "source": [ "## SVM as baseline" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 80 }, "id": "uUMv2d0ZM1X5", "outputId": "26300c19-6800-4537-ee51-455d2415c54d" }, "outputs": [ { "data": { "text/html": [ "
plq_Ridge_Classifier(loss={'name': 'svm'}, max_iter=50000)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. plq_Ridge_Classifier(loss={'name': 'svm'}, max_iter=50000)plq_Ridge_Classifier(constraint=[{'name': 'fair', 'sen_idx': [0],\n",
" 'tol_sen': 0.1}],\n",
" loss={'name': 'svm'}, max_iter=50000)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. plq_Ridge_Classifier(constraint=[{'name': 'fair', 'sen_idx': [0],\n",
" 'tol_sen': 0.1}],\n",
" loss={'name': 'svm'}, max_iter=50000)