{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "xSWYNZ1zvzCA" }, "source": [ "# ElasticNet Compatible Estimators\n", "\n", "[![Slides](https://img.shields.io/badge/🦌-ReHLine-blueviolet)](https://rehline-python.readthedocs.io/en/latest/)\n", "\n", "The core class `plqERM_ElasticNet` serves as a base implementation for both classification and regression tasks. Its subclasses, `plq_ElasticNet_Classifier` and `plq_ElasticNet_Regressor`, extend the Ridge-based variants by introducing an additional `l1_ratio` parameter that controls the mix between L1 and L2 regularization. These estimators integrate seamlessly with scikit-learn utilities such as `Pipeline`, `cross_val_score`, and `GridSearchCV`." ] }, { "cell_type": "markdown", "metadata": { "id": "HDGBmNUmxZtn" }, "source": [ "ElasticNet regularization solves the following optimization problem:\n", "\n", "$$\n", "\\min_{\\beta \\in \\mathbb{R}^d} \\; C \\sum_{i=1}^{n} \\text{PLQ}(y_i, \\mathbf{x}_i^T \\beta) + \\ell_1\\text{ratio} \\|\\beta\\|_1 + \\frac{1}{2}(1 - \\ell_1\\text{ratio})\\|\\beta\\|_2^2, \\quad \\text{s.t.} \\quad \\mathbf{A}\\beta + \\mathbf{b} \\geq \\mathbf{0},\n", "$$\n", "\n", "where\n", "\n", "- $\\text{PLQ}(\\cdot)$ is a piecewise linear-quadratic loss function (e.g., SVM hinge, quantile, Huber),\n", "- $\\mathbf{x}_i \\in \\mathbb{R}^d$ is a feature vector,\n", "- $y_i$ is the response variable (class label or continuous value),\n", "- $C > 0$ is the regularization strength (larger $C$ = less regularization),\n", "- $\\ell_1\\text{ratio} \\in [0, 1]$ is the mixing parameter: $\\ell_1\\text{ratio} = 1$ gives Lasso, $\\ell_1\\text{ratio} = 0$ gives Ridge,\n", "- $\\mathbf{A}\\beta + \\mathbf{b} \\geq \\mathbf{0}$ represents optional linear constraints on $\\beta$." ] }, { "cell_type": "markdown", "metadata": { "id": "L_j1q7cFEBxy" }, "source": [ "#### Classification Example with GridSearchCV and Pipeline\n", "\n", "Here we show a classification example using `Pipeline`, `cross_val_score`, and `GridSearchCV`. Compared to the Ridge classifier, the key difference is the additional `l1_ratio` parameter in `param_grid`.\n", "\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "id": "39IeObaaHBDz" }, "outputs": [], "source": [ "import numpy as np\n", "from sklearn.datasets import make_classification\n", "from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score\n", "from sklearn.pipeline import Pipeline\n", "from sklearn.preprocessing import StandardScaler\n", "from sklearn.metrics import accuracy_score, classification_report, confusion_matrix" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "id": "Rc33Ym8ZHB6a" }, "outputs": [], "source": [ "# generate the dataset\n", "X, y = make_classification(\n", " n_samples=2000,\n", " n_features=20,\n", " n_informative=8,\n", " n_redundant=4,\n", " n_repeated=0,\n", " n_classes=2,\n", " weights=[0.7, 0.3],\n", " class_sep=1.2,\n", " flip_y=0.01,\n", " random_state=42,\n", ")\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " X, y, test_size=0.25, stratify=y, random_state=42\n", ")" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "id": "Q54w-eLSHDlq" }, "outputs": [], "source": [ "from rehline import plq_ElasticNet_Classifier\n", "\n", "# set the pipeline\n", "pipe = Pipeline([\n", " (\"scaler\", StandardScaler()),\n", " (\"clf\", plq_ElasticNet_Classifier(loss={\"name\": \"svm\"})),\n", "])" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "id": "c8hG_-p5HFRk" }, "outputs": [], "source": [ "# set the parameter grid\n", "param_grid = {\n", " \"clf__loss\": [{\"name\": \"svm\"}, {\"name\": \"sSVM\"}],\n", " \"clf__C\": [0.1, 1.0, 3.0],\n", " \"clf__l1_ratio\": [0.0, 0.3, 0.5, 0.8],\n", " \"clf__fit_intercept\": [True, False],\n", " \"clf__intercept_scaling\": [0.5, 1.0, 2.0],\n", " \"clf__max_iter\": [5000, 10000],\n", " \"clf__class_weight\": [None, \"balanced\", {0: 1.0, 1: 2.0}],\n", " \"clf__constraint\": [\n", " [],\n", " [{\"name\": \"nonnegative\"}],\n", " [{\"name\": \"fair\", \"sen_idx\": [0], \"tol_sen\": 0.1}],\n", " ],\n", "}" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "TrRcyQP8HILw", "outputId": "3d4e7e02-2a0a-4f36-b71e-5283f59d8f2f" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CV scores: [0.79666667 0.82 0.82666667 0.81 0.81 ]\n" ] } ], "source": [ "# cross_val_score\n", "cv_scores = cross_val_score(\n", " pipe,\n", " X_train, y_train,\n", " cv=5,\n", " scoring=\"accuracy\",\n", " n_jobs=-1,\n", ")\n", "print(\"CV scores:\", cv_scores)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 207 }, "id": "hMvSW0ifHJnZ", "outputId": "eaeb9c6f-e206-401c-fd61-9e6de3f41499" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Fitting 5 folds for each of 2592 candidates, totalling 12960 fits\n" ] }, { "data": { "text/html": [ "
GridSearchCV(cv=5,\n",
              "             estimator=Pipeline(steps=[('scaler', StandardScaler()),\n",
              "                                       ('clf',\n",
              "                                        plq_ElasticNet_Classifier(loss={'name': 'svm'}))]),\n",
              "             n_jobs=-1,\n",
              "             param_grid={'clf__C': [0.1, 1.0, 3.0],\n",
              "                         'clf__class_weight': [None, 'balanced',\n",
              "                                               {0: 1.0, 1: 2.0}],\n",
              "                         'clf__constraint': [[], [{'name': 'nonnegative'}],\n",
              "                                             [{'name': 'fair', 'sen_idx': [0],\n",
              "                                               'tol_sen': 0.1}]],\n",
              "                         'clf__fit_intercept': [True, False],\n",
              "                         'clf__intercept_scaling': [0.5, 1.0, 2.0],\n",
              "                         'clf__l1_ratio': [0.0, 0.3, 0.5, 0.8],\n",
              "                         'clf__loss': [{'name': 'svm'}, {'name': 'sSVM'}],\n",
              "                         'clf__max_iter': [5000, 10000]},\n",
              "             scoring='accuracy', verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "GridSearchCV(cv=5,\n", " estimator=Pipeline(steps=[('scaler', StandardScaler()),\n", " ('clf',\n", " plq_ElasticNet_Classifier(loss={'name': 'svm'}))]),\n", " n_jobs=-1,\n", " param_grid={'clf__C': [0.1, 1.0, 3.0],\n", " 'clf__class_weight': [None, 'balanced',\n", " {0: 1.0, 1: 2.0}],\n", " 'clf__constraint': [[], [{'name': 'nonnegative'}],\n", " [{'name': 'fair', 'sen_idx': [0],\n", " 'tol_sen': 0.1}]],\n", " 'clf__fit_intercept': [True, False],\n", " 'clf__intercept_scaling': [0.5, 1.0, 2.0],\n", " 'clf__l1_ratio': [0.0, 0.3, 0.5, 0.8],\n", " 'clf__loss': [{'name': 'svm'}, {'name': 'sSVM'}],\n", " 'clf__max_iter': [5000, 10000]},\n", " scoring='accuracy', verbose=1)" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# GridSearchCV\n", "grid = GridSearchCV(\n", " estimator=pipe,\n", " param_grid=param_grid,\n", " scoring=\"accuracy\",\n", " cv=5,\n", " n_jobs=-1,\n", " refit=True,\n", " verbose=1,\n", ")\n", "\n", "grid.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "AXyFwRedHKWh", "outputId": "8d515a49-2532-4bc0-9f5c-962e49687823" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Best params: {'clf__C': 0.1, 'clf__class_weight': None, 'clf__constraint': [{'name': 'fair', 'sen_idx': [0], 'tol_sen': 0.1}], 'clf__fit_intercept': True, 'clf__intercept_scaling': 1.0, 'clf__l1_ratio': 0.0, 'clf__loss': {'name': 'sSVM'}, 'clf__max_iter': 5000}\n", "Best CV accuracy: 0.8133333333333332\n" ] } ], "source": [ "print(\"Best params:\", grid.best_params_)\n", "print(\"Best CV accuracy:\", grid.best_score_)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "Aj-AMD1THMFu", "outputId": "a47b6f0b-3daa-4514-df47-1337ebef31bc" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Test accuracy: 0.808\n", "\n", "Classification report:\n", " precision recall f1-score support\n", "\n", " 0 0.8155 0.9370 0.8720 349\n", " 1 0.7778 0.5099 0.6160 151\n", "\n", " accuracy 0.8080 500\n", " macro avg 0.7966 0.7234 0.7440 500\n", "weighted avg 0.8041 0.8080 0.7947 500\n", "\n", "Confusion matrix:\n", " [[327 22]\n", " [ 74 77]]\n" ] } ], "source": [ "best_model = grid.best_estimator_\n", "y_pred = best_model.predict(X_test)\n", "test_acc = accuracy_score(y_test, y_pred)\n", "\n", "print(\"Test accuracy:\", test_acc)\n", "print(\"\\nClassification report:\\n\", classification_report(y_test, y_pred, digits=4))\n", "print(\"Confusion matrix:\\n\", confusion_matrix(y_test, y_pred))" ] }, { "cell_type": "markdown", "metadata": { "id": "BHtMgQH3EN_R" }, "source": [ "#### Regression Example\n", "\n", "Here we show a regression example using `Pipeline`, `cross_val_score`, and `GridSearchCV`. The `l1_ratio` controls the balance between lasso and ridge penalty.\n" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "id": "OyLbApGNHN1M" }, "outputs": [], "source": [ "import numpy as np\n", "from sklearn.datasets import make_regression\n", "from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score\n", "from sklearn.pipeline import Pipeline\n", "from sklearn.preprocessing import StandardScaler\n", "from sklearn.metrics import mean_squared_error, r2_score" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "id": "vBKg2LAHHPT-" }, "outputs": [], "source": [ "# generate the data\n", "X, y = make_regression(\n", " n_samples=1500,\n", " n_features=15,\n", " n_informative=10,\n", " noise=10.0,\n", " random_state=42\n", ")\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " X, y, test_size=0.25, random_state=42\n", ")" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "id": "DDFvFnM0HQoX" }, "outputs": [], "source": [ "from rehline import plq_ElasticNet_Regressor\n", "\n", "# set the pipeline\n", "pipe = Pipeline([\n", " (\"scaler\", StandardScaler()),\n", " (\"reg\", plq_ElasticNet_Regressor(loss={\"name\": \"QR\", \"qt\": 0.5})),\n", "])" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "id": "8XU49fbKHUbE" }, "outputs": [], "source": [ "# set the param_grid\n", "param_grid = {\n", " \"reg__loss\": [\n", " {\"name\": \"QR\", \"qt\": 0.5},\n", " {\"name\": \"huber\", \"tau\": 1.0},\n", " {\"name\": \"SVR\", \"epsilon\": 0.1},\n", " ],\n", " \"reg__C\": [0.1, 1.0, 10.0],\n", " \"reg__l1_ratio\": [0.0, 0.3, 0.5, 0.8],\n", " \"reg__fit_intercept\": [True, False],\n", " \"reg__intercept_scaling\": [0.5, 1.0],\n", " \"reg__max_iter\": [5000, 8000],\n", " \"reg__constraint\": [\n", " [],\n", " [{\"name\": \"nonnegative\"}],\n", " [{\"name\": \"fair\", \"sen_idx\": [0], \"tol_sen\": 0.1}],\n", " ],\n", "}" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "1MQDxCkoHWI9", "outputId": "47dc0b50-9b3a-496c-a520-18a7480dbc79" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CV R^2 scores: [0.99668483 0.99654706 0.99704323 0.99627612 0.99609029]\n", "Mean CV R^2: 0.9965283057432174\n" ] } ], "source": [ "# cross_val_score\n", "cv_scores = cross_val_score(\n", " pipe,\n", " X_train, y_train,\n", " cv=5,\n", " scoring=\"r2\",\n", " n_jobs=-1,\n", ")\n", "print(\"CV R^2 scores:\", cv_scores)\n", "print(\"Mean CV R^2:\", np.mean(cv_scores))" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 207 }, "id": "Wh8IfBb3HX5v", "outputId": "0f8f28b1-c8aa-4f7c-ef2d-c8a0ffdfe7dc" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Fitting 5 folds for each of 864 candidates, totalling 4320 fits\n" ] }, { "data": { "text/html": [ "
GridSearchCV(cv=5,\n",
              "             estimator=Pipeline(steps=[('scaler', StandardScaler()),\n",
              "                                       ('reg', plq_ElasticNet_Regressor())]),\n",
              "             n_jobs=-1,\n",
              "             param_grid={'reg__C': [0.1, 1.0, 10.0],\n",
              "                         'reg__constraint': [[], [{'name': 'nonnegative'}],\n",
              "                                             [{'name': 'fair', 'sen_idx': [0],\n",
              "                                               'tol_sen': 0.1}]],\n",
              "                         'reg__fit_intercept': [True, False],\n",
              "                         'reg__intercept_scaling': [0.5, 1.0],\n",
              "                         'reg__l1_ratio': [0.0, 0.3, 0.5, 0.8],\n",
              "                         'reg__loss': [{'name': 'QR', 'qt': 0.5},\n",
              "                                       {'name': 'huber', 'tau': 1.0},\n",
              "                                       {'epsilon': 0.1, 'name': 'SVR'}],\n",
              "                         'reg__max_iter': [5000, 8000]},\n",
              "             scoring='r2', verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "GridSearchCV(cv=5,\n", " estimator=Pipeline(steps=[('scaler', StandardScaler()),\n", " ('reg', plq_ElasticNet_Regressor())]),\n", " n_jobs=-1,\n", " param_grid={'reg__C': [0.1, 1.0, 10.0],\n", " 'reg__constraint': [[], [{'name': 'nonnegative'}],\n", " [{'name': 'fair', 'sen_idx': [0],\n", " 'tol_sen': 0.1}]],\n", " 'reg__fit_intercept': [True, False],\n", " 'reg__intercept_scaling': [0.5, 1.0],\n", " 'reg__l1_ratio': [0.0, 0.3, 0.5, 0.8],\n", " 'reg__loss': [{'name': 'QR', 'qt': 0.5},\n", " {'name': 'huber', 'tau': 1.0},\n", " {'epsilon': 0.1, 'name': 'SVR'}],\n", " 'reg__max_iter': [5000, 8000]},\n", " scoring='r2', verbose=1)" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# GridSearchCV\n", "grid = GridSearchCV(\n", " estimator=pipe,\n", " param_grid=param_grid,\n", " scoring=\"r2\",\n", " cv=5,\n", " n_jobs=-1,\n", " refit=True,\n", " verbose=1,\n", ")\n", "\n", "grid.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "AM_OSqTZHaCL", "outputId": "7d1cf617-7c65-411a-9ba1-564ee822676c" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Best params: {'reg__C': 0.1, 'reg__constraint': [{'name': 'nonnegative'}], 'reg__fit_intercept': True, 'reg__intercept_scaling': 1.0, 'reg__l1_ratio': 0.0, 'reg__loss': {'name': 'huber', 'tau': 1.0}, 'reg__max_iter': 5000}\n", "Best CV R^2: 0.9967196763855011\n" ] } ], "source": [ "print(\"Best params:\", grid.best_params_)\n", "print(\"Best CV R^2:\", grid.best_score_)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "FQAcMLBsHanr", "outputId": "41a9afa4-ad3d-43ed-a6e1-0b90ac9fbabb" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Test R^2: 0.9967743380626125\n", "Test MSE: 104.74629973212267\n" ] } ], "source": [ "best_model = grid.best_estimator_\n", "y_pred = best_model.predict(X_test)\n", "\n", "print(\"Test R^2:\", r2_score(y_test, y_pred))\n", "print(\"Test MSE:\", mean_squared_error(y_test, y_pred))" ] } ], "metadata": { "colab": { "provenance": [] }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 0 }