{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "Vv6noLRoaqh0" }, "source": [ "# Scikit-learn Compatible Estimators" ] }, { "cell_type": "markdown", "metadata": { "id": "Y9pEv59la5CV" }, "source": [ "[![Slides](https://img.shields.io/badge/🦌-ReHLine-blueviolet)](https://rehline-python.readthedocs.io/en/latest/)\n", "\n", "The core class `plqERM_Ridge` serves as a base implementation for both classification and regression tasks. Its subclasses, `plqERMClassifier` and `plqERMRegressor`, provide task-specific functionality while integrating seamlessly with scikit-learn utilities such as `Pipeline`, `cross_val_score`, and `GridSearchCV`. In addition, these models support common evaluation methods, allowing users to compute metrics such as accuracy scores for classification or R² values for regression." ] }, { "cell_type": "markdown", "metadata": { "id": "chXqSvec7yqI" }, "source": [ "#### Classification Example with GridSearchCV and Pipeline\n", "\n", "Here we shows a classification example which contains `pipeline`, `cross_val_score` and `GridSearchCV`." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "id": "umXH0TZG9Zsl" }, "outputs": [], "source": [ "import numpy as np\n", "from sklearn.datasets import make_classification\n", "from sklearn.metrics import accuracy_score, classification_report, confusion_matrix\n", "from sklearn.model_selection import GridSearchCV, cross_val_score, train_test_split\n", "from sklearn.pipeline import Pipeline\n", "from sklearn.preprocessing import StandardScaler" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "id": "Qh3tOux-9gZ9" }, "outputs": [], "source": [ "# generate the dataset\n", "X, y = make_classification(\n", " n_samples=2000,\n", " n_features=20,\n", " n_informative=8,\n", " n_redundant=4,\n", " n_repeated=0,\n", " n_classes=2,\n", " weights=[0.7, 0.3], # imbalance\n", " class_sep=1.2,\n", " flip_y=0.01,\n", " random_state=42,\n", ")\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, stratify=y, random_state=42)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "id": "_MR1vTRc93xM" }, "outputs": [], "source": [ "from rehline import plq_Ridge_Classifier\n", "\n", "# set the pipeline\n", "pipe = Pipeline(\n", " [\n", " (\"scaler\", StandardScaler()),\n", " (\"clf\", plq_Ridge_Classifier(loss={\"name\": \"svm\"})),\n", " ]\n", ")" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "id": "tfGUG7auABMG" }, "outputs": [], "source": [ "# set the parameter grid\n", "param_grid = {\n", " \"clf__loss\": [{\"name\": \"svm\"}, {\"name\": \"sSVM\"}],\n", " \"clf__C\": [0.1, 1.0, 3.0],\n", " \"clf__fit_intercept\": [True, False],\n", " \"clf__intercept_scaling\": [0.5, 1.0, 2.0],\n", " \"clf__max_iter\": [5000, 10000],\n", " \"clf__class_weight\": [None, \"balanced\", {0: 1.0, 1: 2.0}],\n", " \"clf__constraint\": [\n", " [], # no constraint\n", " [{\"name\": \"nonnegative\"}],\n", " [{\"name\": \"fair\", \"sen_idx\": [0], \"tol_sen\": 0.1}],\n", " ],\n", "}" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "LBsUuv6bBW00", "outputId": "fbd50af6-a23a-4eb9-c0dc-0c1e538c9574" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CV scores: [0.79333333 0.82 0.82333333 0.81 0.80666667]\n" ] } ], "source": [ "# cross_val_score function\n", "cv_scores = cross_val_score(\n", " pipe,\n", " X_train,\n", " y_train,\n", " cv=5,\n", " scoring=\"accuracy\",\n", " n_jobs=-1,\n", ")\n", "print(\"CV scores:\", cv_scores)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 207 }, "id": "s0Ur4GgIAGET", "outputId": "f2fee1ba-d348-472f-9cb2-8c5d62a776ba" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Fitting 5 folds for each of 648 candidates, totalling 3240 fits\n" ] }, { "data": { "text/html": [ "
GridSearchCV(cv=5,\n",
       "             estimator=Pipeline(steps=[('scaler', StandardScaler()),\n",
       "                                       ('clf',\n",
       "                                        plq_Ridge_Classifier(loss={'name': 'svm'}))]),\n",
       "             n_jobs=-1,\n",
       "             param_grid={'clf__C': [0.1, 1.0, 3.0],\n",
       "                         'clf__class_weight': [None, 'balanced',\n",
       "                                               {0: 1.0, 1: 2.0}],\n",
       "                         'clf__constraint': [[], [{'name': 'nonnegative'}],\n",
       "                                             [{'name': 'fair', 'sen_idx': [0],\n",
       "                                               'tol_sen': 0.1}]],\n",
       "                         'clf__fit_intercept': [True, False],\n",
       "                         'clf__intercept_scaling': [0.5, 1.0, 2.0],\n",
       "                         'clf__loss': [{'name': 'svm'}, {'name': 'sSVM'}],\n",
       "                         'clf__max_iter': [5000, 10000]},\n",
       "             scoring='accuracy', verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "GridSearchCV(cv=5,\n", " estimator=Pipeline(steps=[('scaler', StandardScaler()),\n", " ('clf',\n", " plq_Ridge_Classifier(loss={'name': 'svm'}))]),\n", " n_jobs=-1,\n", " param_grid={'clf__C': [0.1, 1.0, 3.0],\n", " 'clf__class_weight': [None, 'balanced',\n", " {0: 1.0, 1: 2.0}],\n", " 'clf__constraint': [[], [{'name': 'nonnegative'}],\n", " [{'name': 'fair', 'sen_idx': [0],\n", " 'tol_sen': 0.1}]],\n", " 'clf__fit_intercept': [True, False],\n", " 'clf__intercept_scaling': [0.5, 1.0, 2.0],\n", " 'clf__loss': [{'name': 'svm'}, {'name': 'sSVM'}],\n", " 'clf__max_iter': [5000, 10000]},\n", " scoring='accuracy', verbose=1)" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# perform GridSearchCV to tune the hyperparameter\n", "grid = GridSearchCV(\n", " estimator=pipe,\n", " param_grid=param_grid,\n", " scoring=\"accuracy\",\n", " cv=5,\n", " n_jobs=-1,\n", " refit=True,\n", " verbose=1,\n", ")\n", "\n", "grid.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "pD_uySLxBcKx", "outputId": "3cbff9f2-8db1-487e-db5e-1ec4c66508c5" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Best params: {'clf__C': 0.1, 'clf__class_weight': None, 'clf__constraint': [{'name': 'fair', 'sen_idx': [0], 'tol_sen': 0.1}], 'clf__fit_intercept': True, 'clf__intercept_scaling': 1.0, 'clf__loss': {'name': 'svm'}, 'clf__max_iter': 5000}\n", "Best CV accuracy: 0.8146666666666667\n" ] } ], "source": [ "print(\"Best params:\", grid.best_params_)\n", "print(\"Best CV accuracy:\", grid.best_score_)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "nbddbX5cBkCa", "outputId": "3859879e-d801-43b7-84c5-3327e589aa4c" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Test accuracy: 0.802\n", "\n", "Classification report:\n", " precision recall f1-score support\n", "\n", " 0 0.8094 0.9370 0.8685 349\n", " 1 0.7708 0.4901 0.5992 151\n", "\n", " accuracy 0.8020 500\n", " macro avg 0.7901 0.7135 0.7339 500\n", "weighted avg 0.7978 0.8020 0.7872 500\n", "\n", "Confusion matrix:\n", " [[327 22]\n", " [ 77 74]]\n" ] } ], "source": [ "# use the best estimator fit and predict\n", "best_model = grid.best_estimator_\n", "y_pred = best_model.predict(X_test)\n", "test_acc = accuracy_score(y_test, y_pred)\n", "\n", "print(\"Test accuracy:\", test_acc)\n", "print(\"\\nClassification report:\\n\", classification_report(y_test, y_pred, digits=4))\n", "print(\"Confusion matrix:\\n\", confusion_matrix(y_test, y_pred))" ] }, { "cell_type": "markdown", "metadata": { "id": "8tYkW_az8-ra" }, "source": [ "#### Regression Example\n", "\n", "Here we shows a regression example which contains `pipeline`, `cross_val_score` and `GridSearchCV`.\n", "\n" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "id": "6oNtcqzXHdof" }, "outputs": [], "source": [ "from sklearn.datasets import make_regression\n", "from sklearn.metrics import mean_squared_error, r2_score\n", "from sklearn.model_selection import GridSearchCV, cross_val_score, train_test_split\n", "from sklearn.pipeline import Pipeline\n", "from sklearn.preprocessing import StandardScaler" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "id": "jtqyQTx8HeJb" }, "outputs": [], "source": [ "# generate the data\n", "X, y = make_regression(n_samples=1500, n_features=15, n_informative=10, noise=10.0, random_state=42)\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "id": "W0YrpRYUHgpd" }, "outputs": [], "source": [ "from rehline import plq_Ridge_Regressor\n", "\n", "# set the pipeline\n", "pipe = Pipeline(\n", " [\n", " (\"scaler\", StandardScaler()),\n", " (\"reg\", plq_Ridge_Regressor(loss={\"name\": \"QR\", \"qt\": 0.5})),\n", " ]\n", ")" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "id": "UK3mHaLnHjBf" }, "outputs": [], "source": [ "# set the param_grid\n", "param_grid = {\n", " \"reg__loss\": [\n", " {\"name\": \"QR\", \"qt\": 0.5},\n", " {\"name\": \"huber\", \"tau\": 1.0}, # Huber needs tau\n", " {\"name\": \"SVR\", \"epsilon\": 0.1}, # SVR needs epsilon\n", " ],\n", " \"reg__C\": [0.1, 1.0, 10.0],\n", " \"reg__fit_intercept\": [True, False],\n", " \"reg__intercept_scaling\": [0.5, 1.0],\n", " \"reg__max_iter\": [5000, 8000],\n", " \"reg__constraint\": [\n", " [], # no constraint\n", " [{\"name\": \"nonnegative\"}],\n", " [{\"name\": \"fair\", \"sen_idx\": [0], \"tol_sen\": 0.1}],\n", " ],\n", "}" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "Ulsm5DtPHluB", "outputId": "fa5e76b3-f7bb-4a85-82cd-ce7d9b624915" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CV R^2 scores: [0.99578266 0.99573973 0.99608371 0.99525645 0.9949942 ]\n", "Mean CV R^2: 0.9955713512215377\n" ] } ], "source": [ "# cross_val_score function\n", "\n", "cv_scores = cross_val_score(\n", " pipe,\n", " X_train,\n", " y_train,\n", " cv=5,\n", " scoring=\"r2\",\n", " n_jobs=-1,\n", ")\n", "print(\"CV R^2 scores:\", cv_scores)\n", "print(\"Mean CV R^2:\", np.mean(cv_scores))" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 243 }, "id": "19c8A6KqHn4u", "outputId": "c26d6e76-9316-458f-ff80-ef9dd70ef473" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Fitting 5 folds for each of 216 candidates, totalling 1080 fits\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/lib/python3.12/dist-packages/rehline/_class.py:419: ConvergenceWarning: ReHLine failed to converge, increase the number of iterations: `max_iter`.\n", " warnings.warn(\n" ] }, { "data": { "text/html": [ "
GridSearchCV(cv=5,\n",
       "             estimator=Pipeline(steps=[('scaler', StandardScaler()),\n",
       "                                       ('reg', plq_Ridge_Regressor())]),\n",
       "             n_jobs=-1,\n",
       "             param_grid={'reg__C': [0.1, 1.0, 10.0],\n",
       "                         'reg__constraint': [[], [{'name': 'nonnegative'}],\n",
       "                                             [{'name': 'fair', 'sen_idx': [0],\n",
       "                                               'tol_sen': 0.1}]],\n",
       "                         'reg__fit_intercept': [True, False],\n",
       "                         'reg__intercept_scaling': [0.5, 1.0],\n",
       "                         'reg__loss': [{'name': 'QR', 'qt': 0.5},\n",
       "                                       {'name': 'huber', 'tau': 1.0},\n",
       "                                       {'epsilon': 0.1, 'name': 'SVR'}],\n",
       "                         'reg__max_iter': [5000, 8000]},\n",
       "             scoring='r2', verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "GridSearchCV(cv=5,\n", " estimator=Pipeline(steps=[('scaler', StandardScaler()),\n", " ('reg', plq_Ridge_Regressor())]),\n", " n_jobs=-1,\n", " param_grid={'reg__C': [0.1, 1.0, 10.0],\n", " 'reg__constraint': [[], [{'name': 'nonnegative'}],\n", " [{'name': 'fair', 'sen_idx': [0],\n", " 'tol_sen': 0.1}]],\n", " 'reg__fit_intercept': [True, False],\n", " 'reg__intercept_scaling': [0.5, 1.0],\n", " 'reg__loss': [{'name': 'QR', 'qt': 0.5},\n", " {'name': 'huber', 'tau': 1.0},\n", " {'epsilon': 0.1, 'name': 'SVR'}],\n", " 'reg__max_iter': [5000, 8000]},\n", " scoring='r2', verbose=1)" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# use GridSearchCV to tune the hyperparameters\n", "\n", "grid = GridSearchCV(\n", " estimator=pipe,\n", " param_grid=param_grid,\n", " scoring=\"r2\",\n", " cv=5,\n", " n_jobs=-1,\n", " refit=True,\n", " verbose=1,\n", ")\n", "\n", "grid.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "_JhyOqu4Hp6G", "outputId": "e70ab254-015b-4327-82a6-fe235a333f6d" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Best params: {'reg__C': 10.0, 'reg__constraint': [{'name': 'nonnegative'}], 'reg__fit_intercept': True, 'reg__intercept_scaling': 1.0, 'reg__loss': {'name': 'SVR', 'epsilon': 0.1}, 'reg__max_iter': 8000}\n", "Best CV R^2: 0.9967851378070526\n" ] } ], "source": [ "# print the best parameters and the best CV R^2 score\n", "print(\"Best params:\", grid.best_params_)\n", "print(\"Best CV R^2:\", grid.best_score_)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "s_Hp2L0IHtMD", "outputId": "be85ae0b-079f-49e9-823f-8184ca93c838" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Test R^2: 0.9968147697852413\n", "Test MSE: 103.43336817904354\n" ] } ], "source": [ "# use the best estimator to fit and predict the model\n", "best_model = grid.best_estimator_\n", "y_pred = best_model.predict(X_test)\n", "\n", "print(\"Test R^2:\", r2_score(y_test, y_pred))\n", "print(\"Test MSE:\", mean_squared_error(y_test, y_pred))" ] } ], "metadata": { "colab": { "provenance": [] }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 0 }