MAE¶

The regularized MAE regression solves the following optimization problem:

\[\min_{\mathbf{\beta} \in \mathbb{R}^d} C \sum_{i=1}^n |y_i - \mathbf{x}_i^\top \mathbf{\beta}| + \frac{1}{2}\|\mathbf{\beta}\|_2^2,\]

where \(\mathbf{x}_i \in \mathbb{R}^d\) is a feature vector, and \(y_i \in \mathbb{R}\) is the response variable.

Note. Since the absolute loss is a plq function, we can optimize it using rehline.plq_Ridge_Regressor. Moreover, this wrapper adapts the plqERM_Ridge into a regressor, compatible with the scikit-learn API.

[ ]:

## install rehline
%pip install rehline -q

[2]:

import warnings

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.datasets import make_regression
from sklearn.preprocessing import StandardScaler

[3]:

# Simulate data
np.random.seed(42)
scaler_mae = StandardScaler()

n, d = 10000, 5
X, y = make_regression(n_samples=n, n_features=d, noise=1.0)
X = scaler_mae.fit_transform(X)
y = y / y.std()

[4]:

## solve MAE Regression via `plq_Ridge_Regressor`
from rehline import plq_Ridge_Regressor

warnings.filterwarnings("ignore")

clf = plq_Ridge_Regressor(loss={"name": "mae"}, C=1.0)
clf.fit(X=X, y=y)

[4]:

plq_Ridge_Regressor(loss={'name': 'mae'})

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

[5]:

## plot MAE results
warnings.filterwarnings("ignore", "is_categorical_dtype")

n_sample = 200
X_sample, y_sample = X[:n_sample], y[:n_sample]
mae_sample = clf.predict(X_sample)

df = pd.DataFrame({"x0": X_sample[:, 0], "real_y": y_sample, "mae": mae_sample})
df = df.melt(id_vars="x0")

sns.scatterplot(data=df, x="x0", y="value", hue="variable")
plt.show()

With Pipeline¶

plq_Ridge_Regressor can be integrated into a scikit-learn Pipeline to streamline preprocessing including scaling.

[6]:

import warnings

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.datasets import make_regression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

[7]:

# Simulate data
np.random.seed(42)

n, d = 10000, 5
X, y = make_regression(n_samples=n, n_features=d, noise=1.0)
y = y / y.std()

[8]:

## solve MAE Regression via `plq_Ridge_Regressor`
from rehline import plq_Ridge_Regressor

pipe = Pipeline([("scaler", StandardScaler()), ("reg", plq_Ridge_Regressor(loss={"name": "mae"}, C=1.0))])
pipe.fit(X=X, y=y)

[8]:

Pipeline(steps=[('scaler', StandardScaler()),
                ('reg', plq_Ridge_Regressor(loss={'name': 'mae'}))])

[9]:

## plot MAE results
warnings.filterwarnings("ignore", "is_categorical_dtype")

n_sample = 200
X_sample, y_sample = X[:n_sample], y[:n_sample]
mae_sample = pipe.predict(X_sample)

df = pd.DataFrame({"x0": X_sample[:, 0], "real_y": y_sample, "mae": mae_sample})
df = df.melt(id_vars="x0")

sns.scatterplot(data=df, x="x0", y="value", hue="variable")
plt.show()

Hyperparameter Tuning with GridSearchCV¶

Due to its compatibility with the scikit-learn API, GridSearchCV can be applied to determine the optimal hyperparameters for the ReHLine model.

[10]:

import warnings

from sklearn.metrics import make_scorer, mean_absolute_error
from sklearn.model_selection import GridSearchCV

warnings.filterwarnings("ignore")

# Define the parameter grid to search
param_grid = {"reg__C": [0.1, 1.0, 10.0]}

# Use mae to evaluate the performances
mae_scorer = make_scorer(mean_absolute_error, greater_is_better=False)

# Create the GridSearchCV objects
grid_mae = GridSearchCV(pipe, param_grid, cv=5, scoring=mae_scorer)

grid_mae.fit(X, y)

# Print the best parameters and scores
print(f"Best params:{grid_mae.best_params_}")
print(f"Best CV Score: {-grid_mae.best_score_:.4f}")

Best params:{'reg__C': 0.1}
Best CV Score: 0.0059

[11]:

## plot MAE results
n_sample = 200
X_sample, y_sample = X[:n_sample], y[:n_sample]
mae_sample = grid_mae.predict(X_sample)

df = pd.DataFrame({"x0": X_sample[:, 0], "real_y": y_sample, "mae": mae_sample})
df = df.melt(id_vars="x0")

sns.scatterplot(data=df, x="x0", y="value", hue="variable").set_title("MAE(C=0.1)")
plt.show()