# Import data
import pandas as pd
recipes = pd.read_csv('recipe_site_traffic_2212.csv')
recipes.head(5)
recipe | calories | carbohydrate | sugar | protein | category | servings | high_traffic | |
---|---|---|---|---|---|---|---|---|
0 | 1 | NaN | NaN | NaN | NaN | Pork | 6 | High |
1 | 2 | 35.48 | 38.56 | 0.66 | 0.92 | Potato | 4 | High |
2 | 3 | 914.28 | 42.68 | 3.09 | 2.88 | Breakfast | 1 | NaN |
3 | 4 | 97.03 | 30.56 | 38.63 | 0.02 | Beverages | 4 | High |
4 | 5 | 27.05 | 1.85 | 0.80 | 0.53 | Beverages | 4 | NaN |
Running recipes.isna().sum() will output a series that shows the number of missing values in each column of the recipes dataframe. The output shows that the columns 'calories', 'carbohydrate', 'sugar', 'protein', and 'high_traffic' have null values.
# Check null values
recipes.isna().sum()
recipe 0 calories 52 carbohydrate 52 sugar 52 protein 52 category 0 servings 0 high_traffic 373 dtype: int64
recipes.dtypes
recipe int64 calories float64 carbohydrate float64 sugar float64 protein float64 category object servings object high_traffic object dtype: object
This code is using scikit-learn's SimpleImputer class to impute missing values in four numerical columns: calories, carbohydrate, sugar, and protein. The code first creates an instance of the SimpleImputer class with the strategy='mean' parameter, which means that missing values in each column will be replaced by the mean of the non-missing values in that column.
Then, the code applies the imputation to each of the four columns by using the fit_transform() method of the imputer object on a subset of the recipes dataframe containing only the column of interest. The fit_transform() method fits the imputer to the column data and transforms the column by imputing missing values.
Finally, the code checks the number of missing values in each of the four columns after the imputation by using the isna().sum() method on a subset of the recipes dataframe containing only these four columns.
# Impute missing data to numerical columns
from sklearn.impute import SimpleImputer
num_imputer = SimpleImputer(strategy='mean')
# Calories
recipes[['calories']] = num_imputer.fit_transform(recipes[['calories']])
# Carbohydrate
recipes[['carbohydrate']] = num_imputer.fit_transform(recipes[['carbohydrate']])
# Sugar
recipes[['sugar']] = num_imputer.fit_transform(recipes[['sugar']])
# Protein
recipes[['protein']] = num_imputer.fit_transform(recipes[['protein']])
# Check imputation
recipes[['calories', 'carbohydrate', 'sugar', 'protein']].isna().sum()
calories 0 carbohydrate 0 sugar 0 protein 0 dtype: int64
# Check consistency of column category
recipes.category.unique()
array(['Pork', 'Potato', 'Breakfast', 'Beverages', 'One Dish Meal', 'Chicken Breast', 'Lunch/Snacks', 'Chicken', 'Vegetable', 'Meat', 'Dessert'], dtype=object)
Here we clean the servings column by converting its values to integers and removing some string values that cannot be converted to integers. This may be useful for subsequent analysis, as it ensures that the servings column contains only numeric values and eliminates any potential errors that could arise from having mixed data types in the column.
# Servings
recipes.servings.unique()
recipes.servings = recipes.servings.replace({'4 as a snack': '4', '6 as a snack': '6'})
recipes.servings = recipes.servings.apply(int)
recipes.servings.unique()
array([6, 4, 1, 2])
We do a similar operation on the high_traffic column by filling missing values with a default value, converting its string values to integers, and creating a binary variable that can be used in subsequent analysis. This may be useful for modeling or visualization purposes, as it simplifies the high_traffic column and ensures that it contains only numeric values.
# High traffic
recipes.high_traffic.unique()
recipes.high_traffic.fillna('Low', inplace=True)
recipes.high_traffic = recipes.high_traffic.replace({'High': 1, 'Low': 0})
recipes.high_traffic.unique()
array([1, 0])
We can notice a high relationships between 'high_traffic' and 'calories', 'carbohydrate', and 'sugar'. However, the variables are highly correlated between each other. This suggests that we will adopt non linear models, to avoid overfitting.
# Create a heatmap of the correlation matrix
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(10,10))
cmap = sns.diverging_palette(220, 10, as_cmap=True)
sns.heatmap(recipes[1:].corr(), annot=True, vmax=1, vmin=-1, center=0, cmap=cmap)
plt.show()
For the exploratory analysis we define a function plot_hist() that takes a numerical column col and a number of bins as input parameters. The function then uses the distplot() function from the Seaborn library to create a histogram of the values in col. The bins parameter specifies the number of bins to use in the histogram.
Then, we call the function, creating a histogram plot of the calories column in the recipes dataframe. The resulting plot can be useful for visualizing the distribution of calorie values in the dataset and identifying any potential outliers or patterns in the data.
Here, we can notice that the distribution is left skewed, resembli a Chi-Square distribution.
sns.set_style("darkgrid")
def plot_hist(col, bins=20):
sns.distplot(col, bins=bins)
plt.show()
plot_hist(recipes.calories)
This code defines a function plot_boxplot() that takes a numerical column col, a showfliers boolean parameter (default False), and a whis parameter specifying the whisker range (default [5, 95]) as input parameters. The function then uses the boxplot() function from the Seaborn library to create a box plot of the values in col.
The showfliers parameter controls whether or not to show outliers in the plot. The whis parameter controls the range of the whiskers in the plot, with the default values indicating the 5th and 95th percentiles.
The function then adds annotations to the plot showing the values of the 5th, 25th, 50th (median), 75th, and 95th percentiles of the data using the quantile() function from the pandas library.
The plt.show() function displays the box plot.
The last line of the code calls the plot_boxplot() function with the calories column of the recipes dataframe as input. This creates a box plot of the calories values, with whiskers extending from the 5th to 95th percentiles and outliers hidden by default. The annotations show the values of the 5th, 25th, 50th, 75th, and 95th percentiles of the data.
Overall, this code is creating a box plot of the calories column in the recipes dataframe. The resulting plot can be useful for visualizing the distribution of calorie values in the dataset, as well as identifying any potential outliers or patterns in the data. The annotations provide additional information about the percentiles of the data, which can be useful for summarizing the distribution.
def plot_boxplot(col, showfliers=False, whis=[5, 95]):
sns.boxplot(col, showfliers=showfliers, whis=whis)
ax = plt.gca()
# Add quartile value annotations
q_05 = col.quantile(0.05)
q_25 = col.quantile(0.25)
q_50 = col.median()
q_75 = col.quantile(0.75)
q_95 = col.quantile(0.95)
for i, quartile in enumerate([q_05, q_25, q_50, q_75, q_95]):
ax.text(0.45+i*0.12*0.5, quartile, f'Q{i+1}\n{quartile:.3f}', ha='center', va='center', fontweight='bold')
plt.show()
plot_boxplot(recipes.calories)
This function takes in a pandas dataframe (data), a column name (col), and a new column name (new_col_name) to store the binned values. It also takes in two optional arguments q_05 and q_25 to determine whether or not to use the 5th and 25th percentile as bin edges. If q_05 is True, then it uses the 5th percentile as the first bin edge. If q_25 is True, then it uses the 25th percentile as the first bin edge. If both are False, then it uses the median as the first bin edge.
The function then calculates the 75th and 95th percentile if needed and creates bin edges accordingly. The data is then binned using pd.cut() with the specified bin edges and labels. Finally, a seaborn distplot is used to plot the distribution of the newly binned column.
In this specific example, the function is called with the recipes dataframe, the calories column, and a new column name of binned_calories. By default, both q_05 and q_25 are True, which means that the function will use both the 5th and 25th percentiles as bin edges. The resulting plot shows the distribution of the binned calorie values.
By binning, we create more meaningful data, and we shrink the effect of the outlier on the model. We will perform this operation on all the numerical columns.
# Binning into categories
import numpy as np
def bin_plot_col(data, col, new_col_name, q_05=True, q_25=True):
if q_05 is False and q_25 is False:
q_50 = data[col].median()
q_75 = data[col].quantile(0.75)
q_95 = data[col].quantile(0.95)
bins = [0, q_50, q_75, q_95, np.inf]
data[new_col_name] = pd.cut(data[col],
bins=bins,
labels=[1,2,3,4])
elif q_05 is True and q_25 is False:
q_05 = data[col].quantile(0.05)
q_50 = data[col].median()
q_75 = data[col].quantile(0.75)
q_95 = data[col].quantile(0.95)
bins = [0, q_05, q_50, q_75, q_95, np.inf]
data[new_col_name] = pd.cut(data[col],
bins=bins,
labels=[1,2,3,4,5])
elif q_05 is False and q_25 is True:
q_25 = data[col].quantile(0.25)
q_50 = data[col].median()
q_75 = data[col].quantile(0.75)
q_95 = data[col].quantile(0.95)
bins = [0, q_25, q_50, q_75, q_95, np.inf]
data[new_col_name] = pd.cut(data[col],
bins=bins,
labels=[1,2,3,4,5])
else:
q_05 = data[col].quantile(0.05)
q_25 = data[col].quantile(0.25)
q_50 = data[col].median()
q_75 = data[col].quantile(0.75)
q_95 = data[col].quantile(0.95)
bins = [0, q_05, q_25, q_50, q_75, q_95, np.inf]
data[new_col_name] = pd.cut(data[col],
bins=bins,
labels=[1,2,3,4,5,6])
sns.distplot(data[new_col_name])
plt.show()
bin_plot_col(recipes, 'calories', 'binned_calories')
The column 'carbohydrate' is right skewed.
plot_hist(recipes.carbohydrate)
plot_boxplot(recipes.carbohydrate)
# Binning into categories
bin_plot_col(recipes, 'carbohydrate', 'binned_carbohydrate')
plot_hist(recipes.sugar, bins=10)
plot_boxplot(recipes.sugar)
bin_plot_col(recipes, 'sugar', 'binned_sugar', q_05=False)
plot_hist(recipes.protein)
plot_boxplot(recipes.protein)
bin_plot_col(recipes, 'protein', 'binned_protein', q_05=False)
The bar plot shows the frequency of recipes for each category. The most popular categories are: breakfast, chicken breast, and beverages
recipes.category.value_counts().plot(kind='barh')
<AxesSubplot: >
The bar plot of the column 'servings' show that the most popular type is 4 servings, with almost the double of the recipes compared to the second most popular (6 servings).
recipes.servings.value_counts().plot(kind='bar')
<AxesSubplot: >
Now we can go in deep with feature engineering, creating new columns as calculations of existing columns. This is done to find potential hidden patterns in the data, which will increase the accuracy of the model
The first calculation is to calculate the amount of fat in each recipe based on the calories, carbohydrates, and protein. This calculation uses the formula fat = (calories - carbohydrate * 4 - protein * 4) / 9, which assumes that fat provides 9 calories per gram, while carbohydrates and protein provide 4 calories per gram. The resulting values are then clipped at 0, so that negative values are set to 0.
The second calculation is to calculate the calories per serving of each recipe, which is done by dividing the total calories by the number of servings.
The third to sixth calculations are used to calculate the percentage of each macronutrient (carbohydrate, sugar, protein, and fat) in each recipe, based on the total calories and the grams of each macronutrient. These percentages are calculated as macronutrient / (carbohydrate + sugar + protein + fat).
Finally, the seventh to tenth calculations are used to calculate the percentage of calories that come from each macronutrient. These percentages are calculated as (macronutrient * 0.4) / calories for carbohydrates, sugar, and protein, and (fat * 0.9) / calories for fat. This assumes that carbohydrates, sugar, and protein provide 4 calories per gram, while fat provides 9 calories per gram. The resulting values are stored in new columns in the recipes dataframe.
recipes['fat'] = (recipes.calories - recipes.carbohydrate * 4 - recipes.protein * 4) / 9
recipes.fat = recipes.fat.apply(lambda x: 0 if x < 0 else x)
recipes['calories_per_serving'] = recipes.calories / recipes.servings
recipes['carb_percent'] = recipes.carbohydrate / (recipes.carbohydrate + recipes.sugar + recipes.protein + recipes.fat)
recipes['sugar_percent'] = recipes.sugar / (recipes.carbohydrate + recipes.sugar + recipes.protein + recipes.fat)
recipes['protein_percent'] = recipes.protein / (recipes.carbohydrate + recipes.sugar + recipes.protein + recipes.fat)
recipes['fat_percent'] = recipes.fat / (recipes.carbohydrate + recipes.sugar + recipes.protein + recipes.fat)
recipes['carb_calories_percent'] = recipes.carbohydrate*0.4 / recipes.calories
recipes['sugar_calories_percent'] = recipes.sugar*0.4 / recipes.calories
recipes['protein_calories_percent'] = recipes.protein*0.4 / recipes.calories
recipes['fat_calories_percent'] = recipes.fat*0.9 / recipes.calories
recipes[['fat','calories_per_serving', 'carb_percent', 'sugar_percent', 'protein_percent', 'fat_percent', 'carb_calories_percent', 'sugar_calories_percent', 'protein_calories_percent', 'fat_calories_percent']].head()
fat | calories_per_serving | carb_percent | sugar_percent | protein_percent | fat_percent | carb_calories_percent | sugar_calories_percent | protein_calories_percent | fat_calories_percent | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 22.118145 | 72.656533 | 0.388009 | 0.100091 | 0.267187 | 0.244714 | 0.032179 | 0.008301 | 0.022158 | 0.045663 |
1 | 0.000000 | 8.870000 | 0.960638 | 0.016442 | 0.022920 | 0.000000 | 0.434724 | 0.007441 | 0.010372 | 0.000000 |
2 | 81.337778 | 914.280000 | 0.328339 | 0.023771 | 0.022156 | 0.625734 | 0.018673 | 0.001352 | 0.001260 | 0.080067 |
3 | 0.000000 | 24.257500 | 0.441555 | 0.558156 | 0.000289 | 0.000000 | 0.125982 | 0.159250 | 0.000082 | 0.000000 |
4 | 1.947778 | 6.762500 | 0.360780 | 0.156013 | 0.103359 | 0.379848 | 0.027357 | 0.011830 | 0.007837 | 0.064806 |
The new column fat is right skewed.
plot_hist(recipes.fat)
25% of the values are equal to 0, therefore the first bin will be from 0 to the median.
plot_boxplot(recipes.fat)
Now the previously right skewed distribution resembles more to a normal distribution.
bin_plot_col(recipes, 'fat', 'binned_fat', q_05=False, q_25=False)
## Calories Per Serving
plot_hist(recipes.calories_per_serving)
plot_boxplot(recipes.calories_per_serving)
bin_plot_col(recipes, 'calories_per_serving', 'binned_calories_per_serving')
plot_hist(recipes.carb_percent, bins=15)
plot_boxplot(recipes.carb_percent)
bin_plot_col(recipes, 'carb_percent', 'binned_carb_percent')
plot_hist(recipes.sugar_percent)
plot_boxplot(recipes.sugar_percent)
bin_plot_col(recipes, 'sugar_percent', 'binned_sugar_percent', q_05=False, q_25=False)
plot_hist(recipes.protein_percent)
plot_boxplot(recipes.protein_percent)
bin_plot_col(recipes, 'protein_percent', 'binned_protein_percent', q_05=False, q_25=False)
plot_hist(recipes.fat_percent)
plot_boxplot(recipes.fat_percent)
bin_plot_col(recipes, 'fat_percent', 'binned_fat_percent', q_05=False, q_25=False)
plot_hist(recipes.carb_calories_percent, bins=40)
plot_boxplot(recipes.carb_calories_percent)
bin_plot_col(recipes, 'carb_calories_percent', 'binned_carb_calories_percent', q_05=False, q_25=False)
plot_hist(recipes.sugar_calories_percent)
plot_boxplot(recipes.sugar_calories_percent)
bin_plot_col(recipes, 'sugar_calories_percent', 'binned_sugar_calories_percent', q_05=False, q_25=False)
plot_hist(recipes.protein_calories_percent, bins=30)
plot_boxplot(recipes.protein_calories_percent)
bin_plot_col(recipes, 'protein_calories_percent', 'binned_protein_calories_percent', q_05=False, q_25=False)
plot_hist(recipes.fat_calories_percent)
plot_boxplot(recipes.fat_calories_percent)
bin_plot_col(recipes, 'fat_calories_percent', 'binned_fat_calories_percent', q_05=False, q_25=False)
Most of the values were right skewed, showing that most of the recipes present low values.
Then, we create a new DataFrame data by applying one-hot encoding to the categorical features in the recipes DataFrame, as well as the binning features that were added earlier using the bin_plot_col function.
The pd.get_dummies function is used to perform one-hot encoding on the specified columns, and the drop_first=True argument is used to drop the first column of each set of dummy variables to avoid multicollinearity issues.
After creating data, several columns are dropped using the drop function, including all the original numerical features that were used to derive the binned features, as well as the binned version of calories_per_serving, which was likely dropped because it was highly correlated with the calories feature.
Finally, the columns.to_list() method is used to display a list of all the remaining columns in data.
data = pd.get_dummies(recipes, columns=['category', 'servings', 'binned_calories', 'binned_carbohydrate',
'binned_sugar', 'binned_protein', 'binned_fat',
'binned_carb_percent', 'binned_sugar_percent', 'binned_protein_percent',
'binned_fat_percent', 'binned_carb_calories_percent',
'binned_sugar_calories_percent', 'binned_protein_calories_percent',
'binned_fat_calories_percent'], drop_first=True)
data.drop(['recipe','calories', 'carbohydrate', 'sugar', 'protein', 'fat',
'calories_per_serving', 'carb_percent', 'sugar_percent',
'protein_percent', 'fat_percent', 'carb_calories_percent',
'sugar_calories_percent', 'protein_calories_percent', 'fat_calories_percent',
'binned_calories_per_serving'], axis=1, inplace=True)
data.columns.to_list()
['high_traffic', 'category_Breakfast', 'category_Chicken', 'category_Chicken Breast', 'category_Dessert', 'category_Lunch/Snacks', 'category_Meat', 'category_One Dish Meal', 'category_Pork', 'category_Potato', 'category_Vegetable', 'servings_2', 'servings_4', 'servings_6', 'binned_calories_2', 'binned_calories_3', 'binned_calories_4', 'binned_calories_5', 'binned_calories_6', 'binned_carbohydrate_2', 'binned_carbohydrate_3', 'binned_carbohydrate_4', 'binned_carbohydrate_5', 'binned_carbohydrate_6', 'binned_sugar_2', 'binned_sugar_3', 'binned_sugar_4', 'binned_sugar_5', 'binned_protein_2', 'binned_protein_3', 'binned_protein_4', 'binned_protein_5', 'binned_fat_2', 'binned_fat_3', 'binned_fat_4', 'binned_carb_percent_2', 'binned_carb_percent_3', 'binned_carb_percent_4', 'binned_carb_percent_5', 'binned_carb_percent_6', 'binned_sugar_percent_2', 'binned_sugar_percent_3', 'binned_sugar_percent_4', 'binned_protein_percent_2', 'binned_protein_percent_3', 'binned_protein_percent_4', 'binned_fat_percent_2', 'binned_fat_percent_3', 'binned_fat_percent_4', 'binned_carb_calories_percent_2', 'binned_carb_calories_percent_3', 'binned_carb_calories_percent_4', 'binned_sugar_calories_percent_2', 'binned_sugar_calories_percent_3', 'binned_sugar_calories_percent_4', 'binned_protein_calories_percent_2', 'binned_protein_calories_percent_3', 'binned_protein_calories_percent_4', 'binned_fat_calories_percent_2', 'binned_fat_calories_percent_3', 'binned_fat_calories_percent_4']
We split the data into training and testing sets using the train_test_split() function from Scikit-learn. The data DataFrame is first separated into the predictor variables (X) and the target variable (y). The high_traffic column is used as the target variable, while all other columns are used as predictors.
# Tran-test split
from sklearn.model_selection import train_test_split
X = data.drop(['high_traffic'], axis=1)
y = data['high_traffic']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1, stratify=y)
Now we can run a genetic testing using the TPOTClassifier from the TPOT library.
The TPOTClassifier is a machine learning pipeline optimizer that uses genetic programming to search for the best machine learning pipeline for a given problem. It searches through a large space of possible pipelines using a combination of genetic operators, such as mutation, crossover, and selection.
In this case, the TPOTClassifier is being used to optimize a classification model for predicting whether a recipe will generate high traffic on a cooking website. The input data is the X_train and y_train datasets, which were split from the original dataset using a 75/25 train-test split. The generations, population_size, offspring_size, and cv parameters control the size and complexity of the search space, the size of the population, and the number of cross-validation folds used during training.
After fitting the TPOTClassifier on the training data, the score() method is used to evaluate the accuracy of the resulting pipeline on the test data (X_test and y_test). The accuracy score is printed to the console.
Here, we find that the best model for the genetic testing is a SGDClassifier, with a 78% accuracy on the test data
# Genetic Testing
from tpot import TPOTClassifier
tpot = TPOTClassifier(generations=10, population_size=5,
verbosity=2, offspring_size=5,
scoring='accuracy', cv=5)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
Optimization Progress: 0%| | 0/55 [00:00<?, ?pipeline/s]
Generation 1 - Current best internal CV score: 0.6943661971830986 Generation 2 - Current best internal CV score: 0.6943661971830986 Generation 3 - Current best internal CV score: 0.7352112676056338 Generation 4 - Current best internal CV score: 0.7352112676056338 Generation 5 - Current best internal CV score: 0.7352112676056338 Generation 6 - Current best internal CV score: 0.7352112676056338 Generation 7 - Current best internal CV score: 0.7352112676056338 Generation 8 - Current best internal CV score: 0.7352112676056338 Generation 9 - Current best internal CV score: 0.7352112676056338 Generation 10 - Current best internal CV score: 0.7352112676056338 Best pipeline: ExtraTreesClassifier(SGDClassifier(input_matrix, alpha=0.01, eta0=0.01, fit_intercept=True, l1_ratio=0.5, learning_rate=constant, loss=hinge, penalty=elasticnet, power_t=0.0), bootstrap=True, criterion=entropy, max_features=0.2, min_samples_leaf=7, min_samples_split=2, n_estimators=100) 0.7890295358649789
tpot.fitted_pipeline_
Pipeline(steps=[('stackingestimator', StackingEstimator(estimator=SGDClassifier(alpha=0.01, eta0=0.01, l1_ratio=0.5, learning_rate='constant', penalty='elasticnet', power_t=0.0))), ('extratreesclassifier', ExtraTreesClassifier(bootstrap=True, criterion='entropy', max_features=0.2, min_samples_leaf=7))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
Pipeline(steps=[('stackingestimator', StackingEstimator(estimator=SGDClassifier(alpha=0.01, eta0=0.01, l1_ratio=0.5, learning_rate='constant', penalty='elasticnet', power_t=0.0))), ('extratreesclassifier', ExtraTreesClassifier(bootstrap=True, criterion='entropy', max_features=0.2, min_samples_leaf=7))])
StackingEstimator(estimator=SGDClassifier(alpha=0.01, eta0=0.01, l1_ratio=0.5, learning_rate='constant', penalty='elasticnet', power_t=0.0))
SGDClassifier(alpha=0.01, eta0=0.01, l1_ratio=0.5, learning_rate='constant', penalty='elasticnet', power_t=0.0)
SGDClassifier(alpha=0.01, eta0=0.01, l1_ratio=0.5, learning_rate='constant', penalty='elasticnet', power_t=0.0)
ExtraTreesClassifier(bootstrap=True, criterion='entropy', max_features=0.2, min_samples_leaf=7)
Following the recommendation given by the genetic testing we build a StackingClassifier, which is an ensemble learning method that combines multiple base classifiers and a meta-classifier to improve the overall performance of the model.
First, it defines a base estimator, which is an instance of SGDClassifier, a linear classifier that uses stochastic gradient descent for optimization. The base estimator is then passed as input to the StackingClassifier, along with a final estimator, which is an ExtraTreesClassifier, an ensemble classifier that uses decision trees and random subsets of features to improve performance.
The StackingClassifier is then trained on the training data using the .fit() method, and predictions are made on the test data using the .predict() method. The accuracy of the model on the test data is computed using the accuracy_score function from scikit-learn, and printed to the console.
The accuracy for this model is 82.28%
# SGDClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import ExtraTreesClassifier, StackingClassifier
from sklearn.metrics import accuracy_score
base_estimator = SGDClassifier(
alpha=0.01,
eta0=0.01,
fit_intercept=True,
l1_ratio=0.5,
learning_rate='constant',
loss='hinge',
penalty='elasticnet',
power_t=0.0
)
estimators = [('base_estimator', base_estimator)]
clf = StackingClassifier(
estimators=estimators,
final_estimator=ExtraTreesClassifier(
bootstrap=True,
criterion='entropy',
max_features=0.2,
min_samples_leaf=7,
min_samples_split=2,
n_estimators=100,
verbose=1
),
verbose=1
)
# You can then fit the classifier on your input data using the `.fit()` method:
clf.fit(X_train, y_train)
# And make predictions on new data using the `.predict()` method:
y_pred_sgd = clf.predict(X_test)
y_train_sgd = clf.predict(X_train)
accuracy_sgd = accuracy_score(y_test, y_pred_sgd)
print("Accuracy on test data: {}%".format(accuracy_sgd.round(4)*100))
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 0.1s finished [Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed: 0.1s finished [Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
Accuracy on test data: 82.28%
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed: 0.1s finished [Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed: 0.2s finished
Now, we can try to tune the SGDClassifier using GridSearch.
First, the base estimator and the estimators for the stacking classifier are defined. Then, a stacking classifier is created with the base estimator and an ExtraTreesClassifier as the final estimator.
Next, a parameter grid is defined for the hyperparameter search space. The GridSearchCV object is created with the stacking classifier, the parameter grid, 5-fold cross-validation, 'accuracy' as the scoring metric, and n_jobs set to -1 to use all available cores.
The GridSearchCV object is fit to the training data, and the best classifier and its parameters are obtained. The accuracy score of the best classifier on the test data is also computed.
The accuracy score is 81.86%
# Grid Search with SGDClassifier
from sklearn.model_selection import GridSearchCV
# Define the base estimator
base_estimator = SGDClassifier(
alpha=0.01,
eta0=0.01,
fit_intercept=True,
l1_ratio=0.5,
learning_rate='constant',
loss='hinge',
penalty='elasticnet',
power_t=0.0
)
# Define the estimators for the stacking classifier
estimators = [('base_estimator', base_estimator)]
# Define the stacking classifier
clf = StackingClassifier(
estimators=estimators,
final_estimator=ExtraTreesClassifier(
bootstrap=True,
criterion='entropy',
max_features=0.2,
min_samples_leaf=7,
min_samples_split=2,
n_estimators=100,
verbose=1
),
verbose=1
)
# Define the hyperparameter search space
param_grid = {
'base_estimator__alpha': [0.005, 0.01, 0.1],
'base_estimator__eta0': [0.005, 0.01, 0.1],
'base_estimator__penalty': ['l1', 'l2', 'elasticnet'],
'final_estimator__n_estimators': [50, 100, 150],
'final_estimator__max_features': [0.15, 0.2, 0.25],
'final_estimator__min_samples_leaf': [6, 7, 8]
}
# Define the GridSearchCV object
grid_search = GridSearchCV(
estimator=clf,
param_grid=param_grid,
cv=5,
scoring='accuracy',
n_jobs=-1
)
# Fit the GridSearchCV object to the training data
grid_search.fit(X_train, y_train)
# Get the best classifier and its parameters
best_clf = grid_search.best_estimator_
best_params = grid_search.best_params_
#accuracy_sgd_cv = best_clf.score(X_test, y_test)
y_pred_sgd_cv = best_clf.predict(X_test)
y_train_sgd_cv = best_clf.predict(X_train)
accuracy_sgd_cv = accuracy_score(y_test, y_pred_sgd_cv)
# Print the best parameters and the score of the best classifier on the test data
print('Best parameters:', best_params)
print('Accuracy score: {}%'.format(round(accuracy_sgd_cv, 4)*100))
Even though genetic tuning recommended a SGDClassifier it's good practice to check more models. Here we train an XGBoost classifier using stochastic gradient descent (SGD) optimization algorithm.
xgb.XGBClassifier is used to instantiate an XGBoost classifier with binary logistic objective and optimizer='sgd' to use SGD optimization algorithm. eval_metric='error' is used to specify that the evaluation metric is classification error. The classifier is trained on the training data using .fit(X_train, y_train) method. The trained classifier is used to predict labels for the test data using .predict(X_test) method. The accuracy score of the classifier is calculated on the test data using accuracy_score(y_test, y_pred_xgb) method. The accuracy score is printed to the console using print("Accuracy on test data:", accuracy_xgb) statement.
# Train an XGBoost Classifier using the SGD optimization algorithm
import xgboost as xgb
from sklearn.metrics import accuracy_score
# Instantiate XGBoost classifier with SGD optimization
clf = xgb.XGBClassifier(objective='binary:logistic', optimizer='sgd', eval_metric='error')
# Train classifier on the training data
clf.fit(X_train, y_train)
# Predict labels for the test data
y_pred_xgb = clf.predict(X_test)
y_train_xgb = clf.predict(X_train)
# Calculate accuracy score on the test data
accuracy_xgb = accuracy_score(y_test, y_pred_xgb)
print("Accuracy on test data: {}%".format(round(accuracy_xgb, 4)*100))
[20:08:58] WARNING: ../src/learner.cc:627: Parameters: { "optimizer" } might not be used. This could be a false alarm, with some parameters getting used by language bindings but then being mistakenly passed down to XGBoost core, or some parameter actually being used but getting flagged wrongly here. Please open an issue if you find any such cases. Accuracy on test data: 70.46%
Now, we use the same model, but with cross-validation using early stopping to prevent overfitting. Here is a breakdown of what each part of the code does:
churn_dmatrix = xgb.DMatrix(data=X_train, label=y_train) creates a DMatrix object from the training data, which is a data structure used by XGBoost to optimize memory usage and speed up training.
params is a dictionary containing hyperparameters for the XGBoost classifier, including the objective function to optimize, the maximum depth of each tree, the learning rate, gamma (a regularization parameter), and reg_lambda (L2 regularization).
xgb.cv() is used to perform cross-validation. The parameters passed to this function include the DMatrix object created from the training data, the hyperparameters defined in the params dictionary, the number of boosting rounds to perform, and the number of folds for cross-validation.
early_stopping_rounds specifies the number of rounds of boosting that can occur without any improvement in the validation error. If the validation error does not improve after this many rounds, then the training stops early. metrics specifies the metric to use for evaluation. In this case, it is the error rate.
as_pandas=True returns the results as a Pandas DataFrame. best_accuracy = 1 - cv_results['test-error-mean'].min() calculates the best accuracy score based on the minimum test error rate from the cross-validation results.
best_rounds = cv_results['test-error-mean'].idxmin() + 1 calculates the number of boosting rounds required to achieve the best accuracy score. Finally, the best accuracy score and number of boosting rounds are printed using formatted strings.
#Cross validation in XGBoost
from sklearn.metrics import accuracy_score
# Create DMatrix object for training data
churn_dmatrix = xgb.DMatrix(data=X_train, label=y_train)
# Set hyperparameters
params = {
'objective': 'binary:logistic',
'max_depth': 4,
'learning_rate': 0.1,
'gamma': 0.1,
'reg_lambda': 1
}
# Perform cross-validation with early stopping
cv_results = xgb.cv(
dtrain=churn_dmatrix,
params=params,
num_boost_round=1000,
early_stopping_rounds=100,
nfold=4,
metrics='error',
as_pandas=True,
seed=1
)
# Print best accuracy score and number of boosting rounds
best_accuracy = 1 - cv_results['test-error-mean'].min()
best_rounds = cv_results['test-error-mean'].idxmin() + 1
print(f"Best score: {best_accuracy:.4f} (at round {best_rounds})")
Best score: 0.7253 (at round 97)
# Train XGBoost model on full training data using best number of boosting rounds
clf = xgb.train(params=params, dtrain=churn_dmatrix, num_boost_round=best_rounds)
# Predict labels for test data and calculate accuracy score
y_pred_xgb_cv = clf.predict(xgb.DMatrix(X_test))
y_pred_binary_xgb_cv = [1 if p >= 0.5 else 0 for p in y_pred_xgb_cv]
accuracy_xgb_cv = accuracy_score(y_test, y_pred_binary_xgb_cv)
y_train_xgb_cv = clf.predict(xgb.DMatrix(X_train))
y_train_binary_xgb_cv = [1 if p >= 0.5 else 0 for p in y_train_xgb_cv]
print(f"Accuracy on test data: {round(accuracy_xgb_cv*100, 2)}%")
Accuracy on test data: 75.95%
Now, we build a Random Forest classifier with cross-validation and hyperparameter tuning using GridSearchCV.
# Random Forest Classifier with CV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
# Define a pipeline with a StandardScaler and a RandomForestClassifier
rf = RandomForestClassifier(random_state=42)
# Define a grid of hyperparameters to search over
param_grid = {
'n_estimators': [100, 200, 500],
'max_depth': [3, 5, None],
'max_features': ['sqrt', 'log2']
}
# Perform grid search cross-validation with 4 folds
grid_search = GridSearchCV(
estimator=rf,
param_grid=param_grid,
cv=4,
scoring='accuracy',
verbose=1
)
# Fit the grid search to the training data
grid_search.fit(X_train, y_train)
# Print best hyperparameters and accuracy score
best_params = grid_search.best_params_
best_accuracy = grid_search.best_score_
print(f"Best hyperparameters: {best_params}")
print(f"Accuracy on validation data: {best_accuracy:.4f}")
Fitting 4 folds for each of 18 candidates, totalling 72 fits Best hyperparameters: {'max_depth': None, 'max_features': 'log2', 'n_estimators': 500} Accuracy on validation data: 0.6635
# Train RandomForestClassifier on full training data using best hyperparameters
rf = RandomForestClassifier(random_state=42, **best_params)
rf.fit(X_train, y_train)
# Predict labels for test data and calculate accuracy score
y_pred_rf_cv = rf.predict(X_test)
y_train_rf_cv = rf.predict(X_train)
accuracy_rf_cv = accuracy_score(y_test, y_pred_rf_cv)
print(f"Accuracy on test data: {round(accuracy_rf_cv*100, 2)}%")
Accuracy on test data: 70.89%
We create stacked features for the training and test data by combining the predictions of multiple models. Specifically, it creates a DataFrame for the training data (X_train_stacked) and a DataFrame for the test data (X_test_stacked), where each row represents an observation and each column represents a model.
The columns in X_train_stacked are:
sgd_pred: The predicted labels for the training data from the SGD Classifier. xgb_pred: The predicted labels for the training data from the XGBoost Classifier with cross-validation. rf_cv_pred: The predicted labels for the training data from the Random Forest Classifier with cross-validation. The columns in X_test_stacked are the same as those in X_train_stacked, but the predicted labels are for the test data instead of the training data.
By stacking the predictions from multiple models, we can create new features that may help to improve the overall predictive performance of our final model. These stacked features are then used as input to a second-level model, which makes the final predictions.
# Model Stacking
X_train_stacked = pd.DataFrame({
'sgd_pred': y_train_sgd,
'sgd_cv_pred': y_train_sgd_cv,
'xgb_pred': y_train_xgb_cv,
'rf_cv_pred': y_train_rf_cv
})
X_test_stacked = pd.DataFrame({
'sgd_pred': y_pred_sgd,
'sgd_cv_pred': y_pred_sgd_cv,
'xgb_pred': y_pred_xgb_cv,
'rf_cv_pred': y_pred_rf_cv
})
This code performs logistic regression on the stacked predictions of three models: Stochastic Gradient Descent (SGD), XGBoost, and Random Forest.
First, the code creates two data frames: X_train_stacked and X_test_stacked. These data frames contain the predictions made by each of the three models on the training and test data sets, respectively.
Then, a logistic regression model is instantiated and trained on X_train_stacked and y_train, the target variable. The model then makes predictions on X_test_stacked, and the accuracy score is calculated by comparing the predicted labels to the true labels in y_test.
The accuracy score is 71%.
# Logistic Regression w/ Model Stacking
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(X_train_stacked, y_train)
y_pred_lr = lr.predict(X_test_stacked)
accuracy_lr = accuracy_score(y_test, y_pred_lr)
print(accuracy_lr)
0.7088607594936709
Lastly, we generate a horizontal bar chart using Matplotlib to display the accuracy of different machine learning models. The chart is created from a Pandas DataFrame called results, which has two columns: "Model" and "Accuracy".
The first column contains the names of the models, and the second column contains the accuracy scores for each model. The horizontal bar chart is created using the plt.barh() method, with the 'Model' column used as the y-axis and the 'Accuracy' column used as the width of the bars.
The sort_values() method is used to sort the DataFrame by the 'Accuracy' column, so that the bars are displayed in order of decreasing accuracy.
# Table of Accuracy results
results = pd.DataFrame({
'Model': ['SGDClassifier', 'SGDClassifier w/ CV', 'XGBoostClassifier', 'XGBoostClassifier w/ CV', 'RandomForestClassifier w/ CV',
'Logistic RegressionStacked'],
'Accuracy': [accuracy_sgd, accuracy_sgd_cv, accuracy_xgb,
accuracy_xgb_cv, accuracy_rf_cv, accuracy_lr]
})
plt.barh(data=results.sort_values(by='Accuracy'), y='Model', width='Accuracy')
<BarContainer object of 6 artists>
The SGDClassifier reported the highest accuracy on the test data among all the models evaluated. This means that it was the best-performing model at classifying whether or not a customer will churn. The accuracy of this model was82%, which indicates that it correctly classified 82% of the customers in the test dataset. The SGDClassifier is a linear classifier that works well with high-dimensional data, making it a good choice for this type of classification problem.