Want to get a quick and profound overview of the 42 most common used Machine Learning Algorithms?

Murat Durmus (CEO @AISOMA_AG)

3 min readJan 30, 2023

A Primer to the 42 Most commonly used Machine Learning Algorithms (With Code Samples)

(Attention book promotion!)

Structure of the Book

As an expample: XGBOOST

Taxonomy

Definition: XGBoost is an extension of gradient boosted decision trees (GBM) and is specifically designed to improve speed and performance by using regularization methods to combat overfitting.

Main Domain: Classic Data Science

Data Type: Structured data

Data Environment: Supervised Learning

Learning Paradigm: Classification, Regression

Explainability: Explainable

Description:

XGBoost is an open-source software library for gradient boosting on decision trees. It stands for “Extreme Gradient Boosting” and is a powerful tool for building machine learning models, particularly in structured data such as classification and regression problems.

The library provides an efficient implementation of the gradient boosting algorithm, which is a method that combines multiple weak models to form a strong model. It uses decision trees as base models and iteratively adds new trees to correct the mistakes made by previous trees. XGBoost is known for its performance and computational efficiency and is widely used in machine-learning competitions and real-world applications.

Its features are particularly useful, such as support for parallel and distributed computing, efficient handling of missing values, and built-in handling of categorical variables. It also provides a rich set of parameters that can be used to fine-tune the model’s performance.

In summary, XGBoost is an optimized and distributed gradient boosting library designed to be highly efficient and scalable and works well with structured data.

An example of how to use XGBoost for a binary classification problem in Python:

import xgboost as xgb

from sklearn.datasets import make_classification

from sklearn.model_selection import train_test_split

# Create a synthetic dataset

X, y = make_classification(n_samples=1000, n_features=10, n_classes=2)

# Split the data into training and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Create an XGBoost data matrix from the training data

dtrain = xgb.DMatrix(X_train, label=y_train)

# Create an XGBoost data matrix from the test data

dtest = xgb.DMatrix(X_test, label=y_test)

# Define the parameter dictionary for the model

params = {‘objective’: ‘binary:logistic’, ‘max_depth’: 2}

# Train the model using the training data

model = xgb.train(params, dtrain)

# Make predictions on the test data

y_pred = model.predict(dtest)

# Convert the predicted probabilities to binary class labels

y_pred_class = (y_pred > 0.5).astype(int)

# Compare the predicted class labels to the true class labels

accuracy = (y_pred_class == y_test).mean()

print(“Accuracy:”, accuracy)

This code first creates a synthetic dataset using the make_classification function from scikit-learn, then splits the data into training and test sets. Then, it converts the training and test sets into XGBoost’s data matrix format, which is more efficient for training and prediction. Next, it defines the parameters for the model; the objective is set to binary: logistic, and max_depth is set to 2. Then, it trains the model using the training data and makes predictions on the test data. Finally, it converts the predicted probabilities to binary class labels, compares the predicted class labels to the true class labels, and calculates the model’s accuracy. Please keep in mind that this is just a simple example, in real-world scenarios you should use cross-validation, hyper-parameter tuning, and many other techniques to improve the model performance.

The following Algorithms are covered in the book:

The Book ist available on Amazon and Leanpub

Want to get a quick and profound overview of the 42 most common used Machine Learning Algorithms?

Written by Murat Durmus (CEO @AISOMA_AG)

Responses (1)