Foundational Concepts for building machine learning models using traditional methods.
In this first section, I will delve into various fundamental concepts, principles, and ideas related to modeling, particularly when working with tabular data. Even if you are already well-versed in these topics, the discussion may provide you with new insights and approaches to strengthen your understanding. These concepts are essential prerequisites for diving into neural networks and deep learning.
Modeling
Modeling involves creating simplified representations of complex phenomena in various fields, including data science. Scientific models make useful predictions despite not capturing all nuances. Automated modeling, or autonomous modeling, refers to creating models with minimal human intervention or supervision, using a computational, mathematical, or symbolic approach to understand and generalize observations. Learning is the process of developing representations of a concept or phenomenon without extensive human involvement, using quantitative data.
Modes of Learning
Supervised Learning is a type of machine learning where the algorithm is trained on a labeled dataset. This means that the dataset contains both input data (features) and output data (labels). The goal is for the algorithm to learn the relationship between the input data and the output data, so that it can predict the output for new, unseen input data. In other words, given some inputs x, the model attempts to predict the corresponding output y. The label set contains the true output values for each row or item in the feature set. The label set is also known as the “ground truth” because it represents the true or correct output values. The feature set and label set must be correspondingly organized so that each row or item in the feature set is associated with a label in the label set.
The model must learn how the input features are related to the output, and this can be a simple or complex relationship, depending on the model and the complexity of the data. Simple relationships can be represented by a weighted linear sum of relevant features, while more complex relationships may involve duplications and concatenations of features to better model the relevancy of the input features with respect to the output. The complexity of the input-output associations ultimately depends on the complexity of the dataset or problem being modeled, as well as the capabilities of the model being used.
Unsupervised Learning in contrast to supervised learning, unsupervised learning is a method of developing a model without a labeled dataset. Instead, the goal of unsupervised learning is to find patterns and relationships within the data itself. Unsupervised models are given only the input features (x), without any corresponding labels (y), and must identify any underlying structure or organization within the data.
To put it differently, while supervised learning aims to predict an output based on a labeled dataset, unsupervised learning focuses on discovering patterns and relationships within an unlabeled dataset. In other words, unsupervised learning algorithms try to find the underlying structure or organization of the data without prior knowledge of the labels. The interactions between the features are learned with respect to the feature set itself, and the resulting models can be simple or complex depending on the complexity of the dataset and the chosen algorithm.
in this blog i focus meanly on supervised learning.
Quantitative Representations of Data: Regression and Classification
To use automated learning models, data needs to be in a quantitative form. Some data is naturally quantitative, like age and measurements, but categorical data like animal type or country is not. Categorical data is characterized by discrete elements with no intrinsic ranking, unlike numerical data. when dealing with categorical data with only two unique values, we can represent them in a binary format by assigning one value to 0 and the other to 1. For example, in a dataset with a feature “color” containing only “red” and “blue”, we can assign 0 to “red” and 1 to “blue” to transform the dataset into a binary format. However, when there are more than two unique values in a feature or label, things become more complicated. A challenge we will encounter and explore throughout the article is that of converting features of all forms into a more readable and informative representation for effective modeling.
In supervised learning, problems are categorized as either regression or classification depending on the form of the desired output/label. If the output is continuous, it is a regression task, and if the output is binned, it is a classification task. For ordinal outputs, the separation between regression and classification depends on the approach taken rather than being attached to the problem itself. Many classification algorithms output a probability that an input sample falls into a certain class rather than automatically assigning a label to an input. It is essential to understand whether a problem and approach perform a classification or a regression task when building modeling systems.
The Data Cycle of Machine Learning : Training, Validation, and Test Sets
The performance evaluation of a model is important to determine the appropriate next steps. If the model performs poorly, further experimentation is required before deploying it. If the model performs reasonably well in the domain, it can be deployed in an application. However, if the model performs too well, it may be too good to be true, and additional investigation may be needed. Failing to evaluate a model’s true performance can lead to serious consequences, such as a medical model that incorrectly diagnoses real people due to an apparent perfect performance. The purpose of modeling is to represent a phenomenon or concept by modeling data derived from it. We must ensure that the approximation and the end goal are aligned with each other.
For instance, a child learning multiplication is given flashcards to practice by their teacher for an upcoming quiz to assess their mastery of multiplication.
In order to more accurately assess the students’ knowledge, it would be better for the teacher to use a different set of problems that were not given explicitly to the students, but were of the same difficulty and concept. This is because using the same questions on the flash cards given to the students on the quiz would not provide an accurate assessment of the students’ understanding, as they would simply be regurgitating memorized answers rather than demonstrating a true understanding of the concept. By using different but equivalent questions, the teacher can ensure that the students have a genuine grasp of the concept and are not simply relying on memorization.
If the teacher gave Quiz 1, a student could score perfectly by simply memorizing six associations between a prompt “x + y” and the answer “z.” In the given scenario, the quiz questions were based on the same flash cards that were used in the lesson, which means that the students were able to memorize the answers rather than mastering the concept of addition. Therefore, the quiz only measured the students’ ability to memorize specific associations between numbers rather than assessing their understanding of the addition concept. This mismatch between the approximation (quiz questions based on flash cards) and the goal (assessing mastery of addition) is an example of approximation-goal misalignment.
On the other hand, a student who merely memorized the six associations given to them on the flash cards would perform poorly on Quiz 2, because they are presented with questions that are different in explicit form, despite being very similar in concept.
The example of Quiz 2 demonstrates this concept by presenting problems that were not explicitly given to the students, but were of the same difficulty and concept, allowing the teacher to more genuinely evaluate the students’ learning. Similarly, in supervised machine learning problems, we often have a dataset with labels and an unlabeled dataset. The labeled dataset is used for training and the unlabeled dataset is used for testing and evaluating the model’s performance.
The training set is the dataset used to train the model, while the validation set is used to tune the model’s structure. There are two definitions of the test set: one is to obtain predictions, while the other is used for objective evaluation.
Real World Example:
An example of a machine learning model for an Predict students’ dropout and academic success most machine learning model implementations do not accept string inputs, and hence we have to transform them into numeric values.
import pandas as pd
import numpy as np
from typing import AnyStr
def is_float(string: AnyStr) -> bool:
try:
float(string)
return True
except ValueError:
return False
def create_df_from_csv(df: pd.DataFrame)-> pd.DataFrame:
rows = []
columns_ls = df.columns[0].split(";")
for i_row in range(df.shape[0]):
rows.append(list(map(lambda x: float(x) if is_float(x) else x,
df.loc[i_row][0].split(";"))))
return pd.DataFrame(rows, columns = columns_ls)
After a small correction of the data set we get:
dataset = pd.read_csv("/data.csv")
create_df_from_csv(dataset).head()
data = create_df_from_csv(dataset)
feature = data.iloc[:, :-1].values
label = data.iloc[:, -1].values
from sklearn.model_selection import train_test_split as tts
X_train, X_val, y_train, y_val = tts(features, label, train_size = 0.6)
it’s important to divide our data into two parts: training data and validation data. The training data is used to teach our model how to predict the right answer, and the validation data is used to test how well our model works on new data it hasn’t seen before.
When splitting the data into these two parts, it’s important to make sure they contain similar types of data, so our model can learn to make predictions on a representative sample. Scikit-learn, a popular machine learning library, randomizes the order and split of the dataset to ensure this similarity. If the original dataset is already randomized, then randomization doesn’t hurt. However, if the original dataset is ordered, randomization is helpful to make sure the training data is representative of the validation data.
In this example, the students data is randomly split into a 60% training set and a 40% validation set. The model is trained on the training set and evaluated on the validation set. If the model’s performance is not satisfactory, the platform can make adjustments to the modeling system.
Once a model obtains satisfactory validation performance, it is deployed on new students data to identify promising students, and customized what ever that need to having a high probability of graduate student based on previous students data.
During the process of dividing a dataset into a training set and a validation set, it’s important to avoid data leakage, which occurs when some of the training data appears in the validation data. This can lead to inaccurate metrics and affect the model’s performance. To prevent data leakage, random seeding is used to split the dataset into k folds and train the model on k-1 folds while evaluating it on the remaining fold. This process is repeated for each of the k folds, and the validation performance is averaged. This method ensures that the model is evaluated on the entire dataset. Another method is k-fold cross-validation, where the dataset is randomly split into training and validation sets without a random seed. In both cases, the training set is usually larger than the validation set to enable the model to learn from the majority of the dataset during training.
Data leakage can lead to overfitting in machine learning models, and this can be addressed through careful separation of training and validation data and the use of techniques like random seeding and k-fold evaluation.
there is a tradeoff between the amount of data used for training the model and the amount of data set aside for evaluation of the model. This tradeoff needs to be carefully considered when choosing the value of k (in k-fold evaluation) and the size of the train-validation split. Furthermore, evaluating a model is not a completely objective process and involves making different choices based on the circumstances. Therefore, any metric or indicator that is used to evaluate a model is only showing one aspect of a complex model and should be interpreted carefully.
Back to Our Example
the target variable in this scenario is a categorical feature with three possible values. The objective is to determine whether a student will graduate or not, with no consideration for whether they drop out. To achieve this, the values of “graduate” are encoded as “1”, while all other values are encoded as “0”.
data = create_df_from_csv(dataset)
feature = data.iloc[:, :-1].values
label = data.iloc[:, -1].values
label = np.array([1 if cat == 'Graduate' else 0 for cat in label])
from sklearn.model_selection import train_test_split as tts
X_train, X_val, y_train, y_val = tts(features, label, train_size = 0.6)
now we use the K-flod concept, and as an example i use the logistic regression algorithm and the metric for evolution the accuracy score
n =5
from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
folds = KFold(n_splits = n)
each_performance = []
overall_performance = 0
for train_indices, valid_indices in folds.split(features):
X_train = features[train_indices]
X_valid = features[valid_indices]
y_train = label[train_indices]
y_valid = label[valid_indices]
# InitializeModel
model = LogisticRegression(solver= 'liblinear')
# Train
model.fit(X_train, y_train)
# Evalute
y_pred = model.predict(X_valid)
score = accuracy_score(y_pred, y_valid)
each_performance.append(score)
overall_performance += score
overall_performance /= n
Bias-Variance Trade-Off
The bias-variance trade-off can help compare the behavior of different models in modeling data, where bias refers to a model’s ability to identify general trends and ideas in the dataset, and variance refers to its ability to model data with high precision.
For example
How would you draw a smooth curve to model the dataset?
think about it?
When creating models to represent complex data, we can’t just draw curves like we might when connecting dots on a graph. We need to use mathematical or computational models that are more sophisticated. The bias-variance trade-off is a concept that can be used to compare different models and their ability to effectively model the data. By understanding the balance between bias and variance, we can create models that accurately capture important patterns and trends while minimizing errors and overfitting.
In machine learning, bias refers to a model’s ability to capture the general trends and patterns in the data, while variance refers to its ability to fit the data with high precision. Both bias and variance contribute to the overall error of a model, so lower values of both are preferred. The relationship between bias and variance is often represented as a bullseye target, where the ideal model is one that hits the center of the target with a cluster of points, indicating low bias and low variance.
The bias-variance relationship helps to understand the concept of underfitting vs. overfitting in models. Underfitting occurs when a model has high bias and low variance, leading to poor modeling of the training dataset. Overfitting happens when a model has low bias and high variance, leading to passing through every point and performing well on the training set but poorly on the validation dataset. Generally, as the complexity of the model increases, bias error decreases, and variance error increases, leading to a balance problem between bias and variance errors. The ideal model for a problem is the one that minimizes the overall error by balancing bias and variance errors. The degree of the polynomial is a factor that affects bias-variance trade-off behavior, with higher-degree polynomials associated with higher-variance overfitting behavior and lower-degree polynomials with higher-bias underfitting behavior.
Curse of Dimensionality, Feature Space
As we delve deeper into mathematical modeling, it’s helpful to view a dataset not just as a set of rows and columns, but rather as a set of points in a feature space, which allows models to find spatial relationships between these points.
- For example, binary classification models can be viewed as drawing a hyperplane to separate points in space and assigning a “0” or “1” label to each point.
High-dimensional spaces can behave very differently from low-dimensional ones, and can often contradict our intuition.
- For example, the volume of a hypercube changes dramatically with dimensionality. A hypercube is a generalization of a three-dimensional cube to all dimensions, and we can draw a smaller hypercube within it to obtain an approximation for its volume.
The Curse of Dimensionality refers to the phenomena where the performance of many machine learning algorithms degrades as the number of features (dimensions) increases. In high dimensional spaces, distances between points become less meaningful as they tend to become equidistant. This can lead to overfitting or underfitting, and make the models harder to train, interpret and generalize.
In a low-dimensional space, we can easily visualize and understand relationships between features and data points. We can identify patterns and make predictions based on a few key features. However, in a high-dimensional space, we quickly lose the ability to visualize and interpret the data.
As the number of dimensions increases, the number of possible combinations of feature values increases exponentially, and we quickly run into the problem of the curse of dimensionality. In other words, as the number of dimensions increases, the amount of data required to accurately capture the underlying patterns increases exponentially as well.
Furthermore, high-dimensional spaces can be sparse, meaning that data points are scattered throughout the space, making it difficult to identify patterns and relationships. This sparsity can also lead to overfitting, where a model captures noise instead of the underlying patterns.
To mitigate the curse of dimensionality, feature selection or dimensionality reduction techniques can be applied to reduce the number of dimensions while preserving the most important information. These techniques include Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), and t-distributed Stochastic Neighbor Embedding (t-SNE).
In summary, the curse of dimensionality is a challenge in machine learning that arises when the number of features or dimensions increases. It can lead to overfitting, underfitting, and difficulty in visualizing and interpreting the data. However, with appropriate feature selection or dimensionality reduction techniques, we can mitigate these challenges and build accurate models.
Example For Principal Component Analysis for dimensionality reduction:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_val = sc.transform(X_val)
than
dim = 5
from sklearn.decomposition import PCA
pca = PCA(n_components = dim)
X_train = pca.fit_transform(X_train)
X_val = pca.transform(X_val)
and then can continue to the training and evaluation…
Conclusion:
In classical machine learning, there are several fundamental principles that govern the construction and evaluation of models. These principles apply to all types of models, from simple linear regression models to complex neural networks.
- Data Preprocessing: Data preprocessing is the first step in any machine learning project. This involves cleaning and transforming raw data into a format that can be used by the model. Preprocessing may include tasks such as data cleaning, feature scaling, and feature engineering.
- Model Selection: Model selection involves choosing the most appropriate model for a given problem. This requires understanding the strengths and weaknesses of different types of models and selecting the one that is best suited to the problem at hand.
- Hyperparameter Tuning: Hyperparameters are the settings that govern the behavior of the model. Tuning these hyperparameters involves selecting the optimal values to achieve the best performance. Hyperparameter tuning is typically done through trial and error or using automated methods such as grid search or random search.
- Training and Validation: Models are trained on a portion of the data and then validated on another portion to assess their performance. This involves splitting the data into training and validation sets, training the model on the training set, and then evaluating the model on the validation set.
- Model Evaluation: Model evaluation involves assessing the performance of the model on unseen data. This is typically done using metrics such as accuracy, precision, recall, and F1 score. Model evaluation can also involve assessing the model’s ability to generalize to new data.
By following these principles, machine learning practitioners can develop effective models that can make accurate predictions and help solve complex problems. While these principles may seem straightforward, they require a deep understanding of machine learning algorithms, statistics, and mathematics.
That concludes the first part of our discussion. In the upcoming sections, we will delve into topics such as Optimization and Gradient Descent, Metrics and Evaluation, and classical machine learning algorithms that continue to hold immense relevance in modern industries. Stay tuned for more insightful discussions!