# Trending November 2023 # Machine Learning 101: Decision Tree Algorithm For Classification # Suggested December 2023 # Top 18 Popular

You are reading the article Machine Learning 101: Decision Tree Algorithm For Classification updated in November 2023 on the website Moimoishop.com. We hope that the information we have shared is helpful to you. If you find the content interesting and meaningful, please share it with your friends and continue to follow and support us for the latest updates. Suggested December 2023 Machine Learning 101: Decision Tree Algorithm For Classification

This article was published as a part of the Data Science Blogathon.

Overview

Learn about the decision tree algorithm in machine learning, for classification problems.

here we have covered entropy, Information Gain, and Gini Impurity

Decision Tree Algorithm

lgorithms. It can be used for both a classification problem as well as for regression problem.

The goal of this algorithm is to create a model that predicts the value of a target variable, for which the decision tree uses the tree representation to solve the problem in which the leaf node corresponds to a class label and attributes are represented on the internal node of the tree.

Let’s take a sample data set to move further ….

Suppose we have a sample of 14 patient data set and we have to predict which drug to suggest to the patient A or B.

Let’s say we pick cholesterol as the first attribute to split data

It will split our data into two branches High and Normal based on cholesterol, as you can see in the above figure.

Let’s suppose our new patient has high cholesterol by the above split of our data we cannot say whether Drug B or Drug A will be suitable for the patient.

Also, If the patient cholesterol is normal we still do not have an idea or information to determine that either Drug A or Drug B is Suitable for the patient.

Let us take Another Attribute Age, as we can see age has three categories in it Young, middle age and senior let’s try to split.

From the above figure, Now we can say that we can easily predict which Drug to give to a patient based on his or her reports.

Assumptions that we make while using the Decision tree:

– In the beginning, we consider the whole training set as the root.

-Feature values are preferred to be categorical, if the values continue then they are converted to discrete before building the model.

-Based on attribute values records are distributed recursively.

-We use a statistical method for ordering attributes as a root node or the internal node.

Mathematics behind Decision tree algorithm: Before going to the Information Gain first we have to understand entropy

Entropy: Entropy is the measures of impurity, disorder, or uncertainty in a bunch of examples.

Purpose of Entropy:

Entropy controls how a Decision Tree decides to split the data. It affects how a Decision Tree draws its boundaries.

“Entropy values range from 0 to 1”, Less the value of entropy more it is trusting able.

Suppose we have F1, F2, F3 features we selected the F1 feature as our root node

F1 contains 9 yes label and 5 no label in it, after splitting the F1 we get F2 which have 6 yes/2 No and F3 which have 3 yes/3 no.

Now if we try to calculate the Entropy of both F2 by using the Entropy formula…

Putting the values in the formula:

Here, 6 is the number of yes taken as positive as we are calculating probability divided by 8 is the total rows present in the F2.

Similarly, if we perform Entropy for F3 we will get 1 bit which is a case of an attribute as in it there is 50%, yes and 50% no.

This splitting will be going on unless and until we get a pure subset.

What is a Puresubset?

The pure subset is a situation where we will get either all yes or all no in this case.

We have performed this concerning one node what if after splitting F2 we may also require some other attribute to reach the leaf node and we also have to take the entropy of those values and add it up to do the submission of all those entropy values for that we have the concept of information gain.

Information Gain: Information gain is used to decide which feature to split on at each step in building the tree. Simplicity is best, so we want to keep our tree small. To do so, at each step we should choose the split that results in the purest daughter nodes. A commonly used measure of purity is called information.

For each node of the tree, the information value measures how much information a feature gives us about the class. The split with the highest information gain will be taken as the first split and the process will continue until all children nodes are pure, or until the information gain is 0.

The algorithm calculates the information gain for each split and the split which is giving the highest value of information gain is selected.

We can say that in Information gain we are going to compute the average of all the entropy-based on the specific split.

Sv = Total sample after the split as in F2 there are 6 yes

S = Total Sample as in F1=9+5=14

Now calculating the Information Gain:

Like this, the algorithm will perform this for n number of splits, and the information gain for whichever split is higher it is going to take it in order to construct the decision tree.

The higher the value of information gain of the split the higher the chance of it getting selected for the particular split.

Gini Impurity:

Gini Impurity is a measurement used to build Decision Trees to determine how the features of a data set should split nodes to form the tree. More precisely, the Gini Impurity of a data set is a number between 0-0.5, which indicates the likelihood of new, random data being miss classified if it were given a random class label according to the class distribution in the data set.

Entropy vs Gini Impurity

The maximum value for entropy is 1 whereas the maximum value for Gini impurity is 0.5.

As the Gini Impurit

In this article, we have covered a lot of details about decision tree, how it works and maths behind it, attribute selection measures such as Entropy, Information Gain, Gini Impurity with their formulas, and how machine learning algorithm solves it.

By now I hope you would have got an idea about the Decision tree, One of the best machine learning algorithms to solve a classification problem.

The media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion.

Related

You're reading Machine Learning 101: Decision Tree Algorithm For Classification

## Beginner’s Guide To Decision Tree Classification Using Python

This article was published as a part of the Data Science Blogathon

Overview

What Is Decision Classification Tree Algorithm

How to build a decision tree from scratch

Terminologies related to decision tree

Difference between random forest and decision tree

Python Code Implementation of decision trees

There are various algorithms in Machine learning for both regression and classification problems, but going for the best and most efficient algorithm for the given dataset is the main point to perform while developing a good Machine Learning Model.

One of Such algorithms good for both classification/categorical and Regression problems is the Decision tree

Decision Trees usually implement exactly the human thinking ability while making a decision, so it is easy to understand.

The logic behind the decision tree can be easily understood because it shows a flow chart type structure /tree-like structure which makes it easy to visualize and extract information out of the background process

What Is a Decision Tree

Elements of Decision Trees

How to build a decision from scratch

How Does  the Decision Tree Algorithm works

Acquaintance With EDA( Exploratory Data Analysis)

Decision Trees and Random Forests

Python Code Implementation

1. What is a Decision Tree?

A Decision Tree is a supervised Machine learning algorithm. It is used in both classification and regression algorithms. The decision tree is like a tree with nodes. The branches depend on a number of factors. It splits data into branches like these till it achieves a threshold value. A decision tree consists of the root nodes, children nodes, and leaf nodes.

Let’s Understand the decision tree methods by Taking one Real-life Scenario

Imagine that you play football every Sunday and you always invite your friend to come to play with you. Sometimes your friend actually comes and sometimes he doesn’t.

The factor on whether or not to come depends on numerous things, like weather, temperature, wind, and fatigue. We start to take all of these features into consideration and begin tracking them alongside your friend’s decision whether to come for playing or not.

You can use this data to predict whether or not your friend will come to play football or not. The technique you could use is a decision tree. Here’s what the decision tree would look like after implementation:

2. Elements Of a Decision Tree

Every decision tree consists following list of elements:

a Node

b Edges

c Root

d Leaves

a) Nodes: It is The point where the tree splits according to the value of some attribute/feature of the dataset

b) Edges: It directs the outcome of a split to the next node we can see in the figure above that there are nodes for features like outlook, humidity and windy. There is an edge for each potential value of each of those attributes/features.

c) Root: This is the node where the first split takes place

3. How to Build Decision Trees from Scratch?

While building a Decision tree, the main thing is to select the best attribute from the total features list of the dataset for the root node as well as for sub-nodes. The selection of best attributes is being achieved with the help of a technique known as the Attribute selection measure (ASM).

With the help of ASM, we can easily select the best features for the respective nodes of the decision tree.

There are two techniques for ASM:

a) Information Gain

b) Gini Index

a) Information Gain:

1 Information gain is the measurement of changes in entropy value after the splitting/segmentation of the dataset based on an attribute.

2 It tells how much information a feature/attribute provides us.

3 Following the value of the information gain, splitting of the node and decision tree building is being done.

4 decision tree always tries to maximize the value of the information gain, and a node/attribute having the highest value of the information gain is being split first. Information gain can be calculated using the below formula:

Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)

Entropy: Entropy signifies the randomness in the dataset. It is being defined as a metric to measure impurity. Entropy can be calculated as:

Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)

Where,

S= Total number of samples

P(yes)= probability of yes

P(no)= probability of no.

b) Gini Index:

Gini index is also being defined as a measure of impurity/ purity used while creating a decision tree in the CART(known as Classification and Regression Tree) algorithm.

An attribute having a low Gini index value should be preferred in contrast to the high Gini index value.

It only creates binary splits, and the CART algorithm uses the Gini index to create binary splits.

Gini index can be calculated using the below formula:

Gini Index= 1- ∑jPj2

4. How Does the Decision Tree Algorithm works?

The basic idea behind any decision tree algorithm is as follows:

1. Select the best Feature using Attribute Selection Measures(ASM) to split the records.

2. Make that attribute/feature a decision node and break the dataset into smaller subsets.

3 Start the tree-building process by repeating this process recursively for each child until one of the following condition is being achieved :

a) All tuples belonging to the same attribute value.

b) There are no more of the attributes remaining.

c ) There are no more instances remaining.

5. Decision Trees and Random Forests

Decision trees and Random forest are both the tree methods that are being used in Machine Learning.

Decision trees are the Machine Learning models used to make predictions by going through each and every feature in the data set, one-by-one.

Random forests on the other hand are a collection of decision trees being grouped together and trained together that use random orders of the features in the given data sets.

Instead of relying on just one decision tree, the random forest takes the prediction from each and every tree and based on the majority of the votes of predictions, and it gives the final output. In other words, the random forest can be defined as a collection of multiple decision trees.

6. Advantages of the Decision Tree

1 It is simple to implement and it follows a flow chart type structure that resembles human-like decision making.

2 It proves to be very useful for decision-related problems.

3 It helps to find all of the possible outcomes for a given problem.

4 There is very little need for data cleaning in decision trees compared to other Machine Learning algorithms.

5 Handles both numerical as well as categorical values

1 Too many layers of decision tree make it extremely complex sometimes.

2 It may result in overfitting ( which can be resolved using the Random Forest algorithm)

3 For the more number of the class labels, the computational complexity of the decision tree increases.

﻿

#Exploratory data analysis raw_data.info() sns.pairplot(raw_data, hue = 'Kyphosis')

#Split the data set into training data and test data from sklearn.model_selection import train_test_split x = raw_data.drop('Kyphosis', axis = 1) y = raw_data['Kyphosis'] x_training_data, x_test_data, y_training_data, y_test_data = train_test_split(x, y, test_size = 0.3)

#Train the decision tree model from chúng tôi import DecisionTreeClassifier model = DecisionTreeClassifier() model.fit(x_training_data, y_training_data) predictions = model.predict(x_test_data) #Measure the performance of the decision tree model from sklearn.metrics import classification_report from sklearn.metrics import confusion_matrix print(classification_report(y_test_data, predictions)) print(confusion_matrix(y_test_data, predictions))

My name is Pranshu Sharma and I am a Data Science Enthusiast

Email: [email protected]

The media shown in this article on Analytics Vidhya are not owned by Analytics Vidhya and is used at the Author’s discretion.

Related

## Machine Learning With Limited Data

This article was published as a part of the Data Science Blogathon.

Introduction

In machine learning, the data’s mount and quality are necessary to model training and performance. The amount of data affects machine learning and deep learning algorithms a lot. Most of the algorithm’s behaviors change if the amount of data is increased or decreased. But in the case of limited data, it is necessary to effectively handle machine learning algorithms to get better results and accurate models. Deep learning algorithms are also data-hungry, requiring a large amount of data for better accuracy.

In this article, we will discuss the relationship between the amount and the quality of the data with machine learning and deep learning algorithms, the problem with limited data, and the accuracy of dealing with it. Knowledge about these key concepts will help one understand the algorithm vs. data scenario and will shape one so that one can deal with limited data efficiently.

The “Amount of Data vs. Performace” Graph

In machine learning, a query could be raised to your mind, how strictly is the data required to train a good machine learning or deep learning model? Well, there is no threshold levels or fixed answer to this, as every piece of information is different and has different features and patterns. Still, there are some threshold levels after which the performance of the machine learning or deep learning algorithms tends to be constant.

Most of the time, machine learning and deep learning models tend to perform well as the amount of data fed is increased, but after some point or some amount of data, the behavior of the models becomes constant, and it stops learning from data.

The above pictures show the performance of some famous machine learning and deep learning architectures with the amount of data fed to the algorithms. Here we can see that the traditional machine learning algorithms learn a lot from the data in a preliminary period, where the amount of data fed is increasing, but after some time, when a threshold level comes, the performance becomes constant. Now, if you provide more data to the algorithm, it will not learn anything, and the version will not increase or decrease.

In the case of deep learning algorithms, there are a total of three types of deep learning architectures in the diagram. The shallow ty[e of deep learning stricture is a minor deep learning architecture in terms of depth, meaning that there are few hidden layers and neurons in external deep learning architectures. In the case o deep neural networks, the number of hidden layers and neurons is very high and designed very profoundly.

From the diagram, we can see a total of three deep learning architectures, and all three perform differently when some amount of data is fed and increased. The shallow, deep neural networks tend to function like traditional machine learning algorithms, where the performance becomes constant after some threshold amount of data. At the same time, deep neural networks keep learning from the data when new data is fed.

From the diagram, we can conclude that,

” THE DEEP NEURAL NETWORKS ARE DATA HUNGRY “

What Problems Arise with Limited Data?

Several problems occur with limited data, and the model could perform better if trained with limited data. The common issues that arise with limited data are listed below:

1. Classification:

In classification, if a low amount of data is fed, then the model will classify the observations wrongly, meaning that it will not give the accurate output class for given words.

2. Regression:

In a regression problem, if the model’s accuracy is low, then the model will predict very wrong, meaning that as it is a regression problem, it will be expecting the number. Still, limited data may show a horrifying amount far from the actual output.

3. Clustering:

The model can classify the different points in the wrong clusters in the clustering problems if trained with limited data.

4. Time Series:

In time series analysis, we forecast some data for the future. Still, a low-accurate time series model can give us inferior forecast results, and there may be a lot of errors related to time.

5. Object Detection:

If an object detection model is trained on limited data, it might not detect the object correctly, or it can classify the thing incorrectly.

How to Deal With Problems of Limited Data?

There needs to be an accurate or fixed method for dealing with the limited data. Every machine learning problem is different, and the way of solving the particular problem is other. But some standard techniques are helpful for many cases.

1. Data Augmentation

Data augmentation is the technique in which the existing data is used to generate new data. Here the further information generated will look like the old data, but some of the values and parameters would be different here.

This approach can increase the amount of data, and there is a high likelihood of improving the model’s performance.

Data augmentation is preferred in most deep-learning problems, where there is limited data with images.

2. Don’t Drop and Impute:

In some of the datasets, there is a high fraction of invalid data or empty. Due to that, some amount of data s dropped not to make the process complex, but by doing this, the amount of data is decreased, and several problems can occur.

3. Custom Approach:

If there is a case of limited data, one could search for the data on the internet and find similar data. Once this type of data is obtained, it can be used to generate more data or be merged with the existing data.

Conclusion

In this article, we discussed the limited data, the performance of several machine learning and deep learning algorithms, the amount of data increasing and decreasing, the type of problem that can occur due to limited data, and the common ways to deal with limited data. This article will help one understand the process of restricted data, its effects on performance, and how to handle it.

1. Machine Learning and shallow neural networks are the algorithms that are not affected by the amount of data after some threshold level.

2. Deep neural networks are data-hungry algorithms that never stop learning from data.

3. Limited data can cause problems in every field of machine learning applications, e.g., classification, regression, time series, image processing, etc.

4. We can apply Data augmentation, imputation, and some other custom approaches based on domain knowledge to handle the limited data.

Want to Contact the Author?

Related

## Data Preprocessing In Machine Learning

Introduction to Data Preprocessing in Machine Learning

The following article provides an outline for Data Preprocessing in Machine Learning. Data pre-processing also knows as data wrangling is the technique of transforming the raw data i.e. an incomplete, inconsistent, data with lots of error, and data that lack certain behavior, into understandable format carefully using the different steps (i.e. from importing libraries, data to checking of missing values, categorical followed by validation and feature scaling ) so that proper interpretations can be made from it and negative results can be avoided, as the quality of the model in machine learning highly depends upon the quality of data we train it on.

Start Your Free Data Science Course

Data collected for training the model is from various sources. These collected data are generally in their raw format i.e. they can have noises like missing values, and relevant information, numbers in the string format, etc. or they can be unstructured. Data pre-processing increases the efficiency and accuracy of the machine learning models. As it helps in removing these noises from and dataset and giving meaning to the dataset

Six Different Steps Involved in Machine Learning

Following are six different steps involved in machine learning to perform data pre-processing:

Step 1: Import libraries

Step 2: Import data

Step 3: Checking for missing values

Step 4: Checking for categorical data

Step 5: Feature scaling

1. Import Libraries

The very first step is to import a few of the important libraries required in data pre-processing. A library is a collection of modules that can be called and used. In python, we have a lot of libraries that are helpful in data pre-processing.

A few of the following important libraries in python are:

Numpy: Mostly used the library for implementing or using complicated mathematical computation of machine learning. It is useful in performing an operation on multidimensional arrays.

Pandas: It is an open-source library that provides high performance, and easy-to-use data structure and data analysis tools in python. It is designed in a way to make working with relation and labeled data easy and intuitive.

Matplotlib: It’s a visualization library provided by python for 2D plots o array. It is built on a numpy array and designed to work with a broader Scipy stack. Visualization of datasets is helpful in the scenario where large data is available. Plots available in matplot lib are line, bar, scatter, histogram, etc.

Seaborn: It is also a visualization library given by python. It provides a high-level interface for drawing attractive and informative statistical graphs.

2. Import Dataset

Once the libraries are imported, our next step is to load the collected data. Pandas library is used to import these datasets. Mostly the datasets are available in CSV formats as they are low in size which makes it fast for processing. So, to load a csv file using the read_csv function of the panda’s library. Various other formats of the dataset that can be seen are

Once the dataset is loaded, we have to inspect it and look for any noise. To do so we have to create a feature matrix X and an observation vector Y with respect to X.

3. Checking for Missing Values

Once you create the feature matrix you might find there are some missing values. If we won’t handle it then it may cause a problem at the time of training.

Removing the entire row that contains the missing value, but there can be a possibility that you may end up losing some vital information. This can be a good approach if the size of the dataset is large.

If a numerical column has a missing value then you can estimate the value by taking the mean, median, mode, etc.

4. Checking for Categorical Data

Data in the dataset has to be in a numerical form so as to perform computation on it. Since Machine learning models contain complex mathematical computation, we can’t feed them a non-numerical value. So, it is important to convert all the text values into numerical values. LabelEncoder() class of learned is used to covert these categorical values into numerical values.

5. Feature Scaling

The values of the raw data vary extremely and it may result in biased training of the model or may end up increasing the computational cost. So it is important to normalize them. Feature scaling is a technique that is used to bring the data value in a shorter range.

Methods used for feature scaling are:

Rescaling (min-max normalization)

Mean normalization

Standardization (Z-score Normalization)

Scaling to unit length

6. Splitting Data into Training, Validation and Evaluation Sets

Finally, we need to split our data into three different sets, training set to train the model, validation set to validate the accuracy of our model and finally test set to test the performance of our model on generic data. Before splitting the Dataset, it is important to shuffle the Dataset to avoid any biases. An ideal proportion to divide the Dataset is 60:20:20 i.e. 60% as the training set, 20% as test and validation set. To split the Dataset use train_test_split of sklearn.model_selection twice. Once to split the dataset into train and validation set and then to split the remaining train dataset into train and test set.

Conclusion – Data Preprocessing in Machine Learning

Data Preprocessing is something that requires practice. It is not like a simple data structure in which you learn and apply directly to solve a problem. To get good knowledge on how to clean a Dataset or how to visualize your dataset, you need to work with different datasets. The more you will use these techniques the better understanding you will get about it. This was a general idea of how data processing plays an important role in machine learning. Along with that, we have also seen the steps needed for data pre-processing. So next time before going to train the model using the collected data be sure to apply data pre-processing.

Recommended Articles

This is a guide to Data Preprocessing in Machine Learning. Here we discuss the introduction and six different steps involved in machine learning. You can also go through our other suggested articles to learn more –

## 5 Challenges Of Machine Learning!

Introduction :

In this post, we will come through some of the major challenges that you might face while developing your machine learning model. Assuming that you know what machine learning is really about, why do people use it, what are the different categories of machine learning, and how the overall workflow of development takes place.

Image Source

What can possibly go wrong during the development and prevent you from getting accurate predictions?

So let’s get started, during the development phase our focus is to select a learning algorithm and train it on some data, the two things that might be a problem are a bad algorithm or bad data, or perhaps both of them.

Table of Content :

Not enough training data.

Poor Quality of data.

Irrelevant features.

Nonrepresentative training data.

Overfitting and Underfitting.

1. Not enough training data :

Let’s say for a child, to make him learn what an apple is, all it takes for you to point to an apple and say apple repeatedly. Now the child can recognize all sorts of apples.

2. Poor Quality of data:

Obviously, if your training data has lots of errors, outliers, and noise, it will make it impossible for your machine learning model to detect a proper underlying pattern. Hence, it will not perform well.

So put in every ounce of effort in cleaning up your training data. No matter how good you are in selecting and hyper tuning the model, this part plays a major role in helping us make an accurate machine learning model.

“Most Data Scientists spend a significant part of their time in cleaning data”.

There are a couple of examples when you’d want to clean up the data :

If you see some of the instances are clear outliers just discard them or fix them manually.

If some of the instances are missing a feature like (E.g., 2% of user did not specify their age), you can either ignore these instances, or fill the missing values by median age, or train one model with the feature and train one without it to come up with a conclusion.

3. Irrelevant Features:

“Garbage in, garbage out (GIGO).”

Image Source

In the above image, we can see that even if our model is “AWESOME” and we feed it with garbage data, the result will also be garbage(output). Our training data must always contain more relevant and less to none irrelevant features.

The credit for a successful machine learning project goes to coming up with a good set of features on which it has been trained (often referred to as feature engineering ), which includes feature selection, extraction, and creating new features which are other interesting topics to be covered in upcoming blogs.

4. Nonrepresentative training data:

To make sure that our model generalizes well, we have to make sure that our training data should be representative of the new cases that we want to generalize to.

If train our model by using a nonrepresentative training set, it won’t be accurate in predictions it will be biased against one class or a group.

For E.G., Let us say you are trying to build a model that recognizes the genre of music. One way to build your training set is to search it on youtube and use the resulting data. Here we assume that youtube’s search engine is providing representative data but in reality, the search will be biased towards popular artists and maybe even the artists that are popular in your location(if you live in India you will be getting the music of Arijit Singh, Sonu Nigam or etc).

So use representative data during training, so your model won’t be biased among one or two classes when it works on testing data.

5. Overfitting and Underfitting :

What is overfitting?

Image Source

Let’s start with an example, say one day you are walking down a street to buy something, a dog comes out of nowhere you offer him something to eat but instead of eating he starts barking and chasing you but somehow you are safe. After this particular incident, you might think all dogs are not worth treating nicely.

So this overgeneralization is what we humans do most of the time, and unfortunately machine learning model also does the same if not paid attention. In machine learning, we call this overfitting i.e model performs well on training data but fails to generalize well.

Overfitting happens when our model is too complex.

Things which we can do to overcome this problem:

Simplify the model by selecting one with fewer parameters.

By reducing the number of attributes in training data.

Constraining the model.

Gather more training data.

Reduce the noise.

What is underfitting?

Image Source

Yes, you guessed it right underfitting is the opposite of overfitting. It happens when our model is too simple to learn something from the data. For E.G., you use a linear model on a set with multi-collinearity it will for sure underfit and the predictions are bound to be inaccurate on the training set too.

Things which we can do to overcome this problem:

Train on better and relevant features.

Reduce the constraints.

Conclusion :

Machine Learning is all about making machines better by using data so that we don’t need to code them explicitly. The model will not perform well if training data is small, or noisy with errors and outliers, or if the data is not representative(results in biased), consists of irrelevant features(garbage in, garbage out), and lastly neither too simple(results in underfitting) nor too complex(results in overfitting). After you have trained a model by keeping the above parameters in mind, don’t expect that your model would simply generalize well to new cases you may need to evaluate and fine-tune it, how to do that? Stay tuned this is a topic that will be covered in the upcoming blogs.

Thank you,