You are reading the article Machine Learning 101: Decision Tree Algorithm For Classification updated in November 2023 on the website Moimoishop.com. We hope that the information we have shared is helpful to you. If you find the content interesting and meaningful, please share it with your friends and continue to follow and support us for the latest updates. Suggested December 2023 Machine Learning 101: Decision Tree Algorithm For Classification
This article was published as a part of the Data Science Blogathon.Overview
Learn about the decision tree algorithm in machine learning, for classification problems.
here we have covered entropy, Information Gain, and Gini ImpurityDecision Tree Algorithm
lgorithms. It can be used for both a classification problem as well as for regression problem.
The goal of this algorithm is to create a model that predicts the value of a target variable, for which the decision tree uses the tree representation to solve the problem in which the leaf node corresponds to a class label and attributes are represented on the internal node of the tree.
Let’s take a sample data set to move further ….
Suppose we have a sample of 14 patient data set and we have to predict which drug to suggest to the patient A or B.
Let’s say we pick cholesterol as the first attribute to split data
It will split our data into two branches High and Normal based on cholesterol, as you can see in the above figure.
Let’s suppose our new patient has high cholesterol by the above split of our data we cannot say whether Drug B or Drug A will be suitable for the patient.
Also, If the patient cholesterol is normal we still do not have an idea or information to determine that either Drug A or Drug B is Suitable for the patient.
Let us take Another Attribute Age, as we can see age has three categories in it Young, middle age and senior let’s try to split.
From the above figure, Now we can say that we can easily predict which Drug to give to a patient based on his or her reports.
Assumptions that we make while using the Decision tree:
– In the beginning, we consider the whole training set as the root.
-Feature values are preferred to be categorical, if the values continue then they are converted to discrete before building the model.
-Based on attribute values records are distributed recursively.
-We use a statistical method for ordering attributes as a root node or the internal node.
Mathematics behind Decision tree algorithm: Before going to the Information Gain first we have to understand entropy
Entropy: Entropy is the measures of impurity, disorder, or uncertainty in a bunch of examples.
Purpose of Entropy:
Entropy controls how a Decision Tree decides to split the data. It affects how a Decision Tree draws its boundaries.
“Entropy values range from 0 to 1”, Less the value of entropy more it is trusting able.
Suppose we have F1, F2, F3 features we selected the F1 feature as our root node
F1 contains 9 yes label and 5 no label in it, after splitting the F1 we get F2 which have 6 yes/2 No and F3 which have 3 yes/3 no.
Now if we try to calculate the Entropy of both F2 by using the Entropy formula…
Putting the values in the formula:
Here, 6 is the number of yes taken as positive as we are calculating probability divided by 8 is the total rows present in the F2.
Similarly, if we perform Entropy for F3 we will get 1 bit which is a case of an attribute as in it there is 50%, yes and 50% no.
This splitting will be going on unless and until we get a pure subset.What is a Puresubset?
The pure subset is a situation where we will get either all yes or all no in this case.
We have performed this concerning one node what if after splitting F2 we may also require some other attribute to reach the leaf node and we also have to take the entropy of those values and add it up to do the submission of all those entropy values for that we have the concept of information gain.
Information Gain: Information gain is used to decide which feature to split on at each step in building the tree. Simplicity is best, so we want to keep our tree small. To do so, at each step we should choose the split that results in the purest daughter nodes. A commonly used measure of purity is called information.
For each node of the tree, the information value measures how much information a feature gives us about the class. The split with the highest information gain will be taken as the first split and the process will continue until all children nodes are pure, or until the information gain is 0.
The algorithm calculates the information gain for each split and the split which is giving the highest value of information gain is selected.
We can say that in Information gain we are going to compute the average of all the entropy-based on the specific split.
Sv = Total sample after the split as in F2 there are 6 yes
S = Total Sample as in F1=9+5=14
Now calculating the Information Gain:
Like this, the algorithm will perform this for n number of splits, and the information gain for whichever split is higher it is going to take it in order to construct the decision tree.
The higher the value of information gain of the split the higher the chance of it getting selected for the particular split.Gini Impurity:
Gini Impurity is a measurement used to build Decision Trees to determine how the features of a data set should split nodes to form the tree. More precisely, the Gini Impurity of a data set is a number between 0-0.5, which indicates the likelihood of new, random data being miss classified if it were given a random class label according to the class distribution in the data set.Entropy vs Gini Impurity
The maximum value for entropy is 1 whereas the maximum value for Gini impurity is 0.5.
As the Gini Impurit
In this article, we have covered a lot of details about decision tree, how it works and maths behind it, attribute selection measures such as Entropy, Information Gain, Gini Impurity with their formulas, and how machine learning algorithm solves it.
By now I hope you would have got an idea about the Decision tree, One of the best machine learning algorithms to solve a classification problem.
The media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion.
You're reading Machine Learning 101: Decision Tree Algorithm For Classification
This article was published as a part of the Data Science BlogathonOverview
What Is Decision Classification Tree Algorithm
How to build a decision tree from scratch
Terminologies related to decision tree
Difference between random forest and decision tree
Python Code Implementation of decision trees
There are various algorithms in Machine learning for both regression and classification problems, but going for the best and most efficient algorithm for the given dataset is the main point to perform while developing a good Machine Learning Model.
One of Such algorithms good for both classification/categorical and Regression problems is the Decision tree
Decision Trees usually implement exactly the human thinking ability while making a decision, so it is easy to understand.
The logic behind the decision tree can be easily understood because it shows a flow chart type structure /tree-like structure which makes it easy to visualize and extract information out of the background processTable of Contents
What Is a Decision Tree
Elements of Decision Trees
How to build a decision from scratch
How Does the Decision Tree Algorithm works
Acquaintance With EDA( Exploratory Data Analysis)
Decision Trees and Random Forests
Advantages of Decision Forest
Python Code Implementation1. What is a Decision Tree?
A Decision Tree is a supervised Machine learning algorithm. It is used in both classification and regression algorithms. The decision tree is like a tree with nodes. The branches depend on a number of factors. It splits data into branches like these till it achieves a threshold value. A decision tree consists of the root nodes, children nodes, and leaf nodes.
Let’s Understand the decision tree methods by Taking one Real-life Scenario
Imagine that you play football every Sunday and you always invite your friend to come to play with you. Sometimes your friend actually comes and sometimes he doesn’t.
The factor on whether or not to come depends on numerous things, like weather, temperature, wind, and fatigue. We start to take all of these features into consideration and begin tracking them alongside your friend’s decision whether to come for playing or not.
You can use this data to predict whether or not your friend will come to play football or not. The technique you could use is a decision tree. Here’s what the decision tree would look like after implementation:2. Elements Of a Decision Tree
Every decision tree consists following list of elements:
a) Nodes: It is The point where the tree splits according to the value of some attribute/feature of the dataset
b) Edges: It directs the outcome of a split to the next node we can see in the figure above that there are nodes for features like outlook, humidity and windy. There is an edge for each potential value of each of those attributes/features.
c) Root: This is the node where the first split takes place3. How to Build Decision Trees from Scratch?
While building a Decision tree, the main thing is to select the best attribute from the total features list of the dataset for the root node as well as for sub-nodes. The selection of best attributes is being achieved with the help of a technique known as the Attribute selection measure (ASM).
With the help of ASM, we can easily select the best features for the respective nodes of the decision tree.
There are two techniques for ASM:
a) Information Gain
b) Gini Indexa) Information Gain:
1 Information gain is the measurement of changes in entropy value after the splitting/segmentation of the dataset based on an attribute.
2 It tells how much information a feature/attribute provides us.
3 Following the value of the information gain, splitting of the node and decision tree building is being done.
4 decision tree always tries to maximize the value of the information gain, and a node/attribute having the highest value of the information gain is being split first. Information gain can be calculated using the below formula:
Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)
Entropy: Entropy signifies the randomness in the dataset. It is being defined as a metric to measure impurity. Entropy can be calculated as:
Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)
S= Total number of samples
P(yes)= probability of yes
P(no)= probability of no.b) Gini Index:
Gini index is also being defined as a measure of impurity/ purity used while creating a decision tree in the CART(known as Classification and Regression Tree) algorithm.
An attribute having a low Gini index value should be preferred in contrast to the high Gini index value.
It only creates binary splits, and the CART algorithm uses the Gini index to create binary splits.
Gini index can be calculated using the below formula:
Gini Index= 1- ∑jPj24. How Does the Decision Tree Algorithm works?
The basic idea behind any decision tree algorithm is as follows:
1. Select the best Feature using Attribute Selection Measures(ASM) to split the records.
2. Make that attribute/feature a decision node and break the dataset into smaller subsets.
3 Start the tree-building process by repeating this process recursively for each child until one of the following condition is being achieved :
a) All tuples belonging to the same attribute value.
b) There are no more of the attributes remaining.
c ) There are no more instances remaining.5. Decision Trees and Random Forests
Decision trees and Random forest are both the tree methods that are being used in Machine Learning.
Decision trees are the Machine Learning models used to make predictions by going through each and every feature in the data set, one-by-one.
Random forests on the other hand are a collection of decision trees being grouped together and trained together that use random orders of the features in the given data sets.
Instead of relying on just one decision tree, the random forest takes the prediction from each and every tree and based on the majority of the votes of predictions, and it gives the final output. In other words, the random forest can be defined as a collection of multiple decision trees.6. Advantages of the Decision Tree
1 It is simple to implement and it follows a flow chart type structure that resembles human-like decision making.
2 It proves to be very useful for decision-related problems.
3 It helps to find all of the possible outcomes for a given problem.
4 There is very little need for data cleaning in decision trees compared to other Machine Learning algorithms.
5 Handles both numerical as well as categorical values
1 Too many layers of decision tree make it extremely complex sometimes.
2 It may result in overfitting ( which can be resolved using the Random Forest algorithm)
3 For the more number of the class labels, the computational complexity of the decision tree increases.8. Python Code Implementation #Numerical computing libraries and Loading Data
#Exploratory data analysis raw_data.info() sns.pairplot(raw_data, hue = 'Kyphosis') #Split the data set into training data and test data from sklearn.model_selection import train_test_split x = raw_data.drop('Kyphosis', axis = 1) y = raw_data['Kyphosis'] x_training_data, x_test_data, y_training_data, y_test_data = train_test_split(x, y, test_size = 0.3) #Train the decision tree model from chúng tôi import DecisionTreeClassifier model = DecisionTreeClassifier() model.fit(x_training_data, y_training_data) predictions = model.predict(x_test_data) #Measure the performance of the decision tree model from sklearn.metrics import classification_report from sklearn.metrics import confusion_matrix print(classification_report(y_test_data, predictions)) print(confusion_matrix(y_test_data, predictions))
My name is Pranshu Sharma and I am a Data Science Enthusiast
Email: [email protected]
The media shown in this article on Analytics Vidhya are not owned by Analytics Vidhya and is used at the Author’s discretion.
This article was published as a part of the Data Science Blogathon.Introduction
In machine learning, the data’s mount and quality are necessary to model training and performance. The amount of data affects machine learning and deep learning algorithms a lot. Most of the algorithm’s behaviors change if the amount of data is increased or decreased. But in the case of limited data, it is necessary to effectively handle machine learning algorithms to get better results and accurate models. Deep learning algorithms are also data-hungry, requiring a large amount of data for better accuracy.
In this article, we will discuss the relationship between the amount and the quality of the data with machine learning and deep learning algorithms, the problem with limited data, and the accuracy of dealing with it. Knowledge about these key concepts will help one understand the algorithm vs. data scenario and will shape one so that one can deal with limited data efficiently.The “Amount of Data vs. Performace” Graph
In machine learning, a query could be raised to your mind, how strictly is the data required to train a good machine learning or deep learning model? Well, there is no threshold levels or fixed answer to this, as every piece of information is different and has different features and patterns. Still, there are some threshold levels after which the performance of the machine learning or deep learning algorithms tends to be constant.
Most of the time, machine learning and deep learning models tend to perform well as the amount of data fed is increased, but after some point or some amount of data, the behavior of the models becomes constant, and it stops learning from data.
The above pictures show the performance of some famous machine learning and deep learning architectures with the amount of data fed to the algorithms. Here we can see that the traditional machine learning algorithms learn a lot from the data in a preliminary period, where the amount of data fed is increasing, but after some time, when a threshold level comes, the performance becomes constant. Now, if you provide more data to the algorithm, it will not learn anything, and the version will not increase or decrease.
In the case of deep learning algorithms, there are a total of three types of deep learning architectures in the diagram. The shallow ty[e of deep learning stricture is a minor deep learning architecture in terms of depth, meaning that there are few hidden layers and neurons in external deep learning architectures. In the case o deep neural networks, the number of hidden layers and neurons is very high and designed very profoundly.
From the diagram, we can see a total of three deep learning architectures, and all three perform differently when some amount of data is fed and increased. The shallow, deep neural networks tend to function like traditional machine learning algorithms, where the performance becomes constant after some threshold amount of data. At the same time, deep neural networks keep learning from the data when new data is fed.
From the diagram, we can conclude that,
” THE DEEP NEURAL NETWORKS ARE DATA HUNGRY “What Problems Arise with Limited Data?
Several problems occur with limited data, and the model could perform better if trained with limited data. The common issues that arise with limited data are listed below:
In classification, if a low amount of data is fed, then the model will classify the observations wrongly, meaning that it will not give the accurate output class for given words.
In a regression problem, if the model’s accuracy is low, then the model will predict very wrong, meaning that as it is a regression problem, it will be expecting the number. Still, limited data may show a horrifying amount far from the actual output.
The model can classify the different points in the wrong clusters in the clustering problems if trained with limited data.
4. Time Series:
In time series analysis, we forecast some data for the future. Still, a low-accurate time series model can give us inferior forecast results, and there may be a lot of errors related to time.
5. Object Detection:
If an object detection model is trained on limited data, it might not detect the object correctly, or it can classify the thing incorrectly.How to Deal With Problems of Limited Data?
There needs to be an accurate or fixed method for dealing with the limited data. Every machine learning problem is different, and the way of solving the particular problem is other. But some standard techniques are helpful for many cases.
1. Data Augmentation
Data augmentation is the technique in which the existing data is used to generate new data. Here the further information generated will look like the old data, but some of the values and parameters would be different here.
This approach can increase the amount of data, and there is a high likelihood of improving the model’s performance.
Data augmentation is preferred in most deep-learning problems, where there is limited data with images.
2. Don’t Drop and Impute:
In some of the datasets, there is a high fraction of invalid data or empty. Due to that, some amount of data s dropped not to make the process complex, but by doing this, the amount of data is decreased, and several problems can occur.
3. Custom Approach:
If there is a case of limited data, one could search for the data on the internet and find similar data. Once this type of data is obtained, it can be used to generate more data or be merged with the existing data.Conclusion
In this article, we discussed the limited data, the performance of several machine learning and deep learning algorithms, the amount of data increasing and decreasing, the type of problem that can occur due to limited data, and the common ways to deal with limited data. This article will help one understand the process of restricted data, its effects on performance, and how to handle it.
Some Key Takeaways from this article are:
1. Machine Learning and shallow neural networks are the algorithms that are not affected by the amount of data after some threshold level.
2. Deep neural networks are data-hungry algorithms that never stop learning from data.
3. Limited data can cause problems in every field of machine learning applications, e.g., classification, regression, time series, image processing, etc.
4. We can apply Data augmentation, imputation, and some other custom approaches based on domain knowledge to handle the limited data.
Want to Contact the Author?
Follow Parth Shukla @AnalyticsVidhya, LinkedIn, Twitter, and Medium for more content.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.
Introduction to Data Preprocessing in Machine Learning
The following article provides an outline for Data Preprocessing in Machine Learning. Data pre-processing also knows as data wrangling is the technique of transforming the raw data i.e. an incomplete, inconsistent, data with lots of error, and data that lack certain behavior, into understandable format carefully using the different steps (i.e. from importing libraries, data to checking of missing values, categorical followed by validation and feature scaling ) so that proper interpretations can be made from it and negative results can be avoided, as the quality of the model in machine learning highly depends upon the quality of data we train it on.
Start Your Free Data Science Course
Data collected for training the model is from various sources. These collected data are generally in their raw format i.e. they can have noises like missing values, and relevant information, numbers in the string format, etc. or they can be unstructured. Data pre-processing increases the efficiency and accuracy of the machine learning models. As it helps in removing these noises from and dataset and giving meaning to the datasetSix Different Steps Involved in Machine Learning
Following are six different steps involved in machine learning to perform data pre-processing:
Step 1: Import libraries
Step 2: Import data
Step 3: Checking for missing values
Step 4: Checking for categorical data
Step 5: Feature scaling1. Import Libraries
The very first step is to import a few of the important libraries required in data pre-processing. A library is a collection of modules that can be called and used. In python, we have a lot of libraries that are helpful in data pre-processing.
A few of the following important libraries in python are:
Numpy: Mostly used the library for implementing or using complicated mathematical computation of machine learning. It is useful in performing an operation on multidimensional arrays.
Pandas: It is an open-source library that provides high performance, and easy-to-use data structure and data analysis tools in python. It is designed in a way to make working with relation and labeled data easy and intuitive.
Matplotlib: It’s a visualization library provided by python for 2D plots o array. It is built on a numpy array and designed to work with a broader Scipy stack. Visualization of datasets is helpful in the scenario where large data is available. Plots available in matplot lib are line, bar, scatter, histogram, etc.
Seaborn: It is also a visualization library given by python. It provides a high-level interface for drawing attractive and informative statistical graphs.2. Import Dataset
Once the libraries are imported, our next step is to load the collected data. Pandas library is used to import these datasets. Mostly the datasets are available in CSV formats as they are low in size which makes it fast for processing. So, to load a csv file using the read_csv function of the panda’s library. Various other formats of the dataset that can be seen are
Once the dataset is loaded, we have to inspect it and look for any noise. To do so we have to create a feature matrix X and an observation vector Y with respect to X.3. Checking for Missing Values
Once you create the feature matrix you might find there are some missing values. If we won’t handle it then it may cause a problem at the time of training.
Removing the entire row that contains the missing value, but there can be a possibility that you may end up losing some vital information. This can be a good approach if the size of the dataset is large.
If a numerical column has a missing value then you can estimate the value by taking the mean, median, mode, etc.4. Checking for Categorical Data
Data in the dataset has to be in a numerical form so as to perform computation on it. Since Machine learning models contain complex mathematical computation, we can’t feed them a non-numerical value. So, it is important to convert all the text values into numerical values. LabelEncoder() class of learned is used to covert these categorical values into numerical values.5. Feature Scaling
The values of the raw data vary extremely and it may result in biased training of the model or may end up increasing the computational cost. So it is important to normalize them. Feature scaling is a technique that is used to bring the data value in a shorter range.
Methods used for feature scaling are:
Rescaling (min-max normalization)
Standardization (Z-score Normalization)
Scaling to unit length6. Splitting Data into Training, Validation and Evaluation Sets
Finally, we need to split our data into three different sets, training set to train the model, validation set to validate the accuracy of our model and finally test set to test the performance of our model on generic data. Before splitting the Dataset, it is important to shuffle the Dataset to avoid any biases. An ideal proportion to divide the Dataset is 60:20:20 i.e. 60% as the training set, 20% as test and validation set. To split the Dataset use train_test_split of sklearn.model_selection twice. Once to split the dataset into train and validation set and then to split the remaining train dataset into train and test set.Conclusion – Data Preprocessing in Machine Learning
Data Preprocessing is something that requires practice. It is not like a simple data structure in which you learn and apply directly to solve a problem. To get good knowledge on how to clean a Dataset or how to visualize your dataset, you need to work with different datasets. The more you will use these techniques the better understanding you will get about it. This was a general idea of how data processing plays an important role in machine learning. Along with that, we have also seen the steps needed for data pre-processing. So next time before going to train the model using the collected data be sure to apply data pre-processing.Recommended Articles
This is a guide to Data Preprocessing in Machine Learning. Here we discuss the introduction and six different steps involved in machine learning. You can also go through our other suggested articles to learn more –
This article was published as part of the Data science Blogathon.Introduction :
In this post, we will come through some of the major challenges that you might face while developing your machine learning model. Assuming that you know what machine learning is really about, why do people use it, what are the different categories of machine learning, and how the overall workflow of development takes place.
What can possibly go wrong during the development and prevent you from getting accurate predictions?
So let’s get started, during the development phase our focus is to select a learning algorithm and train it on some data, the two things that might be a problem are a bad algorithm or bad data, or perhaps both of them.Table of Content :
Not enough training data.
Poor Quality of data.
Nonrepresentative training data.
Overfitting and Underfitting.1. Not enough training data :
Let’s say for a child, to make him learn what an apple is, all it takes for you to point to an apple and say apple repeatedly. Now the child can recognize all sorts of apples.2. Poor Quality of data:
Obviously, if your training data has lots of errors, outliers, and noise, it will make it impossible for your machine learning model to detect a proper underlying pattern. Hence, it will not perform well.
So put in every ounce of effort in cleaning up your training data. No matter how good you are in selecting and hyper tuning the model, this part plays a major role in helping us make an accurate machine learning model.
“Most Data Scientists spend a significant part of their time in cleaning data”.
There are a couple of examples when you’d want to clean up the data :
If you see some of the instances are clear outliers just discard them or fix them manually.
If some of the instances are missing a feature like (E.g., 2% of user did not specify their age), you can either ignore these instances, or fill the missing values by median age, or train one model with the feature and train one without it to come up with a conclusion.3. Irrelevant Features:
“Garbage in, garbage out (GIGO).”
In the above image, we can see that even if our model is “AWESOME” and we feed it with garbage data, the result will also be garbage(output). Our training data must always contain more relevant and less to none irrelevant features.
The credit for a successful machine learning project goes to coming up with a good set of features on which it has been trained (often referred to as feature engineering ), which includes feature selection, extraction, and creating new features which are other interesting topics to be covered in upcoming blogs.4. Nonrepresentative training data:
To make sure that our model generalizes well, we have to make sure that our training data should be representative of the new cases that we want to generalize to.
If train our model by using a nonrepresentative training set, it won’t be accurate in predictions it will be biased against one class or a group.
For E.G., Let us say you are trying to build a model that recognizes the genre of music. One way to build your training set is to search it on youtube and use the resulting data. Here we assume that youtube’s search engine is providing representative data but in reality, the search will be biased towards popular artists and maybe even the artists that are popular in your location(if you live in India you will be getting the music of Arijit Singh, Sonu Nigam or etc).
So use representative data during training, so your model won’t be biased among one or two classes when it works on testing data.5. Overfitting and Underfitting :
What is overfitting?
Let’s start with an example, say one day you are walking down a street to buy something, a dog comes out of nowhere you offer him something to eat but instead of eating he starts barking and chasing you but somehow you are safe. After this particular incident, you might think all dogs are not worth treating nicely.
So this overgeneralization is what we humans do most of the time, and unfortunately machine learning model also does the same if not paid attention. In machine learning, we call this overfitting i.e model performs well on training data but fails to generalize well.
Overfitting happens when our model is too complex.
Things which we can do to overcome this problem:
Simplify the model by selecting one with fewer parameters.
By reducing the number of attributes in training data.
Constraining the model.
Gather more training data.
Reduce the noise.
What is underfitting?
Yes, you guessed it right underfitting is the opposite of overfitting. It happens when our model is too simple to learn something from the data. For E.G., you use a linear model on a set with multi-collinearity it will for sure underfit and the predictions are bound to be inaccurate on the training set too.
Things which we can do to overcome this problem:
Train on better and relevant features.
Reduce the constraints.Conclusion :
Machine Learning is all about making machines better by using data so that we don’t need to code them explicitly. The model will not perform well if training data is small, or noisy with errors and outliers, or if the data is not representative(results in biased), consists of irrelevant features(garbage in, garbage out), and lastly neither too simple(results in underfitting) nor too complex(results in overfitting). After you have trained a model by keeping the above parameters in mind, don’t expect that your model would simply generalize well to new cases you may need to evaluate and fine-tune it, how to do that? Stay tuned this is a topic that will be covered in the upcoming blogs.
Karan Amal Pradhan.
The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.
How can unidentified COVID-19 cases be tracked?
Researchers and provider organisations have increasingly embraced artificial intelligence (AI) and machine learning (ML) tools to reduce and track the spread of COVID-19 and to improve their surveillance efforts. Big data analytics systems have helped health experts to stay ahead of the pandemic from predicting patient outcomes to anticipating future hotspots, resulting in more efficient care delivery. However, the level of pandemic preparation by healthcare organisations is only as good as the data available to them. Although the industry is well aware of the data issues, the COVID-19 pandemic has brought a host of unique challenges to the forefront of care delivery. Nature of the SARS-CoV-2 has led to significant gaps in COVID-19 data with inconsistencies in information, leaving officials uncertain of the effectiveness of public health interventions. “Asymptomatic infections are a common phenomenon in the spread of coronavirus”, said Lucy Li, PhD, a data scientist at the Chan Zuckerberg Biohub. “And it’s very important to understand that phenomenon because depending on how many asymptomatic infections there are, public health interventions might be different.” Chan Zuckerberg Biohub’s researchers are working to cope up with this situation. Li estimated the number of undetected infections using machine learning and cloud computing at 12 locations including Asia, Europe, and the U.S over the course of the pandemic. The results showed that a vast range of infections remained undetected in these parts of the world with the rate of unidentified cases as high as over 90% in Shanghai. Additionally, when the virus was first contracted in these 12 locations, more than 98% of cases were not reported during the first few weeks of the outbreak. This indicates that the pandemic was already well underway by the time intensive testing began. Such findings have crucial implications on public health policy and provider organisations, Lucy Li noted. “For disease outbreaks where you can identify every single infection, rapid testing and a tiny amount of contact tracing is enough to get the epidemic under control, stated Li. “But for coronavirus, there are so many asymptomatic cases out there and testing alone will not help control the pandemic.” “It is because usually when you do testing, you are testing only symptomatic patients which are a subset of the total number of infections out there,” explains Li. “You’re missing a lot of people who are spreading the infection without their knowledge, hence they are not quarantined. Being able to sense of what that number might be is helpful for allocating resources.” Li’s research was backed by AWS Diagnostic Development Initiative which has initiated a global effort to stimulate diagnostic research and innovation during the coronavirus pandemic and to mitigate future disease outbreaks. The data Li is using is viral genomes, the viral DNA. She elaborates, “As the viral genomes spread through the population, they accumulate mutations. These mutations are generally not good or bad; they’re just changes in the genome.” She added, “Every time the virus infects a new individual, it could accumulate new mutations. So, if we know how fast the virus mutates, we can infer how many missing transmission links there were in between the observed genomes.” Li said, “Many different scenarios could explain what we see in the viral genomes. I have to leverage machine learning and cloud computing to test all of those hypotheses and to see which one can explain the observed changes in viral genomes.” She pointed out that these data analytics are well-suited to meet the challenges brought by COVID-19. ML tools allow the researchers to explore different explanations of the data they see so that they can test many hypotheses. With ML and cloud computing technologies, streamlining a previous time-consuming task is possible. By having access to more computational resources in the cloud, time can be reduced from months to days because of the more memory leveraging capacity, which better parallelises analysis. This research may help health officials to monitor the rate of under-reporting in real-time that could indicate how well current surveillance systems are operating. With the available data of COVID-19 pandemic, analytics tools are essential for bringing new insights and potential solutions.
Update the detailed information about Machine Learning 101: Decision Tree Algorithm For Classification on the Moimoishop.com website. We hope the article's content will meet your needs, and we will regularly update the information to provide you with the fastest and most accurate information. Have a great day!