Trending March 2024 # Let’s Solve Overfitting! Quick Guide To Cost Complexity Pruning Of Decision Trees # Suggested April 2024 # Top 10 Popular

You are reading the article Let’s Solve Overfitting! Quick Guide To Cost Complexity Pruning Of Decision Trees updated in March 2024 on the website Moimoishop.com. We hope that the information we have shared is helpful to you. If you find the content interesting and meaningful, please share it with your friends and continue to follow and support us for the latest updates. Suggested April 2024 Let’s Solve Overfitting! Quick Guide To Cost Complexity Pruning Of Decision Trees

This article was published as a part of the Data Science Blogathon.

Understanding the problem of Overfitting in Decision Trees and solving it by Minimal Cost-Complexity Pruning using Scikit-Learn in Python

Decision Tree is one of the most intuitive and effective tools present in a Data Scientist’s toolkit. It has an inverted tree-like structure that was once used only in Decision Analysis but is now a brilliant Machine Learning Algorithm as well, especially when we have a Classification problem on our hands.

These decision trees are well-known for their capability to capture the patterns in the data. But, excess of anything is harmful, right? Decision Trees are infamous as they can cling too much to the data they’re trained on.

Hence, our tree gives poor results on deployment because it cannot deal with a new set of values.

But, don’t worry! Just like a skilled mechanic has wrenches of all sizes readily available in his toolbox, a skilled Data Scientist also has his set of techniques to deal with any kind of problem. And that’s what we’ll explore in this article.

The Role of Pruning in Decision Trees

Pruning is one of the techniques that is used to overcome our problem of Overfitting. Pruning, in its literal sense, is a practice which involves the selective removal of certain parts of a tree(or plant), such as branches, buds, or roots, to improve the tree’s structure, and promote healthy growth. This is exactly what Pruning does to our Decision Trees as well. It makes it versatile so that it can adapt if we feed any new kind of data to it, thereby fixing the problem of overfitting.

It reduces the size of a Decision Tree which might slightly increase your training error but drastically decrease your testing error, hence making it more adaptable.

Minimal Cost-Complexity Pruning is one of the types of Pruning of Decision Trees.

This algorithm is parameterized by α(≥0) known as the complexity parameter.

In its 0.22 version, Scikit-learn introduced this parameter called ccp_alpha (Yes! It’s short for Cost Complexity Pruning- Alpha) to Decision Trees which can be used to perform the same.

Building the Decision Tree in Python

We will use the Iris dataset to fit the Decision Tree on. You can download the dataset here.

First, let us import the basic libraries required and the dataset:

Python Code:



The Dataset looks like this:

Our aim is to predict the Species of a flower based on its Sepal Length and Width.

We will split the dataset into two parts – Train and Test. We’re doing this so that we can see how our model performs on unseen data as well. We shall use the train_test_split function from sklearn.model_selection to split the dataset.

Now, let’s fit a Decision Tree to the train part and predict on both test and train. We will use DecisionTreeClassifier from sklearn.tree for this purpose.

By default, the Decision Tree function doesn’t perform any pruning and allows the tree to grow as much as it can. We get an accuracy score of 0.95 and 0.63 on the train and test part respectively as shown below. We can say that our model is Overfitting i.e. memorizing the train part but is not able to perform equally well on the test part.

DecisionTree in sklearn has a function called cost_complexity_pruning_path, which gives the effective alphas of subtrees during pruning and also the corresponding impurities. In other words, we can use these values of alpha to prune our decision tree:

We will set these values of alpha and pass it to the ccp_alpha parameter of our DecisionTreeClassifier. By looping over the alphas array, we will find the accuracy on both Train and Test parts of our dataset.

From the above plot, we can see that between alpha=0.01 and 0.02, we get the maximum test accuracy. Although our train accuracy has decreased to 0.8, our model is now more generalized and it will perform better on unseen data.

End Notes

You can find the notebook on my GitHub and take a closer look at what I have done. Also, connect with me on LinkedIn, and let’s discuss Data!

Related

You're reading Let’s Solve Overfitting! Quick Guide To Cost Complexity Pruning Of Decision Trees

Beginner’s Guide To Decision Tree Classification Using Python

This article was published as a part of the Data Science Blogathon

Overview

What Is Decision Classification Tree Algorithm

How to build a decision tree from scratch

Terminologies related to decision tree

Difference between random forest and decision tree

Python Code Implementation of decision trees

There are various algorithms in Machine learning for both regression and classification problems, but going for the best and most efficient algorithm for the given dataset is the main point to perform while developing a good Machine Learning Model.

One of Such algorithms good for both classification/categorical and Regression problems is the Decision tree

Decision Trees usually implement exactly the human thinking ability while making a decision, so it is easy to understand.

The logic behind the decision tree can be easily understood because it shows a flow chart type structure /tree-like structure which makes it easy to visualize and extract information out of the background process

Table of Contents

What Is a Decision Tree

Elements of Decision Trees

How to build a decision from scratch

How Does  the Decision Tree Algorithm works

Acquaintance With EDA( Exploratory Data Analysis)

Decision Trees and Random Forests

Advantages of Decision Forest

Python Code Implementation

1. What is a Decision Tree?

A Decision Tree is a supervised Machine learning algorithm. It is used in both classification and regression algorithms. The decision tree is like a tree with nodes. The branches depend on a number of factors. It splits data into branches like these till it achieves a threshold value. A decision tree consists of the root nodes, children nodes, and leaf nodes.

Let’s Understand the decision tree methods by Taking one Real-life Scenario

Imagine that you play football every Sunday and you always invite your friend to come to play with you. Sometimes your friend actually comes and sometimes he doesn’t.

The factor on whether or not to come depends on numerous things, like weather, temperature, wind, and fatigue. We start to take all of these features into consideration and begin tracking them alongside your friend’s decision whether to come for playing or not.

You can use this data to predict whether or not your friend will come to play football or not. The technique you could use is a decision tree. Here’s what the decision tree would look like after implementation:

2. Elements Of a Decision Tree

Every decision tree consists following list of elements:

a Node

b Edges

c Root

d Leaves

a) Nodes: It is The point where the tree splits according to the value of some attribute/feature of the dataset

b) Edges: It directs the outcome of a split to the next node we can see in the figure above that there are nodes for features like outlook, humidity and windy. There is an edge for each potential value of each of those attributes/features.

c) Root: This is the node where the first split takes place

3. How to Build Decision Trees from Scratch?

While building a Decision tree, the main thing is to select the best attribute from the total features list of the dataset for the root node as well as for sub-nodes. The selection of best attributes is being achieved with the help of a technique known as the Attribute selection measure (ASM).

With the help of ASM, we can easily select the best features for the respective nodes of the decision tree.

There are two techniques for ASM:

a) Information Gain

b) Gini Index

a) Information Gain:

1 Information gain is the measurement of changes in entropy value after the splitting/segmentation of the dataset based on an attribute.

2 It tells how much information a feature/attribute provides us.

3 Following the value of the information gain, splitting of the node and decision tree building is being done.

4 decision tree always tries to maximize the value of the information gain, and a node/attribute having the highest value of the information gain is being split first. Information gain can be calculated using the below formula:

Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)

Entropy: Entropy signifies the randomness in the dataset. It is being defined as a metric to measure impurity. Entropy can be calculated as:

Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)

Where,

S= Total number of samples

P(yes)= probability of yes

P(no)= probability of no.

b) Gini Index:

Gini index is also being defined as a measure of impurity/ purity used while creating a decision tree in the CART(known as Classification and Regression Tree) algorithm.

An attribute having a low Gini index value should be preferred in contrast to the high Gini index value.

It only creates binary splits, and the CART algorithm uses the Gini index to create binary splits.

Gini index can be calculated using the below formula:

Gini Index= 1- ∑jPj2

4. How Does the Decision Tree Algorithm works?

The basic idea behind any decision tree algorithm is as follows:

1. Select the best Feature using Attribute Selection Measures(ASM) to split the records.

2. Make that attribute/feature a decision node and break the dataset into smaller subsets.

3 Start the tree-building process by repeating this process recursively for each child until one of the following condition is being achieved :

a) All tuples belonging to the same attribute value.

b) There are no more of the attributes remaining.

c ) There are no more instances remaining.

5. Decision Trees and Random Forests

Decision trees and Random forest are both the tree methods that are being used in Machine Learning.

Decision trees are the Machine Learning models used to make predictions by going through each and every feature in the data set, one-by-one.

Random forests on the other hand are a collection of decision trees being grouped together and trained together that use random orders of the features in the given data sets.

Instead of relying on just one decision tree, the random forest takes the prediction from each and every tree and based on the majority of the votes of predictions, and it gives the final output. In other words, the random forest can be defined as a collection of multiple decision trees.

6. Advantages of the Decision Tree

1 It is simple to implement and it follows a flow chart type structure that resembles human-like decision making.

2 It proves to be very useful for decision-related problems.

3 It helps to find all of the possible outcomes for a given problem.

4 There is very little need for data cleaning in decision trees compared to other Machine Learning algorithms.

5 Handles both numerical as well as categorical values

1 Too many layers of decision tree make it extremely complex sometimes.

2 It may result in overfitting ( which can be resolved using the Random Forest algorithm)

3 For the more number of the class labels, the computational complexity of the decision tree increases.

8. Python Code Implementation #Numerical computing libraries and Loading Data



#Exploratory data analysis raw_data.info() sns.pairplot(raw_data, hue = 'Kyphosis')

#Split the data set into training data and test data from sklearn.model_selection import train_test_split x = raw_data.drop('Kyphosis', axis = 1) y = raw_data['Kyphosis'] x_training_data, x_test_data, y_training_data, y_test_data = train_test_split(x, y, test_size = 0.3)

#Train the decision tree model from chúng tôi import DecisionTreeClassifier model = DecisionTreeClassifier() model.fit(x_training_data, y_training_data) predictions = model.predict(x_test_data) #Measure the performance of the decision tree model from sklearn.metrics import classification_report from sklearn.metrics import confusion_matrix print(classification_report(y_test_data, predictions)) print(confusion_matrix(y_test_data, predictions))

My name is Pranshu Sharma and I am a Data Science Enthusiast

Email: [email protected]

The media shown in this article on Analytics Vidhya are not owned by Analytics Vidhya and is used at the Author’s discretion. 

Related

Fixed Cost Vs Variable Cost

Difference between Fixed Cost vs Variable Cost

The following article provides an outline for Fixed cost vs Variable cost. The major difference between these two costs is that the Variable depends on the output of production while the fixed cost is independent of the output.

Start Your Free Investment Banking Course

Download Corporate Valuation, Investment Banking, Accounting, CFA Calculator & others

What is Fixed Cost?

Fixed cost is defined as a cost that does not change its value with any change (Increase or Decrease) in the goods produced or services sold. Changes in activity levels do not affect fixed costs. It does not mean that the cost will remain fixed forever. It means it will be constant for a particular period of time. E.g., The interest amount charged is fixed for the period unless and until it has been renewed. Fixed cost and variable cost are the main two pillars in any industry’s production and service line. There are two types of fixed costs: Committed fixed cost and discretionary fixed cost. The fixed cost can be considered as a sunk cost.

What is Variable Cost?

=Rs 500) (5*200=Rs 1000) (5*300=Rs 1500).

Head to Head Comparison Between Fixed Cost vs Variable Cost (Infographics)

Below are the top 8 differences between Fixed cost and Variable Cost:

Key Differences between Fixed Cost vs Variable Cost

Examples of variable costs are Raw materials, labor, packaging, freight, and commission. As the volume increases, these costs will increase as one extra item to be produced requires more materials, labor, etc. Hence these costs are directly proportional to the volume of items produced.

Examples of fixed costs are rental payments, depreciation, insurance, interest payment, etc. These items do not change even if you increase the volume of production, e.g., even if you produce one extra item, the rental payment needs to do is the same So, Fixed cost.

Variable cost varies with the variation in the volume production. The fixed cost has no relation with the output capacity.

Fixed cost does not change with the volume and remains constant for a given period of time. e.g. Till the time new lease contract is not changed, the lease payment will remain fixed. Variable cost changes with the production volume.

Example of calculating the fixed cost: Supposes the total cost is Rs1000 and the total units produced are 10. Therefore, the fixed cost per unit is Rs1000/10 = Rs100. The variable cost of labor charges is 5Rs per unit of production. Therefore, for making 10 units, it would be 10*5=Rs50. The total cost of production is the sum of the total variable cost and total fixed cost.

Here the only taken variable is labor. We need to consider the variable cost for all the other items and add to the fixed cost to get the total cost as an outcome. fixed cost changes per unit. As the number of units increases, the fixed cost per unit decreases. Variable cost remains constant per unit. Variable cost is directly proportional to the change in production.

If production increases, i.e., if the number of units produced increases, the fixed cost per unit produced drops significantly, increasing the possibility of a greater profit margin and achieving economies of scale.

As mentioned above, the economies of scale production need to be increased to decrease the per-unit fixed cost. So, the risk associated with the fixed cost is higher than the variable cost.

Unless and until production takes place, variable cost does not take place, but fixed cost occurs even if there is no production. For e.g. Even if there is no laptop produced in the laptop factory but the rental charges need to be paid – that is the fixed cost. The labor charges are not paid as no production – that is the variable cost. The fixed cost cannot be controlled and has to be paid. The amount of the production level can control the variable cost.

Comparison Table between Fixed Cost vs Variable Cost

Let’s discuss the top comparison between Fixed Cost vs Variable Cost:

Basis of Comparison Fixed Cost Variable Cost

Definition The cost is fixed. Cost is variable.

Dependent Independent on the volume of production of a company. Dependent on the volume of production of a company.

Behavior Remains constant for a given time. Time-related. Changes with the output level. Volume-related.

Formula It is calculated as the total fixed cost divided by no of units produced. Formula to calculate the total variable cost is (variable cost of one item*no of items produced).

Economies of scale Greater the fixed cost company has more sales the company targets to reach the breakeven point. Variable cost remains flat in nature.

Risk associated It is riskier as the cost depends on the production level. Risk varies as the cost is dependent on the amount produced.

Occurred when These costs occur even if the quantity is produced or not. It cannot be controlled. These costs occur only when the production starts depending directly on the no of units produced. It could be controlled.

Examples Salary, tax, depreciation, insurance, etc. Cost of goods sold, administrative and general expenses on the Income statement.

Conclusion

Variable and fixed costs are completely contradictory to each other but serve a major role in financial analysis. Higher units of production increase profitability as the total fixed cost decreases, while variable cost helps in the contribution margin; therefore, both have unique importance in their ways.

Recommended Articles

This is a guide to Fixed cost vs Variable cost. Here we discussed the Fixed vs Variable cost key differences with infographics and a comparison table. You can also go through our other suggested articles to learn more –

Ftc’s New Rules For Bloggers: A Quick Guide

As of December 1, the Federal Trade Commission is going to require bloggers, and prominent tweeters and Facebook types to disclose any paid endorsements to their followers, online friends and readers. These new rules have the potential to change everyone’s online habits. Here’s what you need to know:

Blogs

A rare trend among some bloggers is to receive a small fee in exchange for reviewing a particular product or writing a blog post about it. Under the FTC’s new rules, all bloggers engaging in this practice would have to disclose that they are receiving a fee for their blog post. Bloggers will also have to disclose any gifts they receive, such as a free gadget, book, or toothpaste, since the free merchandise counts as compensation.

The strange thing about this new rule is that, in my experience, many bloggers already disclose when they are being paid for reviews. I’ve also seen disclosure on those rare occasions I’ve come a cross a PayPerPost model, when a blogger is basically working a product endorsement into their writing. Of course, even if a pay-per-post blogger didn’t disclose what they were doing, it is often painfully obvious they’ve been paid to insert something about ‘Super Wowee Shampoo’ into their blog.

Social Networking

Now this is where things really get interesting. Prominent users of social networks and Twitter will also be covered by the FTC’s new regulations. CNET’s Caroline McCarthy uses an interesting scenario to illustrate this: a celebrity receives a bunch of free nights from a hotel, and then becomes a fan of that hotel on Facebook. There would have to be disclosure by the celebrity on Facebook that they have received a gift from that hotel.

That sounds reasonable enough, but what about the rest of us? Say, for example, you work for Microsoft and become a fan of the company on Facebook or tweet about how much you love Windows 7. Now, what if you have not made it clear on your Facebook and Twitter profiles that you work for Microsoft? Some of your Facebook friends or Twitter followers might see your posts, and–knowing that you’re an expert in technology, but not necessarily that you work for Microsoft–take your Windows 7 endorsement at face value. You still might love Windows 7, but you haven’t made it clear that you’re receiving financial compensation as a Microsoft employee. Under the new FTC guidelines, you may have just crossed the line. True, it’s unlikely the FTC will be interested in you, but if you have a large amount of Twitter followers or Facebook friends, it might be a good idea to disclose your corporate affiliations.

Bottom Line: If you are going to tweet about how awesome your employer is, make sure everybody knows you work there.

The FTC

While these new rules may seem confusing and perhaps even excessive, the FTC says it is not that interested in hitting individual bloggers or prominent social network users with heavy fines. Repeat offenders may end up being punished, but the new regulations are really about keeping corporations in line.

Cost Of Post 9/11 Wars: $4.6 Trillion

Cost of Post 9/11 Wars: $4.6 Trillion Bill for each taxpayer is $23,386

Neta C. Crawford, a BU College of Arts & Sciences professor of political science, codirects the Costs of War Project at Brown University’s Watson Institute for International and Public Affairs. She includes the costs of healthcare and disability compensation in her accounting. Photo by Cydney Scott

How do you count the costs of war?

In the Pentagon’s most recent accounting, the total authorized US spending on the wars in Afghanistan, Iraq, and Syria between fiscal years 2001 and 2023 is $1.52 trillion. The bill for each individual taxpayer amounts to $7,740, according to the Pentagon.

That sounds like a lot of taxpayer money.

But Neta C. Crawford, codirector of the Costs of War Project at Brown University’s Watson Institute for International and Public Affairs, says the true cost is higher—about three times higher. Crawford, a Boston University College of Arts & Sciences professor of political science, calculates the total price tag of the post-9/11 wars at $4.6 trillion, with the bill for each individual taxpayer totaling $23,386. The numbers come from Crawford’s latest Costs of War report, released in November 2023.

The Costs of War Project has a team of some 40 scholars, legal experts, human rights practitioners, and physicians that documents the economic, political, and human toll of the post-9/11 wars and related violence in the so-called war on terror.

Speaking at a Congressional briefing by US Senator Jack Reed (D-R.I.), the ranking Democrat on the Armed Services Committee, Crawford said that while the Pentagon’s numbers focus on “direct war spending” by the Defense and State Departments, her calculations also include war-caused and war-related spending by the Pentagon, the State Department, the Veterans Affairs Department (VA), and Homeland Security. That spending includes the VA’s healthcare costs for about two million soldiers and disability compensation for another million soldiers.

Crawford, an expert on the ethics of war and international relations theory, said her accounting also includes the paid interest on the money the United States has borrowed for the wars. “This is not a pay-as-you-go war,” she said at the briefing. “It is rather more like taking out a home equity line of credit.”

BU Research talked with Crawford about how she became involved with the Costs of War Project, why she thinks it’s important to talk about the costs of war, and how one Vietnam veteran’s broken life got her interested in the subject.

BU Research: What do you mean when you say the costs of war don’t end when the wars end?

Crawford: Project contributor Linda Bilmes, Daniel Patrick Moynihan Senior Lecturer in Public Policy at the Harvard Kennedy School, says these wars have been paid for with a credit card. It’s like getting a mortgage. We’ve moved in, we’ve moved out, and we’ve given the debt to our kids.

There are two million men and women who’ve served and once they leave the service they become eligible for medical care and disability through the VA. An unfortunate fact of these wars in Afghanistan and Syria is that many of these veterans are sicker than in past wars. That’s partly because of exposure and just mistakes that were made. Remember those open burn pits they used to get rid of waste at military sites in Iraq and Afghanistan? Those caused respiratory illnesses. Some of it is also the very dusty environment soldiers are operating in. Often they’re carrying very heavy packs, which can cause injuries to their musculature. Deployments in the military are longer now, and people have served multiple deployments.

As a result, health insurance costs have gone up for people after duty and for veterans. Over the long term, there is the care of these veterans, who have higher rates of heart disease and lung problems, not to mention the more than 1,600 people who have lost limbs. Medical and disability costs for those people are going to peak in about 40 years. So I have an estimate that’s perhaps low for the cost of healthcare over the next 40 years.

Can you tell us more about the VA’s portion of the costs of war?

Just last year 80,000 more veterans entered the queue for disability compensation. The VA is serving about a million people for disability. The costs for those people will go up as they get older. I’m not even talking about the spiritual pain and distress for families.

What are some of the other unknowns?

We don’t know how long the wars will last. We don’t know how many veterans will be entering the system who have been serving in these wars. We don’t know the ultimate cost. I try to be conservative with my numbers.

What do you and the other Costs of War scholars hope to accomplish with these annual reports?

Our aim is to help people understand what we’ve been doing in these wars. If we can get a handle on the costs, then we can talk about the risks and benefits. Then we can have an informed public policy discussion. We also need to understand whether we’ve prevented incidents or whether it’s a wash or whether what we’ve been doing is counterproductive in the war zone.

I think it’s the job of academics and people in Congress to have these discussions. We’re trying to create the space to have an informed conversation, one that’s not driven by fear, but by evidence. Many people don’t know how to talk about war and to question what we’re doing for the last 16 years, because they’re afraid of seeming as if they don’t support the troops.

Everyone will say we want to give these men and women what they need do their jobs. But we also need to question the job: how well is it going? Are we achieving our policy objectives? You have to recognize that these objectives are moving goal posts and the metrics for success are elusive.

The second thing I think we need to question is whether there is no other way to achieve the objective of security. If the idea is that you want to prevent future terrorist attacks, are there other means besides killing as many potential or actual militant terrorists as there are over there? Can we not think about how to reduce the cause for people to become terrorists?

Since imperial Rome, you said at the briefing, great powers have “almost always believed that their wars would be short, effective, and inexpensive compared to the gains.” Does that apply to these wars?

The American people have been told that these wars would be short, low in cost in terms of lives, and low in cost in terms of funding, that the war in Iraq might even pay for itself. We were told that we’d leave the place better than it was when we came. That’s always the case when you enter wars. You have these optimistic assessments and estimates and promises about the duration, the likelihood of success, and the costs of conflict. Generally, they’re proven to be overly optimistic. We’re trying to point out that this is a pattern.

How did you get interested in the costs of war in the first place?

I’ve been working on research related to war for decades. I grew up in Milwaukee. My father was a public school teacher. He had a side job, renovating houses to sell or rent. He believed in a strong work ethic, and I was helping him by the time I was 9 or 10. I worked alongside a man named Calvin, who was a Vietnam veteran. My father said Calvin was not the same person when he came back from the war. I kept track of Calvin through my father. He was not mentally stable. He was a kind, gentle person, but he was broken by his experiences in Vietnam. The contact with Calvin was formative for me.

The moral injury, the psychic harm, the fear of terrorism people have been living with for a long time—those are all things we can’t quantify.

Explore Related Topics:

Cyclomatic Complexity In Software Testing (Example)

What is McCabe’s Cyclomatic Complexity?

Cyclomatic Complexity in Software Testing is a testing metric used for measuring the complexity of a software program. It is a quantitative measure of independent paths in the source code of a software program. Cyclomatic complexity can be calculated by using control flow graphs or with respect to functions, modules, methods or classes within a software program.

Independent path is defined as a path that has at least one edge which has not been traversed before in any other paths.

This metric was developed by Thomas J. McCabe in 1976 and it is based on a control flow representation of the program. Control flow depicts a program as a graph which consists of Nodes and Edges.

In the graph, Nodes represent processing tasks while edges represent control flow between the nodes.

Flow graph notation for a program:

Flow Graph notation for a program defines several nodes connected through the edges. Below are Flow diagrams for statements like if-else, While, until and normal sequence of flow.

How to Calculate Cyclomatic Complexity

Mathematical representation:

Mathematically, it is set of independent paths through the graph diagram. The Code complexity of the program can be defined using the formula –

V(G) = E - N + 2

Where,

E – Number of edges

N – Number of Nodes

V (G) = P + 1

Where P = Number of predicate nodes (node that contains condition)

Example –

i = 0; n=4; while (i<n-1) do j = i + 1; while (j<n) do if A[i]<A[j] then swap(A[i], A[j]); end do; j=j+1; end do;

Flow graph for this program will be

Computing mathematically,

V(G) = 9 – 7 + 2 = 4

V(G) = 3 + 1 = 4 (Condition nodes are 1,2 and 3 nodes)

Basis Set – A set of possible execution path of a program

1, 7

1, 2, 6, 1, 7

1, 2, 3, 4, 5, 2, 6, 1, 7

1, 2, 3, 5, 2, 6, 1, 7

Properties of Cyclomatic complexity:

Following are the properties of Cyclomatic complexity:

V (G) is the maximum number of independent paths in the graph

G will have one path if V (G) = 1

Minimize complexity to 10

How this metric is useful for software testing?

Basis Path testing is one of White box technique and it guarantees to execute atleast one statement during testing. It checks each linearly independent path through the program, which means number test cases, will be equivalent to the cyclomatic complexity of the program.

This metric is useful because of properties of Cyclomatic complexity (M) –

M can be number of test cases to achieve branch coverage (Upper Bound)

M can be number of paths through the graphs. (Lower Bound)

Consider this example –

If (Condition 1) Statement 1 Else Statement 2 If (Condition 2) Statement 3 Else Statement 4

Cyclomatic Complexity for this program will be 8-7+2=3.

As complexity has calculated as 3, three test cases are necessary to the complete path coverage for the above example.

Steps to be followed:

The following steps should be followed for computing Cyclomatic complexity and test cases design.

Step 1 – Construction of graph with nodes and edges from the code

Step 2 – Identification of independent paths

Step 3 – Cyclomatic Complexity Calculation

Step 4 – Design of Test Cases

Once the basic set is formed, TEST CASES should be written to execute all the paths.

More on V (G):

Cyclomatic complexity can be calculated manually if the program is small. Automated tools need to be used if the program is very complex as this involves more flow graphs. Based on complexity number, team can conclude on the actions that need to be taken for measure.

Following table gives overview on the complexity number and corresponding meaning of v (G):

Complexity Number Meaning

1-10

High Testability

Cost and Effort is less

10-20

Medium Testability

Cost and effort is Medium

20-40

Low Testability

Cost and Effort are high

Very high Cost and Effort

Tools for Cyclomatic Complexity calculation:

Many tools are available for determining the complexity of the application. Some complexity calculation tools are used for specific technologies. Complexity can be found by the number of decision points in a program. The decision points are if, for, for-each, while, do, catch, case statements in a source code.

Examples of tools are

OCLint – Static code analyzer for C and Related Languages

Reflector Add In – Code metrics for .NET assemblies

GMetrics – Find metrics in Java related applications

Uses of Cyclomatic Complexity:

Cyclomatic Complexity can prove to be very helpful in

Helps developers and testers to determine independent path executions

Developers can assure that all the paths have been tested atleast once

Helps us to focus more on the uncovered paths

Improve code coverage in Software Engineering

Evaluate the risk associated with the application or program

Using these metrics early in the cycle reduces more risk of the program

Conclusion:

Cyclomatic Complexity is software metric useful for structured or White Box Testing. It is mainly used to evaluate complexity of a program. If the decision points are more, then complexity of the program is more. If program has high complexity number, then probability of error is high with increased time for maintenance and trouble shoot.

Update the detailed information about Let’s Solve Overfitting! Quick Guide To Cost Complexity Pruning Of Decision Trees on the Moimoishop.com website. We hope the article's content will meet your needs, and we will regularly update the information to provide you with the fastest and most accurate information. Have a great day!