You are reading the article 12 Powerful Tips To Ace Data Science And Machine Learning Hackathons updated in March 2024 on the website Moimoishop.com. We hope that the information we have shared is helpful to you. If you find the content interesting and meaningful, please share it with your friends and continue to follow and support us for the latest updates. Suggested April 2024 12 Powerful Tips To Ace Data Science And Machine Learning HackathonsOverview
Data science hackathons can be a tough nut to crack, especially for beginners
Here are 12 powerful tips to crack your next data science hackathon!Introduction
Like any discipline, data science also has a lot of “folk wisdom”. This folk wisdom is hard to teach formally or in a structured manner but it’s still crucial for success, both in the industry as well as in data science hackathons.
Newcomers in data science often form the impression that knowing all machine learning algorithms would be a panacea to all machine learning problems. They tend to believe that once they know the most common algorithms (Gradient Boosting, Xtreme Gradient Boosting, Deep Learning architectures), they would be able to perform well in their roles/organizations or top these leaderboards in competitions.
Sadly, that does not happen!
If you’re reading this, there’s a high chance you’ve participated in a data science hackathon (or several of them). I’ve personally struggled to improve my model’s performance in my initial hackathon days and it was quite a frustrating experience. I know a lot of newcomers who’ve faced the same obstacle.
So I decided to put together 12 powerful hacks that have helped me climb to the top echelons of hackathon leaderboards. Some of these hacks are straightforward and a few you’ll need to practice to master.
If you are a beginner in the world of Data Science Hackathons or someone who wants to master the art of competing in hackathons, you should definitely check out the third edition of HackLive – a guided community hackathon led by top hackers at Analytics Vidhya.The 12 Tips to Ace Data Science Hackathons
Understand the Problem Statement
Build your Hypothesis Set
Create a Generic Codebase
Feature Engineering is the Key
Ensemble (Almost) Always Wins
Trust Local Validation
Build hindsight to improve your foresight
Refactor your code
Improve iterativelyData Science Hackathon Tip #1: Understand the Problem Statement
Seems too simple to be true? And yet, understanding the problem statement is the very first step to acing any data science hackathon:
Without understanding the problem statement, the data, and the evaluation metric, most of your work is fruitless. Spend time reading as much as possible about them and gain some functional domain knowledge if possible
Re-read all the available information. It will help you in figuring out an approach/direction before writing a single line of code. Only once you are very clear about the objective, you can proceed with the data exploration stage
Let me show you an example of a problem statement from a data science hackathon we conducted. Here’s the Problem Statement of the BigMart Sales Prediction problem:
The data scientists at BigMart have collected 2013 sales data for 1559 products across 10 stores in different cities. Also, certain attributes of each product and store have been defined. The aim is to build a predictive model and find out the sales of each product at a particular store.
Using this model, BigMart will try to understand the properties of products and stores which play a key role in increasing sales.
The idea is to find the properties of a product and store which impact the sales of a product. Here, you can think of some of the factors based on your understanding that can make an impact on the sales and come up with some hypotheses without looking at the data.Data Science Hackathon Tip #2: Build your Hypothesis Set
Next, you should build a comprehensive list of hypotheses. Please note that I am actually asking you to build a set of the hypothesis before looking at the data. This ensures you are not biased by what you see in the data
It also gives you time to plan your workflow better. If you are able to think of hundreds of features, you can prioritize which ones you would create first
Read more about hypothesis generation here
I encourage you to go through the hypotheses generation stage for the BigMart Sales problem in this article: Approach and Solution to break in Top 20 of Big Mart Sales prediction We have divided them on the basis of store level and product level. Let me illustrate a few examples here.
City type: Stores located in urban or Tier 1 cities should have higher sales because of the higher income levels of people there
Population Density: Stores located in densely populated areas should have higher sales because of more demand
Store Capacity: Stores that are very big in size should have higher sales as they act like one-stop-shops and people would prefer getting everything from one place
Ambiance: Stores that are well-maintained and managed by polite and humble people are expected to have higher footfall and thus higher sales
Brand: Branded products should have higher sales because of higher trust in the customer
Packaging: Products with good packaging can attract customers and sell more
Utility: Daily products should have a higher tendency to sell as compared to the specific products
Promotional Offers: Products accompanied by attractive offers and discounts will sell moreData Science Hackathon Hack #3: Team Up!
Build a team and brainstorm together. Try and find a person with a complementary skillset in your team. If you have been a coder all your life, go and team up with a person who has been on the business side of things
This would help you get a more diverse set of hypotheses and would increase your chances of winning the hackathon. The only exception to this rule can be that both of you should prefer the same tool/language stack
It will save you a lot of time and you will be able to parallelly experiment with several ideas and climb to the top of the leaderboard
Get a good score early in the competition which helps in teaming up with higher-ranked people
Here are some of the instances where hackathons were won by a team:Data Science Hackathon Tip #4: Create a Generic Codebase
Save valuable time when you participate in your next hackathon by creating a reusable generic code base & functions for your favorite models which can be used in all your hackathons, like:
Create a variety of time-based features if the dataset has a time feature
You can write a function that will return different types of encoding schemes
You can write functions that will return your results on a variety of different models so that you can choose your baseline model wisely and choose your strategy accordingly
Here is a code snippet that I generally use to encode all my train, test, and validation set of the data. I just need to pass a dictionary on which column and what kind of encoding scheme I want. I will not recommend you to use exactly the same code but will suggest you keep some of the function handy so that you can spend more time on brainstorming and experimenting.
View the code on Gist.
Here is a sample of how I use the above function. I just need to provide a dictionary where the keys are the type of encoding I want and the values are the columns name that I want to encode:
View the code on Gist.
You can also use libraries like pandas profiling to get an idea about the dataset by reading the data:
Data Science Hackathon Tip #5: Feature Engineering is Key
“More data beats clever algorithms, but better data beats more data.”
– Peter Norwig
Feature engineering! This is one of my favorite parts of a data science hackathon. I get to tap into my creative juices when it comes to feature engineering – and which data scientist doesn’t like that?Data Science Hackathon Tip #6: Ensemble (Almost) Always Wins
95% of winners have used ensemble models in their final submission on DataHack hackathons
Ensemble modeling is a powerful way to improve the performance of your model. It is an art of combining diverse results of individual models together to improvise on the stability and predictive power of the model
You will not find any data science hackathon that has top finishing solutions without ensemble models
You can learn more about the different ensemble techniques from the following articles:
Basics of Ensemble Learning
A Comprehensive Guide to Ensemble LearningData Science Hackathon Tip #7: Discuss! Collaborate!
Stay up to date with forum discussions to make sure that you are not missing out on any obvious detail regarding the problem
Do not hesitate to ask people in forums/messages:Data Science Hackathon Tip #8: Trust Local Validation
Do not jump into building models by dumping data into the algorithms. While it is useful to get a sense of basic benchmarks, you need to take a step back and build a robust validation framework
Without validation, you are just shooting in the dark. You will be at the mercy of overfitting, leakage and other possible evaluation issues
By replicating the evaluation mechanism, you can make faster and better improvements by measuring your validation results along with making sure your model is robust enough to perform well on various subsets of the train/test data
Have a robust local validation set and avoid relying too much on the public leaderboard as this might lead to overfitting and can drop your private rank by a lot
In the Restaurant Revenue Prediction contest, a team that was ranked first on the public leaderboard slipped down to rank 1962 on the private leaderboard
“The first we used to determine which rows are part of the public leaderboard score, while the second is used to determine the correct predictions. Along the way, we encountered much interesting mathematics, computer science, and statistics challenges.”
Source: Kaggle: BAYZ TeamData Science Hackathon Tip #9 – Keep Evolving
It is not the strongest or the most intelligent who will survive but those who can best manage change. –Charles Darwin
If you are planning to enter the elite class of data science hackers then one thing is clear – you can’t win with traditional techniques and knowledge.
Employing logistic regression or KNN in Hackathons can be a great starting point but as you move ahead of the curve, these won’t land you in the top 100.
Let’s take a simple example – in the early days of NLP hackathons, participants used TF-IDF, then Word2vec came around. Fast-forward to nowadays, there are state-of-the-art Transformers. The same goes for computer vision and so on.
Keep yourself up-to-date with the latest research papers and projects on Github. Although this will require a bit of extra effort, it will be worth it.Data Science Hackathon Tip #10 – Use Hindsight to build your Foresight
Has it ever happened that after the competition is over, you sit back, relax, maybe think about the things you could have done, and then move on to the next competition? Well, this is the best time to learn!
Do not stop learning after the competition is over. Read winning solutions of the competition, analyze where you went wrong. After all, learning from mistakes can be very impactful!
Try to improve your solutions. Make notes about it. Refer to it to your friends and colleagues and take back feedback.
This will give you a solid head-start for your next competition. And this time you’ll be much more equipped to go tackle the problem statement. Datahack provides a really cool feature of late submission. You can make changes to your code even after the hackathon is over and submit the solution and check its score too!Data Science Hackathon Tip #11 – Refactor your code
Just imagine living in a room, where everything is messy, clothes lying all around, shoes on the shelves, and food on the floor. It is nasty. Isn’t it? The same goes for your code.
When we get started with a competition, we are excited and we probably write rough code, copy-paste from earlier your earlier notebooks, and some from stack overflow. Continuing this trend for the complete notebook will make it messy. Understanding your code will consume the majority of your time and make it harder to perform operations.
The solution is to keep refactoring your code from time to time. Keep maintaining your code at regular intervals of time.
This will also help you team up with other participants and have much better communication.Data Science Hackathon Tip #12 – Improve iteratively
Many of us follow the linear approach of model building, going through the same process – Data Cleaning, EDA, feature engineering, model building, evaluation. The trick is to understand that it is a circular and iterative process.
For example, we are building a sales prediction model and we get a low MAE, we decide to analyze the samples. Hence, it turns out, that our model is giving spurious results for female buyers. We can then take a step back and focus on the Gender feature, do EDA on it and check how to improve from there. Here we are going back and forth and improving step-by-step.
We can also look at some of the strong and important features and combine some of them and check their results. We may or may not see an improvement but we can only figure it out by moving iteratively.
It is very important to understand that the trick is to be a dynamic learner. Following an iterative process can lead you to achieve double or single-digit ranks.Final Thoughts
These 12 hacks have held me in good stead regardless of the hackathon I’m participating in. Sure, a few tweaks here and there are necessary but having a solid framework and structure in place will take you a long way towards achieving success in data science hackathons.
Do you want more of such hacks, tips, and tricks? HackLive is the way to go from zero-to-hero and master the art of participating in a data science competition. Don’t forget to check out the third edition of HackLive 3.
You're reading 12 Powerful Tips To Ace Data Science And Machine Learning Hackathons
In many real-world applications where machine learning models have been deployed in production, often the data evolve over time and thus models built for analyzing such data quickly become obsolete over time. It becomes essential for data scientists to monitor the model performance over time. Is the machine learning model deployed sustainable and performing consistently? Usually, the scenario that occurs over time, is not because the model stops performing well but simply because the model can no longer capture the right variability of the data be it the dependent or independent variables. The reason for this is not to do with the machine learning model itself, but the data distributions. There is a shift that occurs at the data level, as the distribution of the data used to train an ML model, called the source distribution, is different from the distribution of the newly available data, the target distribution. As a result of this, the relationships between input and output data can change over time, meaning that in turn there are changes to the unknown underlying mapping function. This gap is where the concept of data drift comes in.2. What is Data Drift?
Over time, a machine learning model starts to lose its predictive power, a concept known as model drift. What is generally called data drift is a change in the distribution of data used in a predictive task. There are different types of data drift such as:Feature Drift (or covariate drift):
It happens when some previously infrequent or even unseen feature vectors become more frequent, and vice versa. However, the relationship between the feature and the target is still the same.  Covariate Drift is defined as the case where: and Example: Temperature readings changing from in degree Fahrenheit to degree Celsius.
Figure 1. Changes in the distribution of the feature “period” over time, yields to covariate shiftConcept Drift:
It is the phenomenon where the statistical properties of the class variable — in other words, the target we want to predict — change over time. Hence, this drift is focussed on the change in the target variable with time.  Concept Drift is defined as the case where: and Example: Inventory changes over time
Figure 2. Concept drift can be detected by the divergence of model & data decision boundaries and the concomitant loss of predictive abilityDual Drift:
This applies to a situation wherein feature as well as concept drift occurs.  Dual Drift is defined as a case where:3. Methods for detection of Data Drift
Following are a number of techniques that have been explored in python using the ‘scikit-multiflow’ library. The multitude of techniques have been implemented by showing a simulation of how it will work on real life data.Adaptive Windowing(ADWIN):
ADWIN  (ADaptive WINdowing) is an adaptive sliding window algorithm for detecting change, and keeping updated statistics about a data stream. ADWIN allows algorithms not adapted for drifting data, to be resistant to this phenomenon. The general idea is to keep statistics from a window of variable size while detecting concept drift. This algorithm is implemented in python for a specialized tool which can perform drift detection. The scikit-multiflow library has this functionality. A sample code to simulate concept drift and the way it can be detected using this library is as follows
The results of the simulation conducted to detect data drift look like:Drift Detection Method (DDM):
DDM (Drift Detection Method)  is a concept change detection method based on the PAC learning model premise, that the learner’s error rate will decrease as the number of analysed samples increase, as long as the data distribution is stationary. If the algorithm detects an increase in the error rate, that surpasses a calculated threshold, either change is detected or the algorithm will warn the user that change may occur in the near future, which is called the warning zone. The results of the simulation conducted to detect data drift using the DDM are attached below:Early Drift Detection Method (EDDM):
EDDM (Early Drift Detection Method)  aims to improve the detection rate of gradual concept drift in DDM, while keeping a good performance against abrupt concept drift. This method works by keeping track of the average distance between two errors instead of only the error rate. For this, it is necessary to keep track of the running average distance and the running standard deviation, as well as the maximum distance and the maximum standard deviation. The results of the simulation implemented to detect data drift using the DDM are attached below:Drift Detection Method based on Hoeffding’s bounds with moving average-test (HDDM_A):
HDDM_A  is a drift detection method based on the Hoeffding’s inequality. HDDM_A uses the average as estimator. It receives as input a stream of real values and returns the estimated status of the stream: STABLE, WARNING or DRIFT. The results of the simulation conducted to detect data drift using the DDM are attached below:Drift Detection Method based on Hoeffding’s bounds with moving weighted average-test (HDDM_W):
HDDM_W  is an online drift detection method based on McDiarmid’s bounds. HDDM_W uses the EWMA statistic as estimator. It receives as input a stream of real predictions and returns the estimated status of the stream: STABLE, WARNING or DRIFT. The results of the simulation conducted to detect data drift using the DDM are attached below:Kolmogorov-Smirnov Windowing method for concept drift detection (KSWIN):
KSWIN (Kolmogorov-Smirnov Windowing)  is a concept change detection method based on the Kolmogorov-Smirnov (KS) statistical test. KS-test is a statistical test with no assumption of underlying data distribution. KSWIN can monitor data or performance distributions. Note that the detector accepts one dimensional input as array. The results of the simulation conducted to detect data drift using the DDM are attached below:Page-Hinkley method for concept drift detection (PageHinkley):
This change detection method works by computing the observed values and their mean up to the current moment. Page-Hinkley won’t output warning zone warnings, only change detections. The method works by means of the Page-Hinkley test . In general lines it will detect a concept drift if the observed mean at some instant is greater then a threshold value lambda. The results of the simulation conducted to detect data drift using the DDM are attached below:4. Metrics for measuring Data Drift
Different methods and metrics are used depending on the type of drift that needs to be monitored. The methods depending on the type of drift are as follows:Measuring Concept Drift:
Following are the proposed quantitative measures of concept drift including the key measure drift magnitude which measures the distance between two concepts Pt(X,Y) and Pu(X,Y). Using the Hellinger distance method which measures the total variation distance: where Z represents a vector of random variables.Measuring Covariate Drift:
For the conditional drifts it is necessary to deal with multiple distributions, one for each value of the conditioning attributes. We address this by weighted averaging, as described:
For a given subset of the covariate attributes there will be a conditional probability distribution over the possible values of the covariate attributes for each specific class, y. The conditional marginal covariate drift is the weighted sum of the distances between each of these probability distributions from time period t to u, where the weights are the average probability of the class over the two time periods.
For each subset of the covariate attributes there will be a probability distribution over the class labels for each combination of values of those attributes, x ̄ at each time period. Therefore, the Conditional Class Drift can be calculated as the weighted sum of the distances between these probability distributions where the weights are the average probability over the two periods of the specific value for the covariate attribute subset.5. Model pipeline for overcoming data drift:
The following figure reflects the way data drift can be monitored and dealt with in the machine learning model at scale during production. As a part of the pipeline, implementing a system that periodically trains the model after some time t, or once it detects a drift using some of the methods aforementioned is the most robust way to overcome drift at a production level. Figure 3. ML model solution flow at production with data drift being monitored6. Conclusions and Future Scope
This paper brings to forefront the concept of data drift an often missed dimension in setting up a ML workflow in production systems. The concept of data drift is demonstrated with simulated data in this study, where we further discuss in detail some key metrics to measure the same. A high level ML pipeline is also mentioned on incorporating the drift related error correction in any ML workflow. With this work we aim to highlight that drift modelling and correction is not to be taken lightly and should be a part and parcel of any automated ML pipeline in production systems.7. References
Webb, G. I., Lee, L. K., Goethals, B., & Petitjean, F. (2024). Analyzing concept drift and shift from sample data. Data Mining and Knowledge Discovery, 32(5), 1179-1199.
Bifet, Albert, and Ricard Gavalda. “Learning from time-changing data with adaptive windowing.” In Proceedings of the 2007 SIAM international conference on data mining, pp. 443-448. Society for Industrial and Applied Mathematics, 2007.
João Gama, Pedro Medas, Gladys Castillo, Pedro Pereira Rodrigues: Learning with Drift Detection. SBIA 2004: 286-295
Early Drift Detection Method. Manuel Baena-Garcia, Jose Del Campo-Avila, Raúl Fidalgo, Albert Bifet, Ricard Gavalda, Rafael Morales-Bueno. In Fourth International Workshop on Knowledge Discovery from Data Streams, 2006.
Frías-Blanco I, del Campo-Ávila J, Ramos-Jimenez G, et al. Online and non-parametric drift detection methods based on Hoeffding’s bounds. IEEE Transactions on Knowledge and Data Engineering, 2014, 27(3): 810-823.
Christoph Raab, Moritz Heusinger, Frank-Michael Schleif, Reactive Soft Prototype Computing for Concept Drift Streams, Neurocomputing, 2023
S. Page. 1954. Continuous Inspection Schemes. Biometrika 41, 1/2 (1954), 100–115.Authors:
Dr. Anish Roy Chowdhury is currently an Industry Data Science Leader at a Leading Digital Services Organization. In previous roles he was with ABInBev as a Data Science Research lead working in areas of Assortment Optimization, Reinforcement Learning to name a few, He also led several machine learning projects in areas of Credit Risk, Logistics and Sales forecasting. In his stint with HP Supply Chain Analytics he developed data Quality solutions for logistics projects and worked on building statistical models to predict spares part demands for large format printers. Prior to HP, he has 6 years of Work Experience in the IT sector as a DataBase Programmer. During his stint in IT he has worked for Credit Card Fraud Detection among other Analytics related Projects. He has a PhD in Mechanical Engineering (IISc Bangalore) . He also holds a MS degree in Mechanical Engineering from Louisiana State Univ. USA. He did his undergraduate studies from NIT Durgapur with published research in GA- Fuzzy Logic applications to Medical diagnostics Dr. Anish is also a highly acclaimed public speaker with numerous best presentation awards from National and international conferences and has also conducted several workshops in Academic institutes on R programming and MATLAB. He also has several academic publications to his credit and is also a Chapter Co-Author for a Springer Publication and a Oxford University Press, best selling publication in MATLAB. Paulami Das is a seasoned Analytics Leader with 14 years’ experience across industries. She is passionate about helping businesses tackle complex problems through Machine Learning. Over her career, Paulami has worked several large and complex Machine Learning-centric projects around the globe.
This article was published as a part of the Data Science Blogathon.Introduction
In machine learning, the data’s mount and quality are necessary to model training and performance. The amount of data affects machine learning and deep learning algorithms a lot. Most of the algorithm’s behaviors change if the amount of data is increased or decreased. But in the case of limited data, it is necessary to effectively handle machine learning algorithms to get better results and accurate models. Deep learning algorithms are also data-hungry, requiring a large amount of data for better accuracy.
In this article, we will discuss the relationship between the amount and the quality of the data with machine learning and deep learning algorithms, the problem with limited data, and the accuracy of dealing with it. Knowledge about these key concepts will help one understand the algorithm vs. data scenario and will shape one so that one can deal with limited data efficiently.The “Amount of Data vs. Performace” Graph
In machine learning, a query could be raised to your mind, how strictly is the data required to train a good machine learning or deep learning model? Well, there is no threshold levels or fixed answer to this, as every piece of information is different and has different features and patterns. Still, there are some threshold levels after which the performance of the machine learning or deep learning algorithms tends to be constant.
Most of the time, machine learning and deep learning models tend to perform well as the amount of data fed is increased, but after some point or some amount of data, the behavior of the models becomes constant, and it stops learning from data.
The above pictures show the performance of some famous machine learning and deep learning architectures with the amount of data fed to the algorithms. Here we can see that the traditional machine learning algorithms learn a lot from the data in a preliminary period, where the amount of data fed is increasing, but after some time, when a threshold level comes, the performance becomes constant. Now, if you provide more data to the algorithm, it will not learn anything, and the version will not increase or decrease.
In the case of deep learning algorithms, there are a total of three types of deep learning architectures in the diagram. The shallow ty[e of deep learning stricture is a minor deep learning architecture in terms of depth, meaning that there are few hidden layers and neurons in external deep learning architectures. In the case o deep neural networks, the number of hidden layers and neurons is very high and designed very profoundly.
From the diagram, we can see a total of three deep learning architectures, and all three perform differently when some amount of data is fed and increased. The shallow, deep neural networks tend to function like traditional machine learning algorithms, where the performance becomes constant after some threshold amount of data. At the same time, deep neural networks keep learning from the data when new data is fed.
From the diagram, we can conclude that,
” THE DEEP NEURAL NETWORKS ARE DATA HUNGRY “What Problems Arise with Limited Data?
Several problems occur with limited data, and the model could perform better if trained with limited data. The common issues that arise with limited data are listed below:
In classification, if a low amount of data is fed, then the model will classify the observations wrongly, meaning that it will not give the accurate output class for given words.
In a regression problem, if the model’s accuracy is low, then the model will predict very wrong, meaning that as it is a regression problem, it will be expecting the number. Still, limited data may show a horrifying amount far from the actual output.
The model can classify the different points in the wrong clusters in the clustering problems if trained with limited data.
4. Time Series:
In time series analysis, we forecast some data for the future. Still, a low-accurate time series model can give us inferior forecast results, and there may be a lot of errors related to time.
5. Object Detection:
If an object detection model is trained on limited data, it might not detect the object correctly, or it can classify the thing incorrectly.How to Deal With Problems of Limited Data?
There needs to be an accurate or fixed method for dealing with the limited data. Every machine learning problem is different, and the way of solving the particular problem is other. But some standard techniques are helpful for many cases.
1. Data Augmentation
Data augmentation is the technique in which the existing data is used to generate new data. Here the further information generated will look like the old data, but some of the values and parameters would be different here.
This approach can increase the amount of data, and there is a high likelihood of improving the model’s performance.
Data augmentation is preferred in most deep-learning problems, where there is limited data with images.
2. Don’t Drop and Impute:
In some of the datasets, there is a high fraction of invalid data or empty. Due to that, some amount of data s dropped not to make the process complex, but by doing this, the amount of data is decreased, and several problems can occur.
3. Custom Approach:
If there is a case of limited data, one could search for the data on the internet and find similar data. Once this type of data is obtained, it can be used to generate more data or be merged with the existing data.Conclusion
In this article, we discussed the limited data, the performance of several machine learning and deep learning algorithms, the amount of data increasing and decreasing, the type of problem that can occur due to limited data, and the common ways to deal with limited data. This article will help one understand the process of restricted data, its effects on performance, and how to handle it.
Some Key Takeaways from this article are:
1. Machine Learning and shallow neural networks are the algorithms that are not affected by the amount of data after some threshold level.
2. Deep neural networks are data-hungry algorithms that never stop learning from data.
3. Limited data can cause problems in every field of machine learning applications, e.g., classification, regression, time series, image processing, etc.
4. We can apply Data augmentation, imputation, and some other custom approaches based on domain knowledge to handle the limited data.
Want to Contact the Author?
Follow Parth Shukla @AnalyticsVidhya, LinkedIn, Twitter, and Medium for more content.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.
Introduction to Data Preprocessing in Machine Learning
The following article provides an outline for Data Preprocessing in Machine Learning. Data pre-processing also knows as data wrangling is the technique of transforming the raw data i.e. an incomplete, inconsistent, data with lots of error, and data that lack certain behavior, into understandable format carefully using the different steps (i.e. from importing libraries, data to checking of missing values, categorical followed by validation and feature scaling ) so that proper interpretations can be made from it and negative results can be avoided, as the quality of the model in machine learning highly depends upon the quality of data we train it on.
Start Your Free Data Science Course
Data collected for training the model is from various sources. These collected data are generally in their raw format i.e. they can have noises like missing values, and relevant information, numbers in the string format, etc. or they can be unstructured. Data pre-processing increases the efficiency and accuracy of the machine learning models. As it helps in removing these noises from and dataset and giving meaning to the datasetSix Different Steps Involved in Machine Learning
Following are six different steps involved in machine learning to perform data pre-processing:
Step 1: Import libraries
Step 2: Import data
Step 3: Checking for missing values
Step 4: Checking for categorical data
Step 5: Feature scaling1. Import Libraries
The very first step is to import a few of the important libraries required in data pre-processing. A library is a collection of modules that can be called and used. In python, we have a lot of libraries that are helpful in data pre-processing.
A few of the following important libraries in python are:
Numpy: Mostly used the library for implementing or using complicated mathematical computation of machine learning. It is useful in performing an operation on multidimensional arrays.
Pandas: It is an open-source library that provides high performance, and easy-to-use data structure and data analysis tools in python. It is designed in a way to make working with relation and labeled data easy and intuitive.
Matplotlib: It’s a visualization library provided by python for 2D plots o array. It is built on a numpy array and designed to work with a broader Scipy stack. Visualization of datasets is helpful in the scenario where large data is available. Plots available in matplot lib are line, bar, scatter, histogram, etc.
Seaborn: It is also a visualization library given by python. It provides a high-level interface for drawing attractive and informative statistical graphs.2. Import Dataset
Once the libraries are imported, our next step is to load the collected data. Pandas library is used to import these datasets. Mostly the datasets are available in CSV formats as they are low in size which makes it fast for processing. So, to load a csv file using the read_csv function of the panda’s library. Various other formats of the dataset that can be seen are
Once the dataset is loaded, we have to inspect it and look for any noise. To do so we have to create a feature matrix X and an observation vector Y with respect to X.3. Checking for Missing Values
Once you create the feature matrix you might find there are some missing values. If we won’t handle it then it may cause a problem at the time of training.
Removing the entire row that contains the missing value, but there can be a possibility that you may end up losing some vital information. This can be a good approach if the size of the dataset is large.
If a numerical column has a missing value then you can estimate the value by taking the mean, median, mode, etc.4. Checking for Categorical Data
Data in the dataset has to be in a numerical form so as to perform computation on it. Since Machine learning models contain complex mathematical computation, we can’t feed them a non-numerical value. So, it is important to convert all the text values into numerical values. LabelEncoder() class of learned is used to covert these categorical values into numerical values.5. Feature Scaling
The values of the raw data vary extremely and it may result in biased training of the model or may end up increasing the computational cost. So it is important to normalize them. Feature scaling is a technique that is used to bring the data value in a shorter range.
Methods used for feature scaling are:
Rescaling (min-max normalization)
Standardization (Z-score Normalization)
Scaling to unit length6. Splitting Data into Training, Validation and Evaluation Sets
Finally, we need to split our data into three different sets, training set to train the model, validation set to validate the accuracy of our model and finally test set to test the performance of our model on generic data. Before splitting the Dataset, it is important to shuffle the Dataset to avoid any biases. An ideal proportion to divide the Dataset is 60:20:20 i.e. 60% as the training set, 20% as test and validation set. To split the Dataset use train_test_split of sklearn.model_selection twice. Once to split the dataset into train and validation set and then to split the remaining train dataset into train and test set.Conclusion – Data Preprocessing in Machine Learning
Data Preprocessing is something that requires practice. It is not like a simple data structure in which you learn and apply directly to solve a problem. To get good knowledge on how to clean a Dataset or how to visualize your dataset, you need to work with different datasets. The more you will use these techniques the better understanding you will get about it. This was a general idea of how data processing plays an important role in machine learning. Along with that, we have also seen the steps needed for data pre-processing. So next time before going to train the model using the collected data be sure to apply data pre-processing.Recommended Articles
This is a guide to Data Preprocessing in Machine Learning. Here we discuss the introduction and six different steps involved in machine learning. You can also go through our other suggested articles to learn more –
Before creation, God did just pure mathematics. Then he thought it would be pleasant change to do some applied
-John Edensor Littlewood
Mathematics & Statistics are the founding steps for data science and machine learning. Most of the successful data scientists I know of, come from one of these areas – computer science, applied mathematics & statistics or economics. If you wish to excel in data science, you must have a good understanding of basic algebra and statistics.
However, learning Maths for people not having background in mathematics can be intimidating. First, you have to identify what to study and what not. The list can include Linear Algebra, calculus, probability, statistics, discrete mathematics, regression, optimization and many more topics. What do you do? How deep to you want to get in each of these topics? It is very difficult to navigate through this by yourself.
If you have faced this situation before – don’t worry! You are at the right place now. I have done the hard work for you. Here is a list of popular open courses on Maths for Data science from Coursera, edX, Udemy and Udacity. The list has been carefully curated to give you a structured path to teach you the required concepts of mathematics used in data science.
Get started now to learn & explore mathematics for data science.Which course is suitable for you?
Few courses may require you to finish the preceding course for better understanding. So, make sure that you either know the subject or have undergone these courses.
Read on to find out the right course for you!Table of Content
Beginners Mathematics / Statistics
Data Science Maths Skills
Intro to Descriptive Statistics
Intro to Inferential Statistics
Introduction to Probability and Data
Math is Everywhere: Applications of Finite Math
Probability: Basic Concepts & Discrete Random Variables
Mathematical Biostatistics Boot Camp 1
Applications of Linear Algebra Part 1
Introduction to Mathematical Thinking
Intermediate Mathematics / Statistics
Bayesian Statistics: From Concept to Data Analysis
Game Theory 1
Game Theory II: Advanced Applications
Advanced Linear Models for Data Science 1: Least Squares
Advanced Linear Models for Data Science 2: Statistical Linear Models
Introduction to Linear Models and Matrix Algebra
Maths in Sports
Advanced Mathematics / Statistics
Statistics for Genomic Data Science
Biostatistics for Big Data ApplicationsBeginners Mathematics & Statistics
Duration: 4 weeks
Led by: Duke University (Coursera)
If you are a beginner with very minimal knowledge of mathematics, then this course is for you. In this course, you will learn about concepts of algebra like set theory, inequalities, functions, coordinate geometry, logarithms, probability theory and many more.
This course will take you through all the basic maths skills required for data science and would provide a strong foundation.
The course starts from 9 Jan 2023 and is lead by professors from Duke University.
Prerequisites: Basic maths skills
Duration: 8 weeks
Led by: Udacity
This course by Udacity is an excellent beginners guide for learning statistics. It is fun, practical and filled with examples. The Descriptive Statistics course will first make you familiar with different terms of statistics and their definition. Then you will learn about statistics concepts like central tendency, variability, standard normal distribution and sampling distribution.
This course doesn’t require any prior knowledge of statistics and is open for enrollment.
Duration: 8 weeks
Led by: Udacity
After you have gone through the Descriptive Statistics course, it is time for Inferential statistics. The same practical approach to the subject continues in this course.
In this course, you will learn concepts of statistics like estimation, hypothesis testing, t-test, chi-square test, one-way Anova, two-way Anova, correlation, and regression.
There are problem set and quiz questions after each topic. You will also be able to test your learning on a real-life dataset at the end of the course. The course is open for enrollment.
Prerequisites: Complete understanding of Descriptive Statistics (the course mentioned above)
Alternate Course: You can also look at Statistics: Unlocking the World of Data. It is a 6 weeks long course run by University of Edinburgh (edX)
Duration: 5 weeks
Led by: Duke University (Coursera)
It will provide you hands on experience in data visualization and numeric statistics using R and RStudio.
The course will first take you through basics of probability and data exploration to give a basic understanding to get started. Then, it will individually explain various concepts under each topic in detail. At the end, you will be tested on a data analysis project using a real-world dataset.
The course is led by a Professor in Statistics at Duke University and is also a prerequisite for Statistics in R specialization. If you are looking forward to learn R for data science, then you must take this course. The course is open for enrollment.
Prerequisites: Basic Statistics and knowledge of R
Duration: 1 week
Led by: Davidson College (Udemy)
As the name suggests, this course tells you how maths is being used everywhere from Angry birds to Google. It is a fun approach to applied mathematical concepts.
In this course, you will learn how equation of lines is used to create computer fonts, how graph theory plays a vital role in angry birds, linear systems model the performance of a sports team and how Google uses probability and simulation to lead the race in search engines.
The course is led by the mathematics professor at Davidson College and is open for enrollment.
Prerequisites: Understanding of linear algebra and programming
Duration: 6 weeks
Led by: Purdue University (edX)
This course is designed for anyone looking for a career in data science & information science. It covers essentials of mathematical probabilities.
In this course, you will learn the basic concepts of probability, random variables, distributions, Bayes Theorem, probability mass functions and CDFs, joint distributions and expected values.
After taking this course you will have a thorough understanding of how probability is used in everyday life. The course is open for enrollment.
Prerequisite: Basics Statistics
Duration: 4 weeks
Led by: Johns Hopkins University (Coursera)
Honestly, the “Bio” in “Biostatistics” is misleading. This course is all about fundamental probability and statistics techniques for data analysis.
The course covers topics on probability, expectations, conditional probabilities, distributions, confidence intervals, bootstrapping, binomial proportions, and logs.
A well-paced course with a complete introduction to mathematical statistics.
Prerequisites: Basic Linear algebra, calculus and programming useful but not mandatory
Duration: 5 weeks
Led by: Davidson College (edX)
This is an interesting course on applications of linear algebra in data science.
The course will first take you through fundamentals of linear algebra. Then, it will introduce you to applications of linear algebra for recognizing handwritten numbers, ranking of sports team along with online codes.
The course is open for enrollment.
Prerequisite: Basic linear algebra
Duration: 8 weeks
Led by: Stanford University (Coursera)
In this mathematical thinking course from Stanford, you will learn how to develop analytical thinking skills. The course teaches you interesting ways to develop out-of-the-box thinking and helps you remain ahead of the competitive curve.
In this course, you will learn about analysis of a language, quantifiers, brief introduction to number theory and real analysis. To make the most of this course one must have familiarity with algebra, number system and elementary set theory.
The course starts from 9 Jan 2023 and is led by professors at Stanford. It is open for enrollment.
Prerequisites: Basic algebra, number system and elementary set theory.Intermediate Mathematics & Statistics
By this time, you know all the basic concepts a data scientist needs to know. This is the time to take your mathematical knowledge to the next level.
Duration: 4 weeks
Led by: University of California (Coursera)
Bayesian Statistics is an important topic in data science. For some reason, it does not get as much attention.
In this course, the first section covers basic topics like probability like conditional probability, probability distribution and Bayes Theorem. Then you will learn about statistical inference for both Frequentist and Bayesian approach, methods for selecting prior distributions and models for discrete data and Bayesian analysis for continuous data.
Prior knowledge of statistics concepts is required to take this course. The course starts from 16 Jan 2023.
Prerequisite: Basic & Advanced Statistics
Duration: 8 weeks
Led by: Stanford University and University of British Columbia (Coursera)
Game theory is an important component of data science. In this course, you will learn about basics of game theory and its applications. If you are looking to master Re-inforcement learning this year – this course is a must learn for you.
The course provides basic understanding of representing games and strategies, the extensive form (which computer scientists call game trees), Bayesian games (modeling things like auctions), repeated and stochastic games. Each concept has been explained with the help of examples and applications.
The course is led by professors from the Stanford University and The University of British Columbia. The course is open for enrollment.
Prerequisite: Basic probability and mathematical thinking
Duration: 5 weeks
Led by: Stanford University and The University of British Columbia (Coursera)
You will learn about how to design interactions between agents in order to achieve good social outcomes. The three main topics covered are social choice theory, mechanism design, and auctions. The course starts from 30 Jan 2023 and is led by professors from Stanford University & The University of British Columbia.
The course is open for enrollment.
Prerequisite: Basics of Game Theory
Duration: 4 weeks
Led By: Harvard University (edX)
Matrix algebra is used in various tools for experimental design and analysis of high-dimensional data.
For easy understanding, the course has been divided into seven parts to provide you a step by step approach. You will learn about matrix algebra notation & operations, application of matrix algebra to data analysis, linear models and QR decomposition.
The language used throughout the course is R. Feel free to choose which part of the course caters more to your interest and take the course accordingly.
The course is conducted by biostatistics professors at Harvard University and is open for enrolment now.
Prerequisite: Basic Linear algebra and knowledge of R
Duration: 6 weeks
Led by: Johns Hopkins University (Coursera)
In this course, you will learn about one & two parameter regression, linear regression, general least square, least square examples, bases & residuals.
Before you proceed further let me clear, to take this course you need a basic understanding of linear algebra & multivariate calculus, statistics & regression models, familiarity with proof based mathematics and working knowledge of R. The course starts from 23 Jan 2023.
Prerequisite: Linear Algebra, calculus, statistics and knowledge of R
Duration: 6 weeks
Led by: Johns Hopkins University
In this course, you will learn about basics of statistical modeling multivariate normal distribution, distributional results, and residuals.
Before you proceed further let me clear, to take this course you need basic understanding of linear algebra & multivariate calculus, statistics & regression models, familiarity with proof based mathematics and working knowledge of R. The course starts from 23 Jan 2023.
Prerequisite: Linear Algebra, calculus, statistics and knowledge of R
Duration: 8 weeks
Led by: University of Notre Dam (edX)
I am someone who is very curious to know how mathematics can be used to drive deeper insights in sports and everyday life.
I came across this course, which shows how your favorite sport uses mathematics to analyze data and know the trends, performance of players and their fellow teams.
In this course, you will learn how inductive reasoning is used in mathematical analysis, how probability is used to evaluate data, assess the risk and outcomes of any event.
All the major team sports, athletic sports, and even extreme sports like mountain climbing have been covered in the course. The course is led by professors of the University of Notre Dam and is currently open for enrolment.
Prerequisite: Statistics & Linear AlgebraAdvanced Mathematics & Statistics
Bravo, by now – you would be on your own. You would have developed a knack for mathematics & statistics and would feel confident about continuous learning – way to go!
Duration: 8 weeks
Led by: University of Melbourne (Coursera)
Every industry & company makes use of optimization. Airlines use optimization to ensure fixed turn-around-time, E-commerce like Amazon uses optimization for on time delivery of products. Macro-level applications of optimization includes deploying electricity to millions of people, way for new medical drug discoveries and many more.
The prerequisites to take this course are good programming skills, knowledge of fundamental algorithms, and linear algebra. The course starts from 16 Jan 2023 and is conducted by professors at Melbourne University.
Prerequisite: Programming, algorithms and linear algebra
Duration: 4 weeks
Led by: Johns Hopkins University
If you aspire to become a generation sequencing data scientist then you must take this course.
In this course, you will learn about exploratory analysis, linear modeling, hypothesis testing & multi-hypothesis testing, different types of process like RNA-seq, GWAS, ChIP-Seq, and DNA Methylation studies. This course is part of Genomic Data Scientist specialization from Johns Hopkins. The course starts from 16 Jan 2023.
This course is part of Genomic Data Scientist specialization from Johns Hopkins. The course starts from 16 Jan 2023.
Prerequisite: Advanced Statistics and algorithms
Duration: 8 weeks
Led by: utmb Health (edX)
This course is an introduction to data analysis using biomedical big data.
In this course, you will learn about fundamental components of biostatistical methods. Working with biomedical big data can pose various challenges for someone not familiar with statistics.
Learn how basic statistics is used in biomedical data types. You will learn about basics of R programming, how to create & interpret graphical summaries of data and inferential statistics for parametric & non-parametric methods. It will provide you hands on experience in R with biomedical problem types.
The course is open for enrolment.
Prerequisite: Advanced statistics and knowledge REnd Notes
I hope you found this article useful. By now, you would have identified the learning areas for yourself. If you are from mathematics background, you can choose the right courses for yourself. On the other hand, if you do not have a mathematics background, then start from the beginners sections and move ahead.
This article was published as part of the Data science Blogathon.Introduction :
In this post, we will come through some of the major challenges that you might face while developing your machine learning model. Assuming that you know what machine learning is really about, why do people use it, what are the different categories of machine learning, and how the overall workflow of development takes place.
What can possibly go wrong during the development and prevent you from getting accurate predictions?
So let’s get started, during the development phase our focus is to select a learning algorithm and train it on some data, the two things that might be a problem are a bad algorithm or bad data, or perhaps both of them.Table of Content :
Not enough training data.
Poor Quality of data.
Nonrepresentative training data.
Overfitting and Underfitting.1. Not enough training data :
Let’s say for a child, to make him learn what an apple is, all it takes for you to point to an apple and say apple repeatedly. Now the child can recognize all sorts of apples.2. Poor Quality of data:
Obviously, if your training data has lots of errors, outliers, and noise, it will make it impossible for your machine learning model to detect a proper underlying pattern. Hence, it will not perform well.
So put in every ounce of effort in cleaning up your training data. No matter how good you are in selecting and hyper tuning the model, this part plays a major role in helping us make an accurate machine learning model.
“Most Data Scientists spend a significant part of their time in cleaning data”.
There are a couple of examples when you’d want to clean up the data :
If you see some of the instances are clear outliers just discard them or fix them manually.
If some of the instances are missing a feature like (E.g., 2% of user did not specify their age), you can either ignore these instances, or fill the missing values by median age, or train one model with the feature and train one without it to come up with a conclusion.3. Irrelevant Features:
“Garbage in, garbage out (GIGO).”
In the above image, we can see that even if our model is “AWESOME” and we feed it with garbage data, the result will also be garbage(output). Our training data must always contain more relevant and less to none irrelevant features.
The credit for a successful machine learning project goes to coming up with a good set of features on which it has been trained (often referred to as feature engineering ), which includes feature selection, extraction, and creating new features which are other interesting topics to be covered in upcoming blogs.4. Nonrepresentative training data:
To make sure that our model generalizes well, we have to make sure that our training data should be representative of the new cases that we want to generalize to.
If train our model by using a nonrepresentative training set, it won’t be accurate in predictions it will be biased against one class or a group.
For E.G., Let us say you are trying to build a model that recognizes the genre of music. One way to build your training set is to search it on youtube and use the resulting data. Here we assume that youtube’s search engine is providing representative data but in reality, the search will be biased towards popular artists and maybe even the artists that are popular in your location(if you live in India you will be getting the music of Arijit Singh, Sonu Nigam or etc).
So use representative data during training, so your model won’t be biased among one or two classes when it works on testing data.5. Overfitting and Underfitting :
What is overfitting?
Let’s start with an example, say one day you are walking down a street to buy something, a dog comes out of nowhere you offer him something to eat but instead of eating he starts barking and chasing you but somehow you are safe. After this particular incident, you might think all dogs are not worth treating nicely.
So this overgeneralization is what we humans do most of the time, and unfortunately machine learning model also does the same if not paid attention. In machine learning, we call this overfitting i.e model performs well on training data but fails to generalize well.
Overfitting happens when our model is too complex.
Things which we can do to overcome this problem:
Simplify the model by selecting one with fewer parameters.
By reducing the number of attributes in training data.
Constraining the model.
Gather more training data.
Reduce the noise.
What is underfitting?
Yes, you guessed it right underfitting is the opposite of overfitting. It happens when our model is too simple to learn something from the data. For E.G., you use a linear model on a set with multi-collinearity it will for sure underfit and the predictions are bound to be inaccurate on the training set too.
Things which we can do to overcome this problem:
Train on better and relevant features.
Reduce the constraints.Conclusion :
Machine Learning is all about making machines better by using data so that we don’t need to code them explicitly. The model will not perform well if training data is small, or noisy with errors and outliers, or if the data is not representative(results in biased), consists of irrelevant features(garbage in, garbage out), and lastly neither too simple(results in underfitting) nor too complex(results in overfitting). After you have trained a model by keeping the above parameters in mind, don’t expect that your model would simply generalize well to new cases you may need to evaluate and fine-tune it, how to do that? Stay tuned this is a topic that will be covered in the upcoming blogs.
Karan Amal Pradhan.
The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.
Update the detailed information about 12 Powerful Tips To Ace Data Science And Machine Learning Hackathons on the Moimoishop.com website. We hope the article's content will meet your needs, and we will regularly update the information to provide you with the fastest and most accurate information. Have a great day!