You are reading the article From Petroleum Engineering To Data Science: Jaiyesh Chahar’s Journey updated in December 2023 on the website Moimoishop.com. We hope that the information we have shared is helpful to you. If you find the content interesting and meaningful, please share it with your friends and continue to follow and support us for the latest updates. Suggested January 2024 From Petroleum Engineering To Data Science: Jaiyesh Chahar’s JourneyIntroduction
Let’s get into this thrilling and compelling conversation with Jaiyesh.Interview Excerpts with Jaiyesh Chahar AV: Please introduce yourself and share your educational journey with us.
Jaiyesh: Hi, I am Jaiyesh Chahar, a Petroleum Engineer turned Data Scientist. I have done my bachelor’s in petroleum engineering from the University of Petroleum and Energy Studies, Dehradun. After that, I pursued my master’s from IIT ISM Dhanbad, and there I decided to take Machine Learning as my minor. From there, my journey of becoming a Data Scientist started. After that, I worked as a Petroleum Data Scientist for an Oil and Gas Startup. And later, I joined Siemens as a Data Scientist.AV: What made you decide to become a Data Scientist? AV: You have a specialization in Petroleum; what made you switch? What steps did you take to gain the necessary skills and knowledge to succeed in the field of Data Science?
Jaiyesh: I started by using Data Science as a tool for solving problems in Petroleum Industry. So, I was a Petroleum Engineer who knew Data Science. And regarding the steps, the initial step was learning how to code using python. Started with python, followed by useful libraries of python like Numpy, pandas, and Matplotlib. Then Statistics, machine learning, and at last deep learning. Also, at each step, a lot of practice is required.AV: What is the biggest challenge you have faced in your career as a Data Scientist? How did you overcome it?
Jaiyesh: My biggest challenge was to get a job as a fresher without any experience, but here my petroleum background helped me because there are very few people who have knowledge of oil and gas as well as data science. So that mixture helped me to get my first job.AV: You are one of the co-founders of “Petroleum From Scratch.” What was the inspiration behind it?
Jaiyesh: We started Petroleum from Scratch during covid times 2023. A lot of organizations started during that time, and they were charging hefty amounts for providing training to petroleum students/engineers. And also oil and gas market was at its lowest point, as crude prices went below zero. So a lot of layoffs happened in the oil and gas industry. So, to help professionals and students, we came up with Petroleum From Scratch, where we share knowledge free of cost.AV: After working at Siemens for over a year, can you describe a recent project you have worked on, and what were some key insights or takeaways?
Jaiyesh: So, one of my recent projects is for a giant automotive company, where we built a complete pipeline for the detection of faulty parts in the manufacturing unit. In this project, not only the Data Science part but the software piece was also delivered by us. So, this project showed me the importance of knowing software pipeline development, even being a data scientist.Tips for Data Scientist Enthusiasts AV: What are habits that you swear by which have led you to be successful?
Jaiyesh: Consistency and showing up daily are key habits that can help in achieving success in any area of life. When we consistently show up and put in the effort, we are more likely to make progress and see results over time.
This is especially true when it comes to learning new skills or developing new habits. By consistently practicing or working on something every day, we can build momentum and make steady progress towards our goals. This can help us stay motivated and avoid getting discouraged or giving up too soon.
In addition to consistency, other habits that can contribute to success include setting clear goals, prioritizing tasks, staying organized, and maintaining a positive attitude. By combining these habits with consistency and daily effort, we can create a powerful formula for achieving success in any area of life.Conclusion
You're reading From Petroleum Engineering To Data Science: Jaiyesh Chahar’s Journey
Machine learning has the ultimate goal of creating a model that is generalizable. It is important to select the most accurate model by comparing and choosing it correctly. You will need a different holdout than the one you used to train your hyperparameters. You will also need to use statistical tests that are appropriate to evaluate the results.
It is crucial to understand the goals of the users or participants in a data science project. However, this does not guarantee success. Data science teams must adhere to best practices when executing a project in order to deliver on a clearly defined brief. These ten points can be used to help you understand what it means.1. Understanding the Problem
Knowing the problem you are trying to solve is the most important part of solving it. You must understand the problem you are trying to predict, all constraints, and the end goal of this project.
Also read: iPhone 14 Pro Max Is Apple’s New iPhone To Be Launched In September (Know The Release Date, Specification, Rumour & More)2. Know Your Data
Knowing what your data means will help you understand which models are most effective and which features to use. The data problem will determine which model is most successful. Also, the computational time will impact the project’s cost.
You can improve or mimic human decision-making by using and creating meaningful features. It is crucial to understand the meaning of each field, especially when it comes to regulated industries where data may be anonymized and not clear. If you’re unsure what something means, consult a domain expert.3. Split your data
What will your model do with unseen data? If your model can’t adapt to new data, it doesn’t matter how good it does with the data it is given.
You can validate its performance on unknown data by not letting the model see any of it while training. This is essential in order to choose the right model architecture and tuning parameters for the best performance.
Splitting your data into multiple parts is necessary for supervised learning. The training data is the data the model uses to learn. It typically consists of 75-80% of the original data.
This data was chosen randomly. The remaining data is called the testing data. This data is used to evaluate your model. You may need another set of data, called the validation set.
This is used to compare different supervised learning models that were tuned using the test data, depending on what type of model you are creating.
You will need to separate the non-training data into the validation and testing data sets. It is possible to compare different iterations of the same model with the test data, and the final versions using the validation data.
Also read: 30+ Loan Apps Like MoneyLion and Dave: Boost Your Financial Emergency (#3 Is Popular 🔥 )4. Don’t Leak Test Data
It is important to not feed any test data into your model. This could be as simple as training on the entire data set, or as subtle as performing transformations (such as scaling) before splitting.
If you normalize your data before splitting, the model will gain information about the test set, since the global minimum and maximum might be in the held-out data.5. Use the Right Evaluation Metrics
Every problem is unique so the evaluation method must be based on that context. Accuracy is the most dangerous and naive classification method. Take the example of cancer detection.
We should always say “not cancer” if we want to build a reliable model. This will ensure that we are correct 99 percent of the time.
Also read: Top 10 Business Intelligence Tools of 20236. Keep it simple
It is important to select the best solution for your problem and not the most complex. Management, customers, and even you might want to use the “latest-and-greatest.” You need to use the simplest model that meets your needs, a principle called Occam’s Razor.
This will not only make it easier to see and reduce training time but can also improve performance. You shouldn’t try to kill Godzilla or shoot a fly with your bazooka.7. Do not overfit or underfit your model
Overfitting, also called variance, can lead to poor performance when the model doesn’t see certain data. The model simply remembers the training data.
Bias, also known as underfitting, is when the model has too few details to be able to accurately represent the problem. These two are often referred to as “bias-variance trading-off”, and each problem requires a different balance.
Let’s use a simple image classification tool as an example. It is responsible for identifying whether a dog is present in an image.8. Try Different Model Architectures
It is often beneficial to look at different models for a particular problem. One model architecture may not work well for another.
You can mix simple and complex algorithms. If you are creating a classification model, for example, try as simple as random forests and as complex as neural networks.
Interestingly, extreme gradient boosting is often superior to a neural network classifier. Simple problems are often easier to solve with simple models.9. Tune Your Hyperparameters
These are the values that are used in the model’s calculation. One example of a hyperparameter in a decision tree would be depth.
This is how many questions the tree will ask before it decides on an answer. The default parameters for a model’s hyperparameters are those that give the highest performance on average.
Also read: Top 9 WordPress Lead Generation Plugins in 202310. Comparing Models Correctly
Machine learning has the ultimate goal of creating a model that is generalizable. It is important to select the most accurate model by comparing and choosing it correctly.
You will need a different holdout than the one you used to train your hyperparameters. You will also need to use statistical tests that are appropriate to evaluate the results.
Analytics vendors and non-technical employees are democratizing data science. Organizations are looking at converting non-technical employees into data scientists so that they can combine their domain expertise with data science technology to solve business problems.What does citizen data scientist mean?
In short, they are non-technical employees who can use data science tools to solve business problems.
Citizen data scientists can provide business and industry domain expertise that many data science experts lack. Their business experience and awareness of business priorities enable them to effectively integrate data science and machine learning output into business processes.Why are citizen data scientists important now?
Interest in citizen data science is almost tripled between 2012-2023, as seen below.
Reasons for this growing interest are:
Though there is an increasing need for analytics due to increased popularity of data-driven decision making, data science talent is in short supply. As of 2023, there are three times more data science job postings than job searches.
As with any short supply product in the market, data science talent is expensive. According to the U.S. Bureau of Labor Statistics, the average data science salary is $101k.
Analytics tools are easier-to-use now, which reduces the reliance on data scientists.
Most industry analysts are also highlighting the increased role of citizen data scientists in organizations:
IDC big data analytics and AI research director Chwee Kan Chua mentions in an interview: “Lowering the barriers to allow even non-technical business users to be ‘data scientists’ is a great approach.”
Gartner defined the term and is heavily promoting it
Various solutions help businesses to democratize AI and analytics:
Citizen data scientists first need to understand business data and access it from various systems. Metadata management solutions like data catalogs or self-service data reporting tools can help citizen data scientists with this.
Automated Machine Learning (AutoML): AutoML solutions can automate manual and repetitive machine learning tasks to empower citizen data scientists. ML tasks AutoML tools can automate are
Algorithm selection & hyperparameter optimization
Augmented analytics /AI-driven analytics: ML-led analytics, where tools extract insights from data in two forms:
Search-driven: Software returns with results in various formats (reports, dashboards, etc.) to answer citizen data scientists’ queries.
Auto-generated: ML algorithms identify patterns to automate insight generation.
No/low-code and RPA solutions minimize coding with drag-and-drop interfaces which helps citizen developers place the models they prepare in production.
BotX’s no-code AI platform can empower citizen data scientists to build solutions faster while reducing development costs. BotX solutions allow developers and data scientists to launch apps and set infrastructure and IT systems through:What are best practices for citizen data science projects? Create a workspace where citizen data scientists and data science experts can work collaboratively
Most citizen data scientists are not trained in the foundations of data science. They rely on tools to generate reports, analyze data, create dashboards or models. To maximize citizen data scientists’ value, you should have teams that can support them which also includes data engineers and expert data scientists.Train citizen data scientists
use of BI/autoML tools for maximum efficiency
data security training to maintain data compliance
detecting AI biases and creating standards for model trust and transparency so that citizen data scientists can establish explainable AI (XAI) systems.Classify datasets based on accessibility
Due to data compliance issues, all data types should not be accessible to all employees. Classifying data sets that require limited access can help overcome this issue.Create a sandbox for testing
Sandboxes, software testing environment, which include synthetic data and which are not connected to production environments help citizen data scientists quickly test their models before rolling them to production.
If you still have questions on citizen data science, don’t hesitate to contact us:
Cem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.
YOUR EMAIL ADDRESS WILL NOT BE PUBLISHED. REQUIRED FIELDS ARE MARKED
Did you know that ‘Data Engineer’ is the fastest-growing role in the industry?
Currently, most data science aspirants are still focused on landing the coveted role of a data scientist. That’s understandable – all the hype in the media and the community glorifies the role of a data scientist. But it’s the data engineer that’s emerged as the dark horse.
Which isn’t really surprising, is it? Data science professionals spend close to 60-70% of their time gathering, cleaning, and processing data – that’s right down a data engineer’s alley!
Tech behemoths like Netflix, Facebook, Amazon, Uber, etc. are collecting data at an unprecedented pace – and they’re hiring data engineers like never before. There hasn’t been a better time to get into this field!
Unfortunately, there is no coherent path designed to become a data engineer. Most data science aspirants haven’t even heard of the role – they tend to learn about it on the job.
I’ve put together a list of data engineering books to help you get started with this thriving field and make sure you’re acquainted with the various terms, skills, and other nuances required.
And why books?
Books are a vital way of absorbing information on Data Engineering. So let’s begin!1. The Data Engineering Cookbook by Andreas Kretz
There is a lot of confusion about how to become a data engineer. I’ve met a lot of data science aspirants who didn’t even know this role existed!
Here is an ebook by Andreas Kertz that has elaborate case studies, codes, podcasts, interviews, case studies, and more. I consider this to be a complete package to enable anyone to become a data engineer.
And the icing on the cake? This ebook is free! Yes, you can instantly get started with it. Learn, practice, and prepare for your data engineering role now!2. DW 2.0 – The Architecture for the Next Generation of Data Warehousing by The Father of Data Warehousing W.H. Inmon
This book describes the future of data warehousing that is technologically possible today, at both an architectural level as well as a technology level.
I really like how the book is neatly structured and covers most of the topics related to data architecture and its underlying challenges, how can you use the existing system and build a data warehouse around it, and the best practices to justify the expenses in a very practical manner.
This book is designed for:
Anyone who aspires to become a data engineering professional
Organizations that want to induct this capability into their systems
System designers, and
Data warehouse professionals
DW 2.0. is * Written by the “father of the data warehouse”, Bill Inmon, a columnist and newsletter editor of The Bill Inmon Channel on the Business Intelligence Network.
This one is not to be missed! Here is where you can grab a copy – chúng tôi .3.
Agile Data Warehouse Design: Collaborative Dimensional Modeling, from Whiteboard to Star Schema by
This is a great book. Lawrence Corr provides a comprehensive and step-by-step guide to capturing data warehousing and business intelligence requirements and convert them into high-performance models by using a technique called model storming (model + brainstorming).
Additionally, you’ll come across a concept called BEAM, an agile approach to dimensional modeling for improving communication between data warehouse designers and business intelligence stakeholders.
Source this book at chúng tôi
What do you wish for as a data scientist?
How about getting data that’s clean and reliable? With all the business value captured and presented well in the data, you would definitely wish for accurate and robust data models, high application agility and well-designed models as the final outcome.
How would you feel if someone just granted you these wishes and made your dream to become a champion data engineer come true? So, why wait for that ‘someone’ to grant your wishes when you can find a way to chalk your path and get these wishes granted by yourself by simply reading this book!
Yes, this book is the third edition is a complete library of updated dimensional modeling techniques, the most comprehensive collection ever. It covers new and enhanced star schema dimensional modeling patterns, adds two new chapters on ETL techniques, includes new and expanded business matrices for 12 case studies, and more.
You can grab a copy of it here – chúng tôi
Data is being generated in huge volumes today, a scale we can only imagine. So much data plays a vital role in increasing the complexity of operations and that has sparked new developments in the field of data engineering.
This cracking book by Holden Karau offers a valuable reference guide for all graduate students, researchers, and scientists interested in exploring the potential of Big Data applications.
Dive into the world of innovations in the way you acquire and massage the data, the ultimate goal is to get the best and most classified data for your machine learning model. Spark is the most effective data processing framework in enterprises today.
Get a copy today!! – chúng tôi
Spark: The Definitive Guide: Big Data Processing Made Simple by
Data Engineering is a multi-disciplinary field with applications in control, decision theory, and the emerging hot area of bioinformatics. There are no books on the market that make the subject accessible to non-experts.
So, if you are just starting off and need a good book to learn everything about data engineering, then Spark, a fast cluster computing framework that is used for processing, querying and analyzing big data, is the tool that you should learn and this is your book to read.
All the theory and practical concepts are explained in a user-friendly manner and easy to understand language.
It describes a scalable, easy-to-understand approach to big data systems that can be built and run by a small team. Following a realistic example, this book guides readers through the theory of big data systems, how to implement them in practice, and how to deploy and operate them once they’re built.
So, if you are the CEO/CXO of an organization and want to introduce the Data Engineering practice into your organization, then you should grab this book and access the data engineering pattern of your business.
The concepts of this book revolve around the task of collecting data and distilling useful information from that data. Five discrete sections covered in this book are:
Martin Kleppmann helps you navigate this diverse landscape by examining the pros and cons of various technologies for processing and storing data.
Big Data, Black Book: Covers Hadoop 2, MapReduce, Hive, YARN, Pig, R, and Data Visualization
So, if you want to start learning about data engineering tools, then this book is a must-read. It holistically covers all the tools that help you meddle with data and craft strategies to gain a competitive edge.End Notes
Becoming a data engineer is not an easy task. It requires a deep understanding of tools, processes, and techniques to be able to extract the best out of any structured/ unstructured data.
You can sketch out a data engineering path for yourself by reading this exhaustive article – Want to Become a Data Engineer? Here’s a Comprehensive List of Resources to get started.
This article was published as a part of the Data Science Blogathon.Introduction
Data science interviews consist of questions from statistics and probability, Linear Algebra, Vector, Calculus, Machine Learning/Deep learning mathematics, Python, OOPs concepts, and Numpy/Tensor operations. Apart from these, an interviewer asks you about your projects and their objective. In short, interviewers focus on basic concepts and projects.
This article is part 1 of the data science interview series and will cover some basic data science interview questions. We will discuss the interview questions with their answers:What is OLS? Why, and Where do we use it?
OLS (or Ordinary Least Square) is a linear regression technique that helps estimate the unknown parameters that can influence the output. This method relies on minimizing the loss function. The loss function is the sum of squares of residuals between the actual and predicted values. The residual is the difference between the target values and forecasted values. The error or residual is:
Minimize ∑(yi – ŷi)^2
Where ŷi is the predicted value, and yi is the actual value.
We use OLS when we have more than one input. This approach treats the data as a matrix and estimates the optimal coefficients using linear algebra operations.What is Regularization? Where do we use it?
Regularization is a technique that reduces the overfitting of the trained model. This technique gets used where the model is overfitting the data.
Overfitting occurs when the model performs well with the training set but not with the test set. The model gives minimal error with the training set, but the error is high with the test set.
Hence, the regularization technique penalizes the loss function to acquire the perfect fit model.What is the Difference between L1 AND L2 Regularization?
L1 Regularization is also known as Lasso(Least Absolute Shrinkage and Selection Operator) Regression. This method penalizes the loss function by adding the absolute value of coefficient magnitude as a penalty term.
Lasso works well when we have a lot of features. This technique works well for model selection since it reduces the features by shrinking the coefficients to zero for less significant variables.
Thus it removes some features that have less importance and selects some significant features.
L2 Regularization( or Ridge Regression) penalizes the model as the complexity of the model increases. The regularization parameter (lambda) penalizes all the parameters except intercept so that the model generalizes the data and does not overfit.
Ridge regression adds the squared magnitude of the coefficient as a penalty term to the loss function. When the lambda value is zero, it becomes analogous to OLS. While lambda is very large, the penalty will be too much and lead to under-fitting.
Moreover, Ridge regression pushes the coefficients towards smaller values while maintaining non-zero weights and a non-sparse solution. Since the square term in the loss function blows up the outliers residues that make the L2 sensitive to outliers, the penalty term endeavors to rectify it by penalizing the weights.
Ridge regression performs better when all the input features influence the output with weights roughly equal in size. Besides, Ridge regression can also learn complex data patterns.What is R Square?
R Square is a statistical measure that shows the closeness of the data points to the fitted regression line. It calculates the percentage of the predicted variable variation calculated by a linear model.
The value of R-Square lies between 0% and 100%, where 0 means the model can not explain the variation of the predicted values around its mean. Besides, 100% indicates that the model can explain the whole variability of the output data around its mean.
In short, the higher the R-Square value, the better the model fits the data.
The R-square measure has some drawbacks that we will address here too.
The problem is if we add junk independent variables or significant independent variables, or impactful independent variables to our model, the R-Squared value will always increase. It will never decrease with a newly independent variable addition, whether it could be an impactful, non-impactful, or insignificant variable. Hence we need another way to measure equivalent RSquare, which penalizes our model with any junk independent variable.
So, we calculate the Adjusted R-Square with a better adjustment in the formula of generic R-square.What is Mean Square Error?
Mean square error tells us the closeness of the regression line to a set of data points. It calculates the distances from data points to the regression line and squares those distances. These distances are the errors of the model for predicted and actual values.
The line equation is given as y = MX+C
M is the slope, and C is the intercept coefficient. The objective is to find the values for M and C to best fit the data and minimize the error.Why Support Vector Regression? Difference between SVR and a simple regression model?
The objective of the simple regression model is to minimize the error rate while SVR tries to fit the error into a certain threshold.
The best fit line is the line that has a maximum number of points on it. The SVR attempts to calculate a decision boundary at the distance of ‘e’ from the base hyper-plane such that data points are nearest to that hyper-plane and support vectors are within that boundary line.Conclusion
The ordinary least squares technique estimates the unknown coefficients and relies on minimizing the residue.
L1 and L2 Regularization penalizes the loss function with absolute value and square of the value of the coefficient, respectively.
The R-square value indicates the variation of response around its mean.
R-square has some drawbacks, and to overcome these drawbacks, we use adjusted R-Square.
Mean square error calculates the distance between points on the regression line to the data points.
SVR fits the error within a certain threshold instead of minimizing it.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.
This article was published as a part of the Data Science BlogathonOverview
Step by Step approach to Perform EDA
Resources Like Blogs, MOOCS for getting familiar with EDA
Getting familiar with various Data Visualization techniques, charts, plots
Demonstration of some steps with Python Code SnippetWhat is that one thing that differentiates one data science professional, from the other?
Not Machine Learning, Not Deep Learning, Not SQL, It’s Exploratory Data Analysis (EDA). How good one is with the identification of hidden patterns/trends of the data and how valuable the extracted insights are, is what differentiates Data Professionals.1. What Is Exploratory Data Analysis
EDA assists Data science professionals in various ways:-
3 Getting a better understanding of the problem statement
[ Note: the dataset in this blog is being opted as iris dataset]2. Checking Introductory Details About Data
The first and foremost step of any data analysis, after loading the data file, should be about checking few introductory details like, no. Of columns, no. of rows, types of features( categorical or Numerical), data types of column entries.
Python Code Snippet
data.head() For displaying first five rows
data.tail() For Displaying last Five Rows3. Statistical Insight
This step should be performed for getting details about various statistical data like Mean, Standard Deviation, Median, Max Value, Min ValuePython Code Snippet
data.describe()4. Data cleaning
This is the most important step in EDA involving removing duplicate rows/columns, filling the void entries with values like mean/median of the data, dropping various values, removing null entriesChecking Null entries
data.IsNull().sum gives the number of missing values for each variableRemoving Null Entries
data.dropna(axis=0,inplace=True) If null entries are thereFilling values in place of Null Entries(If Numerical feature)
Values can either be mean, median or any integerPython Code Snippet Checking Duplicates
data.duplicated().sum() returning total number of duplicates entriesRemoving Duplicates
data.drop_duplicates(inplace=True)5. Data Visualization
Data visualization is the method of converting raw data into a visual form, such as a map or graph, to make data easier for us to understand and extract useful insights.
The main goal of data visualization is to put large datasets into a visual representation. It is one of the important steps and simple steps when it comes to data science
You Can refer to the blog below for getting more details about Data VisualizationVarious Types of Visualization analysis is: a. Uni Variate analysis:
This shows every observation/distribution in data on a single data variable. It can be shown with the help of various plots like Scatter Plot, Line plot, Histogram(summary)plot, box plots, violin plot, etc.b. Bi-Variate analysis: c. Multi-Variate analysis:
Scatterplots, Histograms, box plots, violin plots can be used for Multivariate AnalysisVarious Plots
Below are some of the plots that can be deployed for Univariate, Bivariate, Multivariate analysisa. Scatter Plot Python Code Snippet
sns.scatterplot(data[‘sepal_length’],data[‘sepal_width’],hue =data[‘species’],s=50)For multivariate analysis Python Code Snippet
sns.pairplot(data,hue=”species”,height=4)b. Box Plot Python Code Snippet
plt.show()c. Violin Plot
More informative, than box plot, and shows full distribution of dataPython Code Snippet
It can be used for visualizing the Probability density function(PDF)Python Code Snippet
Email: [email protected]
You can refer to the blog being, mentioned below for getting familiar with Exploratory Data Analysis
Exploratory Data Analysis: Iris Dataset
The media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion.
Update the detailed information about From Petroleum Engineering To Data Science: Jaiyesh Chahar’s Journey on the Moimoishop.com website. We hope the article's content will meet your needs, and we will regularly update the information to provide you with the fastest and most accurate information. Have a great day!