You are reading the article Visualizing Covid Data With Plotly updated in December 2023 on the website Moimoishop.com. We hope that the information we have shared is helpful to you. If you find the content interesting and meaningful, please share it with your friends and continue to follow and support us for the latest updates. Suggested January 2024 Visualizing Covid Data With Plotly
This article was published as a part of the Data Science Blogathon.
IntroductionThe graphical or pictorial representation of data and information is called Data Visualization. Using different tools like graphs, charts, maps, etc, data visualization tools provide a very effective and efficient way of finding trends, outliers, and patterns in data, which might seem nonexistent to human eyes.
Data visualization tools and technologies are highly essential in the world of Big Data, to access and analyze massive amounts of information and make data-driven decisions.
Quickens the Decision-making process
Easily identify hidden patterns
Getting business insights
Finding errors in beliefs
Storytelling about the data is more engaging
Helps non-technical background people understand the data better
Identify new trends
Data visualization can be described as another form of art, that grabs our eyes and attention, and keeps us focused on the underlying message. While viewing a chart we can easily and quickly see upcoming or ongoing trends, outliers, etc. And this visual representation helps us digest the facts faster.
You know how much more effective data visualization can be if you’ve ever stared at a massive excel sheet, and couldn’t make out the head or tail of it.
Today we will do Data Visualization of covid datasets across the world. This dataset can be found on Kaggle, linked here.
Image Source: Times of India
PlotlyWe will use Plotly for this. It is an open-source graphical library for Python, which produces interactive, publication-quality graphs. Its headquarters are located in Montreal, Quebec, which develops online data analytics and visualization tools.
They provide online graph creation, analytics, and statistical tools for individuals as well as corporations, along with scientific graphing libraries for Python, R, MATLAB, Perl, Julia, Arduino, and REST.
Image Source: Wikipedia
Importing LibrariesFirst, we install the chart-studio, for interfacing with Plotly’s Chart Studio services( Both Chart Studio cloud and Chart Studio On-Perm).
!pip install chart_studioNext, we import the necessary modules and libraries:
import pandas as pd import numpy as np import chart_studio.plotly as py import cufflinks as cf import seaborn as sns import plotly.express as px %matplotlib inline # Make Plotly work in your Jupyter Notebook from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot init_notebook_mode(connected=True) # Use Plotly locally cf.go_offline() Loading the Country wise DatasetLet’s take a look at the dataset first:
country_wise = pd.read_csv('/kaggle/input/corona-virus-report/country_wise_latest.csv') print("Country Wise Data shape =",country_wise.shape) country_wise.head()The last column is named “WHO Region“. Due to some technical glitches, it was not visible in the screenshot.
country_wise.info() Histogram PlotLet us visualize total deaths from all the countries. Due to a large number of countries, I have divided them into different plots.
A) Deaths in first 50 countries import plotly.graph_objects as go # Display death due to covid data for various countries fig = px.bar(country_wise.head(50), y='Deaths', x='Country/Region', text='Deaths', color='Country/Region') # fig.update_traces(texttemplate='%{text:.2s}', textposition='outside') # Set fontsize and uniformtext_mode='hide' says to hide the text if it won't fit fig.update_layout(uniformtext_minsize=8) # Rotate labels 45 degrees fig.update_layout(xaxis_tickangle=-45) fig B) Deaths in the next 50 countries fig1 = px.bar(country_wise[50:101], y='Deaths', x='Country/Region', text='Deaths', color='Country/Region') # Put bar total value above bars with 2 values of precision fig1.update_traces(texttemplate='%{text:.2s}', textposition='outside') # Set fontsize and uniformtext_mode='hide' says to hide the text if it won't fit fig1.update_layout(uniformtext_minsize=8) # Rotate labels 45 degrees fig1.update_layout(xaxis_tickangle=-45) fig1 C) Deaths in the next 50 countries fig1 = px.bar(country_wise[101:151], y='Deaths', x='Country/Region', text='Deaths', color='Country/Region') # Put bar total value above bars with 2 values of precision fig1.update_traces(texttemplate='%{text:.2s}', textposition='outside') # Set fontsize and uniformtext_mode='hide' says to hide the text if it won't fit fig1.update_layout(uniformtext_minsize=8) # Rotate labels 45 degrees fig1.update_layout(xaxis_tickangle=-45) fig1 D) Deaths in the rest of the countries fig1 = px.bar(country_wise[151:], y='Deaths', x='Country/Region', text='Deaths', color='Country/Region') # Put bar total value above bars with 2 values of precision fig1.update_traces(texttemplate='%{text:.2s}', textposition='outside') # Set fontsize and uniformtext_mode='hide' says to hide the text if it won't fit fig1.update_layout(uniformtext_minsize=8) # Rotate labels 45 degrees fig1.update_layout(xaxis_tickangle=-45) fig1 E) Pie chart for total deaths in all the Asian Countries worldometer = pd.read_csv('/kaggle/input/corona-virus-report/worldometer_data.csv') worldometer_asia = worldometer[worldometer['Continent'] == 'Asia'] px.pie(worldometer_asia, values='TotalCases', names='Country/Region', title='Population of Asian continent', color_discrete_sequence=px.colors.sequential.RdBu) F) Code for the animated transition of confirmed cases from 22 Jan 2023 to July 2023Note: The animation could not be added to this article, but if you write the code and run it, it will play seamlessly.
full_grouped = pd.read_csv('/kaggle/input/corona-virus-report/full_grouped.csv') india = full_grouped[full_grouped['Country/Region'] == 'India'] us = full_grouped[full_grouped['Country/Region'] == 'US'] russia = full_grouped[full_grouped['Country/Region'] == 'Russia'] china = full_grouped[full_grouped['Country/Region'] == 'China'] df = pd.concat([india,us,russia,china], axis=0) # Watch as bars chart covid cases changes fig = px.bar(df, x="Country/Region", y="Confirmed", color="Country/Region", animation_frame="Date", animation_group="Country/Region", range_y=[0,df['Confirmed'].max() + 100000]) fig.layout.updatemenus[0].buttons[0].args[1]["frame"]["duration"] = 1 figThe end result of the animation
Now we plot a histogram for deaths across all the Asian Countries.
# bins represent the number of bars to make # Can define x label, color, title # marginal creates another plot (violin, box, rug) fig = px.histogram(worldometer_asia,x = 'TotalDeaths', nbins=20, labels={'value':'Total Deaths'},title='Death Distribution of Asia Continent', marginal='violin', color='Country/Region') fig.update_layout( xaxis_title_text='Total Deaths', showlegend=True )So as you can see, India had the most number of deaths, around 40-45k, which is really sad.
G) A box plot to represent total cases distribution across Asia and Europe # A box plot allows you to compare different variables # The box shows the quartiles of the data. The bar in the middle is the median # The whiskers extend to all the other data aside from the points that are considered # to be outliers # Complex Styling fig = go.Figure() # Show all points, spread them so they don't overlap and change whisker width fig.add_trace(go.Box(y=worldometer_asia['TotalCases'], boxpoints='all', name='Asia', fillcolor='blue', jitter=0.5, whiskerwidth=0.2)) fig.add_trace(go.Box(y=worldometer[worldometer['Continent'] == 'Europe']['TotalCases'], boxpoints='all', name='Europe', fillcolor='red', jitter=0.5, whiskerwidth=0.2)) # Change background / grid colors fig.update_layout(title='Asia vs Europe total cases distribution', yaxis=dict(gridcolor='rgb(255, 255, 255)', gridwidth=3), paper_bgcolor='rgb(243, 243, 243)', plot_bgcolor='rgb(243, 243, 243)') Bonus: Creating an interactive globe mapThis is one of my favourite features from Plotly and another module called Pycountry. We can create an interactive Global Map, which displays all the deaths due to the Coronavirus, in different regions. I highly urge you to run this code and see how this map works.
import pycountry worldometer['Country/Region'].replace('USA','United States', inplace=True) worldometer['Country/Region'].replace('UAE','United Arab Emirates', inplace=True) worldometer['Country/Region'].replace('Ivory Coast','Côte d'Ivoire', inplace=True) worldometer['Country/Region'].replace('S. Korea','Korea', inplace=True) worldometer['Country/Region'].replace('N. Korea','Korea', inplace=True) worldometer['Country/Region'].replace('DRC','Republic of the Congo', inplace=True) worldometer['Country/Region'].replace('Channel Islands','Jersey', inplace=True) exceptions = [] def get_alpha_3_code(cou): try: return pycountry.countries.search_fuzzy(cou)[0].alpha_3 except: exceptions.append(cou) worldometer['iso_alpha'] = worldometer['Country/Region'].apply(lambda x : get_alpha_3_code(x)) # removeing exceptions for exc in exceptions: worldometer = worldometer[worldometer['Country/Region']!=exc] fig = px.scatter_geo(worldometer, locations="iso_alpha", color="Continent", # which column to use to set the color of markers hover_name="Country/Region", # column added to hover information size="TotalCases", # size of markers projection="orthographic") figYou can rotate the globe using your cursor and view all the deaths in every country. A very tidy and neat visualization in my opinion.
End NotesPlotly is one of my favorite goto libraries for visualization, apart from Matplotlib or Seaborn. I would like to write a blog about it someday as well. If you like what you see and want to check out more of my writings, you can do so here:
I hope you had a good time reading this article. Thank you for reading, Cheers!!
The media shown in this article on visualizing covid data in plotly are not owned by Analytics Vidhya and is used at the Author’s discretion.
Related
You're reading Visualizing Covid Data With Plotly
Stock Market Analysis With Pandas – Datareader And Plotly For Beginners
This article was published as a part of the Data Science Blogathon
IntroductionYou must have come across news articles, update rallying, stocks falling, and so on. Stock markets are volatile. Stock prices daily go up and down. Keeping a track of such changes and trends can be tedious for a data professional. In this article, we will perform a stock market analysis of a few popular internet tech companies.
Analysing the stock prices demands a dataset that is continuously updating. In such scenarios, pandas have a special library called pandas-datareader. Pandas-datareader helps us import data from the internet. Do check out more about the pandas data-reader library from here.
We will use Yahoo Finance to import stock market data for our analysis. We’ll study stocks of popular 5 tech companies. The list of stocks for the analysis is as below:
Amazon
Microsoft
Apple
Let’s take a look at the data from the last 5 years to have an understanding of how stocks have behaved. First, we will search for tickers of the above companies on Yahoo. Ticker is a unique stock symbol with a series of letters assigned to a particular stock for trading purposes.
tickers = ['GOOG','AMZN','MSFT','AAPL', 'FB']Now, we will import pandas-datareader and necessary libraries. If you haven’t installed it, then install it in your notebook with the below command.
!pip install pandas-datareaderThen, we import the necessary libraries.
import pandas_datareader as data import pandas as pd Getting the DataWe get our data from Yahoo for the last 5 years. For each stock, we import data separately. We concat all the stocks data into a single dataframe for our analysis.
We give names to our columns for better data interpretation.
df.columns.names = ['Stock Ticker', 'Stock Info'] df.head()Our data is in multi-index format. We have to take a cross-section of the data for analysis. Read more about dealing with multilevel indexing here.
df.xs(key='GOOG', axis=1, level='Stock Ticker') Data VisualisationWe will use the cross-section to pull stock data and visualise stock movements in the last 5 years. We will use graphic charts to understand the stock movements.
For visualisations, we will use a library called Plotly. Plotly is a graphing library popular for creating interactive charts. With Plotly, we can understand the stock movement in real-time.
Let’s import the necessary visualisation libraries below:
import matplotlib.pyplot as plt import plotly.express as px import plotly.graph_objects as go %matplotlib inlineWe analyse Google’s closing price over the last 5 years with the below code.
px.line(df.xs(key='GOOG', axis=1, level='Stock Ticker')['Close'])Let’s see stock movement for a specific time period and analyse price fluctuations in that particular period.
px.line(df.xs(key='GOOG', axis=1, level='Stock Ticker')['Close'], range_x=['2023-01-01','2023-12-31'])Comparisons between two stocks can be helpful to understand which stock is performing better. We compare Google and Amazon stocks below.
px.line(df.xs(key='Close', axis=1, level='Stock Info')[['GOOG', 'AMZN']])After analysing each stock, we can analyse all of them by taking a cross-section of their closing price or opening price.
df.xs(key='Close', axis=1, level='Stock Info').head() c = df.xs(key='Close', axis=1, level='Stock Info') c.head()Let’s look at the chart containing all stocks movement for the given time period.
plt.figure(figsize=(20,10)) fig = px.line(c) fig.show()We understand each stock behaviour separately with the below chart. This helps us differentiate between underperforming stocks and better-performing ones.
plt.figure(figsize=(20,10)) fig = px.area(c, facet_col='Stock Ticker', facet_col_wrap=3) fig.show()Amazon and Google are clearly outperforming other stocks. There is consistent growth in both stocks over the last 5 years.
Specify the time period for which you want to see the stock performance. During the Covid-19 outbreak, we can see stocks crashing and then recovering after a certain time.
fig = px.line(c, range_x=['2023-01-01','2023-12-31']) fig.show() Candlestick Charts:Candlestick charts are popular in stock market analysis. They are helpful in understanding the past stock movements. Candlestick charts give us insights into stock’s open, close, high, low prices. Green candlesticks show positive movement and red shows a decline in the stock.
A candlestick has a body in the middle and sticks at its ends. The body showcases the opening and closing price of the stock. Two ends which are called shadows represent the high and low values of the day respectively for a particular stock.
Let’s look at the code for creating a candlestick chart for Google:
plt.figure(figsize=(24,16)) fig = go.Figure(data=[go.Candlestick(x=df.index, open=df['GOOG']['Open'], high = df['GOOG']['High'], low = df['GOOG']['Low'], close = df['GOOG']['Close'])]) fig.update_layout(xaxis_rangeslider_visible=False) fig.show()For a visually appealing candlestick chart, we can use cufflinks. Cufflinks is a library that connects Plotly to pandas for better visuals. Let’s import Cufflinks and create a candlestick chart for the year 2023.
import cufflinks as cf cf.go_offline() google = df['GOOG'][['Open', 'High', 'Low', 'Close']].loc['2023-01-01':'2023-11-30'] google.iplot(kind='candle') Conclusion:And just like that, we completed our stock market analysis. First, we imported the pandas-DataReader library. We mentioned the source of our data, time period, and stocks list for analysis. We import ata and created a separate dataframe. The data has multilevel indexing. We took a cross-section of the dataframe to analyse each stock and its movement.
We created interactive visualisation charts with Plotly and compared a few stocks for a specific time period. Candlestick charts are important to get a better understanding of the past stock movements. Plotly and Cufflinks can be used to create visually appealing Candlestick graphs. With this article hope you’re now equipped to do stock analysis on your own. Good luck!
Image credit: Photo by Maxim Hopman on Unsplash
The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion.
Related
Creating Interactive Visualizations Using Plotly In Python
Introduction
In today’s world, every second the data keeps on getting bigger and bigger. In order to understand the data quickly and to draw insights, data visualization becomes necessary.
For e.g. consider a case where you are asked to illustrate crucial sales aspects (like sales performance, target, revenue, acquisition cost, etc.) from huge amounts of sales data, which one would you prefer:
Exploring the data using different types of sales graphs and charts
Obviously, you would prefer graphs and charts. So data visualization plays a key role in data exploration and data analysis.
Data Visualization is the technique to represent the data/information in a pictorial or graphical format. It enables the stakeholders and decision-makers to analyze and explore the data visually and uncover deep insights.
“Visualization gives you answers to questions you didn’t know you had.” – Ben Schneiderman
Benefits of Data Visualization
Helps in data analysis, data exploration and makes the data more understandable.
Summarises the complex quantitative information in a small space.
Helps in discovering the latest trends, hidden patterns in the data.
Identifies the relationships/correlations between the variables.
Helps in examining the areas that need attention or improvement.
Why Plotly?There are several libraries available in Python like Matplotlib, Seaborn, etc. for data visualization. But they render only the static images of the charts/plots and due to this, many crucial things get lost in the visualization. Wouldn’t it be amazing if we could interact better with the charts by hovering in (or) zooming in? Plotly allows us to do the same.
Plotly is an open-source data visualization library to create interactive and publication-quality charts/graphs.
Plotly offers implementation of many different graph types/objects like line plot, scatter plot, area plot, histogram, box plot, bar plot, etc.
Plotly supports interactive plotting in commonly used programming languages like Python, R, MATLAB, Javascript, etc.
In this post, we will cover the most commonly used graph types using Plotly. So let’s get started using the Cars93 dataset available on Kaggle.
The dataset contains 27 car parameters (like manufacturer, make, price, horsepower, engine size, weight, cylinders, airbags, passengers, etc.) of 93 different cars.
The dataset looks like this:
Installing PlotlyIn order to install Plotly, use the following command in the terminal.
pip install plotlyPlotly comes with few modules to create visualizations, i.e. giving us a choice of how to utilize it.
express: A high-level interface for creating quick visualizations. It’s a wrapper around Plotly graph_objects module.
graph_objects: A low-level interface to figures, traces, and layouts. It’s highly customizable in general for different graphs/charts.
figure_factory: Figure Factories are dedicated functions for creating very specific types of plots. It was available prior to the existence of Plotly express, therefore deprecated as “legacy”.
Having known and installed Plotly, now let’s plot different graphs/charts using it.
1. Box Plot
A box plot (or box-and-whisker plot) is a standardized way to display the distribution of quantitative data based on a Five-Point summary (minimum, first quartile(Q1), median(Q2), third quartile(Q3), and maximum).
The box extends from the Q1 to Q3 quartile values, whereas the whiskers extend from the edges of the box to the 1.5*IQR. IQR = (Q3 – Q1)
The best thing about this visualization is that we can start interacting with it by hovering in to see the quantiles values.
Similarly, we can customize it as per the requirement. For e.g. drawing a boxplot of Price for each AirBags type.
2. Histogram
A histogram is an accurate representation of the distribution of numerical data.
To construct a histogram, follow these steps −
Bin
(or bucket) the range of values – Divide the entire range of values into a series of intervals.
Count
how many values fall into each interval.
Let’s draw a histogram for cars’ Horsepower feature.
Here, X-axis is about bin ranges of Horsepower whereas Y-axis talks about frequency/count in each bin.
3. Density Plot
The density plot is a variation of a histogram, where instead of representing the frequency on the Y-axis, it represents the PDF (Probability Density Function) values.
It’s helpful in determining the Skewness of the variable visually.
Also, useful in assessing the importance of a continuous variable for a classification problem.
The density plot of Horsepower based on AirBags type is as shown below.
4. Bar Chart
A bar chart represents categorical data with rectangular bars with weights proportional to the values that they represent.
A bar plot shows comparisons among discrete categories.
The bar chart of the Type feature is as shown below.
Similarly, we can customize it to display MPG.city mean on the Y-axis, instead of displaying count.
5. Pie Chart
Pie Chart is used to represent the numerical proportion of the data in a circular graph.
The whole area of the chart represents 100% of the data, the arc length of each slice represents the relative percentage part of the whole.
The pie chart of the Type feature is as shown below.

6. Scatter Plot
A scatter plot uses dots to represent values for two different numeric variables.
It is really helpful in observing the relationship between two numeric variables.
Let’s draw a scatter plot, in order to assess the relationship between Horsepower and MPG.city.
From this plot, we can observe that as the Horsepower increases, MPG in the city decreases.
Plotly also provides a way to draw 3D scatter plots. Let’s draw the same using Horsepower, MPG.city, and Price features.
Similarly, we can draw a scatter plot matrix (a grid/matrix of scatter plots) to assess pairwise relationships for each combination of variables.
7. Line Chart
A line chart is a type of chart that displays information as a series of data points called ‘markers’ connected by straight line segments.
It is similar to a scatter plot except that the measurement points are ordered (typically by their x-axis value) and joined with straight line segments.
Line graphs are usually used to find relationships between two numeric variables or to visualize a trend in time series data.
Let’s draw a scatter plot, in order to assess the relationship between Horsepower and MPG.city.
8. Heatmap
A heatmap is a two-dimensional graphical representation of data whereas matrix values are represented in different shades of colors.
A heatmap aims to provide a color-coded visual summary of data/information.
Seaborn allows annotated heatmaps as well.
Let’s draw a heatmap to represent the correlation matrix of cars93 data.
9. Violin Plot
Violin plots are similar to box plots, except that they also show the probability density of the data at different values. In other words, the violin plot is a combination of a box plot and density plot.
Broader sections of the violin plot indicate higher probability, whereas the narrow sections indicate lower probability.
The Violin plot of the Price feature is shown below.
Similarly, we can customize it using Plotly to display the box and all the data points.
10. Word Cloud
Word Cloud is a visualization technique to represent the frequency of words within a given text segment.
The size of a word indicates how frequently it occurs in the text. The bigger the size, the greater the importance(frequency), whereas the smaller the size, the lesser the importance(frequency).
Word clouds are often used for representing the frequency of words within text documents, reports, website data, public speeches, etc.
Word cloud of a chosen text document is as shown below.
End NotesThe media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion.
Related
Machine Learning With Limited Data
This article was published as a part of the Data Science Blogathon.
IntroductionIn machine learning, the data’s mount and quality are necessary to model training and performance. The amount of data affects machine learning and deep learning algorithms a lot. Most of the algorithm’s behaviors change if the amount of data is increased or decreased. But in the case of limited data, it is necessary to effectively handle machine learning algorithms to get better results and accurate models. Deep learning algorithms are also data-hungry, requiring a large amount of data for better accuracy.
In this article, we will discuss the relationship between the amount and the quality of the data with machine learning and deep learning algorithms, the problem with limited data, and the accuracy of dealing with it. Knowledge about these key concepts will help one understand the algorithm vs. data scenario and will shape one so that one can deal with limited data efficiently.
The “Amount of Data vs. Performace” GraphIn machine learning, a query could be raised to your mind, how strictly is the data required to train a good machine learning or deep learning model? Well, there is no threshold levels or fixed answer to this, as every piece of information is different and has different features and patterns. Still, there are some threshold levels after which the performance of the machine learning or deep learning algorithms tends to be constant.
Most of the time, machine learning and deep learning models tend to perform well as the amount of data fed is increased, but after some point or some amount of data, the behavior of the models becomes constant, and it stops learning from data.
The above pictures show the performance of some famous machine learning and deep learning architectures with the amount of data fed to the algorithms. Here we can see that the traditional machine learning algorithms learn a lot from the data in a preliminary period, where the amount of data fed is increasing, but after some time, when a threshold level comes, the performance becomes constant. Now, if you provide more data to the algorithm, it will not learn anything, and the version will not increase or decrease.
In the case of deep learning algorithms, there are a total of three types of deep learning architectures in the diagram. The shallow ty[e of deep learning stricture is a minor deep learning architecture in terms of depth, meaning that there are few hidden layers and neurons in external deep learning architectures. In the case o deep neural networks, the number of hidden layers and neurons is very high and designed very profoundly.
From the diagram, we can see a total of three deep learning architectures, and all three perform differently when some amount of data is fed and increased. The shallow, deep neural networks tend to function like traditional machine learning algorithms, where the performance becomes constant after some threshold amount of data. At the same time, deep neural networks keep learning from the data when new data is fed.
From the diagram, we can conclude that,
” THE DEEP NEURAL NETWORKS ARE DATA HUNGRY “
What Problems Arise with Limited Data?Several problems occur with limited data, and the model could perform better if trained with limited data. The common issues that arise with limited data are listed below:
1. Classification:
In classification, if a low amount of data is fed, then the model will classify the observations wrongly, meaning that it will not give the accurate output class for given words.
2. Regression:
In a regression problem, if the model’s accuracy is low, then the model will predict very wrong, meaning that as it is a regression problem, it will be expecting the number. Still, limited data may show a horrifying amount far from the actual output.
3. Clustering:
The model can classify the different points in the wrong clusters in the clustering problems if trained with limited data.
4. Time Series:
In time series analysis, we forecast some data for the future. Still, a low-accurate time series model can give us inferior forecast results, and there may be a lot of errors related to time.
5. Object Detection:
If an object detection model is trained on limited data, it might not detect the object correctly, or it can classify the thing incorrectly.
How to Deal With Problems of Limited Data?There needs to be an accurate or fixed method for dealing with the limited data. Every machine learning problem is different, and the way of solving the particular problem is other. But some standard techniques are helpful for many cases.
1. Data Augmentation
Data augmentation is the technique in which the existing data is used to generate new data. Here the further information generated will look like the old data, but some of the values and parameters would be different here.
This approach can increase the amount of data, and there is a high likelihood of improving the model’s performance.
Data augmentation is preferred in most deep-learning problems, where there is limited data with images.
2. Don’t Drop and Impute:
In some of the datasets, there is a high fraction of invalid data or empty. Due to that, some amount of data s dropped not to make the process complex, but by doing this, the amount of data is decreased, and several problems can occur.
3. Custom Approach:
If there is a case of limited data, one could search for the data on the internet and find similar data. Once this type of data is obtained, it can be used to generate more data or be merged with the existing data.
ConclusionIn this article, we discussed the limited data, the performance of several machine learning and deep learning algorithms, the amount of data increasing and decreasing, the type of problem that can occur due to limited data, and the common ways to deal with limited data. This article will help one understand the process of restricted data, its effects on performance, and how to handle it.
Some Key Takeaways from this article are:
1. Machine Learning and shallow neural networks are the algorithms that are not affected by the amount of data after some threshold level.
2. Deep neural networks are data-hungry algorithms that never stop learning from data.
3. Limited data can cause problems in every field of machine learning applications, e.g., classification, regression, time series, image processing, etc.
4. We can apply Data augmentation, imputation, and some other custom approaches based on domain knowledge to handle the limited data.
Want to Contact the Author?
Follow Parth Shukla @AnalyticsVidhya, LinkedIn, Twitter, and Medium for more content.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.
Related
Can Machine Learning Calculate Unreported Covid
How can unidentified COVID-19 cases be tracked?
Researchers and provider organisations have increasingly embraced artificial intelligence (AI) and machine learning (ML) tools to reduce and track the spread of COVID-19 and to improve their surveillance efforts. Big data analytics systems have helped health experts to stay ahead of the pandemic from predicting patient outcomes to anticipating future hotspots, resulting in more efficient care delivery. However, the level of pandemic preparation by healthcare organisations is only as good as the data available to them. Although the industry is well aware of the data issues, the COVID-19 pandemic has brought a host of unique challenges to the forefront of care delivery. Nature of the SARS-CoV-2 has led to significant gaps in COVID-19 data with inconsistencies in information, leaving officials uncertain of the effectiveness of public health interventions. “Asymptomatic infections are a common phenomenon in the spread of coronavirus”, said Lucy Li, PhD, a data scientist at the Chan Zuckerberg Biohub. “And it’s very important to understand that phenomenon because depending on how many asymptomatic infections there are, public health interventions might be different.” Chan Zuckerberg Biohub’s researchers are working to cope up with this situation. Li estimated the number of undetected infections using machine learning and cloud computing at 12 locations including Asia, Europe, and the U.S over the course of the pandemic. The results showed that a vast range of infections remained undetected in these parts of the world with the rate of unidentified cases as high as over 90% in Shanghai. Additionally, when the virus was first contracted in these 12 locations, more than 98% of cases were not reported during the first few weeks of the outbreak. This indicates that the pandemic was already well underway by the time intensive testing began. Such findings have crucial implications on public health policy and provider organisations, Lucy Li noted. “For disease outbreaks where you can identify every single infection, rapid testing and a tiny amount of contact tracing is enough to get the epidemic under control, stated Li. “But for coronavirus, there are so many asymptomatic cases out there and testing alone will not help control the pandemic.” “It is because usually when you do testing, you are testing only symptomatic patients which are a subset of the total number of infections out there,” explains Li. “You’re missing a lot of people who are spreading the infection without their knowledge, hence they are not quarantined. Being able to sense of what that number might be is helpful for allocating resources.” Li’s research was backed by AWS Diagnostic Development Initiative which has initiated a global effort to stimulate diagnostic research and innovation during the coronavirus pandemic and to mitigate future disease outbreaks. The data Li is using is viral genomes, the viral DNA. She elaborates, “As the viral genomes spread through the population, they accumulate mutations. These mutations are generally not good or bad; they’re just changes in the genome.” She added, “Every time the virus infects a new individual, it could accumulate new mutations. So, if we know how fast the virus mutates, we can infer how many missing transmission links there were in between the observed genomes.” Li said, “Many different scenarios could explain what we see in the viral genomes. I have to leverage machine learning and cloud computing to test all of those hypotheses and to see which one can explain the observed changes in viral genomes.” She pointed out that these data analytics are well-suited to meet the challenges brought by COVID-19. ML tools allow the researchers to explore different explanations of the data they see so that they can test many hypotheses. With ML and cloud computing technologies, streamlining a previous time-consuming task is possible. By having access to more computational resources in the cloud, time can be reduced from months to days because of the more memory leveraging capacity, which better parallelises analysis. This research may help health officials to monitor the rate of under-reporting in real-time that could indicate how well current surveillance systems are operating. With the available data of COVID-19 pandemic, analytics tools are essential for bringing new insights and potential solutions.
Analyze Covid Vaccination Progress Using Python
This article was published as a part of the Data Science Blogathon
IntroductionHello Readers!!
Covid-19 has affected our lives very much in very accepts it could be economical, mentally, etc. In this blog, we are going to explore how the vaccination drive is going around the world. For the past 1 year, we have been hoping for vaccines so that we can enjoy our life as we were doing before.
Hope this vaccination drive will help millions of people and save them. We are going to first read the dataset, then clean and draw some beautiful visuals.
Check my latest articles here
Dataset
Image Source
IMPORT LIBRARIESFor analyzing data, we need some libraries. In this section, we are importing all the required libraries like pandas, NumPy, matplotlib, plotly, seaborn, and word cloud that are required for data analysis. Check the below code to import all the required libraries.
import pandas as pd import numpy as np import matplotlib.pyplot as plt import plotly.express as px import plotly.graph_objects as go import matplotlib.patches as mpatches from plotly.subplots import make_subplots from wordcloud import WordCloud import seaborn as sns sns.set(color_codes = True) sns.set(style="whitegrid") import plotly.figure_factory as ff from plotly.colors import n_colors READ DATA AND BASIC INFORMATIONRead the CSV file using pandas read_csv() function and show the output using head() function.
Observation:
Dataset has columns like country, iso_code, date, total_vaccinations, people_vaccinated, people_fully vaccinated, etc. An initial look at the above table shows that data has null values too. We will deal with null values later.
info() function is used to get the overview of data like data type of feature, a number of null values in each column, and many more.
df.info()Observation:
The above picture shows that there are many null values in our dataset. We will deal with these null values later in this blog. There are two data types as seen from the table object means string and float.
The below function is used to get the total count of null values in each feature.
df.isnull().sum()The below picture shows tables like country, date, vaccines, source_name has 0 null values. Features like people_fully_vaccinated have a maximum of 2866 null values.
DATA CLEANINGDataset has many null values as we have seen before. To get rid of it we need to clean the data first, After cleaning we will perform our further analysis. For cleaning the dataset we will perform many steps. Some of these steps are shown below
Handling and Filling null values
Change the data type of features
Handling strings like splitting.
Check the below code for all the data cleaning that we are performing here:
df.fillna(value = 0, inplace = True) df.total_vaccinations = df.total_vaccinations.astype(int) df.people_vaccinated = df.people_vaccinated.astype(int) df.people_fully_vaccinated = df.people_fully_vaccinated.astype(int) df.daily_vaccinations_raw = df.daily_vaccinations_raw.astype(int) df.daily_vaccinations = df.daily_vaccinations.astype(int) df.total_vaccinations_per_hundred = df.total_vaccinations_per_hundred.astype(int) df.people_fully_vaccinated_per_hundred = df.people_fully_vaccinated_per_hundred.astype(int) df.daily_vaccinations_per_million = df.daily_vaccinations_per_million.astype(int) df.people_vaccinated_per_hundred = df.people_vaccinated_per_hundred.astype(int) date = df.date.str.split('-', expand =True) date df['year'] = date[0] df['month'] = date[1] df['day'] = date[2] df.year = pd.to_numeric(df.year) df.month = pd.to_numeric(df.month) df.day = pd.to_numeric(df.day) df.date = pd.to_datetime(df.date) df.head() SOME FEATURESLet’s get some details about our features using the below code
print('Data point starts from ',df.date.min(),'n') print('Data point ends at ',df.date.max(),'n') print('Total no of countries in the data set ',len(df.country.unique()),'n') print('Total no of unique vaccines in the data set ',len(df.vaccines.unique()),'n')Observation
Data points start from 2023-12-08
Data points end at 2023-02-28
Total Number of countries in the data set = 117
Total Number of Unique Vaccines in the data set = 22
df.info() DATA VISUALIZATIONIn this section, we are going to draw some visuals to get insights from our dataset. So let’s started.
describe() function in pandas used to get the statistics of each feature present in our dataset. Some of the information we get include count, max, min, standard deviation, median, etc.
df.describe()unique() function in pandas helps to get unique values present in the feature.
df.country.unique() def size(m,n): fig = plt.gcf(); fig.set_size_inches(m,n); Word Art of CountriesWord Cloud is a unique way to get information from our dataset. The words are shown in the form of art where the size proportional depends on how much the particular word repeated in the dataset. This is made by using the WordCloud library. Check the below code on how to draw word cloud
wordCloud = WordCloud( background_color='white', max_font_size = 50).generate(' '.join(df.country)) plt.figure(figsize=(15,7)) plt.axis('off') plt.imshow(wordCloud) plt.show() Total Vaccinated Till DateIn this section, we are going to see how many total vaccines have been used in each country. Check the below code for more information. The data shows the United States has administrated most vaccines in the world followed by China, United Kingdom, England, India and at the last some countries includes Saint Helena, San Marino has 0 vaccination.
country_wise_total_vaccinated = {} for country in df.country.unique() : vaccinated = 0 for i in range(len(df)) : if df.country[i] == country : vaccinated += df.daily_vaccinations[i] country_wise_total_vaccinated[country] = vaccinated # made a seperate dict from the df country_wise_total_vaccinated_df = pd.DataFrame.from_dict(country_wise_total_vaccinated, orient='index', columns = ['total_vaccinted_till_date']) # converted dict to df country_wise_total_vaccinated_df.sort_values(by = 'total_vaccinted_till_date', ascending = False, inplace = True) country_wise_total_vaccinated_df fig = px.bar(country_wise_total_vaccinated_df, y = 'total_vaccinted_till_date', x = country_wise_total_vaccinated_df.index, color = 'total_vaccinted_till_date', color_discrete_sequence= px.colors.sequential.Viridis_r ) fig.update_layout( title={ 'text' : "Vaccination till date in various countries", 'y':0.95, 'x':0.5 }, xaxis_title="Countries", yaxis_title="Total vaccinated", legend_title="Total vaccinated" ) fig.show()Observation
The United States has administrated most vaccines in the world followed by China, United Kingdom, England, India
Countries include Saint Helena, San Marino has 0 vaccination.
Country Wise Daily VaccinationTo check what is the vaccination trend in each country, check the below code. We are drawing the line plot where the x-axis is the date and the y-axis is the count of daily vaccination, Colours Is set to be the country.
fig = px.line(df, x = 'date', y ='daily_vaccinations', color = 'country') fig.update_layout( title={ 'text' : "Daily vaccination trend", 'y':0.95, 'x':0.5 }, xaxis_title="Date", yaxis_title="Daily Vaccinations" ) fig.show()Observation:
Plot Till Date Function
# helper function def plot_till_date(value1, value2, title, color1, color2) : so_far_dict = {} for dates in df.date.unique() : so_far_dict[dates], value1_count, value2_count = [], 0, 0 for i in range(len(df)) : if df.date[i] == dates : value1_count += df[value1][i] value2_count += df[value2][i] # if dates not in so_far_dict.keys() : so_far_dict[dates].append(value1_count) so_far_dict[dates].append(value2_count) so_far_df = pd.DataFrame.from_dict(so_far_dict, orient = 'index', columns=[value1, value2]) so_far_df.reset_index(inplace = True) # return so_far_df so_far_df.sort_values(by='index', inplace = True) plot = go.Figure(data=[go.Scatter( x = so_far_df['index'], y = so_far_df[value1], stackgroup='one', name = value1, marker_color= color1), go.Scatter( x = so_far_df['index'], y = so_far_df[value2], stackgroup='one', name = value2, marker_color= color2) ]) plot.update_layout( title={ 'text' : title, 'y':0.95, 'x':0.5 }, xaxis_title="Date" ) return plot.show() People vaccinated vs people fully vaccinated in the world :In this section, let’s analyze how many people vaccinated vs the people which are fully vaccinated in the world. We are drawing a kind of curve where the x-axis is Date and the y-axis is the count of people that are fully vaccinated in the world
plot_till_date('people_fully_vaccinated', 'people_vaccinated','People vaccinated vs Fully vaccinated till date', '#c4eb28', '#35eb28')Observation
People fully vaccinated in the world is around 50 million
People that are vaccinated in the world is around 50 million
The People vaccinated vs people fully vaccinated per hundred in the worldIn this section, let’s analyze how many people vaccinated vs the people which are fully vaccinated in the world per hundred. We are drawing a kind of curve where the x-axis is Date and the y-axis is the count of people that are fully vaccinated in the world per hundred
plot_till_date('people_fully_vaccinated_per_hundred', 'people_vaccinated_per_hundred', 'People vaccinated vs Fully vaccinated per hundred till date', '#0938e3','#7127cc')Observation
People fully vaccinated in the world per hundred is around 2
People that are vaccinated in the world is around 7
Pie-PlotIn this section, we are going to draw pip-plots. For more details check the below code:
def plot_pie(value, title, color) : new_dict = {} for v in df[value].unique() : value_count = 0 for i in range(len(df)) : if df[value][i] == v : value_count += 1 new_dict[v] = value_count # print(new_dict) new_df = pd.DataFrame.from_dict(new_dict, orient = 'index', columns = ['Total']) if color == 'plasma' : fig = px.pie(new_df, values= 'Total', names = new_df.index, title = title, color_discrete_sequence=px.colors.sequential.Plasma) elif color == 'rainbow' : fig = px.pie(new_df, values= 'Total', names = new_df.index, title = title, color_discrete_sequence=px.colors.sequential.Rainbow) else : fig = px.pie(new_df, values= 'Total', names = new_df.index, title = title) fig.update_layout( title={ 'y':0.95, 'x':0.5 }, legend_title = value ) return fig.show() plot_pie('vaccines', 'Various vaccines and their uses', 'plasma') Most Used VaccineLet’s see what all vaccines are used in the different part of the world:
df.vaccines.unique() Word art of VaccinesWord Cloud is a unique way to get information from our dataset. The words are shown in the form of art where the size proportional depends on how much the particular word repeated in the dataset. This is made by using the WordCloud library. Check the below code on how to draw word cloud
wordCloud = WordCloud( background_color='white', max_font_size = 50).generate(' '.join(df.vaccines)) plt.figure(figsize=(12,5)) plt.axis('off') plt.imshow(wordCloud) plt.show() Daily vaccination trend per millionIn this section, we will what is the trend of vaccination per million. We are going to draw a line plot where the x-axis is Date and the y-axis is daily vaccination per million. Check the below code for more information:
fig = px.line(df, x = 'date', y ='daily_vaccinations_per_million', color = 'country') fig.update_layout( title= 'text' : "Daily vaccination trend per million", 'y':0.95, 'x':0.5 }, xaxis_title="Date", yaxis_title="Daily Vaccinations per million" ) fig.show()Observation
Seychelles and Israel has the highest number of vaccinations per million
On 10th Jan Gibraltar has the highest vaccination per million
Total vaccinated – India vs the USAIn this section, we will see what is the trend of vaccination among two great countries India and the USA. We are going to draw a line plot where the x-axis is Date and the y-axis is daily vaccination. Check the below code for more information:
india_usa = [df[df.country == 'United States'], df[df.country == 'India']] result = pd.concat(india_usa) fig = px.line(result, x = 'date', y ='total_vaccinations', color = 'country') fig.update_layout( title={ 'text' : "Total vaccinated - India vs USA", 'y':0.95, 'x':0.5 }, xaxis_title="Date", yaxis_title="Total Vaccinations" ) fig.show()Observation
The USA has started vaccination drive very early
India is moving quite steadily despite it has started vaccination late.
MAPS
In this section, we are going to see how vaccinations are going in different countries using maps. The colour signifies how many people have been vaccinated. Check the below maps for more details.
Most vaccinated country plot_map('total_vaccinations','Most vaccinated country', None) Vaccines Used in Different countries plot_map('vaccines','Vaccines Used in Different countries', None) People fully vaccinated in Different countries plot_map('people_fully_vaccinated','People fully vaccinated in Different countries', 'haline') Key Observations:
Sputnik V is mostly used In Asia, Africa, and South America
Most of the countries are not fully vaccinated
Modena and Pfizer are mostly used in North America and Europe
Pfizer/BioNTech are mostly used in the world, its around 47.6%
Covishield and Covaxin are in the 10th position
China has started Mass Vaccination first
Daily vaccination is highest in the USA thought the USA has started vaccination late as compared to China
End NotesSo in this article, we had a detailed discussion on Covid Vaccination Progress. Hope you learn something from this blog and it will help you in the future. Thanks for reading and your patience. Good luck!
You can check my articles here: Articles
Email id: [email protected]
Connect with me on LinkedIn: LinkedIn.
The media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion.
Related
Update the detailed information about Visualizing Covid Data With Plotly on the Moimoishop.com website. We hope the article's content will meet your needs, and we will regularly update the information to provide you with the fastest and most accurate information. Have a great day!