This is start of a series of posts in which we will have a look at statistical analysis using predominantly Python.

**Statistical analysis**involves collecting and scrutinizing every data sample in a set of items from which samples can be drawn.
Statistical analysis can be broken down into following steps, as follows:

· Explore the relation of the data to the underlying population (EDA).

· Create a model to summarize understanding of how the data relates to the underlying population.

· Prove (or disprove) the validity of the model.

· Employ predictive analytics to run scenarios that will help guide future actions.

The goal of statistical analysis is to identify trends. A retail business, for example, might use statistical analysis to find patterns in unstructured and semi-structured customer data that can be used to create a more positive customer experience and increase sales.

In this post we will go through Exploratory Data Analysis, which is the first step towards most data analysis work.

In this post we will go through Exploratory Data Analysis, which is the first step towards most data analysis work.

**Exploratory data analysis**(

**EDA**) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task in short The process of organizing, plotting and summarizing the data set is known as EDA. It often involves converting tabular data in graphical form and if done well, graphical representation can allow for more rapid interpretation of the data.

In this post we will look at a couple of ways to visualize the data with the intention of gaining some useful insight from it, using Python with its workhorse plotting library matplotlib, and also seaborn. The latter is built on top of matplotlib and offers simple api for advanced visualizations and better styling of plots by default.

1. Plotting a Histogram:

· A histogram is essentially a plot of frequency distribution of data grouped into bins. Consider that we have to carefully measure the anatomical properties of samples of three different species of iris,

*Iris setosa*,*Iris versicolor*, and*Iris virginica*. This is the popular iris dataset commonly used in data science. Here, we will work with the measurements of petal length.
We have 3 Numpy Arrays for each species consists of petal length.

Following is the code to plot histogram of versicolor petal lengths

# Import plotting modules

import matplotlib.pyplot as plt

import seaborn as sns

# Set default Seaborn style

sns.set()

# Plot histogram of versicolor petal lengths

plt.hist(versicolor_petal_length)

# Label axes

plt.xlabel('petal length(cm)')

plt.ylabel('count')

# Show histogram

plt.show()

After executing these line of code we will have following histogram.

What we could see from Histogram is that the petal length ranges from 3.0-5.0 cm and majority of total sample size 50 are greater than 3.5

We can plot histogram with multiple bins as well which gives a better idea about the data. The default no of bins are 10 but we can explicitly mention bins within plt.hist() as given below.

# Plot the histogram with 7 bins

plt.hist(versicolor_petal_length, bins = 7)

# Label axes

plt.xlabel('petal length (cm)')

plt.ylabel('count')

# Show histogram

plt.show()

After executing these line of code we will have following histogram

The biggest disadvantage of histogram is that the same data may be interpreted differently depending upon the choice of bins called as bin bias.

# Import plotting modules

import matplotlib.pyplot as plt

import seaborn as sns

# Set default Seaborn style

sns.set()

# Plot histogram of versicolor petal lengths

plt.hist(versicolor_petal_length)

# Label axes

plt.xlabel('petal length(cm)')

plt.ylabel('count')

# Show histogram

plt.show()

After executing these line of code we will have following histogram.

We can plot histogram with multiple bins as well which gives a better idea about the data. The default no of bins are 10 but we can explicitly mention bins within plt.hist() as given below.

# Plot the histogram with 7 bins

plt.hist(versicolor_petal_length, bins = 7)

# Label axes

plt.xlabel('petal length (cm)')

plt.ylabel('count')

# Show histogram

plt.show()

After executing these line of code we will have following histogram

The biggest disadvantage of histogram is that the same data may be interpreted differently depending upon the choice of bins called as bin bias.

2. Plotting a Bee Swam :

- Lets make a bee swarm plot of the iris petal lengths. our x-axis should contain each of the three species, and the y-axis the petal lengths. A data frame containing the data is as df having columns as [ sepal length(cm), sepal width(cm), petal length(cm), petal width(cm), species

# Create bee swarm plot with Seaborn's default settings

sns.swarmplot(x='species',y='petal length (cm)',data=df)

# Label the axes

plt.xlabel('species')

plt.ylabel('petal length')

# Show the plot

plt.show()

After Executing these line of code we will have following Bee Swarm Plot

We can clearly see from the plot that

*virginica*petals tend to be the longest, and*setosa*petals tend to be the shortest of the three species.
Suppose if we have to find that what is the % of the versicolor species having petal length less than 4 cms.

3. ECDF Empirical cumulative distribution function :

The empirical distribution function is an estimate of the cumulative distribution function that generated the points in the sample.

The empirical distribution function is an estimate of the cumulative distribution function that generated the points in the sample.

- Lets define a ECDF function by using this function over and again we can plot ECDF plots.def ecdf(data):"""Compute ECDF for a one-dimensional array of measurements."""# Number of data points: nn = len(data)# x-data for the ECDF: xx = np.sort(data)# y-data for the ECDF: yy = np.arange(1, n+1) / nreturn x, y

We will now use our`ecdf()`

function to compute the ECDF for the petal lengths of*versicolor*flowers.

# Compute ECDF for versicolor data: x_vers, y_vers

x_vers, y_vers = ecdf(versicolor_petal_length)

# Generate plot using above x_vers, y_vers which we found by ecdf() function.plt.plot(x_vers,y_vers,marker='.',linestyle = 'none')# Make the margins niceplt.margins(0.02)# Label the axesplt.xlabel('versicolor petal length')plt.ylabel('ECDF')# Display the plotplt.show()

After Executing these line of code we will have following ECDF PlotHere we can say that around 30% of versicolor petal length are less than 4 cms.We can plot ECDF for other species in the single plot for the better comparison and understanding.# Compute ECDFsx_set, y_set = ecdf(setosa_petal_length)x_vers, y_vers = ecdf(versicolor_petal_length)x_virg, y_virg = ecdf(virginica_petal_length )# Plot all ECDFs on the same plotplt.plot(x_set, y_set,marker='.',linestyle='none')plt.plot(x_vers, y_vers,marker='.',linestyle='none')plt.plot(x_virg, y_virg,marker='.',linestyle='none')# Make nice marginsplt.margins(0.02)# Annotate the plotplt.legend(('setosa', 'versicolor', 'virginica'), loc='lower right')plt.xlabel('petal length (cm)')plt.ylabel('ECDF')# Display the plotplt.show()After executing these line of code we will have following ECDF PlotWe can say that 40 % of setosa, versicolor, virginica petal length are less than 1.5 cms, 4.5 cms and 5.5 cms respectively.

Excellent and useful blog admin, I would like to read more about this topic.

ReplyDeleteccna Training in Chennai

ccna course in Chennai

Python Classes in Chennai

Python Training Institute in Chennai

R Training in Chennai

R Programming Training in Chennai

CCNA Training in T Nagar

CCNA Training in OMR

Nice content. The blog is very useful for me and freshers to know about this domain. Thank you.

ReplyDeleteJava Training in Bangalore

Best Java Training Institutes in Bangalore

Java Training in Madurai

Best Java Training Institute in Madurai

Java Training in Coimbatore

Java Course in Coimbatore

Utilizations – Data examination (DA) can be utilized for a few purposes by any business; truth be told, numerous banks and credit organizations have prevailing with regards to utilizing this science to avert false exercises.Data Analytics Course

ReplyDeleteVery Informative article

ReplyDeleteData Science Interview Questions

Smm Panel

ReplyDeleteSmm panel

İŞ İLANLARI

instagram takipçi satın al

hirdavatciburada.com

Beyazesyateknikservisi.com.tr

SERVİS

tiktok jeton hilesi

ümraniye samsung klima servisi

ReplyDeletebeykoz vestel klima servisi

üsküdar vestel klima servisi

tuzla bosch klima servisi

tuzla arçelik klima servisi

çekmeköy samsung klima servisi

ataşehir samsung klima servisi

çekmeköy mitsubishi klima servisi

ataşehir mitsubishi klima servisi

lisans satın al

ReplyDeleteyurtdışı kargo

özel ambulans

en son çıkan perde modelleri

minecraft premium

en son çıkan perde modelleri

nft nasıl alınır

uc satın al

istanbul

ReplyDeletefatih

ankara

adana

ordu

CRPC

sultangazi

ReplyDeletebakırköy

beşiktaş

erzincan

izmir

V3KKQ

sultangazi

ReplyDeletebakırköy

beşiktaş

erzincan

izmir

TF6

yurtdışı kargo

ReplyDeleteresimli magnet

instagram takipçi satın al

yurtdışı kargo

sms onay

dijital kartvizit

dijital kartvizit

https://nobetci-eczane.org/

XYKV

karabük

ReplyDeletetunceli

ardahan

giresun

ordu

3İJL