Wednesday, October 4, 2017

Exploratory data analysis (EDA)

This is start of a series of posts in which we will have a look at statistical analysis using predominantly Python. Statistical analysis involves collecting and scrutinizing every data sample in a set of items from which samples can be drawn.
Statistical analysis can be broken down into following steps, as follows:
· Explore the relation of the data to the underlying population (EDA).
· Create a model to summarize understanding of how the data relates to the underlying population.
· Prove (or disprove) the validity of the model.
· Employ predictive analytics to run scenarios that will help guide future actions.
The goal of statistical analysis is to identify trends. A retail business, for example, might use statistical analysis to find patterns in unstructured and semi-structured customer data that can be used to create a more positive customer experience and increase sales.

In this post we will go through Exploratory Data Analysis, which is the first step towards most data analysis work.
Exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task in short The process of organizing, plotting and summarizing the data set is known as EDA. It often involves converting tabular data in graphical form and if done well, graphical representation can allow for more rapid interpretation of the data.
In this post we will look at a couple of ways to visualize the data with the intention of gaining some useful insight from it, using Python with its workhorse plotting library matplotlib, and also seaborn. The latter is built on top of matplotlib and offers simple api for advanced visualizations and better styling of plots by default.
1. Plotting a Histogram:
    ·  A histogram is essentially a plot of frequency distribution of data grouped into bins. Consider that we have to carefully measure the anatomical properties of samples of three different species of iris, Iris setosaIris versicolor, and Iris virginica. This is the popular iris dataset commonly used in data science. Here, we will work with the measurements of petal length.
    We have 3 Numpy Arrays for each species consists of petal length.
    Following is the code to plot histogram of versicolor petal lengths

    # Import plotting modules

    import matplotlib.pyplot as plt
    import seaborn as sns

    # Set default Seaborn style

    sns.set()

    # Plot histogram of versicolor petal lengths

    plt.hist(versicolor_petal_length)

    # Label axes

    plt.xlabel('petal length(cm)')
    plt.ylabel('count')

    # Show histogram

    plt.show()

    After executing these line of code we will have following histogram.

    What we could see from Histogram is that the petal length ranges from 3.0-5.0 cm and majority of total sample size 50 are greater than 3.5

    We can plot histogram with multiple bins as well which gives a better idea about the data. The default no of bins are 10 but we can explicitly mention bins within plt.hist() as given below.

    # Plot the histogram with 7 bins
    plt.hist(versicolor_petal_length, bins = 7)

    # Label axes

    plt.xlabel('petal length (cm)')
    plt.ylabel('count')

    # Show histogram

    plt.show()

    After executing these line of code we will have following histogram

    The biggest disadvantage of histogram is that the same data may be interpreted differently depending upon the choice of bins called as bin bias. 
    2. Plotting a Bee Swam :
    • Lets make a bee swarm plot of the iris petal lengths. our x-axis should contain each of the three species, and the y-axis the petal lengths. A data frame containing the data is as df having columns as [ sepal length(cm),  sepal width(cm), petal length(cm),  petal width(cm), species

      # Create bee swarm plot with Seaborn's default settings
      sns.swarmplot(x='species',y='petal length (cm)',data=df)

      # Label the axes
      plt.xlabel('species')
      plt.ylabel('petal length')

      # Show the plot
      plt.show()
    After Executing these line of code we will have following Bee Swarm Plot
    We can clearly see from the plot that virginica petals tend to be the longest, and setosa petals tend to be the shortest of the three species.
    Suppose if we have to find that what is the % of the versicolor species having petal length less than 4 cms. 
    3. ECDF Empirical cumulative distribution function :  
    The empirical distribution function is an estimate of the cumulative distribution function that generated the points in the sample.
    • Lets define a ECDF function by using this function over and again we can plot ECDF plots.
      def ecdf(data):
          """Compute ECDF for a one-dimensional array of measurements."""

          # Number of data points: n
          n = len(data)

          # x-data for the ECDF: x
          x = np.sort(data)

          # y-data for the ECDF: y
          y = np.arange(1, n+1) / n

          return x, y
      We will now use our 
      ecdf() function to compute the ECDF for the petal lengths of versicolor flowers.
      # Compute ECDF for versicolor data: x_vers, y_vers

      x_vers, y_vers = ecdf(versicolor_petal_length)



      # Generate plot using above x_vers, y_vers which we found by ecdf() function.
      plt.plot(x_vers,y_vers,marker='.',linestyle = 'none')

      # Make the margins nice
      plt.margins(0.02)

      # Label the axes
      plt.xlabel('versicolor petal length')
      plt.ylabel('ECDF')


      # Display the plot
      plt.show()
      After Executing these line of code we will have following ECDF Plot 


      Here we can say that around 30% of versicolor petal length are less than 4 cms.
      We can plot ECDF for other species in the single plot for the better comparison and understanding.

      # Compute ECDFs
      x_set, y_set = ecdf(setosa_petal_length)
      x_vers, y_vers = ecdf(versicolor_petal_length)
      x_virg, y_virg = ecdf(virginica_petal_length )

      # Plot all ECDFs on the same plot
      plt.plot(x_set, y_set,marker='.',linestyle='none')
      plt.plot(x_vers, y_vers,marker='.',linestyle='none')
      plt.plot(x_virg, y_virg,marker='.',linestyle='none')

      # Make nice margins
      plt.margins(0.02)

      # Annotate the plot
      plt.legend(('setosa', 'versicolor', 'virginica'), loc='lower right')
      plt.xlabel('petal length (cm)')
      plt.ylabel('ECDF')

      # Display the plot
      plt.show()

      After executing these line of code we will have following ECDF Plot 

      We can say that 40 % of setosa, versicolor, virginica petal length are less than 1.5 cms, 4.5 cms and 5.5 cms respectively.




    12 comments:

    1. Utilizations – Data examination (DA) can be utilized for a few purposes by any business; truth be told, numerous banks and credit organizations have prevailing with regards to utilizing this science to avert false exercises.Data Analytics Course

      ReplyDelete