Showing posts with label Data Science. Show all posts
Showing posts with label Data Science. Show all posts

Wednesday, October 4, 2017

Exploratory data analysis (EDA)

This is start of a series of posts in which we will have a look at statistical analysis using predominantly Python. Statistical analysis involves collecting and scrutinizing every data sample in a set of items from which samples can be drawn.
Statistical analysis can be broken down into following steps, as follows:
· Explore the relation of the data to the underlying population (EDA).
· Create a model to summarize understanding of how the data relates to the underlying population.
· Prove (or disprove) the validity of the model.
· Employ predictive analytics to run scenarios that will help guide future actions.
The goal of statistical analysis is to identify trends. A retail business, for example, might use statistical analysis to find patterns in unstructured and semi-structured customer data that can be used to create a more positive customer experience and increase sales.

In this post we will go through Exploratory Data Analysis, which is the first step towards most data analysis work.
Exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task in short The process of organizing, plotting and summarizing the data set is known as EDA. It often involves converting tabular data in graphical form and if done well, graphical representation can allow for more rapid interpretation of the data.
In this post we will look at a couple of ways to visualize the data with the intention of gaining some useful insight from it, using Python with its workhorse plotting library matplotlib, and also seaborn. The latter is built on top of matplotlib and offers simple api for advanced visualizations and better styling of plots by default.
1. Plotting a Histogram:
    ·  A histogram is essentially a plot of frequency distribution of data grouped into bins. Consider that we have to carefully measure the anatomical properties of samples of three different species of iris, Iris setosaIris versicolor, and Iris virginica. This is the popular iris dataset commonly used in data science. Here, we will work with the measurements of petal length.
    We have 3 Numpy Arrays for each species consists of petal length.
    Following is the code to plot histogram of versicolor petal lengths

    # Import plotting modules

    import matplotlib.pyplot as plt
    import seaborn as sns

    # Set default Seaborn style

    sns.set()

    # Plot histogram of versicolor petal lengths

    plt.hist(versicolor_petal_length)

    # Label axes

    plt.xlabel('petal length(cm)')
    plt.ylabel('count')

    # Show histogram

    plt.show()

    After executing these line of code we will have following histogram.

    What we could see from Histogram is that the petal length ranges from 3.0-5.0 cm and majority of total sample size 50 are greater than 3.5

    We can plot histogram with multiple bins as well which gives a better idea about the data. The default no of bins are 10 but we can explicitly mention bins within plt.hist() as given below.

    # Plot the histogram with 7 bins
    plt.hist(versicolor_petal_length, bins = 7)

    # Label axes

    plt.xlabel('petal length (cm)')
    plt.ylabel('count')

    # Show histogram

    plt.show()

    After executing these line of code we will have following histogram

    The biggest disadvantage of histogram is that the same data may be interpreted differently depending upon the choice of bins called as bin bias. 
    2. Plotting a Bee Swam :
    • Lets make a bee swarm plot of the iris petal lengths. our x-axis should contain each of the three species, and the y-axis the petal lengths. A data frame containing the data is as df having columns as [ sepal length(cm),  sepal width(cm), petal length(cm),  petal width(cm), species

      # Create bee swarm plot with Seaborn's default settings
      sns.swarmplot(x='species',y='petal length (cm)',data=df)

      # Label the axes
      plt.xlabel('species')
      plt.ylabel('petal length')

      # Show the plot
      plt.show()
    After Executing these line of code we will have following Bee Swarm Plot
    We can clearly see from the plot that virginica petals tend to be the longest, and setosa petals tend to be the shortest of the three species.
    Suppose if we have to find that what is the % of the versicolor species having petal length less than 4 cms. 
    3. ECDF Empirical cumulative distribution function :  
    The empirical distribution function is an estimate of the cumulative distribution function that generated the points in the sample.
    • Lets define a ECDF function by using this function over and again we can plot ECDF plots.
      def ecdf(data):
          """Compute ECDF for a one-dimensional array of measurements."""

          # Number of data points: n
          n = len(data)

          # x-data for the ECDF: x
          x = np.sort(data)

          # y-data for the ECDF: y
          y = np.arange(1, n+1) / n

          return x, y
      We will now use our 
      ecdf() function to compute the ECDF for the petal lengths of versicolor flowers.
      # Compute ECDF for versicolor data: x_vers, y_vers

      x_vers, y_vers = ecdf(versicolor_petal_length)



      # Generate plot using above x_vers, y_vers which we found by ecdf() function.
      plt.plot(x_vers,y_vers,marker='.',linestyle = 'none')

      # Make the margins nice
      plt.margins(0.02)

      # Label the axes
      plt.xlabel('versicolor petal length')
      plt.ylabel('ECDF')


      # Display the plot
      plt.show()
      After Executing these line of code we will have following ECDF Plot 


      Here we can say that around 30% of versicolor petal length are less than 4 cms.
      We can plot ECDF for other species in the single plot for the better comparison and understanding.

      # Compute ECDFs
      x_set, y_set = ecdf(setosa_petal_length)
      x_vers, y_vers = ecdf(versicolor_petal_length)
      x_virg, y_virg = ecdf(virginica_petal_length )

      # Plot all ECDFs on the same plot
      plt.plot(x_set, y_set,marker='.',linestyle='none')
      plt.plot(x_vers, y_vers,marker='.',linestyle='none')
      plt.plot(x_virg, y_virg,marker='.',linestyle='none')

      # Make nice margins
      plt.margins(0.02)

      # Annotate the plot
      plt.legend(('setosa', 'versicolor', 'virginica'), loc='lower right')
      plt.xlabel('petal length (cm)')
      plt.ylabel('ECDF')

      # Display the plot
      plt.show()

      After executing these line of code we will have following ECDF Plot 

      We can say that 40 % of setosa, versicolor, virginica petal length are less than 1.5 cms, 4.5 cms and 5.5 cms respectively.




    Thursday, August 10, 2017

    Basic Data Analysis in Excel: Charts and Tables

    This blog is related to following :
    •  Introduction to Reporting in Excel
    •  Excel Tables
    •  Basic Pivot Tables and Chart
    •  Dashboards
    Introduction to Reporting in Excel :

    Generally we report data in Excel but do not know how to use that data and how to represent that data graphically. In the below screenshot there are steps to follow to represent data graphically.
    Consider the sales data for a bicycle company over the years for different countries.

    Select data in Excel and then go to insert tab and choose any type of chart: Bar, Line, Pie etc.


    Excel Tables : Excel tables are formatted tables which are more user friendly and functional for calculations and formulas.

    To create excel table select data and then select table from insert tab.
    There is total row check box to find multiple result at the last row of the data such as Sum, Max, Min.




    Basic Pivot Tables and Chart : Pivot tables are one of Excel's most powerful features. A pivot table allows you to extract the significance from a large, detailed data set.

    Step to create Pivot table :

    1) Select Pivot table option from insert tab.
    2) Create Pivot table pop up appears as given below.


    3) After select "OK" on new sheet pivot table with all columns on the LHS will appear.
    4) Select Column Label , Row label and Value section as per the desire report in pivot.




    5) You can select multiple columns/rows/report filters as shown below.



    Pivot Chart :
    Steps to create pivot charts :
    1) Select pivot table data.
    2) Select pivot chart from option tab.
    3) Insert Chart pop up will appears.
    4) Select Chart type and template then select "OK".
    5) Pivot chart will appear as per the pivot table with all applied filters.