Exploratory Data Analysis

Mahima Jain
3 min readJan 26, 2022

--

In this blog, I’ll be sharing introduction of exploratory data analysis. This covers the below-mentioned points:

  • About EDA
  • Objectives of EDA
  • Plots involved in EDA

Let’s see what is the process to reach up to EDA. Before this, there comes IDA (Initial Data Analysis). As the name suggests, IDA is the initial step for the analysis of data. It tells us about what is the nature of data, how and from where data has been collected.

Steps involved in the Analysis of data are:

  1. Evaluation and understanding of data
  2. Cleaning of data redundancies
  3. Summarization
  4. Analyze the relationship between the variables

The first 3 steps in the analysis come under IDA. It focuses more on checking assumption requests for model fitting and hypothesis testing, handling missing values, and making transformations of variables. The last step belongs to EDA.

So, the main question is: What is EDA?

EDA stands for Exploratory Data Analysis. Its meaning is as simple as its name, data exploration technique to understand the various aspects of the data and for a better understanding of the data.

While using EDA the data must be clean, do not have any redundancy, do not have any missing values or null values in the dataset i.e. initial data analysis must be done.

After doing all the above things, identify the important variables in the dataset and remove unnecessary noise (unnecessary data or column) from the dataset so that it will not affect the accuracy of the model which we are going to build.

And in this way, we can understand the relationship between variables using EDA. We will be able to conclude using gathered insights about data in order to do more complicated processes in data preprocessing.

The objective of the analysis of the dataset is:

  1. After completing IDA, the dataset will be free from redundancies and null values.
  2. It helps us to find out faulty data points like outliers in data and after finding them, we can easily remove them for cleaning the data.
  3. By summarizing the data we can understand the dataset, rows, and columns, etc. (for non-graphical analysis).
  4. By plotting different graphs we can visualize the data for better understanding (for graphical analysis).
  5. It helps to understand the relationship between the variables for a wider perspective of the data.

There are many plots involved in EDA for the visualization and exploration of the data. Here in the below part ‘df’ stands for Data Frame which is found in pandas library.

Non-graphical Analysis: For understanding the distribution of data without plotting the graph. Below are the three commands that come under this:

  • df.info() - Print a concise summary of the data frame.
  • df.describe() - It gives us some values in output like count, mean, standard deviation, minimum value, 25%, 50%, 75%, and maximum value etc.
  • df.isnull() - It gives us Boolean output (either true or false). If there are any missing values then the output will be false otherwise true.

Graphical Analysis: For understanding data using graphs and plots, we use graphical analysis. There are some common plots in EDA for graphical analysis:

  • Univariate

Numerical: df[column].plot(kind=hist”) - this is used to display histogram of the data

Categorical: df[column].plot(kind=bar”) - this is used to display bar plot of the data

  • Multivariate

Numerical vs Numerical:

  1. sns.pairplot() - this is used to plot a pairwise relationship between data. For each column, it plots the graph depending upon the parameters we pass in this pairplot() method.
  2. sns.heatmap() - this plot provides us the color matrix to visualize the value of the matrix.

Categorical vs Categorical:

  1. sns.countplot(hue = ..) - as the name suggests, it shows the counts of observation using bars. This plot is very similar to bar plot or we can say that countplot is a group of many bar plots.

Categorical vs Numerical:

  1. sns.boxplot() - this plot draws boxes to show the distribution of the data. There are many parameters that we can pass in the method depending on our needs.
  2. sns.pairplot(hue = ..) - this is the same as the above-defined pairplot.

For reference and better understanding, image which has all the graphs which mentioned above is attached.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Mahima Jain
Mahima Jain

Written by Mahima Jain

Just a geek who enjoys learning new technologies. Please feel free to correct me if there is anything wrong in my blogs.

Responses (2)

Write a response