Exploratory data analysis

Exploratory data analysis

The most enjoyable part of the job

#️⃣   ⌛  ~50 min 🗿  Beginner

28.09.2022

upd:

#16

Exploratory data analysis

The most enjoyable part of the job

⌛  ~50 min

#16

🎓 39/167

This post is a part of the Working with data educational series from my free course. Please keep in mind that the correct sequence of posts is outlined on the course page, while it can be arbitrary in Research.

I'm also happy to announce that I've started working on standalone paid courses, so you could support my work and get cheap educational material. These courses will be of completely different quality, with more theoretical depth and niche focus, and will feature challenging projects, quizzes, exercises, video lectures and supplementary stuff. Stay tuned!

Exploratory data analysis (EDA) is a crucial step in any data-driven project, as it helps you develop an intuitive understanding of the dataset before diving into modeling or sophisticated algorithms. The term was popularized by John Tukey in the 1970s, who emphasized the importance of "letting the data speak for itself" rather than imposing strict statistical hypotheses too early. In modern machine learning and data science workflows, EDA remains equally essential, because it provides insights into a dataset's structure, highlights possible anomalies, and guides the choice of subsequent techniques.

Common objectives of EDA include:

Identifying potential data quality issues such as missing values, outliers, duplicates, and incorrect data types.
Uncovering patterns, trends, and relationships between variables that can inform feature engineering and modeling strategies.
Gaining a broad statistical overview (e.g., means, medians, standard deviations, correlation coefficients) to assess distribution properties.
Visualizing the data to detect clusters or groupings, anomalies, or interesting structures that might not be obvious from summary statistics alone.

Several powerful tools exist for EDA in the Python ecosystem. Most workflows begin with pandas for data loading and cleaning, then utilize plotting libraries such as matplotlib for fundamental charting and seaborn for statistical visualizations. Finally, for interactive and shareable dashboards, plotly is an increasingly popular choice. This article walks through a typical EDA workflow, exploring the practical uses of these libraries, the theory behind the techniques, and helpful tips for ensuring your analysis remains robust and insightful.

An image was requested, but the frog was found.

Alt: "An overview of the EDA workflow"

Caption: "Exploratory Data Analysis often begins with data loading and cleaning, followed by descriptive statistics and various forms of visualization."

Error type: missing path

Data loading and cleaning with pandas

Perhaps the most fundamental step in your EDA journey involves importing, merging, and cleaning datasets. Before generating any plots or running statistical tests, it is vital to ensure your data is properly formatted, consistent, and free of major errors. This step is often the most time-consuming but is indispensable for a trustworthy analysis.

Importing data from various sources (CSV, Excel, SQL, etc.)

The pandas library simplifies data ingestion from multiple formats and sources:


import pandas as pd

# Load data from a CSV file
df_csv = pd.read_csv("data.csv")

# Load data from an Excel file
df_excel = pd.read_excel("data.xlsx", sheet_name="Sheet1")

# Load data from a SQL database
import sqlite3
conn = sqlite3.connect("database.db")
df_sql = pd.read_sql_query("SELECT * FROM table_name", conn)

Most real-world projects require combining multiple files or tables into a single unified dataset. pandas.merge() and pandas.concat() are frequently used for table joins or vertical concatenations, respectively. These operations help create a consolidated dataframe suitable for analysis.

Handling missing values and outliers

Missing data appears in almost every dataset and can significantly impact the results of an analysis. Common causes of missing values include incomplete data collection, user input errors, and merges that introduce mismatched records. Handling these is context-dependent but often involves one of the following:

Dropping missing rows: Useful when only a small percentage of entries are missing, or when the missingness appears random and does not bias the dataset.
Imputing missing values: Replacing NaN entries with some approximation, such as the mean, median, mode, or a prediction model specifically trained for imputation (Imputation models might include kNN, regression, or MICE (Multiple Imputation by Chained Equations).).
Leaving them as is: Sometimes missing values carry semantic meaning, e.g., "no response" for certain types of survey data.

An example of applying simple imputation with the mean:


# Mean imputation for a specific column
df_csv["column_of_interest"].fillna(
    df_csv["column_of_interest"].mean(), 
    inplace=True
)

Outliers — extreme or inconsistent points — can skew your distributions and lead to incorrect conclusions. One way to detect outliers is via the Interquartile Range (IQR) method:

Q1 = \text{the first quartile}

Q3 = \text{the third quartile}

IQR = Q3 - Q1

Any data point $x$ that satisfies:

x < Q1 - 1.5 \times IQR \quad\text{or}\quad x > Q3 + 1.5 \times IQR

is often treated as a potential outlier (though domain knowledge should guide any decisions to remove or cap these values). Modern robust techniques like DBSCAN (for outlier detection) or robust scalers (that reduce the influence of outliers) might also be used depending on the context (Wu and gang, NeurIPS 2023 introduced a semi-supervised approach for automatic outlier detection in high-dimensional data, illustrating advanced strategies to handle complex scenarios).

Data transformations and feature engineering

Transforming variables into more suitable representations can illuminate hidden patterns and make modeling more effective. These transformations might include:

Log transformations for highly skewed distributions.
Scaling (standardization or min-max normalization) for variables on vastly different numeric ranges.
Combining categories to reduce dimensionality or group rare classes.
Feature extraction (e.g., extracting day of week from a timestamp or text-based features from string fields).

In practice, a thorough EDA might reveal that certain features have a power-law distribution, prompting a log or Box-Cox transformation. Or you might discover unexpected category duplications (like "Hot," "HOT," "Hot/n") caused by inconsistent data entry. Resolving these issues ensures more consistent and accurate downstream analysis.

Summarizing data using descriptive statistics

Pandas offers convenient methods for generating quick summaries:


# Displays basic descriptive statistics for each column
df_csv.describe()

# Lists column names, data types, and memory usage
df_csv.info()

These commands highlight the most common data types, shape, range of values, and central tendencies (mean, median, etc.), which usually serve as the first indicators of potential data anomalies or relationships worth exploring further. You might also consider data profiling libraries (e.g., pandas-profiling or ydata-profiling) for automated generation of summary reports, histograms, correlation matrices, and more.

Visualizing data with matplotlib

matplotlib is the most widely used base library for creating static plots in Python. While other tools (like seaborn and plotly) build upon matplotlib's functionalities, understanding its core principles gives you granular control over plot aesthetics and layout.

Creating basic plots (line, bar, scatter)

A handful of basic plot types can cover a surprising range of scenarios. For instance:


import matplotlib.pyplot as plt

# Line plot
plt.plot(df_csv["time"], df_csv["sensor_reading"])
plt.title("Sensor Reading Over Time")
plt.xlabel("Time")
plt.ylabel("Reading")
plt.show()

# Bar plot
categories = df_csv["category_col"].value_counts()
plt.bar(categories.index, categories.values)
plt.title("Category Distribution")
plt.xlabel("Category")
plt.ylabel("Count")
plt.show()

# Scatter plot
plt.scatter(df_csv["feature1"], df_csv["feature2"])
plt.title("Relationship between Feature1 and Feature2")
plt.xlabel("Feature1")
plt.ylabel("Feature2")
plt.show()

The line plot is useful for time-series analysis or continuous signals, the bar plot is common for categorical data counts, and the scatter plot reveals pairwise relationships or clusters in the data.

Customizing plots (titles, labels, legends)

Beyond these basic operations, matplotlib provides a highly flexible architecture for customizing every element, including plot titles, axes labels, legends, and annotations. For example:


plt.figure(figsize=(8, 6))
plt.scatter(df_csv["x"], df_csv["y"], color="green", marker="x", alpha=0.7)
plt.title("Customized Scatter Plot")
plt.xlabel("X Values")
plt.ylabel("Y Values")
plt.legend(["Data Points"], loc="upper left")
plt.grid(True)
plt.show()

This snippet demonstrates how to configure figure size, marker style, transparency (alpha), and a legend. These tweaks often increase the clarity of your plots and emphasize the key insight you want to convey.

Handling multiple plots and subplots

When comparing variables side-by-side or illustrating multiple features at once, you can employ subplots:


fig, axs = plt.subplots(1, 2, figsize=(12, 5))
axs[0].hist(df_csv["featureA"], bins=20, color="blue")
axs[0].set_title("Distribution of Feature A")
axs[1].boxplot(df_csv["featureB"].dropna())
axs[1].set_title("Boxplot of Feature B")
plt.tight_layout()
plt.show()

Subplots help present different distributions or relationships together, which is especially valuable in EDA where you often compare multiple aspects of a dataset at a glance.

Saving and exporting visualizations

In a collaborative environment, you typically share plots as images or embed them into reports:


# Save current figure as a PNG image
plt.savefig("my_plot.png", dpi=300)
plt.close()

This approach is quite helpful for versioning your visualizations or including them in notebooks and presentations.

Statistical plots with seaborn

seaborn is a Python data visualization library built on top of matplotlib. It offers a high-level interface for drawing attractive and informative statistical plots, making it well-suited for quickly exploring relationships in your data.

Distribution plots (histogram, KDE, boxplot)

Seaborn's hallmark is simplifying the creation of distribution-oriented visualizations that reveal the underlying statistical patterns. For instance:


import seaborn as sns

# Histogram
sns.histplot(data=df_csv, x="featureC", bins=30, kde=False)
plt.title("Histogram of Feature C")
plt.show()

# Kernel density estimate (KDE)
sns.kdeplot(data=df_csv, x="featureC", shade=True)
plt.title("KDE of Feature C")
plt.show()

# Boxplot
sns.boxplot(data=df_csv, x="category_col", y="numerical_col")
plt.title("Boxplot grouped by Category")
plt.show()

Histogram: Displays frequency distribution by dividing the data into bins.
KDE: Provides a smooth curve representing the continuous probability density function of a variable.
Boxplot: Emphasizes medians, quartiles, and potential outliers. It is a staple of EDA for quickly spotting distribution asymmetry or extreme values.

Relational plots (scatter, line)

Seaborn enhances basic relational plots with built-in regression lines, confidence intervals, or additional grouping:


# Scatter plot with regression line
sns.regplot(data=df_csv, x="feature1", y="feature2", scatter_kws={"alpha":0.5})
plt.title("Scatter + Regression Line")
plt.show()

# Line plot with grouping
sns.lineplot(data=df_csv, x="time", y="value", hue="group_col")
plt.title("Line Plot by Group")
plt.show()

Here, regplot automatically fits a simple linear regression line to help you see the correlation between two variables. Similarly, lineplot can group lines by a categorical variable, making it effortless to compare multiple subgroups.

Categorical plots (countplot, barplot, violinplot)

When analyzing categorical variables, seaborn offers specialized plots:


# Countplot
sns.countplot(data=df_csv, x="category_col")
plt.title("Count of Each Category")
plt.show()

# Barplot (aggregates a numerical value by category)
sns.barplot(data=df_csv, x="category_col", y="numerical_col", estimator=np.mean, ci="sd")
plt.title("Average Value per Category with Std. Deviation")
plt.show()

# Violinplot (combines boxplot + KDE)
sns.violinplot(data=df_csv, x="category_col", y="numerical_col")
plt.title("Violin Plot")
plt.show()

A violinplot merges the concept of a boxplot with a KDE, showing both the quartiles and the probability distribution shape of the data, which can be more informative than a simple boxplot alone.

Advanced customization and styling

You can style all seaborn plots globally using sns.set_theme() or switch among different built-in themes:


sns.set_theme(style="whitegrid", palette="muted")
sns.boxplot(data=df_csv, x="category_col", y="value_col")
plt.title("Styled Boxplot")
plt.show()

Additionally, advanced developers often mix seaborn's high-level syntax with the fine-tuned capabilities of matplotlib for more specialized customizations.

Interactive visualizations with plotly

plotly is another powerful visualization library that generates interactive charts. It is particularly helpful for dashboards or web-based demos, enabling users to hover, zoom, and filter data in real-time.

Setting up and using plotly in Python

To begin:


import plotly.express as px

# Quick example with plotly.express
fig = px.scatter(df_csv, x="feature1", y="feature2", color="category_col")
fig.show()

This snippet creates an interactive scatter plot with color encoding for categories. By default, you can hover over points to see underlying values, and you can pan or zoom within the chart area.

Creating interactive charts and dashboards

Plotly supports many chart types (line, scatter, bar, pie, choropleth, 3D scatter, etc.) and can integrate with Dash for building full-featured web apps and dashboards. For example, a quick interactive bar chart:


fig = px.bar(df_csv, x="category_col", y="numerical_col", title="Interactive Bar Chart")
fig.update_layout(barmode='group')
fig.show()

You can further customize each plot's layout, color scheme, and interactive tooltips.

Plotly express vs. plotly graph_objects

plotly.express offers a simple, concise syntax for generating many standard figures quickly. If you require more granular control, the graph_objects API gives full control over each plot element:


import plotly.graph_objects as go

trace = go.Scatter(
    x=df_csv["feature1"], 
    y=df_csv["feature2"],
    mode="markers",
    marker=dict(size=8, color="blue", opacity=0.6),
    name="Data Points"
)

layout = go.Layout(
    title="Customized Scatter Plot",
    xaxis=dict(title="Feature 1"),
    yaxis=dict(title="Feature 2")
)

fig = go.Figure(data=[trace], layout=layout)
fig.show()

The trade-off is that graph_objects requires more verbose code but grants greater flexibility in customizing each chart.

Plotly figures can be exported as static images (PNG, SVG) or HTML files:


# Export to an HTML file
fig.write_html("my_interactive_plot.html")

This approach preserves the interactivity, making it easy to distribute or embed the plot in internal documentation, wikis, or Jupyter notebooks.

An image was requested, but the frog was found.

Alt: "Plotly interactive histogram example"

Caption: "Interactive plots allow panning, zooming, and tooltips for deeper insights."

Error type: missing path

Combining multiple libraries for EDA

Each library — pandas, matplotlib, seaborn, and plotly — has its unique strengths, but a robust EDA typically integrates them in a complementary manner. For instance:

Data loading and cleaning with pandas: Quickly load various data sources, merge them, handle missing values, and create derived features.
Preliminary investigations and basic plotting with matplotlib: Acquire a first impression of distributions and relationships.
In-depth statistical visualization with seaborn: Uncover deeper insights into the data's structure and relationships, e.g., correlation heatmaps, advanced boxplots, or regression lines.
Interactive dashboards with plotly: Allow colleagues, stakeholders, or your future self to interact with the data and discover patterns beyond static visuals.

A comprehensive EDA workflow might look like this:


import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# 1. Load and clean data
df = pd.read_csv("combined_data.csv")
df.dropna(subset=["critical_column"], inplace=True)
df["log_feature"] = df["skewed_feature"].apply(lambda x: np.log1p(x))

# 2. Preliminary summary
print(df.describe())
print(df.info())

# 3. Quick histograms/boxplots (matplotlib)
plt.hist(df["log_feature"], bins=30)
plt.title("Distribution of Log-Transformed Feature")
plt.show()

# 4. Statistical scatter plot (seaborn)
sns.scatterplot(data=df, x="feature1", y="feature2", hue="category_col")
plt.title("Seaborn Scatter Plot with Category Hue")
plt.show()

# 5. Interactive analysis (plotly)
fig = px.scatter(df, x="feature1", y="feature2", color="category_col")
fig.update_layout(title="Interactive Scatter Plot")
fig.show()

By taking advantage of each library's unique capabilities, you create a more holistic picture of your dataset. This synergy not only reveals interesting aspects of the data but also helps build trust in your eventual models, providing confidence that no critical pattern or anomaly was overlooked in the exploratory phase.

EDA sets the stage for all subsequent analytical or modeling steps. From verifying data integrity to highlighting subtle relationships, it empowers you to confidently decide on feature engineering, model selection, and hyperparameter tuning. By combining efficient data handling with powerful visualizations and statistical methods, your EDA workflow can become a natural extension of the scientific process — continually testing hypotheses, refining insights, and surfacing new questions. The techniques and tools discussed here form a foundation that you will repeatedly refine and adapt as you tackle increasingly complex datasets in your data science and machine learning journey.