🧪 Using Python with NumPy, Pandas, Matplotlib, and Seaborn for Data Analysis, Data Science & Pre-Machine Learning Analysis

Before any machine learning model is built, the real work lies in understanding, cleaning, transforming, and visualizing the data. This crucial phase is known as pre-machine learning analysis or exploratory data analysis (EDA).

In this post, we’ll cover how to use the most powerful Python libraries—NumPy, Pandas, Matplotlib, and Seaborn—for data analysis and pre-ML preparation.

Whether you're new to data science or sharpening your skills, this guide walks you through practical techniques to wrangle and understand your data before diving into algorithms.

🔧 The Essential Python Libraries for Data Analysis

Let’s briefly introduce the four core libraries:

NumPy – The foundation for numerical computing in Python. It’s great for array operations, math, and basic statistics.
Pandas – The go-to library for working with structured data (like CSV files, databases, spreadsheets).
Matplotlib – A flexible plotting library to create static charts and graphs.
Seaborn – Built on top of Matplotlib, it provides a high-level interface for beautiful and informative statistical plots.

🟢 Step 1: Import Libraries

python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Optional: Set style
sns.set(style='whitegrid')
%matplotlib inline

📥 Step 2: Load and Inspect the Data

Let’s use a sample dataset (e.g., Titanic or a marketing dataset):

python
df = pd.read_csv('titanic.csv')
print(df.head())
print(df.info())

Checklist:

Understand data types (int, float, object)
Check for missing values
Look at overall shape and sample rows

🧮 Step 3: Numeric Operations with NumPy

While Pandas handles most data tasks, NumPy shines in fast, vectorized operations.

python
# Example: Convert column to NumPy array
ages = df['Age'].values

# Get basic stats
mean_age = np.mean(ages)
std_age = np.std(ages)

NumPy Use Cases:

Matrix operations
Mathematical functions (e.g., np.log(), np.exp())
Random number generation (np.random)

🧹 Step 4: Data Cleaning with Pandas

Preprocessing is key before any modeling begins.

Missing Values

python
# Find missing values
df.isnull().sum()

# Fill missing Age with median
df['Age'].fillna(df['Age'].median(), inplace=True)

# Drop rows with missing 'Embarked'
df.dropna(subset=['Embarked'], inplace=True)

Encoding Categorical Variables

python
df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})
df = pd.get_dummies(df, columns=['Embarked'], drop_first=True)

Feature Engineering

python
df['FamilySize'] = df['SibSp'] + df['Parch'] + 1

Summary Statistics

python
print(df.describe())

📊 Step 5: Visualization with Matplotlib & Seaborn

Data visualization helps discover patterns and relationships visually.

Univariate Analysis

Histogram of Age:

python
plt.hist(df['Age'], bins=30, edgecolor='black')
plt.title("Age Distribution")
plt.xlabel("Age")
plt.ylabel("Count")
plt.show()

Seaborn Alternative:

python
sns.histplot(df['Age'], kde=True)

Categorical Data

Survival by Gender:

python
sns.countplot(x='Survived', hue='Sex', data=df)

Bivariate Relationships

Age vs Fare Scatterplot:

python
sns.scatterplot(x='Age', y='Fare', hue='Survived', data=df)

Boxplot:

python
sns.boxplot(x='Pclass', y='Age', data=df)

Correlation Heatmap

python
plt.figure(figsize=(10,6))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Matrix")

⚙️ Step 6: Feature Selection & Pre-Modeling Prep

At this point, you’re almost ready to start ML. But first:

Check Feature Relationships

python
print(df.corr()['Survived'].sort_values(ascending=False))

Drop Irrelevant Features

python
df.drop(columns=['Name', 'Ticket', 'Cabin'], inplace=True)

Normalize or Scale (if needed)

python
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df[['Age', 'Fare']] = scaler.fit_transform(df[['Age', 'Fare']])

Split Data for Modeling

python
from sklearn.model_selection import train_test_split

X = df.drop('Survived', axis=1)
y = df['Survived']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Now your data is clean, visualized, and split—ready for machine learning!

🧠 Bonus: Automating EDA with Pandas Profiling or Sweetviz

For quick exploration:

python
# pip install pandas-profiling
from pandas_profiling import ProfileReport

profile = ProfileReport(df, title="Titanic Report")
profile.to_file("titanic_report.html")

📌 Summary: What You Learned

Step	Description
1.	Import key Python libraries
2.	Load and inspect data with Pandas
3.	Perform math/stats operations using NumPy
4.	Clean and engineer features in Pandas
5.	Visualize data using Matplotlib and Seaborn
6.	Prepare data for machine learning

🎯 Why This Is Critical Before ML

Most beginners jump straight into machine learning algorithms without understanding their data. But in real-world data science:

70-80% of time is spent on data preparation
Visualization guides feature selection
Cleaning prevents garbage-in, garbage-out
Understanding your data builds better models

📚 Resources to Go Deeper

Books:
- Python for Data Analysis by Wes McKinney
- Data Science from Scratch by Joel Grus
Courses:
- Kaggle Microcourses
- freeCodeCamp Data Analysis in Python
Practice Datasets:
- Kaggle
- UCI Machine Learning Repository

🚀 Final Thoughts

Mastering NumPy, Pandas, Matplotlib, and Seaborn gives you the foundation to analyze any dataset, spot key trends, and prepare your data for accurate machine learning. Before you feed your model, feed your brain with insights from your data.

Don't skip the analysis. It's where the real magic happens.

Korshub

Search This Blog