Skip to main content

🧪 Using Python with NumPy, Pandas, Matplotlib, and Seaborn for Data Analysis, Data Science & Pre-Machine Learning Analysis

 Before any machine learning model is built, the real work lies in understanding, cleaning, transforming, and visualizing the data. This crucial phase is known as pre-machine learning analysis or exploratory data analysis (EDA).

In this post, we’ll cover how to use the most powerful Python libraries—NumPy, Pandas, Matplotlib, and Seaborn—for data analysis and pre-ML preparation.

Whether you're new to data science or sharpening your skills, this guide walks you through practical techniques to wrangle and understand your data before diving into algorithms.


🔧 The Essential Python Libraries for Data Analysis

Let’s briefly introduce the four core libraries:

  • NumPy – The foundation for numerical computing in Python. It’s great for array operations, math, and basic statistics.

  • Pandas – The go-to library for working with structured data (like CSV files, databases, spreadsheets).

  • Matplotlib – A flexible plotting library to create static charts and graphs.

  • Seaborn – Built on top of Matplotlib, it provides a high-level interface for beautiful and informative statistical plots.


🟢 Step 1: Import Libraries

python
import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns # Optional: Set style sns.set(style='whitegrid') %matplotlib inline

📥 Step 2: Load and Inspect the Data

Let’s use a sample dataset (e.g., Titanic or a marketing dataset):

python
df = pd.read_csv('titanic.csv') print(df.head()) print(df.info())

Checklist:

  • Understand data types (int, float, object)

  • Check for missing values

  • Look at overall shape and sample rows


🧮 Step 3: Numeric Operations with NumPy

While Pandas handles most data tasks, NumPy shines in fast, vectorized operations.

python
# Example: Convert column to NumPy array ages = df['Age'].values # Get basic stats mean_age = np.mean(ages) std_age = np.std(ages)

NumPy Use Cases:

  • Matrix operations

  • Mathematical functions (e.g., np.log(), np.exp())

  • Random number generation (np.random)


🧹 Step 4: Data Cleaning with Pandas

Preprocessing is key before any modeling begins.

Missing Values

python
# Find missing values df.isnull().sum() # Fill missing Age with median df['Age'].fillna(df['Age'].median(), inplace=True) # Drop rows with missing 'Embarked' df.dropna(subset=['Embarked'], inplace=True)

Encoding Categorical Variables

python
df['Sex'] = df['Sex'].map({'male': 0, 'female': 1}) df = pd.get_dummies(df, columns=['Embarked'], drop_first=True)

Feature Engineering

python
df['FamilySize'] = df['SibSp'] + df['Parch'] + 1

Summary Statistics

python
print(df.describe())

📊 Step 5: Visualization with Matplotlib & Seaborn

Data visualization helps discover patterns and relationships visually.

Univariate Analysis

Histogram of Age:

python
plt.hist(df['Age'], bins=30, edgecolor='black') plt.title("Age Distribution") plt.xlabel("Age") plt.ylabel("Count") plt.show()

Seaborn Alternative:

python
sns.histplot(df['Age'], kde=True)

Categorical Data

Survival by Gender:

python
sns.countplot(x='Survived', hue='Sex', data=df)

Bivariate Relationships

Age vs Fare Scatterplot:

python
sns.scatterplot(x='Age', y='Fare', hue='Survived', data=df)

Boxplot:

python
sns.boxplot(x='Pclass', y='Age', data=df)

Correlation Heatmap

python
plt.figure(figsize=(10,6)) sns.heatmap(df.corr(), annot=True, cmap='coolwarm', fmt=".2f") plt.title("Correlation Matrix")

⚙️ Step 6: Feature Selection & Pre-Modeling Prep

At this point, you’re almost ready to start ML. But first:

Check Feature Relationships

python
print(df.corr()['Survived'].sort_values(ascending=False))

Drop Irrelevant Features

python
df.drop(columns=['Name', 'Ticket', 'Cabin'], inplace=True)

Normalize or Scale (if needed)

python
from sklearn.preprocessing import StandardScaler scaler = StandardScaler() df[['Age', 'Fare']] = scaler.fit_transform(df[['Age', 'Fare']])

Split Data for Modeling

python
from sklearn.model_selection import train_test_split X = df.drop('Survived', axis=1) y = df['Survived'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Now your data is clean, visualized, and split—ready for machine learning!


🧠 Bonus: Automating EDA with Pandas Profiling or Sweetviz

For quick exploration:

python
# pip install pandas-profiling from pandas_profiling import ProfileReport profile = ProfileReport(df, title="Titanic Report") profile.to_file("titanic_report.html")

📌 Summary: What You Learned

StepDescription
1.Import key Python libraries
2.Load and inspect data with Pandas
3.Perform math/stats operations using NumPy
4.Clean and engineer features in Pandas
5.Visualize data using Matplotlib and Seaborn
6.Prepare data for machine learning

🎯 Why This Is Critical Before ML

Most beginners jump straight into machine learning algorithms without understanding their data. But in real-world data science:

  • 70-80% of time is spent on data preparation

  • Visualization guides feature selection

  • Cleaning prevents garbage-in, garbage-out

  • Understanding your data builds better models


📚 Resources to Go Deeper


🚀 Final Thoughts

Mastering NumPy, Pandas, Matplotlib, and Seaborn gives you the foundation to analyze any dataset, spot key trends, and prepare your data for accurate machine learning. Before you feed your model, feed your brain with insights from your data.

Don't skip the analysis. It's where the real magic happens.

Comments

Popular posts from this blog

Learn 11 Ads Platforms – Google Ads, Meta Ads, Microsoft Ads, LinkedIn Ads, TikTok Ads, X Ads, Pinterest Ads

  Digital advertising has become the backbone of online business growth. Whether you’re running an eCommerce store, promoting a service, or building your personal brand, advertising platforms give you the power to reach the right audience at the right time. In this blog, we’ll explore 11 top advertising platforms —from Google Ads to TikTok Ads—that every marketer and business owner should know in 2025. 1. Google Ads The largest and most powerful ad platform in the world. Ad types: Search Ads, Display Ads, YouTube Ads, Shopping Ads. Best for: Driving targeted traffic, lead generation, eCommerce sales. Why use it? Google processes over 8.5 billion searches daily , making it a goldmine for businesses. 2. Meta Ads (Facebook & Instagram Ads) Meta’s advertising platform covers Facebook, Instagram, Messenger, and Audience Network. Ad types: Image Ads, Video Ads, Carousel Ads, Reels Ads, Lead Forms. Best for: B2C businesses, brand awareness, community growt...

Create a Complete Online Co-Op Multiplayer Game in Godot 4

  Creating a Complete Online Co-Op Multiplayer Game in Godot 4 is a major but highly rewarding project. Godot 4 introduced significant improvements to its networking architecture, making it more capable of handling multiplayer games , including online co-op experiences . In this guide, we’ll walk you through building a basic online co-op multiplayer game using Godot 4’s high-level multiplayer API . This game will allow multiple players to connect over a network and interact in a shared game world. 🎮 Project Overview Game Concept A top-down 2D co-op game where players control characters that can move around, collect items, and interact with each other. All players will see each other's movements in real time. Key Features: Host or join a multiplayer session Synchronize player movement across the network Use Godot’s high-level multiplayer API Optional: Chat system or shared inventory 🧰 Tools & Technologies Godot 4.x GDScript Godot’s High-Level M...

Learn Everything in Digital Marketing - PPC, SEO, Social Media Marketing, Social Media Ads, GTM, Content Marketing

 Digital marketing is no longer an optional skill—it’s a must-have for businesses and professionals who want to grow in today’s competitive world. Whether you’re aiming to build your brand, scale a business, or land a high-paying career, mastering digital marketing can open limitless opportunities. In this blog, we’ll explore everything you need to know about PPC, SEO, Social Media Marketing, Social Media Ads, Google Tag Manager (GTM), and Content Marketing . 1. Pay-Per-Click (PPC) Advertising PPC is one of the fastest ways to drive targeted traffic to your website. Platforms like Google Ads, Microsoft Ads, and social media ads allow you to bid on keywords or audience interests. Why PPC? Instant traffic, measurable ROI, and precise targeting. Best practices: Research keywords with high intent. Use A/B testing for ad copies. Track conversions with GTM or Analytics. 👉 Start with Google Ads Search Campaigns before moving to Display, YouTube, and Shopping ca...