NumPy Data Structure: ndarray
The ndarray (N-dimensional array) is the core object in NumPy.
It can represent:
- Scalars (0D)
- Vectors (1D)
- Matrices (2D)
- Tensors (3D or higher)
Example 1: Creating Arrays
import numpy as np
# 1D Array
arr1 = np.array([1, 2, 3, 4, 5])
print("1D Array:", arr1)
# 2D Array (Matrix)
arr2 = np.array([[1, 2, 3], [4, 5, 6]])
print("2D Array:\n", arr2)
# 3D Array
arr3 = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])
print("3D Array:\n", arr3)
Array Attributes
Every array has useful attributes that describe it:
print("Shape:", arr2.shape) # (2, 3)
print("Dimensions:", arr2.ndim) # 2
print("Data type:", arr2.dtype) # int64
print("Size:", arr2.size) # 6 elements
print("Item size:", arr2.itemsize, "bytes")
Array Creation Functions
NumPy provides many built-in functions for creating arrays easily:
np.zeros((2,3)) # 2x3 matrix of zeros
np.ones((3,2)) # 3x2 matrix of ones
np.eye(3) # 3x3 Identity matrix
np.arange(0,10,2) # [0, 2, 4, 6, 8]
np.linspace(0, 1, 5) # [0. , 0.25, 0.5 , 0.75, 1.]
np.random.rand(2,3) # Random 2x3 array
Array Indexing and Slicing
Example:
arr = np.array([[10, 20, 30], [40, 50, 60]])
print("First row:", arr[0])
print("Element at (1,2):", arr[1, 2])
print("Slicing (rows 0–1, columns 0–2):\n", arr[0:2, 0:2])
You can also modify values directly:
arr[0, 0] = 99
print("Modified array:\n", arr)
Array Operations NumPy performs element-wise operations efficiently.
Arithmetic Operations:
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
print("Addition:", a + b)
print("Subtraction:", a - b)
print("Multiplication:", a * b)
print("Division:", a / b)
Mathematical Functions:
x = np.array([1, 4, 9, 16, 25])
print("Square root:", np.sqrt(x))
print("Exponent:", np.exp(x))
print("Log:", np.log(x))
print("Sum:", np.sum(x))
print("Mean:", np.mean(x))
print("Standard Deviation:", np.std(x))
Matrix Operations
NumPy makes linear algebra easy.
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
print("Matrix addition:\n", A + B)
print("Matrix multiplication:\n", np.dot(A, B))
print("Transpose:\n", A.T)
print("Determinant:", np.linalg.det(A))
print("Inverse:\n", np.linalg.inv(A))
Broadcasting
Broadcasting lets NumPy perform arithmetic on arrays of different shapes.
Example:
a = np.array([1, 2, 3])
b = 2
print("Add scalar:", a + b)
# 2D array with 1D
A = np.array([[1, 2, 3], [4, 5, 6]])
B = np.array([10, 20, 30])
print("Broadcasting addition:\n", A + B)
Boolean Indexing and Filtering
You can filter elements that satisfy a condition.
arr = np.array([10, 20, 30, 40, 50])
print("Elements > 25:", arr[arr > 25])
Reshaping and Flattening Arrays
arr = np.arange(1, 10)
arr2 = arr.reshape(3, 3)
print("Reshaped 3x3:\n", arr2)
print("Flattened:", arr2.flatten())
Aggregation Functions
NumPy offers many functions for summarizing data:
arr = np.array([[1, 2, 3], [4, 5, 6]])
print("Sum:", np.sum(arr))
print("Column-wise Sum:", np.sum(arr, axis=0))
print("Row-wise Mean:", np.mean(arr, axis=1))
Random Module in NumPy
NumPy’s random module is useful for generating random numbers — essential for simulations and ML model initialization.
from numpy import random
print("Random integer:", random.randint(10))
print("Random array:\n", random.randint(1, 100, size=(3,3)))
print("Random float array:\n", random.rand(2,2))
print("Random choice:", random.choice([10, 20, 30, 40]))
NumPy in Machine Learning
NumPy is used for:
- Feature scaling & normalization
- Matrix operations in neural networks
- Data preprocessing
- Implementing algorithms from scratch
Example: Simple Linear Regression (Manual using NumPy)
import numpy as np
# Training data
X = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 6, 8, 10])
# Mean
X_mean, y_mean = np.mean(X), np.mean(y)
# Calculate slope (m) and intercept (c)
m = np.sum((X - X_mean)*(y - y_mean)) / np.sum((X - X_mean)**2)
c = y_mean - m * X_mean
print("Equation: y =", round(m, 2), "x +", round(c, 2))
print("Prediction for x=6:", m*6 + c)
Summary
- NumPy is the backbone of data handling in Python.
- It provides speed, power, and flexibility for numerical operations.
- Used in almost every area of data science, AI, and ML.
NumPy Complete Demonstration Program
import numpy as np
print("========== NUMPY COMPLETE DEMONSTRATION ==========\n")
# 1️⃣ ARRAY CREATION METHODS
arr1 = np.array([1, 2, 3])
arr2 = np.array([[1, 2], [3, 4]])
zeros_arr = np.zeros((2, 3))
ones_arr = np.ones((3, 3))
empty_arr = np.empty((2, 2))
arange_arr = np.arange(1, 11, 2)
lin_arr = np.linspace(0, 1, 5)
print("Array 1 =", arr1)
print("Array 2 =\n", arr2)
print("Zeros Array =\n", zeros_arr)
print("Ones Array =\n", ones_arr)
print("Empty Array =\n", empty_arr)
print("Arange =", arange_arr)
print("Linspace =", lin_arr, "\n")
# 2️⃣ ARRAY PROPERTIES
print("Shape:", arr2.shape)
print("Size:", arr2.size)
print("Data Type:", arr2.dtype)
print("Dimension:", arr2.ndim, "\n")
# 3️⃣ INDEXING & SLICING
print("arr2[0][1] =", arr2[0][1])
print("Slice arr2[0,:] =", arr2[0, :])
print("Slice arr2[:,1] =", arr2[:, 1], "\n")
# 4️⃣ MATHEMATICAL OPERATIONS
a = np.array([10, 20, 30])
b = np.array([1, 2, 3])
print("a + b =", a + b)
print("a - b =", a - b)
print("a * b =", a * b)
print("a / b =", a / b)
print("a ** 2 =", a ** 2)
print("Sin(a) =", np.sin(a), "\n")
# 5️⃣ BROADCASTING
mat = np.array([[1, 2, 3], [4, 5, 6]])
print("Matrix + 10 =\n", mat + 10, "\n")
# 6️⃣ RESHAPING
r = np.arange(1, 13)
reshaped = r.reshape(3, 4)
print("Reshaped 1-12 into 3x4:\n", reshaped, "\n")
# 7️⃣ FLATTEN & RAVEL
print("Flatten:", reshaped.flatten())
print("Ravel:", reshaped.ravel(), "\n")
# 8️⃣ STACKING ARRAYS
vstack_arr = np.vstack((arr1, b))
hstack_arr = np.hstack((arr1, b))
print("Vertical Stack:\n", vstack_arr)
print("Horizontal Stack:", hstack_arr, "\n")
# 9️⃣ SPLITTING ARRAYS
split_arr = np.array([10, 20, 30, 40, 50, 60])
print("Split:", np.split(split_arr, 3), "\n")
# 🔟 STATISTICAL FUNCTIONS
stats = np.array([10, 20, 30, 40, 50])
print("Mean =", np.mean(stats))
print("Median =", np.median(stats))
print("Standard Deviation =", np.std(stats))
print("Variance =", np.var(stats))
print("Sum =", np.sum(stats))
print("Min =", np.min(stats))
print("Max =", np.max(stats), "\n")
# 1️⃣1️⃣ RANDOM MODULE
rand_arr = np.random.rand(3, 3)
rand_int_arr = np.random.randint(1, 50, size=5)
normal_arr = np.random.normal(0, 1, 5)
print("Random (0-1):\n", rand_arr)
print("Random Integers:", rand_int_arr)
print("Normal Distribution:", normal_arr, "\n")
# 1️⃣2️⃣ SORTING
unsorted = np.array([40, 10, 50, 20, 30])
print("Sorted:", np.sort(unsorted), "\n")
# 1️⃣3️⃣ LOGICAL OPERATIONS
logic_arr = np.array([10, 20, 30, 40, 50])
print("logic_arr > 25:", logic_arr > 25)
print("Elements > 25:", logic_arr[logic_arr > 25], "\n")
# 1️⃣4️⃣ COPY vs VIEW
original = np.array([1, 2, 3, 4])
view_arr = original.view()
copy_arr = original.copy()
original[0] = 99
print("Original:", original)
print("View (changes with original):", view_arr)
print("Copy (independent):", copy_arr, "\n")
# 1️⃣5️⃣ ITERATING ARRAYS
print("Iterating:")
for i in reshaped:
print(i)
print()
# 1️⃣6️⃣ LINEAR ALGEBRA
matrix_A = np.array([[1, 2], [3, 4]])
matrix_B = np.array([[5, 6], [7, 8]])
print("Matrix Multiplication:\n", np.dot(matrix_A, matrix_B))
print("Transpose:\n", np.transpose(matrix_A))
print("Determinant:", np.linalg.det(matrix_A))
print("Inverse:\n", np.linalg.inv(matrix_A), "\n")
# 1️⃣7️⃣ UFUNCS (Universal Functions)
arr = np.array([1, 4, 9, 16])
print("Square Root =", np.sqrt(arr))
print("Log =", np.log(arr))
print("Exp =", np.exp(arr), "\n")
# 1️⃣8️⃣ FILE OPERATIONS
np.savetxt("numbers.csv", stats, delimiter=",")
loaded = np.loadtxt("numbers.csv", delimiter=",")
print("Saved and Loaded Array:", loaded)
# --------------------------------------------------------------
print("\n========= END OF NUMPY DEMONSTRATION =========")
Pandas (Python Data Analysis Library)
1. Introduction to Pandas
Pandas stands for Python Data Analysis Library. It is built on top of NumPy and provides high-level data structures and data analysis tools. It helps to clean, transform, analyze, and visualize data easily — especially data stored in tables (rows and columns) just like in Excel or SQL.
2. Features of Pandas
Feature Description
- Data Structures Provides Series (1D) and DataFrame (2D) for data manipulation
- Data Handling Handles missing, duplicate, and inconsistent data efficiently
- File Operations Supports reading/writing data from CSV, Excel, JSON, SQL, etc.
- Data Alignment Automatic data alignment and indexing
- Fast Operations Built on NumPy, hence highly optimized for performance
- Data Analysis Supports grouping, merging, joining, and pivoting operations
3. Pandas Data Structures
(a) Series
A 1-dimensional labeled array (like a column in a table).
Can hold data of any type (integer, string, float, Python objects).
Example:
import pandas as pd
# Creating a Series
s = pd.Series([10, 20, 30, 40, 50], index=['a', 'b', 'c', 'd', 'e'])
print(s)
print("Access element with label 'c':", s['c'])
Output:
a 10
b 20
c 30
d 40
e 50
dtype: int64
Access element with label 'c': 30
(b) DataFrame
A 2-dimensional labeled data structure with rows and columns (like an Excel sheet or SQL table). Each column can be a Series with a different data type.
Example:
import pandas as pd
# Creating a DataFrame using a dictionary
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 22],
'City': ['Delhi', 'Mumbai', 'Chennai']
}
df = pd.DataFrame(data)
print(df)
Output:
Name Age City
0 Alice 25 Delhi
1 Bob 30 Mumbai
2 Charlie 22 Chennai
4. Reading and Writing Files
Pandas can read and write many file formats directly.
File Type Function
- CSV read_csv(), to_csv()
- Excel read_excel(), to_excel()
- JSON read_json(), to_json()
- SQL read_sql(), to_sql()
Example (Read/Write CSV):
# Read a CSV file
df = pd.read_csv('data.csv')
# Write DataFrame to CSV
df.to_csv('output.csv', index=False)
5. Common DataFrame Operations
(a) Viewing Data
print(df.head()) # First 5 rows
print(df.tail(2)) # Last 2 rows
print(df.info()) # Summary (columns, types, memory)
print(df.describe()) # Statistical summary of numeric columns
(b) Selecting Columns and Rows
print(df['Name']) # Select single column
print(df[['Name', 'City']]) # Select multiple columns
print(df.iloc[0]) # Select row by index
print(df.loc[1, 'City']) # Select specific cell (row=1, col='City')
(c) Filtering Data
# Select rows where Age > 25
print(df[df['Age'] > 25])
(d) Adding or Modifying Columns
df['Country'] = 'India' # Add a new column
df['Age+5'] = df['Age'] + 5 # Add computed column
(e) Handling Missing Data
df.dropna(inplace=True) # Remove rows with missing values
df.fillna(0, inplace=True) # Replace missing values with 0
(f) Sorting Data
df.sort_values(by='Age', ascending=False, inplace=True)
(g) Grouping Data
data = {'City': ['Delhi', 'Mumbai', 'Delhi', 'Mumbai'],
'Sales': [200, 150, 400, 300]}
df = pd.DataFrame(data)
grouped = df.groupby('City')['Sales'].sum()
print(grouped)
Output:
City
Delhi 600
Mumbai 450
Name: Sales, dtype: int64
6. Combining DataFrames
Concatenation
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
result = pd.concat([df1, df2])
print(result)
Merging (like SQL Join)
df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['A', 'B', 'C']})
df2 = pd.DataFrame({'ID': [1, 2, 3], 'Marks': [90, 80, 70]})
merged = pd.merge(df1, df2, on='ID')
print(merged)
7. Data Cleaning Example
data = {
'Name': ['Alice', 'Bob', None, 'David'],
'Age': [25, None, 22, 30],
'City': ['Delhi', 'Mumbai', 'Chennai', None]
}
df = pd.DataFrame(data)
print("Before Cleaning:\n", df)
df.fillna({'Name': 'Unknown', 'Age': df['Age'].mean(), 'City': 'Unknown'}, inplace=True)
print("\nAfter Cleaning:\n", df)
8. Visualization with Pandas
Pandas integrates directly with Matplotlib.
import matplotlib.pyplot as plt
df = pd.DataFrame({'Year': [2018, 2019, 2020, 2021],
'Sales': [250, 300, 350, 400]})
df.plot(x='Year', y='Sales', kind='bar', title='Yearly Sales')
plt.show()
9. Advantages of Pandas
- Easy to use and understand
- Highly efficient and fast
- Handles large datasets easily
- Excellent integration with other Python libraries (NumPy, Matplotlib, Scikit-learn)
- Built-in tools for cleaning, transforming, merging, and analyzing data
Pandas is the heart of data analysis in Python — it makes working with structured data fast, flexible, and powerful, enabling easy preparation of data for Machine Learning models.
First Install All Required Libraries
You can install all of them at once:
pip install pandas numpy matplotlib seaborn
Or install them one by one:
pip install pandas
pip install numpy
pip install matplotlib
pip install seaborn
COMPLETE PANDAS PROGRAM
# Step 1: Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# ----------------------------------------------
# Step 2: Create DataFrame
# ----------------------------------------------
data = {
"Name": ["Shree", "Satya", "Jyoti", "Ashis", "Saraswati", "Ganesh", "Kajal", "Priya"],
"Gender": ["F", "M", "F", "M", "F", "M", "F", "F"],
"Age": [25, 30, 22, 28, 26, np.nan, 24, 27],
"Department": ["CS", "IT", "CS", "EC", "IT", "CS", "EC", "IT"],
"Marks1": [85, 78, 92, 70, 88, 80, 90, 95],
"Marks2": [80, 85, 88, 75, np.nan, 70, 95, 98],
"Marks3": [82, 80, 91, 72, 90, 68, 88, 96]
}
df = pd.DataFrame(data)
print("\n===== Original Data =====")
print(df)
# ----------------------------------------------
# Step 3: Basic Information and Statistics
# ----------------------------------------------
print("\n===== Basic Information =====")
print(df.info())
print("\n===== Statistical Summary =====")
print(df.describe())
# ----------------------------------------------
# Step 4: Data Selection
# ----------------------------------------------
print("\n===== Column Selection =====")
print(df["Name"]) # Single column
print(df[["Name", "Marks1"]]) # Multiple columns
print("\n===== Row Selection =====")
print(df.iloc[0]) # By position
print(df.loc[2, "Marks1"]) # By label
# ----------------------------------------------
# Step 5: Adding and Modifying Columns
# ----------------------------------------------
df["Total"] = df["Marks1"] + df["Marks2"] + df["Marks3"]
df["Average"] = df["Total"] / 3
print("\n===== After Adding Columns =====")
print(df.head())
# Modify a column
df["Department"] = df["Department"].replace({"CS": "Computer", "IT": "InformationTech", "EC": "Electronics"})
print("\n===== After Modifying Department Names =====")
print(df)
# ----------------------------------------------
# Step 6: Filtering Data
# ----------------------------------------------
print("\n===== Students with Average > 85 =====")
print(df[df["Average"] > 85])
# ----------------------------------------------
# Step 7: Sorting Data
# ----------------------------------------------
print("\n===== Sorting by Average (Descending) =====")
print(df.sort_values(by="Average", ascending=False))
# ----------------------------------------------
# Step 8: Handling Missing Data
# ----------------------------------------------
print("\n===== Missing Values =====")
print(df.isnull().sum())
df["Marks2"].fillna(df["Marks2"].mean(), inplace=True)
df["Age"].fillna(df["Age"].mean(), inplace=True)
print("\n===== After Filling Missing Values =====")
print(df)
# ----------------------------------------------
# Step 9: Grouping and Aggregation
# ----------------------------------------------
grouped = df.groupby("Department")[["Marks1", "Marks2", "Marks3", "Average"]].mean()
print("\n===== Average Marks by Department =====")
print(grouped)
# ----------------------------------------------
# Step 10: Concatenation and Merging
# ----------------------------------------------
df_extra = pd.DataFrame({
"Name": ["Shree", "Satya", "Jyoti", "Ashis", "Saraswati", "Ganesh", "Kajal", "Priya"],
"Attendance (%)": [95, 88, 92, 80, 97, 85, 90, 99]
})
merged = pd.merge(df, df_extra, on="Name")
print("\n===== After Merging Attendance Data =====")
print(merged)
# ----------------------------------------------
# Step 11: Visualization with Pandas + Matplotlib + Seaborn
# ----------------------------------------------
# Bar plot of Average Marks by Department
grouped["Average"].plot(kind="bar", color="skyblue", title="Average Marks by Department")
plt.ylabel("Average Marks")
plt.show()
# Distribution of Marks
sns.histplot(df["Average"], bins=10, kde=True)
plt.title("Distribution of Student Averages")
plt.show()
# Box plot
sns.boxplot(x="Department", y="Average", data=df)
plt.title("Average Scores by Department")
plt.show()
# ----------------------------------------------
# Step 12: Export Data
# ----------------------------------------------
df.to_csv("students_full_processed.csv", index=False)
print("\n✅ Data exported to 'students_full_processed.csv' successfully!")
COMPLETE PANDAS PROGRAM USING CSV FILE
# Step 1: Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Step 2: Load CSV file
# Make sure "students_full_processed.csv" is in the same directory as this script
df = pd.read_csv("students_full_processed.csv")
print("===== Original Data =====")
print(df)
print("\n")
# Step 3: Display basic info
print("===== Basic Information =====")
print(df.info())
print("\n")
# Step 4: Display statistics
print("===== Statistical Summary =====")
print(df.describe())
print("\n")
# Step 5: Show first and last few rows
print("===== First 5 Rows =====")
print(df.head())
print("\n")
print("===== Last 5 Rows =====")
print(df.tail())
print("\n")
# Step 6: Access specific columns and rows
print("===== Names of All Students =====")
print(df["Name"])
print("\n")
print("===== Marks of First 3 Students =====")
print(df[["Name", "Marks1", "Marks2", "Marks3"]].head(3))
print("\n")
# Step 7: Conditional filtering
print("===== Students with Average > 85 =====")
high_achievers = df[df["Average"] > 85]
print(high_achievers)
print("\n")
# Step 8: Sorting
print("===== Sorting by Average (Descending) =====")
sorted_df = df.sort_values(by="Average", ascending=False)
print(sorted_df)
print("\n")
# Step 9: Grouping and Aggregation
print("===== Average Marks by Department =====")
dept_avg = df.groupby("Department")[["Marks1", "Marks2", "Marks3", "Average"]].mean()
print(dept_avg)
print("\n")
# Step 10: Handle missing data (if any)
print("===== Check Missing Values =====")
print(df.isnull().sum())
print("\n")
# Fill missing numeric values (if found)
df.fillna(df.mean(numeric_only=True), inplace=True)
# Step 11: Add new derived column (Performance Grade)
def grade(avg):
if avg >= 90:
return "A+"
elif avg >= 80:
return "A"
elif avg >= 70:
return "B"
else:
return "C"
df["Grade"] = df["Average"].apply(grade)
print("===== After Adding Grade Column =====")
print(df[["Name", "Average", "Grade"]])
print("\n")
# Step 12: Data Visualization
print("===== Data Visualization =====")
# Bar plot for average marks by department
dept_avg["Average"].plot(kind="bar", color="skyblue", title="Average Marks by Department")
plt.ylabel("Average Marks")
plt.show()
# Distribution of average marks
sns.histplot(df["Average"], bins=8, kde=True, color="green")
plt.title("Distribution of Average Marks")
plt.show()
# Box plot of Average by Department
sns.boxplot(x="Department", y="Average", data=df, palette="pastel")
plt.title("Average Marks by Department")
plt.show()
# Step 13: Export modified data
df.to_csv("students_analysis_output.csv", index=False)
print("✅ Processed data saved as 'students_analysis_output.csv'")
Data Visualization Libraries
1. Introduction to Matplotlib
Matplotlib is a comprehensive 2D and 3D plotting library in Python that allows users to create high-quality static, animated, and interactive visualizations. It was developed by John D. Hunter in 2003, originally designed to provide MATLAB-like plotting features in Python.
2. Why Use Matplotlib?
Matplotlib is widely used because it is:
Flexible – You can customize every element of a plot.
Powerful – Supports hundreds of plot types (line, bar, scatter, histogram, pie, etc.).
Compatible – Works well with other libraries such as NumPy, Pandas, Seaborn, and Scikit-learn.
Cross-platform– Works on Windows, macOS, Linux, and in Jupyter notebooks.
3. Installation
To install Matplotlib, open your terminal or command prompt and type:
pip install matplotlib
Complete Matplotlib Demonstration Program
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
# Step 1: Create Sample Dataset
data = {
"Name": ["Shree", "Satya", "Jyoti", "Ashis", "Saraswati", "Ganesh", "Kajal", "Priya"],
"Marks1": [85, 78, 90, 88, 75, 92, 81, 89],
"Marks2": [80, 74, 94, 84, 70, 95, 77, 90],
"Marks3": [82, 79, 88, 91, 73, 89, 80, 87],
"Department": ["CSE", "ECE", "CSE", "EEE", "MECH", "CSE", "ECE", "CIVIL"]
}
df = pd.DataFrame(data)
df["Total"] = df["Marks1"] + df["Marks2"] + df["Marks3"]
df["Average"] = df["Total"] / 3
print("===== STUDENTS DATA =====")
print(df, "\n")
# Step 2: Basic Line Plot
plt.figure(figsize=(7, 4))
plt.plot(df["Name"], df["Marks1"], color='blue', marker='o', linestyle='-', label='Marks1')
plt.plot(df["Name"], df["Marks2"], color='red', marker='x', linestyle='--', label='Marks2')
plt.plot(df["Name"], df["Marks3"], color='green', marker='s', linestyle='-.', label='Marks3')
plt.title("Line Plot: Students' Marks Comparison")
plt.xlabel("Student Name")
plt.ylabel("Marks")
plt.legend()
plt.grid(True)
plt.show()
# Step 3: Bar Chart
plt.figure(figsize=(7, 4))
plt.bar(df["Name"], df["Average"], color='purple', alpha=0.6)
plt.title("Bar Chart: Students' Average Marks")
plt.xlabel("Name")
plt.ylabel("Average Marks")
plt.grid(axis='y', linestyle='--')
plt.show()
# Step 4: Horizontal Bar Chart
plt.figure(figsize=(7, 4))
plt.barh(df["Name"], df["Total"], color='orange')
plt.title("Horizontal Bar Chart: Total Marks")
plt.xlabel("Total Marks")
plt.ylabel("Student Name")
plt.show()
# Step 5: Scatter Plot
plt.figure(figsize=(7, 4))
plt.scatter(df["Marks1"], df["Marks2"], color='teal', s=100, alpha=0.7)
plt.title("Scatter Plot: Marks1 vs Marks2")
plt.xlabel("Marks1")
plt.ylabel("Marks2")
plt.grid(True)
plt.show()
# Step 6: Histogram
plt.figure(figsize=(6, 4))
plt.hist(df["Average"], bins=5, color='coral', edgecolor='black')
plt.title("Histogram: Distribution of Average Marks")
plt.xlabel("Average Marks")
plt.ylabel("Number of Students")
plt.show()
# Step 7: Pie Chart
plt.figure(figsize=(6, 6))
plt.pie(df["Average"], labels=df["Name"], autopct='%1.1f%%', startangle=90, shadow=True)
plt.title("Pie Chart: Share of Average Marks")
plt.show()
# Step 8: Multiple Subplots
fig, axes = plt.subplots(2, 2, figsize=(10, 7))
# 1️⃣ Line Plot
axes[0, 0].plot(df["Name"], df["Marks1"], color='blue', marker='o')
axes[0, 0].set_title("Marks1 Line Plot")
axes[0, 0].set_xlabel("Name")
axes[0, 0].set_ylabel("Marks1")
# 2️⃣ Bar Plot
axes[0, 1].bar(df["Name"], df["Marks2"], color='red')
axes[0, 1].set_title("Marks2 Bar Plot")
# 3️⃣ Scatter Plot
axes[1, 0].scatter(df["Marks1"], df["Marks3"], color='green', s=80)
axes[1, 0].set_title("Marks1 vs Marks3 Scatter")
# 4️⃣ Histogram
axes[1, 1].hist(df["Total"], color='purple', bins=4)
axes[1, 1].set_title("Histogram of Total Marks")
plt.suptitle("Students’ Performance Subplots", fontsize=14)
plt.tight_layout()
plt.show()
# Step 9: Customization and Annotation
plt.figure(figsize=(7, 4))
plt.plot(df["Name"], df["Average"], color='brown', marker='D', label="Average")
plt.title("Customized Plot with Annotations")
plt.xlabel("Student")
plt.ylabel("Average Marks")
plt.grid(True)
plt.legend()
# Annotate highest performer
max_avg = df["Average"].max()
max_name = df.loc[df["Average"].idxmax(), "Name"]
plt.annotate(f"Topper: {max_name} ({max_avg:.2f})",
xy=(max_name, max_avg),
xytext=(max_name, max_avg + 2),
arrowprops=dict(facecolor='black', arrowstyle="->"))
plt.show()
# Step 10: Styling with Built-in Styles
plt.style.use('seaborn-v0_8-darkgrid')
plt.figure(figsize=(7, 4))
plt.plot(df["Name"], df["Total"], color='magenta', marker='o')
plt.title("Styled Plot using Seaborn Darkgrid")
plt.xlabel("Name")
plt.ylabel("Total Marks")
plt.show()
# Step 11: Object-Oriented API Example
fig, ax = plt.subplots(figsize=(7, 4))
ax.bar(df["Department"], df["Average"], color='skyblue')
ax.set_title("Department-wise Average Marks")
ax.set_xlabel("Department")
ax.set_ylabel("Average Marks")
ax.grid(True)
plt.show()
# Step 12: Save the Plot
plt.figure(figsize=(6, 4))
plt.bar(df["Name"], df["Total"], color='darkgreen')
plt.title("Saving Plot Example")
plt.xlabel("Name")
plt.ylabel("Total Marks")
plt.savefig("students_performance_plot.png")
print("Plot saved successfully as 'students_performance_plot.png' ✅")
Python Program: Mean, Median, Mode, Variance & Standard Deviation
import numpy as np
import statistics as stats
from collections import Counter
# Sample List
data = [12, 15, 12, 18, 20, 22, 12, 15, 30]
# Using statistics Library
mean_stats = stats.mean(data)
median_stats = stats.median(data)
# Mode can return error if no unique mode, so use try-except
try:
mode_stats = stats.mode(data)
except:
mode_stats = "No unique mode"
variance_stats = stats.variance(data) # Sample variance
std_stats = stats.stdev(data) # Sample standard deviation
# Using NumPy
mean_np = np.mean(data)
median_np = np.median(data)
variance_np = np.var(data) # Population variance
std_np = np.std(data) # Population standard deviation
# Mode with Counter
counter_mode = Counter(data).most_common(1)[0][0]
# Output
print("===== Using statistics Module =====")
print("Mean:", mean_stats)
print("Median:", median_stats)
print("Mode:", mode_stats)
print("Variance:", variance_stats)
print("Standard Deviation:", std_stats)
print("\n===== Using NumPy Library =====")
print("Mean:", mean_np)
print("Median:", median_np)
print("Mode (Counter):", counter_mode)
print("Variance:", variance_np)
print("Standard Deviation:", std_np)
Overview of Scikit-Learn, TensorFlow & PyTorch
These are the core machine learning libraries used in industry, research, and academics.
Scikit-Learn (sklearn)
- Best for: Traditional Machine Learning Algorithms
- Level: Beginner → Intermediate
- Built on: NumPy, SciPy, Matplotlib
What Scikit-Learn Is
Scikit-Learn is a Python library that provides ready-made implementations of classical ML algorithms like:
- Linear Regression
- Logistic Regression
- Decision Trees
- Random Forest
- Naive Bayes
- SVM (Support Vector Machines)
- Clustering (K-Means, DBSCAN)
- Dimensionality Reduction (PCA)
It is simple, clean, and widely used in data science.
Key Features
- Easy to use (simple API)
- Fast (built on optimized C/C++ libraries)
- Includes preprocessing tools
(e.g., StandardScaler, OneHotEncoder)
Excellent for beginners and training ML models quickly. Great for small- and medium-size datasets
Why is Scikit-Learn Important?
It contains almost every traditional machine learning algorithm.
Provides ready-made functions for:
- Data preprocessing
- Feature selection
- Model building
- Model evaluation
- Model tuning
- Highly reliable and optimized.
Scikit-Learn is the standard library for most ML beginners and data scientists.
Core Areas of Scikit-Learn
Scikit-Learn provides support for:
1. Supervised Learning
Algorithms where the model learns from labeled data:
- Regression
- Linear Regression
- Polynomial Regression
- Ridge, Lasso Regression
- Classification
- Logistic Regression
- Decision Trees
- Random Forest
- Support Vector Machines
- K-Nearest Neighbors (KNN)
- Naive Bayes
2. Unsupervised Learning
Algorithms that work with unlabeled data:
- Clustering
- K-Means
- DBSCAN
- Agglomerative Clustering
- Dimensionality Reduction
- PCA (Principal Component Analysis)
- LDA (Linear Discriminant Analysis)
- t-SNE (via other packages)
3. Semi-Supervised Learning
Uses a mix of labeled and unlabeled data.
4. Model Selection & Evaluation
Tools for testing model accuracy:
- train_test_split
- Cross Validation (K-Fold)
- Grid Search (GridSearchCV)
- Randomized Search
- Accuracy, Precision, Recall, F1-score
- Confusion Matrix
5. Preprocessing Tools
For preparing raw data:
- Standardization (StandardScaler)
- Normalization (MinMaxScaler)
- Encoding (OneHotEncoder, LabelEncoder)
- Imputation (SimpleImputer)
- Binarization
- Polynomial features
FULL MACHINE LEARNING PROGRAM WITH AUTO BEST MODEL SELECTION
# Demonstrates all major ML workflow steps
# Dataset: Handwritten Digits (0–9)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.metrics import (
accuracy_score, confusion_matrix, classification_report,
precision_score, recall_score, f1_score
)
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
import joblib
# 1. LOAD DATASET
digits = datasets.load_digits()
X, y = digits.data, digits.target
print("Shape of X:", X.shape)
print("Shape of y:", y.shape)
# 2. TRAIN-TEST SPLIT
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# 3. DEFINE MODELS AND PIPELINES
models = {
"Logistic Regression": LogisticRegression(max_iter=2000),
"KNN": KNeighborsClassifier(),
"SVM": SVC(),
"Decision Tree": DecisionTreeClassifier()
}
pipelines = {}
for name, model in models.items():
if name in ["Logistic Regression", "SVM"]:
pipelines[name] = Pipeline([
("scaler", StandardScaler()),
("pca", PCA(n_components=30)),
("model", model)
])
else:
pipelines[name] = Pipeline([
("scaler", StandardScaler()),
("model", model)
])
# 4. TRAIN MODELS, EVALUATE, AND SELECT BEST
best_accuracy = 0
best_model_name = None
best_model_pipeline = None
results = []
for name, pipe in pipelines.items():
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
acc = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred, average="macro")
rec = recall_score(y_test, y_pred, average="macro")
f1 = f1_score(y_test, y_pred, average="macro")
results.append([name, acc, prec, rec, f1])
print(f"\n===== {name} =====")
print("Accuracy:", acc)
print("Precision:", prec)
print("Recall:", rec)
print("F1 Score:", f1)
# Update best model
if acc > best_accuracy:
best_accuracy = acc
best_model_name = name
best_model_pipeline = pipe
# Confusion matrix heatmap
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6,5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title(f"Confusion Matrix: {name}")
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()
# Summary DataFrame
results_df = pd.DataFrame(results, columns=["Model", "Accuracy", "Precision", "Recall", "F1"])
print("\n===== SUMMARY OF MODEL PERFORMANCE =====")
print(results_df)
print(f"\nBest Model: {best_model_name} with Accuracy = {best_accuracy:.4f}")
# 5. CROSS-VALIDATION OF BEST MODEL
scores = cross_val_score(best_model_pipeline, X, y, cv=5)
print(f"\nCross-validation Scores for {best_model_name}: {scores}")
print(f"Mean CV Accuracy: {scores.mean():.4f}")
# 6. HYPERPARAMETER TUNING FOR BEST MODEL
if best_model_name == "Logistic Regression":
param_grid = {
'model__C': [0.01, 0.1, 1, 10],
'model__solver': ['liblinear', 'lbfgs']
}
elif best_model_name == "SVM":
param_grid = {
'model__C': [0.1, 1, 10],
'model__kernel': ['linear', 'rbf']
}
elif best_model_name == "KNN":
param_grid = {
'model__n_neighbors': [3, 5, 7, 9],
'model__weights': ['uniform', 'distance']
}
elif best_model_name == "Decision Tree":
param_grid = {
'model__max_depth': [None, 5, 10, 20],
'model__min_samples_split': [2, 5, 10]
}
grid = GridSearchCV(best_model_pipeline, param_grid, cv=3, scoring='accuracy', verbose=1)
grid.fit(X_train, y_train)
best_model_final = grid.best_estimator_
print(f"\n===== GRID SEARCH RESULTS FOR {best_model_name} =====")
print("Best Parameters:", grid.best_params_)
print("Best CV Score:", grid.best_score_)
# 7. SAVE FINAL BEST MODEL
joblib.dump(best_model_final, "best_digit_model_final.pkl")
print(f"\nFinal best model ({best_model_name}) saved as best_digit_model_final.pkl")
# 8. LOAD MODEL & PREDICT NEW SAMPLE
loaded_model = joblib.load("best_digit_model_final.pkl")
sample = X_test[0].reshape(1, -1)
predicted_digit = loaded_model.predict(sample)
print("\nSample Prediction:")
print("Actual value:", y_test[0])
print("Predicted value:", predicted_digit[0])
# END OF PROGRAM
TensorFlow
- Best for: Deep Learning, Neural Networks, Large-Scale ML
- Level: Intermediate → Advanced
- Developed by: Google
- Has Keras API (easy high-level NN building)
What TensorFlow Is
TensorFlow is a powerful open-source platform used for:
- Deep Neural Networks
- Convolutional Neural Networks (CNNs)
- Recurrent Neural Networks (RNNs)
- Natural Language Processing (NLP)
- Computer Vision
- Large-scale ML training
TensorFlow uses computational graphs and can run on:
- CPU
- GPU
- TPU (Tensor Processing Units)
Key Features
- High-performance deep learning
- Supports distributed training
- Industry standard for production models
- Keras makes model-building simple
- Used in Google products (Search, Photos, YouTube)
COMPLETE TENSORFLOW PROJECT (ALL MAJOR METHODS INCLUDED)
# MNIST HANDWRITTEN DIGIT RECOGNITION
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import (
Input, Conv2D, SeparableConv2D, MaxPooling2D,
GlobalAveragePooling2D, Dense, Dropout, BatchNormalization, Activation
)
from tensorflow.keras.callbacks import (
EarlyStopping, ModelCheckpoint, TensorBoard, ReduceLROnPlateau, LearningRateScheduler
)
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.utils import to_categorical
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import numpy as np
import datetime
# 1. CHECK GPU
print("Available GPUs:", tf.config.list_physical_devices('GPU'))
# 2. LOAD MNIST DATASET
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
print("Train shape:", x_train.shape, "Test shape:", x_test.shape)
# 3. PREPROCESSING
x_train = x_train.astype("float32") / 255
x_test = x_test.astype("float32") / 255
x_train = x_train.reshape(-1,28,28,1)
x_test = x_test.reshape(-1,28,28,1)
y_train_cat = to_categorical(y_train, 10)
y_test_cat = to_categorical(y_test, 10)
# 4. SPLIT TRAIN/VALIDATION MANUALLY
x_train_new, x_val, y_train_new, y_val = train_test_split(
x_train, y_train_cat, test_size=0.2, random_state=42
)
# 5. DATA AUGMENTATION
datagen = ImageDataGenerator(
rotation_range=10,
zoom_range=0.1,
width_shift_range=0.1,
height_shift_range=0.1
)
datagen.fit(x_train_new)
# 6. BUILD MODEL
inputs = Input(shape=(28,28,1))
x = Conv2D(32, (3,3), padding='same')(inputs)
x = BatchNormalization()(x)
x = Activation('relu')(x)
x = MaxPooling2D((2,2))(x)
x = SeparableConv2D(64, (3,3), padding='same')(x)
x = BatchNormalization()(x)
x = Activation('relu')(x)
x = MaxPooling2D((2,2))(x)
x = GlobalAveragePooling2D()(x)
x = Dense(128, activation='relu')(x)
x = Dropout(0.4)(x)
outputs = Dense(10, activation='softmax')(x)
model = Model(inputs, outputs)
model.summary()
# 7. COMPILE MODEL
model.compile(
optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
loss='categorical_crossentropy',
metrics=['accuracy', tf.keras.metrics.Precision(), tf.keras.metrics.Recall()]
)
# 8. CALLBACKS
log_dir = "logs/advanced_mnist_" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
def lr_schedule(epoch, lr):
if epoch > 0 and epoch % 5 == 0:
return lr * 0.5
return lr
callbacks = [
EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True),
ModelCheckpoint('best_advanced_mnist.h5', save_best_only=True),
TensorBoard(log_dir=log_dir),
ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=3, verbose=1),
LearningRateScheduler(lr_schedule)
]
# 9. TRAIN MODEL
history = model.fit(
datagen.flow(x_train_new, y_train_new, batch_size=128),
epochs=20,
validation_data=(x_val, y_val),
steps_per_epoch=len(x_train_new)//128,
callbacks=callbacks
)
# 10. EVALUATE MODEL
loss, acc, precision, recall = model.evaluate(x_test, y_test_cat, verbose=0)
print(f"Test Loss: {loss:.4f}, Accuracy: {acc:.4f}, Precision: {precision:.4f}, Recall: {recall:.4f}")
# 11. PREDICTIONS
pred = model.predict(x_test)
pred_classes = np.argmax(pred, axis=1)
print("\nClassification Report:")
print(classification_report(y_test, pred_classes))
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, pred_classes))
# 12. PLOT ACCURACY, LOSS
plt.figure(figsize=(18,5))
plt.subplot(1,3,1)
plt.plot(history.history['accuracy'], label='Train Acc')
plt.plot(history.history['val_accuracy'], label='Val Acc')
plt.title('Accuracy')
plt.legend()
plt.subplot(1,3,2)
plt.plot(history.history['loss'], label='Train Loss')
plt.plot(history.history['val_loss'], label='Val Loss')
plt.title('Loss')
plt.legend()
plt.subplot(1,3,3)
if 'lr' in history.history:
plt.plot(history.history['lr'], label='Learning Rate')
plt.title('Learning Rate')
plt.legend()
plt.show()
# 13. SAVE AND LOAD MODEL
model.save('final_advanced_mnist_model')
loaded_model = tf.keras.models.load_model('final_advanced_mnist_model')
# 14. SINGLE SAMPLE PREDICTION
sample_idx = 0
sample = x_test[sample_idx].reshape(1,28,28,1)
prediction = loaded_model.predict(sample)
print(f"\nActual: {y_test[sample_idx]}, Predicted: {np.argmax(prediction)}")
# 15. VISUALIZE MISCLASSIFIED IMAGES
misclassified_idx = np.where(pred_classes != y_test)[0]
plt.figure(figsize=(12,6))
for i, idx in enumerate(misclassified_idx[:9]):
plt.subplot(3,3,i+1)
plt.imshow(x_test[idx].reshape(28,28), cmap='gray')
plt.title(f"Actual: {y_test[idx]}, Predicted: {pred_classes[idx]}")
plt.axis('off')
plt.show()
# END OF PROGRAM
PyTorch
PyTorch is an open-source deep learning and scientific computing framework developed by Facebook AI Research (FAIR).
- Best for: Research, Deep Learning, AI Models
- Level: Intermediate → Advanced
- Developed by: Facebook (Meta)
PyTorch is a deep learning framework known for:
- Flexibility
- Dynamic computation graphs
- Faster debugging
- Research-friendliness
- It is widely used in:
- NLP (Transformers, BERT, GPT)
- Computer Vision
- Deep Learning Research
- Reinforcement Learning
PyTorch NLP Sentiment Analysis using LSTM
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence
from collections import Counter
import re
# 1. TEXT CLEANING + TOKENIZER
def clean_text(text):
text = text.lower()
text = re.sub(r"[^a-zA-Z0-9\s']", "", text)
return text.split()
# 2. TRAINING DATA (40 SENTENCES)
positive_sentences = [
"I love this movie",
"This film is amazing",
"I really like this product",
"The acting was fantastic",
"This is a wonderful experience",
"I enjoyed this movie a lot",
"The product quality is very good",
"This is awesome",
"Absolutely loved it",
"What a great film",
"This is brilliant",
"The storyline is excellent",
"The actors did a great job",
"I highly recommend this movie",
"This product works very well",
"This made me very happy",
"It was a delightful experience",
"I am satisfied with this item",
"The movie was very enjoyable",
"I like this a lot"
]
negative_sentences = [
"I hate this movie",
"This film is terrible",
"I dislike this product",
"The acting was awful",
"This is a bad experience",
"I did not enjoy the movie",
"The product quality is very poor",
"This is horrible",
"Absolutely hated it",
"What a waste of time",
"This is boring",
"The storyline is terrible",
"The actors did a bad job",
"I don't recommend this movie",
"This product does not work",
"This made me very sad",
"It was a disappointing experience",
"I am frustrated with this item",
"The movie was not enjoyable",
"I don't like this at all"
]
texts = positive_sentences + negative_sentences
labels = [1]*20 + [0]*20 # 1 = Positive, 0 = Negative
# 3. BUILD VOCABULARY
tokenized = [clean_text(t) for t in texts]
word_counts = Counter(word for sent in tokenized for word in sent)
vocab = {"<PAD>": 0, "<UNK>": 1}
for word, _ in word_counts.items():
vocab[word] = len(vocab)
vocab_size = len(vocab)
print("Vocabulary Size:", vocab_size)
# 4. ENCODING FUNCTION
def encode(sentence):
return torch.tensor([vocab.get(word, 1) for word in clean_text(sentence)])
# 5. DATASET CLASS
class SentimentDataset(Dataset):
def __init__(self, texts, labels):
self.data = [encode(t) for t in texts]
self.labels = torch.tensor(labels)
def __len__(self):
return len(self.labels)
def __getitem__(self, idx):
return self.data[idx], self.labels[idx]
# Padding function
def collate_fn(batch):
text, labels = zip(*batch)
text = pad_sequence(text, batch_first=True, padding_value=0)
labels = torch.tensor(labels)
return text, labels
dataset = SentimentDataset(texts, labels)
loader = DataLoader(dataset, batch_size=4, shuffle=True, collate_fn=collate_fn)
# 6. BI-DIRECTIONAL LSTM MODEL (UPGRADED)
class BiLSTMSentiment(nn.Module):
def __init__(self, vocab_size):
super().__init__()
self.embedding = nn.Embedding(vocab_size, 64)
self.lstm = nn.LSTM(64, 128, batch_first=True, bidirectional=True)
self.dropout = nn.Dropout(0.4)
self.fc = nn.Linear(128 * 2, 2) # Bi-LSTM → *2
def forward(self, x):
emb = self.embedding(x)
output, (hidden, cell) = self.lstm(emb)
hidden = torch.cat((hidden[-2], hidden[-1]), dim=1) # Combine directions
hidden = self.dropout(hidden)
return self.fc(hidden)
model = BiLSTMSentiment(vocab_size)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
# 7. TRAINING LOOP
for epoch in range(15):
total_loss = 0
for x, y in loader:
optimizer.zero_grad()
preds = model(x)
loss = criterion(preds, y)
loss.backward()
optimizer.step()
total_loss += loss.item()
print(f"Epoch {epoch+1}/15, Loss = {total_loss:.4f}")
# 8. PREDICTION FUNCTION
def predict(text):
model.eval()
with torch.no_grad():
encoded = encode(text).unsqueeze(0) # Add batch dimension
output = model(encoded)
label = torch.argmax(output, dim=1).item()
return "Positive 😀" if label == 1 else "Negative 😡"
# 9. TESTING THE MODEL
print("\n--- SENTIMENT PREDICTIONS ---")
print("This movie is awesome! ->", predict("This movie is awesome!"))
print("I don't like this product. ->", predict("I don't like this product."))
print("The acting was fantastic! ->", predict("The acting was fantastic!"))
print("This is the worst thing ever ->", predict("This is the worst thing ever"))
CNN-Based Face Recognition: Student Manual
This project demonstrates how to perform face recognition using a Convolutional Neural Network (CNN) in Python with TensorFlow/Keras.
It supports both:
- Binary face recognition (e.g., one person vs another)
- Multi-class face recognition (e.g., multiple people)
Components:
Training Script (train.py) – Train the CNN on your dataset.
Recognition Script (recognize.py) – Real-time face recognition using webcam.
Dataset folder – Contains images of people, organized by subfolders.
Prerequisites
Software Required
Python 3.11+
Pip package manager
Required Python Libraries
Install using pip:
pip install tensorflow opencv-python numpy scikit-learn pillow
- tensorflow – For building and training CNN
- opencv-python – For webcam capture and face detection
- numpy – For array operations
- scikit-learn – For label encoding & train-test split
- pillow – For image loading & preprocessing
Folder Structure
Create the following structure:
face_recognition/
dataset/
person1/
img1.jpg
img2.jpg
person2/
img1.jpg
img2.jpg
model/
(will store cnn_model.keras and labels.npy)
train.py
recognize.py
Dataset Preparation
- Each person’s images go in their own subfolder inside dataset/.
- Recommended 20–50 images per person.
- Images can be .jpg or .png.
- Keep face visible and centered.
train.py
import os
import numpy as np
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Input, Conv2D, MaxPooling2D, Flatten, Dense, Dropout
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.image import load_img, img_to_array
# ------------------------
# Config
# ------------------------
DATASET_DIR = "dataset/"
IMG_SIZE = 128 # Resize images
# ------------------------
# Load dataset
# ------------------------
X, y = [], []
for person_name in os.listdir(DATASET_DIR):
person_dir = os.path.join(DATASET_DIR, person_name)
if not os.path.isdir(person_dir):
continue
for img_name in os.listdir(person_dir):
img_path = os.path.join(person_dir, img_name)
try:
img = load_img(img_path, target_size=(IMG_SIZE, IMG_SIZE))
img_array = img_to_array(img) / 255.0
X.append(img_array)
y.append(person_name)
except:
continue
X = np.array(X)
y = np.array(y)
# ------------------------
# Encode labels
# ------------------------
le = LabelEncoder()
y_encoded = le.fit_transform(y)
num_classes = len(np.unique(y_encoded))
os.makedirs("model", exist_ok=True)
np.save("model/labels.npy", le.classes_)
# ------------------------
# Prepare labels
# ------------------------
if num_classes == 2:
y_final = y_encoded # Binary classification
else:
y_final = to_categorical(y_encoded, num_classes=num_classes) # Multi-class
# ------------------------
# Train-test split
# ------------------------
X_train, X_test, y_train, y_test = train_test_split(
X, y_final, test_size=0.2, random_state=42
)
# ------------------------
# Build CNN model
# ------------------------
model = Sequential([
Input(shape=(IMG_SIZE, IMG_SIZE, 3)),
Conv2D(32, (3,3), activation="relu"),
MaxPooling2D(2,2),
Conv2D(64, (3,3), activation="relu"),
MaxPooling2D(2,2),
Conv2D(128, (3,3), activation="relu"),
MaxPooling2D(2,2),
Flatten(),
Dense(128, activation="relu"),
Dropout(0.5),
Dense(1 if num_classes==2 else num_classes,
activation="sigmoid" if num_classes==2 else "softmax")
])
# ------------------------
# Compile model
# ------------------------
model.compile(
optimizer="adam",
loss="binary_crossentropy" if num_classes==2 else "categorical_crossentropy",
metrics=["accuracy"]
)
model.summary()
# ------------------------
# Train model
# ------------------------
model.fit(X_train, y_train,
validation_data=(X_test, y_test),
batch_size=16,
epochs=15)
# ------------------------
# Save model
# ------------------------
model.save("model/cnn_model.keras")
print("✅ Model trained and saved successfully!")
recognize.py
import cv2
import numpy as np
from tensorflow.keras.models import load_model
IMG_SIZE = 128
# ------------------------
# Load model and labels
# ------------------------
model = load_model("model/cnn_model.keras")
labels = np.load("model/labels.npy")
num_classes = len(labels)
# ------------------------
# Load face detector
# ------------------------
face_cascade = cv2.CascadeClassifier(cv2.data.haarcascades + "haarcascade_frontalface_default.xml")
# ------------------------
# Start webcam
# ------------------------
cap = cv2.VideoCapture(0)
while True:
ret, frame = cap.read()
gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
faces = face_cascade.detectMultiScale(gray, 1.3, 5)
for (x, y, w, h) in faces:
face_img = frame[y:y+h, x:x+w]
face_img = cv2.resize(face_img, (IMG_SIZE, IMG_SIZE))
face_img = face_img.astype("float32") / 255.0
face_img = np.expand_dims(face_img, axis=0)
# Predict
preds = model.predict(face_img)[0]
if num_classes == 2:
# Binary classification
class_id = int(preds > 0.5)
confidence = preds if class_id == 1 else 1 - preds
else:
# Multi-class
class_id = np.argmax(preds)
confidence = preds[class_id]
name = labels[class_id]
# Draw rectangle and label
cv2.rectangle(frame, (x, y), (x+w, y+h), (0,255,0), 2)
cv2.putText(frame, f"{name} ({confidence*100:.1f}%)",
(x, y-10),
cv2.FONT_HERSHEY_SIMPLEX,
0.8, (0,255,0), 2)
cv2.imshow("CNN Face Recognition", frame)
if cv2.waitKey(1) & 0xFF == ord("q"):
break
cap.release()
cv2.destroyAllWindows()
Implement the linear regression algorithm.
# Linear Regression using Scikit-learn
# Step 1: Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
# Step 2: Create Dataset (Study Hours vs Marks)
data = {
"Hours": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
"Marks": [2, 4, 5, 4, 5, 7, 8, 9, 10, 12]
}
df = pd.DataFrame(data)
# Step 3: Define Features (X) and Target (y)
X = df[["Hours"]] # 2D array required
y = df["Marks"]
# Step 4: Split Dataset into Training and Testing
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Step 5: Create Linear Regression Model
model = LinearRegression()
# Step 6: Train Model
model.fit(X_train, y_train)
# Step 7: Model Parameters
print("Slope (Coefficient):", model.coef_[0])
print("Intercept:", model.intercept_)
# Step 8: Make Predictions
y_pred = model.predict(X_test)
print("\nActual Marks:", list(y_test))
print("Predicted Marks:", y_pred)
# Step 9: Model Evaluation
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print("\nMean Squared Error:", mse)
print("R2 Score:", r2)
# Step 10: Plot Graph
plt.scatter(X, y, color='blue', label="Actual Data")
plt.plot(X, model.predict(X), color='red', label="Regression Line")
plt.xlabel("Study Hours")
plt.ylabel("Marks")
plt.title("Linear Regression using Scikit-learn")
plt.legend()
plt.show()
# Step 11: Predict for New Value
new_hours = np.array([[12]])
predicted_marks = model.predict(new_hours)
print("\nPredicted marks for 12 study hours:", predicted_marks[0])
Implement the logistic regression algorithm.
# MULTIPLE FEATURE LOGISTIC REGRESSION - LOAN APPROVAL
# Step 1: Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.preprocessing import StandardScaler
# Step 2: Create Realistic Dataset
data = {
"Income": [15000, 20000, 25000, 30000, 35000, 40000, 45000, 50000, 55000, 60000],
"Credit_Score": [500, 520, 580, 600, 650, 700, 720, 750, 780, 800],
"Loan_Status": [0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
}
df = pd.DataFrame(data)
print("Dataset:\n")
print(df)
# Step 3: Define Features and Target
X = df[["Income", "Credit_Score"]]
y = df["Loan_Status"]
# Step 4: Split Dataset
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Step 5: Feature Scaling (Important for Logistic Regression)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Step 6: Train Logistic Regression Model
model = LogisticRegression()
model.fit(X_train, y_train)
# Step 7: Make Predictions
y_pred = model.predict(X_test)
print("\nActual Values:", list(y_test))
print("Predicted Values:", list(y_pred))
# Step 8: Model Evaluation
accuracy = accuracy_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)
report = classification_report(y_test, y_pred)
print("\nModel Evaluation:")
print("Accuracy:", accuracy)
print("\nConfusion Matrix:\n", cm)
print("\nClassification Report:\n", report)
# Step 9: Predict New Applicant (No Warning Version)
new_applicant = pd.DataFrame({
"Income": [42000],
"Credit_Score": [690]
})
# Apply same scaling
new_applicant_scaled = scaler.transform(new_applicant)
prediction = model.predict(new_applicant_scaled)
if prediction[0] == 1:
print("\nLoan Approved for new applicant")
else:
print("\nLoan Rejected for new applicant")
# Step 10: Display Model Coefficients
print("\nModel Coefficients:")
print("Income Coefficient:", model.coef_[0][0])
print("Credit Score Coefficient:", model.coef_[0][1])
print("Intercept:", model.intercept_[0])
Implement the K-nearest neighbor algorithm.
# KNN - HEART DISEASE PREDICTION
# Step 1: Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
# Step 2: Create Realistic Medical Dataset
data = {
"Age": [25, 35, 45, 50, 55, 60, 65, 70, 40, 48],
"Blood_Pressure": [120, 130, 140, 150, 160, 170, 180, 190, 135, 145],
"Cholesterol": [180, 190, 210, 220, 240, 260, 280, 300, 200, 215],
"Heart_Disease": [0, 0, 0, 1, 1, 1, 1, 1, 0, 1]
}
df = pd.DataFrame(data)
print("Dataset:\n")
print(df)
# Step 3: Define Features and Target
X = df[["Age", "Blood_Pressure", "Cholesterol"]]
y = df["Heart_Disease"]
# Step 4: Split Dataset
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
# Step 5: Feature Scaling (Important for KNN)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Step 6: Train KNN Model
k = 3
model = KNeighborsClassifier(n_neighbors=k)
model.fit(X_train, y_train)
# Step 7: Make Predictions
y_pred = model.predict(X_test)
print("\nActual Values:", list(y_test))
print("Predicted Values:", list(y_pred))
# Step 8: Model Evaluation
accuracy = accuracy_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)
report = classification_report(y_test, y_pred)
print("\nModel Evaluation:")
print("Accuracy:", accuracy)
print("\nConfusion Matrix:\n", cm)
print("\nClassification Report:\n", report)
# Step 9: Predict New Patient
new_patient = pd.DataFrame({
"Age": [52],
"Blood_Pressure": [155],
"Cholesterol": [230]
})
# Apply scaling
new_patient_scaled = scaler.transform(new_patient)
prediction = model.predict(new_patient_scaled)
if prediction[0] == 1:
print("\nPatient likely has Heart Disease")
else:
print("\nPatient likely does NOT have Heart Disease")
# Step 10: 2D Visualization (Age vs Cholesterol)
plt.scatter(df["Age"], df["Cholesterol"], c=df["Heart_Disease"])
plt.xlabel("Age")
plt.ylabel("Cholesterol")
plt.title("Heart Disease Classification (KNN)")
plt.show()
Implement the decision tree algorithm.
# LOAN APPROVAL PREDICTION USING DECISION TREE
# Step 1: Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn import tree
# Step 2: Create Realistic Dataset
data = {
"Income": [15000, 20000, 25000, 30000, 35000, 40000, 45000, 50000, 55000, 60000],
"Credit_Score": [500, 520, 580, 600, 650, 700, 720, 750, 780, 800],
"Age": [22, 25, 28, 30, 35, 40, 45, 50, 55, 60],
"Loan_Status": [0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
}
df = pd.DataFrame(data)
print("Dataset:\n")
print(df)
# Step 3: Define Features and Target
X = df[["Income", "Credit_Score", "Age"]]
y = df["Loan_Status"]
# Step 4: Split Dataset
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
# Step 5: Train Decision Tree Model
model = DecisionTreeClassifier(criterion="gini", max_depth=3, random_state=42)
model.fit(X_train, y_train)
# Step 6: Make Predictions
y_pred = model.predict(X_test)
print("\nActual Values:", list(y_test))
print("Predicted Values:", list(y_pred))
# Step 7: Evaluate Model
accuracy = accuracy_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)
report = classification_report(y_test, y_pred)
print("\nModel Evaluation:")
print("Accuracy:", accuracy)
print("\nConfusion Matrix:\n", cm)
print("\nClassification Report:\n", report)
# Step 8: Predict New Applicant
new_applicant = pd.DataFrame({
"Income": [42000],
"Credit_Score": [690],
"Age": [38]
})
prediction = model.predict(new_applicant)
if prediction[0] == 1:
print("\nLoan Approved for new applicant")
else:
print("\nLoan Rejected for new applicant")
# Step 9: Visualize Decision Tree
plt.figure(figsize=(12, 8))
tree.plot_tree(
model,
feature_names=["Income", "Credit_Score", "Age"],
class_names=["Rejected", "Approved"],
filled=True
)
plt.title("Decision Tree - Loan Approval")
plt.show()
Comments
Post a Comment