• Legit Python
  • Posts
  • Learn NumPy from Scratch – Step by Step! (Part 2)

Learn NumPy from Scratch – Step by Step! (Part 2)

Getting Started with Data Science in Python

Working with Data: Filtering, Sorting, and Aggregation

Now that we've covered the core concepts, it's time to apply NumPy to real-world data operations. In this section, we'll explore essential techniques used in data science, including filtering, sorting, and aggregation. These operations allow us to efficiently manipulate and analyze datasets, making them a crucial part of any data-driven workflow.

Indexing in NumPy

Indexing in NumPy follows similar rules as standard Python lists, allowing you to access elements using positive and negative indices. You can use a colon (:) to select ranges or the entire array, and double colons (::) to skip elements. However, NumPy arrays introduce a key difference: multi-dimensional indexing uses commas to separate axes, enabling efficient access to specific rows, columns, and subarrays.

Let’s demonstrate this with a fascinating example—Dürer’s Magic Square, a 4×4 grid where the sum of each row, column, and certain subsets always equals 34.

import numpy as np  

# Creating a 4×4 magic square  
square = np.array([
    [16, 3, 2, 13],  
    [5, 10, 11, 8],  
    [9, 6, 7, 12],  
    [4, 15, 14, 1]  
])  

# Verifying that each row and column sums to 34  
for i in range(4):  
    assert square[i, :].sum() == 34  # Row-wise sum  
    assert square[:, i].sum() == 34  # Column-wise sum  

# Checking sum of quadrants  
assert square[:2, :2].sum() == 34  # Top-left quadrant  
assert square[2:, :2].sum() == 34  # Bottom-left quadrant  
assert square[:2, 2:].sum() == 34  # Top-right quadrant  
assert square[2:, 2:].sum() == 34  # Bottom-right quadrant  

Here, we use row-wise (i, :) and column-wise (:, i) indexing inside a loop to confirm that each sum equals 34. We also extract quadrants using slicing ([:2, :2], etc.) to verify that they meet the same property.

Another powerful feature of NumPy is the ability to compute global and axis-wise sums efficiently:

print(square.sum())      # Sum of all elements  
print(square.sum(axis=0))  # Sum of each column  
print(square.sum(axis=1))  # Sum of each row  

This indexing flexibility makes NumPy an excellent tool for handling structured numerical data with ease!

Masking and Filtering in NumPy

Index-based selection is useful, but what if you need to extract elements based on complex conditions? This is where masking comes in.

What is a Mask?

A mask is a NumPy array of the same shape as your data but filled with Boolean values (True or False). It helps extract specific elements that meet a condition.

Example: Filtering Multiples of 4

import numpy as np  

# Creating a structured array  
numbers = np.linspace(5, 50, 24, dtype=int).reshape(4, -1)  

# Creating a mask where numbers are divisible by 4  
mask = numbers % 4 == 0  

# Using the mask to filter values  
filtered_values = numbers[mask]  

print(filtered_values)  # Extracted numbers  

Understanding the Process

  1. Creating the Array:

    • np.linspace(5, 50, 24, dtype=int) generates 24 evenly spaced integers between 5 and 50.

    • .reshape(4, -1) restructures it into a 4-row array, automatically calculating the number of columns.

  2. Creating the Mask:

    • The condition numbers % 4 == 0 creates a Boolean array.

    • True where numbers are multiples of 4, False elsewhere.

  3. Applying the Mask:

    • Using numbers[mask] extracts elements where mask is True.

    • The result is a one-dimensional array.

Filtering in a Normal Distribution

You can also use masks to filter elements based on statistical properties, such as values within two standard deviations of the mean.

from numpy.random import default_rng  

rng = default_rng()  
values = rng.standard_normal(10000)  # Generating random values  
std_dev = values.std()  

# Filtering values within ±2 standard deviations  
filtered = values[(values > -2 * std_dev) & (values < 2 * std_dev)]  

print(len(filtered) / len(values))  # Should be around 95.45%  

Key Takeaways

  • Vectorized Boolean Operations: Use & (and) and | (or) instead of and/or.

  • Efficient Filtering: NumPy processes conditions across entire arrays without loops.

  • Flexible Selection: Extract specific ranges, statistical properties, or patterns.

Transposing, Sorting, and Concatenating

When working with NumPy arrays, sometimes you need to rearrange data for better analysis. Let's look at three powerful techniques:

1️⃣ Transposing an Array

Transposing is the process of swapping rows and columns. If you have a 2D array, transposing flips it so that the first row becomes the first column, the second row becomes the second column, and so on.

import numpy as np

arr = np.array([
    [1, 2],
    [3, 4],
    [5, 6]
])

print(arr.T)  # Transpose using .T
print(arr.transpose())  # Another way to transpose

💡 Tip: .T is a shortcut for .transpose().

2️⃣ Sorting in NumPy

Sorting helps organize your data in ascending order (by default) or descending order (with an extra argument).

data = np.array([
    [7, 1, 4],
    [8, 6, 5],
    [1, 2, 3]
])

print(np.sort(data))  # Sorts each row individually
print(np.sort(data, axis=0))  # Sorts each column individually
print(np.sort(data, axis=None))  # Flattens and sorts entire array

💡 Tip: axis=0 sorts columns, axis=1 sorts rows, and axis=None sorts the entire array as a single list.

3️⃣ Concatenating Arrays

Concatenation means joining multiple arrays together. You can do it horizontally (side by side) or vertically (stacking on top).

a = np.array([
    [4, 8],
    [6, 1]
])

b = np.array([
    [3, 5],
    [7, 2]
])

print(np.hstack((a, b)))  # Horizontal stacking
print(np.vstack((a, b)))  # Vertical stacking
print(np.concatenate((a, b)))  # Default is vertical stacking
print(np.concatenate((a, b), axis=None))  # Flattens and joins both arrays

💡 Tip: hstack() joins arrays side by side, vstack() joins them top to bottom, and concatenate() gives you more flexibility.

Aggregation in NumPy

Aggregation helps summarize data using functions like sum(), max(), mean(), and std().

import numpy as np

arr = np.array([
    [3, 6, 9],
    [2, 4, 8],
    [1, 5, 7]
])

print(np.sum(arr))   # Total sum
print(np.max(arr))   # Max value
print(np.mean(arr))  # Mean value
print(np.std(arr))   # Standard deviation

Aggregation Along Axes:

print(np.sum(arr, axis=0))  # Column-wise sum
print(np.mean(arr, axis=1)) # Row-wise mean
  • axis=0 → Columns

  • axis=1 → Rows

NumPy also includes advanced aggregation functions for deeper analysis.

Practical Example: Standardizing a Dataset using NumPy

Problem:

In Data Science and Machine Learning, datasets often contain features with different scales. Some may have values ranging from 0 to 1, while others might range from 1,000 to 10,000. This can negatively affect models like k-NN, Linear Regression, and Neural Networks. Standardization (also called Z-score normalization) helps by transforming data to have a mean of 0 and standard deviation of 1.

Solution Using NumPy:

import numpy as np

# Sample dataset (rows = samples, columns = features)
data = np.array([
    [50, 2000, 3.5],
    [20, 1500, 2.1],
    [30, 1800, 4.3],
    [40, 2100, 3.9]
])

# Compute mean and standard deviation
mean = np.mean(data, axis=0)
std_dev = np.std(data, axis=0)

# Standardization formula: (X - mean) / std_dev
standardized_data = (data - mean) / std_dev

print("Original Data:\n", data)
print("\nStandardized Data:\n", standardized_data)

Explanation:

  1. Compute mean and standard deviation for each feature (column).

  2. Apply the Z-score formula: X_new = (X - μ) / σ

  3. Now, all features have the same scale, making them suitable for Machine Learning models.

Final Words

NumPy is an incredibly powerful library that simplifies complex data operations with ease and efficiency. From indexing and filtering to transposing, sorting, concatenating, and aggregating, it provides a robust set of tools for handling large datasets.

With its optimized mathematical operations, NumPy significantly speeds up computations, making it a must-have for data manipulation, scientific computing, and machine learning workflows.

Mastering these fundamental techniques will not only enhance your coding efficiency but also help you tackle real-world data challenges with confidence!