- Legit Python
- Posts
- Learn NumPy from Scratch – Step by Step! (Part 2)
Learn NumPy from Scratch – Step by Step! (Part 2)
Getting Started with Data Science in Python

Working with Data: Filtering, Sorting, and Aggregation
Now that we've covered the core concepts, it's time to apply NumPy to real-world data operations. In this section, we'll explore essential techniques used in data science, including filtering, sorting, and aggregation. These operations allow us to efficiently manipulate and analyze datasets, making them a crucial part of any data-driven workflow.
Indexing in NumPy
Indexing in NumPy follows similar rules as standard Python lists, allowing you to access elements using positive and negative indices. You can use a colon (:
) to select ranges or the entire array, and double colons (::
) to skip elements. However, NumPy arrays introduce a key difference: multi-dimensional indexing uses commas to separate axes, enabling efficient access to specific rows, columns, and subarrays.
Let’s demonstrate this with a fascinating example—Dürer’s Magic Square, a 4×4 grid where the sum of each row, column, and certain subsets always equals 34.
import numpy as np
# Creating a 4×4 magic square
square = np.array([
[16, 3, 2, 13],
[5, 10, 11, 8],
[9, 6, 7, 12],
[4, 15, 14, 1]
])
# Verifying that each row and column sums to 34
for i in range(4):
assert square[i, :].sum() == 34 # Row-wise sum
assert square[:, i].sum() == 34 # Column-wise sum
# Checking sum of quadrants
assert square[:2, :2].sum() == 34 # Top-left quadrant
assert square[2:, :2].sum() == 34 # Bottom-left quadrant
assert square[:2, 2:].sum() == 34 # Top-right quadrant
assert square[2:, 2:].sum() == 34 # Bottom-right quadrant
Here, we use row-wise (i, :
) and column-wise (:, i
) indexing inside a loop to confirm that each sum equals 34. We also extract quadrants using slicing ([:2, :2]
, etc.) to verify that they meet the same property.
Another powerful feature of NumPy is the ability to compute global and axis-wise sums efficiently:
print(square.sum()) # Sum of all elements
print(square.sum(axis=0)) # Sum of each column
print(square.sum(axis=1)) # Sum of each row
This indexing flexibility makes NumPy an excellent tool for handling structured numerical data with ease!
Masking and Filtering in NumPy
Index-based selection is useful, but what if you need to extract elements based on complex conditions? This is where masking comes in.
What is a Mask?
A mask is a NumPy array of the same shape as your data but filled with Boolean values (True
or False
). It helps extract specific elements that meet a condition.
Example: Filtering Multiples of 4
import numpy as np
# Creating a structured array
numbers = np.linspace(5, 50, 24, dtype=int).reshape(4, -1)
# Creating a mask where numbers are divisible by 4
mask = numbers % 4 == 0
# Using the mask to filter values
filtered_values = numbers[mask]
print(filtered_values) # Extracted numbers
Understanding the Process
Creating the Array:
np.linspace(5, 50, 24, dtype=int)
generates 24 evenly spaced integers between 5 and 50..reshape(4, -1)
restructures it into a 4-row array, automatically calculating the number of columns.
Creating the Mask:
The condition
numbers % 4 == 0
creates a Boolean array.True
where numbers are multiples of 4,False
elsewhere.
Applying the Mask:
Using
numbers[mask]
extracts elements wheremask
isTrue
.The result is a one-dimensional array.
Filtering in a Normal Distribution
You can also use masks to filter elements based on statistical properties, such as values within two standard deviations of the mean.
from numpy.random import default_rng
rng = default_rng()
values = rng.standard_normal(10000) # Generating random values
std_dev = values.std()
# Filtering values within ±2 standard deviations
filtered = values[(values > -2 * std_dev) & (values < 2 * std_dev)]
print(len(filtered) / len(values)) # Should be around 95.45%
Key Takeaways
Vectorized Boolean Operations: Use
&
(and) and|
(or) instead ofand
/or
.Efficient Filtering: NumPy processes conditions across entire arrays without loops.
Flexible Selection: Extract specific ranges, statistical properties, or patterns.
Transposing, Sorting, and Concatenating
When working with NumPy arrays, sometimes you need to rearrange data for better analysis. Let's look at three powerful techniques:
1️⃣ Transposing an Array
Transposing is the process of swapping rows and columns. If you have a 2D array, transposing flips it so that the first row becomes the first column, the second row becomes the second column, and so on.
import numpy as np
arr = np.array([
[1, 2],
[3, 4],
[5, 6]
])
print(arr.T) # Transpose using .T
print(arr.transpose()) # Another way to transpose
💡 Tip: .T
is a shortcut for .transpose()
.
2️⃣ Sorting in NumPy
Sorting helps organize your data in ascending order (by default) or descending order (with an extra argument).
data = np.array([
[7, 1, 4],
[8, 6, 5],
[1, 2, 3]
])
print(np.sort(data)) # Sorts each row individually
print(np.sort(data, axis=0)) # Sorts each column individually
print(np.sort(data, axis=None)) # Flattens and sorts entire array
💡 Tip: axis=0
sorts columns, axis=1
sorts rows, and axis=None
sorts the entire array as a single list.
3️⃣ Concatenating Arrays
Concatenation means joining multiple arrays together. You can do it horizontally (side by side) or vertically (stacking on top).
a = np.array([
[4, 8],
[6, 1]
])
b = np.array([
[3, 5],
[7, 2]
])
print(np.hstack((a, b))) # Horizontal stacking
print(np.vstack((a, b))) # Vertical stacking
print(np.concatenate((a, b))) # Default is vertical stacking
print(np.concatenate((a, b), axis=None)) # Flattens and joins both arrays
💡 Tip: hstack()
joins arrays side by side, vstack()
joins them top to bottom, and concatenate()
gives you more flexibility.
Aggregation in NumPy
Aggregation helps summarize data using functions like sum()
, max()
, mean()
, and std()
.
import numpy as np
arr = np.array([
[3, 6, 9],
[2, 4, 8],
[1, 5, 7]
])
print(np.sum(arr)) # Total sum
print(np.max(arr)) # Max value
print(np.mean(arr)) # Mean value
print(np.std(arr)) # Standard deviation
Aggregation Along Axes:
print(np.sum(arr, axis=0)) # Column-wise sum
print(np.mean(arr, axis=1)) # Row-wise mean
axis=0
→ Columnsaxis=1
→ Rows
NumPy also includes advanced aggregation functions for deeper analysis.
Practical Example: Standardizing a Dataset using NumPy
Problem:
In Data Science and Machine Learning, datasets often contain features with different scales. Some may have values ranging from 0 to 1, while others might range from 1,000 to 10,000. This can negatively affect models like k-NN, Linear Regression, and Neural Networks. Standardization (also called Z-score normalization) helps by transforming data to have a mean of 0 and standard deviation of 1.
Solution Using NumPy:
import numpy as np
# Sample dataset (rows = samples, columns = features)
data = np.array([
[50, 2000, 3.5],
[20, 1500, 2.1],
[30, 1800, 4.3],
[40, 2100, 3.9]
])
# Compute mean and standard deviation
mean = np.mean(data, axis=0)
std_dev = np.std(data, axis=0)
# Standardization formula: (X - mean) / std_dev
standardized_data = (data - mean) / std_dev
print("Original Data:\n", data)
print("\nStandardized Data:\n", standardized_data)
Explanation:
Compute mean and standard deviation for each feature (column).
Apply the Z-score formula: X_new = (X - μ) / σ
Now, all features have the same scale, making them suitable for Machine Learning models.
Final Words
NumPy is an incredibly powerful library that simplifies complex data operations with ease and efficiency. From indexing and filtering to transposing, sorting, concatenating, and aggregating, it provides a robust set of tools for handling large datasets.
With its optimized mathematical operations, NumPy significantly speeds up computations, making it a must-have for data manipulation, scientific computing, and machine learning workflows.
Mastering these fundamental techniques will not only enhance your coding efficiency but also help you tackle real-world data challenges with confidence!