`pandas`

Introduction

Info

At the time of writing, pandas version 2.2.3 was used. Keep in mind, that pandas is actively developed and some functionalities might change in the future. However, as always, we try to keep the content up-to-date.

This section is heavily based on the excellent 10 minutes to pandas guide.

The data set

We will use a custom Spotify data set, containing the current¹ top 50 songs in Austria. You can find the corresponding playlist here.

Info

If you're interested in the creation of the data set, you can find the code below. Note, pandas was the only package needed, and we will cover some of the used functionalities in this section.

Create Spotify data set

# Create a Spotify data set, containing the top 50 tracks in Austria

# Original data (Top Spotify Songs in 73 Countries (Daily Updated)) from:
# https://www.kaggle.com/datasets/asaniczka/top-spotify-songs-in-73-countries-daily-updated/data
from pathlib import Path

import pandas as pd

# read initial data set
data = pd.read_csv(Path("data/universal_top_spotify_songs.csv"))

# only Austrian chart topping songs
data = data[data["country"] == "AT"]

# subset by latest snapshot date
latest_snapshot = data["snapshot_date"].max()
data = data[data["snapshot_date"] == latest_snapshot]
# sort by daily_rank
data = data.sort_values(by="daily_rank").reset_index(drop=True)

# write data to csv
data.to_csv(Path("data/spotify-top50.csv"), index=False)

# excerpt of the data set for inclusion in markdown
with Path("data/spotify-top50.md").open("w", encoding="UTF-8") as f:
    prefix = (
        f"Excerpt of the data set (snapshot date: **{latest_snapshot}**):\n\n"
    )
    _data = data[
        [
            "daily_rank",
            "name",
            "artists",
            "popularity",
            "is_explicit",
            "energy",
        ]
    ]
    # only top 5 songs
    _data = _data.head(5)
    markdown = _data.to_markdown(index=False)
    f.write(prefix + markdown)

Excerpt of the data set (snapshot date: 2024-09-25):

daily_rank	name	artists	popularity	is_explicit	energy
1	The Emptiness Machine	Linkin Park	93	True	0.872
2	Rote Flaggen	Berq	76	True	0.336
3	Bauch Beine Po	Shirin David	80	True	0.746
4	Die With A Smile	Lady Gaga, Bruno Mars	100	False	0.592
5	BIRDS OF A FEATHER	Billie Eilish	99	False	0.507

Download the whole data set to follow this section:

Spotify Top 50 Austria

Prerequisites

For this section, we recommend, to make a new project folder with a Jupyter notebook. Additionally, create a new virtual environment and activate it. Please refer to the previous section on packages and virtual environments if you're having trouble. Lastly, install pandas.

You should end up with a project structure similar to the following:

📁 pandas-course/
├── 📁 .venv/
├── 📄 spotify-top50.csv
└── 📄 pandas-course.ipynb

Tabular data

Before we dive into pandas, let's briefly discuss tabular data. At its simplest, tabular data consists of rows and columns. Looking at the Spotify table above; each row contains information about a specific track (e.g., name, artists), while each column represents a specific attribute (e.g., popularity, energy).

Tabular data has a clear structure which makes it easy to work with. On the other hand sources for tabular data can be manifold. However, one of the most common format is the XLSX ( - Excel) or CSV ( - Comma Separated Values) format which is the one we are working with in this chapter. Nevertheless, tabular data is also present in various other text based formats like TXT, TSV or even in databases (e.g. MySQL, PostgreSQL).

No matter the source, pandas is the go-to tool to work with tabular data.

Getting started

Let's explore some of pandas functionalities on the example of the Spotify data set. First, we need to import the package.

import pandas as pd

The as statement is used to create an alias for the package in order to quickly reference it within our next code snippets. An alias simply reduces the amount of characters you have to type. Moreover, the alias pd is commonly used for pandas. Therefore, you can more easily employ code snippets you find online.

Reading files

With the package imported, we can already read the data set (given as .csv).

data = pd.read_csv("spotify-top50.csv")

The above code snippet assumes, that both data set and notebook are located at the same directory level. Else, you have to adjust the path "spotify-top50.csv" accordingly.

Besides .csv files, pandas supports reading from various other file types like Excel, text files or a SQL database. The pandas documentation provides a comprehensive overview of different file types and their corresponding function. Have a look, to get an idea which file formats are supported not only for reading but also for writing.

Displaying data

With a data set at hand, we will most likely want to view it. To view the first rows of our data frame use the head() method.

print(data.head())

>>> Output

               spotify_id                   name  ...    tempo  time_signature
0  2PnlsTsOTLE5jnBnNe2K0A  The Emptiness Machine  ...  184.115               4
1  7bkUa9kDFGxgCC7d36dzFI           Rote Flaggen  ...  109.940               3
2  64f3yNXsi2Vk76odcHCOnw         Bauch Beine Po  ...  123.969               4
3  2plbrEY59IikOBgBGLjaoe       Die With A Smile  ...  157.969               3
4  6dOtVTDdiauQNBQEDOtlAB     BIRDS OF A FEATHER  ...  104.978               4

To display the last rows of the data frame, use the tail() method.

print(data.tail())

>>> Output

                spotify_id  ... time_signature
45  6leQi7NakJQS1vHRtZsroe  ...              4
46  5E4jBLx4P0UBji68bBThSw  ...              4
47  6qzetQfgRVyAGEg8QhqzYD  ...              4
48  3WOhcATHxK2SLNeP5W3v1v  ...              4
49  7xLbQTeLpeqlxxTPLSiM20  ...              4

Columns can be viewed with:

print(data.columns)

>>> Output

Index(['spotify_id', 'name', 'artists', 'daily_rank', 'daily_movement',
       'weekly_movement', 'country', 'snapshot_date', 'popularity',
       'is_explicit', 'duration_ms', 'album_name', 'album_release_date',
       'danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness',
       'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo',
       'time_signature'],
      dtype='object')

Similarly, we can print the row indices.

print(data.index)

>>> Output

RangeIndex(start=0, stop=50, step=1)

This data set is consecutively indexed from 0 to 49. If you recall, a range does not include its stop value (50).

By default, (if not otherwise specified) pandas will assign a range index to a data set in order to label the rows.

The data set dimensions are accessed with the shape attribute.

print(data.shape)

>>> Output

(50, 25)

The data set has 50 rows and 25 columns.

Data structures

pandas has two main data structures: Series and DataFrame. As you would expect, a DataFrame is a two-dimensional data structure such as our whole Spotify data set assigned to the variable data:

print(type(data))

>>> Output

<class 'pandas.core.frame.DataFrame'>

Whereas, a single column of a DataFrame is referred to as a Series. Generally, selections of the DataFrame can be accessed with square brackets ([]). To get a column, you can simply use its name.

print(data["artists"])

print(type(data["artists"]))

>>> Output

0                                       Linkin Park
1                                              Berq
2                                      Shirin David
3                             Lady Gaga, Bruno Mars
...                                             ...

<class 'pandas.core.series.Series'>

A DataFrame is composed of at least one Series.

Detour: `Series` and `DataFrame` from scratch

It's not always the case that you have a data set (in form of a file) at hand. Sometimes you have to create a Series or DataFrame yourself.

A Series can be easily created from a list.

austrian_artists = ["Bibiza", "Wanda", "Bilderbuch"]
austrian_artists = pd.Series(austrian_artists)

To initiate a DataFrame, you can use a dictionary (among others).

austrian_artists = {
    "name": ["Bibiza", "Wanda", "Bilderbuch"],
    "album": ["bis einer weint", "Amore", "mea culpa"],
    "release_year": [2024, 2014, 2019]
}
austrian_artists = pd.DataFrame(austrian_artists)

Dictionary keys are used as column names and the corresponding values as the column values.

Info

Apart from a dict, a DataFrame can be created from multiple other data structures like a list or tuple. For an extensive guide, visit the pandas documentation on Intro to data structures (specifically the section DataFrame).

Selecting data

Let's dive deeper into selecting data. To access specific rows, you can use a slice (just like with lists).

# rows 5 and 6
print(data[5:7])

>>> Output

               spotify_id        name  ...    tempo  time_signature
5  0io16MKpbeDIdYzmGpQaES  Embrace It  ...  114.933               4
6  3aJT51ya8amzpT3TKDVipL         FTW  ...   91.937               4

Select multiple columns by passing a list of column names.

print(data[["name", "artists"]])

>>> Output

                                  name    artists
0                The Emptiness Machine    Linkin Park
1                         Rote Flaggen    Berq
2                       Bauch Beine Po    Shirin David
...

Boolean indexing

Most of the time, we want to filter the data based on criterias. For example, we can select the tracks with a tempo higher than 120 beats per minute (BPM).

high_tempo = data[data["tempo"] > 120]

Let's break the example down:

First, we select the column tempo from the data set with data["tempo"].
Next, we expand our expression to data["tempo"] > 120. This will return a Series of boolean values.
Lastly, we wrap the expression in another set of square brackets to filter the whole data set based on our boolean values.

We end up with 27 tracks that meet the criteria. high_tempo is a new DataFrame containing entries that exceed 120 BPM.

Danceable tracks

We assume that tracks with a danceability score higher than 0.8 are danceable.

How many of the tracks are danceable?

Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.

-- Spotify for Developers

Mathematical operations

pandas supports mathematical operations on both Series and DataFrame. For instance, we weigh the popularity of a track by its energy level.

Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity.

-- Spotify for Developers

weighted_popularity = data["popularity"].mul(data["energy"])
print(type(weighted_popularity))

# assign the Series to the DataFrame
data["weighted_popularity"] = weighted_popularity

The mul() method is used to multiply the popularity and energy columns. The resulting Series is assigned to data as a new column.

>>> Output

<class 'pandas.core.series.Series'>

Track length

Since songs are getting shorter and shorter, we want to know how long the tracks in our data set are. To do so, calculate the length in minutes.

Explore the given data set to find the appropriate column (which holds information on the song length).
Calculate the length in minutes (hint: use the pandas documentation or Google.)
Assign the result to the data frame.
Use boolean indexing, to check if there are any tracks longer than 4 minutes.
Lastly, calculate the average track length in minutes.

Basic statistics

pandas provides a variety of methods to calculate basic statistics. For instance, min(), max(), mean() can be easily retrieved for a numeric Series in the data set.

print(data["tempo"].min())

>>> Output

80.969

Conveniently, statistics can be calculated for each column at once using the DataFrame. In this example, we calculate the standard deviation.

print(data.std())

If you execute the above snippet, a TypeError is raised.

Fix the error

Try to determine, why the error was raised in the first place. Now, circumvent/fix the error.

Hint: Look at the documentation of the std() method and its parameters.

If you want to calculate multiple statistics, you can call the describe() method.

stats = data.describe()
print(stats)

>>> Output

       daily_rank  daily_movement  ...       tempo  time_signature
count    50.00000        50.00000  ...   50.000000       50.000000
mean     25.50000         1.04000  ...  125.087260        3.920000
std      14.57738         8.14902  ...   26.751323        0.340468
min       1.00000       -22.00000  ...   80.969000        3.000000
25%      13.25000        -3.00000  ...  104.990750        4.000000
50%      25.50000         1.00000  ...  123.981500        4.000000
75%      37.75000         3.00000  ...  137.487250        4.000000
max      50.00000        29.00000  ...  184.115000        5.000000

describe() provides descriptive statistics for each column. The result of describe() is a DataFrame itself.

Other functionalities

pandas offers a plethora of functionalities. There's simply too much to cover in a brief introductory section. Still, there are some common DataFrame methods/properties that are worth mentioning:

sort_values(): Sort the data frame by a specific column.
groupby(): Group the data frame by a column.
merge(): Merge two data frames.
T: Transpose of the data frame.
drop_duplicates(): Remove duplicates.
dropna(): Remove missing values.

All methods are linked to its corresponding documentation with examples that help you get started.

Recap

We covered pandas and some selected functionalities which should provide you with a solid foundation to work with tabular data sets. Moreover, you should be able to follow the code portions in the upcoming courses more easily.

The full data set is available on Kaggle and contains the most streamed songs for multiple different countries. For our purpose, the data was subset. ↩

pandas

Introduction

The data set

Prerequisites

Tabular data

Getting started

Reading files

Displaying data

Data structures

Detour: Series and DataFrame from scratch

Selecting data

Boolean indexing

Mathematical operations

Basic statistics

Other functionalities

Recap

`pandas`

Detour: `Series` and `DataFrame` from scratch