Skip to content

pandas

Introduction

Spotify Top 50 Austria
Info

At the time of writing, pandas version 2.2.3 was used. Keep in mind, that pandas is actively developed and some functionalities might change in the future. However, as always, we try to keep the content up-to-date.

This section is heavily based on the excellent 10 minutes to pandas guide.


The data set

Spotify Top 50 Austria

We will use a custom Spotify data set, containing the current1 top 50 songs in Austria. You can find the corresponding playlist here.

Info

If you're interested in the creation of the data set, you can find the code below. Note, pandas was the only package needed, and we will cover some of the used functionalities in this section.

Create Spotify data set
# Create a Spotify data set, containing the top 50 tracks in Austria

# Original data (Top Spotify Songs in 73 Countries (Daily Updated)) from:
# https://www.kaggle.com/datasets/asaniczka/top-spotify-songs-in-73-countries-daily-updated/data
from pathlib import Path

import pandas as pd

# read initial data set
data = pd.read_csv(Path("data/universal_top_spotify_songs.csv"))

# only Austrian chart topping songs
data = data[data["country"] == "AT"]

# subset by latest snapshot date
latest_snapshot = data["snapshot_date"].max()
data = data[data["snapshot_date"] == latest_snapshot]
# sort by daily_rank
data = data.sort_values(by="daily_rank").reset_index(drop=True)

# write data to csv
data.to_csv(Path("data/spotify-top50.csv"), index=False)

# excerpt of the data set for inclusion in markdown
with Path("data/spotify-top50.md").open("w", encoding="UTF-8") as f:
    prefix = (
        f"Excerpt of the data set (snapshot date: **{latest_snapshot}**):\n\n"
    )
    _data = data[
        [
            "daily_rank",
            "name",
            "artists",
            "popularity",
            "is_explicit",
            "energy",
        ]
    ]
    # only top 5 songs
    _data = _data.head(5)
    markdown = _data.to_markdown(index=False)
    f.write(prefix + markdown)

Excerpt of the data set (snapshot date: 2024-09-25):

daily_rank name artists popularity is_explicit energy
1 The Emptiness Machine Linkin Park 93 True 0.872
2 Rote Flaggen Berq 76 True 0.336
3 Bauch Beine Po Shirin David 80 True 0.746
4 Die With A Smile Lady Gaga, Bruno Mars 100 False 0.592
5 BIRDS OF A FEATHER Billie Eilish 99 False 0.507

Download the whole data set to follow this section:

Prerequisites

For this section, we recommend, to make a new project folder with a Jupyter notebook. Additionally, create a new virtual environment and activate it. Please refer to the previous section on packages and virtual environments if you're having trouble. Lastly, install pandas.

You should end up with a project structure similar to the following:

📁 pandas-course/
├── 📁 .venv/
├── 📄 spotify-top50.csv
└── 📄 pandas-course.ipynb

Tabular data

Before we dive into pandas, let's briefly discuss tabular data. At its simplest, tabular data consists of rows and columns. Looking at the Spotify table above; each row contains information about a specific track (e.g., name, artists), while each column represents a specific attribute (e.g., popularity, energy).

Tabular data has a clear structure which makes it easy to work with. On the other hand sources for tabular data can be manifold. However, one of the most common format is the XLSX ( - Excel) or CSV ( - Comma Separated Values) format which is the one we are working with in this chapter. Nevertheless, tabular data is also present in various other text based formats like TXT, TSV or even in databases (e.g. MySQL, PostgreSQL).

No matter the source, pandas is the go-to tool to work with tabular data.

Getting started

Let's explore some of pandas functionalities on the example of the Spotify data set. First, we need to import the package.

import pandas as pd

The as statement is used to create an alias for the package in order to quickly reference it within our next code snippets. An alias simply reduces the amount of characters you have to type. Moreover, the alias pd is commonly used for pandas. Therefore, you can more easily employ code snippets you find online.

Reading files

With the package imported, we can already read the data set (given as .csv).

data = pd.read_csv("spotify-top50.csv")

The above code snippet assumes, that both data set and notebook are located at the same directory level. Else, you have to adjust the path "spotify-top50.csv" accordingly.

Besides .csv files, pandas supports reading from various other file types like Excel, text files or a SQL database. The pandas documentation provides a comprehensive overview of different file types and their corresponding function. Have a look, to get an idea which file formats are supported not only for reading but also for writing.

Displaying data

With a data set at hand, we will most likely want to view it. To view the first rows of our data frame use the head() method.

print(data.head())
>>> Output
               spotify_id                   name  ...    tempo  time_signature
0  2PnlsTsOTLE5jnBnNe2K0A  The Emptiness Machine  ...  184.115               4
1  7bkUa9kDFGxgCC7d36dzFI           Rote Flaggen  ...  109.940               3
2  64f3yNXsi2Vk76odcHCOnw         Bauch Beine Po  ...  123.969               4
3  2plbrEY59IikOBgBGLjaoe       Die With A Smile  ...  157.969               3
4  6dOtVTDdiauQNBQEDOtlAB     BIRDS OF A FEATHER  ...  104.978               4

To display the last rows of the data frame, use the tail() method.

print(data.tail())
>>> Output
                spotify_id  ... time_signature
45  6leQi7NakJQS1vHRtZsroe  ...              4
46  5E4jBLx4P0UBji68bBThSw  ...              4
47  6qzetQfgRVyAGEg8QhqzYD  ...              4
48  3WOhcATHxK2SLNeP5W3v1v  ...              4
49  7xLbQTeLpeqlxxTPLSiM20  ...              4

Columns can be viewed with:

print(data.columns)
>>> Output
Index(['spotify_id', 'name', 'artists', 'daily_rank', 'daily_movement',
       'weekly_movement', 'country', 'snapshot_date', 'popularity',
       'is_explicit', 'duration_ms', 'album_name', 'album_release_date',
       'danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness',
       'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo',
       'time_signature'],
      dtype='object')

Similarly, we can print the row indices.

print(data.index)

>>> Output
RangeIndex(start=0, stop=50, step=1)
This data set is consecutively indexed from 0 to 49. If you recall, a range does not include its stop value (50).

By default, (if not otherwise specified) pandas will assign a range index to a data set in order to label the rows.

The data set dimensions are accessed with the shape attribute.

print(data.shape)
>>> Output
(50, 25)

The data set has 50 rows and 25 columns.

Data structures

pandas has two main data structures: Series and DataFrame. As you would expect, a DataFrame is a two-dimensional data structure such as our whole Spotify data set assigned to the variable data:

print(type(data))
>>> Output
<class 'pandas.core.frame.DataFrame'>

Whereas, a single column of a DataFrame is referred to as a Series. Generally, selections of the DataFrame can be accessed with square brackets ([]). To get a column, you can simply use its name.

print(data["artists"])

print(type(data["artists"]))
>>> Output
0                                       Linkin Park
1                                              Berq
2                                      Shirin David
3                             Lady Gaga, Bruno Mars
...                                             ...

<class 'pandas.core.series.Series'>

A DataFrame is composed of at least one Series.

Selecting data

Let's dive deeper into selecting data. To access specific rows, you can use a slice (just like with lists).

# rows 5 and 6
print(data[5:7])
>>> Output
               spotify_id        name  ...    tempo  time_signature
5  0io16MKpbeDIdYzmGpQaES  Embrace It  ...  114.933               4
6  3aJT51ya8amzpT3TKDVipL         FTW  ...   91.937               4

Select multiple columns by passing a list of column names.

print(data[["name", "artists"]])
>>> Output
                                  name    artists
0                The Emptiness Machine    Linkin Park
1                         Rote Flaggen    Berq
2                       Bauch Beine Po    Shirin David
...

Boolean indexing

Most of the time, we want to filter the data based on criterias. For example, we can select the tracks with a tempo higher than 120 beats per minute (BPM).

high_tempo = data[data["tempo"] > 120]

Let's break the example down:

  • First, we select the column tempo from the data set with data["tempo"].
  • Next, we expand our expression to data["tempo"] > 120. This will return a Series of boolean values.
  • Lastly, we wrap the expression in another set of square brackets to filter the whole data set based on our boolean values.

We end up with 27 tracks that meet the criteria. high_tempo is a new DataFrame containing entries that exceed 120 BPM.

Danceable tracks

Saturday Night Fever
Saturday Night Fever

We assume that tracks with a danceability score higher than 0.8 are danceable.

How many of the tracks are danceable?

Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.

-- Spotify for Developers

Mathematical operations

pandas supports mathematical operations on both Series and DataFrame. For instance, we weigh the popularity of a track by its energy level.

Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity.

-- Spotify for Developers

weighted_popularity = data["popularity"].mul(data["energy"])
print(type(weighted_popularity))

# assign the Series to the DataFrame
data["weighted_popularity"] = weighted_popularity

The mul() method is used to multiply the popularity and energy columns. The resulting Series is assigned to data as a new column.

>>> Output
<class 'pandas.core.series.Series'>
Track length

Spotify App

Since songs are getting shorter and shorter, we want to know how long the tracks in our data set are. To do so, calculate the length in minutes.

  • Explore the given data set to find the appropriate column (which holds information on the song length).
  • Calculate the length in minutes (hint: use the pandas documentation or Google.)
  • Assign the result to the data frame.
  • Use boolean indexing, to check if there are any tracks longer than 4 minutes.
  • Lastly, calculate the average track length in minutes.

Basic statistics

pandas provides a variety of methods to calculate basic statistics. For instance, min(), max(), mean() can be easily retrieved for a numeric Series in the data set.

print(data["tempo"].min())
>>> Output
80.969

Conveniently, statistics can be calculated for each column at once using the DataFrame. In this example, we calculate the standard deviation.

print(data.std())

If you execute the above snippet, a TypeError is raised.

Fix the error

Try to determine, why the error was raised in the first place. Now, circumvent/fix the error.

Hint: Look at the documentation of the std() method and its parameters.

If you want to calculate multiple statistics, you can call the describe() method.

stats = data.describe()
print(stats)
>>> Output
       daily_rank  daily_movement  ...       tempo  time_signature
count    50.00000        50.00000  ...   50.000000       50.000000
mean     25.50000         1.04000  ...  125.087260        3.920000
std      14.57738         8.14902  ...   26.751323        0.340468
min       1.00000       -22.00000  ...   80.969000        3.000000
25%      13.25000        -3.00000  ...  104.990750        4.000000
50%      25.50000         1.00000  ...  123.981500        4.000000
75%      37.75000         3.00000  ...  137.487250        4.000000
max      50.00000        29.00000  ...  184.115000        5.000000

describe() provides descriptive statistics for each column. The result of describe() is a DataFrame itself.

Other functionalities

pandas offers a plethora of functionalities. There's simply too much to cover in a brief introductory section. Still, there are some common DataFrame methods/properties that are worth mentioning:

All methods are linked to its corresponding documentation with examples that help you get started.

Recap

We covered pandas and some selected functionalities which should provide you with a solid foundation to work with tabular data sets. Moreover, you should be able to follow the code portions in the upcoming courses more easily.


  1. The full data set is available on Kaggle and contains the most streamed songs for multiple different countries. For our purpose, the data was subset.