3.1. Tabular Data in Pandas#

Tabular data is one of the most common forms of data you’ll encounter in data analysis. It consists of rows and columns, much like a table in a spreadsheet or database. This structured format allows data to be easily organized and manipulated. Some key characteristics of tabular data include:

  • Rows: Represent individual records or observations.

  • Columns: Represent variables or features that describe the data.

The most common way to represent tabular data in Python, comes from files called Comma Separated Values (CSV). CSVs are of plain text format where each row corresponds to a record, and the columns are separated by commas.

3.1.1. Gotta Catch ‘Em All!#

Let’s consider some examples with a Pokémon dataset.

The Pokémon franchise (short for “Pocket Monsters”) was created by Satoshi Tajiri and Ken Sugimori, and first introduced by Nintendo, Game Freak, and Creatures in 1996. It has since become one of the most popular and enduring multimedia franchises in the world, centering around fictional creatures that humans, known as Pokémon Trainers, catch and train to battle each other.

The core of the franchise revolves around video games, where players explore various regions, capture wild Pokémon, and aim to complete a comprehensive database known as the Pokédex.

Let’s consider the following dataset, which we can load using the pd.read_csv() function:

import pandas as pd

pokedex = pd.read_csv("pokedex.csv")
pokedex
Name Type 1 Type 2 Total HP Attack Defense SP. Atk. SP. Def Speed
0 Bulbasaur Grass Poison 318 45 49 49 65 65 45
1 Ivysaur Grass Poison 405 60 62 63 80 80 60
2 Venusaur Grass Poison 525 80 82 83 100 100 80
3 Venusaur Mega Venusaur Grass Poison 625 80 100 123 122 120 80
4 Charmander Fire NaN 309 39 52 43 60 50 65
... ... ... ... ... ... ... ... ... ... ...
1210 Iron Crown Steel Psychic 590 90 72 100 122 108 98
1211 Terapagos Normal Form Normal NaN 450 90 65 85 65 85 60
1212 Terapagos Terastal Form Normal NaN 600 95 95 110 105 110 85
1213 Terapagos Stellar Form Normal NaN 700 160 105 110 130 110 85
1214 Pecharunt Poison Ghost 600 88 88 160 88 88 88

1215 rows × 10 columns

A dataset in this form is referred to as a DataFrame. This dataset includes information about each Pokémon, from the first generation to the ninth generation

Each row represents one Pokémon. The columns are:

Column Label

Description

Name

Pokémon’s name

Type 1

The primary type of the Pokémon, which defines its elemental characteristic.

Type 2

The secondary type of the Pokémon, if applicable. Some Pokémon have two types

Total

The sum of all base stats, indicating overall power.

HP

Hit Points, representing how much damage a Pokémon can take.

Attack

Physical attack power, affecting damage from physical moves.

Defense

Ability to resist physical attacks.

SP. Atk.

Special Attack power, for non-physical moves (e.g., energy-based)

SP. Def.

Special Defense, resistance to special (non-physical) moves.

Speed

Determines attack order in battle; higher Speed means attacking first.

We can see that the first Pokémon in the dataset is Bulbasaur, who is a Grass/Poison-type Pokémon with a total stat of 318, including 45 HP, 49 Attack, 49 Defense, 65 Special Attack, 65 Special Defense, and 45 Speed.

3.1.1.1. The first n rows#

We can see that our pokedex DataFrame has 1215 rows, and 10 unique columns. If we only want to see a specific amount of rows, we can do so using df.head(), where df corresponds to a generic dataframe.

pokedex.head() #Show the first 5 rows
Name Type 1 Type 2 Total HP Attack Defense SP. Atk. SP. Def Speed
0 Bulbasaur Grass Poison 318 45 49 49 65 65 45
1 Ivysaur Grass Poison 405 60 62 63 80 80 60
2 Venusaur Grass Poison 525 80 82 83 100 100 80
3 Venusaur Mega Venusaur Grass Poison 625 80 100 123 122 120 80
4 Charmander Fire NaN 309 39 52 43 60 50 65

By default, if we don’t specify a specific value in .head(), the default is 5.

Let’s try looking at the first 8 rows, corresponding to the first 8 Pokémon:

pokedex.head(8)
Name Type 1 Type 2 Total HP Attack Defense SP. Atk. SP. Def Speed
0 Bulbasaur Grass Poison 318 45 49 49 65 65 45
1 Ivysaur Grass Poison 405 60 62 63 80 80 60
2 Venusaur Grass Poison 525 80 82 83 100 100 80
3 Venusaur Mega Venusaur Grass Poison 625 80 100 123 122 120 80
4 Charmander Fire NaN 309 39 52 43 60 50 65
5 Charmeleon Fire NaN 405 58 64 58 80 65 80
6 Charizard Fire Flying 534 78 84 78 109 85 100
7 Charizard Mega Charizard X Fire Dragon 634 78 130 111 130 85 100

Notice as well, that dataframe rows are considered zero-indexed, where the first row is at index 0, the second at index 1, and so on and so forth

3.1.1.2. The last n rows#

We can view the last few rows using df.tail(). Again, if we don’t specify an argument, the default number of rows is 5

pokedex.tail() #last 5 rows
Name Type 1 Type 2 Total HP Attack Defense SP. Atk. SP. Def Speed
1210 Iron Crown Steel Psychic 590 90 72 100 122 108 98
1211 Terapagos Normal Form Normal NaN 450 90 65 85 65 85 60
1212 Terapagos Terastal Form Normal NaN 600 95 95 110 105 110 85
1213 Terapagos Stellar Form Normal NaN 700 160 105 110 130 110 85
1214 Pecharunt Poison Ghost 600 88 88 160 88 88 88

If we want to see the last 11 rows, perhaps, we can do:

pokedex.tail(11)
Name Type 1 Type 2 Total HP Attack Defense SP. Atk. SP. Def Speed
1204 Ogerpon Cornerstone Mask Grass Rock 550 80 120 84 60 96 110
1205 Archaludon Steel Dragon 600 90 105 130 125 65 85
1206 Hydrapple Grass Dragon 540 106 80 110 120 80 44
1207 Gouging Fire Fire Dragon 590 105 115 121 65 93 91
1208 Raging Bolt Electric Dragon 590 125 73 91 137 89 75
1209 Iron Boulder Rock Psychic 590 90 120 80 68 108 124
1210 Iron Crown Steel Psychic 590 90 72 100 122 108 98
1211 Terapagos Normal Form Normal NaN 450 90 65 85 65 85 60
1212 Terapagos Terastal Form Normal NaN 600 95 95 110 105 110 85
1213 Terapagos Stellar Form Normal NaN 700 160 105 110 130 110 85
1214 Pecharunt Poison Ghost 600 88 88 160 88 88 88

3.1.1.3. Structure#

We can also use df.info() to give an overview of the dataset, including the number of entries, column names, data types, and non-null counts.

pokedex.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1215 entries, 0 to 1214
Data columns (total 10 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Name      1215 non-null   object
 1   Type 1    1215 non-null   object
 2   Type 2    669 non-null    object
 3   Total     1215 non-null   int64 
 4   HP        1215 non-null   int64 
 5   Attack    1215 non-null   int64 
 6   Defense   1215 non-null   int64 
 7   SP. Atk.  1215 non-null   int64 
 8   SP. Def   1215 non-null   int64 
 9   Speed     1215 non-null   int64 
dtypes: int64(7), object(3)
memory usage: 95.1+ KB

We can see that each of the columns, except the Type 2 column, have 0 null (empty) entries. We do see though, that only 669 of the Pokémon have an entry in the Type 2 column. This seems to indicate that around half of the Pokémon only have a single type associated with them

3.1.1.4. Summary Statistics#

We can use df.describe() to get some summary statistics for each of our columns in our dataset

pokedex.describe()
Total HP Attack Defense SP. Atk. SP. Def Speed
count 1215.000000 1215.000000 1215.000000 1215.000000 1215.000000 1215.000000 1215.000000
mean 443.104527 71.244444 81.152263 75.007407 73.224691 72.441152 70.034568
std 121.193406 26.927819 32.037134 30.740999 32.757152 27.578188 30.161298
min 175.000000 1.000000 5.000000 5.000000 10.000000 20.000000 5.000000
25% 332.000000 52.000000 57.000000 52.000000 50.000000 51.000000 45.000000
50% 465.000000 70.000000 80.000000 70.000000 65.000000 70.000000 68.000000
75% 521.000000 85.000000 100.000000 91.000000 95.000000 90.000000 91.000000
max 1125.000000 255.000000 190.000000 250.000000 194.000000 250.000000 200.000000

Note that these summary statistics are only calculate for columns which have numeric values, where a categorical column like "Name" is not present

3.1.1.5. Listing the columns#

Suppose we had a dataset which had many many columns. If we wanted to see the different column names, we could do so using df.columns

pokedex.columns
Index(['Name', 'Type 1', 'Type 2', 'Total', 'HP', 'Attack', 'Defense',
       'SP. Atk.', 'SP. Def', 'Speed'],
      dtype='object')

3.1.1.6. Sorting#

Sorting is a fundamental operation in data analysis that allows you to organize your data based on specific criteria. In Pandas, the .sort_values() function is used to sort a DataFrame by one or more columns.

As an example, let’s sort the pokedex dataset by values in the Total column. In order to ensure that the row with the highest "Total" is at the top, we have to specify the ascending = False argument. Otherwise, we would have the Pokémon with the lowest "Total" at the top

pokedex_sorted = pokedex.sort_values(by='Total', ascending=False) #highest total at the top
pokedex_sorted.head()
Name Type 1 Type 2 Total HP Attack Defense SP. Atk. SP. Def Speed
1058 Eternatus Eternamax Poison Dragon 1125 255 115 250 125 250 130
475 Rayquaza Mega Rayquaza Dragon Flying 780 105 180 100 180 100 115
201 Mewtwo Mega Mewtwo X Psychic Fighting 780 106 190 100 154 100 130
202 Mewtwo Mega Mewtwo Y Psychic NaN 780 106 150 70 194 120 140
473 Groudon Primal Groudon Ground Fire 770 100 180 160 150 90 90
pokedex_sorted = pokedex.sort_values(by='Total') #lowest totals at the top
pokedex_sorted.head()
Name Type 1 Type 2 Total HP Attack Defense SP. Atk. SP. Def Speed
902 Wishiwashi Solo Form Water NaN 175 45 20 20 25 25 40
245 Sunkern Grass NaN 180 30 30 30 30 30 30
985 Blipbug Bug NaN 180 25 20 20 25 45 45
1034 Snom Ice Bug 185 30 25 35 45 30 20
368 Azurill Normal Fairy 190 50 20 40 20 40 20

If we want to sort the Pokémon alphabetically by their names, simply sort by the "Name" column:

pokedex_sorted = pokedex.sort_values(by='Name')
pokedex_sorted.head()
Name Type 1 Type 2 Total HP Attack Defense SP. Atk. SP. Def Speed
561 Abomasnow Grass Ice 494 90 92 75 92 85 60
562 Abomasnow Mega Abomasnow Grass Ice 594 90 132 105 132 105 30
83 Abra Psychic NaN 310 25 20 15 105 55 90
441 Absol Dark NaN 465 65 130 60 75 60 75
442 Absol Mega Absol Dark NaN 565 65 150 60 115 60 115

Again, notice that when we use .sort_values without the ascending = False argument, our "Name" column is sorted by ascending order, where pokemon names that come earliest in the alphabet are at the top of the Dataframe

pokedex_sorted = pokedex.sort_values(by='Name', ascending=False)
pokedex_sorted.head()
Name Type 1 Type 2 Total HP Attack Defense SP. Atk. SP. Def Speed
865 Zygarde Complete Forme Dragon Ground 708 216 100 121 91 95 85
863 Zygarde 50% Forme Dragon Ground 600 108 100 121 81 95 95
864 Zygarde 10% Forme Dragon Ground 486 54 100 71 61 85 115
760 Zweilous Dark Dragon 420 72 85 70 65 70 58
54 Zubat Poison Flying 245 40 45 35 30 40 55

Now, we can actually see the Pokémon which have names that come latest in the alphabet