Tabular Data in Pandas

3.1. Tabular Data in Pandas#

Tabular data is one of the most common forms of data you’ll encounter in data analysis. It consists of rows and columns, much like a table in a spreadsheet or database. This structured format allows data to be easily organized and manipulated. Some key characteristics of tabular data include:

Rows: Represent individual records or observations.
Columns: Represent variables or features that describe the data.

The most common way to represent tabular data in Python, comes from files called Comma Separated Values (CSV). CSVs are of plain text format where each row corresponds to a record, and the columns are separated by commas.

3.1.1. Gotta Catch ‘Em All!#

Let’s consider some examples with a Pokémon dataset.

The Pokémon franchise (short for “Pocket Monsters”) was created by Satoshi Tajiri and Ken Sugimori, and first introduced by Nintendo, Game Freak, and Creatures in 1996. It has since become one of the most popular and enduring multimedia franchises in the world, centering around fictional creatures that humans, known as Pokémon Trainers, catch and train to battle each other.

The core of the franchise revolves around video games, where players explore various regions, capture wild Pokémon, and aim to complete a comprehensive database known as the Pokédex.

Let’s consider the following dataset, which we can load using the pd.read_csv() function:

import pandas as pd

pokedex = pd.read_csv("pokedex.csv")
pokedex

	Name	Type 1	Type 2	Total	HP	Attack	Defense	SP. Atk.	SP. Def	Speed
0	Bulbasaur	Grass	Poison	318	45	49	49	65	65	45
1	Ivysaur	Grass	Poison	405	60	62	63	80	80	60
2	Venusaur	Grass	Poison	525	80	82	83	100	100	80
3	Venusaur Mega Venusaur	Grass	Poison	625	80	100	123	122	120	80
4	Charmander	Fire	NaN	309	39	52	43	60	50	65
...	...	...	...	...	...	...	...	...	...	...
1210	Iron Crown	Steel	Psychic	590	90	72	100	122	108	98
1211	Terapagos Normal Form	Normal	NaN	450	90	65	85	65	85	60
1212	Terapagos Terastal Form	Normal	NaN	600	95	95	110	105	110	85
1213	Terapagos Stellar Form	Normal	NaN	700	160	105	110	130	110	85
1214	Pecharunt	Poison	Ghost	600	88	88	160	88	88	88

1215 rows × 10 columns

A dataset in this form is referred to as a DataFrame. This dataset includes information about each Pokémon, from the first generation to the ninth generation

Each row represents one Pokémon. The columns are:

Column Label	Description
`Name`	Pokémon’s name
`Type 1`	The primary type of the Pokémon, which defines its elemental characteristic.
`Type 2`	The secondary type of the Pokémon, if applicable. Some Pokémon have two types
`Total`	The sum of all base stats, indicating overall power.
`HP`	Hit Points, representing how much damage a Pokémon can take.
`Attack`	Physical attack power, affecting damage from physical moves.
`Defense`	Ability to resist physical attacks.
`SP. Atk.`	Special Attack power, for non-physical moves (e.g., energy-based)
`SP. Def.`	Special Defense, resistance to special (non-physical) moves.
`Speed`	Determines attack order in battle; higher Speed means attacking first.

We can see that the first Pokémon in the dataset is Bulbasaur, who is a Grass/Poison-type Pokémon with a total stat of 318, including 45 HP, 49 Attack, 49 Defense, 65 Special Attack, 65 Special Defense, and 45 Speed.

3.1.1.1. The first n rows#

We can see that our pokedex DataFrame has 1215 rows, and 10 unique columns. If we only want to see a specific amount of rows, we can do so using df.head(), where df corresponds to a generic dataframe.

pokedex.head() #Show the first 5 rows

	Name	Type 1	Type 2	Total	HP	Attack	Defense	SP. Atk.	SP. Def	Speed
0	Bulbasaur	Grass	Poison	318	45	49	49	65	65	45
1	Ivysaur	Grass	Poison	405	60	62	63	80	80	60
2	Venusaur	Grass	Poison	525	80	82	83	100	100	80
3	Venusaur Mega Venusaur	Grass	Poison	625	80	100	123	122	120	80
4	Charmander	Fire	NaN	309	39	52	43	60	50	65

By default, if we don’t specify a specific value in .head(), the default is 5.

Let’s try looking at the first 8 rows, corresponding to the first 8 Pokémon:

pokedex.head(8)

	Name	Type 1	Type 2	Total	HP	Attack	Defense	SP. Atk.	SP. Def	Speed
0	Bulbasaur	Grass	Poison	318	45	49	49	65	65	45
1	Ivysaur	Grass	Poison	405	60	62	63	80	80	60
2	Venusaur	Grass	Poison	525	80	82	83	100	100	80
3	Venusaur Mega Venusaur	Grass	Poison	625	80	100	123	122	120	80
4	Charmander	Fire	NaN	309	39	52	43	60	50	65
5	Charmeleon	Fire	NaN	405	58	64	58	80	65	80
6	Charizard	Fire	Flying	534	78	84	78	109	85	100
7	Charizard Mega Charizard X	Fire	Dragon	634	78	130	111	130	85	100

Notice as well, that dataframe rows are considered zero-indexed, where the first row is at index 0, the second at index 1, and so on and so forth

3.1.1.2. The last n rows#

We can view the last few rows using df.tail(). Again, if we don’t specify an argument, the default number of rows is 5

pokedex.tail() #last 5 rows

	Name	Type 1	Type 2	Total	HP	Attack	Defense	SP. Atk.	SP. Def	Speed
1210	Iron Crown	Steel	Psychic	590	90	72	100	122	108	98
1211	Terapagos Normal Form	Normal	NaN	450	90	65	85	65	85	60
1212	Terapagos Terastal Form	Normal	NaN	600	95	95	110	105	110	85
1213	Terapagos Stellar Form	Normal	NaN	700	160	105	110	130	110	85
1214	Pecharunt	Poison	Ghost	600	88	88	160	88	88	88

If we want to see the last 11 rows, perhaps, we can do:

pokedex.tail(11)

	Name	Type 1	Type 2	Total	HP	Attack	Defense	SP. Atk.	SP. Def	Speed
1204	Ogerpon Cornerstone Mask	Grass	Rock	550	80	120	84	60	96	110
1205	Archaludon	Steel	Dragon	600	90	105	130	125	65	85
1206	Hydrapple	Grass	Dragon	540	106	80	110	120	80	44
1207	Gouging Fire	Fire	Dragon	590	105	115	121	65	93	91
1208	Raging Bolt	Electric	Dragon	590	125	73	91	137	89	75
1209	Iron Boulder	Rock	Psychic	590	90	120	80	68	108	124
1210	Iron Crown	Steel	Psychic	590	90	72	100	122	108	98
1211	Terapagos Normal Form	Normal	NaN	450	90	65	85	65	85	60
1212	Terapagos Terastal Form	Normal	NaN	600	95	95	110	105	110	85
1213	Terapagos Stellar Form	Normal	NaN	700	160	105	110	130	110	85
1214	Pecharunt	Poison	Ghost	600	88	88	160	88	88	88

3.1.1.3. Structure#

We can also use df.info() to give an overview of the dataset, including the number of entries, column names, data types, and non-null counts.

pokedex.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1215 entries, 0 to 1214
Data columns (total 10 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Name      1215 non-null   object
 1   Type 1    1215 non-null   object
 2   Type 2    669 non-null    object
 3   Total     1215 non-null   int64 
 4   HP        1215 non-null   int64 
 5   Attack    1215 non-null   int64 
 6   Defense   1215 non-null   int64 
 7   SP. Atk.  1215 non-null   int64 
 8   SP. Def   1215 non-null   int64 
 9   Speed     1215 non-null   int64 
dtypes: int64(7), object(3)
memory usage: 95.1+ KB

We can see that each of the columns, except the Type 2 column, have 0 null (empty) entries. We do see though, that only 669 of the Pokémon have an entry in the Type 2 column. This seems to indicate that around half of the Pokémon only have a single type associated with them

3.1.1.4. Summary Statistics#

We can use df.describe() to get some summary statistics for each of our columns in our dataset

pokedex.describe()

	Total	HP	Attack	Defense	SP. Atk.	SP. Def	Speed
count	1215.000000	1215.000000	1215.000000	1215.000000	1215.000000	1215.000000	1215.000000
mean	443.104527	71.244444	81.152263	75.007407	73.224691	72.441152	70.034568
std	121.193406	26.927819	32.037134	30.740999	32.757152	27.578188	30.161298
min	175.000000	1.000000	5.000000	5.000000	10.000000	20.000000	5.000000
25%	332.000000	52.000000	57.000000	52.000000	50.000000	51.000000	45.000000
50%	465.000000	70.000000	80.000000	70.000000	65.000000	70.000000	68.000000
75%	521.000000	85.000000	100.000000	91.000000	95.000000	90.000000	91.000000
max	1125.000000	255.000000	190.000000	250.000000	194.000000	250.000000	200.000000

Note that these summary statistics are only calculate for columns which have numeric values, where a categorical column like "Name" is not present

3.1.1.5. Listing the columns#

Suppose we had a dataset which had many many columns. If we wanted to see the different column names, we could do so using df.columns

pokedex.columns

Index(['Name', 'Type 1', 'Type 2', 'Total', 'HP', 'Attack', 'Defense',
       'SP. Atk.', 'SP. Def', 'Speed'],
      dtype='object')

3.1.1.6. Sorting#

Sorting is a fundamental operation in data analysis that allows you to organize your data based on specific criteria. In Pandas, the .sort_values() function is used to sort a DataFrame by one or more columns.

As an example, let’s sort the pokedex dataset by values in the Total column. In order to ensure that the row with the highest "Total" is at the top, we have to specify the ascending = False argument. Otherwise, we would have the Pokémon with the lowest "Total" at the top

pokedex_sorted = pokedex.sort_values(by='Total', ascending=False) #highest total at the top
pokedex_sorted.head()

	Name	Type 1	Type 2	Total	HP	Attack	Defense	SP. Atk.	SP. Def	Speed
1058	Eternatus Eternamax	Poison	Dragon	1125	255	115	250	125	250	130
475	Rayquaza Mega Rayquaza	Dragon	Flying	780	105	180	100	180	100	115
201	Mewtwo Mega Mewtwo X	Psychic	Fighting	780	106	190	100	154	100	130
202	Mewtwo Mega Mewtwo Y	Psychic	NaN	780	106	150	70	194	120	140
473	Groudon Primal Groudon	Ground	Fire	770	100	180	160	150	90	90

pokedex_sorted = pokedex.sort_values(by='Total') #lowest totals at the top
pokedex_sorted.head()

	Name	Type 1	Type 2	Total	HP	Attack	Defense	SP. Atk.	SP. Def	Speed
902	Wishiwashi Solo Form	Water	NaN	175	45	20	20	25	25	40
245	Sunkern	Grass	NaN	180	30	30	30	30	30	30
985	Blipbug	Bug	NaN	180	25	20	20	25	45	45
1034	Snom	Ice	Bug	185	30	25	35	45	30	20
368	Azurill	Normal	Fairy	190	50	20	40	20	40	20

If we want to sort the Pokémon alphabetically by their names, simply sort by the "Name" column:

pokedex_sorted = pokedex.sort_values(by='Name')
pokedex_sorted.head()

	Name	Type 1	Type 2	Total	HP	Attack	Defense	SP. Atk.	SP. Def	Speed
561	Abomasnow	Grass	Ice	494	90	92	75	92	85	60
562	Abomasnow Mega Abomasnow	Grass	Ice	594	90	132	105	132	105	30
83	Abra	Psychic	NaN	310	25	20	15	105	55	90
441	Absol	Dark	NaN	465	65	130	60	75	60	75
442	Absol Mega Absol	Dark	NaN	565	65	150	60	115	60	115

Again, notice that when we use .sort_values without the ascending = False argument, our "Name" column is sorted by ascending order, where pokemon names that come earliest in the alphabet are at the top of the Dataframe

pokedex_sorted = pokedex.sort_values(by='Name', ascending=False)
pokedex_sorted.head()

	Name	Type 1	Type 2	Total	HP	Attack	Defense	SP. Atk.	SP. Def	Speed
865	Zygarde Complete Forme	Dragon	Ground	708	216	100	121	91	95	85
863	Zygarde 50% Forme	Dragon	Ground	600	108	100	121	81	95	95
864	Zygarde 10% Forme	Dragon	Ground	486	54	100	71	61	85	115
760	Zweilous	Dark	Dragon	420	72	85	70	65	70	58
54	Zubat	Poison	Flying	245	40	45	35	30	40	55

Now, we can actually see the Pokémon which have names that come latest in the alphabet