Pandas Reference

Created by Brandon Concepcion

Table of Contents

  1. Pandas Reference
    1. What is Pandas?
    2. Pandas
    3. GroupBy Aggregation Functions

What is Pandas?

Pandas is a powerful data manipulation and analysis library in Python. It provides two main data structures:

  • DataFrame: A 2-dimensional tabular data structure with labeled axes (rows and columns), similar to a spreadsheet in Excel or Google Sheets
  • Series: A 1-dimensional labeled array capable of holding any data type. You can think of these as the columns of a DataFrame

Pandas

In the examples in the left column, pd refers to the Pandas library. df refers to a generic DataFrame.

NameDescriptionInputOutput
pd.read_csv()Reads a CSV file into a DataFrame.string: filenameDataFrame: contents of the CSV file
df.head(n)Displays the first n rows of the DataFrame. Defaults to 5 if n is not provided.(Optional) int: number of rows to displayDataFrame: first n rows
df.tail(n)Displays the last n rows of the DataFrame. Defaults to 5 if n is not provided.(Optional) int: number of rows to displayDataFrame: last n rows
df.info()Provides a summary of the DataFrame, including column data types and non-null values.NoneSummary: information about DataFrame
df.describe()Generates descriptive statistics like mean, min, max for numerical columns.NoneDataFrame: summary statistics
df.shapeReturns the number of rows and columns in the DataFrame.Nonetuple: (rows, columns)
df.columnsLists the column names in the DataFrame.NoneIndex: column names
df.dtypesShows the data types of each column in the DataFrame.NoneSeries: data types of each column
df.isnull()Returns a DataFrame showing Boolean values for missing data.NoneDataFrame: True/False for null values
df.dropna()Removes rows with missing values.NoneDataFrame: without missing rows
df.fillna(value)Fills missing values with a specified value.value: what to fill in for missing valuesDataFrame: missing values replaced
df.sort_values(by)Sorts the DataFrame by the values in a column or multiple columns.string: column name
(Optional) ascending=True/False
DataFrame: sorted DataFrame
df.groupby(by)Groups the DataFrame by one or more columns and allows aggregation.string: column name(s) to group byDataFrameGroupBy: object for aggregation
df.drop(labels)Drops specified rows or columns from the DataFrame.string: column name or list of labelsDataFrame: with columns/rows dropped
df.iloc[]Selects data by integer-location based indexing (row and column positions).int: row/column indexDataFrame/Series: selected data
df.loc[]Selects data by label-based indexing (row/column names).string: label or list of labelsDataFrame/Series: selected data
df.rename(columns={'old':'new'})Renames columns or index of the DataFrame.dict: old and new names for columnsDataFrame: renamed columns
df.apply(function)Applies a function to each element in a column or across a row/column.function: the function to applySeries or DataFrame: with results
df.merge(df2, on=key)Merges two DataFrames based on a common column (key).DataFrame: another DataFrame
string: key column
DataFrame: merged DataFrame
pd.concat([df1, df2])Concatenates two or more DataFrames along a particular axis (rows or columns).list: list of DataFramesDataFrame: concatenated DataFrame

GroupBy Aggregation Functions

These aggregation functions are most commonly applied after using the .groupby() method in Pandas, such as df.groupby(agg_col).agg_func(). Recall that the .groupby() method is used to perform group-wise operations. The groupby() method groups the data based on one or more columns, and then applies the aggregation functions to each one of the groups to compute the desired statistics.

NameDescriptionOutput
mean()Calculates the mean (average) of values in a column or group.float: mean of values
sum()Sums all the values in a column or group.float/int: sum of values
min()Returns the minimum value in a column or group.float/int: minimum value
max()Returns the maximum value in a column or group.float/int: maximum value
count()Counts the number of non-null entries in a column or group.int: count of non-null rows
median()Calculates the median of values in a column or group.float/int: median value
std()Calculates the standard deviation of values in a column or group.float: standard deviation
agg(func)Applies a custom aggregation function (one the user would define) to a group.Series: result of aggregation
nunique()Counts the number of unique values in a column or group.int: count of unique values
first()Returns the first value in each group.Series: first values
last()Returns the last value in each group.Series: last values
mode()Returns the mode (most frequent value) in a column or group.Series: mode value
prod()Returns the product of values in a column or group.float/int: product of values