Bias in CelebA ¶
Author - Vuong NGUYEN
- The CelebFaces Attributes Dataset (CelebA) is a large-scale face attributes dataset with more than 200K celebrity images, each with 40 attribute annotations.
- Since we'll be placing ourselves in the fairness framework, we're going to attack the Male/Female binary classification problem, in which we'll explore bias based on celebrity hair color. As we will demonstrate later, this group of images constitutes a discriminatory group in the sense of the equality of errors. We will calculate some statistics about the dataset in the next section.
Imports ¶
We import the necessary libraries and tools.
import numpy as np
import pandas as pd
from PIL import Image as PilImage
from IPython.display import HTML
from matplotlib import pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from utils_plot import plot_res_celeba_global, plot_best_res_celeba, plot_res_celeba_minority
Exploring CelebA ¶
Raw dataset ¶
Let's take a look at the distribution of the number of male and female based on different hair colors present in the raw dataset.
df_attr = pd.read_csv('/home/vuong.nguyen/vuong/augmentare/experiments/Bias in CelebA/dataset/list_attr_celeba.csv')
df_attr.replace(-1,0,inplace=True)
data = df_attr[['image_id', 'Male', 'Black_Hair', 'Blond_Hair', 'Brown_Hair', 'Gray_Hair']]
data_male = data[data["Male"]==1]
data_female = data[data["Male"]==0]
arr=pd.DataFrame(data_male.iloc[:,2:].sum(axis=0))
arr.columns=['Male']
arr1=pd.DataFrame(data_female.iloc[:,2:].sum(axis=0))
arr1.columns=['Female']
result = pd.concat([arr, arr1], axis=1)
fig = go.Figure()
x = ['Black_Hair', 'Blond_Hair', 'Brown_Hair', 'Gray_Hair']
fig.add_trace(go.Bar(
name='Male',
x=x, y=result.Male))
fig.add_trace(go.Bar(
name='Female',
x=x, y=result.Female))
fig.update_layout(title="Number of men and women on different hair colors in raw dataset", yaxis_title="Number", barmode="group")
#fig.show()
HTML(fig.to_html())