import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
Read in the data
df = pd.read_csv('../resources/creativeclass200711.csv')
df.head()
df.count()
df[df['TotEmpEst'] == '.']
df = df.drop([87, 250, 331, 1654, 2922, 2950])
Remove rows with missing data
df['TotEmpEst'] = pd.to_numeric(df['TotEmpEst'])
df['CCEst'] = pd.to_numeric(df['CCEst'])
Total Employment ~ Creative Class Employment
fig = plt.figure(figsize=(10, 8))
axs = fig.add_subplot(111)
axs.scatter(df['TotEmpEst'], df['CCEst'], c = df['metro03'])
axs.set_xlabel('Total Employment')
axs.set_ylabel('Creative Class Employment')
Total Employment ~ CC Employment with outlier exclusion
fig = plt.figure(figsize=(10, 8))
axs = fig.add_subplot(111)
axs.scatter(df['TotEmpEst'], df['CCEst'], c = df['metro03'], alpha = .3)
axs.set_xlabel('Total Employment')
axs.set_ylabel('Creative Class Employment')
axs.set_xlim([0, 2000000])
axs.set_ylim([0, 600000])
axs.legend()
Which are the outlier metros with really high employment? NYC, SF, Chicago?
df[['State', 'County', 'TotEmpEst']].sort_values(by = ['TotEmpEst'], ascending = False).head()
Where is NYC, does the borough system split up employment more than I would have guessed?
df[['State', 'County', 'TotEmpEst']][df['State'] == 'New York'].sort_values(by = ['TotEmpEst'], ascending = False).head()
They are not far from the top, together Brooklyn, Queens and Manhattan would be above Chicago.
Total Employment ~ CC Share
df['CCShare'] = pd.to_numeric(df['CCShare'])
fig = plt.figure(figsize=(10, 10))
axs = fig.add_subplot(111)
axs.scatter(df['TotEmpEst'], df['CCShare'], c = df['metro03'])
axs.set_xlabel('Total Employment')
axs.set_ylabel('Creative Class Share')
With outlier exclusion
fig = plt.figure(figsize=(10, 10))
axs = fig.add_subplot(111)
axs.scatter(df['TotEmpEst'], df['CCShare'], c = df['metro03'], alpha = .3)
axs.set_xlabel('Total Employment')
axs.set_ylabel('Creative Class Share')
axs.set_xlim([0, 200000])
It looks like the places with the lowest CC shares are low employment rural areas.
There are a couple low employment rural areas with very high CC shares and there are a number of rural areas with above average CC shares.
There is a clear grouping of rural areas with low employment and CC share of between .1 and .18.
Metro areas have a wide range of total employment figures, but they also look to have a higher CC share than rural areas on average.
df[['CCShare', 'metro03']].groupby('metro03').describe()
Metro areas have the higher CCShare average and greater variation.
The lower quartile for metro areas is about 4 percentage points higher than for rural areas.
Both metro and rural areas have a higher mean than median, pointing a the long upper tail in CCShare.
The upper quartile for metro areas is about 8 percentage points higher than for rural areas, greater than the difference in medians and double the difference in lower quartiles.
df_keep = df[['FIPS', 'State', 'State abr.', 'County', 'metro03', 'TotEmpEst', 'CCShare']]
df_keep.columns = ['FIPS', 'state', 'stateAbr', 'county', 'metro', 'totEmp', 'ccShare']
df_keep['totEmp'].describe()
df_keep['totEmpTh'] = df_keep['totEmp'] / 1000
df_keep['totEmpTh'].describe()
df_keep.to_csv('../resources/ccShare.csv', index=False)
df.columns