A TALE OF TWO TESTS
An analysis of the SAT and ACT with a focus on socioeconomic inequality.
Project 1: SAT & ACT Analysis¶
Executive Summary¶
In the United States, the SAT and ACT are entrance exams used by colleges and universities to make admission decisions. These exams help provide colleges with a common data point that can be used to compare all applicants. While many colleges do not prefer one test over the other, and while both tests share a number of similarities in terms of subjects tested, participation rates for each test tend vary wildly across the country.
In 2016, the College Board implemented major changes to the SAT, in an attempt to counter the growing popularity of the ACT. By looking at data from the subsequent years of 2017 and 2018, we can begin to measure the impact of these changes, while coming up with new strategies that may help the College Board increase their participation rates moving forward.
Problem Statement: which state should the College Board target to further raise SAT participation rates?
Libraries¶
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
2017 Data Import and Cleaning¶
# Load sat_2017 & act_2017 datasets
sat_2017 = pd.read_csv('./data/sat_2017.csv')
act_2017 = pd.read_csv('./data/act_2017.csv')
sat_2017.head()
act_2017.head()
sat_2017.describe()
act_2017.describe()
The SAT dataset contains the average SAT scores of students by state. All 51 states (Washington DC became a state in 2020) of the US are represented in the data, excluding non-state territories such as Puerto Rico and Guam. This particular dataset includes the two compulsory sections of the SAT, which are Math section and the Evidence Based Reading and Writing section. The dataset also has a record of average total scores by state, and participation rate by state.
The ACT dataset similarly consists of the average ACT scores of students by state, includes the average English, Math, Reading, Science and Composite of students. Unlike the SAT dataset, the ACT dataset includes a row with national averages. There are four columns indicating English, Math, Reading and Science scores. The ACT data also includes participation rates by state.
It's worth noting that the average total scores for both tests are derived from subject scores listed within the dataset, and do not include additional subjects outside the dataset.
Overall, the data looks complete, given that the number of rows for the ACT and SAT match the number of states in the US (the ACT dataset has an extra row that we will easily do away with later). While the data may be complete, the multiple outliers in the data (i.e. abnormally low minimum scores) suggest that further cleaning is needed.
The minimum possible value for each SAT subject is 200, with a maximum of 800. Correspondingly, SAT total scores can range from 400 to 1600.
Each ACT subject has a minimum score of 1, and a maximum score of 36. The total score for a student is calculated by adding these subjects together, and dividing by the total number of subjects (4).
With this in mind, there are a few areas that we need to look into --
SAT 2017 dataset:
- Abnormally low average Math Score of 52 when minimum is 200
- The datatype of the participation rates column is a string
object
instead of anint
orfloat
ACT 2017 dataset:
- Abnormally low average score of 2.3
- Extra row within data
- Composite score column is a string
object
when it should be afloat
-- this is due to the '20.2x' string in Wyoming column
# Fixing composite score for Wyoming
act_2017 = act_2017.replace('20.2x', 20.2)
# Fixing SAT Math score for Maryland
sat_2017.loc[20, 'Math'] = 524
# Fixing ACT Science score for Maryland
act_2017.loc[21, 'Science'] = 23.2
# Checking data types for ACT 2017 dataset
act_2017.dtypes
# Checking data types for SAT 2017 dataset
sat_2017.dtypes
Composite score is still an object
-- this means that we need to convert the data in the columns to int
or float
numbers.
# We'll write a function to change our participation column to integers
def participation_to_int(df):
for col in df.columns:
if 'percentage' in col.lower() or 'participation' in col.lower():
df[col] = df[col].replace('%','', regex=True).astype('int')
participation_to_int(act_2017)
participation_to_int(sat_2017)
# Replace object values in composite column with float values
act_2017['Composite'] = list(map(lambda x: float(x), act_2017['Composite']))
# Checking column types
type(act_2017['Composite'].iloc[0]), type(act_2017['Participation'].iloc[0]), type(sat_2017['Participation'].iloc[0])
We'll also want to rename our columns to make things a bit easier for us later on.
def update_cols(df,prefix):
# Rename columns using list comprehension
df.columns = [x.lower().replace('evidence-based reading and writing','ebrw').replace('participation','rate') \
.replace('average', '').replace('score', '').replace('composite', 'total') \
.replace(' ','').replace('percentage', 'rate') for x in df.columns]
# Add prefix
df.columns = [prefix + x if x !='state' else x for x in df.columns]
return df
update_cols(act_2017, 'act_2017_');
update_cols(sat_2017, 'sat_2017_');
# Checking column names
act_2017.columns
# Checking column names
sat_2017.columns
Data Dictionary¶
Feature | Type | Dataset | Description |
---|---|---|---|
state | string | SAT 2017 | State from which ACT and SAT data was collected |
sat_2017_rate | integer | SAT 2017 | rate rate as a percentage out of 100 |
sat_2017_ebrw | integer | SAT 2017 | Average score for reading and writing (Min: 200, Max: 800) |
sat_2017_math | integer | SAT 2017 | Average math score (Min: 200, Max: 800) |
sat_2017_total | integer | SAT 2017 | Average total score (Min: 400, Max: 1600) |
act_2017_rate | integer | ACT 2017 | rate rate as a percentage out of 100 |
act_2017_english | float | ACT 2017 | Average english score (Min: 1, Max:36) |
act_2017_math | float | ACT 2017 | Average math score (Min: 1, Max:36) |
act_2017_reading | float | ACT 2017 | Average reading score (Min: 1, Max:36) |
act_2017_science | float | ACT 2017 | Average science score (Min: 1, Max:36) |
act_2017_total | float | ACT 2017 | Average total score derived as a composite of Math, Reading, English and Science scores (Min: 1, Max:36). |
sat_2018_rate | integer | SAT 2018 | rate rate as a percentage out of 100 |
sat_2018_ebrw | integer | SAT 2018 | Average score for reading and writing (Min: 200, Max: 800) |
sat_2018_math | integer | SAT 2018 | Average math score (Min: 200, Max: 800) |
sat_2018_total | integer | SAT 2018 | Average total score (Min: 400, Max: 1600) |
act_2018_rate | integer | ACT 2018 | rate rate as a percentage out of 100 |
act_2018_english | float | ACT 2018 | Average english score (Min: 1, Max:36) |
act_2018_math | float | ACT 2018 | Average math score (Min: 1, Max:36) |
act_2018_reading | float | ACT 2018 | Average reading score (Min: 1, Max:36) |
act_2018_science | float | ACT 2018 | Average science score (Min: 1, Max:36) |
act_2018_total | float | ACT 2018 | Average total score derived as a composite of Math, Reading, English and Science scores (Min: 1, Max:36). |
gdp_2017 | integer | US BEA | Real estimates of Gross domestic product (GDP) by state measured in chained 2009 dollars |
gdp_2018 | integer | US BEA | Real estimates of Gross domestic product (GDP) by state measured in chained 2009 dollars |
cat_gdp_2017 | object | US BEA | Real estimates of Gross domestic product (GDP) by state measured in chained 2009 dollars, sorted by category (<40000, 40000-50000, 50000-60000, >60000) |
cat_gdp_2018 | object | US BEA | Real estimates of Gross domestic product (GDP) by state measured in chained 2009 dollars, sorted by category (<40000, 40000-50000, 50000-60000, >60000) |
classification | object | Gallup | Political affliation by state based on annual state averages of party affiliation from Gallup Daily tracking |
The ACT dataset has an extra row for national averages which we need to remove.
# Dropping national row
act_2017.drop(act_2017[act_2017['state'] == 'National'].index, inplace=True)
act_2017.reset_index(drop=True, inplace=True)
Merge dataframes¶
# Set index as state in preparation for merging
act_2017.set_index('state', inplace=True)
sat_2017.set_index('state', inplace=True);
# Joining ACT and SAT 2017 datasets
combined_2017 = act_2017.join(sat_2017)
combined_2017.head();
combined_2017.to_csv('./data/combined_2017.csv', index=False)
2018 Data Import and Cleaning¶
# Import 2018 SAT and ACT data as dataframes
sat_2018 = pd.read_csv('./data/sat_2018.csv')
act_2018 = pd.read_csv('./data/act_2018_updated.csv')
# Set index as state to facillitate merging and access to data
act_2018.set_index('State', inplace=True)
sat_2018.set_index('State', inplace=True)
Problems with the Raw Data¶
SAT 2018 Dataset Errors:
- Florida, Indiana (participation rate and scores)
- Hawaii, Montana, South Dakota, Utah, Washington, Wyoming (scores)
ACT 2018 Dataset Errors:
- Delaware (total score)
- Georgia (reading / science score)
# Function to help us rectify errors in SAT dataset
def sat_set_state(df, state, participation, ebrw, math, total):
if participation is not None:
df.at[state,'Participation'] = participation
if ebrw is not None:
df.at[state,'Evidence-Based Reading and Writing'] = ebrw
if math is not None:
df.at[state,'Math'] = math
if total is not None:
df.at[state,'Total'] = total
return df
# Fixing SAT values
sat_2018 = sat_set_state(sat_2018, 'Florida', 97, 522, 493, 1014)
sat_2018 = sat_set_state(sat_2018, 'Hawaii', None, 550, 549, 1099)
sat_2018 = sat_set_state(sat_2018, 'Indiana', 67, 546, 539, 1086)
sat_2018 = sat_set_state(sat_2018, 'Montana', None, None, None, 1229)
sat_2018 = sat_set_state(sat_2018, 'South Dakota', None, None, None, 1241)
sat_2018 = sat_set_state(sat_2018, 'Utah', None, 618, 612, 1230)
sat_2018 = sat_set_state(sat_2018, 'Washington', None, None, None, 1082)
sat_2018 = sat_set_state(sat_2018, 'Wyoming', None, None, 635, None)
# Fixing ACT values
act_2018.at['Delaware', 'Average Composite Score'] = 23.8
act_2018.at['Georgia', 'Average Reading Score'] = 22.1
act_2018.at['Georgia', 'Average Science Score'] = 21.2
# Changing participation rates to int
participation_to_int(act_2018)
participation_to_int(sat_2018)
update_cols(act_2018, 'act_2018_')
update_cols(sat_2018, 'sat_2018_');
# Standardizing column name
act_2018.rename(columns={'act_2018_rateofstudentstested': 'act_2018_rate'}, inplace=True)
# Combine 2018 data into a dataframe
combined_2018 = act_2018.join(sat_2018)
combined_2018.head()
# Checking our dtypes
combined_2018.dtypes
We'll combine our 2017 and 2018 data into a single dataframe here by joining on state names.
We'll be using this combined dataframe for the remainder of the project.
final = combined_2017.join(combined_2018)
final.info()
final.to_csv('./data/final.csv')
Exploratory Data Analysis¶
Summary Statistics¶
final.describe().transpose()
Exposing Selection Bias¶
# Function to display top 5 and bottom 5 of dataframe
def display_n(df,n):
with pd.option_context('display.max_rows',n*2):
display(df)
# Source: https://stackoverflow.com/questions/42504984/python-pandas-select-both-head-and-tail/53311747#53311747
# Lowest and highest SAT scores for 2017
display_n(final[['sat_2017_total', 'sat_2017_rate']].sort_values('sat_2017_total'),5)
# Lowest and highest SAT scores for 2018
display_n(final[['sat_2018_total', 'sat_2018_rate']].sort_values('sat_2018_total'),5)
# Lowest and highest ACT scores in 2017
display_n(final[['act_2017_total', 'act_2017_rate']].sort_values('act_2017_total'),5)
# Lowest and highest ACT scores in 2018
display_n(final[['act_2018_total', 'act_2018_rate']].sort_values('act_2018_total'), 5)
# 4 states with SAT participation rates of 100%
sat_2017[sat_2017['sat_2017_rate'] == 100].index
# 17 states with SAT participation rates of 100%
act_2017[act_2017['act_2017_rate'] == 100].index, (len(act_2017[act_2017['act_2017_rate'] == 100].index))
When looking at the SAT, we can see that only 4 states had participation rates of 100% in 2017. This increased to 5 states in 2018. In comparison, the ACT had 17 states with participation rates of 100% in 2017, with this number remaining unchanged in 2018. We can infer from this that the ACT has a larger baseline following than the SAT.
Delving further into the data, we can also observe a noticeable trend where mid-western/rural states like Kansas and Iowa seem to do better on the SAT as compared to urban states like Washington DC or coastal states like Delware. The opposite pattern seems true, where urban states like New York and DC do better than rural states like South Carolina on the ACT. However, if we look closer, this is actually a case of selection bias.
Students taking the SAT in states with extremely ACT dominant states (like Kansas or Iowa) are likely planning to apply to out of state universities that require the SAT. The ability to afford both the SAT and relocate to a geographically distant location suggests that these students are likely to be of an above average socioeconomic status, which makes them unrepresentative of the average population of students taking the SAT.
# Create function to determine total percentage change over years
def abs_change(row):
row = row.fillna(0)
row = np.abs(row)
row['total_pct_change'] = row.sum()
return row
# Checking for rate change between years for SAT
sat_2017_rate = final[final['sat_2017_rate'] >= 99][['sat_2017_rate', 'sat_2018_rate']]
sat_2018_rate = final[final['sat_2018_rate'] >= 99][['sat_2017_rate', 'sat_2018_rate']]
# Concatenate dataframes with SAT participation percentages by state
sat_rate_df = pd.concat([sat_2017_rate, sat_2018_rate]).drop_duplicates()
# Create new column with percentage changes for SAT
sat_rate_df['total_pct_change'] = sat_rate_df.diff(axis=1).apply(abs_change, axis=1)['total_pct_change']
sat_rate_df[sat_rate_df['total_pct_change'] != 0.0].sort_values('total_pct_change', ascending=False)
Colorado had a massive jump in their SAT participation rate from 11% in 2017 to 100% in 2018. Illinois also had a similar jump, from 9% in 2017 to 99% in 2018. This was due to changes in state educational policy, which made the SAT compulsory in both Colorado and Illinois.
DC's SAT participation rate dropped from 100% to 92% in 2018, while Idaho's SAT participation rate increased from 93% to 100% in 2018.
# Checking for rate change between years for ACT
act_2017_rate = final[final['act_2017_rate'] >= 99][['act_2017_rate', 'act_2018_rate']]
act_2018_rate = final[final['act_2018_rate'] >= 99][['act_2017_rate', 'act_2018_rate']]
# Concatenate dataframes with ACT participation percentages by state
act_rate_df = pd.concat([act_2017_rate, act_2018_rate]).drop_duplicates()
# Create new column with percentage changes for ACT
act_rate_df['total_pct_change'] = act_rate_df.diff(axis=1).apply(abs_change, axis=1)['total_pct_change']
act_rate_df[act_rate_df['total_pct_change'] != 0.0].sort_values('total_pct_change', ascending=False)
Colorado's ACT participation rate dropped massively from 2017 to 2018, while Nebraska and Ohio both increased their ACT participation rates from 2017 to 2018. In Nebraska, the ACT was made mandatory, leading to an increase in participation rate. In Ohio, standardized testing through the SAT or ACT also became mandatory -- despite being given a choice of which test to adopt, the majority of state districts (95%) chose the ACT over the SAT.
# Checking for cases where participation rates is greater than 50% for both tests in 2017
final[(final['sat_2017_rate'] > 50) & (final['act_2017_rate'] > 50)] \
[['sat_2017_rate', 'act_2017_rate']] \
.sort_values('sat_2017_rate')
# Checking for cases where participation rates is greater than 50% for both tests in 2018
final[(final['sat_2018_rate'] > 50) & (final['act_2018_rate'] > 50)] \
[['sat_2018_rate', 'act_2018_rate']] \
.sort_values(by='sat_2018_rate')
Flordia, Hawaii and Georgia had greater than 50% participation rates for both tests in 2017 and 2018. Hawaii is interesting as the state is rather unique in terms of its location and demographics. The majority of Hawaiians live in urban areas, which seems to be contrary to the trend of urban/coastal states favoring the SAT test.
Both North and South Carolina both have more than 50% participation for the ACT and SAT tests. Interestingly, North Carolina, South Carolina, Florida and Georgia are in close geographic proximity, and are all officially recognised as part of the Southern US region. These are generally considered to be highly conservative, republican states. We'll take a look later to see if political affiliation truly has anything to do with participation rates.
Data Visualization¶
# Create mask to prevent duplication of variables
mask = np.zeros(final.corr().shape, dtype=bool)
mask[np.triu_indices(len(mask))] = True
plt.figure(figsize=(16,12))
plt.title('Heatmap between numeric features of ACT and SAT in 2017 and 2018', fontsize=14)
sns.heatmap(final.corr(), cmap='RdBu', annot=True, mask=mask)
There are several patterns of interest here --
Firstly, we can see that ACT participation has a strong negative correlation with ACT subject scores and overall ACT scores. This means that states with higher ACT participation tend to have lower ACT scores, and vice versa for states with low ACT participation rates. This is mirrored in SAT participation rates, where there is an even stronger negative correlation between participation rates and SAT scores.
Secondly, subject and test scores from 2017 are strongly correlated with the same subject and test scores in 2018. This means that states that did well in the previous year, are likely to do well in the following year. This is unsurprising as states are unlikely to dramatically go up or down in test scores over a single year, due to policies and institutions (e.g. state education departments) that work to maintain consistent educational results year-on-year. Barring dramatic policy changes like switching from one test to another, states are likely to produce to the same results.
The test scores / participation rates for each are also negatively correlated with the other test. This shows that it's pretty rare for a student to take both tests.
def subplot_histograms(dataframe, list_of_columns, list_of_titles, list_of_xlabels, sharex=False, kde=False, figsize=(14,7)):
nrows = int(np.ceil(len(list_of_columns)/2)) # Makes sure you have enough rows
fig, ax = plt.subplots(nrows=nrows, ncols=2, figsize=figsize, sharex=sharex, sharey=True)
ax = ax.ravel() # Ravel turns a matrix into a vector, which is easier to iterate
for i, column in enumerate(list_of_columns):
# Calculate mean
mean = np.mean(dataframe[list_of_columns[i]])
ax[i].axvline(mean, ls='--', color='crimson')
# Set color of graph according to test type
if 'act_' in column:
color = '#E1812C'
elif 'sat_' in column:
color = '#3274A1'
sns.histplot(data=dataframe, x=column, color=color, ax=ax[i], kde=kde, bins=8)
ax[i].set_title(list_of_titles[i] + f' (Mean: {mean:.1f})')
ax[i].set_xlabel(list_of_xlabels[i])
# To prevent label duplication
if i % 2 == 0:
ax[i].set_ylabel('Number of States')
fig.tight_layout(h_pad = 1.5, w_pad = 1.5)
subplot_histograms(final, ['sat_2017_rate', 'sat_2018_rate', 'act_2017_rate', 'act_2018_rate'],
['SAT 2017 Participation', 'SAT 2018 Participation', 'ACT 2017 Participation', 'ACT 2018 Participation'],
['Participation (%)', 'Participation (%)',
'Participation (%)', 'Participation (%)'], sharex=True, kde=True)
Overall, we can see that the ACT has a larger baseline following than the SAT. The SAT has a large number of states with an extremely low participation rate, while the ACT has a large number of states with a extremely high participation rate.
In 2018, SAT participation increased by 5.9%, while ACT participation decreased by 3.7% on average. This suggests that some states are beginning to adopt the SAT test over the ACT test, with some states moving away from standardized testing altogether. The number of states with a 90% - 100% participation for the SAT increased in 2018, while ACT participation rates for states in the same range remained consistent.
subplot_histograms(final, ['sat_2017_math', 'sat_2018_math', 'act_2017_math', 'act_2018_math'],
['SAT 2017 Math Scores', 'SAT 2018 Math Scores', 'ACT 2017 Math Scores', 'ACT 2018 Math Scores'],
['Average Score (min: 400, max: 800)', 'Average Score (min: 400, max: 800)',
'Average Score (min: 1, max: 36)', 'Average Score (min: 1, max: 36)'], kde=True)
Both tests have Math scores that do not appear to be normally distributed. In particular, we can observe that most states tend to have SAT Math scores in the range of 525-550 points. The ACT seems to less bimodal and slightly closer to having a normal distribution.
subplot_histograms(final,
['sat_2017_ebrw', 'sat_2018_ebrw',
'act_2017_english', 'act_2017_reading', 'act_2018_english', 'act_2018_reading'],
['SAT 2017 Reading/Writing Scores', 'SAT 2018 Reading/Writing Scores',
'ACT 2017 English Scores', 'ACT 2017 Reading Scores',
'ACT 2018 English Scores', 'ACT 2018 Reading Scores'],
['Average Score (min: 400, max: 800)', 'Average Score (min: 400, max: 800)',
'Average Score (min: 1, max: 36)', 'Average Score (min: 1, max: 36)',
'Average Score (min: 1, max: 36)', 'Average Score (min: 1, max: 36)'], kde=True)
SAT average scores for Reading/Writing dropped from 2017 to 2018 by 1.8 points, while ACT English scores increased by 0.1 points in 2018.
ACT reading scores from 2017 and 2018 are relatively seem to be more normally distributed than SAT Reading/Writing scores. ACT English scores from 2017 and 2018 seem to be more bimodal in comparison to the rest of the ACT subjects.
subplot_histograms(final,
['act_2017_science', 'act_2018_science'],
['ACT 2017 Science Scores', 'ACT 2018 Science Scores'],
['Average Score (min: 1, max: 36)', 'Average Score (min: 1, max: 36)'], kde=True, figsize=(14,4))
Similar to the other ACT subjects, the distribution seems to be somewhat ambigious -- it seems to be somewhere in between normal and bimodal.
subplot_histograms(final,
['sat_2017_total', 'sat_2018_total', 'act_2017_total', 'act_2018_total'],
['SAT 2017 Total Scores', 'SAT 2018 Total Scores',
'ACT 2017 Total Scores', 'ACT 2018 Total Scores'],
['Average Score (min: 800, max: 1600)', 'Average Score (min: 800, max: 1600)',
'Average Score (min: 1, max: 36)', 'Average Score (min: 1, max: 36)'], kde=True)
The average total score for the SAT dropped by 0.8 points in 2018. On the other hand, ACT English and Reading scores increased by 0.9 and 1.0 respectively.
We can also easily observe that SAT total scores appear to be more bimodally distributed, and can be divided into two distinct groups of states i.e. high-performing and low-performing states. This pattern is also apparent within the ACT data, but to a lesser degree.
# Function to plot multiple scatter plots
def multi_plot_scatter(df, list_of_x, list_of_y, list_of_titles, xlabels, ylabels, sharex = False, sharey = False):
if len(list_of_x) == 1:
ncols = 1
else:
ncols = 2
nrows = int(np.ceil(len(list_of_x)/2)) # Makes sure you have enough rows
fig, ax = plt.subplots(nrows=nrows, ncols=ncols, figsize=(14,7), sharex = sharex, sharey = sharey)
ax = ax.ravel() # Ravel turns a matrix into a vector, which is easier to iterate
for i, column in enumerate(list_of_x):
sns_fig = sns.regplot(data=df, x=list_of_x[i], y=list_of_y[i], ax=ax[i]).set_title(list_of_titles[i])
ax[i].set_xlabel(xlabels[i])
ax[i].set_ylabel(ylabels[i])
plt.tight_layout()
# Function for plotting single scatter plots
def plot_scatter(df, x, y, title):
fig, ax = plt.subplots(figsize=(10,5))
plt.title(title)
sns.regplot(data=df, x=x, y=y)
plt.tight_layout()
plot_scatter(final, 'sat_2017_math', 'act_2017_math', 'SAT vs ACT Math Average Score for 2017')
plt.xlabel('SAT 2017 Math Average Score')
plt.ylabel('ACT 2017 Math Average Score');
multi_plot_scatter(final, ['sat_2017_ebrw', 'sat_2017_ebrw'],
['act_2017_reading', 'act_2017_english'],
['SAT Reading/Writing vs ACT Reading Average Score (2017)',
'SAT Reading/Writing vs ACT English Average Score (2017)'],
['SAT 2017 Reading and Writing Score', 'SAT 2017 Reading and Writing Score'],
['ACT 2017 Reading Score', 'ACT 2017 English Score'],
sharex = True, sharey = True)
plot_scatter(final, 'sat_2017_total', 'act_2017_total', 'SAT vs ACT Total Average Score for 2017')
plt.xlabel('SAT 2017 Total Average Score')
plt.ylabel('ACT 2017 Total Average Score');
Generally, ACT and SAT scores are negatively correlated to each other. This suggests that states tend to focus on one test over the other, meaning that states are likely to have either high SAT/ low ACT or low SAT/ high SAT average scores. In other words, states with low SAT scores tend to have high ACT scores and vice versa.
multi_plot_scatter(final, ['sat_2017_total', 'act_2017_total'], ['sat_2018_total', 'act_2018_total'],
['SAT Total Average Score in 2017 vs 2018', 'ACT Total Average Score in 2017 vs 2018'],
['SAT 2017 Total Average Score', 'ACT 2017 Composite Average Score'],
['SAT 2018 Total Average Score', 'ACT 2018 Composite Average Score'])
ACT and SAT test results year by year have a very strong correlation. This means that states that did well in the last year are likely to do well in the next year. This correlation is much stronger for the ACT test, meaning that ACT scores tend to remain static from one year to the next. This could actually be an incentive for high-scoring states to stay with the ACT, given that scores seem to less from one year to the next as compared to the SAT.
Plot and interpret boxplots¶
def plot_box(dataframe, list_of_columns, title, list_of_yticklabels, xlabel, test_type=False, palette='tab10', orient='h'):
plt.figure(figsize=(14,7))
plt.title(title)
ax = sns.boxplot(data=dataframe[list_of_columns],palette=palette, orient=orient)
#For setting colors according to standardize colors according to test type (SAT = blue, ACT = orange)
for i, col in enumerate(list_of_columns):
if test_type:
box = ax.artists[i]
if 'act_' in col:
box.set_facecolor('#E8A36C')
elif 'sat_' in col:
box.set_facecolor('#9DC2D5')
ax.set_yticklabels(list_of_yticklabels)
ax.set_xlabel(xlabel)
plot_box(final, ['sat_2017_rate', 'act_2017_rate', 'sat_2018_rate', 'act_2018_rate'],
'SAT and ACT Participation Rates in 2017 and 2018',
['SAT 2017 Participation', 'ACT 2017 Participation', 'SAT 2018 Participation', 'ACT 2018 Participation'],
'Participation Rate', test_type=True)
The ACT test has a higher median than the SAT test, indicating that the ACT participation rates are generally higher than SAT participation rates. The SAT participation rates are also skewed right, suggesting that there is strong support for the ACT test throughout the US. SAT participation rates are generally skewed left.
However, from 2017 to 2018, we can observe that the median for SAT participation increased, while the median for ACT participation dropped slightly. Variance (as measured by the IQR) also seems to have risen, leading to a decrease in skewness. This could suggest that support for the SAT test is beginning to rise throughout the US.
plot_box(final, ['sat_2017_math', 'sat_2018_math', 'sat_2017_ebrw', 'sat_2018_ebrw'],
'SAT Scores in 2017 and 2018',
['SAT 2017 Math', 'SAT 2018 Math', 'SAT 2017 Reading and Writing', 'SAT 2018 Reading and Writing'],
'Average Score', palette='Blues')
The median for SAT test scores decreased for both Math and Reading and Writing in 2018. This could be due to increased SAT participation rates as previously mentioned. Another contributing factor to this could be states that switched to the SAT from the ACT, like Colorado and Illinois which had below total average scores of 1025 and 1029 respectively.
plot_box(final, ['sat_2017_total', 'sat_2018_total'],
'SAT Total Scores in 2017 and 2018',
['SAT 2017 Average Total Score', 'SAT 2018 Average Total Score'], 'Average Score', palette='Blues')
The median for SAT average total scores also decreased from 1107 to 1099 -- this could be due to both increased participation rates and reports of more students than ever taking the SAT, with many of these students from low-income families or from families “with no history of college attendance". The introduction of SAT test fee waivers for low income students also occurred in 2018.
plot_box(final, ['act_2017_math', 'act_2018_math', 'act_2017_reading',
'act_2018_reading', 'act_2017_english', 'act_2018_english'],
'ACT Subject Scores in 2017 and 2018',
['ACT 2017 Math', 'ACT 2018 Math', 'ACT 2017 Reading',
'ACT 2018 Reading', 'ACT 2017 English', 'ACT 2018 English'], 'Average Score', palette='Oranges')
plot_box(final, ['act_2017_total', 'act_2018_total'],
'ACT Composite Scores in 2017 and 2018', ['ACT 2017 Total', 'ACT 2018 Total'], 'Average Score', palette='Oranges')
The median for all ACT subjects decreased in 2018. While we saw earlier that the mean for ACT scores increased the across the board, this could mean that the ACT is beginning to become increasingly bimodally distributed.
Statistical Equity vs Socioeconomic Equity¶
To further understand the underlying differences between high-performing and low-performing states, I decided to look at additional categories including political affiliation by state and GDP per Capita by state.
I used multiple datasets for this, including:
- 2017 US Party Affiliation by State (Gallup)
- 2018 US Party Affilation by State (Gallup)
- 2017 & 2018 GDP per Capita (US Bureau of Economic Analysis)
The real estimates of Gross domestic product (GDP) by state were measured in chained 2009 dollars. Real GDP by state is an inflation-adjusted measure of each state's gross product that is based on national prices for the goods and services produced within the state.
# Load datasets for GDP and political classification
gallup_2017 = pd.read_csv('./data/Gallup_2017.csv', index_col='state')
gallup_2018 = pd.read_csv('./data/Gallup_2018.csv', index_col='state')
us_gdp = pd.read_csv('./data/US_GDP.csv', index_col='state')
us_gdp.replace(',', '', regex=True, inplace=True)
# Simplify classification to either Democrat, Republican or Swing
gallup_2017 = gallup_2017.replace(['Lean Republican', 'Solid Republican'], 'Republican').replace \
(['Lean Democrat', 'Solid Democrat'], 'Democrat').replace('Competitive', 'Swing')
gallup_2018 = gallup_2018.replace(['Lean Republican', 'Strong Republican'], 'Republican').replace \
(['Lean Democratic', 'Strong Democratic'], 'Democrat').replace('Competitive', 'Swing')
# Change GDP numbers to numeric
us_gdp['gdp_2017'] = pd.to_numeric(us_gdp['gdp_2017'])
us_gdp['gdp_2018'] = pd.to_numeric(us_gdp['gdp_2018'])
# Combine GDP and political classification with 2017 dataset
combined_2017 = combined_2017.join(gallup_2017['classification'])
combined_2017 = combined_2017.join(us_gdp['gdp_2017'])
# Combine GDP and political classification with 2018 dataset
combined_2018 = combined_2018.join(gallup_2018['classification'])
combined_2018 = combined_2018.join(us_gdp['gdp_2018'])
categories = ['<$40,000', '\$40,000-$50,000', '\$50,000-$60,000', '>$60,000']
# Create categories for US GDP
for col in us_gdp:
cat = cat_gdp_2017 = cat_gdp_2018 = []
for value in us_gdp[col]:
if value < 40000:
cat.append('<$40,000')
elif value >= 40000 and value <= 50000:
cat.append('\$40,000-$50,000')
elif value >= 50000 and value <= 60000:
cat.append('\$50,000-$60,000')
elif value > 60000:
cat.append('>$60,000')
if '2017' in col:
cat = cat_gdp_2017
elif '2018' in col:
cat = cat_gdp_2018
# Convert GDP data to categorical data
us_gdp['cat_gdp_2017'] = pd.Categorical(cat_gdp_2017, categories)
us_gdp['cat_gdp_2018'] = pd.Categorical(cat_gdp_2018, categories)
# Add in categories for US GDP
combined_2017 = combined_2017.join(us_gdp['cat_gdp_2017'])
combined_2018 = combined_2018.join(us_gdp['cat_gdp_2018'])
We can now revisit some of our scatterplot graphs made earlier, and separate our datapoints according to GDP. In the plots below, I separated the graph into quadrants using the mean for SAT and ACT total scores.
# Function for plotting custom seaborn scatter plots
def sns_scatter(dataframe, list_of_x, list_of_y, list_of_titles, list_of_xlabels, list_of_ylabels, hue, sharex=False):
nrows = int(np.ceil(len(list_of_x)/2))
fig, ax = plt.subplots(nrows=nrows, ncols=2, figsize=(14,6), sharex=sharex, sharey=True)
ax = ax.ravel()
for i, _ in enumerate(list_of_x):
if '_2017' in list_of_x[i]:
dataframe = combined_2017
elif '_2018' in list_of_x[i]:
dataframe = combined_2018
sns.scatterplot(data=dataframe, x=list_of_x[i], y=list_of_y[i], hue=hue[i], ax=ax[i], palette='magma', s=100)
if i != len(list_of_x) - 1:
ax[i].get_legend().remove()
ax[i].set_title(list_of_titles[i], fontsize=15)
ax[i].set_xlabel(list_of_xlabels[i] + f' (Mean: {dataframe[list_of_x[i]].mean():.1f})')
ax[i].set_ylabel(list_of_ylabels[i] + f' (Mean: {dataframe[list_of_y[i]].mean():.1f})')
# Mean for ACT test in 2017 and 2018
ax[i].axvline(dataframe[list_of_x[i]].mean(), color='black')
# Mean for SAT test in 2017 and 2018
ax[i].axhline(dataframe[list_of_y[i]].mean(), color='black')
plt.legend(title='State GDP per Capita \n (Chained 2009 USD)')
plt.tight_layout()
sns_scatter(final, ['act_2017_total', 'act_2018_total'], ['sat_2017_total', 'sat_2018_total'],
['ACT vs SAT Total Average Score in 2017', 'ACT vs SAT Total Average Score in 2018'],
['ACT 2017 Scores', 'ACT 2018 Scores'],
['SAT 2017 Scores', 'SAT 2017 Scores'],
['cat_gdp_2017', 'cat_gdp_2018'])
In the bottom left quadrant, we can obverve that state with below average ACT and SAT total scores are mostly poor states, with a GDP of below \$50,000. Conversely, we can see that states with a GDP of more than \\$50,000 tend to exist in the bottom right or top left quadrant.
This suggests that states with a low GDP per capita tend to do badly on both the ACT or SAT, while states with a high GDP per capita tend to do better either on ACT or SAT. For this reason, the upper right quadrant is almost empty, with only a few fringe states.
We can also observe that richer states tend to cluster closer together versus lower income states, which are more spread out.
sns_scatter(final, ['act_2017_rate', 'act_2018_rate'], ['sat_2017_rate', 'sat_2018_rate'],
['ACT vs SAT Participation in 2017', 'ACT vs SAT Participation in 2018'], ['ACT Participation', 'ACT Participation'],
['SAT Participation', 'SAT Participation'],
['cat_gdp_2017', 'cat_gdp_2018'])
In terms of participation rates, we can see that states of a similiar GDP per capita tend to cluster together, especially for states of a high GDP per capita.
plt.figure(figsize=(8,8))
plt.suptitle('Total Score vs GDP', fontsize=20)
ax1=sns.boxplot(data=combined_2017, x='sat_2017_total', y='cat_gdp_2017', palette='Blues')
ax1.set_title('SAT Total Score vs GDP (2017)', fontsize=15)
ax1.set_xlabel('SAT 2017 Total Score')
ax1.set_ylabel('State GDP Per Capita')
plt.figure(figsize=(14,7))
ax2=sns.boxplot(data=combined_2017, x='act_2017_total', y='cat_gdp_2017', palette='Oranges')
ax2.set_title('ACT Total Score vs GDP (2017)', fontsize=15)
ax2.set_xlabel('ACT 2017 Total Score')
ax2.set_ylabel('State GDP Per Capita')