A TALE OF TWO TESTS

An analysis of the SAT and ACT with a focus on socioeconomic inequality.

Project 1: SAT & ACT Analysis

Executive Summary

In the United States, the SAT and ACT are entrance exams used by colleges and universities to make admission decisions. These exams help provide colleges with a common data point that can be used to compare all applicants. While many colleges do not prefer one test over the other, and while both tests share a number of similarities in terms of subjects tested, participation rates for each test tend vary wildly across the country.

In 2016, the College Board implemented major changes to the SAT, in an attempt to counter the growing popularity of the ACT. By looking at data from the subsequent years of 2017 and 2018, we can begin to measure the impact of these changes, while coming up with new strategies that may help the College Board increase their participation rates moving forward.

Problem Statement: which state should the College Board target to further raise SAT participation rates?

Libraries

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

2017 Data Import and Cleaning

In [2]:
# Load sat_2017 & act_2017 datasets
sat_2017 = pd.read_csv('./data/sat_2017.csv')
act_2017 = pd.read_csv('./data/act_2017.csv')
In [3]:
sat_2017.head()
Out[3]:
State Participation Evidence-Based Reading and Writing Math Total
0 Alabama 5% 593 572 1165
1 Alaska 38% 547 533 1080
2 Arizona 30% 563 553 1116
3 Arkansas 3% 614 594 1208
4 California 53% 531 524 1055
In [4]:
act_2017.head()
Out[4]:
State Participation English Math Reading Science Composite
0 National 60% 20.3 20.7 21.4 21.0 21.0
1 Alabama 100% 18.9 18.4 19.7 19.4 19.2
2 Alaska 65% 18.7 19.8 20.4 19.9 19.8
3 Arizona 62% 18.6 19.8 20.1 19.8 19.7
4 Arkansas 100% 18.9 19.0 19.7 19.5 19.4
In [5]:
sat_2017.describe()
Out[5]:
Evidence-Based Reading and Writing Math Total
count 51.000000 51.000000 51.000000
mean 569.117647 547.627451 1126.098039
std 45.666901 84.909119 92.494812
min 482.000000 52.000000 950.000000
25% 533.500000 522.000000 1055.500000
50% 559.000000 548.000000 1107.000000
75% 613.000000 599.000000 1212.000000
max 644.000000 651.000000 1295.000000
In [6]:
act_2017.describe()
Out[6]:
English Math Reading Science
count 52.000000 52.000000 52.000000 52.000000
mean 20.919231 21.173077 22.001923 21.040385
std 2.332132 1.963602 2.048672 3.151113
min 16.300000 18.000000 18.100000 2.300000
25% 19.000000 19.400000 20.475000 19.900000
50% 20.550000 20.900000 21.700000 21.150000
75% 23.300000 23.100000 24.125000 22.525000
max 25.500000 25.300000 26.000000 24.900000

The SAT dataset contains the average SAT scores of students by state. All 51 states (Washington DC became a state in 2020) of the US are represented in the data, excluding non-state territories such as Puerto Rico and Guam. This particular dataset includes the two compulsory sections of the SAT, which are Math section and the Evidence Based Reading and Writing section. The dataset also has a record of average total scores by state, and participation rate by state.

The ACT dataset similarly consists of the average ACT scores of students by state, includes the average English, Math, Reading, Science and Composite of students. Unlike the SAT dataset, the ACT dataset includes a row with national averages. There are four columns indicating English, Math, Reading and Science scores. The ACT data also includes participation rates by state.

It's worth noting that the average total scores for both tests are derived from subject scores listed within the dataset, and do not include additional subjects outside the dataset.

Overall, the data looks complete, given that the number of rows for the ACT and SAT match the number of states in the US (the ACT dataset has an extra row that we will easily do away with later). While the data may be complete, the multiple outliers in the data (i.e. abnormally low minimum scores) suggest that further cleaning is needed.

The minimum possible value for each SAT subject is 200, with a maximum of 800. Correspondingly, SAT total scores can range from 400 to 1600.

Each ACT subject has a minimum score of 1, and a maximum score of 36. The total score for a student is calculated by adding these subjects together, and dividing by the total number of subjects (4).

With this in mind, there are a few areas that we need to look into --

SAT 2017 dataset:

  • Abnormally low average Math Score of 52 when minimum is 200
  • The datatype of the participation rates column is a string object instead of an int or float

ACT 2017 dataset:

  • Abnormally low average score of 2.3
  • Extra row within data
  • Composite score column is a string object when it should be a float -- this is due to the '20.2x' string in Wyoming column
In [7]:
# Fixing composite score for Wyoming
act_2017 = act_2017.replace('20.2x', 20.2)
In [8]:
# Fixing SAT Math score for Maryland
sat_2017.loc[20, 'Math'] = 524 
In [9]:
# Fixing ACT Science score for Maryland
act_2017.loc[21, 'Science'] = 23.2
In [10]:
# Checking data types for ACT 2017 dataset
act_2017.dtypes
Out[10]:
State             object
Participation     object
English          float64
Math             float64
Reading          float64
Science          float64
Composite         object
dtype: object
In [11]:
# Checking data types for SAT 2017 dataset
sat_2017.dtypes
Out[11]:
State                                 object
Participation                         object
Evidence-Based Reading and Writing     int64
Math                                   int64
Total                                  int64
dtype: object

Composite score is still an object -- this means that we need to convert the data in the columns to int or float numbers.

In [12]:
# We'll write a function to change our participation column to integers
def participation_to_int(df):
    for col in df.columns:
        if 'percentage' in col.lower() or 'participation' in col.lower():
            df[col] = df[col].replace('%','', regex=True).astype('int')
            
participation_to_int(act_2017)
participation_to_int(sat_2017)
In [13]:
# Replace object values in composite column with float values 
act_2017['Composite'] = list(map(lambda x: float(x), act_2017['Composite']))
In [14]:
# Checking column types
type(act_2017['Composite'].iloc[0]), type(act_2017['Participation'].iloc[0]), type(sat_2017['Participation'].iloc[0])
Out[14]:
(numpy.float64, numpy.int32, numpy.int32)

We'll also want to rename our columns to make things a bit easier for us later on.

In [15]:
def update_cols(df,prefix):
    # Rename columns using list comprehension
    df.columns = [x.lower().replace('evidence-based reading and writing','ebrw').replace('participation','rate') \
                           .replace('average', '').replace('score', '').replace('composite', 'total') \
                           .replace(' ','').replace('percentage', 'rate') for x in df.columns]
    
    # Add prefix
    df.columns = [prefix + x if x !='state' else x for x in df.columns]
    return df
In [16]:
update_cols(act_2017, 'act_2017_');
In [17]:
update_cols(sat_2017, 'sat_2017_');
In [18]:
# Checking column names
act_2017.columns
Out[18]:
Index(['state', 'act_2017_rate', 'act_2017_english', 'act_2017_math',
       'act_2017_reading', 'act_2017_science', 'act_2017_total'],
      dtype='object')
In [19]:
# Checking column names
sat_2017.columns
Out[19]:
Index(['state', 'sat_2017_rate', 'sat_2017_ebrw', 'sat_2017_math',
       'sat_2017_total'],
      dtype='object')

Data Dictionary

Feature Type Dataset Description
state string SAT 2017 State from which ACT and SAT data was collected
sat_2017_rate integer SAT 2017 rate rate as a percentage out of 100
sat_2017_ebrw integer SAT 2017 Average score for reading and writing (Min: 200, Max: 800)
sat_2017_math integer SAT 2017 Average math score (Min: 200, Max: 800)
sat_2017_total integer SAT 2017 Average total score (Min: 400, Max: 1600)
act_2017_rate integer ACT 2017 rate rate as a percentage out of 100
act_2017_english float ACT 2017 Average english score (Min: 1, Max:36)
act_2017_math float ACT 2017 Average math score (Min: 1, Max:36)
act_2017_reading float ACT 2017 Average reading score (Min: 1, Max:36)
act_2017_science float ACT 2017 Average science score (Min: 1, Max:36)
act_2017_total float ACT 2017 Average total score derived as a composite of Math, Reading, English and Science scores (Min: 1, Max:36).
sat_2018_rate integer SAT 2018 rate rate as a percentage out of 100
sat_2018_ebrw integer SAT 2018 Average score for reading and writing (Min: 200, Max: 800)
sat_2018_math integer SAT 2018 Average math score (Min: 200, Max: 800)
sat_2018_total integer SAT 2018 Average total score (Min: 400, Max: 1600)
act_2018_rate integer ACT 2018 rate rate as a percentage out of 100
act_2018_english float ACT 2018 Average english score (Min: 1, Max:36)
act_2018_math float ACT 2018 Average math score (Min: 1, Max:36)
act_2018_reading float ACT 2018 Average reading score (Min: 1, Max:36)
act_2018_science float ACT 2018 Average science score (Min: 1, Max:36)
act_2018_total float ACT 2018 Average total score derived as a composite of Math, Reading, English and Science scores (Min: 1, Max:36).
gdp_2017 integer US BEA Real estimates of Gross domestic product (GDP) by state measured in chained 2009 dollars
gdp_2018 integer US BEA Real estimates of Gross domestic product (GDP) by state measured in chained 2009 dollars
cat_gdp_2017 object US BEA Real estimates of Gross domestic product (GDP) by state measured in chained 2009 dollars, sorted by category (<40000, 40000-50000, 50000-60000, >60000)
cat_gdp_2018 object US BEA Real estimates of Gross domestic product (GDP) by state measured in chained 2009 dollars, sorted by category (<40000, 40000-50000, 50000-60000, >60000)
classification object Gallup Political affliation by state based on annual state averages of party affiliation from Gallup Daily tracking

The ACT dataset has an extra row for national averages which we need to remove.

In [20]:
# Dropping national row
act_2017.drop(act_2017[act_2017['state'] == 'National'].index, inplace=True)
act_2017.reset_index(drop=True, inplace=True)

Merge dataframes

In [21]:
# Set index as state in preparation for merging
act_2017.set_index('state', inplace=True)
sat_2017.set_index('state', inplace=True);
In [22]:
# Joining ACT and SAT 2017 datasets
combined_2017 = act_2017.join(sat_2017)
combined_2017.head();
In [23]:
combined_2017.to_csv('./data/combined_2017.csv', index=False)

2018 Data Import and Cleaning

In [24]:
# Import 2018 SAT and ACT data as dataframes
sat_2018 = pd.read_csv('./data/sat_2018.csv')
act_2018 = pd.read_csv('./data/act_2018_updated.csv')
In [25]:
# Set index as state to facillitate merging and access to data
act_2018.set_index('State', inplace=True)
sat_2018.set_index('State', inplace=True)

Problems with the Raw Data

SAT 2018 Dataset Errors:

  • Florida, Indiana (participation rate and scores)
  • Hawaii, Montana, South Dakota, Utah, Washington, Wyoming (scores)

ACT 2018 Dataset Errors:

  • Delaware (total score)
  • Georgia (reading / science score)
In [26]:
# Function to help us rectify errors in SAT dataset
def sat_set_state(df, state, participation, ebrw, math, total):
    
    if participation is not None:
        df.at[state,'Participation'] = participation
        
    if ebrw is not None:
        df.at[state,'Evidence-Based Reading and Writing'] = ebrw
        
    if math is not None:
        df.at[state,'Math'] = math
    
    if total is not None:
        df.at[state,'Total'] = total
        
    return df
In [27]:
# Fixing SAT values
sat_2018 = sat_set_state(sat_2018, 'Florida', 97, 522, 493, 1014)
sat_2018 = sat_set_state(sat_2018, 'Hawaii', None, 550, 549, 1099)
sat_2018 = sat_set_state(sat_2018, 'Indiana', 67, 546, 539, 1086)
sat_2018 = sat_set_state(sat_2018, 'Montana', None, None, None, 1229)
sat_2018 = sat_set_state(sat_2018, 'South Dakota', None, None, None, 1241)
sat_2018 = sat_set_state(sat_2018, 'Utah', None, 618, 612, 1230)
sat_2018 = sat_set_state(sat_2018, 'Washington', None, None, None, 1082)
sat_2018 = sat_set_state(sat_2018, 'Wyoming', None, None, 635, None)
In [28]:
# Fixing ACT values
act_2018.at['Delaware', 'Average Composite Score'] = 23.8
act_2018.at['Georgia', 'Average Reading Score'] = 22.1
act_2018.at['Georgia', 'Average Science Score'] = 21.2
In [29]:
# Changing participation rates to int
participation_to_int(act_2018)
participation_to_int(sat_2018)
In [30]:
update_cols(act_2018, 'act_2018_')
update_cols(sat_2018, 'sat_2018_');
In [31]:
# Standardizing column name
act_2018.rename(columns={'act_2018_rateofstudentstested': 'act_2018_rate'}, inplace=True)
In [32]:
# Combine 2018 data into a dataframe
combined_2018 = act_2018.join(sat_2018)
combined_2018.head()
Out[32]:
act_2018_rate act_2018_total act_2018_english act_2018_math act_2018_reading act_2018_science sat_2018_rate sat_2018_ebrw sat_2018_math sat_2018_total
State
Alabama 100 19.1 18.9 18.3 19.6 19.0 6 595 571 1166
Alaska 33 20.8 19.8 20.6 21.6 20.7 43 562 544 1106
Arizona 66 19.2 18.2 19.4 19.5 19.2 29 577 572 1149
Arkansas 100 19.4 19.1 18.9 19.7 19.4 5 592 576 1169
California 27 22.7 22.5 22.5 23.0 22.1 60 540 536 1076
In [33]:
# Checking our dtypes
combined_2018.dtypes
Out[33]:
act_2018_rate         int32
act_2018_total      float64
act_2018_english    float64
act_2018_math       float64
act_2018_reading    float64
act_2018_science    float64
sat_2018_rate         int32
sat_2018_ebrw         int64
sat_2018_math         int64
sat_2018_total        int64
dtype: object

We'll combine our 2017 and 2018 data into a single dataframe here by joining on state names.

We'll be using this combined dataframe for the remainder of the project.

In [34]:
final = combined_2017.join(combined_2018)
In [35]:
final.info()
<class 'pandas.core.frame.DataFrame'>
Index: 51 entries, Alabama to Wyoming
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   act_2017_rate     51 non-null     int32  
 1   act_2017_english  51 non-null     float64
 2   act_2017_math     51 non-null     float64
 3   act_2017_reading  51 non-null     float64
 4   act_2017_science  51 non-null     float64
 5   act_2017_total    51 non-null     float64
 6   sat_2017_rate     51 non-null     int32  
 7   sat_2017_ebrw     51 non-null     int64  
 8   sat_2017_math     51 non-null     int64  
 9   sat_2017_total    51 non-null     int64  
 10  act_2018_rate     51 non-null     int32  
 11  act_2018_total    51 non-null     float64
 12  act_2018_english  51 non-null     float64
 13  act_2018_math     51 non-null     float64
 14  act_2018_reading  51 non-null     float64
 15  act_2018_science  51 non-null     float64
 16  sat_2018_rate     51 non-null     int32  
 17  sat_2018_ebrw     51 non-null     int64  
 18  sat_2018_math     51 non-null     int64  
 19  sat_2018_total    51 non-null     int64  
dtypes: float64(10), int32(4), int64(6)
memory usage: 10.1+ KB
In [36]:
final.to_csv('./data/final.csv')

Exploratory Data Analysis

Summary Statistics

In [37]:
final.describe().transpose()
Out[37]:
count mean std min 25% 50% 75% max
act_2017_rate 51.0 65.254902 32.140842 8.0 31.00 69.0 100.00 100.0
act_2017_english 51.0 20.931373 2.353677 16.3 19.00 20.7 23.30 25.5
act_2017_math 51.0 21.182353 1.981989 18.0 19.40 20.9 23.10 25.3
act_2017_reading 51.0 22.013725 2.067271 18.1 20.45 21.8 24.15 26.0
act_2017_science 51.0 21.450980 1.739353 18.2 19.95 21.3 23.20 24.9
act_2017_total 51.0 21.519608 2.020695 17.8 19.80 21.4 23.60 25.5
sat_2017_rate 51.0 39.803922 35.276632 2.0 4.00 38.0 66.00 100.0
sat_2017_ebrw 51.0 569.117647 45.666901 482.0 533.50 559.0 613.00 644.0
sat_2017_math 51.0 556.882353 47.121395 468.0 523.50 548.0 599.00 651.0
sat_2017_total 51.0 1126.098039 92.494812 950.0 1055.50 1107.0 1212.00 1295.0
act_2018_rate 51.0 61.647059 34.080976 7.0 28.50 66.0 100.00 100.0
act_2018_total 51.0 21.498039 2.117686 17.7 19.95 21.3 23.65 25.6
act_2018_english 51.0 20.988235 2.446356 16.6 19.10 20.2 23.70 26.0
act_2018_math 51.0 21.125490 2.035765 17.8 19.40 20.7 23.15 25.2
act_2018_reading 51.0 22.033333 2.164132 18.0 20.45 21.7 24.10 26.1
act_2018_science 51.0 21.341176 1.870206 17.9 19.85 21.1 23.05 24.9
sat_2018_rate 51.0 46.627451 38.014187 2.0 4.50 52.0 79.50 100.0
sat_2018_ebrw 51.0 567.294118 45.317676 497.0 535.00 552.0 616.50 643.0
sat_2018_math 51.0 557.450980 49.183865 480.0 521.50 547.0 600.50 655.0
sat_2018_total 51.0 1125.294118 94.442214 977.0 1062.50 1099.0 1229.50 1298.0

Exposing Selection Bias

In [38]:
# Function to display top 5 and bottom 5 of dataframe
def display_n(df,n): 
    with pd.option_context('display.max_rows',n*2):
        display(df)
    # Source: https://stackoverflow.com/questions/42504984/python-pandas-select-both-head-and-tail/53311747#53311747
In [39]:
# Lowest and highest SAT scores for 2017
display_n(final[['sat_2017_total', 'sat_2017_rate']].sort_values('sat_2017_total'),5)

# Lowest and highest SAT scores for 2018
display_n(final[['sat_2018_total', 'sat_2018_rate']].sort_values('sat_2018_total'),5)

# Lowest and highest ACT scores in 2017
display_n(final[['act_2017_total', 'act_2017_rate']].sort_values('act_2017_total'),5)

# Lowest and highest ACT scores in 2018
display_n(final[['act_2018_total', 'act_2018_rate']].sort_values('act_2018_total'), 5)
sat_2017_total sat_2017_rate
state
District of Columbia 950 100
Delaware 996 100
Idaho 1005 93
Michigan 1005 100
Maine 1012 95
... ... ...
Kansas 1260 4
Missouri 1271 3
Iowa 1275 2
Wisconsin 1291 3
Minnesota 1295 3

51 rows × 2 columns

sat_2018_total sat_2018_rate
state
District of Columbia 977 92
Delaware 998 100
West Virginia 999 28
Idaho 1001 100
Michigan 1011 100
... ... ...
Kansas 1265 4
Iowa 1265 3
North Dakota 1283 2
Wisconsin 1294 3
Minnesota 1298 4

51 rows × 2 columns

act_2017_total act_2017_rate
state
Nevada 17.8 100
Mississippi 18.6 100
South Carolina 18.7 100
Hawaii 19.0 90
North Carolina 19.1 100
... ... ...
District of Columbia 24.2 32
Maine 24.3 8
Connecticut 25.2 31
Massachusetts 25.4 29
New Hampshire 25.5 18

51 rows × 2 columns

act_2018_total act_2018_rate
state
Nevada 17.7 100
South Carolina 18.3 100
Mississippi 18.6 100
Hawaii 18.9 89
Alabama 19.1 100
... ... ...
Michigan 24.4 22
New York 24.5 27
New Hampshire 25.1 16
Massachusetts 25.5 25
Connecticut 25.6 26

51 rows × 2 columns

In [40]:
# 4 states with SAT participation rates of 100%
sat_2017[sat_2017['sat_2017_rate'] == 100].index
Out[40]:
Index(['Connecticut', 'Delaware', 'District of Columbia', 'Michigan'], dtype='object', name='state')
In [41]:
# 17 states with SAT participation rates of 100%
act_2017[act_2017['act_2017_rate'] == 100].index, (len(act_2017[act_2017['act_2017_rate'] == 100].index))
Out[41]:
(Index(['Alabama', 'Arkansas', 'Colorado', 'Kentucky', 'Louisiana', 'Minnesota',
        'Mississippi', 'Missouri', 'Montana', 'Nevada', 'North Carolina',
        'Oklahoma', 'South Carolina', 'Tennessee', 'Utah', 'Wisconsin',
        'Wyoming'],
       dtype='object', name='state'),
 17)

When looking at the SAT, we can see that only 4 states had participation rates of 100% in 2017. This increased to 5 states in 2018. In comparison, the ACT had 17 states with participation rates of 100% in 2017, with this number remaining unchanged in 2018. We can infer from this that the ACT has a larger baseline following than the SAT.

Delving further into the data, we can also observe a noticeable trend where mid-western/rural states like Kansas and Iowa seem to do better on the SAT as compared to urban states like Washington DC or coastal states like Delware. The opposite pattern seems true, where urban states like New York and DC do better than rural states like South Carolina on the ACT. However, if we look closer, this is actually a case of selection bias.

Students taking the SAT in states with extremely ACT dominant states (like Kansas or Iowa) are likely planning to apply to out of state universities that require the SAT. The ability to afford both the SAT and relocate to a geographically distant location suggests that these students are likely to be of an above average socioeconomic status, which makes them unrepresentative of the average population of students taking the SAT.

In [42]:
# Create function to determine total percentage change over years
def abs_change(row):
    row = row.fillna(0)
    row = np.abs(row)
    row['total_pct_change'] = row.sum()
    return row
In [43]:
# Checking for rate change between years for SAT
sat_2017_rate = final[final['sat_2017_rate'] >= 99][['sat_2017_rate', 'sat_2018_rate']]
sat_2018_rate = final[final['sat_2018_rate'] >= 99][['sat_2017_rate', 'sat_2018_rate']]

# Concatenate dataframes with SAT participation percentages by state
sat_rate_df = pd.concat([sat_2017_rate, sat_2018_rate]).drop_duplicates()

# Create new column with percentage changes for SAT
sat_rate_df['total_pct_change'] = sat_rate_df.diff(axis=1).apply(abs_change, axis=1)['total_pct_change']
sat_rate_df[sat_rate_df['total_pct_change'] != 0.0].sort_values('total_pct_change', ascending=False)
Out[43]:
sat_2017_rate sat_2018_rate total_pct_change
state
Illinois 9 99 90.0
Colorado 11 100 89.0
District of Columbia 100 92 8.0
Idaho 93 100 7.0
Maine 95 99 4.0

Colorado had a massive jump in their SAT participation rate from 11% in 2017 to 100% in 2018. Illinois also had a similar jump, from 9% in 2017 to 99% in 2018. This was due to changes in state educational policy, which made the SAT compulsory in both Colorado and Illinois.

DC's SAT participation rate dropped from 100% to 92% in 2018, while Idaho's SAT participation rate increased from 93% to 100% in 2018.

In [44]:
# Checking for rate change between years for ACT
act_2017_rate = final[final['act_2017_rate'] >= 99][['act_2017_rate', 'act_2018_rate']]
act_2018_rate = final[final['act_2018_rate'] >= 99][['act_2017_rate', 'act_2018_rate']]

# Concatenate dataframes with ACT participation percentages by state
act_rate_df = pd.concat([act_2017_rate, act_2018_rate]).drop_duplicates()

# Create new column with percentage changes for ACT
act_rate_df['total_pct_change'] = act_rate_df.diff(axis=1).apply(abs_change, axis=1)['total_pct_change']
act_rate_df[act_rate_df['total_pct_change'] != 0.0].sort_values('total_pct_change', ascending=False)
Out[44]:
act_2017_rate act_2018_rate total_pct_change
state
Colorado 100 30 70.0
Ohio 75 100 25.0
Nebraska 84 100 16.0
Minnesota 100 99 1.0

Colorado's ACT participation rate dropped massively from 2017 to 2018, while Nebraska and Ohio both increased their ACT participation rates from 2017 to 2018. In Nebraska, the ACT was made mandatory, leading to an increase in participation rate. In Ohio, standardized testing through the SAT or ACT also became mandatory -- despite being given a choice of which test to adopt, the majority of state districts (95%) chose the ACT over the SAT.

In [45]:
# Checking for cases where participation rates is greater than 50% for both tests in 2017
final[(final['sat_2017_rate'] > 50) & (final['act_2017_rate'] > 50)] \
        [['sat_2017_rate', 'act_2017_rate']] \
        .sort_values('sat_2017_rate')
Out[45]:
sat_2017_rate act_2017_rate
state
Hawaii 55 90
Georgia 61 55
Florida 83 73
In [46]:
# Checking for cases where participation rates is greater than 50% for both tests in 2018
final[(final['sat_2018_rate'] > 50) & (final['act_2018_rate'] > 50)] \
        [['sat_2018_rate', 'act_2018_rate']] \
        .sort_values(by='sat_2018_rate')
Out[46]:
sat_2018_rate act_2018_rate
state
North Carolina 52 100
South Carolina 55 100
Hawaii 56 89
Georgia 70 53
Florida 97 66

Flordia, Hawaii and Georgia had greater than 50% participation rates for both tests in 2017 and 2018. Hawaii is interesting as the state is rather unique in terms of its location and demographics. The majority of Hawaiians live in urban areas, which seems to be contrary to the trend of urban/coastal states favoring the SAT test.

Both North and South Carolina both have more than 50% participation for the ACT and SAT tests. Interestingly, North Carolina, South Carolina, Florida and Georgia are in close geographic proximity, and are all officially recognised as part of the Southern US region. These are generally considered to be highly conservative, republican states. We'll take a look later to see if political affiliation truly has anything to do with participation rates.

Data Visualization

In [47]:
# Create mask to prevent duplication of variables
mask = np.zeros(final.corr().shape, dtype=bool)
mask[np.triu_indices(len(mask))] = True
In [48]:
plt.figure(figsize=(16,12))
plt.title('Heatmap between numeric features of ACT and SAT in 2017 and 2018', fontsize=14)
sns.heatmap(final.corr(), cmap='RdBu', annot=True, mask=mask)
Out[48]:
<AxesSubplot:title={'center':'Heatmap between numeric features of ACT and SAT in 2017 and 2018'}>

There are several patterns of interest here --

Firstly, we can see that ACT participation has a strong negative correlation with ACT subject scores and overall ACT scores. This means that states with higher ACT participation tend to have lower ACT scores, and vice versa for states with low ACT participation rates. This is mirrored in SAT participation rates, where there is an even stronger negative correlation between participation rates and SAT scores.

Secondly, subject and test scores from 2017 are strongly correlated with the same subject and test scores in 2018. This means that states that did well in the previous year, are likely to do well in the following year. This is unsurprising as states are unlikely to dramatically go up or down in test scores over a single year, due to policies and institutions (e.g. state education departments) that work to maintain consistent educational results year-on-year. Barring dramatic policy changes like switching from one test to another, states are likely to produce to the same results.

The test scores / participation rates for each are also negatively correlated with the other test. This shows that it's pretty rare for a student to take both tests.

In [49]:
def subplot_histograms(dataframe, list_of_columns, list_of_titles, list_of_xlabels, sharex=False, kde=False, figsize=(14,7)):
    nrows = int(np.ceil(len(list_of_columns)/2)) # Makes sure you have enough rows
    fig, ax = plt.subplots(nrows=nrows, ncols=2, figsize=figsize, sharex=sharex, sharey=True)
    ax = ax.ravel() # Ravel turns a matrix into a vector, which is easier to iterate
    
    for i, column in enumerate(list_of_columns):
        # Calculate mean
        mean = np.mean(dataframe[list_of_columns[i]])
        ax[i].axvline(mean, ls='--', color='crimson')
        
        # Set color of graph according to test type
        if 'act_' in column:
            color = '#E1812C'
        elif 'sat_' in column:
            color = '#3274A1'
        
        sns.histplot(data=dataframe, x=column, color=color, ax=ax[i], kde=kde, bins=8)
        ax[i].set_title(list_of_titles[i] + f' (Mean: {mean:.1f})')
        ax[i].set_xlabel(list_of_xlabels[i])
        # To prevent label duplication
        if i % 2 == 0:
            ax[i].set_ylabel('Number of States')
    
    fig.tight_layout(h_pad = 1.5, w_pad = 1.5)
In [50]:
subplot_histograms(final, ['sat_2017_rate', 'sat_2018_rate', 'act_2017_rate', 'act_2018_rate'], 
                   ['SAT 2017 Participation', 'SAT 2018 Participation', 'ACT 2017 Participation', 'ACT 2018 Participation'], 
                   ['Participation (%)', 'Participation (%)', 
                    'Participation (%)', 'Participation (%)'], sharex=True, kde=True)

Overall, we can see that the ACT has a larger baseline following than the SAT. The SAT has a large number of states with an extremely low participation rate, while the ACT has a large number of states with a extremely high participation rate.

In 2018, SAT participation increased by 5.9%, while ACT participation decreased by 3.7% on average. This suggests that some states are beginning to adopt the SAT test over the ACT test, with some states moving away from standardized testing altogether. The number of states with a 90% - 100% participation for the SAT increased in 2018, while ACT participation rates for states in the same range remained consistent.

In [51]:
subplot_histograms(final, ['sat_2017_math', 'sat_2018_math', 'act_2017_math', 'act_2018_math'], 
                   ['SAT 2017 Math Scores', 'SAT 2018 Math Scores', 'ACT 2017 Math Scores', 'ACT 2018 Math Scores'], 
                   ['Average Score (min: 400, max: 800)', 'Average Score (min: 400, max: 800)', 
                    'Average Score (min: 1, max: 36)', 'Average Score (min: 1, max: 36)'], kde=True)

Both tests have Math scores that do not appear to be normally distributed. In particular, we can observe that most states tend to have SAT Math scores in the range of 525-550 points. The ACT seems to less bimodal and slightly closer to having a normal distribution.

In [52]:
subplot_histograms(final, 
                   ['sat_2017_ebrw', 'sat_2018_ebrw', 
                    'act_2017_english', 'act_2017_reading', 'act_2018_english', 'act_2018_reading'],
                   ['SAT 2017 Reading/Writing Scores', 'SAT 2018 Reading/Writing Scores',
                    'ACT 2017 English Scores', 'ACT 2017 Reading Scores', 
                    'ACT 2018 English Scores', 'ACT 2018 Reading Scores'],
                   ['Average Score (min: 400, max: 800)', 'Average Score (min: 400, max: 800)',
                    'Average Score (min: 1, max: 36)', 'Average Score (min: 1, max: 36)', 
                    'Average Score (min: 1, max: 36)', 'Average Score (min: 1, max: 36)'], kde=True)

SAT average scores for Reading/Writing dropped from 2017 to 2018 by 1.8 points, while ACT English scores increased by 0.1 points in 2018.

ACT reading scores from 2017 and 2018 are relatively seem to be more normally distributed than SAT Reading/Writing scores. ACT English scores from 2017 and 2018 seem to be more bimodal in comparison to the rest of the ACT subjects.

In [53]:
subplot_histograms(final, 
                   ['act_2017_science', 'act_2018_science'],
                   ['ACT 2017 Science Scores', 'ACT 2018 Science Scores'],
                   ['Average Score (min: 1, max: 36)', 'Average Score (min: 1, max: 36)'], kde=True, figsize=(14,4))

Similar to the other ACT subjects, the distribution seems to be somewhat ambigious -- it seems to be somewhere in between normal and bimodal.

In [54]:
subplot_histograms(final, 
                   ['sat_2017_total', 'sat_2018_total', 'act_2017_total', 'act_2018_total'],
                   ['SAT 2017 Total Scores', 'SAT 2018 Total Scores', 
                    'ACT 2017 Total Scores', 'ACT 2018 Total Scores'],
                   ['Average Score (min: 800, max: 1600)', 'Average Score (min: 800, max: 1600)',
                    'Average Score (min: 1, max: 36)', 'Average Score (min: 1, max: 36)'], kde=True)

The average total score for the SAT dropped by 0.8 points in 2018. On the other hand, ACT English and Reading scores increased by 0.9 and 1.0 respectively.

We can also easily observe that SAT total scores appear to be more bimodally distributed, and can be divided into two distinct groups of states i.e. high-performing and low-performing states. This pattern is also apparent within the ACT data, but to a lesser degree.

In [55]:
# Function to plot multiple scatter plots
def multi_plot_scatter(df, list_of_x, list_of_y, list_of_titles, xlabels, ylabels, sharex = False, sharey = False):
    if len(list_of_x) == 1:
        ncols = 1
    else:
        ncols = 2
    nrows = int(np.ceil(len(list_of_x)/2)) # Makes sure you have enough rows
    fig, ax = plt.subplots(nrows=nrows, ncols=ncols, figsize=(14,7), sharex = sharex, sharey = sharey)
    ax = ax.ravel() # Ravel turns a matrix into a vector, which is easier to iterate
    
    for i, column in enumerate(list_of_x):
        sns_fig = sns.regplot(data=df, x=list_of_x[i], y=list_of_y[i], ax=ax[i]).set_title(list_of_titles[i])
        ax[i].set_xlabel(xlabels[i])
        ax[i].set_ylabel(ylabels[i])
        plt.tight_layout()
        
# Function for plotting single scatter plots
def plot_scatter(df, x, y, title):
    fig, ax = plt.subplots(figsize=(10,5))
    plt.title(title)
    sns.regplot(data=df, x=x, y=y)
    plt.tight_layout()
In [56]:
plot_scatter(final, 'sat_2017_math', 'act_2017_math', 'SAT vs ACT Math Average Score for 2017')
plt.xlabel('SAT 2017 Math Average Score')
plt.ylabel('ACT 2017 Math Average Score');
In [57]:
multi_plot_scatter(final, ['sat_2017_ebrw', 'sat_2017_ebrw'], 
                    ['act_2017_reading', 'act_2017_english'], 
                     ['SAT Reading/Writing vs ACT Reading Average Score (2017)',
                      'SAT Reading/Writing vs ACT English Average Score (2017)'],
                     ['SAT 2017 Reading and Writing Score', 'SAT 2017 Reading and Writing Score'],
                     ['ACT 2017 Reading Score', 'ACT 2017 English Score'],
                    sharex = True, sharey = True)
In [58]:
plot_scatter(final, 'sat_2017_total', 'act_2017_total', 'SAT vs ACT Total Average Score for 2017')
plt.xlabel('SAT 2017 Total Average Score')
plt.ylabel('ACT 2017 Total Average Score');

Generally, ACT and SAT scores are negatively correlated to each other. This suggests that states tend to focus on one test over the other, meaning that states are likely to have either high SAT/ low ACT or low SAT/ high SAT average scores. In other words, states with low SAT scores tend to have high ACT scores and vice versa.

In [59]:
multi_plot_scatter(final, ['sat_2017_total', 'act_2017_total'], ['sat_2018_total', 'act_2018_total'], 
                   ['SAT Total Average Score in 2017 vs 2018', 'ACT Total Average Score in 2017 vs 2018'], 
                   ['SAT 2017 Total Average Score', 'ACT 2017 Composite Average Score'],
                   ['SAT 2018 Total Average Score', 'ACT 2018 Composite Average Score'])

ACT and SAT test results year by year have a very strong correlation. This means that states that did well in the last year are likely to do well in the next year. This correlation is much stronger for the ACT test, meaning that ACT scores tend to remain static from one year to the next. This could actually be an incentive for high-scoring states to stay with the ACT, given that scores seem to less from one year to the next as compared to the SAT.

Plot and interpret boxplots

In [60]:
def plot_box(dataframe, list_of_columns, title, list_of_yticklabels, xlabel, test_type=False, palette='tab10', orient='h'):
    plt.figure(figsize=(14,7))
    plt.title(title)
    
    ax = sns.boxplot(data=dataframe[list_of_columns],palette=palette, orient=orient)
    
    #For setting colors according to standardize colors according to test type (SAT = blue, ACT = orange)
    for i, col in enumerate(list_of_columns):
        if test_type:
                box = ax.artists[i]
                if 'act_' in col:
                        box.set_facecolor('#E8A36C')

                elif 'sat_' in col:
                    box.set_facecolor('#9DC2D5')
        
    ax.set_yticklabels(list_of_yticklabels)
    ax.set_xlabel(xlabel)
In [61]:
plot_box(final, ['sat_2017_rate', 'act_2017_rate', 'sat_2018_rate', 'act_2018_rate'],
         'SAT and ACT Participation Rates in 2017 and 2018', 
         ['SAT 2017 Participation', 'ACT 2017 Participation', 'SAT 2018 Participation', 'ACT 2018 Participation'], 
         'Participation Rate', test_type=True)

The ACT test has a higher median than the SAT test, indicating that the ACT participation rates are generally higher than SAT participation rates. The SAT participation rates are also skewed right, suggesting that there is strong support for the ACT test throughout the US. SAT participation rates are generally skewed left.

However, from 2017 to 2018, we can observe that the median for SAT participation increased, while the median for ACT participation dropped slightly. Variance (as measured by the IQR) also seems to have risen, leading to a decrease in skewness. This could suggest that support for the SAT test is beginning to rise throughout the US.

In [62]:
plot_box(final, ['sat_2017_math', 'sat_2018_math', 'sat_2017_ebrw', 'sat_2018_ebrw'], 
         'SAT Scores in 2017 and 2018',
         ['SAT 2017 Math', 'SAT 2018 Math', 'SAT 2017 Reading and Writing', 'SAT 2018 Reading and Writing'],
         'Average Score', palette='Blues')
    

The median for SAT test scores decreased for both Math and Reading and Writing in 2018. This could be due to increased SAT participation rates as previously mentioned. Another contributing factor to this could be states that switched to the SAT from the ACT, like Colorado and Illinois which had below total average scores of 1025 and 1029 respectively.

In [63]:
plot_box(final, ['sat_2017_total', 'sat_2018_total'], 
         'SAT Total Scores in 2017 and 2018', 
         ['SAT 2017 Average Total Score', 'SAT 2018 Average Total Score'], 'Average Score', palette='Blues')

The median for SAT average total scores also decreased from 1107 to 1099 -- this could be due to both increased participation rates and reports of more students than ever taking the SAT, with many of these students from low-income families or from families “with no history of college attendance". The introduction of SAT test fee waivers for low income students also occurred in 2018.

In [64]:
plot_box(final, ['act_2017_math', 'act_2018_math', 'act_2017_reading', 
                 'act_2018_reading', 'act_2017_english', 'act_2018_english'], 
         'ACT Subject Scores in 2017 and 2018', 
         ['ACT 2017 Math', 'ACT 2018 Math', 'ACT 2017 Reading',
          'ACT 2018 Reading', 'ACT 2017 English', 'ACT 2018 English'], 'Average Score', palette='Oranges')
In [65]:
plot_box(final, ['act_2017_total', 'act_2018_total'], 
         'ACT Composite Scores in 2017 and 2018', ['ACT 2017 Total', 'ACT 2018 Total'], 'Average Score', palette='Oranges')

The median for all ACT subjects decreased in 2018. While we saw earlier that the mean for ACT scores increased the across the board, this could mean that the ACT is beginning to become increasingly bimodally distributed.

Statistical Equity vs Socioeconomic Equity

To further understand the underlying differences between high-performing and low-performing states, I decided to look at additional categories including political affiliation by state and GDP per Capita by state.

I used multiple datasets for this, including:

The real estimates of Gross domestic product (GDP) by state were measured in chained 2009 dollars. Real GDP by state is an inflation-adjusted measure of each state's gross product that is based on national prices for the goods and services produced within the state.

In [66]:
# Load datasets for GDP and political classification
gallup_2017 = pd.read_csv('./data/Gallup_2017.csv', index_col='state')
gallup_2018 = pd.read_csv('./data/Gallup_2018.csv', index_col='state')
us_gdp = pd.read_csv('./data/US_GDP.csv', index_col='state')
us_gdp.replace(',', '', regex=True, inplace=True)
In [67]:
# Simplify classification to either Democrat, Republican or Swing
gallup_2017 = gallup_2017.replace(['Lean Republican', 'Solid Republican'], 'Republican').replace \
(['Lean Democrat', 'Solid Democrat'], 'Democrat').replace('Competitive', 'Swing')

gallup_2018 = gallup_2018.replace(['Lean Republican', 'Strong Republican'], 'Republican').replace \
(['Lean Democratic', 'Strong Democratic'], 'Democrat').replace('Competitive', 'Swing')
In [68]:
# Change GDP numbers to numeric
us_gdp['gdp_2017'] = pd.to_numeric(us_gdp['gdp_2017'])
us_gdp['gdp_2018'] = pd.to_numeric(us_gdp['gdp_2018'])
In [69]:
# Combine GDP and political classification with 2017 dataset
combined_2017 = combined_2017.join(gallup_2017['classification'])
combined_2017 = combined_2017.join(us_gdp['gdp_2017'])

# Combine GDP and political classification with 2018 dataset
combined_2018 = combined_2018.join(gallup_2018['classification'])
combined_2018 = combined_2018.join(us_gdp['gdp_2018'])
In [70]:
categories = ['<$40,000', '\$40,000-$50,000', '\$50,000-$60,000', '>$60,000']

# Create categories for US GDP
for col in us_gdp:
    cat = cat_gdp_2017 = cat_gdp_2018 = []
    for value in us_gdp[col]:
        if value < 40000:
            cat.append('<$40,000')
        
        elif value >= 40000 and value <= 50000:
            cat.append('\$40,000-$50,000')
            
        elif value >= 50000 and value <= 60000:
            cat.append('\$50,000-$60,000')
        
        elif value > 60000:
            cat.append('>$60,000')
            
    if '2017' in col:
        cat = cat_gdp_2017
        
    elif '2018' in col:
        cat = cat_gdp_2018
In [71]:
# Convert GDP data to categorical data
us_gdp['cat_gdp_2017'] = pd.Categorical(cat_gdp_2017, categories)
us_gdp['cat_gdp_2018'] = pd.Categorical(cat_gdp_2018, categories)     

# Add in categories for US GDP
combined_2017 = combined_2017.join(us_gdp['cat_gdp_2017'])
combined_2018 = combined_2018.join(us_gdp['cat_gdp_2018'])

We can now revisit some of our scatterplot graphs made earlier, and separate our datapoints according to GDP. In the plots below, I separated the graph into quadrants using the mean for SAT and ACT total scores.

In [72]:
# Function for plotting custom seaborn scatter plots
def sns_scatter(dataframe, list_of_x, list_of_y, list_of_titles, list_of_xlabels, list_of_ylabels, hue, sharex=False):
    nrows = int(np.ceil(len(list_of_x)/2))
    fig, ax = plt.subplots(nrows=nrows, ncols=2, figsize=(14,6), sharex=sharex, sharey=True)
    ax = ax.ravel()
    
    for i, _ in enumerate(list_of_x):
        if '_2017' in list_of_x[i]:
            dataframe = combined_2017
        elif '_2018' in list_of_x[i]:
            dataframe = combined_2018   
        sns.scatterplot(data=dataframe, x=list_of_x[i], y=list_of_y[i], hue=hue[i], ax=ax[i], palette='magma', s=100)
        if i != len(list_of_x) - 1:
            ax[i].get_legend().remove()
        ax[i].set_title(list_of_titles[i], fontsize=15)
        
        ax[i].set_xlabel(list_of_xlabels[i] + f' (Mean: {dataframe[list_of_x[i]].mean():.1f})')
        ax[i].set_ylabel(list_of_ylabels[i] + f' (Mean: {dataframe[list_of_y[i]].mean():.1f})')
        
        # Mean for ACT test in 2017 and 2018
        ax[i].axvline(dataframe[list_of_x[i]].mean(), color='black')
        
        # Mean for SAT test in 2017 and 2018
        ax[i].axhline(dataframe[list_of_y[i]].mean(), color='black')
        
    
    plt.legend(title='State GDP per Capita \n (Chained 2009 USD)')
    plt.tight_layout()
In [73]:
sns_scatter(final, ['act_2017_total', 'act_2018_total'], ['sat_2017_total', 'sat_2018_total'], 
            ['ACT vs SAT Total Average Score in 2017', 'ACT vs SAT Total Average Score in 2018'],
            ['ACT 2017 Scores', 'ACT 2018 Scores'],
            ['SAT 2017 Scores', 'SAT 2017 Scores'],
            ['cat_gdp_2017', 'cat_gdp_2018'])

In the bottom left quadrant, we can obverve that state with below average ACT and SAT total scores are mostly poor states, with a GDP of below \$50,000. Conversely, we can see that states with a GDP of more than \\$50,000 tend to exist in the bottom right or top left quadrant.

This suggests that states with a low GDP per capita tend to do badly on both the ACT or SAT, while states with a high GDP per capita tend to do better either on ACT or SAT. For this reason, the upper right quadrant is almost empty, with only a few fringe states.

We can also observe that richer states tend to cluster closer together versus lower income states, which are more spread out.

In [74]:
sns_scatter(final, ['act_2017_rate', 'act_2018_rate'], ['sat_2017_rate', 'sat_2018_rate'], 
            ['ACT vs SAT Participation in 2017', 'ACT vs SAT Participation in 2018'], ['ACT Participation', 'ACT Participation'], 
            ['SAT Participation', 'SAT Participation'],
            ['cat_gdp_2017', 'cat_gdp_2018'])

In terms of participation rates, we can see that states of a similiar GDP per capita tend to cluster together, especially for states of a high GDP per capita.

In [75]:
plt.figure(figsize=(8,8))
plt.suptitle('Total Score vs GDP', fontsize=20)
ax1=sns.boxplot(data=combined_2017, x='sat_2017_total', y='cat_gdp_2017', palette='Blues')
ax1.set_title('SAT Total Score vs GDP (2017)', fontsize=15)
ax1.set_xlabel('SAT 2017 Total Score')
ax1.set_ylabel('State GDP Per Capita')
plt.figure(figsize=(14,7))
ax2=sns.boxplot(data=combined_2017, x='act_2017_total', y='cat_gdp_2017', palette='Oranges')
ax2.set_title('ACT Total Score vs GDP (2017)', fontsize=15)
ax2.set_xlabel('ACT 2017 Total Score')
ax2.set_ylabel('State GDP Per Capita')
Out[75]:
Text(0, 0.5, 'State GDP Per Capita')