Project 3: Subreddit Classification with NLP¶

Introduction & Problem Statement¶

Machine learning has come a pretty long way, especially in the area of Natural Language Processing (NLP), which is a branch of artificial intelligence that can help computers understand, interpret and manipulate human language. With the steadily increasing amounts of unstructured data growing every day, automation is becoming vital in analyzing text and speech data efficiently.

This is relevant to social media websites like Reddit - a collection of interest-based communities known as subreddits - that are growing in popularity but face management issues and bottlenecks due to the limited availability of human moderators. Reddit can also be overwhelming to new users, who may struggle to find the right subreddit as subreddit names can be highly misleading. This project aims to solve this problem through a subreddit post classifier.

Problem statement: How can NLP be best used to classify posts from two different subreddits based on their title and text content?

For this project, I selected two subreddits that are ostensibly similar on the surface -- r/MensLib and r/MensRights. Both subreddits focus mainly on male issues such as suicide or homelessness, and both give men a space to discuss their beliefs around gender rights and issues. The main difference between these subreddits is that r/MensLib has a broad definition of masculinity and supports feminism, while r/MensRights generally has a narrower definition of masculinity and is driven by the idea that feminism/feminists are actively harming men. Furthermore, they believe that there is serious discrimination against men inherent in western societies.

While the moderation teams of both subreddits could benefit from having some kind of classification tool, r/MensLib would likely benefit much more from a NLP classifier, given their strict moderation policies. I also personally believe that the positions of both these subreddits are worth deeper examination. r/MensLib is generally believed to have a more 'positive' community than r/MensRights, but is this really the case? Are the topics they discuss similar, or different?

The full code for this project is available here.

Data Scraping¶

To get at the posts we need, we have to decide on a way to scrape them. There are various methods for scraping data, but the most straightforward way is to look for the Application Programming Interface (API) of a site and extract a JavaScript Object notation (JSON), if available. This .JSON can be then read into Python as a dictionary using the json module.

Luckily, Reddit allows us to access the JSON of each subreddit that gives us access to the top-ranked 1000 posts. Off the bat, I knew that I wanted to track the volume of posts in each Reddit, which means that I would have to find another method to scrape that much data. Additionally, I wanted more infomation such as the number of upvotes and comments that each post had.

To accomplish this, I took a dual approach. I used the Python Reddit API Wrapper (PRAW) - a third-party tool - to extract 1000 of the most popular posts from each subreddit and the PushShift API (PSAW) to extract all posts between Aug 1 and Nov 28.

This allowed me to capture the following information:

All posts from each subreddit between Aug 1 - Nov 28
1000 top posts from each subreddit including:
- Date of post
- Title of post
- Text within post
- Number of upvotes
- Upvote ratio
- Number of comments
- Type of post (moderator/non-moderator post)
- Permalink
- Author of post

Data Cleaning¶

Now that we have our data, the key question now is what to do with it, or how to make it interpretable to our computers. Because the text we just gathered is unstructured data, we need to make sure that our text is clean and free of null values or duplicated entries. This entails removing URL links and non-alphanumerical characters that won't be useful in a classification model.

# 75 mod posts vs 0 posts
menslib['distinguished'].isin(['moderator']).sum(), mensrights['distinguished'].isin(['moderator']).sum()

(75, 0)

# Deal with null values
menslib['selftext'] = menslib['selftext'].fillna('')
mensrights['selftext'] = mensrights['selftext'].fillna('')

# Check for duplicates
duplicates = menslib[(menslib.duplicated(subset=['selftext'])) & (menslib['selftext'] != '')] \
                .sort_values(ascending=False, by='selftext') \
                ['title'].value_counts()

duplicates.head(4)

Weekly Free Talk Friday thread!                       19
Tuesday Check In: How's Everybody's Mental Health?    17
Weekly Free Talk Friday Thread!                       14
Tuesday Check In: How's Everyone's Mental Health?     11
Name: title, dtype: int64

# No duplicates!
mensrights[(mensrights.duplicated(subset=['selftext'])) & (mensrights['selftext'] != '')] \
                .sort_values(ascending=False, by='selftext')['title'].value_counts()

Series([], Name: title, dtype: int64)

menslib = menslib.drop_duplicates(subset=['title'])

# Check for any more duplicates
duplicates = menslib[(menslib.duplicated(subset=['selftext'])) & (menslib['selftext'] != '')] \
                .sort_values(ascending=False, by='selftext') \
                ['title'].value_counts()

duplicates

Tuesday Check In: How's Everyone's Mental Health?    1
Tuesday Check-In: How's everyone's mental health?    1
Name: title, dtype: int64

# Dropping weekly threads that are mod posts
menslib = menslib.drop(menslib[menslib['title'].str.contains('Weekly Free Talk')].index)
menslib = menslib.drop(menslib[menslib['title'].str.contains('Tuesday Check')].index)
menslib = menslib.drop(menslib[menslib['title'].str.contains('NEW Resources of The Week Highlight')].index)

To conduct further analysis in areas such as word and ngram frequency, we'll first have to clean the data by removing http links and irrelevant non-alpha-numeric characters.

def drop_links(text):
    words = text.split(' ')
    words_to_sub = [w for w in words if 'http' in w]
    
    if words_to_sub:
        for w in words_to_sub:
            new_word = re.sub('http.*', '', w)
            text = re.sub(w, new_word, text)
        return text
        
    # If no links in post, return text   
    return text

After data scraping, I found that a few posts had HTML tags and emoticons that needed slightly more specialized treatment. For example, &#x200b which is the HTML character for a zero width space can't be dealt with simply by removing non alpha-numeric characters, we'll still end up with 'xb' which is not what we want. Here, we'll use RegEx to remove unwanted HTML.

def preprocessing(text):
            
    # Remove new lines
    text = text.replace('\n',' ').lower()
    
    # Remove emoticons
    text = re.sub(':d', '', str(text)).strip()
    text = re.sub(':p', '', str(text)).strip()
    
    # Remove HTML markers and punctuation
    text = re.sub('xa0', '', str(text)).strip()
    text = re.sub('x200b', '', str(text)).strip()
    text = re.sub('[^a-zA-Z\s]', '', str(text)).strip() 
    
    return text

# Removing new lines, removing punctuation and tokenizing text for menslib
menslib['cleaned_title'] = menslib['title'].apply(preprocessing)
menslib['cleaned_selftext'] = menslib['selftext'].apply(preprocessing)

# Removing new lines, removing punctuation and tokenizing text for mensrights
mensrights['cleaned_title'] = mensrights['title'].apply(preprocessing)
mensrights['cleaned_selftext'] = mensrights['selftext'].apply(preprocessing)

# Dropping links
menslib['cleaned_selftext'] = menslib['cleaned_selftext'].apply(drop_links)
mensrights['cleaned_selftext'] = mensrights['cleaned_selftext'].apply(drop_links)

# Combining title and selftext
menslib['combi_text'] = menslib['cleaned_title'] + ' ' + menslib['cleaned_selftext']
mensrights['combi_text'] = mensrights['cleaned_title'] + ' ' + mensrights['cleaned_selftext']

Lemmatization¶

It's worth noting here that I also tried out lemmatization, which helps to reduce inflectional forms and derivationally related forms of a word to a common base form. I ultimately didn't use lemmatization however, as it led to a decreased level of accuracy during model testing. There's also existing research out there has generally shown that lemmatization is not necessary, especially for languages with a simple morphology like English. Given that we're trying to classify posts, even small differences in language could make a large difference regarding the accuracy of our model.

# Comparing lemmatized vs regular text 
# We can see that a lot of nuance within the post has been lost after lemmatization and removal of stop words
print(menslib['combi_text'][0], '\n')
print(menslib['lem_combi_text'][0])

happy international mens day from menslib officially the theme is better health for men and boys however us at the mod team feel thats a bit corporate and sanitised in how it sounds in our opinion the real theme has to be the issue that we have all been thinking about every single day since march of this year the current global pandemic  so lets all kick off our shoes let our hair down a bit and talk about this year how its affected our lives our mental health and what were doing to cope remember that we all need to be looking after each other out there 

happy international men day menslib officially theme better health men boy mod team feel thats bit corporate sanitised sound opinion real theme issue thinking single day march year current global pandemic  let kick shoe let hair bit talk year affected life mental health doing cope remember need looking

def plotter_1(column):
        fig, ax = plt.subplots(1, 2, figsize=(12,6), sharey=True)
        ax = ax.ravel()
        g1 = sns.histplot(data=menslib, x=column, ax = ax[0], bins=20)
        mean_1 = menslib[column].mean()
        g1.set_title(f'r/MensLib (Mean: {round(mean_1)} {column})')
        ax[0].axvline(mean_1, ls='--', color='black')
        g2 = sns.histplot(data=mensrights, x=column, ax = ax[1], bins=20, color='darkorange')
        mean_2 = mensrights[column].mean()
        g2.set_title(f'r/MensRights (Mean: {round(mean_2)} {column})')
        ax[1].axvline(mean_2, ls='--', color='black')
        plt.suptitle(f'{column.capitalize()}', fontsize=20)
        plt.tight_layout()

plotter_1('upvotes')

While both subreddits have a large number of posts that have less than 2000 upvotes, the majority of posts from r/MensRights have less than 1,000 upvotes. We can see that posts in r/MensLib are more evenly distributed, though still skewed towards the right. r/MensRights has an outlier post that reached over 20,000 upvotes.

mensrights[mensrights['upvotes'] > 20_000]

This post is likely to have made it to the front page of Reddit and resonated with readers there. Public sentiment of Johnny Depp has been mostly positive since the trial, with many redditors upset over Johnny Depp's treatment by the media and companies like Disney. This is a rallying issue for r/MensRights, which focuses heavily on false accusations of rape or sexual assault made against men.

plotter_1('n_comments')
plt.suptitle('Number of Comments', fontsize=18)

plotter_1('n_comments')
plt.suptitle('Number of Comments', fontsize=18)

The community of r/MensLib seems to be a lot more active than that of r/MensRights, despite the higher subscriber count of the latter.

Summary of Differences¶

'Hot' or popular posts from r/MensRights have an average of 233 upvotes, with around 25 comments per post. In comparison, posts from r/MensLib have an average of 595 upvotes, and 91 comments per post. Based on this, it seems that the r/MensLib community is a lot more active.

However, it's worth noting that r/MensRights has 286,000 subscribers compared to the 152,000 subscribers of r/MensLib, so we can't base popularity off these statistics alone.

Exploratory Data Analysis¶

fig, ax = plt.subplots(1, 2, figsize=(12,6), sharey=True)
ax = ax.ravel()
g1 = sns.countplot(data=menslib, x=menslib['is_self'].astype(int), ax = ax[0], palette='Blues')
g1.set_xticklabels(['Image Post', 'Text Post'])
g1.set_title('r/MensLib', fontsize=16)
g1.set_xlabel('')
g2 = sns.countplot(data=mensrights, x=mensrights['is_self'].astype(int), ax = ax[1], palette='Oranges')
g2.set_xticklabels(['Image Post', 'Text Post'])
g2.set_title('r/MensRights', fontsize=16)
g2.set_ylabel('')
g2.set_xlabel('')
plt.suptitle('Type of Post', fontsize=20)
plt.tight_layout()

A large number of posts in r/MensRights are not text posts. R/MensLib also has a substantial number of non-text posts. As we created a feature that combines both title text and post text (combi_text), this shouldn't pose much of an issue to our model later on.

def subplot_histograms(col, graph_title):
    fig, ax = plt.subplots(1, 2, figsize=(12,6), sharey=True)
    ax = ax.ravel()
    
    # Plot first df   
    g1 = sns.histplot(data=menslib, x=menslib[col].str.len(), ax = ax[0], bins=25)
    mean_1 = menslib[col].str.len().mean()
    ax[0].axvline(mean_1, ls='--', color='black')
    g1.set_title(f'r/MensLib (Mean: {round(mean_1)} words)')
    g1.set_xlabel(f'Length of {col.capitalize()}')
    
    # Plot second df
    g2 = sns.histplot(data=mensrights, x=mensrights[col].str.len(), ax = ax[1], bins=25, color='darkorange')
    mean_2 = mensrights[col].str.len().mean()
    ax[1].axvline(mean_2, ls='--', color='black')
    g2.set_title(f'r/MensRights (Mean: {round(mean_2)} words)')
    g2.set_xlabel(f'Length of {col.capitalize()}')
    plt.suptitle(graph_title, fontsize=20)
    plt.tight_layout()

subplot_histograms('title', 'Average Title Length')

r/MensLib generally has slightly shorter title lengths than r/MensRights, but the average post length in r/MensLib is much greater than the average post length in r/MensRights. The differences here could make these features useful for our classifier model.

Analyzing Ngrams¶

To analyze the language within each subreddit, I took a bag-of-words approach, where each word is broken into units (or tokenized) and counted according to how often this given word appears in a post.

This approach is called a 'bag' of words, because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document. Within this process, we'll also look to remove frequent words that don’t contain much information, like 'a,' 'of,' etc. These are known as stop words.

To do this, we'll be using CountVectorizer(), a scikit-learn tool that helps to both tokenize a collection of text documents and build a vocabulary of known words and encode new documents using that vocabulary. We'll also be using it to exclude stopwords from our results.

# Create function to get top words
def plot_top_words(df, col, n, n_gram_range, title, palette='tab10'):
    def get_top_n_words(corpus, n=n, k=n_gram_range):     
        vec = CountVectorizer(ngram_range=(k,k), stop_words='english').fit(corpus)     
        bag_of_words = vec.transform(corpus)     
        sum_words = bag_of_words.sum(axis=0)      
        words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]    
        words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True) 
        return words_freq[:n]
    temp_df = pd.DataFrame(data=get_top_n_words(df[col], n), columns=['word','freq'])
    plt.figure(figsize=(10,6))
    sns.barplot(data=temp_df, y='word', x='freq', palette=palette)
    plt.ylabel('')
    plt.xlabel('')
    plt.title(title, fontsize=18)

plot_top_words(menslib,'cleaned_title', 10, 1, 'r/menslib Top 10 Words', 'Blues_r')

stop_words = set(CountVectorizer(stop_words = 'english').get_stop_words())

The full code for the data scraping process can be found here.

plot_top_words(mensrights,'cleaned_title', 10, 1, 'r/mensrights Top 10 Words', 'Reds_r')

There's a large amount of overlap between the top words in r/MensRights and r/MensLib as expected. Namely, the focus on various male nouns e.g. men, mens, male, man. However, there's a much stronger focus on masculinity within r/MensLib. The subreddit also encompasses the issues of children, and focuses on issues affecting the physical and emotional health of men.

Within r/MensRights, the focus is much more topical. Topics that are common in the sub include the celebration of International Men's Day and the Johnny Depp & Amber Heard court case.

Top Bigrams¶

plot_top_words(menslib,'cleaned_selftext', 10, 2, 'r/menslib Top Bigrams', palette='Blues_r')

plot_top_words(mensrights,'cleaned_selftext', 10, 2, 'r/mensrights Top Bigrams', palette='Reds_r')

The relationship between men and women is an important topic in each subreddit, though we can now see more of the key differences based on the bigrams. The bigrams used within r/MensLib (e.g. don't know, don't think, don't want) along with their frequency (>40) are a likely indicator that the community is somewhat introspective.

For r/MensRights, key topical issues include men paying child support and sexual discrimination (Title IX is a US federal law that prohibits discrimination based on sex). Men's rights activists believe that Title IX is an unfair law that denies proper due process and is a tool that is used indiscriminantly against men.

Top Trigrams¶

plot_top_words(menslib, 'cleaned_selftext', 10, 3, 'r/menslib Top Trigrams', 'Blues_r')

plot_top_words(mensrights,'cleaned_selftext', 10, 3, 'r/mensrights Top Trigrams', palette='Reds_r')

The focus on the emotional health of men is quite clear within r/MensLib. The frequent use of pronouns within the trigrams also suggests that personal issues and viewpoints are commonly discussed within the subreddit. In comparison, the trigrams for r/MensRights are factual and once again topical.

International Men's Day also seems to be a huge topic for r/MensRights. Within r/MensLib, the moderation team created a single post that fulfilled the purpose of celebrating International Men's Day, which is why we don't see mentions of it within our bigrams or trigrams.

Feminism as a movement seems to be also quite negatively viewed within the subreddit. This alludes to differences in sentiment between both subreddits that we'll explore further down.

Analyzing Volume of Posts¶

fig, ax = plt.subplots(1, 1, figsize=(16,8))
menslib_vdf['date'].value_counts().plot()
mensrights_vdf['date'].value_counts().plot()
plt.xlabel('Date')
plt.ylabel('Number of Posts')
plt.title('Overall Post Volume (Aug 2020 - Dec 2020)', fontsize=18)
legend = plt.legend(title='Subreddit', loc='best', labels=['r/MensLib', 'r/MensRights'], fontsize=12)
legend.get_title().set_fontsize('12')

The difference between the number of posts within each subreddit is quite obvious here. While arguably r/MensLib has higher 'quality' posts (as indicated by higher upvotes and number of comments), r/MensRights has people posting much more frequently. This could be due to the strict moderation policy of r/MensLib which discourages users from making low-quality posts.

# There was a noticeable spike in posts from Nov 19 - Nov 21 -- this was likely due to International Men's Day on 19 Nov
mensrights_vdf['date'].value_counts().head()

2020-11-20    114
2020-11-19     95
2020-08-14     86
2020-09-12     85
2020-09-19     85
Name: date, dtype: int64

Deleted Posts¶

fig, ax = plt.subplots(1, 1, figsize=(16,8))
menslib_vdf['date'].value_counts().plot(color='dodgerblue')
menslib_vdf[(menslib_vdf['selftext'] == '[removed]') & (menslib_vdf['n_comments'] < 3)] \
            ['date'].value_counts().plot(color='lightgrey')
plt.title('r/MensLib: Deleted Posts versus Actual Posts', fontsize=18)
plt.ylabel('Post Frequency')
plt.ylim(0, 60)
legend = plt.legend(title='Post Type', loc='best', labels=['Regular (r/MensLib)', 'Deleted'], fontsize=12)
legend.get_title().set_fontsize('12')

fig, ax = plt.subplots(1, 1, figsize=(16,8))
mensrights_vdf['date'].value_counts().plot(color='darkorange')
mensrights_vdf[(mensrights_vdf['selftext'] == '[removed]') & (mensrights_vdf['n_comments'] < 3)] \
            ['date'].value_counts().plot(color='lightgrey')
plt.title('r/MensRights: Deleted Posts versus Actual Posts', fontsize=18)
plt.ylabel('Post Frequency')
legend = plt.legend(title='Post Type', loc='best', labels=['Regular (r/MensRights)', 'Deleted'], fontsize=12)
legend.get_title().set_fontsize('12')

Within the posts that we scraped with PSAW from r/MensLib, a large number of these posts are either posts that have been deleted or removed. This is due to strict automoderation policy, as well as active efforts on the part of moderators to prevent the subreddit from being flooded with irrelevant content. In comparison, r/MensRights has a lot less removed/deleted posts. This could be due to more lenient moderation.

Preparation for Sentiment Analysis¶

# Setting up target column
menslib['is_menslib'] = 1
mensrights['is_menslib'] = 0

combined_df['og_text'] = combined_df['title'] + '\n\n' + combined_df['selftext']

# Final check for null values
combined_df.isnull().sum()[combined_df.isnull().sum() > 0]

distinguished    1655
author             41
dtype: int64

combined_df.to_csv('./datasets/combined_df.csv', index=False)

Sentiment Analysis¶

There seems to be a clear difference in sentiment between both subreddits, both how can we prove this? In the NLP field, there are various tools that try to gauge sentiment. The tool that I'm using is the NLTK sentiment analyzer, also known as VADER (Valence Aware Dictionary and sEntiment Reasoner), which is a lexicon (dictionary of sentiments in this case) and a simple rule-based model for general sentiment analysis.

VADER is able to give us a Positivity and Negativity score that can be standardized in a range of -1 to 1. VADER is able to include sentiments from emoticons (e.g, :-D), sentiment-related acronyms (e.g, LoL) and slang (e.g, meh) where algorithms typically struggle. This makes it a good fit for the casual language that's being used across both subreddits.

sentiment_df = combined_df[['combi_text', 'og_text', 'is_menslib']].reset_index(drop=True)

By calling the SentimentIntensityAnalyzer() on each post, we can retrieve the proportion of text that falls in the negative, neutral or positive category. From this, we can also retrieve the compound score of each post, which is a metric that calculates the sum of all the lexicon ratings which have been normalized between -1 and 1. In short, compound scores are the sum of valence computed based on internal heuristics and a sentiment lexicon.

Based on the compound score, we can then create a categorical feature - comp_score - that reflects whether a post is overall positive or negative based on compound score.

# Initialise VADER
sid = SentimentIntensityAnalyzer()

# Analyzing sentiment with VADER
sentiment_df['scores'] = sentiment_df['combi_text'].apply(lambda x: sid.polarity_scores(x))
sentiment_df['compound']  = sentiment_df['scores'].apply(lambda score_dict: score_dict['compound'])
sentiment_df['comp_score'] = sentiment_df['compound'].apply(lambda c: 'pos' if c >=0 else 'neg')

sentiment_df['subreddit'] = sentiment_df['is_menslib'].apply(lambda x: 'r/MensLib' if x == 1 else 'r/MensRights')

sentiment_df.head()

fig, ax = plt.subplots(1, 2, figsize=(16, 6))
ax = ax.ravel()
sns.histplot(sentiment_df[sentiment_df['is_menslib'] == 1]['compound'].values, ax = ax[0])
sns.histplot(sentiment_df[sentiment_df['is_menslib'] == 0]['compound'].values, color='darkorange', ax = ax[1])
mean_1 = sentiment_df[sentiment_df['is_menslib'] == 1]['compound'].mean()
mean_2 = sentiment_df[sentiment_df['is_menslib'] == 0]['compound'].mean()
ax[0].set_title(f'r/MensLib (Mean: {(mean_1):.2f})', fontsize=18)
ax[1].set_title(f'r/MensRights (Mean: {(mean_2):.2f})', fontsize=18)
ax[0].axvline(mean_1, ls='--', color='black')
ax[1].axvline(mean_2, ls='--', color='black')
ax[0].set_xlabel('Sentiment')
ax[0].set_ylabel('Word Frequency')
ax[1].set_xlabel('Sentiment')
ax[1].set_ylabel('')
plt.suptitle('Sentiment Analysis', fontsize=20);

Our initial hypothesis seems to have been largely correct -- the sentiment within r/MensLib seems to be much more positive. Most posts tend towards the right side of the graph, with a large number of posts having a high positive sentiment score between 0.8 and 1.0. In comparison, r/MensRights has a large number of posts that are strongly negative, and relatively fewer posts that are strongly positive.

# We can also see that the average sentiment within r/MensLib is much higher.
sentiment_df[sentiment_df['is_menslib']==1]['compound'].mean(), sentiment_df[sentiment_df['is_menslib']==0]['compound'].mean()

(0.24337977528089885, -0.10875671834625329)

# Example of a strongly negative post within r/MensLib.
# As a whole, this post isn't actually 'negative' despite the negative sentiment that VADER is detecting.
temp_df = sentiment_df[(sentiment_df['subreddit'] == 'r/MensLib') 
                       & (sentiment_df['compound'] < -0.9)]['og_text'].reset_index()
print(temp_df['og_text'][6])

As a man, I cry in front of people whenever I feel like it. It's refreshing

I was fired yesterday and cried in front of my ex-boss.

If I feel like crying (i.e. when I'm suffering) I bloody well will do so.

Crying is a natural defense mechanism against suffering. Denying men the ability to cry in front of people, in this day and age, just completely obliterates their ability to process their emotions. I have a theory that if we don't cry, we then must put that bad energy elsewhere, like into aggression and latent resentment, which will inevitably hurt others.

There are two ways to cry:

* You can cry involuntarily, which leads to looking like you're constantly holding it back, ashamed and hyper-aware of what everyone thinks of the crying, which will just make things even more awkward
* You can cry completely unashamedly, showing that you do not care on a meta-level what anyone thinks of your crying or how you look.

This isn't to advocate crying at spilt milk, or at every damn thing. But if your girlfriend breaks up with you, feel free to damn well cry in front of her. If you accidentally chop off your finger, feel free to cry out in pain and shed a freaking tear. If something makes you upset, do not hide that emotion. It isn't weakness. I'm amazed at the number of people in the world who think crying = undue helplessness or weakness. It is a RESPONSE. If anything it is the thing that signals the beginning of the fix to the problem.

Cry strongly, and completely, without a care of what others think of your tears. It will make you stronger.

# Example of a strongly negative or 'angry' post within r/MensRights.
temp_df2 = sentiment_df[(sentiment_df['subreddit'] == 'r/MensRights') 
                        & (sentiment_df['compound'] < -0.9)]['og_text'].reset_index()
print(temp_df2['og_text'][17])

Men's problems don't matter apparently

People (even other men) use "be a man" to shut up men who talk about their own problems, this makes them unable to get anything off their chest and can lead to suicide, self harm or violence.

More men die of COVID-19 but it affects women more.

Millions of men die in wars, have to watch their friend die in front of them and kill children in battle but women are the real survivors.

Men are expected to provide for the whole family, keep a job, fight for promotions, work on someone else's schedule and pay for everything but women do all the real work.

Women are valued for being alive while men have to work extremely hard to be valued but women have such a hard life and men have it easy.

My point is that men are expected to always deal with their own problems because "men never need help". **Men *do* have problems, you will figure out that every man has problems if you let him speak.**

Sentiment Visualization with Scattertext¶

To get a better grasp of the words and the sentiment behind them across both subreddits, we'll use a tool called Scattertext. Scattertext can help visualize what words and phrases are more characteristic of a category than others. Below, we'll try to visualize sentiment across both subreddits and see what are the most 'positive' and 'negative' words.

Scattertext uses scaled f-score (the harmonic mean between precision and recall), which takes into account category-specific precision and term frequency when ranking words. A detailed explanation of the formula behind scaled f-score can be found here.

In this case, we'll be looking to compare positive and negative sentiment posts based on our sentiment scores created through VADER. While a word may appear frequently in either the negative or positive category, Scattertext can use scaled f-score to detect whether a particular term is more characteristic of a particular category.

nlp = spacy.load('en')

# Build corpus for Scattertext sentiment analysis
corpus = st.CorpusFromPandas(sentiment_df, category_col='comp_score', text_col='combi_text', nlp=nlp).build()

html = st.produce_scattertext_explorer(corpus, 
                                       category='pos', 
                                       category_name='Positive', 
                                       not_category_name='Negative',
                                       width_in_pixels=1000, 
                                       metadata=sentiment_df['subreddit'],
                                       save_svg_button=True)

html_file_name = "Project_3_Sentiment_Analysis.html"
open(html_file_name, 'wb').write(html.encode('utf-8'))

2767384

The blue dots represent words that have been ranked as positive, while the red dots represent words that have been ranked as negative. As a general observation, the words in blue seem to be used more frequently within r/MensLib, while the words in red are used more frequently within r/MensRights.

As an example of how ScatterText works take the word 'feminist'. While the word is frequent across both 'negative' and 'positive' categories, Scattertext concludes that 'feminist' belong in the 'negative' category due to it's scaled f-score of -0.1. This means that the word generally appears more in negative posts than positive posts.

The interactive version of this chart can be viewed within your browser here.

We can think of this particular map as a sentiment analysis chart of gender issues facing men. While this chart isn't perfect, it reveals some interesting insights in terms of what both subreddits talk about. Words like fatherhood, male friendship and role models appear on the positive side of the spectrum, while words like sexual assault, victim and domestic violence appear on the negative end of the spectrum.

sentiment_df['is_menslib'].corr(sentiment_df['compound'])

0.2567633340564021

We can see that clearly, a correlation does exist between sentiment and whether a post comes from either r/MensLib or r/MensRights. We've proved our initial hypothesis that r/MensLib is a more 'positive' community - as is_menslib increases from 0 (r/MensRights) to 1 (r/MensLib), sentiment scores tend to increase.

We've also seen that VADER is limited in being able to understand words that have a negative connotation versus a post with negative words but overall positive sentiment.

Model Selection¶

The next step that we'll cover below is selecting and tuning a model that can help predict where each post comes from. This is, in effect, a binary classification problem. To find the best model to use here, we'll carry out the following steps:

Run a Train-Test-Split on our data
Transform data using a vectorizer
Fit model to training data
Generate predictions using test data
Evaluate model based on various evaluation metrics (accuracy, precision, recall, ROC-AUC).
Select the best model and tune hyper-parameters

Besides CountVectorizer(), we'll also be using TfidfVectorizer(). TfidfVectorizer() is pretty similar to CountVectorizer, except that it looks at the frequency of words in our data. This means that it downweights words that appear in many posts, while upweighting the rarer words.

We'll look to test a range of classification techniques including Logistic Regression, Random Forest, Boosting, Multinomial Naive Bayes classification and Support Vector Machine (SVM) classification.

Accuracy and F-score will be our main metrics here, given that we're not too particularly concerned about minimizing either false negatives and false positives -- ideally we'd like to minimize both as far as possible.

Baseline Model¶

To have something to compare our model against, we can use the normalized value of y, or the percentage of y within our target. This represents the simplest model we can use, where assigning a post randomly will give us a 53.4% chance of classifying it correctly.

# Baseline
y = combined_df['is_menslib']
y.value_counts(normalize=True)

1    0.534856
0    0.465144
Name: is_menslib, dtype: float64

Train Test Split¶

I chose not to drop particular stopwords here like 'men, rights, lib/liberal, menslib' as I felt both subreddits were likely to discuss these topics independently. My examination of the data also didn't reveal any of these labels as a dead giveaway.

X = combined_df['combi_text']
y = combined_df['is_menslib']

# Split our data into train and test data. We're stratifying here to ensure that the train and test sets 
# have approximately the same percentage of samples in order to avoid imbalanced classes.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)

Model Preparation¶

Here, I instantiated a range of different vectorizers and models. To simplify my workflow slightly, I opted to create a function that utilizes Sklearn's Pipeline tool that allows for easy fitting and transformation of data.

# Instantiate vectorizers
vectorizers = {'cvec': CountVectorizer(),
               'tvec': TfidfVectorizer(),
               'hv': HashingVectorizer()}

# Instiantiate models
models = {'lr': LogisticRegression(max_iter=1_000, random_state=42),
          'rf': RandomForestClassifier(random_state=42),
          'gb': GradientBoostingClassifier(random_state=42),
          'et': ExtraTreesClassifier(random_state=42),
          'ada': AdaBoostClassifier(random_state=42),
          'nb': MultinomialNB(),
          'svc': SVC(random_state=42)}

# Function to run model -- input vectorizer and model
def run_model(vec, mod, vec_params={}, mod_params={}, grid_search=False):
    
    results = {}
    
    pipe = Pipeline([
            (vec, vectorizers[vec]),
            (mod, models[mod])
            ])
    
    if grid_search:
        gs = GridSearchCV(pipe, param_grid = {**vec_params, **mod_params}, cv=5, verbose=1, n_jobs=-1)
        gs.fit(X_train, y_train)
        pipe = gs
        
    else:
        pipe.fit(X_train, y_train)
    
    # Retrieve metrics
    results['model'] = mod
    results['vectorizer'] = vec
    results['train'] = pipe.score(X_train, y_train)
    results['test'] = pipe.score(X_test, y_test)
    predictions = pipe.predict(X_test)
    results['roc'] = roc_auc_score(y_test, predictions)
    results['precision'] = precision_score(y_test, predictions)
    results['recall'] = recall_score(y_test, predictions)
    results['f_score'] = f1_score(y_test, predictions)
    
    if grid_search:
        tuning_list.append(results)
        print('### BEST PARAMS ###')
        display(pipe.best_params_)
        
    else:
        eval_list.append(results)
    
    print('### METRICS ###')
    display(results)
    
    tn, fp, fn, tp = confusion_matrix(y_test, predictions).ravel()
    print(f"True Negatives: {tn}")
    print(f"False Positives: {fp}")
    print(f"False Negatives: {fn}")
    print(f"True Positives: {tp}")
    
    return pipe

# Create list to store model testing results
eval_list = []

Numeric Classifier¶

As there were significant differences within the numerical features of the posts extracted from both subreddits (e.g. title/post length and number of comments), I wanted to test out a model with only these features. Surprisingly, this model performed pretty well and achieved a R² score of 66.8% on the test data.

When including sentiment analysis information (compound score from VADER), this accuracy increased substantially to 71.2% on the test data. Clearly, sentiment is an important numerical feature for classifying our particular subreddit posts.

combined_df['title_len'] = len(combined_df['title'])
combined_df['selftext_len'] = len(combined_df['selftext'])
combined_df['compound'] = sentiment_df['compound']

X_bm = combined_df[['upvotes', 'n_comments', 'title_len', 'selftext_len', 'compound']]
y_bm = combined_df['is_menslib']

# Split our data into train and test data
X_bm_train, X_bm_test, y_train, y_test = train_test_split(X_bm, y_bm, test_size=0.3, stratify=y, random_state=42)

logreg = LogisticRegression()
logreg.fit(X_bm_train, y_train)

LogisticRegression()

logreg.score(X_bm_train, y_train)

0.7199312714776632

logreg.score(X_bm_test, y_test)

0.714

bm1_pred = logreg.predict(X_bm_test)

tn, fp, fn, tp = confusion_matrix(y_test, bm1_pred).ravel()

print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)

True Negatives: 162
False Positives: 71
False Negatives: 72
True Positives: 195

Here, True Negatives are r/MensRights posts that were correctly classified by our model. True Positives are r/MensLib posts that were correctly classified by our model.

With sentiment scores excluded, our model did particularly badly in predicting r/MensRights posts, with 116 posts from r/MensLib classified as r/MensRights posts. With sentiment scores included, the overall f-score increased from 0.67 to 0.71.

Final Model Selection¶

Due to length limitations, I'll avoid going too far into my model selection and hyperparameter tuning process. Some of the models I tried included random forest and extra trees classifiers, adaptive boost and gradient boosting classifiers, a support vector machine classifier, and finally a naive bayes multinomial classifier. I tried out various vectorizers including CountVectorizer, TfidfVectorizer and HashingVectorizer.

tuning_df = pd.DataFrame(tuning_list)

tuning_df.sort_values(by=['test', 'roc'], ascending=False).reset_index(drop=True)

After trying out various model and vectorizer combinations, it turned out that logistic regression with TfidfVectorizer returned the highest R2 accuracy -- in short, our model is able accurately predict 83.4% of the test data based on our text features. The model also has the best AUC-ROC score. We can interpret this metric as proof that that this model is the best at distinguishing between classes. The model does particularly very well in terms of recall, with only 30 false negatives (predicted r/MensRights but actually r/MensLib posts).

While the Multinomial Naive Bayes model with CountVectorizer is better at minimizing false positives (predicted r/MensLib but actually r/MensRights posts), the Logistic Regression model with TfidfVectorizer still wins overall in terms of test accuracy and f-score.

To summarize, our final model:

uses Tfidf Vectorization with no max feature limit

includes only words or n-grams that appear in at least 4 posts

excludes stop words and ignores terms that that appear in more than 20% of posts

uses Logistic Regression with Ridge regularization ($\alpha$ = 0.1 | C = 10)

Our model is still overfitting quite a bit as indicated by the large gap between training and test scores, but this seems to be the limit to which we can push our model.

AUC-ROC Curve¶

fig, ax = plt.subplots(1, 1, figsize=(12,10))
plot_roc_curve(cvec_lr_gs, X_test, y_test, ax=ax, name='LogisticRegression-CVEC(GS)', color='lightgrey')
plot_roc_curve(cvec_nb_gs, X_test, y_test, ax=ax, name='MultinomialNB-CVEC(GS)', color='lightgrey')
plot_roc_curve(tvec_svc_gs, X_test, y_test, ax=ax, name='SupportVectorClassifier-TVEC(GS)', color='lightgrey')
plot_roc_curve(tvec_lr_gs, X_test, y_test, ax=ax, name='LogisticRegression-TVEC(GS)', color='blue')
plt.plot([0, 1], [0, 1], color='black', lw=2, linestyle='--', label='Random Guess')
plt.legend()

<matplotlib.legend.Legend at 0x1934e4785b0>

We can see that our chosen model, Logistic Regression with TfidfVectorizer does pretty well as indicated by the AUC-ROC curve. The other classifiers all come pretty close, but it's clear that our chosen model is generally outperforming the other models at most decision thresholds, apart from the start of the curve. It also looks like the model has the best true positive to false positive threshold with a ~0.85 TPR to ~0.2 FPR.

Model Insights¶

Each post has a probability assigned to it that determines whether our model classifies it as a r/MensLib or r/MensRights post. What we can do here, is to look at these probabilities and identify the top posts that model believes to be indicative of a particular subreddit.

tvec_lr_gs.best_estimator_

Pipeline(steps=[('tvec',
                 TfidfVectorizer(max_df=0.2, min_df=4, ngram_range=(1, 2),
                                 stop_words='english')),
                ('lr',
                 LogisticRegression(C=10, max_iter=1000, random_state=42))])

coefs = pd.DataFrame(tvec_lr_gs.best_estimator_.steps[1][1].coef_).T
coefs.columns = ['coef']
coefs['ngram'] = tvec_lr_gs.best_estimator_.steps[0][1].get_feature_names()
coefs = coefs[['ngram','coef']]
coefs = coefs.sort_values('coef', ascending=True)

top_mensrights_coefs = coefs.head(15).reset_index(drop=True)
top_menslib_coefs = coefs.tail(15).sort_values(by='coef', ascending=False).reset_index(drop=True)

By looking at the coefficients from our model, we can determine the words or n-grams that are most related to r/MensLib and r/MensRights. As we assigned r/MensRights posts to 0, and r/MensLib posts to 1, words or n-grams with the lowest negative coefficients are words or n-grams that are most predictive of r/MensRight. This applies vice versa for r/MensLib. What our model has done here, is to search for words and n-grams that appear frequently in one subreddit relative to the other.

plt.figure(figsize=(10,10))
sns.barplot(data=top_mensrights_coefs, x=-top_mensrights_coefs['coef'], y='ngram', palette='Reds_r')
plt.ylabel('')
plt.title('Top 15 Ngrams Correlated with r/MensRights', fontsize=20);

There are extremely strong correlations around the terms feminists, feminism, female and feminist. This makes sense, given that the men's rights movement largely believes that feminism harms men by concealing discrimination and promoting gynocentrism. The top few words from the sub are therefore focused around the loss of rights, and a push towards 'equality'. For r/MensRights, the case of Johnny Depp and Amber Heard represents a percieved 'imbalance of power' between sexes, and serves to reinforce their belief that men are victims.

We also see other areas where r/MensRights belives that men are disadvantaged -- in terms of court cases and the issue of male circumcision. The International Conference on Men's Issues is also a point of interest here. It represents men coming together to take on the so-called 'specter' of feminism.

plt.figure(figsize=(10,10))
sns.barplot(data=top_menslib_coefs, x=top_menslib_coefs['coef'], y='ngram', palette='Blues_r')
plt.xticks(np.arange(0, 6, step=1))
plt.ylabel('')
plt.title('Top 15 Ngrams Correlated with r/MensLib', fontsize=20);

r/MensLib is much more focused on masculinity as compared to feminism. We see that mentions of feminism aren't central here, which makes sense given that r/MensLib is a pro-feminism sub that focuses on how masculinity, particularly toxic masculinity can harm both men and women. The spectrum that these words cover seems to quite broad, covering boys and men, including trans and gay men.

We can also see that relative to r/MensRights,r/MensLib is much more focused on inter-personal relationships and personal issues.

Predictive Wordclouds¶

def plot_cloud(wordcloud):
    plt.figure(figsize=(20, 14))
    plt.imshow(wordcloud)
    plt.axis("off");

menslib_posts = list(combined_df[combined_df['is_menslib'] == 1]['combi_text'])

mensrights_posts = list(combined_df[combined_df['is_menslib'] == 0]['combi_text'])

menslib_wordcloud = WordCloud(
            width = 2000, height = 2000, random_state = 42,
            background_color = 'white',colormap='Blues',
            max_font_size = 900).generate(' '.join(menslib_posts))

plot_cloud(menslib_wordcloud)
plt.title('r/MensLib', fontsize=30)
#plt.savefig('MensLib', dpi=300)

mensrights_wordcloud = WordCloud(
            width = 2000, height = 2000, random_state = 42,
            background_color = 'white',colormap='Reds',
            max_font_size = 900).generate(' '.join(mensrights_posts))

plot_cloud(mensrights_wordcloud)
plt.title('r/MensRights', fontsize=30)

Text(0.5, 1.0, 'r/MensRights')

predictions = pd.DataFrame(tvec_lr_gs.predict_proba(X))
predictions['combi_text'] = combined_df['combi_text']

predictions.columns = ['mensrights', 'menslib', 'combi_text']

Model Limitations¶

As demonstrated by the false predictions above, our model does have some limitations, especially when it comes to predicting r/MensRights posts. Mentioning mental health throws off our model, even though it could potentially be an issue pertinent to r/MensRights. When it comes to the wire, both groups care deeply about similar issues facing men (e.g. male suicide, male parenting, male genital cutting), even if their approach and beliefs are fundamentally different. Similarly, mentions of Amber Heard and Johnny Depp are disproportionately weighted and can easily throw the model off.

I also wouldn't be surprised if r/MensLib is sometimes critical of certain branches of feminism or feminists that espouse hate towards men as an entire group.

In today's day and age, gender issues continue to be extremely complicated, and even a well-trained model will still struggle to differentiate men and their beliefs that exist across a wide spectrum.

To further improve model accuracy, we'd ideally need to train our model to recognize slightly more abstract concepts such as the level of introspection or sentiment within the post. This would allow our model to deal with issues that are topical to r/MensRights (e.g. Johnny Depp and Amber Heard), but classify posts not just by mention of a name, but also by syntactic patterns that suggest introspection such as like 'I have been thinking' or 'I'd love to hear people's views'.

Recommendations¶

Beyond easing the burden of moderators by giving them the ability to classify posts from two different subreddits based on their title and selftext, there are several other possible applications for this model.

By looking at the probabilities associated with each post, moderators can also understand the overall direction of their subreddit. It's often hard to trace the evolution of subreddits over time, however, by looking at the posts that have an extremely high classification probability (>0.99), moderators can see the language and topics that have become characteristic or emblematic of their community.

Depending on their objectives, moderators can then try to increase the diversity of topics within their subreddit, or try to attempt to refocus conversations that are straying from the vision and objectives of the subreddit.

The sentiment analysis we implemented could also be useful here moderators. For example, the moderators from r/MensLib are focused on maintaining positive and constructive discourse. Armed with the ability to monitoring changes in sentiment over time, the moderation team can determine the overall 'mood' of their community and proactively work to address points of potential conflict within the community.

	combi_text	og_text	is_menslib	scores	compound	comp_score	subreddit
0	happy international mens day from menslib offi...	Happy International Men's Day from MensLib\n\n...	1	{'neg': 0.014, 'neu': 0.926, 'pos': 0.059, 'co...	0.7184	pos	r/MensLib
1	red hot chilli piper women grope me under my k...	Red Hot Chilli Piper: Women grope me under my ...	1	{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...	0.0000	pos	r/MensLib
2	black male teachers can have a profound impact...	Black male teachers can have a profound impact...	1	{'neg': 0.167, 'neu': 0.833, 'pos': 0.0, 'comp...	-0.3400	neg	r/MensLib
3	musing on masculinity and coopetition the topi...	Musing on masculinity and coopetition.\n\nThe ...	1	{'neg': 0.087, 'neu': 0.694, 'pos': 0.219, 'co...	0.9993	pos	r/MensLib
4	trans man loses uk legal battle to register as...	Trans man loses UK legal battle to register as...	1	{'neg': 0.318, 'neu': 0.584, 'pos': 0.097, 'co...	-0.5267	neg	r/MensLib

	model	vectorizer	train	test	roc	precision	recall	f_score
0	lr	tvec	0.993133	0.834	0.830086	0.817241	0.887640	0.850987
1	svc	tvec	0.972532	0.826	0.821503	0.806122	0.887640	0.844920
2	nb	cvec	0.893562	0.824	0.824275	0.845560	0.820225	0.832700
3	lr	cvec	0.987983	0.824	0.823456	0.837736	0.831461	0.834586

	mensrights	menslib	combi_text
1208	0.998705	0.001295	sjw feminists are stealing international mens ...
1233	0.997787	0.002213	feminist hypocrisy and international mens day ...
1578	0.997508	0.002492	petition justice for johnny depp against domes...
1126	0.997295	0.002705	how amber heards abuse and accusations destroy...
1022	0.996742	0.003258	johnny depp is being blackmailed by amber heard
1608	0.996470	0.003530	petition to remove amber heard after she admit...
1485	0.995503	0.004497	complete silence on amber heards admission of ...
1537	0.994381	0.005619	johnny depp good amber heard bad
1185	0.994271	0.005729	the un celebrates international mens day by ma...
1545	0.994238	0.005762	johnny depp loses to amber heard where is the ...
1158	0.993843	0.006157	sent flowers on international mens day since i...
1239	0.993243	0.006757	go check google nothing for international mens...
1109	0.993216	0.006784	that johnny depp petition we all hopefully sig...
1177	0.992873	0.007127	happy mens day from a mens rights supporting g...
1297	0.992749	0.007251	what was the evidence cited in the court rulin...

	mensrights	menslib	combi_text
642	0.000818	0.999182	im a trans man who struggles with being mascul...
113	0.000975	0.999025	i feel like theres two different pressures act...
866	0.001494	0.998506	on sex between trans and cis men questions po...
57	0.001683	0.998317	what my life as a trans woman has taught me ab...
256	0.002108	0.997892	toward positive masculinity there have been a ...
207	0.002198	0.997802	dating somebody new is showing me how rare it ...
510	0.002201	0.997799	men should not be shamed for liking feminine a...
50	0.003339	0.996661	sex negativity is not the answer to male liber...
552	0.003468	0.996532	a reflection during international mens health ...
68	0.003636	0.996364	to lgbtq men andor men who feel different beca...
773	0.003756	0.996244	mental health and coronavirus like many many o...
305	0.003892	0.996108	whats your opinion of narrow male sexual roles...
38	0.003933	0.996067	resources for raising boys i am a new father t...
533	0.004208	0.995792	book recommendations on vulnerability communic...
86	0.004538	0.995462	we cant have positive masculinity without affi...

	combined_text	pred	mensrights	menslib
1130	creative men support group i have a weekly men...	1	0.133108	0.866892
1131	my year old dad opened up for the first time ...	1	0.091817	0.908183
1308	could hour shifts improve mens mental health ...	1	0.072995	0.927005
1482	what were men not allowed to do years ago for...	1	0.063357	0.936643
1527	must watch solid masculine life game masculini...	1	0.052861	0.947139
1550	do russian people support same sex marriage w...	1	0.132117	0.867883

Rights Versus Liberation

Classifying Subreddit Posts with Natural Language Processing