Using Machine Learning to predict University Confession Page post like count (Negative Binomial Regression)

Exefess

University confession pages have exploded in popularity in recent years. 19,526 people are following the Durham University page “Durfess”, that is 104% of Durham’s student population (2018/19).

Exeter University had the original “ExeHonestly” page shut down, and clones appeared and gained popularity within hours.

These pages are useful not just for students, who can post their frustrations, crushes and weird ideas. These pages could be a goldmine for advertisers. If their brand is named in a popular post on one of these pages, nearly the whole student facebook population are likely to see it (Pages like this grow through friends tagging each other). Best of all — due to the anonymity afforded to everyone who posts to the page, it could be completely free for a marketing team to make a post innocuously naming their brand and propelling it into a student populations consciousness. Therefore, it would make sense that they’d want to know how to make a successful post.

In this article, I will be using “Likes” as a measurement of success. There is probably a better way to combine all reactions, comments and shares to measure success of a post, but that is beyond the scope of this.


Collecting the data

To aid our analysis through broadening our dataset, I am going to analyse both “Exefess” and “Durfess” together. Anecdotally, the demographics (other than geography) of these pages are very similar, and the differences will be accounted for during regression through having an indicator variable.

I am using a python module named “Facebook-Scraper” to scrape posts from these pages.

First, I am going to scrape 50,000 pages of each (should be enough to capture all), and mark each with a 1, depending on which page they are from.

df_durfess = pd.DataFrame(get_posts("Durfess", pages=50000))
df_durfess['durfess'] = 1
df_exefess = pd.DataFrame(get_posts("ExeFess-110674570384545", pages=50000))
df_exefess['exefess'] = 1

Cleaning Data

Merging the datasets, removing the new lines and removing the common #Durfess and #Exefess posts. Other punctuation will be removed later.

df = pd.merge(df_durfess, df_exefess, how='outer')
df = df.replace({r'\s+$': '', r'^\s+': ''},
                regex=True).replace(r'\n',  ' ', regex=True)
df.replace(r'\\s', '', regex=True, inplace=True)
df.replace(r'#.+?\b', "", regex=True, inplace=True)

We are going to remove any shared posts. This is because on these confession pages, these are adverts for t-shirts or the sharing of the submission link. These are not relevant to our analysis and so it makes sense to remove all of them.

df['shared_text'].replace('', np.nan, inplace=True)
df.fillna({'durfess': 0, 'exefess': 0, 'shared_text': 0}, inplace=True)
df = df[df.shared_text == 0]

Next, I am going to add indicator variables for whether the post contains a link or an image. We will then remove the image’s and links. A future version of this analysis could analyse the contents.

df['haslink'] = df['link'].apply(lambda x: 0 if x == None else 1)
df['haspicture'] = df['image'].apply(lambda x: 0 if x == None else 1)
Our dataset after cleaning and removing unneeded columns

Exploratory Data Analysis

Mean likes is about 58, but max is 18,405. This, and the 75th percentile both suggest some extreme values. (Comments is even more extreme, so we’ll have to bare that in mind).

fig_bp, (ax1_bp, ax2_bp) = plt.subplots(1, 2, sharey=True, figsize=(15,8))
durfessdata = df[df['durfess'] == 1]
exefessdata = df[df['exefess'] == 1]
ax1_bp.boxplot(durfessdata['likes'])
ax2_bp.boxplot(exefessdata['likes'])
ax1_bp.set_title('Durfess')
ax2_bp.set_title('Exefess')
plt.show()

Boxplots

As you can see, the outlier is a big one, and comes from the Durfess data. Let’s look at these individual observations and see if there is anything special about them.

df[df['likes'] > 2500

Obviously, the best way to amass a lot of likes is to post a picture of a cat in a magical hat during exam season. I am going to class this as an outlier and remove it.

df=df[df['likes'] > 3000].copy()

Extracting Latent Topics from Text

I am going to use an unsupervised LDA model to cluster the latent topics from the text, and see if there is any benefit in posting about a certain topic to increase the number of likes.

Text Cleaning

First, the text needs to be cleaned. There are many articles that go further in to depth with this, so I will gloss over here.

from gensim.utils import simple_preprocess
import gensim
import re
data = df['post_text'].values.tolist()
data = [re.sub('#\S+ *', '', sent) for sent in data]
data = [re.sub('\s+', '', sent) for sent in data]
data = [re.sub("\'", "", sent) for sent in data]
def prepare_data(data):
    for sentence in data:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))
data = list(prepare_data(data))

Send data to list, remove punctuation, and use gensim simple preprocess to process.

from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 'post', 'page',
                   'use', 'tag', 'durfess', 'submit', 'exefess'])

Remove stopwords, these are some very common words that are not relevant to the topic, and would only be there because they are on a university confession page.

import spacy
data = [[word for word in simple_preprocess(
    str(doc)) if word not in stop_words] for doc in data]
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
# 'NOUN', 'ADJ', 'VERB', 'ADV'
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent))
        texts_out.append(" ".join([token.lemma_ if token.lemma_ not in [
                         '-PRON-'] else '' for token in doc if token.pos_ in allowed_postags]))
    return texts_out
data_lemmatized = lemmatization(
    data, allowed_postags=["NOUN", "VERB"])  # select noun and verb
df['post_text2'] = data_lemmatized
sentences_ready = []
allowed_postags = ["NOUN", "VERB"]
for sentence in data:
    doc = nlp(" ".join(sentence))
    sentences_ready.append(" ".join([token.lemma_ if token.lemma_ not in [
                           '-PRON-'] else '' for token in doc if token.pos_ in allowed_postags]))
data_ready = []
for doc in sentences_ready:
    data_ready.append(gensim.utils.simple_preprocess(doc))
dictionary = gensim.corpora.Dictionary(data_ready)
bow_corpus = [dictionary.doc2bow(doc) for doc in data_ready]

Lemmatize (Reduced to stem) and convert to bag of words and dictionary.

Calculate ideal number of topics through coherence.
def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=3):
    coherence_values = []
    model_list = []
    for num_topics in range(start, limit, step):
        model = gensim.models.LdaMulticore(
            corpus, num_topics=num_topics, id2word=dictionary, passes=10, workers=2)
        model_list.append(model)
        coherencemodel = CoherenceModel(
            model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherencemodel.get_coherence())
return model_list, coherence_values
model_list, coherence_values = compute_coherence_values(
    dictionary=dictionary, corpus=bow_corpus, texts=data_ready, start=5, limit=100, step=2)

Test the LDA model for coherence with a different number of topics each time. (This takes a while).

Selecting the best model.

limit = 100
start = 5
step = 2
x = range(start, limit, step)
for m, cv in zip(x, coherence_values):
    print("Num Topics =", m, " has Coherence Value of", round(cv, 4))
    
fig_ch, ax_ch=plt.subplots()
ax_ch.plot(x, coherence_values)

Optimal topics seems to be 7.

optimal_model_number = 1 
optimal_model = model_list[optimal_model_number]
num_topics = x[optimal_model_number]

This next bit (as well as a lot of above) is taken from this great blog post. It is a very elegant way of turning our LDA Model results back into a dataframe that we can use for our regression analysis.

def format_topics_sentences(ldamodel=optimal_model, corpus=bow_corpus, texts=data_ready):
    # Init output
    sent_topics_df = pd.DataFrame()
# Get main topic in each document
    for i, row in enumerate(ldamodel[corpus]):
        row = sorted(row, key=lambda x: (x[1]), reverse=True)
        # Get the Dominant topic, Perc Contribution and Keywords for each document
        for j, (topic_num, prop_topic) in enumerate(row):
            if j == 0:  # => dominant topic
                wp = ldamodel.show_topic(topic_num)
                topic_keywords = ", ".join([word for word, prop in wp])
                sent_topics_df = sent_topics_df.append(pd.Series(
                    [int(topic_num), round(prop_topic, 4), topic_keywords]), ignore_index=True)
            else:
                break
    sent_topics_df.columns = ['Dominant_Topic',
                              'Perc_Contribution', 'Topic_Keywords']
# Add original text to the end of the output
    return(sent_topics_df)
df_topic_sents_keywords = format_topics_sentences(
    ldamodel=optimal_model, corpus=bow_corpus, texts=data_ready)
# Format
df_dominant_topic = df_topic_sents_keywords.reset_index()
df_dominant_topic.columns = ['Document_No',
                             'Dominant_Topic', 'Topic_Perc_Contrib', 'Keywords']
# Show
df_dominant_topic.head(10)
df_dominant topic (output of above)

Merging this dataframe with our original.

df.reset_index(inplace=True)
df = pd.merge(df, df_dominant_topic, left_on='index', right_on='Document_No')

Final Data Prep

I am going to drop any observations that don’t have any text (after our cleaning), images or links. This is because they will add nothing to our analysis.

df['post_text2'].replace('', 0, inplace=True)
df = df.drop(df[(df['post_text2'] == '') & (
    df['haslink']+df['haspicture'] == 0)].index)

Giving any data with no text, but link or image their only topic.

df['Dominant_Topic'] = df.apply(
    lambda row: num_topics+1 if row['post_text2'] == 0 else row['Dominant_Topic'], axis=1)

Sentiment Analysis

How positive a post is could be related to how many likes it gets. Lets account for this by using textblob to analyse the sentiment of posts.

from textblob import TextBlob as tb
def sentiment(text):
    post = tb(text)
    post_sentiment = post.sentiment.polarity
    return post_sentiment
df['post_text'] = df['post_text'].apply(
    lambda text: str(text))  # convert text into string
df['t_sentiment'] = df['post_text'].apply(lambda post: sentiment(post))

Adding Time

Adding the time will allow us to measure whether the day of the week, day or month has any effect on the number of likes a post gets.

df['Month'] = df.time.dt.month
df['Day_Of_Week'] = df.time.dt.dayofweek
df['Day'] = df.time.dt.day

Implementing a Negative Binomial Regression

Regression Goal: Predict the number of likes a post will receive, based on factors included in the post.

I am using a negative binomial regression for 2 reasons. The first is that we have count data. This data is not continuous (Likes are always integers), and therefore that along with the extreme positive skew, rules out a linear regression.

A poisson regression model is usually recommended for data of this type, though a poisson regression has the requirements that mean=variance. In our data:

print("Skewness: %f" % df['likes'].skew())
print("Kurtosis: %f" % df['likes'].kurt())
print("Mean: ", df['likes'].mean())
print("Variance: ", df['likes'].var())

Skewness: 7.585360
Kurtosis: 124.327838
Mean:  54.166710770044986
Variance:  12317.058489204572

Negative Binomial Regression, however, does not require an equal mean and variance, and is used when the conditional variance exceeds the conditional mean. However, to eventually test a negative binomial regression, we have to run a poisson regression first.

mask = np.random.rand(len(df)) < 0.9
df_train = df[mask]
df_test = df[~mask]

explanatory_var = """likes ~ comments  + durfess + haslink + haspicture + Dominant_Topic + Month + Day_Of_Week + Day + t_sentiment"""

y_train, X_train = dmatrices(explanatory_var, df_train, return_type='dataframe')
y_test, X_test = dmatrices(explanatory_var, df_test, return_type='dataframe')

poisson_reg=sm.GLM(y_train,X_train,family=sm.families.Poisson()).fit()

print(poisson_reg.summary())
Output of above.

This is obviously not a statistically sound regression. The Deviance and chi² are massive, and therefore we almost certainly do not have a well fitted model. The 5% confidence level with 1000+ degrees of freedom is 1074.679, about 21 times smaller than our deviance. Therefore this is not a good representation of the data.

A negative binomial regression, however does not require a variance=mean, and the most common form of Negative Binomial Regression, nb2, makes the assumption that Variance=mean+alpha*mean²

\mu_i+\alpha\mu_i^2=\sigma^2

We want to find what value of alpha is right for our dataset.

Auxiliary OLS Regression without a constant to find alpha

We can use the “mu” or “lambda” values gained from the poisson regression, in order to fit an auxiliary OLS regression that fits our data a lot better. This value is a fitted rate for each observation, and we can use it in the formula seen below.

α∗λ_i=((y_i−λ_i )^2−y_i )/λ_i

The left hand side of the above equation is equivalent to the coefficient on an independent variable in an OLS regression (plus a zero constant), while the right hand side of the regression is equivalent to a dependent variable. We can then use the ordinary least squares method to find alpha.

Lets add our lambda_i values above to the dataframe, and then implement the above equation

df_train['likes_lambda'] = poisson_training_results.mu

df_train['y'] = df_train.apply(lambda row: ((row['likes'] - row['likes_lambda'])**2 - row['likes']) / row['likes_lambda'], axis=1)

And run the OLS regression

ols_vars = """y ~ likes_lambda - 1"""
aux_regression= smf.ols(ols_vars, df_train).fit()

Our alpha value (the Beta_1 from the auxiliary regression) equals 0.993958.

The t value for this coefficient is 10.503339. For our degrees of freedom and our significance level (5%) the t statistic is 1.645302. 10.5 is obviously higher than this, and therefore we can reject the null hypothesis that alpha=0.

This means that our NB2 variance assumption is better than the poisson assumption variance=mean (because alpha is different from 0).

Using this Alpha value in our NB2 model.

We can then run a negative binomial model with our value of alpha calculated from the auxiliary regression as the “alpha” needed for stats models NB2 calculated.

alpha=aux_regression.params[0]

nb2_regression = sm.GLM(y_train,X_train,family=sm.families.NegativeBinomial(alpha=alpha)).fit()

print(nb2_regression.summary())
Our results

Firstly, looking at the statistical significance. At the 5% significance level, “haslink” (whether the post has a link or not), Dominant Topic, Day of the week and t_sentiment are not statistically significant from 0.

This may suggest that the content of the post (Whether it has a link, what topic it is related to, and how positive/negative is) have no effect on the number of likes.

Starting with dominant topic, I find this extremely hard to believe. I would think that that would be the driving force. This is probably down to our methods of analysing the data. If we had gone through manually and marked the posts topic, this would probably have more of a bearing. For now, we have used unsupervised methods that probably do not fit the dataset very well. This means that the topics are so broad that they have no statistically significant impact on the number of likes a post gets.

Similarly, the analysis of the sentiment is based on statistically inefficient, unsupervised methods. However, the sentiment of a post is probably not a good indicator of the number of likes. Anecdotally, negative posts and positive posts receive about equal numbers of likes.

Whether we have a link or not is probably not relevant because of the small number of posts that actually do, compared to the size of the dataset.

df["haslink"].value_counts()
  
0    3756
1      23

Lets remove the statistically insignificant variables (I am going to keep t_sentiment as it is significant at the 10% level, at that is probably within the limits of tenousity for this analysis)

With the statistically insignificant variables removed:

Our results with statistically insignificant independent variables removed

Making predictions using our test data:

nb2_test_predictions= nb2_regression.get_prediction(X_test)

Plotting predicted vs actual counts

predictions = nb2_test_predictions.summary_frame()
predicted_counts=predictions['mean']
actual_counts = y_test['likes']
fig, ax = plt.subplots(figsize=(20,15))
predicted, = plt.plot(X_test.index, predicted_counts, label='Predicted counts',color="#5B85AA")
actual, = plt.plot(X_test.index, actual_counts, label='Actual counts', color="#F46036")
plt.legend(handles=[predicted, actual])
ax.set_title("Likes: Predicted vs Actual Counts")
ax.set_ylim(0,900)
fig.tight_layout()
plt.show()

Our model does alright. The actual counts are a lot more volatile, but that is to be expected. Our model follows the trend pretty well. Let’s go back to our results and analyse them in a bit more detail.


Statistically significant coefficients

To work out the IRR (Incidence Rate Ratio), and therefore the expected increase or decrease as we increase independent variables, we use the formula:

e^{\beta k}

Results after implementing that formula for all estimated coefficients:

Intercept      9.842956
comments       1.003344
durfess        5.888823
haspicture     2.819429
month          0.950591
day            0.992481
t_sentiment    1.120381
dtype: float64

Comments

A comment on a post is predicted to increase the number of likes by about 1, holding all else equal.

Day and Month

Day and Month both have negative (statistically significant) coefficients. This suggests that both the later in the year, and the later in the month that a post is posted, the fewer likes it will get.

A reason for this could be that the Durfess page often gets more likes than Exefess, and Exefess posts are only from about september onwards. (This should be accounted for in our “Durfess” variable though).

It may also be because of the skew over summer when students are not at university. There are 6 months in the year before summer, while only 3 after.

Durfess

The post being on a Durfess page instead of Exefess, is expected to increase the number of likes by 5.89, holding all else constant. Pretty self explanatory, if you want more likes, post to Durfess, probably because the page itself has more likes.

Has Picture

Having a picture is expected to increase the number of likes by 2.82, holding all else constant. I suspect this would be higher had we left the mammoth 30,000 reaction post in the data.

Overall Takeaways

This analysis could definitely be improved by better topic recognition, sentiment analysis and probably more data. Overall, if you want a lot of likes on your post. Post to Durfess, with a picture, and make sure people comment on it. Make sure you post early in the week and early in the year, but the day does not matter.

This analysis was not as successful as I had hoped. The topic modelling is probably the most interesting thing but turned out to be very statistically insignificant. I am going to try and engineer a better way to model the topic and come back to this analysis.

Thanks for reading!