By Jing Han
Introduction
This pilot project is inspired by the “short squeeze” event on GameStop stock initiated by retail investors in January 2021. A short squeeze is an unusual condition that triggers rapidly rising prices in a stock or other tradeable financial instruments. The January 2021 “short squeeze” event was followed by rapidly rising prices of GameStop stock and other “meme stocks”. Meme stock is a phenomenon involving memetic embodiment with two modes: semantic texts and financial instruments. The price of a meme stock (financial instrument) connects with the community’s narratives and conversations (semantic texts) around the stock. The rapid increase in prices in January 2021 was accompanied by a rapid increase in community participation.
Community members identifying themselves as retail investors, as opposed to institutional investors, have been organizing their communication primarily around Reddit since the beginning of the short squeeze event. The continuous community participation after the initial event and the constant contestations between retail investors and institutional investors can be studied as a type of social movement (what I call “the GameStop movement”). Indeed, retail investors deem their actions as contributing to the realization of a collective goal, colloquially named “Mother of All Short Squeezes” (MOASS). MOASS is an imagined pinnacle of this social movement, exemplified by a populist intent to facilitate a global wealth transfer.
For my Cultural Analytics Practicum pilot project, I scraped 2 months of Reddit posts from the subreddit r/wallstreetbets using Pushshift Multithread API Wrapper, which allows me to programmatically scrape Reddit data using the scripting language: Python (Data and code for this project is hosted on Github Repository). The most relevant metadata for my project from the original Reddit post, including author, creation time, score, number of comments, title, post body, and post flair, are selected for the analysis.
I am interested in exploring the distribution of topics in Reddit posts; in particular, my research asks how these topics change over time. I am also interested in the information channels that are connected to r/wallstreetbets. In other words, what are the information channels that Redditors on r/wallstreetbets often cite? Do Redditors in r/wallstreetbets often cite other subreddits or external sources? More importantly, what are the community characteristics of r/wallstreetbets?
Exploratory Analysis
For my exploratory analysis, I collected 200 000 Reddit posts and formatted the selected post metadata in a human-readable way.
The exploratory analysis reveals the overall posting volume ebbs and flows. Multi-day peak posting volumes occur in mid-June for 5 consecutive days. Most posts have a low score and a small number of comments. There are 11514 unique authors. The most prolific author is u/Guysmarket, followed by unknown deleted accounts. Investigating the top 20 posts made by u/Guymarket reveals that most posts have a low score and number of comments with the post flair: Chart. Most posts by u/Guysmarket don’t contain textual content in the post body. The use of URL links is prominent in posts that have textual content. This observation suggests that the community filters posts that appear to be spammy.
Posting Volume
The author whose post has the highest score is u/predictony007. The post is labeled with meme flair by the author. However, the post content is a truthful statement that might read unbelievable: “Wikipedia has updated the definition of a recession 22 times in the last 24 hours! The line ‘There is no global consensus on the definition of a recession was added on July 27.” Updating the definition doesn’t change the fact that we’re in a technical recession.” This observation suggests that the community rewards attention to the author and their post that speaks an alternative truth to the one the mainstream media promotes.
The author with the highest number of comments is u/AutoModerator, followed by u/ OPINION_IS_UNPOPULAR, a moderator. The posting content made by these two authors is ritualist daily discussions in the community. This observation suggests that the community cultivates itself by having daily discussions, and the community members adhere to this practice by commenting on the discussion posts.
Top 20 Authors
Topic Modeling and URL Analysis on Reddit Posts
In this deeper analysis, I investigate the part-of-speech patterns in the post title before conducting topic modeling on the post titles over time. Since most posts whose post bodies don’t contain valid textual value and URL links are common in post bodies, I analyze URL citations in post bodies instead.
For this project, I pay attention to the company names, nouns, and emojis. Unsurprisingly, the companies of the meme stock are the most frequently mentioned. BBBY is mentioned 1019 times, followed by GME, which is mentioned 544 times. The most mentioned noun is market (660 times), followed by inflation (259 times). The most used emoji is ????, followed by ???? which reflects the observable communication patterns (“to the moon”, “diamond hands” respectively) unique to the community.
For topic modeling the post titles, I use BERTopic, a Python package that leverages transformers and class-based TF-IDF. Transformers refer to a type of neural network architecture that is designed to process input sequences of variable lengths. TF-IDF, short for term frequency-inverse document frequency, is a statistical method that is intended to reflect how important a word is to a document in a collection or corpus. It is a weighting factor in topic modeling. Class-based TF-IDF is a variant of the TF-IDF algorithm that takes into account the class labels of documents in a corpus when calculating the importance of each term. The additional information provided by the class labels can improve the accuracy of topic modeling. Specifically, BERTopic generates document embedding with pre-trained transformer-based language models, clusters these embeddings, and finally generates topic representations with the class-based TF-IDF procedure. The result coincides with the observations in the part-of-speech analysis: BBBY is the most discussed topic. The following graph shows the top 10 topics. -1 topic refers to outliers that can be ignored.
Top 10 Topic Frequency
The following graph displays the distribution of topics over time. Most topics have even distribution across the observation time and reflect the community’s attention on both macro (less time-sensitive) and micro (time-sensitive) events. The overall most pronounced topic trend is a time-sensitive topic: BBBY, preceded by the discussion about a less time-sensitive topic: recession. Interestingly, the period where the highest posting volume was observed in the exploratory analysis (mid-June) does not stand out in this graph. This period roughly reflects a peaked interest in discussions about inflation.
Topics Over Time
After removing posts whose post body does not contain valid textual input, only 4793 out of 20 000 posts remain. There are 2464 unique URLs. The following image shows the top 20 most frequently used URLs in post bodies.
URLs in Post Bodies
The result of URL analysis shows r/wallstreetbets community pays attention to community outreach by interlinking its multi-platform online presence. In addition to the moderators’ presence revealed earlier, high citation of the community guideline shows that the community structure might play an essential role in member participation. The result also hints that the circulation and reproduction of community knowledge might help bind members together. The community’s tendency to reward alternative truth can be further observed in the lack of mainstream media URLs.
Simply put, r/wallstreetbets appears to be a community space where members who cultivate and seek alternative truth from mainstream media gather, and this space is expanding.
Next Steps
Translating the insights learned from the pilot project into hypotheses would be the first step when I scale up the project of studying the GameStop movement on Reddit. Since community members on r/wallstreetbets seek and cultivate alternative truth from mainstream media, I am interested in analyzing the discourse about meme stock on Reddit in parallel with analyzing the discourse about meme stock on mainstream media. I will also collect Reddit post data since the beginning of the “short squeeze” event to look at the full picture of the GameStop movement.