{"id":6076,"date":"2019-03-12T09:25:31","date_gmt":"2019-03-12T13:25:31","guid":{"rendered":"https:\/\/sites.temple.edu\/tudsc\/?p=6076"},"modified":"2023-01-24T14:42:03","modified_gmt":"2023-01-24T18:42:03","slug":"unsupervised-learning-of-textual-data-iii-principal-component-analysis","status":"publish","type":"post","link":"https:\/\/sites.temple.edu\/tudsc\/2019\/03\/12\/unsupervised-learning-of-textual-data-iii-principal-component-analysis\/","title":{"rendered":"Principal Component Analysis: Unsupervised Learning of Textual Data Part III"},"content":{"rendered":"<p>By Luling Huang<\/p>\n<p><!--more--><\/p>\n<h2>I. Intro<\/h2>\n<p>In this post, I continue to explore unsupervised learning based on <a href=\"https:\/\/sites.temple.edu\/tudsc\/2018\/04\/18\/exploring-hierarchical-clustering-r-grouping-ideologies\/\">my previous post on hierarchical clustering<\/a>\u00a0and <a href=\"https:\/\/sites.temple.edu\/tudsc\/2017\/11\/09\/use-wordfish-for-ideological-scaling\/\">another post on <em>Wordfish<\/em><\/a>. As explained in my post on hierarchical clustering, my goal has been to see which of the 17 ideological labels (for discussion board commenters) are similar enough lexically so that we can group them together\u2014a basic clustering task.<\/p>\n<p>By mining the same textual data, I&#8217;ve done some preliminary work on <a href=\"http:\/\/setosa.io\/ev\/principal-component-analysis\/\">Principal Component Analysis<\/a> (PCA). Section II explains why I chose PCA.<\/p>\n<p>If you need some R code for doing PCA, feel free to go to Section III.<\/p>\n<h2>II. Why Principal Component Analysis?<\/h2>\n<p>The interpretation of unsupervised leaning&#8217;s results is necessarily data-, algorithm-, and model-dependent (<a href=\"https:\/\/www.cambridge.org\/core\/journals\/political-analysis\/article\/text-as-data-the-promise-and-pitfalls-of-automatic-content-analysis-methods-for-political-texts\/F7AAC8B2909441603FEB25C156448F20\">Grimmer &amp; Stewart, 2013<\/a>). The burden of validation is now on the analyst. The key question is: How do we interpret and evaluate the results in a systematic way?<\/p>\n<p>For example, if we look at a dendrogram (a tree diagram for taxonomy), we have to decide where we want to cut the branches. That is, how many clusters do we want to see from the data? If based on the dendrogram only, the decision is often pretty arbitrary. If we use <a href=\"https:\/\/en.wikipedia.org\/wiki\/K-means_clustering\">k-means clustering<\/a>, then we must specify a number of clusters in advance. Is there a systematic (automated) way to decide the number of clusters based on data-generating assumptions?<\/p>\n<p>Yes. For example,\u00a0<a href=\"https:\/\/academic.oup.com\/comjnl\/article-abstract\/41\/8\/578\/360856\">Fraley and Raftery (1998)<\/a> proposed to use the Gaussian mixture model (GMM) to compare models with different parameters (including cluster numbers) by looking at the Bayesian Information Criterion (<a href=\"https:\/\/en.wikipedia.org\/wiki\/Bayesian_information_criterion\">BIC<\/a>). The greater the BIC of a model, the stronger the evidence for the model. The Gaussian mixture model is straightforward enough to understand: A data point is assumed to first come from the <em>j<\/em>th cluster from a total of <em>K<\/em> clusters (according to a set of probabilities, each for a cluster), and a value for that data point is then drawn from <a href=\"https:\/\/en.wikipedia.org\/wiki\/Multivariate_normal_distribution\">the multivariate normal distribution<\/a> for the <em>j<\/em>th cluster.<\/p>\n<p>All sounds good. BUT, we are working on textual data, which means that the assumption of multivariate normal distribution is highly questionable. In a typical document-term matrix, there tend to be many zeros. For a cluster, a term&#8217;s frequency distribution tends to be skewed to the right. Here is <a href=\"http:\/\/hameddaily.blogspot.com\/2015\/03\/when-not-to-use-gaussian-mixtures-model.html\">a great post on why we should avoid GMM when the distribution assumption is not met<\/a>. In other words, GMM&#8217;s don&#8217;t seem ideal for textual data, which tend to have unpredictable distribution patterns.<\/p>\n<p>LDA (Latent Dirichlet Allocation) Topic Modeling (<a href=\"http:\/\/www.cs.columbia.edu\/~blei\/papers\/BleiLafferty2009.pdf\">Blei &amp; Lafferty, 2009<\/a>) seems to have more plausible assumptions about how words are generated (but it still requires setting parameters ahead of time). As argued in Grimmer and Stewart&#8217;s (2013) critique, the way we use language is so complex that automated textual analysis may be inherently premised on more or less <i>incorrect\u00a0<\/i>data-generating models. Therefore, I decided to forget about the model-based unsupervised learning methods for a moment, and used a method that suits my textual data better.<\/p>\n<p>PCA is firstly a dimensionality reduction technique <a href=\"https:\/\/www.stat.cmu.edu\/~cshalizi\/uADA\/12\/lectures\/ch18.pdf\">that does not rely on any explicit distribution assumptions (see Section 18.1.4 in this link<\/a>). Everything PCA does is reducing the number of features\/terms\/variables,\u00a0 while minimizing information loss [<a href=\"https:\/\/towardsdatascience.com\/a-one-stop-shop-for-principal-component-analysis-5582fb7e0a9c\">check out this article for a non-technical explanation<\/a>]. This would be extremely helpful in my case because I have a sparse matrix with over 10,000 columns (words). Also, after dimensionality reduction, I can still do a clustering analysis based on the reduced data.<\/p>\n<h2>III. PCA in R<\/h2>\n<p><strong>1. Preprocessing:<\/strong><\/p>\n<p>This part has been addressed in my previous posts [I just used the standard &#8216;tm&#8217; package in R]. Assume that we now have a document-term matrix called &#8220;dtm.&#8221;<\/p>\n<p><strong>2. Scaling\/normalization:<\/strong><\/p>\n<p>The PCA results rely heavily on the scales of features. Here, word frequency is just discrete count data. Therefore, scale is not a problem. Still, we need to take into account the fact that some document&#8217;s length is longer than others. I normalized my document-term matrix by dividing the frequency of a term in a document with the Euclidean norm of that document&#8217;s vector:<\/p>\n<p><!-- HTML generated using hilite.me --><\/p>\n<div style=\"background: #f8f8f8;overflow: auto;width: auto;border: solid gray;border-width: .1em .1em .1em .8em;padding: .2em .6em\">\n<pre style=\"margin: 0;line-height: 125%\"><span style=\"color: #888888\">dtm_norm &lt;- t(apply(dtm, 1, function(x) x\/sqrt(sum(x^2))))<\/span>\n<\/pre>\n<\/div>\n<p><strong>3. The <a href=\"https:\/\/cran.r-project.org\/web\/packages\/irlba\/irlba.pdf\">&#8216;irlba&#8217;<\/a> package in R:<\/strong><\/p>\n<p>Because we are dealing with a high-dimensional sparse matrix, to make the calculation more efficient, we use &#8216;irlba.&#8217;<\/p>\n<p><strong>4. How many principal components?<\/strong><\/p>\n<p>In general, the maximum number of principal components we can specify is the smaller value of (the row length \u2014 1) and (the column length \u2014 1). I have 17 documents and more than 10,000 words, so the maximum number of components is 16.<\/p>\n<p>We can first run a PCA with 16 components and check the cumulative explained variance by component graph:<br \/>\n<!-- HTML generated using hilite.me --><\/p>\n<div style=\"background: #f8f8f8;overflow: auto;width: auto;border: solid gray;border-width: .1em .1em .1em .8em;padding: .2em .6em\">\n<pre style=\"margin: 0;line-height: 125%\"><span style=\"color: #888888\">p_dtm_norm_16 &lt;- prcomp_irlba(dtm_norm, n=16)<\/span>\n<span style=\"color: #888888\">percent_variation_16 &lt;- p_dtm_norm_16$sdev^2 \/ sum(p_dtm_norm_16$sdev^2) ## sdev is the standard deviation of each component<\/span>\n<span style=\"color: #888888\">plot(cumsum(percent_variation_16), xlab = \"Principal Component\",<\/span>\n<span style=\"color: #888888\">     ylab = \"Cumulative Proportion of Variance Explained\",<\/span>\n<span style=\"color: #888888\">     type = \"b\")<\/span>\n<\/pre>\n<\/div>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-6087\" src=\"https:\/\/sites.temple.edu\/tudsc\/files\/2019\/02\/CumProp_ExplainedVar_PCA.png\" alt=\"\" width=\"459\" height=\"286\" srcset=\"https:\/\/sites.temple.edu\/tudsc\/files\/2019\/02\/CumProp_ExplainedVar_PCA.png 562w, https:\/\/sites.temple.edu\/tudsc\/files\/2019\/02\/CumProp_ExplainedVar_PCA-300x187.png 300w\" sizes=\"auto, (max-width: 459px) 100vw, 459px\" \/><\/p>\n<p>Here is the graph showing the proportion of explained variance by component:<img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-6088\" src=\"https:\/\/sites.temple.edu\/tudsc\/files\/2019\/02\/Prop_ExplainedVar_PCA.png\" alt=\"\" width=\"475\" height=\"296\" srcset=\"https:\/\/sites.temple.edu\/tudsc\/files\/2019\/02\/Prop_ExplainedVar_PCA.png 562w, https:\/\/sites.temple.edu\/tudsc\/files\/2019\/02\/Prop_ExplainedVar_PCA-300x187.png 300w\" sizes=\"auto, (max-width: 475px) 100vw, 475px\" \/><br \/>\nThe first graph above does not look like a line, which is good news. The second graph shows that the first 4 components explained more variance than the others. After the 10th component, there is not much difference in explained variance. Let&#8217;s just set the number of components as 10 for now.<br \/>\n<!-- HTML generated using hilite.me --><\/p>\n<div style=\"background: #f8f8f8;overflow: auto;width: auto;border: solid gray;border-width: .1em .1em .1em .8em;padding: .2em .6em\">\n<pre style=\"margin: 0;line-height: 125%\"><span style=\"color: #888888\">p_dtm_norm_10 &lt;- prcomp_irlba(dtm_norm, n=10)<\/span><\/pre>\n<\/div>\n<p><strong>5. What are the important words in a component?<\/strong><\/p>\n<p>The code in this section was adapted from <a href=\"https:\/\/juliasilge.com\/blog\/stack-overflow-pca\/\">Silge&#8217;s (2018) wonderful blog post<\/a>. The additional required R packages are: &#8216;tidyverse,&#8217; &#8216;broom,&#8217; and &#8216;scales.&#8217; Now, we would like to know which words have the greatest absolute loadings on the first component, PC1.<br \/>\n<!-- HTML generated using hilite.me --><\/p>\n<div style=\"background: #f8f8f8;overflow: auto;width: auto;border: solid gray;border-width: .1em .1em .1em .8em;padding: .2em .6em\">\n<pre style=\"margin: 0;line-height: 125%\"><span style=\"color: #888888\">tidied_pca_10 &lt;- bind_cols(Tag = colnames(dtm_norm),<\/span>\n<span style=\"color: #888888\">                        tidy(p_dtm_norm_10$rotation)) %&gt;%<\/span>\n<span style=\"color: #888888\">  gather(PC, Contribution, PC1:PC10)<\/span>\n\n<span style=\"color: #888888\">tidied_pca_10 %&gt;%<\/span>\n<span style=\"color: #888888\">  filter(PC == \"PC1\") %&gt;%<\/span>\n<span style=\"color: #888888\">  top_n(40, abs(Contribution)) %&gt;%<\/span>\n<span style=\"color: #888888\">  mutate(Tag = reorder(Tag, Contribution)) %&gt;%<\/span>\n<span style=\"color: #888888\">  ggplot(aes(Tag, Contribution, fill = Tag)) +<\/span>\n<span style=\"color: #888888\">  geom_col(show.legend = FALSE, alpha = 0.8) +<\/span>\n<span style=\"color: #888888\">  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5), <\/span>\n<span style=\"color: #888888\">        axis.ticks.x = element_blank()) + <\/span>\n<span style=\"color: #888888\">  labs(x = \"Terms\",<\/span>\n<span style=\"color: #888888\">       y = \"Relative importance in principle component\")<\/span>\n<\/pre>\n<\/div>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-6091\" src=\"https:\/\/sites.temple.edu\/tudsc\/files\/2019\/02\/words_pc1_10PC-1.png\" alt=\"\" width=\"562\" height=\"350\" srcset=\"https:\/\/sites.temple.edu\/tudsc\/files\/2019\/02\/words_pc1_10PC-1.png 562w, https:\/\/sites.temple.edu\/tudsc\/files\/2019\/02\/words_pc1_10PC-1-300x187.png 300w\" sizes=\"auto, (max-width: 562px) 100vw, 562px\" \/><\/p>\n<p>On PC1, if we look at the words with large negative factor loadings and the ones with large positive loadings, we may interpret that this component is about a contrast between different topical aspects of the 2016 Presidential Election. On one hand, with key words like &#8220;hillari,&#8221; &#8220;trump,&#8221; &#8220;win,&#8221; and &#8220;poll,&#8221; people are focusing on the &#8220;horse race&#8221; aspect of the election; On the other, with words like &#8220;parti,&#8221; &#8220;republican,&#8221; &#8220;white,&#8221; &#8220;christian,&#8221; &#8220;black,&#8221; people are talking about the candidates&#8217; and\/or voters&#8217; demographics.<\/p>\n<p>What about PC2?<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-6092\" src=\"https:\/\/sites.temple.edu\/tudsc\/files\/2019\/02\/words_pc2_10PC-1.png\" alt=\"\" width=\"562\" height=\"350\" srcset=\"https:\/\/sites.temple.edu\/tudsc\/files\/2019\/02\/words_pc2_10PC-1.png 562w, https:\/\/sites.temple.edu\/tudsc\/files\/2019\/02\/words_pc2_10PC-1-300x187.png 300w\" sizes=\"auto, (max-width: 562px) 100vw, 562px\" \/><br \/>\nPC2 seems to be a contrast regarding the topics\/persons between the two major U.S. political parties.<\/p>\n<p>We can do one graph for each component (out of 10), but I&#8217;m not showing it here.<\/p>\n<p><strong>6. A 3-D projection<\/strong><\/p>\n<div>I used &#8216;plotly&#8217; to generate the below 3-D projection. The point labels look cluttered together. Feel free to click the image if you want to zoom in, zoom out, or rotate the graph.<\/div>\n<div><a style=\"text-align: center\" title=\"A 3-D Projection\" href=\"https:\/\/plot.ly\/~manuyee\/3\/?share_key=K3qZIVQCf5OotIUFA2ri7K\" target=\"_blank\" rel=\"noopener noreferrer\"><img decoding=\"async\" style=\"max-width: 100%;width: 600px\" src=\"https:\/\/plot.ly\/~manuyee\/3.png?share_key=K3qZIVQCf5OotIUFA2ri7K\" alt=\"A 3-D Projection\" width=\"600\" \/><\/a><br \/>\n<a href=\"https:\/\/plot.ly\/embed.js\">https:\/\/plot.ly\/embed.js<\/a><\/div>\n","protected":false},"excerpt":{"rendered":"<p>By Luling Huang<\/p>\n","protected":false},"author":3163,"featured_media":6097,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[296,2,287,290],"tags":[64,375,372,69,329,6],"class_list":["post-6076","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-critical-digital-studies","category-grad-students","category-media-studies","category-political-science","tag-coding","tag-machine-learning","tag-principal-component-analysis","tag-r","tag-text-analysis","tag-top-news"],"_links":{"self":[{"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/posts\/6076","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/users\/3163"}],"replies":[{"embeddable":true,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/comments?post=6076"}],"version-history":[{"count":1,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/posts\/6076\/revisions"}],"predecessor-version":[{"id":9197,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/posts\/6076\/revisions\/9197"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/media\/6097"}],"wp:attachment":[{"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/media?parent=6076"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/categories?post=6076"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/tags?post=6076"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}