Tweets - "ChileDesperto" & "RenunciaPiñera"

Note! The interactive version of the word cloud can be found at the end of this blog.

Background

Chile, the country often praised as Latin America’s great economic success story, has shocked the world with the unexpected wave of protests. The anger has its roots in dictatorship’s legacy of inequality. Perhaps the only people not shocked are Chileans themselves.

Goal

  • To do some basic text mining and create a simple wordcloud using Twitter.

The goal is not a political discussion, though feel free to contact me if you are interested in such. Nevertheless, I have and always will be on the students and workers side. They have my support!

Twitter

I have personally never tried to extract tweets from Twitter before. I was surprised to realise how easy it was. The package (rtweet)[https://rtweet.info/index.html] in R extracts tweets based on a keyword, a hashtag, or a specific account. But before you can do anything, you need two things:

Loadning packages

> ## Load rtweet
> library(rtweet)
> 
> ## Loading other libraries I will use
> library(tidyverse)
> library(ggplot2) # plotting 
> library(wordcloud) # plot word cloud 
> library(ggThemeAssist)
> 
> library(tidytext) # text mining library
> library(spacyr)
> library(stringi)
> library(tm)
> library(kableExtra)

Access keys

> # The name you assigned to your created app
> appname <- "Add-your-app-name"
> 
> ## API key (You will recieve this when you create your Twitter app)
> key <- "Add-your-API-key-here"
> 
> ## API secret (You will recieve this when you create your Twitter app)
> secret <- "Add-your-secret-API-key-here"

Access Token

> ## Create a token that validates your access to tweets
> twitter_token <- create_token(
+   app = appname,
+   consumer_key = key,
+   consumer_secret = secret)  

Search Twitter for Tweets

The idea was to extract 500 000 tweets with the hashtags #chiledesperto and #renunciapiñera. In case ñ was not used but just n, I also search for #renunciapinera. There are two things to consider:

  • You can only get 18 000 tweets every 15 minutes
  • The search API searches against a sampling of recent Tweets published in the past seven days.

In order to extract more then 18 000 tweets use retryonratelimit = TRUE in the search_tweets function.

After approximately 4 hours, the search stopped giving me 407,976tweets. I decided to keep all the tweets in Spanish using the lang == "es" variable, but as you will notice, the variable doesn’t seem to operate accurately. It is also possible that people do mix languages as they twett.

Observe! The first chunck of codes only shows what I did with the file but this code is not excecuted. The original data set can be dowloaded from GitHub

> ## Searching for 500,000 tweets containing: 
> ## renunciapiñera, renunciapinera or chiledesperto
> 
> rt <- search_tweets(
+   "renunciapiñera OR renunciapinera OR chiledesperto",
+   n = 500000,
+   retryonratelimit = TRUE
+ )
> 
> ## Saving all the collected tweets
> saveRDS(rt, file = "rt_original.rds")
> 
> ## Load all the tweets into R
> rt_original <- readRDS(file = "rt.rds")
> 
> ## Original file
> dim(rt_original)
> 
> ## Number of unique usernames
> length(unique(rt_original$screen_name))
> 
> ## Preview tweets data
> head(rt_original)
> 
> ## Preview users data
> users_data(rt_original)
> 
> ## Extracting all the unique tweets written in spanish
> ## This is the file I am planning to share 
> rt_es <- unique(rt_original) %>%
+   filter(lang == "es") %>%
+   select(created_at, text, hashtags, country)
> 
> dim(rt_es)
> 
> ## Saved the tweets to an .rds file
> saveRDS(rt_es, file = "rt_es.rds")

Here is some bacis plot that can be created using the ts_plot function in the rtweet package:

> ## Plot time series of tweets frequency
> rt_es %>% ts_plot(by = "hours") +
+   ggplot2::labs(title = "Tweets during the first week of November 2019")

Cleaning the tweets

Here are some components I will use to clean the tweets. In formal Spanish, the five vowels (a, e, i, o, u) can contain an accent. The rules on why and where to place accents can be challenging to understand and are not always marked.

> ## Stopwords
> ## From tm package - 20 spanish stopwords
> head(tm::stopwords(kind = "spanish"), 20)
 [1] "de"   "la"   "que"  "el"   "en"   "y"    "a"    "los"  "del"  "se"  
[11] "las"  "por"  "un"   "para" "con"  "no"   "una"  "su"   "al"   "lo"  
> 
> ## From stringi package - To remove accents from a string
> some_words <- c("Millésime", "boulangère", "üéâäàåçêëèïîì")
> stringi::stri_trans_general(some_words, id = "Latin-ASCII")
[1] "Millesime"     "boulangere"    "ueaaaaceeeiii"

Regular expressions is something I do struggle with. If you have better suggestion, please comment.

> ## unnest_tokens() - we unnest using the specialized “tweets” tokenizer that is built in to the tokenizers package. 
> 
> tidy_tweets_original <- rt_es %>%
+   mutate(text = stri_trans_general(str = text, id = "Latin-ASCII")) %>%
+   mutate(text = str_remove_all(text, "#")) %>% #removing hashtag
+   mutate(text = gsub("p[[:lower:]]+a[[:punct:]]*\\sitalia", "plazaitalia", tolower(text))) %>% #pattern for "Plaza italia"
+   unnest_tokens(word, text, token = "tweets") %>%
+   filter(!word %in% stopwords(kind = "spanish"),
+          !word %in% str_remove_all(stopwords(kind = "spanish"), "'"),
+          word %in% str_remove_all(word, "^http"), #removing links
+          word %in% str_remove_all(word, "^@") #removing @UserName
+          ) %>%
+   mutate(word = gsub("[[:punct:][:space:]]+", "", word)) %>% #removing punctuations
+   mutate(word = gsub("\\w*[0-9]+\\w*\\s*", "", word)) %>% #removing all words containing numbers
+   mutate(word = iconv(word, "latin1", "ASCII", sub="")) %>% #removing odd characters
+   mutate(word = trimws(word)) %>% #removing white spaces
+   filter(trimws(word) != "") %>% #removing empty rows
+   filter(nchar(word) > 1) %>% #removing words with one letter
+   filter(!word %in% c("pa", "la", "pal", "pala")) %>% #removing some problematic strings
+   filter(trimws(word) != "") #removing empty rows
> 
> 
> # a copy
> tidy_tweets <- tidy_tweets_original
> 
> # a little sample
> head(tidy_tweets$word, 50)
 [1] "pinera"              "facista"             "vos"                
 [4] "sos"                 "terrorista"          "canta"              
 [7] "buenos"              "aires"               "chiledesperto"      
[10] "saque"               "cuenta"              "encuentro"          
[13] "foto"                "merece"              "ser"                
[16] "compartida"          "forma"               "publica"            
[19] "felicitaciones"      "fotografo"           "si"                 
[22] "llega"               "tuit"                "favor"              
[25] "identificate"        "pueblo"              "ancestros"          
[28] "chiledesperto"       "chileprotests"       "chileresiste"       
[31] "oigan"               "vecinos"             "chatos"             
[34] "manifestaciones"     "chatos"              "lacrimogenas"       
[37] "gas"                 "pimienta"            "carabineros"        
[40] "andan"               "tirando"             "indiscriminadamente"
[43] "incluso"             "delante"             "ninos"              
[46] "abuelitos"           "chiledesperto"       "resulta"            
[49] "conmebol"            "demuestra"          

There are still some issues with the words in the{r nrow(tidy_tweets)} rows tidy_tweets data set. Some words that could have been removed, but at this point, their impact will not be significant.

Lemmatization

Lemmatization is an advanced technique, which uses a dictionary to replace words with their morphological root form. However, lemmatization in R requires external software modules. spacyr is an R wrapper to the spaCy industrial strength natural language processing Python library. Visit spacy page for information about the installation and usage of the package. Below you can see how I used the spacy_parse function to get the word lemmas for all the words in the twitter data.

Here is how I did to add the word lemmas to the each word in the tidy_tweets data set.

> # Install the spacyr R package
> install.packages("spacyr")
> library(spacyr)
> 
> # Install spaCy in a conda environment
> spacy_install()
> 
> # Download language support 
> spacy_download_langmodel("es")

> # Open a connection by being initialized  
> spacy_initialize(model = "es")
> 
> # keeping unique words
> unique_words <- unique(tidy_tweets$word)
> length(unique_words)
[1] 57632
> 
> # runing spacy_parse() function to get word lemmas
> # You can directly apply spacy_parse to your data set
> # Since my file is so large I apply it on a vector of unique words
> word_lemmas_original <- spacy_parse(unique_words, 
+                                     tag = FALSE, 
+                                     entity = FALSE, 
+                                     pos = FALSE, 
+                                     lemma = TRUE,
+                                     output = "data.frame") %>% 
+   select(doc_id, token, lemma)
> 
> word_lemmas <- word_lemmas_original
> dim(word_lemmas)
[1] 57632     3
> 
> # word_lemmas doesn't contain the original word, just the doc_id to the original file
> # joining back word_lemmas to the original vector of unique words
> # just an extra careful step - may not be needed!!!
> word_lemmas <- tibble(word = unique_words) %>%
+   mutate(doc_id = paste0("text", row_number())) %>%
+   left_join(word_lemmas) %>%
+   select(word, lemma)

> # Merging tidy_tweets to the df with word lemmas
> tidy_tweets <- tidy_tweets %>% 
+   left_join(word_lemmas, 
+             by = "word") 

Frequency of each word

> ## Counting the frequency of each word
> frequency_all <- tidy_tweets %>% 
+   count(lemma, sort = TRUE) %>% 
+   mutate(total = n()) %>%
+   mutate(freq = n/total)

ggThemeAssist

ggThemeAssist is a RStudio-Addin. I used it to create the theme for the plot below. It is not super pretty but symbolic. Try it out!

> frequency_all %>%
+   filter(n < n[3]) %>%
+   arrange(n) %>%
+   top_n(20, n) %>%
+   mutate(lemma = factor(lemma, unique(lemma))) %>%
+   ggplot(aes(reorder(lemma, freq), n)) +
+   geom_col(show.legend = FALSE, 
+            fill = "#0039a6", 
+            colour = "#d52b1e") +
+   coord_flip() +
+   labs(x = NULL,
+        y = "Median # of retweets for tweets containing each word")  + 
+   theme(plot.subtitle = element_text(vjust = 1), 
+     plot.caption = element_text(size = 8,  
+                                 colour = "#0039a6", 
+                                 vjust = 1), 
+     axis.line = element_line(colour = "#d52b1e"), 
+     axis.ticks = element_line(colour = "white", 
+                               linetype = "blank"), 
+     panel.grid.major = element_line(linetype = "blank"), 
+     panel.grid.minor = element_line(linetype = "blank"), 
+     axis.title = element_text(size = 10, 
+                               colour = "white"), 
+     axis.text = element_text(colour = "white"), 
+     axis.text.x = element_text(size = 8), 
+     axis.text.y = element_text(vjust = 0.3, 
+                                hjust = 0,
+                                size = 12, 
+                                margin = margin(r = -158)
+                                ), 
+     plot.title = element_text(size = 10, 
+                               colour = "white"), 
+     panel.background = element_rect(fill = "#d52b1e"), 
+     plot.background = element_rect(fill = "#d52b1e", 
+                                    colour = "#d52b1e", 
+                                    linetype = "solid")) +
+   labs(y = "# of tweets containing each word", 
+        caption = "@leynu")

Wordcloud

There are actually two packages that can create wordclouds, wordcloud and wordcloud2. I used the wordcloud2 package that provides an HTML5 interface to wordcloud for data visualization.

> ## colors
> pal <- c("#0039a6", "#d52b1e")
> 
> ## Cheating by reducing the count for chiledesperto 
> ## in order to adjust the text sizes in the wordcloud
> frequency_cheat <- frequency_all %>% 
+   mutate(n = ifelse(lemma == "chiledesperto", 110000, n)) %>% 
+   mutate(word = lemma) %>% 
+   top_n(250, n)
> 
> wordcloud2(frequency_cheat, color = rep(pal, 15),  backgroundColor = "white", minSize = 10) 

Note! Since ChileDesperto and RenunciaPiñera were the search words used to extract the tweets; they are not presented in the word cloud.

Final words

The whole data set can be found in GitHub. Twitter contains a lot of parameters that can be used for text mining purposes; I hardly used 2-3 of those.

Reproducibility

─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
 setting  value                       
 version  R version 3.5.3 (2019-03-11)
 os       macOS Mojave 10.14.6        
 system   x86_64, darwin15.6.0        
 ui       X11                         
 language (EN)                        
 collate  en_US.UTF-8                 
 ctype    en_US.UTF-8                 
 tz       Europe/Stockholm            
 date     2019-11-10                  

─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
 package       * version    date       lib source                         
 assertthat      0.2.1      2019-03-21 [1] CRAN (R 3.5.2)                 
 backports       1.1.5      2019-10-02 [1] CRAN (R 3.5.2)                 
 blogdown        0.16       2019-10-01 [1] CRAN (R 3.5.2)                 
 bookdown        0.14       2019-10-01 [1] CRAN (R 3.5.2)                 
 broom           0.5.2      2019-04-07 [1] CRAN (R 3.5.2)                 
 callr           3.3.2      2019-09-22 [1] CRAN (R 3.5.2)                 
 cellranger      1.1.0      2016-07-27 [1] CRAN (R 3.5.0)                 
 cli             1.1.0      2019-03-19 [1] CRAN (R 3.5.2)                 
 colorspace      1.4-1      2019-03-18 [1] CRAN (R 3.5.2)                 
 crayon          1.3.4      2017-09-16 [1] CRAN (R 3.5.0)                 
 data.table      1.12.2     2019-04-07 [1] CRAN (R 3.5.2)                 
 desc            1.2.0      2018-05-01 [1] CRAN (R 3.5.0)                 
 devtools      * 2.2.1      2019-09-24 [1] CRAN (R 3.5.2)                 
 digest          0.6.21     2019-09-20 [1] CRAN (R 3.5.2)                 
 dplyr         * 0.8.3      2019-07-04 [1] CRAN (R 3.5.2)                 
 ellipsis        0.3.0      2019-09-20 [1] CRAN (R 3.5.2)                 
 evaluate        0.14       2019-05-28 [1] CRAN (R 3.5.2)                 
 fastmap         1.0.1      2019-10-08 [1] CRAN (R 3.5.3)                 
 forcats       * 0.4.0      2019-02-17 [1] CRAN (R 3.5.2)                 
 formatR         1.7        2019-06-11 [1] CRAN (R 3.5.2)                 
 fs              1.3.1      2019-05-06 [1] CRAN (R 3.5.2)                 
 generics        0.0.2      2018-11-29 [1] CRAN (R 3.5.0)                 
 ggplot2       * 3.2.1      2019-08-10 [1] CRAN (R 3.5.2)                 
 ggThemeAssist * 0.1.5      2016-08-13 [1] CRAN (R 3.5.0)                 
 glue            1.3.1.9000 2019-10-12 [1] Github (tidyverse/glue@71eeddf)
 gtable          0.3.0      2019-03-25 [1] CRAN (R 3.5.2)                 
 haven           2.1.1      2019-07-04 [1] CRAN (R 3.5.2)                 
 highr           0.8        2019-03-20 [1] CRAN (R 3.5.2)                 
 hms             0.5.1      2019-08-23 [1] CRAN (R 3.5.2)                 
 htmltools       0.4.0      2019-10-04 [1] CRAN (R 3.5.2)                 
 htmlwidgets     1.5.1      2019-10-08 [1] CRAN (R 3.5.2)                 
 httpuv          1.5.2      2019-09-11 [1] CRAN (R 3.5.2)                 
 httr            1.4.1      2019-08-05 [1] CRAN (R 3.5.2)                 
 janeaustenr     0.1.5      2017-06-10 [1] CRAN (R 3.5.0)                 
 jsonlite        1.6        2018-12-07 [1] CRAN (R 3.5.0)                 
 kableExtra    * 1.1.0      2019-03-16 [1] CRAN (R 3.5.2)                 
 knitr         * 1.25       2019-09-18 [1] CRAN (R 3.5.2)                 
 labeling        0.3        2014-08-23 [1] CRAN (R 3.5.0)                 
 later           1.0.0      2019-10-04 [1] CRAN (R 3.5.2)                 
 lattice         0.20-38    2018-11-04 [1] CRAN (R 3.5.3)                 
 lazyeval        0.2.2      2019-03-15 [1] CRAN (R 3.5.2)                 
 lifecycle       0.1.0      2019-08-01 [1] CRAN (R 3.5.2)                 
 lubridate       1.7.4      2018-04-11 [1] CRAN (R 3.5.0)                 
 magrittr        1.5        2014-11-22 [1] CRAN (R 3.5.0)                 
 Matrix          1.2-17     2019-03-22 [1] CRAN (R 3.5.2)                 
 memoise         1.1.0      2017-04-21 [1] CRAN (R 3.5.0)                 
 mime            0.7        2019-06-11 [1] CRAN (R 3.5.2)                 
 miniUI          0.1.1.1    2018-05-18 [1] CRAN (R 3.5.0)                 
 modelr          0.1.5      2019-08-08 [1] CRAN (R 3.5.2)                 
 munsell         0.5.0      2018-06-12 [1] CRAN (R 3.5.0)                 
 nlme            3.1-141    2019-08-01 [1] CRAN (R 3.5.2)                 
 NLP           * 0.2-0      2018-10-18 [1] CRAN (R 3.5.0)                 
 pillar          1.4.2      2019-06-29 [1] CRAN (R 3.5.2)                 
 pkgbuild        1.0.6      2019-10-09 [1] CRAN (R 3.5.2)                 
 pkgconfig       2.0.3      2019-09-22 [1] CRAN (R 3.5.2)                 
 pkgload         1.0.2      2018-10-29 [1] CRAN (R 3.5.0)                 
 prettyunits     1.0.2      2015-07-13 [1] CRAN (R 3.5.0)                 
 processx        3.4.1      2019-07-18 [1] CRAN (R 3.5.2)                 
 promises        1.1.0      2019-10-04 [1] CRAN (R 3.5.2)                 
 ps              1.3.0      2018-12-21 [1] CRAN (R 3.5.0)                 
 purrr         * 0.3.2      2019-03-15 [1] CRAN (R 3.5.2)                 
 R6              2.4.0      2019-02-14 [1] CRAN (R 3.5.2)                 
 Rcpp            1.0.2      2019-07-25 [1] CRAN (R 3.5.2)                 
 readr         * 1.3.1      2018-12-21 [1] CRAN (R 3.5.0)                 
 readxl          1.3.1      2019-03-13 [1] CRAN (R 3.5.2)                 
 remotes         2.1.0      2019-06-24 [1] CRAN (R 3.5.2)                 
 reticulate      1.13       2019-07-24 [1] CRAN (R 3.5.2)                 
 rlang           0.4.0      2019-06-25 [1] CRAN (R 3.5.2)                 
 rmarkdown       1.16       2019-10-01 [1] CRAN (R 3.5.2)                 
 rprojroot       1.3-2      2018-01-03 [1] CRAN (R 3.5.0)                 
 rstudioapi      0.10       2019-03-19 [1] CRAN (R 3.5.2)                 
 rtweet        * 0.6.9      2019-05-19 [1] CRAN (R 3.5.2)                 
 rvest           0.3.4      2019-05-15 [1] CRAN (R 3.5.2)                 
 scales          1.0.0      2018-08-09 [1] CRAN (R 3.5.0)                 
 sessioninfo     1.1.1      2018-11-05 [1] CRAN (R 3.5.0)                 
 shiny           1.4.0      2019-10-10 [1] CRAN (R 3.5.2)                 
 slam            0.1-45     2019-02-26 [1] CRAN (R 3.5.2)                 
 SnowballC       0.6.0      2019-01-15 [1] CRAN (R 3.5.2)                 
 spacyr        * 1.2        2019-07-04 [1] CRAN (R 3.5.2)                 
 stringi       * 1.4.3      2019-03-12 [1] CRAN (R 3.5.2)                 
 stringr       * 1.4.0      2019-02-10 [1] CRAN (R 3.5.2)                 
 testthat        2.2.1      2019-07-25 [1] CRAN (R 3.5.2)                 
 tibble        * 2.1.3      2019-06-06 [1] CRAN (R 3.5.2)                 
 tidyr         * 1.0.0      2019-09-11 [1] CRAN (R 3.5.2)                 
 tidyselect      0.2.5      2018-10-11 [1] CRAN (R 3.5.0)                 
 tidytext      * 0.2.2      2019-07-29 [1] CRAN (R 3.5.2)                 
 tidyverse     * 1.2.1      2017-11-14 [1] CRAN (R 3.5.0)                 
 tm            * 0.7-6      2018-12-21 [1] CRAN (R 3.5.0)                 
 tokenizers      0.2.1      2018-03-29 [1] CRAN (R 3.5.0)                 
 usethis       * 1.5.1      2019-07-04 [1] CRAN (R 3.5.2)                 
 vctrs           0.2.0      2019-07-05 [1] CRAN (R 3.5.2)                 
 viridisLite     0.3.0      2018-02-01 [1] CRAN (R 3.5.0)                 
 webshot         0.5.1      2018-09-28 [1] CRAN (R 3.5.0)                 
 withr           2.1.2      2018-03-15 [1] CRAN (R 3.5.0)                 
 wordcloud2    * 0.2.1      2018-01-03 [1] CRAN (R 3.5.0)                 
 xfun            0.10       2019-10-01 [1] CRAN (R 3.5.2)                 
 xml2            1.2.2      2019-08-09 [1] CRAN (R 3.5.2)                 
 xtable          1.8-4      2019-04-21 [1] CRAN (R 3.5.2)                 
 yaml            2.2.0      2018-07-25 [1] CRAN (R 3.5.0)                 
 zeallot         0.1.0      2018-01-28 [1] CRAN (R 3.5.0)                 

[1] /Library/Frameworks/R.framework/Versions/3.5/Resources/library

References

Avatar
Leyla Nunez
Statistician

Related

comments powered by Disqus