This R code defines a function load.package() that checks if a specified package is already installed in the R environment. If the package is not installed, it installs it using the install.packages() function and then loads it into the current session using the library() function.
The function takes a single argument, package, which is a character string specifying the name of the package to be loaded.
The code then calls the load.package() function for each of the required packages for the analysis. These packages include dplyr, tidyr, tm, udpipe, SnowballC, vader, ggplot2, ggthemes, stringr, qdapTools, quanteda, quanteda.textstats, quantmod, and tibble.
By using this function, the code ensures that all the required packages are installed and loaded, without having to manually check each package and install it if necessary.
# Check if a package is installed and load it, otherwise install and load it
load.package <- function(package) {
if (!require(package, character.only = TRUE)) {
install.packages(package)
library(package, character.only = TRUE)
} else {
library(package, character.only = TRUE)
}
}
# Load required packages
load.package("dplyr")
load.package("tidyr")
load.package("tm")
load.package("udpipe")
load.package("SnowballC")
load.package("vader")
load.package("ggplot2")
load.package("ggthemes")
load.package("stringr")
load.package("qdapTools")
load.package("quanteda")
load.package("quanteda.textstats")
load.package("quantmod")
load.package("tibble")
load.package("keras")
load.package("tensorflow")
Data source: https://www.kaggle.com/datasets/vidyapb/elon-musk-tweets-2015-to-2020
tesla = read.csv("C:/Users/Goldleverse/OneDrive/Desktop/Datasets/elonmusk.csv")
head(tesla, 5)
## id conversation_id created_at date time timezone
## 1 1.282940e+18 1.282933e+18 1.594712e+12 2020-07-14 07:28:03 UTC
## 2 1.282845e+18 1.282802e+18 1.594689e+12 2020-07-14 01:10:26 UTC
## 3 1.282806e+18 1.282759e+18 1.594680e+12 2020-07-13 22:34:13 UTC
## 4 1.282800e+18 1.282672e+18 1.594678e+12 2020-07-13 22:12:52 UTC
## 5 1.282800e+18 1.282739e+18 1.594678e+12 2020-07-13 22:12:26 UTC
## user_id username name place
## 1 44196397 elonmusk Elon Musk NA
## 2 44196397 elonmusk Elon Musk NA
## 3 44196397 elonmusk Elon Musk NA
## 4 44196397 elonmusk Elon Musk NA
## 5 44196397 elonmusk Elon Musk NA
## tweet
## 1 Cute
## 2 Wow
## 3 Reusability is essential. A rocket that is single use is just as absurd as a single use airplane. F9 engines already fire 3 times per flight.
## 4 Wild times!
## 5 We’re being extra paranoid. Maximizing probability of successful launch is paramount.
## mentions
## 1 ['teslarati']
## 2 ['davidtayar5', 'teslalisa', 'stevehamel16', 'vincent13031925', 's_padival', 'barkmsmeagol', 'annerajb', 'pjhornak', 'wpipperger', 'ecoheliguy']
## 3 ['katlinegrey']
## 4 ['vincent13031925']
## 5 ['spaceflightnow']
## urls photos replies_count retweets_count likes_count hashtags cashtags
## 1 [] [] 222 176 7883 [] []
## 2 [] [] 346 180 4241 [] []
## 3 [] [] 102 126 1551 [] []
## 4 [] [] 438 415 13421 [] []
## 5 [] [] 128 155 4238 [] []
## link retweet quote_url
## 1 https://twitter.com/elonmusk/status/1282939902531796993 False
## 2 https://twitter.com/elonmusk/status/1282844872571904000 False
## 3 https://twitter.com/elonmusk/status/1282805559834492929 False
## 4 https://twitter.com/elonmusk/status/1282800187308572672 False
## 5 https://twitter.com/elonmusk/status/1282800078000803840 False
## video near geo source user_rt_id user_rt retweet_id
## 1 0 NA NA NA NA NA NA
## 2 0 NA NA NA NA NA NA
## 3 0 NA NA NA NA NA NA
## 4 0 NA NA NA NA NA NA
## 5 0 NA NA NA NA NA NA
## reply_to
## 1 [{'user_id': '44196397', 'username': 'elonmusk'}, {'user_id': '1308211178', 'username': 'Teslarati'}]
## 2 [{'user_id': '44196397', 'username': 'elonmusk'}, {'user_id': '275295731', 'username': 'davidtayar5'}, {'user_id': '1046173159580479490', 'username': 'TeslaLisa'}, {'user_id': '999632398525530114', 'username': 'SteveHamel16'}, {'user_id': '1689516060', 'username': 'vincent13031925'}, {'user_id': '717042249624854529', 'username': 'S_Padival'}, {'user_id': '980554784133427200', 'username': 'BarkMSmeagol'}, {'user_id': '23771947', 'username': 'annerajb'}, {'user_id': '12153682', 'username': 'PJHORNAK'}, {'user_id': '2638809573', 'username': 'WPipperger'}, {'user_id': '316068880', 'username': 'EcoHeliGuy'}]
## 3 [{'user_id': '44196397', 'username': 'elonmusk'}, {'user_id': '319128454', 'username': 'katlinegrey'}]
## 4 [{'user_id': '44196397', 'username': 'elonmusk'}, {'user_id': '1689516060', 'username': 'vincent13031925'}]
## 5 [{'user_id': '44196397', 'username': 'elonmusk'}, {'user_id': '17217640', 'username': 'SpaceflightNow'}]
## retweet_date translate trans_src trans_dest
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
nrow(tesla)
## [1] 9286
This code uses the dplyr package to perform data manipulation operations on the tesla data frame.
tesla %>% arrange(“date”, ascending=F) arranges the rows in the tesla data frame in descending order of date. This is achieved using the %>% operator to pipe the data frame into the arrange() function, which takes two arguments: “date” (the column to sort by) and ascending = F (indicating descending order).
%>% group_by(date) groups the rows in the tesla data frame by date, using the %>% operator to pipe the data frame into the group_by() function, which takes one argument: date (the column to group by).
%>% summarize(tweet = paste(tweet, collapse = ” . “)) summarizes the tweet column for each group (i.e., each unique date) by concatenating all the tweets into a single string, separated by a period (.). This is achieved using the %>% operator to pipe the grouped data frame into the summarize() function, which takes one argument: tweet = paste(tweet, collapse =” . “) (a named argument indicating the column to summarize and the operation to apply).
tesla = tesla %>% arrange("date", ascending=F) %>%
group_by(date) %>%
summarize(tweet = paste(tweet, collapse = " . "))
tail(tesla, 5)
## # A tibble: 5 × 2
## date tweet
## <chr> <chr>
## 1 2020-07-10 "I spoke with Korolev’s family today.\nHe was one of the very best…
## 2 2020-07-11 "Haha . Maybe he should design flag of Mars . 48 65 78 20 74 6f 20…
## 3 2020-07-12 "Thanks :) . That is the near-term danger of AI . I didn’t mind DA…
## 4 2020-07-13 "Reusability is essential. A rocket that is single use is just as …
## 5 2020-07-14 "Cute . Wow"
This R code creates a corpus object, which is a collection of documents (in this case, tweets about Tesla) that will be used for text analysis.
The Corpus() function is from the tm package, and is used to create a corpus object from a source text (in this case, VectorSource(tesla$tweet)). The VectorSource() function is used to create a source vector from the tweet column of the tesla data frame, which contains the text of the tweets.
The corpus[[1]][1] line extracts the first word of the first document in the corpus.
corpus = Corpus(VectorSource(tesla$tweet))
corpus[[1]][1]
## $content
## [1] "If you are curious about the P85D, you can schedule a test drive here: http://ts.la/dE "
This R code is using the tm_map() function from the tm package to transform the text in a corpus to lowercase.
# Convert to lower case
corpus = tm_map(corpus, FUN = content_transformer(tolower))
corpus[[1]][1]
## $content
## [1] "if you are curious about the p85d, you can schedule a test drive here: http://ts.la/de "
The content_transformer() function is being used to apply a regular expression pattern to each document in the corpus. The regular expression pattern is ’http[[:alnum:][:punct:]]*‘, which matches any sequence of characters that starts with ’http’, followed by any combination of alphanumeric and punctuation characters. This pattern is being replaced with a space character using the gsub() function, which is a built-in R function for replacing text in a string.
The overall effect of this code is to remove any URLs or web links from the text documents in the corpus, replacing them with a space character.
# Remover URL
corpus = tm_map(corpus,
FUN = content_transformer(FUN = function(x)gsub(pattern = 'http[[:alnum:][:punct:]]*',
replacement = ' ',x = x)))
removePunctuation is a built-in function in the tm package that removes all punctuation marks from the text.
# Remove punctuation
corpus = tm_map(corpus,FUN = removePunctuation)
corpus[[1]][1]
## $content
## [1] "if you are curious about the p85d you can schedule a test drive here "
The removeWords() function is applied to corpus using tm_map(). The removeWords() function is a text preprocessing function that takes a corpus as input and removes specified words from the text. In this case, stopwords(‘english’) is used to generate a list of English stop words that are commonly used but do not carry much meaning on their own (such as “the”, “and”, “a”, etc.). These stop words are then removed from the text in corpus.
The resulting corpus object will have the English stop words removed from its text.
corpus = tm_map(corpus,FUN = removeWords,c(stopwords('english')))
corpus[[1]][1]
## $content
## [1] " curious p85d can schedule test drive "
the stripWhitespace function is being passed as an argument to tm_map(). This function removes extra whitespace (i.e., spaces, tabs, line breaks) from the text in the corpus. The stripWhitespace function is a built-in function in the tm package.
By applying this function to the corpus variable, any extra whitespace will be removed from each document in the corpus.
corpus = tm_map(corpus,FUN = stripWhitespace)
corpus[[1]][1]
## $content
## [1] " curious p85d can schedule test drive "
Lemmatization is the process of reducing a word to its base or dictionary form, which can help to standardize text and improve the accuracy of text analysis.
The code first downloads a pre-trained language model for English using the udpipe_download_model function, and then loads the model using udpipe_load_model.
Next, the code defines a function called lemmatize_document which takes a single document as input and returns a vector of lemmas. The function tokenizes the document using the udpipe_annotate function, extracts the lemma column from the resulting data frame using as.data.frame, and removes numbers, punctuation, and other special characters from the lemmas using the gsub function.
Finally, the code applies the lemmatize_document function to each document in the corpus using the lapply function and stores the result in lemmatized_corpus. The resulting corpus is then printed using head.
# download the model to perform the lemmatization
model <- udpipe_download_model(language = "english")
# load the model
ud_model <- udpipe_load_model(file = model$file_model)
# create a function to lemmatize a single document
lemmatize_document <- function(document) {
# tokenize the document
tokens <- udpipe_annotate(ud_model, x = document)
# extract the lemma column from the tokens
lemmas <- as.data.frame(tokens)$lemma
# remove numbers, punctuation and other special characters
lemmas <- gsub("[^[:alpha:][:space:]]*", "", lemmas)
# return the lemmatized document
return(lemmas)
}
# apply lemmatization to each document in the corpus
lemmatized_corpus <- lapply(corpus, lemmatize_document)
# print the lemmatized corpus
head(lemmatized_corpus, 5)
## [[1]]
## [1] "curious" "p" "would" "can" "schedule" "test" "drive"
##
## [[2]]
## [1] "prob" "good" "though" "give" "we"
## [6] "time" "replace" "st" "stage" "video"
## [11] "transmitter" "need" "launch" "nice" "air"
## [16] "force" "track" "radar" "go" "launch"
## [21] "postponed" "time" "tomorrow" "rocket" "reentry"
## [26] "much" "tougher" "time" "around" "due"
## [31] "deep" "space" "mission" "almost" ""
## [36] "x" "force" "" "x" "heat"
## [41] "plenty" "hydraulic" "fluid" "tho" "launch"
## [46] "st" "deep" "space" "mission" "today"
## [51] "head" "earthsun" "l" "gravity" "null"
## [56] "point" "" "m" "mile" ""
## [61] "x" "moon"
##
## [[3]]
## [1] "launch" "postpone" "tomorrow" "due" "high"
## [6] "wind" "cape" "dragon" "still" "inbound"
## [11] "orbit" "" "min" "extreme" "wind"
## [16] "share" "cape" "canaveral" "feel" "like"
## [21] "sledgehammer" "supersonic" "vertical" "hoping" "change"
## [26] "" "civilian" "application" ""
##
## [[4]]
## [1] "rocket" "soft" "land"
## [4] "ocean" "withing" ""
## [7] "m" "target" "nicely"
## [10] "vertical" "high" "probability"
## [13] "good" "droneship" "landing"
## [16] "nonstormy" "weather" "primary"
## [19] "mission" "target" "spacecraft"
## [22] "head" "towards" "sun"
## [25] "good" "daniellockyer" "actually"
## [28] "thatmaybe" "plan" "significant"
## [31] "upgrade" "droneship" "future"
## [34] "mission" "handle" "literally"
## [37] "anything" "maybe" "give"
## [40] "merlin" "good" "measure"
## [43] "cant" "delay" "longer"
## [46] "must" "proceed" "primary"
## [49] "mission" "launch" "deep"
## [52] "space" "climate" "observatory"
## [55] "spacecraft" "mega" "storm"
## [58] "prevent" "droneship" "remain"
## [61] "station" "rocket" "try"
## [64] "land" "water" "survival"
## [67] "probability" "" "come"
## [70] "home" "pictwittercomfmrmysr" "v"
## [73] "dragon" "splashdown" "california"
## [76] "coast" "pictwittercombvfmeii"
##
## [[5]]
## [1] "land" "stormy" "sea"
## [4] "pictwittercomeygiu"
This code removes non-alpha characters from a previously stemmed corpus of text using the gsub function in R. The purpose of this step is to clean the text further and remove any remaining non-alphabetic characters that may interfere with subsequent text analysis.
The code first defines a function called clean_document which takes a single document as input and returns a cleaned version of the document. The function uses gsub to remove all characters that are not alphabets or white spaces from the document, with the regular expression “[^[:alpha:][:space:]]*” matching any character that is not a letter or whitespace.
Next, the code applies the clean_document function to each document in the previously stemmed corpus using the lapply function and stores the result in cleaned_corpus.
# Removing non-alpha characters
# create a function to remove non-alpha characters from a document
clean_document <- function(document) {
# remove all characters that are not alphabets or white spaces
clean_doc <- gsub("[^[:alpha:][:space:]]*", "", document)
# return the cleaned document
return(clean_doc)
}
# apply cleaning to each document in the lemmatized or stemmed corpus
cleaned_corpus <- lapply(lemmatized_corpus, clean_document)
# print the cleaned corpus
head(cleaned_corpus, 5)
## [[1]]
## [1] "curious" "p" "would" "can" "schedule" "test" "drive"
##
## [[2]]
## [1] "prob" "good" "though" "give" "we"
## [6] "time" "replace" "st" "stage" "video"
## [11] "transmitter" "need" "launch" "nice" "air"
## [16] "force" "track" "radar" "go" "launch"
## [21] "postponed" "time" "tomorrow" "rocket" "reentry"
## [26] "much" "tougher" "time" "around" "due"
## [31] "deep" "space" "mission" "almost" ""
## [36] "x" "force" "" "x" "heat"
## [41] "plenty" "hydraulic" "fluid" "tho" "launch"
## [46] "st" "deep" "space" "mission" "today"
## [51] "head" "earthsun" "l" "gravity" "null"
## [56] "point" "" "m" "mile" ""
## [61] "x" "moon"
##
## [[3]]
## [1] "launch" "postpone" "tomorrow" "due" "high"
## [6] "wind" "cape" "dragon" "still" "inbound"
## [11] "orbit" "" "min" "extreme" "wind"
## [16] "share" "cape" "canaveral" "feel" "like"
## [21] "sledgehammer" "supersonic" "vertical" "hoping" "change"
## [26] "" "civilian" "application" ""
##
## [[4]]
## [1] "rocket" "soft" "land"
## [4] "ocean" "withing" ""
## [7] "m" "target" "nicely"
## [10] "vertical" "high" "probability"
## [13] "good" "droneship" "landing"
## [16] "nonstormy" "weather" "primary"
## [19] "mission" "target" "spacecraft"
## [22] "head" "towards" "sun"
## [25] "good" "daniellockyer" "actually"
## [28] "thatmaybe" "plan" "significant"
## [31] "upgrade" "droneship" "future"
## [34] "mission" "handle" "literally"
## [37] "anything" "maybe" "give"
## [40] "merlin" "good" "measure"
## [43] "cant" "delay" "longer"
## [46] "must" "proceed" "primary"
## [49] "mission" "launch" "deep"
## [52] "space" "climate" "observatory"
## [55] "spacecraft" "mega" "storm"
## [58] "prevent" "droneship" "remain"
## [61] "station" "rocket" "try"
## [64] "land" "water" "survival"
## [67] "probability" "" "come"
## [70] "home" "pictwittercomfmrmysr" "v"
## [73] "dragon" "splashdown" "california"
## [76] "coast" "pictwittercombvfmeii"
##
## [[5]]
## [1] "land" "stormy" "sea"
## [4] "pictwittercomeygiu"
This code performs term frequency-inverse document frequency (TF-IDF) calculation on the cleaned corpus of text. TF-IDF is a widely used weighting scheme in text mining that assigns a weight to each term in a document based on its frequency within the document and its rarity across all documents.
The code first creates a document-term matrix (DTM) using the DocumentTermMatrix function from the tm package. The x parameter is set to the cleaned corpus, and the control parameter is set to a list containing normalize=TRUE, which scales the counts in the DTM by the total number of words in each document.
Next, the code removes sparse terms from the DTM using the removeSparseTerms function, which removes terms that appear in fewer than 5% of the documents. The resulting DTM is converted to a matrix and then to a data frame using the as.matrix and as.data.frame functions, respectively.
# TF-IDF
dtm_tfidf = DocumentTermMatrix(x=cleaned_corpus,
control = list(normalize=TRUE))
xdtm_tfidf = removeSparseTerms(dtm_tfidf,sparse = 0.95)
xdtm_tfidf = as.data.frame(as.matrix(xdtm_tfidf))
sort(colSums(xdtm_tfidf),decreasing = T)
## "", "tesla", "s", "not", "car",
## 4172 802 615 601 452
## "be", "do", "just", "yes", "make",
## 405 391 377 376 365
## "") "good", "need", "can", "like",
## 356 331 322 315 307
## "year", "model", "work", "go", "get",
## 301 299 298 292 287
## "time", "great", "much", "one", "re",
## 257 252 238 231 227
## "come", "use", "next", "spacex", "high",
## 220 212 204 201 197
## "also", "people", "true", "now", "soon",
## 188 186 183 174 174
## "rocket", "even", "team", "take", "better",
## 170 169 169 165 164
## "yeah", "think", "look", "production", "right",
## 163 162 159 159 158
## "sure", "first", "probably", "month", "engine",
## 158 157 157 152 152
## "exactly", "many", "we", "new", "say",
## 152 151 149 148 148
## "starship", "week", "day", "launch", "thanks",
## 148 147 146 145 143
## "way", "actually", "test", "maybe", "see",
## 142 139 138 137 132
## "love", "point", "lot", "falcon", "know",
## 132 130 130 130 127
## "company", "still", "change", "design", "thing",
## 122 121 118 117 115
## "almost", "x", "really", "best", "want",
## 114 114 114 112 112
## "drive", "cost", "mar", "super", "hard",
## 110 108 107 105 104
## "software", "build", "autopilot", "start", "order",
## 104 103 103 102 100
## "full", "heavy", "system", "mean", "power",
## 99 98 96 95 95
## "m", "back", "flight", "earth", "part",
## 94 94 93 93 93
## "try", "ok", "haha", "landing", "cool",
## 92 92 91 89 89
## "sound", "long", "end", "give", "future",
## 88 87 87 86 85
## "add", "less", "hopefully", "big", "real",
## 85 84 83 83 82
## "well", "definitely", "improve", "pretty", "today",
## 80 80 80 80 75
head(xdtm_tfidf, 10)
## "can", "test", "", "almost", "give", "go", "good", "launch", "m", "much",
## 1 1 1 0 0 0 0 0 0 0 0
## 2 0 0 4 1 1 1 1 3 1 1
## 3 0 0 2 0 0 0 0 0 0 0
## 4 0 0 2 0 1 0 3 1 1 0
## 5 0 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0 0 0
## 7 0 0 0 0 0 0 0 0 0 0
## 8 0 0 2 0 0 0 0 1 0 0
## 9 0 0 1 0 0 0 0 0 0 0
## 10 1 0 0 0 0 0 0 0 0 0
## "need", "point", "rocket", "time", "today", "we", "x", "") "change", "high",
## 1 0 0 0 0 0 0 0 0 0 0
## 2 1 1 1 3 1 1 3 0 0 0
## 3 0 0 0 0 0 0 0 1 1 1
## 4 0 0 1 0 0 0 0 0 0 1
## 5 0 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 1 0 0
## 7 0 0 0 0 0 0 0 1 0 0
## 8 0 0 0 0 0 0 0 0 0 0
## 9 0 0 0 0 0 0 0 1 0 0
## 10 0 0 0 0 0 0 0 0 0 0
## "like", "still", "actually", "come", "future", "landing", "maybe", "try",
## 1 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0
## 3 1 1 0 0 0 0 0 0
## 4 0 0 1 1 1 1 1 1
## 5 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0
## 7 0 0 0 0 0 0 0 0
## 8 0 0 0 0 0 2 0 0
## 9 0 0 0 0 0 0 0 0
## 10 0 0 0 0 0 0 0 0
## "best", "car", "model", "s", "year", "flight", "next", "now", "one", "work",
## 1 0 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0 0
## 6 1 1 1 1 1 0 0 0 0 0
## 7 0 0 0 0 0 0 0 0 0 0
## 8 0 0 0 0 0 1 1 2 1 1
## 9 0 0 0 0 0 0 0 0 0 0
## 10 0 0 0 0 0 0 0 0 0 0
## "full", "back", "be", "build", "do", "get", "mar", "not", "really", "see",
## 1 0 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0 0 0
## 7 0 0 0 0 0 0 0 0 0 0
## 8 0 0 0 0 0 0 0 0 0 0
## 9 1 0 0 0 0 0 0 0 0 0
## 10 0 0 0 0 0 0 0 0 0 0
## "lot", "soon", "new", "system", "company", "even", "long", "people",
## 1 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0
## 7 0 0 0 0 0 0 0 0
## 8 0 0 0 0 0 0 0 0
## 9 0 0 0 0 0 0 0 0
## 10 0 0 0 0 0 0 0 0
## "think", "drive", "hopefully", "tesla", "well", "just", "spacex", "take",
## 1 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0
## 7 0 0 0 0 0 0 0 0
## 8 0 0 0 0 0 0 0 0
## 9 0 0 0 0 0 0 0 0
## 10 0 0 0 0 0 0 0 0
## "look", "falcon", "want", "mean", "say", "design", "power", "week", "thing",
## 1 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0 0
## 7 0 0 0 0 0 0 0 0 0
## 8 0 0 0 0 0 0 0 0 0
## 9 0 0 0 0 0 0 0 0 0
## 10 0 0 0 0 0 0 0 0 0
## "many", "true", "end", "less", "way", "hard", "big", "ok", "month", "day",
## 1 0 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0 0 0
## 7 0 0 0 0 0 0 0 0 0 0
## 8 0 0 0 0 0 0 0 0 0 0
## 9 0 0 0 0 0 0 0 0 0 0
## 10 0 0 0 0 0 0 0 0 0 0
## "haha", "real", "thanks", "cool", "know", "love", "also", "team", "great",
## 1 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0 0
## 7 0 0 0 0 0 0 0 0 0
## 8 0 0 0 0 0 0 0 0 0
## 9 0 0 0 0 0 0 0 0 0
## 10 0 0 0 0 0 0 0 0 0
## "first", "right", "yeah", "probably", "better", "re", "software", "start",
## 1 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0
## 7 0 0 0 0 0 0 0 0
## 8 0 0 0 0 0 0 0 0
## 9 0 0 0 0 0 0 0 0
## 10 0 0 0 0 0 0 0 0
## "yes", "cost", "definitely", "production", "heavy", "use", "super", "add",
## 1 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0
## 7 0 0 0 0 0 0 0 0
## 8 0 0 0 0 0 0 0 0
## 9 0 0 0 0 0 0 0 0
## 10 0 0 0 0 0 0 0 0
## "autopilot", "sure", "make", "improve", "sound", "engine", "earth", "order",
## 1 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0
## 7 0 0 0 0 0 0 0 0
## 8 0 0 0 0 0 0 0 0
## 9 0 0 0 0 0 0 0 0
## 10 0 0 0 0 0 0 0 0
## "exactly", "part", "pretty", "starship",
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## 7 0 0 0 0
## 8 0 0 0 0
## 9 0 0 0 0
## 10 0 0 0 0
tesla_tfidf = cbind(tesla, xdtm_tfidf)
head(tesla_tfidf, 3)
## date
## 1 2015-01-30
## 2 2015-02-08
## 3 2015-02-10
## tweet
## 1 If you are curious about the P85D, you can schedule a test drive here: http://ts.la/dE
## 2 Prob good though. Will give us time to replace 1st stage video transmitter (not needed for launch, but nice to have). . Air Force tracking radar went down. Launch postponed to same time tomorrow. . Rocket reentry will be much tougher this time around due to deep space mission. Almost 2X force and 4X heat. Plenty of hydraulic fluid tho. . Launching our 1st deep space mission today. Headed to Earth-Sun L1 gravity null point at 1M miles, 4X further than moon.
## 3 Launch postponed to tomorrow due to high winds at the Cape, but Dragon still inbound from orbit in 90 mins . Extreme wind shear over Cape Canaveral. Feels like a sledgehammer when supersonic in the vertical. Hoping it changes … . "What Are The Civilian Applications?" https://m.youtube.com/watch?v=M8YjvHYbZ9w …
## "can", "test", "", "almost", "give", "go", "good", "launch", "m", "much",
## 1 1 1 0 0 0 0 0 0 0 0
## 2 0 0 4 1 1 1 1 3 1 1
## 3 0 0 2 0 0 0 0 0 0 0
## "need", "point", "rocket", "time", "today", "we", "x", "") "change", "high",
## 1 0 0 0 0 0 0 0 0 0 0
## 2 1 1 1 3 1 1 3 0 0 0
## 3 0 0 0 0 0 0 0 1 1 1
## "like", "still", "actually", "come", "future", "landing", "maybe", "try",
## 1 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0
## 3 1 1 0 0 0 0 0 0
## "best", "car", "model", "s", "year", "flight", "next", "now", "one", "work",
## 1 0 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0 0
## "full", "back", "be", "build", "do", "get", "mar", "not", "really", "see",
## 1 0 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0 0
## "lot", "soon", "new", "system", "company", "even", "long", "people", "think",
## 1 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0
## "drive", "hopefully", "tesla", "well", "just", "spacex", "take", "look",
## 1 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0
## "falcon", "want", "mean", "say", "design", "power", "week", "thing", "many",
## 1 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0
## "true", "end", "less", "way", "hard", "big", "ok", "month", "day", "haha",
## 1 0 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0 0
## "real", "thanks", "cool", "know", "love", "also", "team", "great", "first",
## 1 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0
## "right", "yeah", "probably", "better", "re", "software", "start", "yes",
## 1 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0
## "cost", "definitely", "production", "heavy", "use", "super", "add",
## 1 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0
## "autopilot", "sure", "make", "improve", "sound", "engine", "earth", "order",
## 1 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0
## "exactly", "part", "pretty", "starship",
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
This code performs sentiment analysis on a collection of tweets related to Tesla using the VADER (Valence Aware Dictionary and sEntiment Reasoner) sentiment analysis tool. The code loops through each tweet in the “tesla” data frame and applies the get_vader function to extract the sentiment scores.
The sentiment scores are then appended to empty lists for each score type, including “word_scores”, “compound”, “pos”, “neu”, “neg”, and “but_count”. The length() function is used to determine the index of the next available spot in each list.
After processing all tweets, the code combines the lists into a data frame named vader_df using the data.frame() function. The unlist() function is used to flatten the lists into a single vector for each score type. The resulting data frame contains one row per tweet and columns for each score type.
# Create empty lists
word_scores_list = list()
compound_list = list()
pos_list = list()
neu_list = list()
neg_list = list()
but_count_list = list()
# Loop through tweets
for (row in tesla$tweet) {
# Get sentiment scores
sentiment = get_vader(row, incl_nt = T, neu_set = T, rm_qm = T)
# Append scores to lists
word_scores_list[[length(word_scores_list) + 1]] = sentiment["word_scores"]
compound_list[[length(compound_list) + 1]] = sentiment["compound"]
pos_list[[length(pos_list) + 1]] = sentiment["pos"]
neu_list[[length(neu_list) + 1]] = sentiment["neu"]
neg_list[[length(neg_list) + 1]] = sentiment["neg"]
but_count_list[[length(but_count_list) + 1]] = sentiment["but_count"]
}
# Combine lists into data frame
vader_df = data.frame(
word_scores = unlist(word_scores_list),
compound = unlist(compound_list),
pos = unlist(pos_list),
neu = unlist(neu_list),
neg = unlist(neg_list),
but_count = unlist(but_count_list)
)
colnames(vader_df)
## [1] "word_scores" "compound" "pos" "neu" "neg"
## [6] "but_count"
tesla_sentiment = cbind(tesla_tfidf, vader_df)
print(nrow(tesla_sentiment))
## [1] 1360
head(tesla_sentiment, 5)
## date
## 1 2015-01-30
## 2 2015-02-08
## 3 2015-02-10
## 4 2015-02-11
## 5 2015-02-12
## tweet
## 1 If you are curious about the P85D, you can schedule a test drive here: http://ts.la/dE
## 2 Prob good though. Will give us time to replace 1st stage video transmitter (not needed for launch, but nice to have). . Air Force tracking radar went down. Launch postponed to same time tomorrow. . Rocket reentry will be much tougher this time around due to deep space mission. Almost 2X force and 4X heat. Plenty of hydraulic fluid tho. . Launching our 1st deep space mission today. Headed to Earth-Sun L1 gravity null point at 1M miles, 4X further than moon.
## 3 Launch postponed to tomorrow due to high winds at the Cape, but Dragon still inbound from orbit in 90 mins . Extreme wind shear over Cape Canaveral. Feels like a sledgehammer when supersonic in the vertical. Hoping it changes … . "What Are The Civilian Applications?" https://m.youtube.com/watch?v=M8YjvHYbZ9w …
## 4 Rocket soft landed in the ocean within 10m of target & nicely vertical! High probability of good droneship landing in non-stormy weather. . Primary mission on target. Spacecraft head towards the sun! All good there. . @DanielLockyer We could actually do that...maybe we should . Planning a significant upgrade of the droneship for future missions to handle literally anything. Maybe give it a Merlin for good measure :) . Can't delay any longer. Must proceed with primary mission to launch the Deep Space Climate Observatory spacecraft. . Mega storm preventing droneship from remaining on station, so rocket will try to land on water. Survival probability <1%. . Coming home pic.twitter.com/FmrmYs6R6V . Dragon splashdown off the California coast pic.twitter.com/4Bvfmei8I3
## 5 Landing on a stormy sea pic.twitter.com/7EY25g3IU5
## "can", "test", "", "almost", "give", "go", "good", "launch", "m", "much",
## 1 1 1 0 0 0 0 0 0 0 0
## 2 0 0 4 1 1 1 1 3 1 1
## 3 0 0 2 0 0 0 0 0 0 0
## 4 0 0 2 0 1 0 3 1 1 0
## 5 0 0 0 0 0 0 0 0 0 0
## "need", "point", "rocket", "time", "today", "we", "x", "") "change", "high",
## 1 0 0 0 0 0 0 0 0 0 0
## 2 1 1 1 3 1 1 3 0 0 0
## 3 0 0 0 0 0 0 0 1 1 1
## 4 0 0 1 0 0 0 0 0 0 1
## 5 0 0 0 0 0 0 0 0 0 0
## "like", "still", "actually", "come", "future", "landing", "maybe", "try",
## 1 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0
## 3 1 1 0 0 0 0 0 0
## 4 0 0 1 1 1 1 1 1
## 5 0 0 0 0 0 0 0 0
## "best", "car", "model", "s", "year", "flight", "next", "now", "one", "work",
## 1 0 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0 0
## "full", "back", "be", "build", "do", "get", "mar", "not", "really", "see",
## 1 0 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0 0
## "lot", "soon", "new", "system", "company", "even", "long", "people", "think",
## 1 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0
## "drive", "hopefully", "tesla", "well", "just", "spacex", "take", "look",
## 1 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0
## "falcon", "want", "mean", "say", "design", "power", "week", "thing", "many",
## 1 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0
## "true", "end", "less", "way", "hard", "big", "ok", "month", "day", "haha",
## 1 0 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0 0
## "real", "thanks", "cool", "know", "love", "also", "team", "great", "first",
## 1 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0
## "right", "yeah", "probably", "better", "re", "software", "start", "yes",
## 1 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0
## "cost", "definitely", "production", "heavy", "use", "super", "add",
## 1 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0
## "autopilot", "sure", "make", "improve", "sound", "engine", "earth", "order",
## 1 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0
## "exactly", "part", "pretty", "starship",
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## word_scores
## 1 {0, 0, 0, 1.3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}
## 2 {0, 0.95, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2.7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1.2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1.05, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}
## 3 {0, -0.4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2.25, 0, 0, 0, 0, 0, 0, 0, 2.7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}
## 4 {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1.9, 0, 0, 0, 0, 1.9, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1.9, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1.9, 0, 2, 0, 0, 0.962, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -0.1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}
## 5 {0, 0, 0, 0, 0, 0}
## compound pos neu neg but_count
## 1 0.318 0.141 0.859 0 0
## 2 0.67 0.088 0.887 0.025 1
## 3 0.785 0.139 0.835 0.027 1
## 4 0.95 0.146 0.846 0.008 0
## 5 0 0 1 0 0
write.csv(tesla_sentiment, 'pre_finance.csv')
This code is for downloading stock price data for Tesla from Yahoo Finance within a specified date range.
First, the stock symbol “TSLA” is defined and the start and end dates are specified using as.Date().
getSymbols() is then used to download the historical stock price data for TSLA from Yahoo Finance, with the data source specified as “yahoo” and the date range specified using the from and to arguments.
Next, Ad() is used to extract the adjusted close price data from the downloaded data. as.data.frame() is used to convert the data into a dataframe format, and rownames_to_column() is used to move the date information from the row names to a separate “date” column in the dataframe. The resulting dataframe is assigned to tsla_prices.
# define the stock symbol and data range
symbol <- "TSLA"
start_date <- as.Date("2015-01-29")
end_date <- as.Date("2020-07-15")
# get the stock price data from Yahoo Finance
getSymbols(symbol, src = "yahoo", from = start_date, to = end_date)
## [1] "TSLA"
# extract the adjusted close price data
tsla_prices <- Ad(get(symbol))
tsla_prices <- as.data.frame(tsla_prices) %>%
rownames_to_column(var = "date")
# view the last 10 prices
tail(tsla_prices)
## date TSLA.Adjusted
## 1369 2020-07-07 92.65733
## 1370 2020-07-08 91.05867
## 1371 2020-07-09 92.95200
## 1372 2020-07-10 102.97667
## 1373 2020-07-13 99.80400
## 1374 2020-07-14 101.12000
This code converts the “date” column in the tesla_sentiment and tsla_prices dataframes from character or factor type to Date type using as.Date().
This is useful when combining or merging dataframes on the basis of the “date” column, as it ensures that the dates are in the same format and can be compared properly.
tesla_sentiment$date = as.Date(tesla_sentiment$date)
tsla_prices$date = as.Date(tsla_prices$date)
tail(tesla_sentiment, 3)
## date
## 1358 2020-07-12
## 1359 2020-07-13
## 1360 2020-07-14
## tweet
## 1358 Thanks :) . That is the near-term danger of AI . I didn’t mind DA2. DA1 was awesome. Mass Effect 2 (talking about sequels) is amazing.
## 1359 Reusability is essential. A rocket that is single use is just as absurd as a single use airplane. F9 engines already fire 3 times per flight. . Wild times! . We’re being extra paranoid. Maximizing probability of successful launch is paramount. . Welcome anytime . Well, I do care very much about sustainability of civilization, but there is some truth to the irony part haha . Yes, in plan. Superchargers and public high power wall connectors will keep growing exponentially every year. . 👀 . I think so . Doing range testing now. Number will be significantly higher than 300. Extremely good for any EV, especially an SUV. . We have reduced pricing on Model Y LR dual motor & will offer a LR single motor Y in a few months, which improves affordability, while still keeping the product excellent . Yes . It may be able to reach 250kW at low states of charge . No, as range would be unacceptably low (< 250 mile EPA) . We had to increase some wire thicknesses in S/X to reduce resistive heating. Technically, won’t be quite 1000 mph charging, as X especially is much bigger than 3. . Hell of a ride!
## 1360 Cute . Wow
## "can", "test", "", "almost", "give", "go", "good", "launch", "m", "much",
## 1358 0 0 1 0 0 0 0 0 0 0
## 1359 0 1 7 0 0 0 1 1 0 2
## 1360 0 0 0 0 0 0 0 0 0 0
## "need", "point", "rocket", "time", "today", "we", "x", "") "change",
## 1358 0 0 0 0 0 0 0 0 0
## 1359 0 0 1 2 0 0 1 0 0
## 1360 0 0 0 0 0 0 0 0 0
## "high", "like", "still", "actually", "come", "future", "landing", "maybe",
## 1358 0 0 0 0 0 0 0 0
## 1359 1 0 1 0 0 0 0 0
## 1360 0 0 0 0 0 0 0 0
## "try", "best", "car", "model", "s", "year", "flight", "next", "now",
## 1358 0 0 0 0 0 0 0 0 0
## 1359 0 0 0 1 0 1 1 0 1
## 1360 0 0 0 0 0 0 0 0 0
## "one", "work", "full", "back", "be", "build", "do", "get", "mar", "not",
## 1358 0 0 0 0 1 0 1 0 0 1
## 1359 0 0 0 0 0 0 0 0 0 1
## 1360 0 0 0 0 0 0 0 0 0 0
## "really", "see", "lot", "soon", "new", "system", "company", "even",
## 1358 0 0 0 0 0 0 0 0
## 1359 0 0 0 0 0 0 0 0
## 1360 0 0 0 0 0 0 0 0
## "long", "people", "think", "drive", "hopefully", "tesla", "well", "just",
## 1358 0 0 0 0 0 0 0 0
## 1359 0 0 1 0 0 0 1 1
## 1360 0 0 0 0 0 0 0 0
## "spacex", "take", "look", "falcon", "want", "mean", "say", "design",
## 1358 0 0 0 0 0 0 0 0
## 1359 0 0 0 0 0 0 0 0
## 1360 0 0 0 0 0 0 0 0
## "power", "week", "thing", "many", "true", "end", "less", "way", "hard",
## 1358 0 0 0 0 0 0 0 0 0
## 1359 1 0 0 0 0 0 0 0 0
## 1360 0 0 0 0 0 0 0 0 0
## "big", "ok", "month", "day", "haha", "real", "thanks", "cool", "know",
## 1358 0 0 0 0 0 0 0 0 0
## 1359 0 0 1 0 1 0 0 0 0
## 1360 0 0 0 0 0 0 0 0 0
## "love", "also", "team", "great", "first", "right", "yeah", "probably",
## 1358 0 0 0 0 0 0 0 0
## 1359 0 0 0 0 0 0 0 0
## 1360 0 0 0 0 0 0 0 0
## "better", "re", "software", "start", "yes", "cost", "definitely",
## 1358 0 0 0 0 0 0 0
## 1359 0 1 0 0 2 0 0
## 1360 0 0 0 0 0 0 0
## "production", "heavy", "use", "super", "add", "autopilot", "sure", "make",
## 1358 0 0 0 0 0 0 0 0
## 1359 0 0 2 0 0 0 0 0
## 1360 0 0 0 0 0 0 0 0
## "improve", "sound", "engine", "earth", "order", "exactly", "part",
## 1358 0 0 0 0 0 0 0
## 1359 1 0 1 0 0 0 1
## 1360 0 0 0 0 0 0 0
## "pretty", "starship",
## 1358 0 0
## 1359 0 0
## 1360 0 0
## word_scores
## 1358 {1.9, 2, 0, 0, 0, 0, 0, -2.4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3.1, 0, 0, 0, 0, 0, 0, 0, 2.8}
## 1359 {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -0.7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -0.5, 0, 0, 0, 1.4, 0, 0, 0, 0, 1, 0, 0, 0.55, 0, 0, 1.1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1.95, 0, 0, -0.3, 0, 3, 0, 2.55, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1.05, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.45, 0, 0, 0, 0, 0, 0, 0, 3.2895, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2.7, 0, 0, 0, 0, 0, 0, 4.05, 0, 2.55, 0, 0, 0, 0, 0, 0, 0.15, 0, 0, -1.65, 0, 0, 0, 0, -1.8, 0, 0, 0, 0, 0, -1.65, 0, 0, 0, 0, 0, 0, 0, 0, 1.95, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -5.4, 0, 0, 0}
## 1360 {2, 0, 2.8}
## compound pos neu neg but_count
## 1358 0.886 0.361 0.55 0.089 0
## 1359 0.973 0.18 0.742 0.079 1
## 1360 0.778 0.872 0.128 0 0
colnames(tesla_sentiment)
## [1] "date" "tweet" "\"can\"," "\"test\","
## [5] "\"\"," "\"almost\"," "\"give\"," "\"go\","
## [9] "\"good\"," "\"launch\"," "\"m\"," "\"much\","
## [13] "\"need\"," "\"point\"," "\"rocket\"," "\"time\","
## [17] "\"today\"," "\"we\"," "\"x\"," "\"\")"
## [21] "\"change\"," "\"high\"," "\"like\"," "\"still\","
## [25] "\"actually\"," "\"come\"," "\"future\"," "\"landing\","
## [29] "\"maybe\"," "\"try\"," "\"best\"," "\"car\","
## [33] "\"model\"," "\"s\"," "\"year\"," "\"flight\","
## [37] "\"next\"," "\"now\"," "\"one\"," "\"work\","
## [41] "\"full\"," "\"back\"," "\"be\"," "\"build\","
## [45] "\"do\"," "\"get\"," "\"mar\"," "\"not\","
## [49] "\"really\"," "\"see\"," "\"lot\"," "\"soon\","
## [53] "\"new\"," "\"system\"," "\"company\"," "\"even\","
## [57] "\"long\"," "\"people\"," "\"think\"," "\"drive\","
## [61] "\"hopefully\"," "\"tesla\"," "\"well\"," "\"just\","
## [65] "\"spacex\"," "\"take\"," "\"look\"," "\"falcon\","
## [69] "\"want\"," "\"mean\"," "\"say\"," "\"design\","
## [73] "\"power\"," "\"week\"," "\"thing\"," "\"many\","
## [77] "\"true\"," "\"end\"," "\"less\"," "\"way\","
## [81] "\"hard\"," "\"big\"," "\"ok\"," "\"month\","
## [85] "\"day\"," "\"haha\"," "\"real\"," "\"thanks\","
## [89] "\"cool\"," "\"know\"," "\"love\"," "\"also\","
## [93] "\"team\"," "\"great\"," "\"first\"," "\"right\","
## [97] "\"yeah\"," "\"probably\"," "\"better\"," "\"re\","
## [101] "\"software\"," "\"start\"," "\"yes\"," "\"cost\","
## [105] "\"definitely\"," "\"production\"," "\"heavy\"," "\"use\","
## [109] "\"super\"," "\"add\"," "\"autopilot\"," "\"sure\","
## [113] "\"make\"," "\"improve\"," "\"sound\"," "\"engine\","
## [117] "\"earth\"," "\"order\"," "\"exactly\"," "\"part\","
## [121] "\"pretty\"," "\"starship\"," "word_scores" "compound"
## [125] "pos" "neu" "neg" "but_count"
This code performs the following tasks:
It merges two data frames, tesla_sentiment and tsla_prices, on the “date” column, using merge() function. The argument all.y=TRUE ensures that all the rows from the tsla_prices data frame are included in the merged data frame.
It adds a new column called “increase” to the merged data frame. This column contains binary values indicating whether the stock price increased or not on a particular day. This will be our target variable.
It sets the NA values in columns 3 to ncol(merged_df)-1 (excluding the “date”, “TSLA.Open”, and “TSLA.Adjusted” columns) to 0. This is done using is.na() and [<-.data.frame() functions.
merged_df = merge(tesla_sentiment, tsla_prices, by="date", all.y = TRUE)
# Add increase column
merged_df$increase <- c(0, ifelse(diff(merged_df$TSLA.Adjusted) > 0, 1, 0))
merged_df[,3:ncol(merged_df)-1][is.na(merged_df[, 3:ncol(merged_df)-1])] = 0
tail(merged_df, 2)
## date
## 1373 2020-07-13
## 1374 2020-07-14
## tweet
## 1373 Reusability is essential. A rocket that is single use is just as absurd as a single use airplane. F9 engines already fire 3 times per flight. . Wild times! . We’re being extra paranoid. Maximizing probability of successful launch is paramount. . Welcome anytime . Well, I do care very much about sustainability of civilization, but there is some truth to the irony part haha . Yes, in plan. Superchargers and public high power wall connectors will keep growing exponentially every year. . 👀 . I think so . Doing range testing now. Number will be significantly higher than 300. Extremely good for any EV, especially an SUV. . We have reduced pricing on Model Y LR dual motor & will offer a LR single motor Y in a few months, which improves affordability, while still keeping the product excellent . Yes . It may be able to reach 250kW at low states of charge . No, as range would be unacceptably low (< 250 mile EPA) . We had to increase some wire thicknesses in S/X to reduce resistive heating. Technically, won’t be quite 1000 mph charging, as X especially is much bigger than 3. . Hell of a ride!
## 1374 Cute . Wow
## "can", "test", "", "almost", "give", "go", "good", "launch", "m", "much",
## 1373 0 1 7 0 0 0 1 1 0 2
## 1374 0 0 0 0 0 0 0 0 0 0
## "need", "point", "rocket", "time", "today", "we", "x", "") "change",
## 1373 0 0 1 2 0 0 1 0 0
## 1374 0 0 0 0 0 0 0 0 0
## "high", "like", "still", "actually", "come", "future", "landing", "maybe",
## 1373 1 0 1 0 0 0 0 0
## 1374 0 0 0 0 0 0 0 0
## "try", "best", "car", "model", "s", "year", "flight", "next", "now",
## 1373 0 0 0 1 0 1 1 0 1
## 1374 0 0 0 0 0 0 0 0 0
## "one", "work", "full", "back", "be", "build", "do", "get", "mar", "not",
## 1373 0 0 0 0 0 0 0 0 0 1
## 1374 0 0 0 0 0 0 0 0 0 0
## "really", "see", "lot", "soon", "new", "system", "company", "even",
## 1373 0 0 0 0 0 0 0 0
## 1374 0 0 0 0 0 0 0 0
## "long", "people", "think", "drive", "hopefully", "tesla", "well", "just",
## 1373 0 0 1 0 0 0 1 1
## 1374 0 0 0 0 0 0 0 0
## "spacex", "take", "look", "falcon", "want", "mean", "say", "design",
## 1373 0 0 0 0 0 0 0 0
## 1374 0 0 0 0 0 0 0 0
## "power", "week", "thing", "many", "true", "end", "less", "way", "hard",
## 1373 1 0 0 0 0 0 0 0 0
## 1374 0 0 0 0 0 0 0 0 0
## "big", "ok", "month", "day", "haha", "real", "thanks", "cool", "know",
## 1373 0 0 1 0 1 0 0 0 0
## 1374 0 0 0 0 0 0 0 0 0
## "love", "also", "team", "great", "first", "right", "yeah", "probably",
## 1373 0 0 0 0 0 0 0 0
## 1374 0 0 0 0 0 0 0 0
## "better", "re", "software", "start", "yes", "cost", "definitely",
## 1373 0 1 0 0 2 0 0
## 1374 0 0 0 0 0 0 0
## "production", "heavy", "use", "super", "add", "autopilot", "sure", "make",
## 1373 0 0 2 0 0 0 0 0
## 1374 0 0 0 0 0 0 0 0
## "improve", "sound", "engine", "earth", "order", "exactly", "part",
## 1373 1 0 1 0 0 0 1
## 1374 0 0 0 0 0 0 0
## "pretty", "starship",
## 1373 0 0
## 1374 0 0
## word_scores
## 1373 {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -0.7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -0.5, 0, 0, 0, 1.4, 0, 0, 0, 0, 1, 0, 0, 0.55, 0, 0, 1.1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1.95, 0, 0, -0.3, 0, 3, 0, 2.55, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1.05, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.45, 0, 0, 0, 0, 0, 0, 0, 3.2895, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2.7, 0, 0, 0, 0, 0, 0, 4.05, 0, 2.55, 0, 0, 0, 0, 0, 0, 0.15, 0, 0, -1.65, 0, 0, 0, 0, -1.8, 0, 0, 0, 0, 0, -1.65, 0, 0, 0, 0, 0, 0, 0, 0, 1.95, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -5.4, 0, 0, 0}
## 1374 {2, 0, 2.8}
## compound pos neu neg but_count TSLA.Adjusted increase
## 1373 0.973 0.18 0.742 0.079 1 99.804 0
## 1374 0.778 0.872 0.128 0 0 101.120 1
write.csv(merged_df, file="tesla_sentiment.csv", row.names = F)
tesla_sentiment = read.csv('tesla_sentiment.csv')
head(tesla_sentiment, 2)
## date
## 1 2015-01-29
## 2 2015-01-30
## tweet
## 1 0
## 2 If you are curious about the P85D, you can schedule a test drive here: http://ts.la/dE
## X.can.. X.test.. X... X.almost.. X.give.. X.go.. X.good.. X.launch.. X.m..
## 1 0 0 0 0 0 0 0 0 0
## 2 1 1 0 0 0 0 0 0 0
## X.much.. X.need.. X.point.. X.rocket.. X.time.. X.today.. X.we.. X.x.. X....1
## 1 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0
## X.change.. X.high.. X.like.. X.still.. X.actually.. X.come.. X.future..
## 1 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0
## X.landing.. X.maybe.. X.try.. X.best.. X.car.. X.model.. X.s.. X.year..
## 1 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0
## X.flight.. X.next.. X.now.. X.one.. X.work.. X.full.. X.back.. X.be..
## 1 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0
## X.build.. X.do.. X.get.. X.mar.. X.not.. X.really.. X.see.. X.lot.. X.soon..
## 1 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0
## X.new.. X.system.. X.company.. X.even.. X.long.. X.people.. X.think..
## 1 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0
## X.drive.. X.hopefully.. X.tesla.. X.well.. X.just.. X.spacex.. X.take..
## 1 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0
## X.look.. X.falcon.. X.want.. X.mean.. X.say.. X.design.. X.power.. X.week..
## 1 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0
## X.thing.. X.many.. X.true.. X.end.. X.less.. X.way.. X.hard.. X.big.. X.ok..
## 1 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0
## X.month.. X.day.. X.haha.. X.real.. X.thanks.. X.cool.. X.know.. X.love..
## 1 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0
## X.also.. X.team.. X.great.. X.first.. X.right.. X.yeah.. X.probably..
## 1 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0
## X.better.. X.re.. X.software.. X.start.. X.yes.. X.cost.. X.definitely..
## 1 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0
## X.production.. X.heavy.. X.use.. X.super.. X.add.. X.autopilot.. X.sure..
## 1 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0
## X.make.. X.improve.. X.sound.. X.engine.. X.earth.. X.order.. X.exactly..
## 1 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0
## X.part.. X.pretty.. X.starship..
## 1 0 0 0
## 2 0 0 0
## word_scores compound pos neu neg
## 1 0 0.000 0.000 0.000 0
## 2 {0, 0, 0, 1.3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0} 0.318 0.141 0.859 0
## but_count TSLA.Adjusted increase
## 1 0 13.68000 0
## 2 0 13.57333 0
This code block is for data cleaning, and it is processing a data frame named tesla_sentiment. Here are the details for each line:
The first line uses regular expression to remove ‘X.’ and ‘..’ from the column names of tesla_sentiment and stores the modified column names in a variable called new_col_names. The second line assigns the new column names to the data frame tesla_sentiment using the colnames function. The third line selects all columns from tesla_sentiment that are not equal to ““,”m”, “x”, “.1”, and “s” using the negation operator ! and %in% function, and assigns the resulting data frame to a new variable called final_data_new. The head function is then used to display the first five rows of the cleaned data frame.
# Remove 'X.' and '..' from column names
new_col_names = gsub('X\\.|\\.\\.', '', colnames(tesla_sentiment))
# Assign new column names to the data frame
colnames(tesla_sentiment) <- new_col_names
# Drop columns "", "m", "x", ".1", and "s"
final_data_new <- tesla_sentiment[, !names(tesla_sentiment) %in% c('', 'm', 'x', '.1', 's')]
head(final_data_new, 5)
## date
## 1 2015-01-29
## 2 2015-01-30
## 3 2015-02-02
## 4 2015-02-03
## 5 2015-02-04
## tweet
## 1 0
## 2 If you are curious about the P85D, you can schedule a test drive here: http://ts.la/dE
## 3 0
## 4 0
## 5 0
## can test almost give go good launch much need point rocket time today we
## 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 2 1 1 0 0 0 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## change high like still actually come future landing maybe try best car model
## 1 0 0 0 0 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0 0 0 0 0
## year flight next now one work full back be build do get mar not really see
## 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## lot soon new system company even long people think drive hopefully tesla well
## 1 0 0 0 0 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0 0 0 0 0
## just spacex take look falcon want mean say design power week thing many true
## 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## end less way hard big ok month day haha real thanks cool know love also team
## 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## great first right yeah probably better re software start yes cost definitely
## 1 0 0 0 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0 0 0 0
## production heavy use super add autopilot sure make improve sound engine earth
## 1 0 0 0 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0 0 0 0
## order exactly part pretty starship
## 1 0 0 0 0 0
## 2 0 0 0 0 0
## 3 0 0 0 0 0
## 4 0 0 0 0 0
## 5 0 0 0 0 0
## word_scores compound pos neu neg
## 1 0 0.000 0.000 0.000 0
## 2 {0, 0, 0, 1.3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0} 0.318 0.141 0.859 0
## 3 0 0.000 0.000 0.000 0
## 4 0 0.000 0.000 0.000 0
## 5 0 0.000 0.000 0.000 0
## but_count TSLA.Adjusted increase
## 1 0 13.68000 0
## 2 0 13.57333 0
## 3 0 14.06267 1
## 4 0 14.55733 1
## 5 0 14.57000 1
This code creates a histogram visualization of the “increase” column in the “final_data_new” data frame. The “increase” column contains binary values (0 or 1) that represent whether the stock price increased or decreased on a particular day.
The code uses the ggplot2 library to create the histogram. The “data” parameter specifies the data frame to use. The “aes” function specifies the x-axis variable, which is the “increase” column.
The “geom_histogram” function creates the histogram, with the “fill” parameter setting the fill color of the bars.
The “theme_bw” function sets the theme of the plot to a black and white style.
The “xlab” function sets the label for the x-axis.
Finally, the “coord_flip” function is used to flip the x and y axes of the plot. This is done to make the plot horizontal, with the bars running from left to right rather than from bottom to top.
#visualizing increase vs decrease
ggplot(data=final_data_new,aes(x=increase))+
geom_bar(fill=c('orange', 'brown'))+
theme_bw()+
xlab('Stock Increase(0) or Decrease(1)') +
coord_flip()
final_data_new_visual <- final_data_new[, !names(final_data_new) %in% c('word_scores', 'tweet','date','compound', 'pos', 'neu', 'neg', 'but_count', 'TSLA.Adjusted', 'increase')]
head(final_data_new_visual, 3)
## can test almost give go good launch much need point rocket time today we
## 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 2 1 1 0 0 0 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## change high like still actually come future landing maybe try best car model
## 1 0 0 0 0 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0 0 0 0 0
## year flight next now one work full back be build do get mar not really see
## 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## lot soon new system company even long people think drive hopefully tesla well
## 1 0 0 0 0 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0 0 0 0 0
## just spacex take look falcon want mean say design power week thing many true
## 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## end less way hard big ok month day haha real thanks cool know love also team
## 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## great first right yeah probably better re software start yes cost definitely
## 1 0 0 0 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0 0 0 0
## production heavy use super add autopilot sure make improve sound engine earth
## 1 0 0 0 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0 0 0 0
## order exactly part pretty starship
## 1 0 0 0 0 0
## 2 0 0 0 0 0
## 3 0 0 0 0 0
This code visualizes the top 25 words appearing in Elon Musk’s tweets.
col_sums calculates the sum of each column in final_data_new_visual, which contains the frequency of each word used in the tweets.
final_data_new_visual_sums creates a new data frame containing two columns: word and value, where word is the name of the word and value is the sum of its frequency in the tweets.
final_data_new_visual_sums_top25 creates a subset of final_data_new_visual_sums containing only the top 25 words sorted by their frequency in descending order.
Finally, ggplot is used to create a bar plot with the x-axis representing the words and the y-axis representing the sum of their frequency. The reorder function is used to sort the words on the x-axis by their value in descending order. The theme function is used to rotate the x-axis labels by 90 degrees, and coord_flip is used to switch the axes of the plot. The bars are filled with the color “sienna3”.
#visualizing top 25 words appearing in Elon's Tweet
col_sums <- colSums(final_data_new_visual)
final_data_new_visual_sums <- data.frame(
word = names(col_sums),
value = col_sums
)
final_data_new_visual_sums_top25 <- final_data_new_visual_sums[order(final_data_new_visual_sums$value, decreasing = TRUE),][1:25,]
# visualize the results with a bar plot
ggplot(data = final_data_new_visual_sums_top25, aes(x = reorder(word, value), y = value)) +
geom_bar(stat = "identity", fill = "sienna3") +
labs(x = "Word", y = "Sum of Values") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
coord_flip()
This code performs some date manipulations and summarizes the data on a yearly basis.
The lubridate library is loaded, which provides some useful functions for working with dates. Then, the as.Date() function is used to convert the date column of the final_data_new data frame to a date format. The quarter() and year() functions from lubridate are then used to create new columns quarter and year, which extract the quarter and year information from the date column.
Next, the %>% operator from the dplyr package is used to pipe the final_data_new data frame to a series of operations. The mutate() function creates a new column year, which extracts the year information from the date column. The data is then grouped by year using the group_by() function, and the summarise() function is used to calculate the mean values of the pos, neu, compound, neg, and TSLA.Adjusted columns for each year. The resulting data frame df_yearly contains the summarized data.
library(lubridate)
final_data_new$date <- as.Date(final_data_new$date, format = "%Y-%m-%d")
final_data_new$quarter <- quarter(final_data_new$date)
final_data_new$year <- year(final_data_new$date)
df_yearly <- final_data_new %>%
mutate(year = year(date)) %>%
group_by(year) %>%
summarise(pos = mean(pos), neu = mean(neu), compound = mean(compound), neg = mean(neg), TSLA.Adjusted = mean(TSLA.Adjusted))
head(final_data_new, 5)
## date
## 1 2015-01-29
## 2 2015-01-30
## 3 2015-02-02
## 4 2015-02-03
## 5 2015-02-04
## tweet
## 1 0
## 2 If you are curious about the P85D, you can schedule a test drive here: http://ts.la/dE
## 3 0
## 4 0
## 5 0
## can test almost give go good launch much need point rocket time today we
## 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 2 1 1 0 0 0 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## change high like still actually come future landing maybe try best car model
## 1 0 0 0 0 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0 0 0 0 0
## year flight next now one work full back be build do get mar not really see
## 1 2015 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 2 2015 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 3 2015 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 4 2015 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 5 2015 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## lot soon new system company even long people think drive hopefully tesla well
## 1 0 0 0 0 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0 0 0 0 0
## just spacex take look falcon want mean say design power week thing many true
## 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## end less way hard big ok month day haha real thanks cool know love also team
## 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## great first right yeah probably better re software start yes cost definitely
## 1 0 0 0 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0 0 0 0
## production heavy use super add autopilot sure make improve sound engine earth
## 1 0 0 0 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0 0 0 0
## order exactly part pretty starship
## 1 0 0 0 0 0
## 2 0 0 0 0 0
## 3 0 0 0 0 0
## 4 0 0 0 0 0
## 5 0 0 0 0 0
## word_scores compound pos neu neg
## 1 0 0.000 0.000 0.000 0
## 2 {0, 0, 0, 1.3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0} 0.318 0.141 0.859 0
## 3 0 0.000 0.000 0.000 0
## 4 0 0.000 0.000 0.000 0
## 5 0 0.000 0.000 0.000 0
## but_count TSLA.Adjusted increase quarter
## 1 0 13.68000 0 1
## 2 0 13.57333 0 1
## 3 0 14.06267 1 1
## 4 0 14.55733 1 1
## 5 0 14.57000 1 1
This code creates two plots side by side, showing the sentiment scores over time and the TSLA adjusted stock prices over time.
The first plot is created using ggplot and shows four lines, each representing a different sentiment score (positive, neutral, compound, and negative) over the years in the dataset. The x-axis represents the year, and the y-axis represents the sentiment score. The color of each line is manually set using scale_color_manual, and the plot is given a title and a y-axis label.
The second plot is also created using ggplot and shows a single line representing the TSLA adjusted stock price over the years in the dataset. The x-axis and y-axis represent the year and the adjusted stock price, respectively. The plot is given a title and a y-axis label.
The two plots are combined side by side using the grid.arrange() function from the gridExtra package. The ncol argument specifies the number of columns in the resulting grid, and the widths argument specifies the relative widths of the columns.
#Sentiment Score over time vs Stock Price
# Create the first plot for sentiment scores
plot1 <- ggplot(data = df_yearly, aes(x = year)) +
geom_line(aes(y = pos, color = "Positive")) +
geom_line(aes(y = neu, color = "Neutral")) +
geom_line(aes(y = compound, color = "Compound")) +
geom_line(aes(y = neg, color = "Negative")) +
ylab("Sentiment Score") +
ggtitle("Sentiment Scores Over Time") +
scale_color_manual(values = c("Positive" = "blue", "Neutral" = "grey",
"Compound" = "black", "Negative" = "red"))
# Create the second plot for TSLA adjusted stock prices
plot2 <- ggplot(data = df_yearly, aes(x = year)) +
geom_line(aes(y = TSLA.Adjusted)) +
ylab("Adjusted Stock Price") +
ggtitle("TSLA Adjusted Stock Prices Over Time")
# Combine the two plots side by side using grid.arrange() from the gridExtra package
library(gridExtra)
grid.arrange(plot1, plot2, ncol = 2,widths=c(2/3, 1/3))
This code is visualizing the top positive and negative words in the final_data_new_visual_sums data frame by joining it with the NRC sentiment dataset using inner_join() function from the tidytext package. The resulting dataset is grouped by sentiment. However, there is an error in the code. The bing_sentiment data frame is not defined anywhere, and so filtering it by sentiment does not make sense.
After that, the code selects the top 25 rows for both positive and negative sentiments, orders them in descending order by value, and creates bar plots for each using ggplot2 package. Finally, the two plots are combined side by side using grid.arrange() function from the gridExtra package.
#Visualizing top positive words and top negative words
library(tidytext)
bing <- read.csv("C:/Users/Goldleverse/OneDrive/Desktop/Datasets/bing-2.csv")
bing_sentiment <- final_data_new_visual_sums%>%
inner_join(bing)%>%
group_by(sentiment)
# filter to only include positive and negative sentiment
bing_sentiment_pos <- bing_sentiment[bing_sentiment$sentiment == "positive", ]
bing_sentiment_neg <- bing_sentiment[bing_sentiment$sentiment == "negative", ]
# arrange in descending order by value
bing_sentiment_pos <- bing_sentiment_pos[order(-bing_sentiment_pos$value), ]
bing_sentiment_neg <- bing_sentiment_neg[order(-bing_sentiment_neg$value), ]
# select top 25 rows
bing_sentiment_pos_top25 <- head(bing_sentiment_pos, 25)
bing_sentiment_neg_top25 <- head(bing_sentiment_neg, 25)
# Define color palette
my_colors <- c("#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7", "#999999")
# Create bar plot for positive words with custom colors
plot_pos <- ggplot(bing_sentiment_pos_top25, aes(x = reorder(word, value), y = value, fill = word)) +
geom_bar(stat = "identity") +
scale_fill_manual(values = my_colors[1:length(unique(bing_sentiment_pos_top25$word))]) +
labs(title = "Top 25 Positive Words", x = "Word", y = "Value") +
coord_flip()
# Create bar plot for negative words with custom colors
plot_neg <- ggplot(bing_sentiment_neg_top25, aes(x = reorder(word, value), y = value, fill = word)) +
geom_bar(stat = "identity") +
scale_fill_manual(values = my_colors[1:length(unique(bing_sentiment_neg_top25$word))]) +
labs(title = "Top 25 Negative Words", x = "Word", y = "Value")
# Combine the two plots side by side using grid.arrange() from the gridExtra package
library(gridExtra)
grid.arrange(plot_pos, plot_neg, ncol = 2, widths=c(1/2, 1/2))
This code creates a new column in the final_data dataframe called emojis. The str_extract_all() function from the stringr package is used to extract all the emojis from the tweet column of the final_data dataframe.
final_data_new$emojis <- str_extract_all(final_data_new$tweet, "[\U0001F300-\U0001F64F]|\U0001F680|\U0001F6A5|\U0001F6B2|\U0001F30D|\U0001F30E|\U0001F31E|\U0001F363|\U0001F377|\U0001F37B|\U0001F41F|\U0001F42C|\U0001F355|\U0001F419|\U0001F680|\u2600-\u26FF]")
tail(final_data_new, 2)
## date
## 1373 2020-07-13
## 1374 2020-07-14
## tweet
## 1373 Reusability is essential. A rocket that is single use is just as absurd as a single use airplane. F9 engines already fire 3 times per flight. . Wild times! . We’re being extra paranoid. Maximizing probability of successful launch is paramount. . Welcome anytime . Well, I do care very much about sustainability of civilization, but there is some truth to the irony part haha . Yes, in plan. Superchargers and public high power wall connectors will keep growing exponentially every year. . 👀 . I think so . Doing range testing now. Number will be significantly higher than 300. Extremely good for any EV, especially an SUV. . We have reduced pricing on Model Y LR dual motor & will offer a LR single motor Y in a few months, which improves affordability, while still keeping the product excellent . Yes . It may be able to reach 250kW at low states of charge . No, as range would be unacceptably low (< 250 mile EPA) . We had to increase some wire thicknesses in S/X to reduce resistive heating. Technically, won’t be quite 1000 mph charging, as X especially is much bigger than 3. . Hell of a ride!
## 1374 Cute . Wow
## can test almost give go good launch much need point rocket time today we
## 1373 0 1 0 0 0 1 1 2 0 0 1 2 0 0
## 1374 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## change high like still actually come future landing maybe try best car
## 1373 0 1 0 1 0 0 0 0 0 0 0 0
## 1374 0 0 0 0 0 0 0 0 0 0 0 0
## model year flight next now one work full back be build do get mar not
## 1373 1 2020 1 0 1 0 0 0 0 0 0 0 0 0 1
## 1374 0 2020 0 0 0 0 0 0 0 0 0 0 0 0 0
## really see lot soon new system company even long people think drive
## 1373 0 0 0 0 0 0 0 0 0 0 1 0
## 1374 0 0 0 0 0 0 0 0 0 0 0 0
## hopefully tesla well just spacex take look falcon want mean say design
## 1373 0 0 1 1 0 0 0 0 0 0 0 0
## 1374 0 0 0 0 0 0 0 0 0 0 0 0
## power week thing many true end less way hard big ok month day haha real
## 1373 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0
## 1374 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## thanks cool know love also team great first right yeah probably better re
## 1373 0 0 0 0 0 0 0 0 0 0 0 0 1
## 1374 0 0 0 0 0 0 0 0 0 0 0 0 0
## software start yes cost definitely production heavy use super add
## 1373 0 0 2 0 0 0 0 2 0 0
## 1374 0 0 0 0 0 0 0 0 0 0
## autopilot sure make improve sound engine earth order exactly part pretty
## 1373 0 0 0 1 0 1 0 0 0 1 0
## 1374 0 0 0 0 0 0 0 0 0 0 0
## starship
## 1373 0
## 1374 0
## word_scores
## 1373 {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -0.7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -0.5, 0, 0, 0, 1.4, 0, 0, 0, 0, 1, 0, 0, 0.55, 0, 0, 1.1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1.95, 0, 0, -0.3, 0, 3, 0, 2.55, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1.05, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.45, 0, 0, 0, 0, 0, 0, 0, 3.2895, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2.7, 0, 0, 0, 0, 0, 0, 4.05, 0, 2.55, 0, 0, 0, 0, 0, 0, 0.15, 0, 0, -1.65, 0, 0, 0, 0, -1.8, 0, 0, 0, 0, 0, -1.65, 0, 0, 0, 0, 0, 0, 0, 0, 1.95, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -5.4, 0, 0, 0}
## 1374 {2, 0, 2.8}
## compound pos neu neg but_count TSLA.Adjusted increase quarter emojis
## 1373 0.973 0.180 0.742 0.079 1 99.804 0 3 👀
## 1374 0.778 0.872 0.128 0.000 0 101.120 1 3
This code adds a new column called n_emojis to the final_data data frame. It does so by using the sapply() function to apply a function called len_emojis() to each element of the emojis column in final_data.
The len_emojis() function takes a column as its input and returns the number of emojis in that column. It does this by calculating the length of the column.
Before calculating the number of emojis in each tweet, the code replaces any missing values in the emojis column with 0 using the replace_na() function from the tidyr package. This ensures that the len_emojis() function doesn’t return an error when it encounters a tweet with no emojis.
#final_data_new$emojis = replace_na(final_data_new$emojis, 'None')
len_emojis = function(col){
n_emojis = length(col)
return(n_emojis)
}
final_data_new$n_emojis = sapply(final_data_new$emojis, FUN=len_emojis)
tail(final_data_new)
## date
## 1369 2020-07-07
## 1370 2020-07-08
## 1371 2020-07-09
## 1372 2020-07-10
## 1373 2020-07-13
## 1374 2020-07-14
## tweet
## 1369 If you get past Mars, the asteroids, moons of Jupiter & Saturn, inevitably you reach Uranus! . 🖤✨Carl Sagan ✨🖤 . Essentially. Long-term purpose of my Tesla stock is to help make life multiplanetary to ensure it’s continuance. The massive capital needs are in 10 to 20 years. By then, if we’re fortunate, Tesla’s goal of accelerating sustainable energy & autonomy will be mostly accomplished.
## 1370 Tesla China team is awesome! . Words are a very lossy compression of thought
## 1371 There’s some of that too . True, it sounds so surreal, but the negative propaganda is still all out there & easy to find in social media & press interviews, so it’s not just our imagination! . Make sure to read ur terms & conditions before clicking accept! . Samwise Gamgee . Altho Dumb and Dumber is 🔥🔥 . Progress update August 28 . Sure . If you can’t beat em, join em\nNeuralink mission statement
## 1372 I spoke with Korolev’s family today.\nHe was one of the very best.\nКорольов / Королёв. . In general, we need to improve how podcasts play . SPQR https://m.youtube.com/watch?v=wjOfQfxmTLQ … . True haha . Death is the loss of information . Earning power post augmentation would easily pay for itself (if that’s even necessary). This is the best thing I can think of to ensure that collective human will decides the future. . Absolutely . 👀 . Probably a good one to design & engineer in Germany . Even more . Ok . 👀 . Berlin Model Y is the one to watch. That is a revolution in automotive body engineering (finally). . True . Wow, IHOP & GitHub are close . Best use of the term “Full Stack”? . For sure. This is both great & terrifying. Everything we’ve ever sensed or thought has been electrical signals. The early universe was just a soup of quarks & leptons. How did a very small piece of the Universe start to think of itself as sentient? . Yes . Sorry, should hopefully be soon! . Not actually a payout, just a vesting of stock options. It may never pay out, as the stock can’t be sold for 5 years after exercise. The stock must be bought & income taxes paid, then hold value for 5 years. . No easy way to answer this in a tweet, but helping with dire brain injuries is our first priority. Details Aug 28. . AI symbiosis while u wait
## 1373 Reusability is essential. A rocket that is single use is just as absurd as a single use airplane. F9 engines already fire 3 times per flight. . Wild times! . We’re being extra paranoid. Maximizing probability of successful launch is paramount. . Welcome anytime . Well, I do care very much about sustainability of civilization, but there is some truth to the irony part haha . Yes, in plan. Superchargers and public high power wall connectors will keep growing exponentially every year. . 👀 . I think so . Doing range testing now. Number will be significantly higher than 300. Extremely good for any EV, especially an SUV. . We have reduced pricing on Model Y LR dual motor & will offer a LR single motor Y in a few months, which improves affordability, while still keeping the product excellent . Yes . It may be able to reach 250kW at low states of charge . No, as range would be unacceptably low (< 250 mile EPA) . We had to increase some wire thicknesses in S/X to reduce resistive heating. Technically, won’t be quite 1000 mph charging, as X especially is much bigger than 3. . Hell of a ride!
## 1374 Cute . Wow
## can test almost give go good launch much need point rocket time today we
## 1369 0 0 0 0 0 0 0 0 1 0 0 0 0 0
## 1370 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 1371 1 0 0 0 0 0 0 0 0 0 0 0 0 0
## 1372 2 0 0 0 0 1 0 0 1 0 0 0 1 0
## 1373 0 1 0 0 0 1 1 2 0 0 1 2 0 0
## 1374 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## change high like still actually come future landing maybe try best car
## 1369 0 0 0 0 0 0 0 0 0 0 0 0
## 1370 0 0 0 0 0 0 0 0 0 0 0 0
## 1371 0 0 0 1 0 0 0 0 0 0 0 0
## 1372 0 0 0 0 1 0 1 0 0 0 3 0
## 1373 0 1 0 1 0 0 0 0 0 0 0 0
## 1374 0 0 0 0 0 0 0 0 0 0 0 0
## model year flight next now one work full back be build do get mar not
## 1369 0 2020 0 0 0 0 0 0 0 0 0 0 0 1 0
## 1370 0 2020 0 0 0 0 0 0 0 0 0 0 0 0 0
## 1371 0 2020 0 0 0 0 0 0 0 0 0 0 0 0 1
## 1372 1 2020 0 0 0 3 0 1 0 1 0 0 0 0 1
## 1373 1 2020 1 0 1 0 0 0 0 0 0 0 0 0 1
## 1374 0 2020 0 0 0 0 0 0 0 0 0 0 0 0 0
## really see lot soon new system company even long people think drive
## 1369 0 0 0 0 0 0 0 0 0 0 0 0
## 1370 0 0 0 0 0 0 0 0 0 0 0 0
## 1371 0 0 0 0 0 0 0 0 0 0 0 0
## 1372 0 0 0 1 0 0 0 2 0 0 3 0
## 1373 0 0 0 0 0 0 0 0 0 0 1 0
## 1374 0 0 0 0 0 0 0 0 0 0 0 0
## hopefully tesla well just spacex take look falcon want mean say design
## 1369 0 2 0 0 0 0 0 0 0 0 0 0
## 1370 0 0 0 0 0 0 0 0 0 0 0 0
## 1371 0 0 0 1 0 0 0 0 0 0 0 0
## 1372 1 0 0 2 0 0 0 0 0 0 0 1
## 1373 0 0 1 1 0 0 0 0 0 0 0 0
## 1374 0 0 0 0 0 0 0 0 0 0 0 0
## power week thing many true end less way hard big ok month day haha real
## 1369 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 1370 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 1371 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
## 1372 1 0 1 0 2 0 0 1 0 0 1 0 0 1 0
## 1373 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0
## 1374 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## thanks cool know love also team great first right yeah probably better re
## 1369 0 0 0 0 0 0 0 0 0 0 0 0 1
## 1370 0 0 0 0 0 1 0 0 0 0 0 0 0
## 1371 0 0 0 0 0 0 0 0 0 0 0 0 0
## 1372 0 0 0 0 0 0 1 1 0 0 1 0 0
## 1373 0 0 0 0 0 0 0 0 0 0 0 0 1
## 1374 0 0 0 0 0 0 0 0 0 0 0 0 0
## software start yes cost definitely production heavy use super add
## 1369 0 0 0 0 0 0 0 0 0 0
## 1370 0 0 0 0 0 0 0 0 0 0
## 1371 0 0 0 0 0 0 0 0 0 0
## 1372 0 1 1 0 0 0 0 1 0 0
## 1373 0 0 2 0 0 0 0 2 0 0
## 1374 0 0 0 0 0 0 0 0 0 0
## autopilot sure make improve sound engine earth order exactly part pretty
## 1369 0 0 1 0 0 0 0 0 0 0 0
## 1370 0 0 0 0 0 0 0 0 0 0 0
## 1371 0 2 1 0 1 0 0 0 0 0 0
## 1372 0 1 0 1 0 0 0 0 0 0 0
## 1373 0 0 0 1 0 1 0 0 0 1 0
## 1374 0 0 0 0 0 0 0 0 0 0 0
## starship
## 1369 0
## 1370 0
## 1371 0
## 1372 0
## 1373 0
## 1374 0
## word_scores
## 1369 {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1.7, 0, 0, 0, 0, 1.6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1.9, 0, 0, 0, 0, 0, 1.1, 0, 0, 0, 0, 0, 1.9}
## 1370 {0, 0, 0, 0, 3.1, 0, 0, 0, 0, 0, -1.493, 0, 0, 0}
## 1371 {0, 0, 0, 0, 0, 0, 0.9, 0, 0, 0, 0, 0, 0, -4.05, -1.5, 0, 0, 0, 0, 0, 0, 2.85, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1.95, 0, 0, 0, 0, 0, 0, 0, 0, 2.4, 0, 0, 0, 0, 0, -3.45, 0, -2.25, 0, 0, 0, 2.7, 0, 0, 0, 0, 1.95, 0, 0, 0, 0, 0, 0, 1.8, 0, 0, 0, 0}
## 1372 {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1.7465, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.95, 0, 0, 0.7, 0, 0, 0, 0, 0.9, 1, 0, -1.45, 0, 0, -0.65, 0, 0, 0, 0, 0, 0, 0, 0, 0.7, -0.2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1.6, 0, 0, 0, 0, 0, 0, 0.8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.95, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.839175, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.9, 0, 1.4, 0, 0, 0, 0, 0, 0, 1.6, 0, 0, 0, 0, 0, 0, 0, 0, 0.65, 0, 0, 0, 1.55, 0, -1.35, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.85, 0, -0.15, 0, 0.85, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.148, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.7, 0, 0, 0, 0, 0, -0.703, 0, 0, 0, 0, 0, 0, 0, 0, 1.8, 0, -3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}
## 1373 {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -0.7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -0.5, 0, 0, 0, 1.4, 0, 0, 0, 0, 1, 0, 0, 0.55, 0, 0, 1.1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1.95, 0, 0, -0.3, 0, 3, 0, 2.55, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1.05, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.45, 0, 0, 0, 0, 0, 0, 0, 3.2895, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2.7, 0, 0, 0, 0, 0, 0, 4.05, 0, 2.55, 0, 0, 0, 0, 0, 0, 0.15, 0, 0, -1.65, 0, 0, 0, 0, -1.8, 0, 0, 0, 0, 0, -1.65, 0, 0, 0, 0, 0, 0, 0, 0, 1.95, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -5.4, 0, 0, 0}
## 1374 {2, 0, 2.8}
## compound pos neu neg but_count TSLA.Adjusted increase quarter emojis
## 1369 0.912 0.198 0.802 0.000 0 92.65733 1 3 🖤, 🖤
## 1370 0.440 0.233 0.635 0.132 0 91.05867 0 3
## 1371 0.708 0.216 0.635 0.149 1 92.95200 1 3 🔥, 🔥
## 1372 0.964 0.147 0.802 0.051 1 102.97667 1 3 👀, 👀
## 1373 0.973 0.180 0.742 0.079 1 99.80400 0 3 👀
## 1374 0.778 0.872 0.128 0.000 0 101.12000 1 3
## n_emojis
## 1369 2
## 1370 0
## 1371 2
## 1372 2
## 1373 1
## 1374 0
# Correlation between number of emojis used and stock price
cor(final_data_new$TSLA.Adjusted, final_data_new$n_emojis)
## [1] 0.1178625
This code creates a table of counts for each emoji present in the final_data dataset using the mtabulate function from the qdapTools package.
The resulting table is assigned to emoji_table, and the column names are modified to add a prefix “emoji_” to each column name using the paste0 function.
Finally, d is created by binding emoji_table and the TSLA.Adjusted column of final_data together using the cbind function. This new dataset d contains the count of each emoji and the corresponding TSLA stock price.
emoji_table <- mtabulate(final_data_new$emojis)
colnames(emoji_table) <- paste0("emoji_", colnames(emoji_table))
d = cbind(emoji_table, final_data_new$TSLA.Adjusted)
tail(d, 20)
## emoji_🌃emoji_🌈emoji_🌌emoji_🌏 emoji_🌪emoji_🌸emoji_🍀emoji_🍁emoji_🍂
## 1355 0 0 0 0 0 0 0 0 0
## 1356 0 0 0 0 0 0 0 0 0
## 1357 0 0 0 0 0 0 0 0 0
## 1358 0 0 0 0 0 0 0 0 0
## 1359 0 0 0 0 0 0 0 0 0
## 1360 0 0 0 0 0 0 0 0 0
## 1361 0 0 0 0 0 0 0 0 0
## 1362 0 0 0 0 0 0 0 0 0
## 1363 0 0 0 0 0 0 0 0 0
## 1364 0 0 0 0 0 0 0 0 0
## 1365 0 0 0 0 0 0 0 0 0
## 1366 0 0 0 0 0 0 0 0 0
## 1367 0 0 0 0 0 0 0 0 0
## 1368 0 0 0 0 0 0 0 0 0
## 1369 0 0 0 0 0 0 0 0 0
## 1370 0 0 0 0 0 0 0 0 0
## 1371 0 0 0 0 0 0 0 0 0
## 1372 0 0 0 0 0 0 0 0 0
## 1373 0 0 0 0 0 0 0 0 0
## 1374 0 0 0 0 0 0 0 0 0
## emoji_🍃emoji_🍒emoji_🍜emoji_🍩emoji_🍷emoji_🍻emoji_🎁emoji_🎂emoji_🎃
## 1355 0 0 0 0 0 0 0 0 0
## 1356 0 0 0 0 0 0 0 0 0
## 1357 0 0 0 0 0 0 0 0 0
## 1358 0 0 0 0 0 0 0 0 0
## 1359 0 0 0 0 0 0 0 0 0
## 1360 0 0 0 0 0 0 0 0 0
## 1361 0 0 0 0 0 0 0 0 0
## 1362 0 0 0 0 0 0 0 0 0
## 1363 0 0 0 0 0 0 0 0 0
## 1364 0 0 0 0 0 0 0 0 0
## 1365 0 0 0 0 0 0 0 0 0
## 1366 0 0 0 0 0 0 0 0 0
## 1367 0 0 0 0 0 0 0 0 0
## 1368 0 0 0 0 0 0 0 0 0
## 1369 0 0 0 0 0 0 0 0 0
## 1370 0 0 0 0 0 0 0 0 0
## 1371 0 0 0 0 0 0 0 0 0
## 1372 0 0 0 0 0 0 0 0 0
## 1373 0 0 0 0 0 0 0 0 0
## 1374 0 0 0 0 0 0 0 0 0
## emoji_🎄emoji_🎉emoji_🎤emoji_🎥emoji_🎩emoji_🎶emoji_🎼emoji_🏴emoji_🏻
## 1355 0 0 0 0 0 0 0 0 0
## 1356 0 0 0 0 0 0 0 0 0
## 1357 0 0 0 0 0 0 0 0 0
## 1358 0 0 0 0 0 0 0 0 0
## 1359 0 0 0 0 0 0 0 0 0
## 1360 0 0 0 0 0 0 0 0 0
## 1361 0 0 0 0 0 0 0 0 0
## 1362 0 0 0 0 0 0 0 0 0
## 1363 0 0 0 0 0 0 0 0 0
## 1364 0 0 0 0 0 0 0 0 0
## 1365 0 0 0 0 0 0 0 0 0
## 1366 0 0 0 0 0 0 0 0 0
## 1367 0 0 0 0 0 0 0 0 0
## 1368 0 0 0 0 0 0 0 0 0
## 1369 0 0 0 0 0 0 0 0 0
## 1370 0 0 0 0 0 0 0 0 0
## 1371 0 0 0 0 0 0 0 0 0
## 1372 0 0 0 0 0 0 0 0 0
## 1373 0 0 0 0 0 0 0 0 0
## 1374 0 0 0 0 0 0 0 0 0
## emoji_🐇emoji_🐈emoji_🐉emoji_🐌emoji_🐏emoji_🐐emoji_🐜emoji_🐣emoji_🐶
## 1355 0 0 0 0 0 0 0 0 0
## 1356 0 0 0 0 0 0 0 0 0
## 1357 0 0 0 0 0 0 0 0 0
## 1358 0 0 0 0 0 0 0 0 0
## 1359 0 0 0 0 0 0 0 0 0
## 1360 0 0 0 0 0 0 0 0 0
## 1361 0 0 0 0 0 0 0 0 0
## 1362 0 0 0 0 0 0 0 0 0
## 1363 0 1 0 0 0 0 0 0 0
## 1364 0 0 0 0 0 0 0 0 0
## 1365 0 0 0 0 0 0 0 0 0
## 1366 0 0 0 0 0 0 0 0 0
## 1367 0 0 0 0 0 0 0 0 0
## 1368 0 0 0 0 0 0 0 0 0
## 1369 0 0 0 0 0 0 0 0 0
## 1370 0 0 0 0 0 0 0 0 0
## 1371 0 0 0 0 0 0 0 0 0
## 1372 0 0 0 0 0 0 0 0 0
## 1373 0 0 0 0 0 0 0 0 0
## 1374 0 0 0 0 0 0 0 0 0
## emoji_🐻 emoji_🐿emoji_👀 emoji_👁emoji_👆emoji_👌emoji_👍emoji_👟emoji_👨
## 1355 0 0 0 0 0 0 0 0 0
## 1356 0 0 0 0 0 0 0 0 0
## 1357 0 0 0 0 0 0 0 0 0
## 1358 0 0 0 0 0 0 0 0 0
## 1359 0 0 0 0 0 0 0 0 0
## 1360 0 0 0 0 0 0 0 0 0
## 1361 0 0 0 0 0 0 0 0 0
## 1362 0 0 0 0 0 0 0 0 0
## 1363 0 0 0 0 0 0 0 0 0
## 1364 0 0 0 0 0 0 0 0 0
## 1365 0 0 0 0 0 0 0 0 0
## 1366 0 0 0 0 0 0 0 0 0
## 1367 0 0 0 0 0 0 0 0 0
## 1368 0 0 0 0 0 0 0 0 0
## 1369 0 0 0 0 0 0 0 0 0
## 1370 0 0 0 0 0 0 0 0 0
## 1371 0 0 0 0 0 0 0 0 0
## 1372 0 0 2 0 0 0 0 0 0
## 1373 0 0 1 0 0 0 0 0 0
## 1374 0 0 0 0 0 0 0 0 0
## emoji_👩emoji_👶emoji_👸emoji_👻emoji_👽emoji_💕emoji_💖emoji_💘emoji_💙
## 1355 0 0 0 0 0 0 0 0 0
## 1356 0 0 0 0 0 0 0 0 0
## 1357 0 0 0 0 0 0 0 0 0
## 1358 0 0 0 0 0 0 0 0 0
## 1359 0 0 0 0 0 0 0 0 0
## 1360 0 0 0 0 0 0 0 0 0
## 1361 0 0 0 0 0 0 0 0 0
## 1362 0 0 0 0 0 0 0 0 0
## 1363 0 0 0 0 0 0 0 0 0
## 1364 0 0 0 0 0 0 0 0 0
## 1365 0 0 0 0 0 0 0 0 0
## 1366 0 0 0 0 0 0 0 0 0
## 1367 0 0 0 0 0 0 0 0 0
## 1368 0 0 0 0 0 0 0 0 0
## 1369 0 0 0 0 0 0 0 0 0
## 1370 0 0 0 0 0 0 0 0 0
## 1371 0 0 0 0 0 0 0 0 0
## 1372 0 0 0 0 0 0 0 0 0
## 1373 0 0 0 0 0 0 0 0 0
## 1374 0 0 0 0 0 0 0 0 0
## emoji_💚emoji_💛emoji_💜emoji_💡emoji_💣emoji_💦emoji_💨emoji_💩emoji_💫
## 1355 0 0 0 0 0 0 0 0 0
## 1356 0 0 0 0 0 0 0 0 0
## 1357 0 0 0 0 0 0 0 0 0
## 1358 0 0 0 0 0 0 0 0 0
## 1359 0 0 0 0 0 0 0 0 0
## 1360 0 0 0 0 0 0 0 0 0
## 1361 0 0 0 0 0 0 0 0 0
## 1362 0 0 0 0 0 0 0 0 0
## 1363 0 0 0 0 0 0 0 0 0
## 1364 0 0 0 0 0 0 0 0 0
## 1365 0 0 0 0 0 0 0 0 0
## 1366 0 0 0 0 0 0 0 0 0
## 1367 0 0 0 0 0 0 0 0 0
## 1368 0 0 0 0 0 0 0 0 0
## 1369 0 0 0 0 0 0 0 0 0
## 1370 0 0 0 0 0 0 0 0 0
## 1371 0 0 0 0 0 0 0 0 0
## 1372 0 0 0 0 0 0 0 0 0
## 1373 0 0 0 0 0 0 0 0 0
## 1374 0 0 0 0 0 0 0 0 0
## emoji_💯emoji_💰emoji_🔥emoji_🔬emoji_🔭 emoji_🕳emoji_🕺emoji_🖤emoji_😀
## 1355 0 0 0 0 0 0 0 0 0
## 1356 0 0 0 0 0 0 0 0 0
## 1357 0 0 0 0 0 0 0 0 0
## 1358 0 0 0 0 0 0 0 0 0
## 1359 0 0 0 0 0 0 0 0 0
## 1360 0 0 0 0 0 0 0 0 0
## 1361 0 0 0 0 0 0 0 0 0
## 1362 0 0 0 0 0 0 0 0 0
## 1363 0 0 0 0 0 0 0 0 0
## 1364 0 0 0 0 0 0 0 0 0
## 1365 0 0 0 0 0 0 0 0 0
## 1366 0 0 0 0 0 0 0 0 0
## 1367 0 0 0 0 0 0 0 0 0
## 1368 0 0 0 0 0 0 0 0 0
## 1369 0 0 0 0 0 0 0 2 0
## 1370 0 0 0 0 0 0 0 0 0
## 1371 0 0 2 0 0 0 0 0 0
## 1372 0 0 0 0 0 0 0 0 0
## 1373 0 0 0 0 0 0 0 0 0
## 1374 0 0 0 0 0 0 0 0 0
## emoji_😅emoji_😇emoji_😉emoji_😊emoji_😋emoji_😍emoji_😎emoji_😐emoji_😔
## 1355 0 0 0 0 0 0 0 0 0
## 1356 0 0 0 0 0 0 0 0 0
## 1357 0 0 0 0 0 0 0 0 0
## 1358 0 0 0 0 0 0 0 0 0
## 1359 0 0 0 0 0 0 0 0 0
## 1360 0 0 0 0 0 0 0 0 0
## 1361 0 0 0 0 0 0 0 0 0
## 1362 0 0 0 0 0 0 0 0 0
## 1363 1 0 0 0 0 0 0 0 0
## 1364 0 0 0 0 0 0 0 0 0
## 1365 0 0 0 0 0 0 0 0 0
## 1366 0 0 0 0 0 0 0 0 0
## 1367 0 0 0 0 0 0 0 0 0
## 1368 0 0 0 0 0 0 0 0 0
## 1369 0 0 0 0 0 0 0 0 0
## 1370 0 0 0 0 0 0 0 0 0
## 1371 0 0 0 0 0 0 0 0 0
## 1372 0 0 0 0 0 0 0 0 0
## 1373 0 0 0 0 0 0 0 0 0
## 1374 0 0 0 0 0 0 0 0 0
## emoji_😘emoji_😛emoji_😜emoji_😢emoji_😮emoji_😲emoji_😴emoji_🙏emoji_🚀
## 1355 0 0 0 0 0 0 0 0 0
## 1356 0 0 0 0 0 0 0 0 0
## 1357 0 0 0 0 0 0 0 0 0
## 1358 0 0 0 0 0 0 0 0 0
## 1359 0 0 0 0 0 0 0 0 0
## 1360 0 0 0 0 0 0 0 0 0
## 1361 0 0 0 0 0 0 0 0 0
## 1362 0 0 0 0 0 0 0 0 0
## 1363 0 0 0 0 0 0 0 0 0
## 1364 0 0 0 0 0 0 0 0 0
## 1365 0 0 0 0 0 0 0 0 0
## 1366 0 0 0 0 0 0 0 0 0
## 1367 0 0 0 1 0 0 0 0 0
## 1368 0 0 0 0 0 0 0 0 0
## 1369 0 0 0 0 0 0 0 0 0
## 1370 0 0 0 0 0 0 0 0 0
## 1371 0 0 0 0 0 0 0 0 0
## 1372 0 0 0 0 0 0 0 0 0
## 1373 0 0 0 0 0 0 0 0 0
## 1374 0 0 0 0 0 0 0 0 0
## final_data_new$TSLA.Adjusted
## 1355 65.47533
## 1356 66.11933
## 1357 66.93066
## 1358 66.72667
## 1359 66.28800
## 1360 66.78533
## 1361 64.05666
## 1362 65.73200
## 1363 63.98267
## 1364 67.29000
## 1365 71.98734
## 1366 74.64200
## 1367 80.57733
## 1368 91.43867
## 1369 92.65733
## 1370 91.05867
## 1371 92.95200
## 1372 102.97667
## 1373 99.80400
## 1374 101.12000
# select first and last column
df_new <- d %>% select(col1=colnames(d)[1], price='final_data_new$TSLA.Adjusted') %>%
filter(col1 != 0) %>%
summarise(avg_price = mean(price))
df_new[1,1]
## [1] 23.50333
This code creates an empty list called price_by_emoji and initializes a vector col_names with the names of the columns in the emoji_table data frame. It then loops through each column name, creates a new data frame df_new by selecting the column with the given name from emoji_table and the TSLA.Adjusted column from final_data, filtering out any rows where the value of the column is 0, and then calculating the mean of the TSLA.Adjusted column. The result is then assigned to an element in the price_by_emoji list with the same name as the column. So, this code calculates the average price for each emoji and stores it in a list called price_by_emoji.
price_by_emoji = list()
col_names = colnames(emoji_table)
for (col_name in col_names) {
df_new <- d %>% select(col1=col_name, price='final_data_new$TSLA.Adjusted') %>%
filter(col1 != 0) %>%
summarise(avg_price = mean(price))
price_by_emoji[[col_name]] = df_new[1,1]
}
df <- data.frame(emoji = names(price_by_emoji), avg_stock_price = unlist(price_by_emoji))
df %>% arrange(desc(avg_stock_price)) %>% select(-1)
## avg_stock_price
## emoji_👀 84.11845
## emoji_👸 52.17200
## emoji_🔭 47.96400
## emoji_👆 46.75467
## emoji_💯 42.75367
## emoji_🕺 42.72067
## emoji_🏻 41.62100
## emoji_😅 40.66311
## emoji_😮 38.20000
## emoji_💕 37.68836
## emoji_💘 36.74433
## emoji_🍻 36.48000
## emoji_🔥 34.06458
## emoji_🐈 33.29817
## emoji_😢 33.23467
## emoji_👶 32.08933
## emoji_🎶 31.43717
## emoji_😎 29.15167
## emoji_🎄 28.72933
## emoji_👻 28.43200
## emoji_👍 28.11459
## emoji_🎥 27.88867
## emoji_🖤 24.82805
## emoji_🌏 24.75355
## emoji_🎁 24.66133
## emoji_🍜 24.38067
## emoji_😀 23.72745
## emoji_😉 23.62442
## emoji_🌃 23.50333
## emoji_🍒 23.32867
## emoji_🌈 22.80000
## emoji_💚 22.80000
## emoji_💜 22.80000
## emoji_😴 22.73800
## emoji_🐻 22.52111
## emoji_😘 22.34233
## emoji_💖 21.83378
## emoji_💛 21.45217
## emoji_💡 21.33300
## emoji_😍 21.32533
## emoji_🍩 21.26400
## emoji_😊 21.26400
## emoji_💩 21.12783
## emoji_🌌 21.11400
## emoji_🎃 20.88733
## emoji_🐇 20.73400
## emoji_💨 20.70967
## emoji_🎂 20.59333
## emoji_🌪 20.54467
## emoji_💫 20.30844
## emoji_💙 20.23133
## emoji_🌸 20.07667
## emoji_🎩 20.07067
## emoji_😛 20.05600
## emoji_👟 19.91800
## emoji_🔬 19.87600
## emoji_👨 19.83933
## emoji_👩 19.80267
## emoji_🎉 19.77967
## emoji_🐜 19.59833
## emoji_🐿 19.45400
## emoji_🎼 19.42800
## emoji_🐉 19.25767
## emoji_👌 19.23480
## emoji_🍃 18.89433
## emoji_🙏 18.84578
## emoji_🚀 18.60696
## emoji_😜 18.44567
## emoji_👽 18.43967
## emoji_🐌 18.16400
## emoji_😲 17.86378
## emoji_🐶 17.54933
## emoji_😇 17.52533
## emoji_😔 17.44333
## emoji_🍁 17.24400
## emoji_🍂 17.24400
## emoji_🐏 17.04533
## emoji_🐐 17.00200
## emoji_🍷 16.90267
## emoji_👁 16.44000
## emoji_🎤 16.30200
## emoji_🕳 16.30200
## emoji_💣 16.23267
## emoji_💦 16.23267
## emoji_🏴 15.46333
## emoji_😋 15.18667
## emoji_🍀 15.16900
## emoji_💰 14.79067
## emoji_😐 14.06867
## emoji_🐣 12.34400
This code is used to create a bar chart of the top 40 most frequent 2-word combinations (2 grams) in a corpus of text data.
First, the text data is processed using the tokens() function from the quanteda package to create a Tokens object from a stemmed corpus. Then, the tokens_ngrams() function is used to create an object that contains all 2-grams in the Tokens object.
Next, the dfm() function from the quanteda package is used to create a Document Feature Matrix (DFM) from the 2-gram Tokens object. This DFM is then used to calculate the frequency of each 2-gram using the textstat_frequency() function, which returns a data frame of the 2-gram features and their frequencies.
Finally, the top 40 most frequent 2-grams are plotted using the ggplot2 package. The reorder() function is used to order the 2-grams by frequency, and the geom_col() function is used to create a bar chart.
# 2-word frequency (2 gram)
# create Tokens object from stemmed corpus
tokens_stemmed <- tokens(cleaned_corpus)
dfm2 = dfm(tokens_ngrams(tokens_stemmed,n=2))
dfFreq2 = textstat_frequency(dfm2)
library(ggplot2)
ggplot(dfFreq2[1:40,], aes(x=reorder(feature, frequency), y=frequency)) +
geom_col() +
coord_flip() +
scale_x_discrete(name = "2 gram") +
theme(text=element_text(size=12))
#2-grams features to data frame
dtm2 <- convert(dfm2, to = "tm")
data_dtm2 = as.data.frame(as.matrix(dtm2))
head(data_dtm2)
## curious_p p_would would_can can_schedule schedule_test test_drive
## text1 1 1 1 1 1 1
## prob_good good_though though_give give_we we_time time_replace replace_st
## text1 0 0 0 0 0 0 0
## st_stage stage_video video_transmitter transmitter_need need_launch
## text1 0 0 0 0 0
## launch_nice nice_air air_force force_track track_radar radar_go go_launch
## text1 0 0 0 0 0 0 0
## launch_postponed postponed_time time_tomorrow tomorrow_rocket
## text1 0 0 0 0
## rocket_reentry reentry_much much_tougher tougher_time time_around
## text1 0 0 0 0 0
## around_due due_deep deep_space space_mission mission_almost almost_x
## text1 0 0 0 0 0 0
## x_force force_x x_heat heat_plenty plenty_hydraulic hydraulic_fluid
## text1 0 0 0 0 0 0
## fluid_tho tho_launch launch_st st_deep mission_today today_head
## text1 0 0 0 0 0 0
## head_earthsun earthsun_l l_gravity gravity_null null_point point_m m_mile
## text1 0 0 0 0 0 0 0
## mile_x x_moon launch_postpone postpone_tomorrow tomorrow_due due_high
## text1 0 0 0 0 0 0
## high_wind wind_cape cape_dragon dragon_still still_inbound inbound_orbit
## text1 0 0 0 0 0 0
## orbit_min min_extreme extreme_wind wind_share share_cape cape_canaveral
## text1 0 0 0 0 0 0
## canaveral_feel feel_like like_sledgehammer sledgehammer_supersonic
## text1 0 0 0 0
## supersonic_vertical vertical_hoping hoping_change change_civilian
## text1 0 0 0 0
## civilian_application rocket_soft soft_land land_ocean ocean_withing
## text1 0 0 0 0 0
## withing_m m_target target_nicely nicely_vertical vertical_high
## text1 0 0 0 0 0
## high_probability probability_good good_droneship droneship_landing
This code is creating a frequency distribution of 3-grams in the text data.
The tokens_ngrams function from the quanteda package is used to create a Tokens object from the stemmed corpus with n-grams of length 3.
The dfm function is then used to create a document-feature matrix from the Tokens object.
The textstat_frequency function is used to create a frequency table of the 3-grams, which is stored in dfFreq3.
Finally, the ggplot2 package is used to create a bar plot of the top 40 most frequent 3-grams.
# 3-word frequency (3 gram)
# create Tokens object from stemmed corpus
dfm3 = dfm(tokens_ngrams(tokens_stemmed,n=3))
dfFreq3 = textstat_frequency(dfm3)
ggplot(dfFreq3[1:40,], aes(x=reorder(feature, frequency), y=frequency)) +
geom_col() +
coord_flip() +
scale_x_discrete(name = "3 gram") +
theme(text=element_text(size=12))
#3-grams features to data frame
dtm3 <- convert(dfm3, to = "tm")
data_dtm3 = as.data.frame(as.matrix(dtm3))
head(data_dtm3)
## curious_p_would p_would_can would_can_schedule can_schedule_test
## text1 1 1 1 1
## schedule_test_drive prob_good_though good_though_give though_give_we
## text1 1 0 0 0
## give_we_time we_time_replace time_replace_st replace_st_stage
## text1 0 0 0 0
## st_stage_video stage_video_transmitter video_transmitter_need
## text1 0 0 0
## transmitter_need_launch need_launch_nice launch_nice_air nice_air_force
## text1 0 0 0 0
## air_force_track force_track_radar track_radar_go radar_go_launch
## text1 0 0 0 0
## go_launch_postponed launch_postponed_time postponed_time_tomorrow
## text1 0 0 0
## time_tomorrow_rocket tomorrow_rocket_reentry rocket_reentry_much
## text1 0 0 0
## reentry_much_tougher much_tougher_time tougher_time_around
## text1 0 0 0
## time_around_due around_due_deep due_deep_space deep_space_mission
## text1 0 0 0 0
## space_mission_almost mission_almost_x almost_x_force x_force_x
# Merging datasets of ngrams
tesla_sentiment = read.csv("C:/Users/Goldleverse/OneDrive/Desktop/pre_finance.csv")
ngrams_data = cbind(tesla_sentiment, data_dtm2, data_dtm3)
head(ngrams_data, 2)
## X date tweet X.can.. X.test.. X... X.almost.. X.give.. X.go.. X.good..
## X.launch.. X.m.. X.much.. X.need.. X.point.. X.rocket.. X.time.. X.today..
## X.we.. X.x.. X....1 X.change.. X.high.. X.like.. X.still.. X.actually..
## X.come.. X.future.. X.landing.. X.maybe.. X.try.. X.best.. X.car..
## X.model.. X.s.. X.year.. X.flight.. X.next.. X.now.. X.one.. X.work..
## X.full.. X.back.. X.be.. X.build.. X.do.. X.get.. X.mar.. X.not..
## X.really.. X.see.. X.lot.. X.soon.. X.new.. X.system.. X.company..
## X.even.. X.long.. X.people.. X.think.. X.drive.. X.hopefully.. X.tesla..
## X.well.. X.just.. X.spacex.. X.take.. X.look.. X.falcon.. X.want..
## X.mean.. X.say.. X.design.. X.power.. X.week.. X.thing.. X.many.. X.true..
## X.end.. X.less.. X.way.. X.hard.. X.big.. X.ok.. X.month.. X.day..
## X.haha.. X.real.. X.thanks.. X.cool.. X.know.. X.love.. X.also.. X.team..
## X.great.. X.first.. X.right.. X.yeah.. X.probably.. X.better.. X.re..
## X.software.. X.start.. X.yes.. X.cost.. X.definitely.. X.production..
## X.heavy.. X.use.. X.super.. X.add.. X.autopilot.. X.sure.. X.make..
## X.improve.. X.sound.. X.engine.. X.earth.. X.order.. X.exactly.. X.part..
## X.pretty.. X.starship.. word_scores compound pos neu neg but_count
## curious_p p_would would_can can_schedule schedule_test test_drive
## prob_good good_though though_give give_we we_time time_replace replace_st
## st_stage stage_video video_transmitter transmitter_need need_launch
## launch_nice nice_air air_force force_track track_radar radar_go go_launch
merged_df = merge(ngrams_data, tsla_prices, by="date", all.y = TRUE)
# Add increase column
merged_df$increase <- c(0, ifelse(diff(merged_df$TSLA.Adjusted) > 0, 1, 0))
merged_df[,3:ncol(merged_df)-1][is.na(merged_df[, 3:ncol(merged_df)-1])] = 0
tail(merged_df, 2)
## date X tweet X.can.. X.test.. X... X.almost.. X.give.. X.go.. X.good..
## X.launch.. X.m.. X.much.. X.need.. X.point.. X.rocket.. X.time.. X.today..
## X.we.. X.x.. X....1 X.change.. X.high.. X.like.. X.still.. X.actually..
## X.come.. X.future.. X.landing.. X.maybe.. X.try.. X.best.. X.car..
## X.model.. X.s.. X.year.. X.flight.. X.next.. X.now.. X.one.. X.work..
## X.full.. X.back.. X.be.. X.build.. X.do.. X.get.. X.mar.. X.not..
## X.really.. X.see.. X.lot.. X.soon.. X.new.. X.system.. X.company..
## X.even.. X.long.. X.people.. X.think.. X.drive.. X.hopefully.. X.tesla..
## X.well.. X.just.. X.spacex.. X.take.. X.look.. X.falcon.. X.want..
## X.mean.. X.say.. X.design.. X.power.. X.week.. X.thing.. X.many.. X.true..
## X.end.. X.less.. X.way.. X.hard.. X.big.. X.ok.. X.month.. X.day..
## X.haha.. X.real.. X.thanks.. X.cool.. X.know.. X.love.. X.also.. X.team..
## X.great.. X.first.. X.right.. X.yeah.. X.probably.. X.better.. X.re..
## X.software.. X.start.. X.yes.. X.cost.. X.definitely.. X.production..
## X.heavy.. X.use.. X.super.. X.add.. X.autopilot.. X.sure.. X.make..
## X.improve.. X.sound.. X.engine.. X.earth.. X.order.. X.exactly.. X.part..
## X.pretty.. X.starship.. word_scores compound pos neu neg but_count
## curious_p p_would would_can can_schedule schedule_test test_drive
## prob_good good_though though_give give_we we_time time_replace replace_st
## st_stage stage_video video_transmitter transmitter_need need_launch
## launch_nice nice_air air_force force_track track_radar radar_go go_launch
merged_df[,3:ncol(merged_df)-1][is.na(merged_df[, 3:ncol(merged_df)-1])] = 0
write.csv(merged_df, file="ngrams_data.csv", row.names = F)
This code uses fread function to import the data. Fread uses C language, which speeds up the file importation. In this case we are not using the last dataframe created during the featuring engineering part for limitations due to computational power. Indeed, the resulting dataframe would result in a matrix 250GB large.
To overcome this issue we will use the dataset originated before the exploratory analysis, which contains the date, tfidf matrix, sentiment analysis and Yahoo Finance data.
library(data.table)
merged_df = fread("C:/Users/Goldleverse/OneDrive/Desktop/Datasets/tesla_sentiment.csv")
head(merged_df, 2)
## date
## 1: 2015-01-29
## 2: 2015-01-30
## tweet
## 1: 0
## 2: If you are curious about the P85D, you can schedule a test drive here: http://ts.la/dE
## ""can"", ""test"", """", ""almost"", ""give"", ""go"", ""good"", ""launch"",
## 1: 0 0 0 0 0 0 0 0
## 2: 1 1 0 0 0 0 0 0
## ""m"", ""much"", ""need"", ""point"", ""rocket"", ""time"", ""today"",
## 1: 0 0 0 0 0 0 0
## 2: 0 0 0 0 0 0 0
## ""we"", ""x"", """") ""change"", ""high"", ""like"", ""still"",
## 1: 0 0 0 0 0 0 0
## 2: 0 0 0 0 0 0 0
## ""actually"", ""come"", ""future"", ""landing"", ""maybe"", ""try"",
## 1: 0 0 0 0 0 0
## 2: 0 0 0 0 0 0
## ""best"", ""car"", ""model"", ""s"", ""year"", ""flight"", ""next"",
## 1: 0 0 0 0 0 0 0
## 2: 0 0 0 0 0 0 0
## ""now"", ""one"", ""work"", ""full"", ""back"", ""be"", ""build"", ""do"",
## 1: 0 0 0 0 0 0 0 0
## 2: 0 0 0 0 0 0 0 0
## ""get"", ""mar"", ""not"", ""really"", ""see"", ""lot"", ""soon"", ""new"",
## 1: 0 0 0 0 0 0 0 0
## 2: 0 0 0 0 0 0 0 0
## ""system"", ""company"", ""even"", ""long"", ""people"", ""think"",
## 1: 0 0 0 0 0 0
## 2: 0 0 0 0 0 0
## ""drive"", ""hopefully"", ""tesla"", ""well"", ""just"", ""spacex"",
## 1: 0 0 0 0 0 0
## 2: 0 0 0 0 0 0
## ""take"", ""look"", ""falcon"", ""want"", ""mean"", ""say"", ""design"",
## 1: 0 0 0 0 0 0 0
## 2: 0 0 0 0 0 0 0
## ""power"", ""week"", ""thing"", ""many"", ""true"", ""end"", ""less"",
## 1: 0 0 0 0 0 0 0
## 2: 0 0 0 0 0 0 0
## ""way"", ""hard"", ""big"", ""ok"", ""month"", ""day"", ""haha"", ""real"",
## 1: 0 0 0 0 0 0 0 0
## 2: 0 0 0 0 0 0 0 0
## ""thanks"", ""cool"", ""know"", ""love"", ""also"", ""team"", ""great"",
## 1: 0 0 0 0 0 0 0
## 2: 0 0 0 0 0 0 0
## ""first"", ""right"", ""yeah"", ""probably"", ""better"", ""re"",
## 1: 0 0 0 0 0 0
## 2: 0 0 0 0 0 0
## ""software"", ""start"", ""yes"", ""cost"", ""definitely"", ""production"",
## 1: 0 0 0 0 0 0
## 2: 0 0 0 0 0 0
## ""heavy"", ""use"", ""super"", ""add"", ""autopilot"", ""sure"", ""make"",
## 1: 0 0 0 0 0 0 0
## 2: 0 0 0 0 0 0 0
## ""improve"", ""sound"", ""engine"", ""earth"", ""order"", ""exactly"",
## 1: 0 0 0 0 0 0
## 2: 0 0 0 0 0 0
## ""part"", ""pretty"", ""starship"",
## 1: 0 0 0
## 2: 0 0 0
## word_scores compound pos neu neg
## 1: 0 0.000 0.000 0.000 0
## 2: {0, 0, 0, 1.3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0} 0.318 0.141 0.859 0
## but_count TSLA.Adjusted increase
## 1: 0 13.68000 0
## 2: 0 13.57333 0
This code converts the “date” column in merged_df from a character string in the format “YYYY-MM-DD” to a numeric value using the as.Date() and as.numeric() functions from base R.
as.Date() converts the character strings to date objects and formats them using the specified format string (“%Y-%m-%d” in this case) and assigns the results to the “date” column.
Then as.numeric() is used to convert the dates into numeric values which represent the number of days since January 1, 1970 (also known as the Unix epoch). The resulting numeric values are stored back into the “date” column of merged_df.
merged_df$date <- as.Date(merged_df$date, format="%Y-%m-%d")
merged_df$date <- as.numeric(merged_df$date)
This code performs some data preprocessing for a time series analysis and classification task.
The first two lines convert the “date” column of the “merged_df” data frame into a numeric value, which is necessary for later use in modeling.
The third line selects only the “TSLA.Adjusted” column from “merged_df” and scales it using the “scale” function from the “stats” package. This is done to standardize the values and make them easier to work with in subsequent analyses.
The next three lines select all columns except for “date”, “TSLA.Adjusted”, “increase”, “word_scores”, and “tweet” from “merged_df” and scale them using the “scale” function. This is done to standardize these variables for use in a classification model.
Finally, the “labels” variable is created by selecting the “increase” column from “merged_df”. This column contains binary values indicating whether the stock price increased (1) or decreased (0) on a given day. This will be the target variable in the classification model.
time_series = merged_df %>% select(TSLA.Adjusted)
time_series = scale(time_series)
nlp_data = merged_df %>% select(-date, -TSLA.Adjusted, -increase, -word_scores, -tweet)
nlp_data = scale(nlp_data)
labels = merged_df$increase
ncol(time_series)
## [1] 1
nrow(time_series)
## [1] 1374
ncol(nlp_data)
## [1] 125
nrow(nlp_data)
## [1] 1374
This code conver the time series dataframe into a matrix, which is the required input format for the LSTM model
time_series = as.matrix(time_series)
This code trains a deep learning model using Keras to make predictions based on two types of data: time series and NLP (natural language processing) data. The model architecture includes two LSTM layers and a dense layer, and it is trained using binary cross-entropy loss and the Adam optimizer. The model is also evaluated using accuracy as a metric.
The steps in the model are:
Determine the number of time steps and features in the time series data, and the number of words and dimensions in the NLP data. Set the number of LSTM units and dense units for the model.
Reshape the time series and NLP data to match the input shape of the model. This is done using the array() function.
Define the input shapes for the time series and NLP data.
Define the model architecture using the layer_input(), layer_lstm(), layer_global_max_pooling_1d(), layer_dense(), and layer_concatenate() functions. This creates a model with two inputs (one for time series data and one for NLP data), two LSTM layers, a dense layer, and a concatenation layer to combine the outputs from the LSTM layers.
Compile the model using the compile() function. This sets the loss function, optimizer, and evaluation metric for the model.
Define a callback to save the best model based on validation accuracy. This is done using the callback_model_checkpoint() function.
Train the model using the fit() function. This trains the model on the input data and labels, with options for the number of epochs, batch size, validation split, and verbose level. It also includes the callbacks defined earlier for early stopping and saving the best model.
Evaluate the model using the evaluate() function. This computes the loss and accuracy of the model on the test data.
max_time_steps <- nrow(time_series)
num_features <- ncol(time_series)
max_words <- nrow(nlp_data)
embedding_dim <- ncol(nlp_data)
lstm_units <- 32
dense_units <- 64
# Reshape time series data to match the input shape of the model
time_series <- array(data = time_series, dim = c(nrow(time_series), max_time_steps, num_features))
# Reshape NLP data to match the input shape of the model
nlp_data <- array(data = nlp_data, dim = c(nrow(nlp_data), max_words, embedding_dim))
# Define input shapes for time series and NLP data
ts_input_shape <- c(max_time_steps, num_features)
nlp_input_shape <- c(max_words, embedding_dim)
# Define LSTM-based model architecture
ts_input <- layer_input(shape = ts_input_shape, name = "ts_input")
ts_lstm <- layer_lstm(units = lstm_units, return_sequences = TRUE)(ts_input)
ts_output <- layer_global_max_pooling_1d()(ts_lstm)
nlp_input <- layer_input(shape = nlp_input_shape, name = "nlp_input")
nlp_lstm <- layer_lstm(units = lstm_units)(nlp_input)
nlp_output <- layer_dense(units = dense_units, activation = "relu")(nlp_lstm)
combined <- layer_concatenate(inputs = list(ts_output, nlp_output), name = "combined")
output <- layer_dense(units = 1, activation = "sigmoid")(combined)
model <- keras_model(inputs = list(ts_input, nlp_input), outputs = output)
# Compile the model
model %>% compile(
loss = "binary_crossentropy",
optimizer = optimizer_adam(),
metrics = c("accuracy")
)
# Define a callback to save the model with the best accuracy
best_model_callback <- callback_model_checkpoint(
filepath = "best_model.h5",
save_best_only = TRUE,
monitor = "val_accuracy",
mode = "max"
)
# Train the model with early stopping and the best model callback
history <- model %>% fit(
x = list(time_series, nlp_data),
y = labels,
epochs = 2000,
batch_size = 32,
validation_split = 0.2,
verbose = 2,
callbacks = list(
callback_early_stopping(patience = 100),
best_model_callback
)
)
# Evaluate the model
scores <- model %>% evaluate(
x = list(time_series, nlp_data),
y = labels,
batch_size = 32
)
# Print the evaluation metrics
cat("Test loss:", scores[[1]], "\n")
## Test loss: 0.9218721
cat("Test accuracy:", scores[[2]], "\n")
## Test accuracy: 0.7489083
This code is splitting a merged dataset into training and testing sets, and creating corresponding labels for each set.
The split_idx variable is defined as 80% of the number of rows in the merged_df dataframe. Then, the dplyr library is loaded to use later.
The time_series_train variable is created by selecting the first split_idx rows of the time_series matrix, and converting it to a matrix using the as.matrix() function. The time_series_test variable is created by selecting the remaining rows of the time_series matrix, and also converting it to a matrix using the as.matrix() function. Similarly, nlp_data_train and nlp_data_test are created by selecting the corresponding rows from nlp_data.
The train_labels variable is created by selecting the increase column from the first split_idx rows of the merged_df dataframe. Similarly, test_labels are created by selecting the increase column from the remaining rows of the merged_df dataframe.
split_idx <- round(nrow(merged_df) * 0.8)
time_series = merged_df %>% select(TSLA.Adjusted)
time_series = scale(time_series)
nlp_data = merged_df %>% select(-date, -TSLA.Adjusted, -increase, -word_scores, -tweet)
nlp_data = scale(nlp_data)
labels = merged_df$increase
time_series_train <- time_series[1:split_idx, ]
time_series_train <- as.matrix(time_series_train)
time_series_test <- time_series[(split_idx + 1):nrow(time_series), ]
time_series_test <- as.matrix(time_series_test)
nlp_data_train <- nlp_data[1:split_idx, ]
nlp_data_test <- nlp_data[(split_idx + 1):nrow(nlp_data), ]
train_labels = merged_df[1:split_idx, ]$increase
test_labels <- merged_df[(split_idx + 1):nrow(merged_df), ]$increase
This code defines a LSTM-based neural network model for a combined time series and NLP (natural language processing) dataset. Here is a brief summary of the steps:
Define the necessary input shapes, model architecture, and hyperparameters such as LSTM units, dense units, dropout rate, and learning rate.
Reshape the input data to match the expected input shape of the model.
Define the Nadam optimizer with a specific learning rate and early stopping based on validation accuracy.
Compile the model with binary cross-entropy loss, Nadam optimizer, and accuracy metric.
Train the model with the defined callbacks to save the best model and stop training if there is no improvement in validation accuracy. The model is trained with a batch size of 32 and for 2000 epochs.
The model architecture consists of two LSTM layers, one for time series data and one for NLP data, followed by dropout layers to prevent overfitting, dense layers with rectified linear unit (ReLU) activation function, and concatenation layer to merge the outputs from the two LSTMs. Finally, a dense layer with sigmoid activation function is used to obtain the binary classification output.
max_time_steps <- ncol(time_series_train)
num_features <- nrow(time_series_train)
max_words <- ncol(nlp_data_train)
embedding_dim <- nrow(nlp_data_train)
lstm_units <- 32
dense_units <- 64
dropout_rate <- 0.1
learning_rate <- 0.01
# Reshape time series data to match the input shape of the model
time_series_train <- array(data = time_series_train, dim = c(nrow(time_series_train), max_time_steps, num_features))
time_series_test <- array(data = time_series_test, dim = c(nrow(time_series_test), max_time_steps, num_features))
# Reshape NLP data to match the input shape of the model
nlp_data_train <- array(data = nlp_data_train, dim = c(nrow(nlp_data_train), max_words, embedding_dim))
nlp_data_test <- array(data = nlp_data_test, dim = c(nrow(nlp_data_test), max_words, embedding_dim))
# Define input shapes for time series and NLP data
ts_input_shape <- c(max_time_steps, num_features)
nlp_input_shape <- c(max_words, embedding_dim)
# Define LSTM-based model architecture
ts_input <- layer_input(shape = ts_input_shape, name = "ts_input")
ts_lstm <- layer_lstm(units = lstm_units, return_sequences = TRUE)(ts_input)
ts_dropout <- layer_dropout(rate = dropout_rate)(ts_lstm)
ts_output <- layer_global_max_pooling_1d()(ts_dropout)
nlp_input <- layer_input(shape = nlp_input_shape, name = "nlp_input")
nlp_lstm <- layer_lstm(units = lstm_units)(nlp_input)
nlp_dropout <- layer_dropout(rate = dropout_rate)(nlp_lstm)
nlp_output <- layer_dense(units = dense_units, activation = "relu")(nlp_dropout)
combined <- layer_concatenate(inputs = list(ts_output, nlp_output), name = "combined")
output <- layer_dense(units = 1, activation = "sigmoid")(combined)
model <- keras_model(inputs = list(ts_input, nlp_input), outputs = output)
# Define the Nadam optimizer with a specific learning rate
optimizer <- optimizer_nadam(learning_rate = learning_rate)
# Define early stopping based on validation accuracy
early_stopping <- callback_early_stopping(
patience = 100,
monitor = "val_accuracy",
mode = "max"
)
# Compile the model
model %>% compile(
loss = "binary_crossentropy",
optimizer = optimizer,
metrics = c("accuracy")
)
# Train the model
history <- model %>% fit(
x = list(time_series_train, nlp_data_train),
y = train_labels,
epochs = 2000,
batch_size = 32,
validation_data = list(list(time_series_test, nlp_data_test), test_labels),
callbacks = list(
callback_model_checkpoint("best_model_advanced.h5", save_best_only = TRUE, verbose = 1),
early_stopping
)
)
save_model_hdf5(model, "best_model_advanced.h5")
This code loads a saved Keras model from an HDF5 file, evaluates the model’s performance on the test data, and prints the test loss and accuracy.
First, the load_model_hdf5() function from the Keras package is used to load the saved model from the file “best_model6.h5” into the model object.
Next, the evaluate() function is used to evaluate the model’s performance on the test data. The input data and labels are provided as lists (list(time_series_test, nlp_data_test) and test_labels, respectively), and the batch size is set to 32.
Finally, the test loss and accuracy are printed using cat() and paste0() functions. The test loss is obtained from the first element of the test_loss_and_metrics object, and the test accuracy is obtained from the second element.
# Load the best saved model
library('keras')
model <- load_model_hdf5("best_model_advanced.h5")
# Evaluate the model on the test data
test_loss_and_metrics <- model %>% evaluate(
x = list(time_series_test, nlp_data_test),
y = test_labels,
batch_size = 32
)
# Print the test loss and accuracy
cat(paste0("Test loss: ", test_loss_and_metrics[1], "\n"))
## Test loss: 0.704934179782867
cat(paste0("Test accuracy: ", test_loss_and_metrics[2], "\n"))
## Test accuracy: 0.538181841373444
As we can observe from the two accuracies. The first and simpler model performs incredibly better than the more complex one, 75% vs 56% accuracy. The more complex model increases the number of layers, adding pooling and drop out. Moreover, it splits the dataset in train and test using and index as slicer, in order to make sure that the data are suitable for a time series splitting. While the first one lacks of these features, and the validation data is sampled randomly from the dataset. This let us assume two conclusions: 1. The relationship between tweets and stock fluctuations is lowly affected by the time considered as longer range of time. 2. The more advanced model may need a more comprehensive dataset, such as the resulting dataset from the featuring engineering part. This may lead to an increase in accuracy, hopefully higher than 75%.