Data collection plays the most important role in the Big Data cycle. The Internet provides almost unlimited sources of data for a variety of topics. The importance of this area depends on the type of business, but traditional industries can acquire a diverse source of external data and combine those with their transactional data.
For example, let’s assume we would like to build a system that recommends restaurants. The first step would be to gather data, in this case, reviews of restaurants from different websites and store them in a database. As we are interested in raw text, and would use that for analytics, it is not that relevant where the data for developing the model would be stored. This may sound contradictory with the big data main technologies, but in order to implement a big data application, we simply need to make it work in real time.
Once the problem is defined, the following stage is to collect the data. The following miniproject idea is to work on collecting data from the web and structuring it to be used in a machine learning model. We will collect some tweets from the twitter rest API using the R programming language.
First of all create a twitter account, and then follow the instructions in the twitteR package vignette to create a twitter developer account. This is a summary of those instructions −
Go to https://twitter.com/apps/new and log in.
After filling in the basic info, go to the "Settings" tab and select "Read, Write and Access direct messages".
Make sure to click on the save button after doing this
In the "Details" tab, take note of your consumer key and consumer secret
In your R session, you’ll be using the API key and API secret values
Finally run the following script. This will install the twitteR package from its repository on github.
install.packages(c("devtools", "rjson", "bit64", "httr")) # Make sure to restart your R session at this point library(devtools) install_github("geoffjentry/twitteR")
We are interested in getting data where the string "big mac" is included and finding out which topics stand out about this. In order to do this, the first step is collecting the data from twitter. Below is our R script to collect required data from twitter. This code is also available in bda/part1/collect_data/collect_data_twitter.R file.
rm(list = ls(all = TRUE)); gc() # Clears the global environment library(twitteR) Sys.setlocale(category = "LC_ALL", locale = "C") ### Replace the xxx’s with the values you got from the previous instructions # consumer_key = "xxxxxxxxxxxxxxxxxxxx" # consumer_secret = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" # access_token = "xxxxxxxxxx-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" # access_token_secret= "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" # Connect to twitter rest API setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_token_secret) # Get tweets related to big mac tweets <- searchTwitter(’big mac’, n = 200, lang = ’en’) df <- twListToDF(tweets) # Take a look at the data head(df) # Check which device is most used sources <- sapply(tweets, function(x) x$getStatusSource()) sources <- gsub("</a>", "", sources) sources <- strsplit(sources, ">") sources <- sapply(sources, function(x) ifelse(length(x) > 1, x[2], x[1])) source_table = table(sources) source_table = source_table[source_table > 1] freq = source_table[order(source_table, decreasing = T)] as.data.frame(freq) # Frequency # Twitter for iPhone 71 # Twitter for Android 29 # Twitter Web Client 25 # recognia 20