Big Data Analytics - Cleansing Data

Twitter Mini Project

In the present case we need to first clean the unstructured data and then convert it to a data matrix in order to apply topics modelling on it. In general, when getting data from twitter, there are several characters we are not interested in using, at least in the first stage of the data cleansing process.

For example, after getting the tweets we get these strange characters: "<ed><U+00A0><U+00BD><ed><U+00B8><U+008B>". These are probably emoticons, so in order to clean the data, we will just remove them using the following script. This code is also available in bda/part1/collect_data/cleaning_data.R file.

rm(list = ls(all = TRUE)); gc() # Clears the global environment
source('collect_data_twitter.R')
# Some tweets
head(df$text)

[1] "I’m not a big fan of turkey but baked Mac &
cheese <ed><U+00A0><U+00BD><ed><U+00B8><U+008B>"
[2] "@Jayoh30 Like no special sauce on a big mac. HOW"
### We are interested in the text - Let’s clean it!

# We first convert the encoding of the text from latin1 to ASCII
df$text <- sapply(df$text,function(row) iconv(row, "latin1", "ASCII", sub = ""))

# Create a function to clean tweets
clean.text <- function(tx) {
  tx <- gsub("htt.{1,20}", " ", tx, ignore.case = TRUE)
  tx = gsub("[^#[:^punct:]]|@|RT", " ", tx, perl = TRUE, ignore.case = TRUE)
  tx = gsub("[[:digit:]]", " ", tx, ignore.case = TRUE)
  tx = gsub(" {1,}", " ", tx, ignore.case = TRUE)
  tx = gsub("^\\s+|\\s+$", " ", tx, ignore.case = TRUE)
  return(tx)
}  

clean_tweets <- lapply(df$text, clean.text)

# Cleaned tweets
head(clean_tweets)
[1] " WeNeedFeminlsm MAC s new make up line features men woc and big girls "
[1] " TravelsPhoto What Happens To Your Body One Hour After A Big Mac "

The final step of the data cleansing mini project is to have cleaned text we can convert to a matrix and apply an algorithm to. From the text stored in the clean_tweets vector we can easily convert it to a bag of words matrix and apply an unsupervised learning algorithm.

Previous Page Print Page