Version 1.0
Computing with Text Activity
Text can be analyzed many different ways. Research areas like "stylometrics" attempt to say something
quantitative about an author's work, for example, by computing the average number of words per
sentence or the average number of letters per word written by an author. We will start by looking at
word counts in picture captions from Flickr.
Part I—Basic Analytics
In this first part, we will address the basics of counting words in a file and creating word clouds based on
those counts.
Directions: Enter the R code, note the results, answer the questions, and follow any other directions
provided by your teacher.
R Code captions
R Code head(captions)
Description A glance at the captions
Output [1] "unhealthy obsessions. - 22/04/2010" "power lunch"
[3] "Me and Brooke" "Frosty Morning"
[5] "Chill Day." "Lily"
>
R Code sort(captions)
Description Display them sorted
Output [1] "...and I would walk 500 miles..."
[2] "...the heart of Spring..."
[3] "a band?? no, just friends!"
Exploring Computer Science—Unit 6: Participatory Urban Sensing “R” Supplement Page 1
Version 1.0
[4] "A Chill in the Air"
[5] "a chill out drink after a fun night"
[6] "A chill still clings upon this place. While the moon still shows it's might face. The
growth seems stunted in the earth. While Spring awaits it's glorious rebirth. But Winter
hasn't had it's fill. And we must wait for the sunlight's thrill."
[7] "a cute monkey chilling after a full stomach of garbage at Angkor Wat"
[8] "A Fable"
[9] "A ghostly chill"
[10] "A glimpse of the magical"
…
2159] "Young & Innocent"
[2160] "Young lion chilling"
[2161] "YOUR MTV (23.04.2010)"
[2162] "Your Tent Smells"
[2163] "Yulia Chilling"
[2164] "Yulia Chilling B&W"
[2165] "Yulia Chilling Close up"
[2166] "Yulia Chilling Close up B&W"
[2167] "Yum Yum . . ."
[2168] "Yun Zi chilling in his hammock"
[2169] "Zach"
[2170] "zara chills"
[2171] "Zev and Kai [2]"
[2172] "Zion & Ivan"
R Code head(sort(captions),n=10)
Exploring Computer Science—Unit 6: Participatory Urban Sensing “R” Supplement Page 2
Version 1.0
Description First 10 of everything sorted.
Output [1] "...and I would walk 500 miles..."
[2] "...the heart of Spring..."
[3] "a band?? no, just friends!"
[4] "A Chill in the Air"
[5] "a chill out drink after a fun night"
[6] "A chill still clings upon this place. While the moon still shows it's might face.
The growth seems stunted in the earth. While Spring awaits it's glorious rebirth.
But Winter hasn't had it's fill. And we must wait for the sunlight's thrill."
[7] "a cute monkey chilling after a full stomach of garbage at Angkor Wat"
[8] "A Fable"
[9] "A ghostly chill"
[10] "A glimpse of the magical"
>
Exploring Computer Science—Unit 6: Participatory Urban Sensing “R” Supplement Page 3
Version 1.0
** Text mining – analyzing word counts
In order to do text mining, we need to first load the text mining library.
R Code library(tm)
Description Load text mining library.
Output >
Next, we are going to turn our vector of captions into a "corpus". A corpus is the term used to describe
a collection of writings. We need to do this to do some more sophisticated things to the comments.
R Code capcorp
R Code capcorp
Description
Output A corpus with 2172 text documents
R Code dtm = DocumentTermMatrix(capcorp)
Description Separates out each word and counts how many times it shows up in all the captions.
Output >
R Code dtm = DocumentTermMatrix(capcorp,control=list(weighting=weightBin))
Description Recreates the Document Term Matrix, but only counts a word once per caption. In other
words, if a caption is “Awesome. Awesome! Simply Awesome!” Awesome is only
counted as showing up 1 time.
Output >
**Word Clouds
Exploring Computer Science—Unit 6: Participatory Urban Sensing “R” Supplement Page 4
Version 1.0
A Word Cloud is a graphical representation of the word counts. The higher the count for the word, the
bigger the word is in the Word Cloud.
R Code library(snippets)
Description Load the library for making Word Clouds.
Output Attaching package: 'snippets'
The following object(s) are masked from package:lattice :
Cloud
R Code make_cloud = removeLessThan]
if (zoom)
words
R Code make_cloud(dtm)
Description Create a Word Cloud. It will look different if your graphics window is a different size.
Exploring Computer Science—Unit 6: Participatory Urban Sensing “R” Supplement Page 5
Version 1.0
Output
1. What words appear to have the highest count?
R Code make_cloud(dtm, removeLessThan=9, zoom=TRUE)
Exploring Computer Science—Unit 6: Participatory Urban Sensing “R” Supplement Page 6
Version 1.0
Description Create a Word Cloud with only the most frequently occurring words (9 times or more). It also
“zooms” in so you could read all the words. (For your final project, you may want to try a few
different ways and see which looks better.)
Output
2. Notice that there is a “chill” and a “chill,” (with a comma). Do you think we should include those
counts together or keep them separate? Why?
R Code head(sort(apply(dtm,2,sum),decreasing=TRUE),n=50)
Description A sorted view of the 50 words that occur most. Does this match your Word Cloud?
Exploring Computer Science—Unit 6: Participatory Urban Sensing “R” Supplement Page 7
Version 1.0
Output
Exploring Computer Science—Unit 6: Participatory Urban Sensing “R” Supplement Page 8
Version 1.0
Part II—Focusing on the Words
**Removing case
Make “Chill” and “chill” be the same thing by making everything lowercase.
R Code inspect(capcorp[1:10])
Description Captions with upper and lower case.
Output A corpus with 10 text documents
The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
create_date creator
Available variables in the data frame are:
MetaID
[[1]]
unhealthy obsessions. - 22/04/2010
[[2]]
power lunch
[[3]]
Me and Brooke
[[4]]
Frosty Morning
[[5]]
Chill Day.
[[6]]
Lily
[[7]]
Spring has returned // The earth is like a child // that knows poems
[[8]]
in search of the light
Exploring Computer Science—Unit 6: Participatory Urban Sensing “R” Supplement Page 9
Version 1.0
[[9]]
onehundredandfourteen
[[10]]
Mr Lee
R Code capcorp
R Code inspect(capcorp[1:10])
Description
Output A corpus with 10 text documents
The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
create_date creator
Available variables in the data frame are:
MetaID
[[1]]
unhealthy obsessions. - 22/04/2010
[[2]]
power lunch
Exploring Computer Science—Unit 6: Participatory Urban Sensing “R” Supplement Page 10
Version 1.0
[[3]]
me and brooke
[[4]]
frosty morning
[[5]]
chill day.
[[6]]
lily
[[7]]
spring has returned // the earth is like a child // that knows poems
[[8]]
in search of the light
[[9]]
onehundredandfourteen
[[10]]
mr lee
R Code dtm = DocumentTermMatrix(capcorp,control=list(weighting=weightBin))
Description Document Text Matrix of all lower case words.
Output >
R Code make_cloud(dtm, removeLessThan=9, zoom=TRUE)
Description Word Cloud based on just lower case words. In this case, the cloud does not
change (most of the captions are written in lower case).
Output 3. Paste your new Word Cloud in a document.
Exploring Computer Science—Unit 6: Participatory Urban Sensing “R” Supplement Page 11
Version 1.0
**Removing "stop" words
Some words like “a” and “the” are probably always going to show up big in our word clouds because
they are common parts of speech. We can remove those words to emphasize the other less common
words.
R Code stopwords()
Description List of the words we will filter out.
Output [1] "a" "about" "above" "across" "after"
[6] "again" "against" "all" "almost" "alone"
[11] "along" "already" "also" "although" "always"
[16] "am" "among" "an" "and" "another"
…
[466] "working" "works" "would" "wouldn't" "x"
[471] "y" "year" "years" "yes" "yet"
[476] "you" "you'd" "you'll" "young" "younger"
[481] "youngest" "your" "you're" "yours" "yourself"
[486] "yourselves" "you've" "z"
Exploring Computer Science—Unit 6: Participatory Urban Sensing “R” Supplement Page 12
Version 1.0
R Code library(Snowball)
Description This library will help us remove the stop words.
Output >
R Code capcorp
R Code dtm = DocumentTermMatrix(capcorp,control=list(weighting=weightBin))
Description Document Text Matrix of all lower case words.
Output >
R Code make_cloud(dtm, removeLessThan=9, zoom=TRUE)
Description Word Cloud based on just lower case words. In this case, the cloud does not change (most of
the captions are written in lower case).
Output 4. Paste your new Word Cloud in a document.
5. What are some of the words that disappeared from your Word Cloud?
6. Why might it be useful to get rid of these stop words?
Exploring Computer Science—Unit 6: Participatory Urban Sensing “R” Supplement Page 13
Version 1.0
Exploring Computer Science—Unit 6: Participatory Urban Sensing “R” Supplement Page 14
Version 1.0
**Deleting punctuation
Notice that many of our captions include symbols other than numbers and letters. We can remove them
as follows.
R Code inspect(capcorp[1:10])
Description First 10 captions, notice the characters that aren’t letters like “/” and “.”
Output A corpus with 10 text documents
The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
create_date creator
Available variables in the data frame are:
MetaID
[[1]]
unhealthy obsessions. - 22/04/2010
[[2]]
power lunch
[[3]]
brooke
[[4]]
frosty morning
[[5]]
chill day.
[[6]]
lily
[[7]]
spring returned // earth child // poems
[[8]]
search light
Exploring Computer Science—Unit 6: Participatory Urban Sensing “R” Supplement Page 15
Version 1.0
[[9]]
onehundredandfourteen
[[10]]
lee
R Code capcorp
R Code inspect(capcorp[1:10])
Description No more “/” and “.”
Output A corpus with 10 text documents
The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
create_date creator
Available variables in the data frame are:
MetaID
[[1]]
unhealthy obsessions 22042010
[[2]]
power lunch
[[3]]
brooke
[[4]]
frosty morning
[[5]]
Exploring Computer Science—Unit 6: Participatory Urban Sensing “R” Supplement Page 16
Version 1.0
chill day
[[6]]
lily
[[7]]
spring returned earth child poems
[[8]]
search light
[[9]]
onehundredandfourteen
[[10]]
lee
7. Which captions changed?
R Code dtm = DocumentTermMatrix(capcorp,control=list(weighting=weightBin))
Description Create our new Document Term Matrix without the punctuation
Output >
R Code make_cloud(dtm, removeLessThan=9, zoom=TRUE)
Description
Output 8. Paste your new Word Cloud in a document.
Exploring Computer Science—Unit 6: Participatory Urban Sensing “R” Supplement Page 17
Version 1.0
9. What happened to “chill,”?
10. Did “chill” change size? Why?
Exploring Computer Science—Unit 6: Participatory Urban Sensing “R” Supplement Page 18
Version 1.0
Part III—Advanced Analytics
**Stemming
We might want to ignore the ending of words like “s”, “ing”, etc. In other words, we’ll turn words like
“boats” and “boating” and change them to just “boat”. This is called stemming.
R Code capcorp
R Code dtm = DocumentTermMatrix(capcorp,control=list(weighting=weightBin))
Description Document Text Matrix of stemmed captions
Output >
R Code make_cloud(dtm, removeLessThan=9, zoom=TRUE)
Description Word Cloud of stemmed captions.
Output 11. Paste your new Word Cloud in a document.
12. How did the Word Cloud change? Why?
Exploring Computer Science—Unit 6: Participatory Urban Sensing “R” Supplement Page 19
Version 1.0
**How to subset based on a word
You can group your captions by certain words.
R Code chill_caps
R Code length(captions)
Description 2172 total captions total
Output [1] 2172
R Code length(chill_caps)
Description 835 contain the word chill
Output [1] 835
R Code nochill_caps
Summary on how to make a Word Cloud from your data:
1. access the appropriate libraries
2. create a corpus
3. (optional)filter out what you don't want (punctuation, stopwords, caps, stemming)
4. create a Document Text Matrix
5. copy cloud function (only need to do once)
6. call make_cloud (options to remove infrequent words or zoom)
Exploring Computer Science—Unit 6: Participatory Urban Sensing “R” Supplement Page 20