Embed
Email

Supplement

Document Sample

Shared by: qinmei liao
Categories
Tags
Stats
views:
4
posted:
11/16/2011
language:
English
pages:
20
Version 1.0





Computing with Text Activity



Text can be analyzed many different ways. Research areas like "stylometrics" attempt to say something

quantitative about an author's work, for example, by computing the average number of words per

sentence or the average number of letters per word written by an author. We will start by looking at

word counts in picture captions from Flickr.



Part I—Basic Analytics



In this first part, we will address the basics of counting words in a file and creating word clouds based on

those counts.



Directions: Enter the R code, note the results, answer the questions, and follow any other directions

provided by your teacher.



R Code captions







R Code head(captions)



Description A glance at the captions



Output [1] "unhealthy obsessions. - 22/04/2010" "power lunch"



[3] "Me and Brooke" "Frosty Morning"



[5] "Chill Day." "Lily"



>







R Code sort(captions)



Description Display them sorted



Output [1] "...and I would walk 500 miles..."



[2] "...the heart of Spring..."



[3] "a band?? no, just friends!"





Exploring Computer Science—Unit 6: Participatory Urban Sensing “R” Supplement Page 1

Version 1.0





[4] "A Chill in the Air"



[5] "a chill out drink after a fun night"



[6] "A chill still clings upon this place. While the moon still shows it's might face. The

growth seems stunted in the earth. While Spring awaits it's glorious rebirth. But Winter

hasn't had it's fill. And we must wait for the sunlight's thrill."



[7] "a cute monkey chilling after a full stomach of garbage at Angkor Wat"



[8] "A Fable"



[9] "A ghostly chill"



[10] "A glimpse of the magical"





2159] "Young & Innocent"



[2160] "Young lion chilling"



[2161] "YOUR MTV (23.04.2010)"



[2162] "Your Tent Smells"



[2163] "Yulia Chilling"



[2164] "Yulia Chilling B&W"



[2165] "Yulia Chilling Close up"



[2166] "Yulia Chilling Close up B&W"



[2167] "Yum Yum . . ."



[2168] "Yun Zi chilling in his hammock"



[2169] "Zach"



[2170] "zara chills"



[2171] "Zev and Kai [2]"



[2172] "Zion & Ivan"



R Code head(sort(captions),n=10)







Exploring Computer Science—Unit 6: Participatory Urban Sensing “R” Supplement Page 2

Version 1.0





Description First 10 of everything sorted.



Output [1] "...and I would walk 500 miles..."



[2] "...the heart of Spring..."



[3] "a band?? no, just friends!"



[4] "A Chill in the Air"



[5] "a chill out drink after a fun night"



[6] "A chill still clings upon this place. While the moon still shows it's might face.

The growth seems stunted in the earth. While Spring awaits it's glorious rebirth.

But Winter hasn't had it's fill. And we must wait for the sunlight's thrill."



[7] "a cute monkey chilling after a full stomach of garbage at Angkor Wat"



[8] "A Fable"



[9] "A ghostly chill"



[10] "A glimpse of the magical"



>









Exploring Computer Science—Unit 6: Participatory Urban Sensing “R” Supplement Page 3

Version 1.0





** Text mining – analyzing word counts



In order to do text mining, we need to first load the text mining library.



R Code library(tm)



Description Load text mining library.



Output >







Next, we are going to turn our vector of captions into a "corpus". A corpus is the term used to describe

a collection of writings. We need to do this to do some more sophisticated things to the comments.



R Code capcorp







R Code capcorp



Description



Output A corpus with 2172 text documents







R Code dtm = DocumentTermMatrix(capcorp)



Description Separates out each word and counts how many times it shows up in all the captions.



Output >







R Code dtm = DocumentTermMatrix(capcorp,control=list(weighting=weightBin))



Description Recreates the Document Term Matrix, but only counts a word once per caption. In other

words, if a caption is “Awesome. Awesome! Simply Awesome!” Awesome is only

counted as showing up 1 time.

Output >



**Word Clouds



Exploring Computer Science—Unit 6: Participatory Urban Sensing “R” Supplement Page 4

Version 1.0





A Word Cloud is a graphical representation of the word counts. The higher the count for the word, the

bigger the word is in the Word Cloud.





R Code library(snippets)



Description Load the library for making Word Clouds.



Output Attaching package: 'snippets'



The following object(s) are masked from package:lattice :



Cloud







R Code make_cloud = removeLessThan]



if (zoom)



words



R Code make_cloud(dtm)



Description Create a Word Cloud. It will look different if your graphics window is a different size.









Exploring Computer Science—Unit 6: Participatory Urban Sensing “R” Supplement Page 5

Version 1.0





Output









1. What words appear to have the highest count?









R Code make_cloud(dtm, removeLessThan=9, zoom=TRUE)





Exploring Computer Science—Unit 6: Participatory Urban Sensing “R” Supplement Page 6

Version 1.0





Description Create a Word Cloud with only the most frequently occurring words (9 times or more). It also

“zooms” in so you could read all the words. (For your final project, you may want to try a few

different ways and see which looks better.)

Output









2. Notice that there is a “chill” and a “chill,” (with a comma). Do you think we should include those

counts together or keep them separate? Why?







R Code head(sort(apply(dtm,2,sum),decreasing=TRUE),n=50)



Description A sorted view of the 50 words that occur most. Does this match your Word Cloud?



Exploring Computer Science—Unit 6: Participatory Urban Sensing “R” Supplement Page 7

Version 1.0





Output









Exploring Computer Science—Unit 6: Participatory Urban Sensing “R” Supplement Page 8

Version 1.0





Part II—Focusing on the Words



**Removing case



Make “Chill” and “chill” be the same thing by making everything lowercase.



R Code inspect(capcorp[1:10])



Description Captions with upper and lower case.



Output A corpus with 10 text documents



The metadata consists of 2 tag-value pairs and a data frame

Available tags are:

create_date creator

Available variables in the data frame are:

MetaID



[[1]]

unhealthy obsessions. - 22/04/2010



[[2]]

power lunch



[[3]]

Me and Brooke



[[4]]

Frosty Morning



[[5]]

Chill Day.



[[6]]

Lily



[[7]]

Spring has returned // The earth is like a child // that knows poems



[[8]]

in search of the light





Exploring Computer Science—Unit 6: Participatory Urban Sensing “R” Supplement Page 9

Version 1.0





[[9]]

onehundredandfourteen



[[10]]

Mr Lee









R Code capcorp









R Code inspect(capcorp[1:10])



Description



Output A corpus with 10 text documents







The metadata consists of 2 tag-value pairs and a data frame



Available tags are:



create_date creator



Available variables in the data frame are:



MetaID







[[1]]



unhealthy obsessions. - 22/04/2010



[[2]]



power lunch





Exploring Computer Science—Unit 6: Participatory Urban Sensing “R” Supplement Page 10

Version 1.0





[[3]]



me and brooke



[[4]]



frosty morning



[[5]]



chill day.



[[6]]



lily



[[7]]



spring has returned // the earth is like a child // that knows poems



[[8]]



in search of the light







[[9]]



onehundredandfourteen



[[10]]



mr lee

R Code dtm = DocumentTermMatrix(capcorp,control=list(weighting=weightBin))



Description Document Text Matrix of all lower case words.



Output >







R Code make_cloud(dtm, removeLessThan=9, zoom=TRUE)



Description Word Cloud based on just lower case words. In this case, the cloud does not

change (most of the captions are written in lower case).



Output 3. Paste your new Word Cloud in a document.





Exploring Computer Science—Unit 6: Participatory Urban Sensing “R” Supplement Page 11

Version 1.0









**Removing "stop" words



Some words like “a” and “the” are probably always going to show up big in our word clouds because

they are common parts of speech. We can remove those words to emphasize the other less common

words.



R Code stopwords()



Description List of the words we will filter out.



Output [1] "a" "about" "above" "across" "after"



[6] "again" "against" "all" "almost" "alone"



[11] "along" "already" "also" "although" "always"



[16] "am" "among" "an" "and" "another"















[466] "working" "works" "would" "wouldn't" "x"



[471] "y" "year" "years" "yes" "yet"



[476] "you" "you'd" "you'll" "young" "younger"



[481] "youngest" "your" "you're" "yours" "yourself"



[486] "yourselves" "you've" "z"





Exploring Computer Science—Unit 6: Participatory Urban Sensing “R” Supplement Page 12

Version 1.0









R Code library(Snowball)



Description This library will help us remove the stop words.



Output >









R Code capcorp







R Code dtm = DocumentTermMatrix(capcorp,control=list(weighting=weightBin))



Description Document Text Matrix of all lower case words.



Output >







R Code make_cloud(dtm, removeLessThan=9, zoom=TRUE)



Description Word Cloud based on just lower case words. In this case, the cloud does not change (most of

the captions are written in lower case).



Output 4. Paste your new Word Cloud in a document.









5. What are some of the words that disappeared from your Word Cloud?







6. Why might it be useful to get rid of these stop words?





Exploring Computer Science—Unit 6: Participatory Urban Sensing “R” Supplement Page 13

Version 1.0









Exploring Computer Science—Unit 6: Participatory Urban Sensing “R” Supplement Page 14

Version 1.0





**Deleting punctuation



Notice that many of our captions include symbols other than numbers and letters. We can remove them

as follows.



R Code inspect(capcorp[1:10])



Description First 10 captions, notice the characters that aren’t letters like “/” and “.”



Output A corpus with 10 text documents



The metadata consists of 2 tag-value pairs and a data frame

Available tags are:

create_date creator

Available variables in the data frame are:

MetaID



[[1]]

unhealthy obsessions. - 22/04/2010



[[2]]

power lunch



[[3]]

brooke



[[4]]

frosty morning



[[5]]

chill day.



[[6]]

lily



[[7]]

spring returned // earth child // poems



[[8]]

search light







Exploring Computer Science—Unit 6: Participatory Urban Sensing “R” Supplement Page 15

Version 1.0





[[9]]

onehundredandfourteen



[[10]]

lee









R Code capcorp







R Code inspect(capcorp[1:10])



Description No more “/” and “.”



Output A corpus with 10 text documents



The metadata consists of 2 tag-value pairs and a data frame

Available tags are:

create_date creator

Available variables in the data frame are:

MetaID



[[1]]

unhealthy obsessions 22042010



[[2]]

power lunch



[[3]]

brooke



[[4]]

frosty morning



[[5]]



Exploring Computer Science—Unit 6: Participatory Urban Sensing “R” Supplement Page 16

Version 1.0





chill day



[[6]]

lily



[[7]]

spring returned earth child poems



[[8]]

search light



[[9]]

onehundredandfourteen



[[10]]

lee







7. Which captions changed?









R Code dtm = DocumentTermMatrix(capcorp,control=list(weighting=weightBin))



Description Create our new Document Term Matrix without the punctuation



Output >









R Code make_cloud(dtm, removeLessThan=9, zoom=TRUE)



Description



Output 8. Paste your new Word Cloud in a document.









Exploring Computer Science—Unit 6: Participatory Urban Sensing “R” Supplement Page 17

Version 1.0





9. What happened to “chill,”?



10. Did “chill” change size? Why?









Exploring Computer Science—Unit 6: Participatory Urban Sensing “R” Supplement Page 18

Version 1.0





Part III—Advanced Analytics



**Stemming



We might want to ignore the ending of words like “s”, “ing”, etc. In other words, we’ll turn words like

“boats” and “boating” and change them to just “boat”. This is called stemming.



R Code capcorp







R Code dtm = DocumentTermMatrix(capcorp,control=list(weighting=weightBin))



Description Document Text Matrix of stemmed captions



Output >







R Code make_cloud(dtm, removeLessThan=9, zoom=TRUE)



Description Word Cloud of stemmed captions.



Output 11. Paste your new Word Cloud in a document.









12. How did the Word Cloud change? Why?









Exploring Computer Science—Unit 6: Participatory Urban Sensing “R” Supplement Page 19

Version 1.0





**How to subset based on a word



You can group your captions by certain words.





R Code chill_caps







R Code length(captions)



Description 2172 total captions total



Output [1] 2172







R Code length(chill_caps)



Description 835 contain the word chill



Output [1] 835







R Code nochill_caps







Summary on how to make a Word Cloud from your data:



1. access the appropriate libraries

2. create a corpus

3. (optional)filter out what you don't want (punctuation, stopwords, caps, stemming)

4. create a Document Text Matrix

5. copy cloud function (only need to do once)

6. call make_cloud (options to remove infrequent words or zoom)









Exploring Computer Science—Unit 6: Participatory Urban Sensing “R” Supplement Page 20



Related docs
Other docs by qinmei liao
Arrival RSE Financial Year
Views: 0  |  Downloads: 0
Take chill pill Workshop GO KART RACING
Views: 0  |  Downloads: 0
Abe cough with sputum
Views: 2  |  Downloads: 0
SDPI Healthy Heart Project
Views: 2  |  Downloads: 0
Alternative Trade Adjustment Assistance ATAA
Views: 0  |  Downloads: 0
Improving the Bjorken estimate PHENIX
Views: 0  |  Downloads: 0
Teacher Erase Color Rhyme
Views: 1  |  Downloads: 0
Estimates of District Domestic Product
Views: 4  |  Downloads: 0
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!