Instruction for Twitter Stream Analysis Lab

Part 1: Create the Project

The archive has been generated using the #Tag application for Google Sheet.


Retrieve the latest archive of the Twitter Archive from the course Materials.


Remove every rows except the from_user ; text ; time and entities_str fields (All > Edit Columns > Reorder Columns)


Part 2: Search for the most used word


Now that we have our project created, let’s look at the most associated word with OpenRefine. Split the text field as multi valued cells and use a space as the delimiter.

What are the top five words?


Not very accurate, right? Let’s use the cluster function to clean some close entries. Can you tell which cluster is returning pertinent results, which one isn’t?

Now we are done with clustering what are the top five words?

We still have a lot of irrelevant word like to or in. A facet by text length will help to discard them.  What is the lowest number of characters we should ignore to have interesting results? What are the top five words now?


If you want to go further, on the top of the screen select show as: records. Now select one word you are interested in the facet and see the full tweet.


Part 3: Extracting URLs

Using the history, go back to step 1 or create a new project.


Let’s now select all the tweets that contains a URL and extract the URL in a new column.


Tips: you will need the stars and flag option in the All columns to bookmark rows and the  text search function.


The entities_str field contains a JSON code which have a label "expanded_url": to tag the full URL in the tweet. Using the Split into several columns extract in a single column the first URL in each tweet.


Look at the URL that are the most shared of all times ; in the last month. Any interesting article to read?

Last modified: Friday, 18 September 2015, 5:59 PM