Geotagged Tweet Heat

So I’ve worked through the codecademy.com python course and have been looking for where to go next.

I figured learning to use the Twitter API and the Tweepy Library would be a way to gather some really interesting real world spatial data.

My first idea was to scrape all the tweets directed @qanda that have a coordinate location attached. Q and A is a very popular Australian panel discussion show that attracts 20-40,000 tweets an episode (Source). My thinking was there should be enough tweets with coordinate data to be able to get an idea of the distribution of @qanda tweets within the major Australian cities. I was mistaken.

During the episode I ran the code, 48 tweets were returned and something like 40 of these looked like they came from the same household.

So, still interested in Twitter’s potential for gathering coordinate data, I decided to cast the net wider.

My next aim was to find places in the world where there are a lot of tweets being made that DO have coordinate data attached. This would allow me to develop an idea of places I could carry out fine scale spatial analyses of tweets.

The main challenges I faced in devising the code were:

-Ensuring that the code would only save tweets that had a specific point coordinate, excluding tweets with other location data such as city and country name, or a bounding box of coordinates.

-Maintaining a relatively stable number of tweets saved per second. This was designed to limit the overall database size.

-Ensuring that the code did not stop running due to issues with connection speed.

Here is the code I ended up running in the end:

from __future__ import absolute_import, print_function

from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream
from time import localtime, gmtime, strftime
import time
import json
strftime("%a, %d %b %Y %H:%M:%S +0000", gmtime())

consumer_key=""
consumer_secret=""

access_token=""
access_token_secret=""

file = open('GLOBALBIG.csv', 'a')
file.write("TIME&X&Y\n")
data_list =[]
count = 0
time_start = time.time()
print ("THIS IS START TIME", time_start)

class StdOutListener(StreamListener):

        def on_data(self, data):
            global count

            if count <= 500000:
                json_data = json.loads(data)
                try:
                    coords = json_data["geo"]
                    if coords is not None:
                        try:
                            global time
                            since_start = (time.time()-time_start)
                            if count / since_start < 5:
                            	print (data)
                            	ptime = strftime("%a, %d %b %Y %H:%M:%S +0000", gmtime())
                            	lon = coords["coordinates"][0]
                            	lat = coords["coordinates"][1]

                            	data_list.append(json_data)
                            	file.write(str(ptime) + "&")
                            	file.write(str(lon) + "&")
                            	file.write(str(lat) + "\n")
                            	count += 1
                            else:
                            	pass
                        except KeyError:

                            print ("Fail")   

                except KeyError:
                    print ("Fail")

            else:
                file.close()
            return     

        #def on_error(self, status):
           # print status

if __name__ == '__main__':
    l = StdOutListener()
    auth = OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_token_secret)
    while True:
        try:
            stream = Stream(auth, l)
            stream.filter(locations=[-179,-89,179,89])
        except Exception:
            continue

ianbrod’s post here helped form the basis of my code and Eugene Yan’s post here helped fix network connection issues. I also had some great help from friends who are much more experienced and clever than I am, who I won’t embarrass by naming but will say thanks to here.

The map below represents the 341,029 tweets gathered between 0600 GMT 08 April and 0600 GMT 09 April, 2016.

Screen Shot 2016-04-11 at 7.57.08 PM

So this tells you a fair bit just from the naked eye, for me the biggest surprise packages were Indonesia (particularly Java) and Turkey.

I have given some summary data below for the ten countries with the highest number of tweets gathered.

Country No. of Tweets gathered Tweets/km² x 100 Tweets/capita x 10,000
United States 65709 0.67 2.06
Indonesia 34460 1.80 1.38
Japan 25415 6.72 2.00
Brazil 22547 0.26 1.13
Malaysia 17030 5.16 5.73
Turkey 17021 2.17 2.27
Philippines 10744 3.58 1.09
Thailand 10645 2.07 1.59
Argentina 9176 0.33 2.21
United Kingdom 9153 3.77 1.43

This data provides a valuable foundation and will form the basis of my site selection for analysing small-scale spatial patterns in tweets in the future.

I also realised that the data could be used to display temporal patterns. I took this as a chance to try out a different visual style. The QGIS heatmap rendering style provides a more simple and clearer representation of the areas of concentration.  The resulting animation displays the concentrated areas of tweets with coordinates and a drop off of twitter traffic over night.

Twheat3

Again, thanks to http://thematicmapping.org/ for the world map shapefile.

 

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s