So I’ve worked through the codecademy.com python course and have been looking for where to go next.
I figured learning to use the Twitter API and the Tweepy Library would be a way to gather some really interesting real world spatial data.
My first idea was to scrape all the tweets directed @qanda that have a coordinate location attached. Q and A is a very popular Australian panel discussion show that attracts 20-40,000 tweets an episode (Source). My thinking was there should be enough tweets with coordinate data to be able to get an idea of the distribution of @qanda tweets within the major Australian cities. I was mistaken.
During the episode I ran the code, 48 tweets were returned and something like 40 of these looked like they came from the same household.
So, still interested in Twitter’s potential for gathering coordinate data, I decided to cast the net wider.
My next aim was to find places in the world where there are a lot of tweets being made that DO have coordinate data attached. This would allow me to develop an idea of places I could carry out fine scale spatial analyses of tweets.
The main challenges I faced in devising the code were:
-Ensuring that the code would only save tweets that had a specific point coordinate, excluding tweets with other location data such as city and country name, or a bounding box of coordinates.
-Maintaining a relatively stable number of tweets saved per second. This was designed to limit the overall database size.
-Ensuring that the code did not stop running due to issues with connection speed.
Here is the code I ended up running in the end:
from __future__ import absolute_import, print_function from tweepy.streaming import StreamListener from tweepy import OAuthHandler from tweepy import Stream from time import localtime, gmtime, strftime import time import json strftime("%a, %d %b %Y %H:%M:%S +0000", gmtime()) consumer_key="" consumer_secret="" access_token="" access_token_secret="" file = open('GLOBALBIG.csv', 'a') file.write("TIME&X&Y\n") data_list = count = 0 time_start = time.time() print ("THIS IS START TIME", time_start) class StdOutListener(StreamListener): def on_data(self, data): global count if count <= 500000: json_data = json.loads(data) try: coords = json_data["geo"] if coords is not None: try: global time since_start = (time.time()-time_start) if count / since_start < 5: print (data) ptime = strftime("%a, %d %b %Y %H:%M:%S +0000", gmtime()) lon = coords["coordinates"] lat = coords["coordinates"] data_list.append(json_data) file.write(str(ptime) + "&") file.write(str(lon) + "&") file.write(str(lat) + "\n") count += 1 else: pass except KeyError: print ("Fail") except KeyError: print ("Fail") else: file.close() return #def on_error(self, status): # print status if __name__ == '__main__': l = StdOutListener() auth = OAuthHandler(consumer_key, consumer_secret) auth.set_access_token(access_token, access_token_secret) while True: try: stream = Stream(auth, l) stream.filter(locations=[-179,-89,179,89]) except Exception: continue
ianbrod’s post here helped form the basis of my code and Eugene Yan’s post here helped fix network connection issues. I also had some great help from friends who are much more experienced and clever than I am, who I won’t embarrass by naming but will say thanks to here.
The map below represents the 341,029 tweets gathered between 0600 GMT 08 April and 0600 GMT 09 April, 2016.
So this tells you a fair bit just from the naked eye, for me the biggest surprise packages were Indonesia (particularly Java) and Turkey.
I have given some summary data below for the ten countries with the highest number of tweets gathered.
|Country||No. of Tweets gathered||Tweets/km² x 100||Tweets/capita x 10,000|
This data provides a valuable foundation and will form the basis of my site selection for analysing small-scale spatial patterns in tweets in the future.
I also realised that the data could be used to display temporal patterns. I took this as a chance to try out a different visual style. The QGIS heatmap rendering style provides a more simple and clearer representation of the areas of concentration. The resulting animation displays the concentrated areas of tweets with coordinates and a drop off of twitter traffic over night.
Again, thanks to http://thematicmapping.org/ for the world map shapefile.