We need to crawl 10M geotagged data from Flickr / Instagram / Twitter to do a data visualization on the map. To achieve something like
[url removed, login to view]
Freelancer will need to deliver
1. register Flickr / Instagram/ Twitter dev account
2. research their API to write a crawler to grab the data within the geofence bounding box. e.g. San Francisco bounding box: [url removed, login to view], 37.6040, [url removed, login to view], 37.8324.
1. three daemon/service-like python programs to crawl the geotagged data from Instagram / Twitter and Instagram and stores these data into the NoSQL database MongoDB.
2. It should be stable enough to crawl the data 24/7.
3. It should crawl 1 millions geotagged data per week even given the rate limit of the APIs.
4. the programs must have scalibility and multithread ability like queue library e.g. Celery in Python.
GEOTAG is a must! we don't need data with no GPS information.
Qualities needed to be successful
Python Experience to write service / daemon like
MongoDB, Redis, Celery
Twitter / Instagram / Flickr API experience.
Other Skills: Data Science Data scraping MongoDB Python Redis Web Crawler
You will be asked to answer the following questions when submitting a proposal:
(1)Have you written a Python crawler to use Twitter / Instagram / Flickr API before?
(2)Have you used any queue library (e.g. Celery) with multithreaded workers in Python to write daemon/service like program?
(3)Have you used any noSQL database before to store data like mongoDB?
(4) We want to estimate how much time you need to put on this whole project.
(5) And we want to set up with a small interview milestone to test: simply use your API to grab 10+ Instagram, Flickr and Twitter raw json data with GEOTAG (latitude and longitude).
(6) Next question will be how can you deal with rate limitation while crawling data? Multiple IPs / accounts ?