This project must be completed by 5/6/2013 3:30pm EST
Pick any one data set which you can download from various online repositories. The bigger the dataset the better. Please check with PM for approval.
- Process this data set using Hive on the cloud, e.g., Amazon’s EC2. (you should not use your own server)
- Use your knowledge of Hive to execute at least 4 queries on the data set
- Prepare a brief power point presentation, displaying your results
- The slides should clearly describe your data set, query execution and results
- Prepare brief speaker notes to accompany the slides
- Include technical references.
There should probably be 1-2 slides describing the data set (origin, size, purpose, characteristics, visual representation)
There should be 1 slide for each of the four queries
There should be 1 slide summary
Hadoop: [login to view URL]
Pig: [login to view URL]
Hive: [login to view URL]
Video tutorials: [login to view URL]
Amazon Elastic MapReduce:[login to view URL]