The Data Challenge

Have you ever wanted to assess or explore a new BI, analytics, data visualization, data science, etc. technology, but struggled to find a data set to use for your assessment and exploration activities? Have you grown tired of analyzing the “typical” datasets–think census, stock market, and employment and inflation rate data–known to be publicly available? Are you seeking data relevant to a particular concept, but aren’t sure how to collect it? Are you wanting to generate world-changing insights on a fresh, rarely analyzed, concept?

In the data-driven, data science and analytics obsessed world we live in today, obtaining robust data sets to fill each of these use cases is fortunately not the monumental task it was just a few years ago. From static historical government data to real-time social media streams to private company data released for public consumption, the internet is now your data oyster.

Start Here

Below, I’ve compiled a list of great resources providing free, robust data sets for your consumption, as well as hints for finding other data sets that might be of interest to you. While many of these represent traditional datasets, there are lots of hidden gems in the list.

  • World Bank Open Data – Provides free and open access to global development data
  • DATA.GOV – “The home of the U.S. Government’s open data”
  • Canada Open Data – wondering how many immigration applications Canada receives after each U.S. Presidential election? There’s a dataset for that.
  • The CIA World Factbook – download the whole thing (you’ll need 240MB for the zipped files, 500MB for the unzipped) or individual components
  • State/Local Data Portals – many state and local level governments across the U.S. have data portals where you can download datasets related to health, agriculture, public safety, resident requests, building permits, recreation, etc. specific to a given geographic area. To find if your area of interest has a data portal, complete a simple data search for “<Area> Data Portal”. For example, a search for “Houston Data Portal” returns just that, the Houston Data Portal. I’ve listed some other state and big city data portals below.
  • OpenData by Socrata – provides over 8K data sets covering all topics, from the traditional U.S. annual Report to Congress on White House Staff to unclaimed bank accounts to songs you should hear before you die, and more. Want to contribute? You can upload your own datasets as well.
  • Kaggle – This is the leading platform for data prediction competitions. Datasets are made available by the companies hosting the competition (past companies have included GE, Bosch, and Allstate). In addition, Kaggle provides open datasets, as well as an interface for adding your own datasets to their library.
  • Wikipedia Database Download – “Wikipedia offers free copies of all available content to interested users.” Want to get into text analytics? This is an immensely comprehensive data set that could keep you analyzing for years. You’ll also need an immense amount of storage, however, if you want to take it offline.
  • Public Datasets on AWS – “AWS hosts a variety of public datasets that anyone can access for free.” These datasets are available in Amazon Elastic Block Store (Amazon EBS) snapshots and/or Amazon Simple Storage Service (Amazon S3) buckets. They provide consolidated datasets around concepts like the 1000 Genomes Project, Google Books Ngrams, satellite imagery, and more.  Amazon states, “Previously, large datasets such as the mapping of the Human Genome required hours or days to locate, download, customize, and analyze. Now, anyone can access these datasets via the AWS centralized data repository and analyze them using Amazon EC2 instances or Amazon EMR (Hosted Hadoop) clusters. By hosting this important data where it can be quickly and easily processed with elastic computing resources, AWS hopes to enable more innovation, more quickly.”
  • Common Crawl – An open repository of web crawl data.  Collected over the last 7 years, it contains “raw web page data, extracted metadata and text extractions.”
  • Quandl – Provides millions of financial and economic time-series datasets.  Many datasets are free of charge, but there are others you will have to pay for.
  • KONECT – Provides several hundred network datasets of varying types that cover “social networks, hyperlink networks, authorship networks, physical networks, interaction networks, and communication networks.”
  • SNAP, Stanford Large Network Dataset Collection – Provides datasets covering social networks, online reviews, product links and commonly co-purchased products, and more.
  • Million Song Dataset – “A freely-available collection of audio features and metadata for a million contemporary popular music tracks.” The site also provides complimentary datasets around concepts like genre, cover songs, lyrics, etc. that have been contributed by the community.
  • aiHitdata – “aiHitdata is a massive, artificial intelligence/machine learning, automated system that has been trained to build and update company information from the web…aiHitdata doesn’t just extract data, it monitors and understands the changes that occur on company websites and records these as time series transactions.”
  • Best Buy APIs – REST-based interfaces provided by Best Buy to various Best Buy data such as product catalog, buying options, recommendations, locations, etc.
  • WalMart Stores Sales Data – Provides historical sales data for 45 Walmart stores located in different physical regions.
  • Yelp Dataset Challenge – Contains over 3.3M reviews and tips by ~700K users for ~85K businesses, with corresponding business attributes, social network information, check-ins, and member contributed images. A great dataset for text analytics, especially sentiment analyses.
  • OpenSensors.io – Provides sensor data contributed by community members and organizations. Their goal is to provide data that helps people “understand the world better”.

There are also many lists of publicly available datasets compiled by others that you can use as a springboard to finding the unique data set you are seeking.  I’ve listed several below, but a Google search will produce even more results if you still haven’t found quite what you’re looking for.

Go Anywhere

With such a breadth of data publicly available for free today, the only limits on assessment and exploration potential in the area of BI, analytics, data visualization, data science, etc. are the time and attention a person dedicates the actual technologies.

So pick a dataset, or two or three, and start playing. Use the datasets individually or mash them up, they’re free for you to do just that. And while each differs greatly in content, the spirit in which they have been made available is resoundingly consistent–to further knowledge and advancement in the area of a data-driven technology, knowledge, insights, and decision making. With such diversity in data available and diversity in approaches to exploring it, the worlds of BI, analytics, data visualization, and data science only become more exciting and intriguing with each newly available dataset.

You may also like

One comment

Leave a comment