How to find data sources for data visualization

So you want to make a chart? Do you have data? I’m assuming that’s a “No,” if you’re reading this article. Well let me tell you how to go about this business then. Before you start, make sure you check your expectations.

It is going to be a very rare occurrence that you find a reputable and large dataset that is perfectly suitable to the visualizations you intend to make. Often you will need to adapt what you were planning to suit what you find. If you absolutely must make certain visualizations you will either need to pay for the data someone else collected, or you will need to embark on collecting the data yourself.

Where does it all come from?

First let’s talk about where data comes from. In order to make sure we explore all our options for accessing data, we need to know what data exists and to know what data exists, it’s easiest to start at the source. Data is collected by various bodies/agents/organisations. They have a purpose for collecting the data and it is this purpose that will determine whether you can get your hands on it and how to do that.

Where does data come from

User generated

Organisations that are collecting data on their customers while their customers use a service such as Facebook or Fitbit should do so for the benefit of the customer while the customer is using the service. It is a closed loop, or should be a closed loop. The customer generates data, and the company uses this data to provide a better service, or the original agreed service, to the customer. Imagine you subscribe to a peanut butter membership where you get new flavours to taste every month. Your data rating the different flavours is used to send you more and more flavours you might like. When this goes wrong customers get upset. If a company other than Peanut Butter Lovers, say Butter Nutters are Doomed, gets some of your data and uses this to send you more and more news articles on the dangers of peanut allergies, customers get angry and feel misused. You obviously don’t have a peanut butter allergy if you have subscribed to a peanut butter membership, but now you are full of doubts! You will not be able to get private data like this in an approved and ethical manner. Don’t go after it.

Open source

Organisations that collect data with a bigger mandate than closed loop customer services could be the right place to start looking. These are often organisations with a mandate to do public good such as the World Bank and the World Health Organisation. It could also be a local or national government looking to make better decisions using the data they collect on their constituents or programs. These data are made open source to allow anyone looking for a data project to help make the world a better place. Open-source data is data that can be accessed, used and shared by anyone. If you get hold of the World Health Organisation data on peanut butter allergies and you show changes in allergies and their geographic location as a timeseries map, that could show some interesting changes. This could lead to a better understanding and help drop the incidence of peanut allergies. That’s why they make it easy for you to get this data. Definitely go after this data

Market research

Other organisations that make data available are market research companies who are going to try to sell you their data. This is the whole point of their business. They do research so they can sell the data to the companies inside the relevant industry and they make the big bucks doing so. A report from a marketing company on one particular topic, say peanut butter sales, could cost you many thousands of dollars.

Pure luck…

Sometimes you could just get lucky. Running a Google search for your topic and adding the words “data” or “data set” could return the exact results you need. Occasionally this has worked. Sometimes I get the exact data for which I was hoping. More often I get something a little different, but I can find a way to make it work. I have this great idea for my peanut butter lovers: an international competition where we track the peanut butter consumption per country. The country that consumes the most peanut butter each month does not get billed for their membership the next month. Sounds fun right? Well, it turns out I can’t find peanut butter consumption data that is updated monthly. It is only updated annually. Well, that’s not a hard adjustment for me to make. It’s just not as fun as I thought it would be. Maybe I keep looking for something else or I ask my members to submit their peanut butter consumption numbers and I collect my own data.

Data repositories

Another way to find data is with open-access data repositories. You might have found an interesting academic article that was based on original research. Some prominent journals require researchers to store the research data related to the article in an open-access data repository. This might be exactly what you are looking for, but this might not be helpful to you at all. Academic research data is collected for very specific scientific purposes and the academics involved will make sure they squeeze everything out of their data. You won’t find anything new that they have not already published. But if you are looking for large datasets to practise with, say a patient group studied for 5 years from birth to see if they develop peanut butter allergies, this could be a great place to look. I would start with PLOS Journals’ article on acceptable open-access data repositories (https://journals.plos.org/plosone/s/recommended-repositories).

Just ask!

Have you thought about just asking for data? There are acts in most countries (e.g. USA has HIPPA for health data, all countries in the EU have GDPR, and even South Africa has POPIA) that govern access to information. Mostly we think these are only for protecting your private data from misuse. What is fantastic about these acts is that they also govern the legitimate access to data for purposes that will bring benefit. This means that there are channels you can use to apply for access to data in order to carry out analysis (or build visualizations) to do some good. Don’t try and access this information to make a visualization that could make money for someone; you have to intend to find out something that benefits people in general. Depending on the country in which the company holding the data is registered, you will have a different process to follow according to their data protection and access act. Read up on HIPPA (https://www.hhs.gov/hipaa/index.html) and GDPR (https://gdpr.eu/). I might approach a retailer of peanut butter to request data on peanut butter sales per store to see if sales increase in poorer neighbourhoods because it is a cheap source of protein.

Data scraping

Data scraping is growing in popularity. This is a method of getting data from various websites for a central comparison. Do you know of a comparison website for multiple companies’ offerings? This probably uses scraped data. To scrape data, you need a specially written piece of code that crawls the internet and brings back the data you are looking for. It does this at any time interval you determine. So we could set up a web scraper for my peanut butter membership business. The scraper needs to look at twelve different retail websites and bring back the data on the price of each type of peanut butter. The output of my scraper can be data added to a spreadsheet. Now I can make an everviz dashboard showing the prices as they rise and fall. My members can get a good idea of where to buy their peanut butter and whether prices are rising or falling as a trend. A second win could also be finding a new product they didn’t know of before.

Search

Have you tried just searching for data? Some people find all their data sets this way! It’s super simple. Open up your favourite internet browser and type in the topic of data you are looking for. It is recommended that you then add one of the two words “data” or “dataset” after your topic, and get searching. Now you might not find the best data source but you will definitely find some data. The data might not be from a reputable organisation but that shouldn’t stop you from playing around with it. Just don’t think you’re going to change to world with your findings if we can’t trust where the data came from. If I find a data set from a search and I put together a dashboard showing strokes linked to peanut butter consumption I should not call up WHO to let them know. There are many questions about the data set that need to be answered and if the data does not come from a reputable source we really can’t trust what we see. It’s worth it for the practise to use these data sets anyway. I can’t advise on what file types you will be finding so just stick with it and get yourself some .xlsx files or some .csv files so you can bring them into everviz. If you want some serious data-finding tips, check out this Field Guide https://datajournalism.com/read/handbook/one/getting-data/a-five-minute-field-guide

Making up data

And finally, have you thought about just making up data? I’m not talking about falsifying data or making up garbage, but I am talking about synthetic data. Synthetic data is created following a series of rules that make sense and can allow the creation of data that is perfectly in line with what the data would be in real life if you had to collect it all. You must know what rules govern this data to make a legitimate data source. Age must be greater than 16 or 18 for the person to be allowed to hold a drivers’ license. Only women can have a pregnancy. Logical things like this are easy but you need to really map out everything before you go about making synthetic data. A group at MIT have made an incredible tool, a synthetic data generator, which can take into account all of your rules and logic (https://sdv.dev/). You can find out more about them. If you’re not keen on using their tool you are going to have a very difficult time building up a synthetic data source all on your own, line by line, but it can be done.

The short and sweet source list

I don’t want this article to be all about a list with 100 data sources that feels overwhelming and doesn’t even offer that much value. No one ever opens that many resources so I’m keeping it short and sweet. Here is a list of five data sources. These sources are mostly collective sources that scan other huge data repositories for your topic. Have a peek.

1. Google Public Data Explorer

https://www.google.com/publicdata/directory

This is the biggest time-saving search for datasets. This search will cover the most important datasets coming from World Bank Open Data, WHO Open Data Repository, UN Data, UNICEF Data Warehouse, World Bank, World Economic Forum, European Union Open Data Portal, Centre for Disease Control, and Data.gov amongst others. The topics you can expect to find here range from population data, labour markets, education, trade, and public health. There is a wealth of data here all central to doing good and improving the world. Definitely start here if you want to do something worthwhile.

To access these datasets you can search for a topic in the main search bar. Now you have two options. You can ether view the data online within Google’s platform here, or you can click on the data source name to be redirected to the original site hosting the data source and download it from there. The download formats are dictated by the hosting website but in general the data sources are a spreadsheet for Microsoft Excel (.xlxs) or comma-separated variables (.csv). You may need to go on a bit of a chase to get to the download. It is common for the hosting website to take you to pdf reports which you can download. Once you have these reports, search them for charts which use the data that you want. Then follow the source of the chart back to the source website to download the data that was used to build the chart in the pdf. Once you have a download of either type of file, you can easily upload this into everviz for visualizing

2. Amazon Registry of Open Data

https://registry.opendata.aws/

This is a repository of large datasets relating to biology, chemistry, economics, and physiology, including the Human Genome Project. These are quite focussed on basic sciences rather than the human sciences of the first data source. Similarly, this is a search engine searching through data sets, so you won’t find the actual data set in the results. You will need to identify the data set you want from the search results and then visit the website hosting the data source in order to download the data. There are 219 data sets including web crawl data, satellite imagery, and dictionaries for natural language processing. The download formats are the same as mentioned above and again, you might need to go on quite a journey to get to the data source to download it.

3. Kaggle

https://www.kaggle.com/datasets

Kaggle is full of amazing data sets. It is a crowd sourced platform offering training and data sets to sharpen your data science skills. These data sets are incredibly varied and cover topics such as business, public health, and a range of more obscure yet somehow interesting topics (e.g. characteristics of the survivors and those who died on the Titanic). A big part of Kaggle are the competitions. These are fun but the prizes are cash so definitely take them seriously. This platform is fantastic about inclusion and support so try to get to know them a little better.

You must have registered with Kaggle to access their data sets. Don’t worry; it’s free. Downloads are usually zipped folders containing comma separated value files which are good for everviz. Nice!

4. Reddit

https://www.reddit.com/r/datasets/

Reddit is a site for posts and conversations in the comments. In the sub-reddit Datasets, people are posting datasets attached to their post. You can find some incredibly random data sets here. Not 100% fact-checkable as it’s not a reliable body to be supplying data but go for it to have fun. The most common data sets will be on topics that are topical such as latest events and global happenings. There are also a lot of posts looking for very particular data sets and often someone will find something relevant and put this in the comments so definitely check the comments for data sources too.

Here you will most often get a link to a page that is hosting the data. Be careful about what you are opening and check the security associated with the page: read the URL in the post and make sure this is not a scammed address similar to one you already trust, check for secure connection being shown by your browser, make sure you don’t get any warnings about expired SSL certificates etc.

5. Google Finance

https://www.google.com/finance & https://trends.google.com

There isn’t too much to say about these sites, because they are simple and we all know Google already. These sites are fantastic for up to date data.

Google Finance brings you market and share trading data which is refreshing at n amazing pace for being an open data source. It is informative just to be on the landing page! The site is straightforward and uncluttered. In order to get to the financial data you need to already have placed the shares you are interested in your account to follow in a “portfolio.” To get your data just log in, head to your portfolios, and download the data from there.

Google Trend is similarly clean and basic. You can look up any text you can dream of and Google Trend will bring you the data on it. This data is already summarised in charts for you to manipulate. As soon as you are happy with the data, there is a nifty little download button that will bring you the comma separated variable document. How fantastic!

We are everviz. We believe that telling stories with beautiful interactive visualizations makes the message easier to understand and more engaging. Our mission is to make it easy for anyone to create and publish stunning visualizations to tell compelling stories.