24 Public Sources of Data Sets for Machine Learning

Data is the new oil. Are you ready to drink some insane amount of milkshake?  These resources are free and will get you started with the machine learning project you've been thinking about.  Maybe YOU'll learn something as well.

 

1. Data.gov

Hoarding Since: 2013
Genres: Agriculture, Climate, Consumer, Ecosystems, Education, Energy, Finance, Health, Local Government, Manufacturing, Maritime, Ocean, Public Safety, Science & Research
Inventory: As of June 2017, the approximately 200,000 datasets reported as the total on Data.gov represents about 10 million data resources.

Screenshot of Data.gov website. Click to follow link to https://www.data.gov/.

This is the home of the U.S. Government’s open data. The site contains more than 190,000 data points at time of publishing. These datasets vary from data about climate, education, energy, Finance and many more areas.


2. DataSF

Hoarding Since: 2009
Genres: Economy & Community, City Management & Ethics, Transportation, Public Safety, Health & Social Services, Geographic Locations & Boundaries, Energy & Environment, Housing & Buildings, City Infrastructure, Culture and Recreation
Inventory: 400+ data sets

Screenshot of DataSF website. Click to follow link to https://datasf.org/.

Their mission is to empower use of data. At DataSF, they seek to transform the way the City works through the use of data. They believe use of data and evidence can improve their operations and the services they provide. This ultimately leads to increased quality of life and work for San Francisco residents, employers, employees and visitors.


3. US Census Bureau 

Hoarding Since: 1902
Genres: Population, Economy, Business, Education, Emergency Preparedness, Employment, Families & Living Arrangements, Health, Housing, Income & Poverty, International Trade, Public Sector
Inventory: 1000+ data sets

Screenshot of US Census Bureau  website. Click to follow link to https://www.census.gov/.

A wealth of information on the lives of US citizens covering population data, geographic data and education.


4. Socrata 

Hoarding Since: 2007
Genres: Financial Insights, Open Data & Citizen Engagement, Performance Improvement & Accountability, Federal Government
Inventory: 100's of portals

Screenshot of Socrata  website. Click to follow link to https://socrata.com/.

Socrata is a mission-driven software company that helps governments and public sector institutions use data more strategically in the design and delivery of their programs and missions.


5. European Union Open Data Portal

Hoarding Since: 2012
Genres: Social questions, Science, Environment, Employment and working conditions, Economics, Finance, Trade, Production, technology and research Industry
Inventory: 12238 datasets available

Screenshot of European Union Open Data Portal website. Click to follow link to http://data.europa.eu/.

This website provides all these EU data that are freely available. They can be reused in databases, reports or projects. A variety of digital formats are available from the EU institutions and other EU bodies.


6. Data.gov.uk

Hoarding Since: 2009
Genres: Business and economy, Environment, Mapping, Crime and justice, Government, Society, Defence, Government spending, Towns and cities, Education, Health, Transport
Inventory: As of February 2015 it contained over 19,343 datasets, rising to over 40,000 in 2017

Screenshot of Data.gov.uk website. Click to follow link to https://data.gov.uk/.

Data from the UK Government, including the British National Bibliography – metadata on all UK books and publications since 1950.


7. Canada Open Data

Hoarding Since: 2011
Genres: Jobs, Immigration, Travel, Business, Benefits, Health, Taxes
Inventory: As of October 2017, there are 2361 datasets available to the public, including datasets connected to GeoDiscover Alberta.

Screenshot of Canada Open Data website. Click to follow link to https://open.canada.ca/en.

Canada Open Data is a pilot project with many government and geospatial datasets.


8. Open Government Data

Hoarding Since: 2009
Genres: Legal, social and technical aspects of open data.
Inventory: 548 Portals

Screenshot of Open Government Data website. Click to follow link to https://opengovernmentdata.org/.

It offers open government data from US, EU, Canada, CKAN, and more.


9. The CIA World Factbook 

Hoarding Since: 1947
Genres: History, population, economy, government, infrastructure and military information
Inventory: 100+

Screenshot of The CIA World Factbook  website. Click to follow link to https://www.cia.gov/library/publications/the-world-factbook/.

The World Factbook provides information on the history, people, government, economy, energy, geography, communications, transportation, military, and transnational issues for 267 world entities. Their Reference tab includes: maps of the major world regions, as well as Flags of the World, a Physical Map of the World, a Political Map of the World, a World Oceans map, and a Standard Time Zones of the World map.


10. Healthdata.gov

Hoarding Since: 2011
Genres: Community, Health, Quality, Medicare, Hospital, Inpatient, National, State
Inventory: 2723 datasets are available

Screenshot of Healthdata.gov website. Click to follow link to https://www.healthdata.gov/.

125 years of US healthcare data including claim-level Medicare data, epidemiology and population statistics.


11. NHS Digital

Hoarding Since: 2015
Genres: Health and social care
Inventory: 2745 datasets are available

Screenshot of NHS Digital website. Click to follow link to https://digital.nhs.uk/.

NHS Digital (previously HSCIC) provides national information, data and IT systems for health and care services. They exist to help patients, clinicians, commissioners, analysts and researchers. Their goal is to improve health and social care in England by making better use of technology, data and information.


12. UNICEF

Hoarding Since: 1946
Genres: Poverty and violence
Inventory: 100's of reports

Screenshot of UNICEF website. Click to follow link to https://www.unicef.org/reports.

UNICEF offers statistics on the situation of women and children worldwide. UNICEF works in 190 countries and territories to protect the rights of every child. UNICEF has spent 70 years working to improve the lives of children and their families.


13. World Health Organization

Hoarding Since: 1948
Genres: World hunger, health, and disease statistics.
Inventory: 100+

Screenshot of World Health Organization website. Click to follow link to http://www.who.int/en/.

WHO’s priority in the area of health systems is moving towards universal health coverage. WHO works together with policy-makers, global health partners, civil society, academia and the private sector to support countries to develop, implement and monitor solid national health plans. In addition, WHO supports countries to assure the availability of equitable integrated people-centred health services at an affordable price; facilitate access to affordable, safe and effective health technologies; and to strengthen health information systems and evidence-based policy-making.


14. Amazon Web Services public datasets 

Hoarding Since: 2018
Genres: Enron email dataset, Google Books n-grams, NASA NEX datasets, Million Songs dataset and many more
Inventory: 54 matching datasets

Screenshot of Amazon Web Services public datasets  website. Click to follow link to https://registry.opendata.aws/.

Amazon provides a few big datasets, which can be used on their platform or on local computers. One can also analyze the data in the cloud using EC2 and Hadoop via EMR. Popular datasets on Amazon include full Enron email dataset, Google Books n-grams, NASA NEX datasets, Million Songs dataset and many more.


15. Data Market

Hoarding Since: 2008
Genres: Economics, healthcare, food and agriculture, and the automotive industry.
Inventory: 100 million time series from the most important data providers, such as the UN, World Bank and Eurostat

Screenshot of Data Market  website. Click to follow link to https://datamarket.com/.

Data Market is a place to check out data related to economics, healthcare, food and agriculture, and the automotive industry.


16. World Bank 

Hoarding Since: 1945
Genres: World development indices, education indices
Inventory: 18,708 Time Series type data, 2,370 Microdata type data, 356 Geospatial type data

Screenshot of World Bank  website. Click to follow link to https://datacatalog.worldbank.org/.

The open data from the World bank. The platform provides several tools like Open Data Catalog, world development indices, education indices etc.


17. Google datasets

Hoarding Since: 2016
Genres: USA Names, Github Activity , Historical Weather, All stories & comments from Hacker News etc
Inventory: 100+

Screenshot of Google datasets website. Click to follow link to https://cloud.google.com/bigquery/public-data/.

Much like Amazon, Google also has a cloud hosting service, called Google Cloud Platform. With GCP, one can use a tool called BigQuery to explore large data sets. Google lists all of the data sets on a page. This includes baby names, data from GitHub public repositories, all stories & comments from Hacker News etc.


18. Socrata OpenData

Hoarding Since: 2007
Genres: US government-related data
Inventory: 100's of portals

Screenshot of Socrata OpenData website. Click to follow link to https://opendata.socrata.com/.

Socrata OpenData is a portal that contains multiple clean data sets that can be explored in the browser or downloaded to visualize. A significant portion of the data is from US government sources, and many are outdated.


19. Quandl

Hoarding Since: 2011
Genres: Economic and financial data
Inventory: 100+

Screenshot of Quandl website. Click to follow link to https://www.quandl.com/.

Quandl is a repository of economic and financial data. Some of this information is free, but many data sets require purchase. Quandl is useful for building models to predict economic indicators or stock prices. Due to the large amount of available data sets, it's possible to build a complex model that uses many data sets to predict values in another.


20. Wikipedia

Hoarding Since: 2001
Genres: General reference, Culture and the arts, Geography and places, Health and fitness, History and events, Mathematics and logic, Natural and physical sciences, People and self Philosophy and thinking, Religion and belief, systems, Society and social sciences, Technology and applied sciences etc.
Inventory: More than 5620000 english articles

Screenshot of Wikipedia website. Click to follow link to https://en.wikipedia.org/.

Wikipedia is a free, online, community-edited encyclopedia. Wikipedia contains an astonishing breadth of knowledge, containing pages on everything from the Ottoman-Habsburg Wars to Leonard Nimoy. As part of Wikipedia's commitment to advancing knowledge, they offer all of their content for free, and regularly generate dumps of all the articles on the site. Additionally, Wikipedia offers edit history and activity, so you can track how a page on a topic evolves over time, and who contributes to it.


21. FiveThirtyEight

Hoarding Since: 2008
Genres: Politics, Sports, Science & Health, Economics, Culture
Inventory: 100+

Screenshot of FiveThirtyEight website. Click to follow link to http://fivethirtyeight.com/.

FiveThirtyEight is an incredibly popular interactive news and sports site started by Nate Silver. FiveThirtyEight, sometimes referred to as 538, is a website that focuses on opinion poll analysis, politics, economics, and sports blogging. The website, which takes its name from the number of electors in the United States electoral college.


22. Junar

Hoarding Since: 2006
Genres: Economy
Inventory: 100's of portals

Screenshot of Junar website. Click to follow link to http://www.junar.com/.

Junar offers a cloud-based open data platform allowing businesses to free their data to drive opportunities, collaboration and transparency.Junar powers the Data Economy by delivering the easiest-to-use, cloud-based Open Data platform. For innovative organizations, Junar is the fastest way to publish data. The Junar platform enables businesses, governments, and organizations to free their data to drive new opportunities, collaboration, and transparency. Some of the world's leading companies trust Junar with their most valuable assets: their data and the end users who are viewing and using it.


23. National Climatic Data Center 

Hoarding Since: 1934
Genres: Environmental, meteorological and climate data sets
Inventory: 1000+

Screenshot of National Climatic Data Center  website. Click to follow link to https://www.ncdc.noaa.gov/data-access/.

NOAA's National Climatic Data Center (NCDC) is responsible for preserving, monitoring, assessing, and providing public access to the Nation's treasure of climate and historical weather data and information.Huge collection of environmental, meteorological and climate data sets from the US National Climatic Data Center. The world’s largest archive of weather data.


24. Pew Research Center

Hoarding Since: 2004
Genres: U.S. politics and policy; journalism and media; internet, science and technology; religion and public life; Hispanic trends; global attitudes and trends; and U.S. social and demographic trends.
Inventory: 1000+

Screenshot of Pew Research Center website. Click to follow link to http://www.pewinternet.org/datasets/.

The Pew Research Center’s Internet Project is pleased to offer scholars access to raw data sets from their research. All uses of this data should reference the Pew Research Center as the source of the data and acknowledge that the Pew Research bears no responsibility for interpretations presented or conclusions reached based on analysis of the data.