Thursday, November 10, 2022

Open Datasets for Machine Learning and Deep Learning Projects

 

Open Datasets for Machine Learning and Deep Learning Projects

There are many important websites that direct you to open datasets. Below, I summarize the ones that I have used so far. I will keep this list updated as I find new useful links.

Please let me know if you have more open dataset resources.

Last Updated on 20 Feb. 2022.


                                                      Photo by Radu Marcusu on Unsplash



  1. You can look for the desired data set, of course, from Google Dataset Search Engine:
https://datasetsearch.research.google.com/

2. Another must-check resource is Kaggle. Here, you do not find only datasets but also shared solutions for the tasks. If you are a new beginner, I strongly suggest starting from Kaggle.

https://www.kaggle.com/datasets

3. Another searchable datasets resource is from the Papers with Code website. Currently, there are more than 3000 datasets.

https://paperswithcode.com/datasets

4. Registry of Open Data on AWS exists to help people discover and share datasets that are available via AWS resources.

https://registry.opendata.aws/

5. For Multi-label classification datasets and Multi-target regression datasetsyou can try Mulan project website.

http://mulan.sourceforge.net/datasets.html
http://mulan.sourceforge.net/datasets.html

6. OpenML provides lots of datasets for you.

https://www.openml.org/search?type=data

7. For Neural Machine Translation datasets, you can visit neural machine translation (NMT) at the Stanford NLP group.

https://nlp.stanford.edu/projects/nmt/

8. If you are looking for Text Classification datasets, here 10 of them listed:

https://analyticsindiamag.com/10-open-source-datasets-for-text-classification/

9. If you are looking for Datasets in Turkish or for Turkey, you can check the DataTurk web page.

https://dataturkey.com.tr/

10. Here is another blog for searching datasets.

https://www.mygreatlearning.com/blog/sources-for-analytics-and-machine-learning-datasets/

11. Here is the Github web page for Turkish Language NTM datasets.

https://github.com/deeplearningturkiye/turkce-yapay-zeka-kaynaklari#ver%C4%B0setler%C4%B0

12. If you need Visual Datasets, the visualdata.io website is one of the best sites.

https://www.visualdata.io/discovery

13. If you are using Tensorflow, you can download lots of datasets from TensorFlow Datasets: a collection of ready-to-use datasets.

https://www.tensorflow.org/datasets

14. Visual Genome is a dataset, a knowledge base, an ongoing effort to connect structured image concepts to text. Visual Genome dataset is among the first to provide detailed labeling of object interactions and attributes, grounding visual concepts to language.

Visual Genome

15. If you are interested in Biomedical Image datasets, you can use the Open-i service of the National Library of Medicine, which enables search and retrieval of abstracts and images (including charts, graphs, clinical images, etc.) from the open-source literature, and biomedical image collections.

https://openi.nlm.nih.gov/

16. Academics Torrents was founded to address the needs of science in the era of big data. It is a scalable platform using BitTorrent which distributes the cost of hosting data in order to prevent the rise and fall of dataset hosting providers and the erasure of the data they host. Researchers are empowered to mirror data they are working with and share large datasets without the large costs typically associated with commercial providers.

https://academictorrents.com/

17. DrivenData works on projects at the intersection of data science and social impact, in areas like international development, health, education, research and conservation, and public services. They want to give more organizations access to the capabilities of data science and engage more data scientists with social challenges where their skills can make a difference. They’ve worked with more than 55 organizations across 100+ projects, many of these made possible by the amazing efforts of the DrivenData community. You can participate in the competitions and use the provided datasets.

https://www.drivendata.org/competitions/

18This is a list of almost all available solutions and ideas shared by top performers in the past Kaggle competitions. This list gets updated as soon as a new competition finishes. Thus, you can access many datasets with some best use cases.

List of Kaggle Datasets, Solutions, and Ideas

19. The İTÜ Natural Language Processing (nlp@İTÜ) research team provides Language Resources and Tools for Turkic Languages on their official website.

http://ddi.itu.edu.tr/en/toolsandresources

20. The UC Irvine Machine Learning Repository currently maintains 596 datasets as a service to the machine learning community.

https://archive-beta.ics.uci.edu/

21. to be continued…

If you would like to learn about Deep Learning with practical coding examples, please subscribe to my Murat Karakaya Akademi YouTube Channel or follow my blog on Blogger.

If you would like to add any new resources please comment below.

Thank you for reading.

https://www.youtube.com/channel/UCrCxCxTFL2ytaDrDYrN4_eA

You can follow me on these social networks:

YouTube

Facebook

Instagram

LinkedIn

Github

Kaggle

Blogger