五.常用数据集

作者: 愿风去了 | 来源:发表于2018-11-06 15:33 被阅读0次

五.常用数据集
keras 数据集学习笔记 2/3
pytorch数据集相关操作
KNN算法应用
(六)TensorFlow.js的Iris数据集示例
常用数据集介绍及转换
Tensorflow基础入门
【机器学习】目标检测（2）
K近邻算法-机器学习-实现鸢尾花种类预测
机器学习数据集之泰坦尼克

The Iris Dataset (R. Fisher / Scikit-Learn)

One of the most frequently used ML datasets is the iris flower dataset. We will use the easy import tool, datasets from scikit-learn. You can read more about it here: http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html#sklearn.datasets.load_iris

Low Birthrate Dataset (Hosted on Github)

The 'Low Birthrate Dataset' is a dataset from a famous study by Hosmer and Lemeshow in 1989 called, "Low Infant Birth Weight Risk Factor Study". It is a very commonly used academic dataset mostly for logistic regression. We will host this dataset on the public Github here:https://github.com/nfmcclure/tensorflow_cookbook/raw/master/01_Introduction/07_Working_with_Data_Sources/birthweight_data/birthweight.dat

Housing Price Dataset (UCI)

We will also use a housing price dataset from the University of California at Irvine (UCI) Machine Learning Database Repository. It is a great regression dataset to use. You can read more about it here: https://archive.ics.uci.edu/ml/datasets/Housing

MNIST Handwriting Dataset (Yann LeCun)

The MNIST Handwritten digit picture dataset is the Hello World of image recognition. The famous scientist and researcher, Yann LeCun, hosts it on his webpage here, http://yann.lecun.com/exdb/mnist/ . But because it is so commonly used, many libraries, including TensorFlow, host it internally. We will use TensorFlow to access this data as follows.

CIFAR-10 Data

The CIFAR-10 data ( https://www.cs.toronto.edu/~kriz/cifar.html ) contains 60,000 32x32 color images of 10 classes collected by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. Alex Krizhevsky maintains the page referenced here. This is such a common dataset, that there are built in functions in TensorFlow to access this data (the keras wrapper has these commands). Note that the keras wrapper for these functions automatically splits the images into a 50,000 training set and a 10,000 test set.

Ham/Spam Texts Dataset (UCI)

We will use another UCI ML Repository dataset called the SMS Spam Collection. You can read about it here: https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection . As a sidenote about common terms, when predicting if a data point represents 'spam' (or unwanted advertisement), the alternative is called 'ham' (or useful information).

Movie Review Data (Cornell)

The Movie Review database, collected by Bo Pang and Lillian Lee (researchers at Cornell), serves as a great dataset to use for predicting a numerical number from textual inputs.

The Complete Works of William Shakespeare (Gutenberg Project)

For training a TensorFlow Model to create text, we will train it on the complete works of William Shakespeare. This can be accessed through the good work of the Gutenberg Project. The Gutenberg Project frees many non-copyright books by making them accessible for free from the hard work of volunteers.

English-German Sentence Translation Database (Manythings/Tatoeba)

The Tatoeba Project is also run by volunteers and is set to make the most bilingual sentence translations available between many different languages.Manythings.org compiles the data and makes it accessible.

http://www.manythings.org/corpus/about.html#info