2020年4月21日火曜日

Several ways to use Mnist dataset

Introduction

The Mnist hand written digit database is one of the most famous dataset in machine learning.

Although they are maintained in several well known library, as it seems that there are several ways to utilize them and several type of datasets, I confused whether there are something difference.

Because I guess there are anyone like me, I wrote this article to maintain confused information.

Because I already wrote this article in Japanese and referred some reference in that, in this article I'm suppose to show reference in minimum.

Assumption

I assume that you already installed sklearn, tensorflow and pytorch.(Anyway as for me I installed them with Anaconda.)

Furthermore I use MacOSX

Notation

We can see two types of mnist so called hand written dataset.

The first is the one attached to sklearn.

And the second is the others.

The first one is made up of 8×8 pixels.

And the second is 28×28 pixels.

The data attached to sklearn (8×8pixel)

where they are

The dataset attached to sklearn is in the following directory.
/(depending on environment respectively)/lib/python3.7/site-packages/sklearn/datasets
The follow is in my case. (I use Anaconda)
$ls  /Users/hiroshi/opt/anaconda3/lib/python3.7/site-packages/sklearn/    

__check_build   dummy.py   model_selection
__init__.py   ensemble   multiclass.py
__pycache__   exceptions.py   multioutput.py
_build_utils   experimental   naive_bayes.py
_config.py   externals   neighbors
_distributor_init.py  feature_extraction  neural_network
_isotonic.cpython-37m-darwin.so feature_selection  pipeline.py
base.py    gaussian_process  preprocessing
calibration.py   impute    random_projection.py
cluster    inspection   semi_supervised
compose    isotonic.py   setup.py
conftest.py   kernel_approximation.py  svm
covariance   kernel_ridge.py   tests
cross_decomposition  linear_model   tree
datasets   manifold   utils
decomposition   metrics
discriminant_analysis.py mixture
And looking the inside the dataset directory, you might find as follows.
$ls /Users/hiroshi/opt/anaconda3/lib/python3.7/site-packages/sklearn/datasets

__init__.py     california_housing.py
__pycache__     covtype.py
_base.py     data
_california_housing.py    descr
_covtype.py     images
_kddcup99.py     kddcup99.py
_lfw.py      lfw.py
_olivetti_faces.py    olivetti_faces.py
_openml.py     openml.py
_rcv1.py     rcv1.py
_samples_generator.py    samples_generator.py
_species_distributions.py   setup.py
_svmlight_format_fast.cpython-37m-darwin.so species_distributions.py
_svmlight_format_io.py    svmlight_format.py
_twenty_newsgroups.py    tests
base.py      twenty_newsgroups.py
Here you can see the other datasets beside mnist.

And diving into the dataset directory more deeply, you might find as follows.
$ ls /Users/hiroshi/opt/anaconda3/lib/python3.7/site-packages/sklearn/datasets/data
boston_house_prices.csv  diabetes_target.csv.gz  linnerud_exercise.csv
breast_cancer.csv  digits.csv.gz   linnerud_physiological.csv
diabetes_data.csv.gz  iris.csv   wine_data.csv
Here there is some datasets like iris dataset , boston_house_price dataset and so on that are often referred some article about skelearn.

How to import dataset

It is same to official page of sklearn.

The subsequent task is done launching the python from terminal.
>>> from sklearn.datasets import load_digits
>>> import matplotlib.pyplot as plt
>>> digit=load_digits()
>>> digit.data.shape
(1797, 64)     

>>> plt.gray()
>>> digit.images[0]
array([[ 0.,  0.,  5., 13.,  9.,  1.,  0.,  0.],
       [ 0.,  0., 13., 15., 10., 15.,  5.,  0.],
       [ 0.,  3., 15.,  2.,  0., 11.,  8.,  0.],
       [ 0.,  4., 12.,  0.,  0.,  8.,  8.,  0.],
       [ 0.,  5.,  8.,  0.,  0.,  9.,  8.,  0.],
       [ 0.,  4., 11.,  0.,  1., 12.,  7.,  0.],
       [ 0.,  2., 14.,  5., 10., 12.,  0.,  0.],
       [ 0.,  0.,  6., 13., 10.,  0.,  0.,  0.]])
>>> plt.matshow(digit.images[0])
>>> plt.show()

And the following image will appear.



Download the original dataset(28×28pixel)

The original dataset of mnist is in this page.

But the data you can get there is binary data which you cannot use it as it is.

So you need to process them to utilize.

But as you will see , the mnist dataset is so famous dataset that there are a lot of tools to use them immediately.

Of course , although the way to process them by yourself exit, as I couldn't catch up with it and I thought I wondered whether we took much time to seek the way, I'm not suppose to talk about the way.

Download via sklearn(28×28pixel)

Searching internet, in some old article I could find the following way.
from sklearn.datasets import fetch_mldata
But it shows us error , as the website we are suppose to access is not available.

So nowadays it seem that we have to use fetch_openml as follows.
>>> import matplotlib.pyplot as plt  
>>> from sklearn.datasets import fetch_openml
>>> digits = fetch_openml(name='mnist_784', version=1)
>>> digits.data.shape
(70000, 784)
>>> plt.imshow(digits.data[0].reshape(28,28), cmap=plt.cm.gray_r)

>>>>>> plt.show()



tensorflow(28×28pixel)

This is the way using the tutorials of tensorflow.
>>> from tensorflow.examples.tutorials.mnist import input_data
Although this command might enable us to import mnist, it didn't. In my case I faced the following error.

As a result There may be some case where the directory including the tutorial isn't downloaded with tensorflow.

Traceback (most recent call last):
  File "", line 1, in 
ModuleNotFoundError: No module named 'tensorflow.examples.tutorials'
I tried to check inside of directory practically.This is the result.
$ls /Users/hiroshi/opt/anaconda3/lib/python3.7/site-packages/tensorflow_core/examples/
__init__.py __pycache__ saved_model
I referred the following pages

At first,you accessgithub page of Tensorflow and download zip file in anywhere and open.



we can find the directory named "tensorflow-master", and in the directory named tensorflow-master\tensorflow\examples\ , there is a directory named "tutorials".

we copy the directory ,"tutorials" into "/Users/hiroshi/opt/anaconda3/lib/python3.7/site-packages/tensorflow_core/examples/"

Then,
>>> import matplotlib.pyplot as plt   
>>> from tensorflow.examples.tutorials.mnist import input_data
>>> mnist = input_data.read_data_sets("MNIST_data", one_hot=True)
>>> im = mnist.train.images[1]
>>> im = im.reshape(-1, 28)
>>> plt.imshow(im)

>>> plt.show()

and you can find the image of mnist.

keras(28×28pixel)

>>> import matplotlib.pyplot as plt   
>>> import tensorflow as tf
>>> mnist = tf.keras.datasets.mnist
>>> mnist
>>> mnist_data = mnist.load_data()
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
11493376/11490434 [==============================] - 1s 0us/step
>>> type(mnist_data[0])
   
>>> len(mnist_data[0])
2
>>> len(mnist_data[0][0])
60000
>>> len(mnist_data[0][0][1])
28
>>> mnist_data[0][0][1].shape
(28, 28)

>>> plt.imshow(mnist_data[0][0][1],cmap=plt.cm.gray_r)

>>> plt.show()
I'm not suppose to show image I got, but if you are in success in doing procedure , you must find the image.

pytorch(28×28pixel)

It seems that if you can't run the following command you can't go next.
>>> from torchvision.datasets import MNIST
I faced an error.

It seems that torchvision don't exit.

In my case, when installing pytorch , I merely do as follows. It seems to be the reason.
$conda install pytorch
In order to install some options, you have to do as follows.
$conda install pytorch torchvision -c pytorch 
As you are required to choose y or n, you choose y.

Doing it ( if you need),you run the following command.
>>> import matplotlib.pyplot as plt   


>>> import torchvision.transforms as transforms
>>> from torch.utils.data import DataLoader
>>> from torchvision.datasets import MNIST
>>> mnist_data = MNIST('~/tmp/mnist', train=True, download=True, transform=transforms.ToTensor())
>>> data_loader = DataLoader(mnist_data,batch_size=4,shuffle=False)
>>> data_iter = iter(data_loader)
>>> images, labels = data_iter.next()
>>> npimg = images[0].numpy()
>>> npimg = npimg.reshape((28, 28))
>>> plt.imshow(npimg, cmap='gray')

>>plt.show()
The sources I referred

The original dataset of mnist

sklearn

Tensorflow

The others