１時間２０円ぐらいの機械学習環境（GPU）を作ってみた

機械学習の勉強の備忘録　その７です
GPU環境でかつ安くてかつ簡単な構築手順です

コスト：1時間で15〜20円ぐらい
構築時間：10分（待ちが7〜8分）
環境：aws
対応ライブラリー：
MXNet : v0.9.3 tag
Tensorflow : v1.0.1 tag ←バージョンアップしました
Theano : rel-0.8.2 tag
Caffe : rc5 tag
Caffe2(Experimental) : v0.6.0
CNTK : v2.0beta12.0 tag
Torch : master branch
Keras : 1.2.2 tag
15〜20円ならお財布に優しいよね

これ以上、簡単な方法が今の所ないと思っている
（これ以上に簡単な方法でかつ安い方法があるなら教えてほしい）

前提として
Ubuntu系の場合は以下の手順が必要なのですが
めんどくださいのです
1. Ubuntu14.04のインストール
2. NVIDIAドライバのインストール
3. CUDA, cuDNNのインストール
4. TensorFlowのインストール
5. TensorFlowの実行テスト
参考資料
qiita.com
qiita.com

私は、環境構築にいつも時間がかかってしまい
ビクビクしていたのですが
以下の記事をみてその恐怖がなくなりました

speakerdeck.com
この資料の４４ページ目に光が！！
amazonが用意してくれているを使えばインストールとかいらない！！
超楽勝！！

実際のAMIはこれです！！
AWS Marketplace: Deep Learning AMI with Source Code (CUDA 8, Amazon Linux)

Ubutu Versionもあるみたい（2017/04/11追加）
AWS Marketplace: Deep Learning AMI with Source Code (CUDA 8, Ubuntu)

作業手順
１・awsのコンソールにログインする
２・リージョンをバージニア北部に変更する
　東京リージョンはGPUに対応していない
　pingの速度は求めていないのでバージニア北部で問題ない
f:id:masalib:20170403023146j:plain
　DeNAも同じ所を使っているらしい・・・
　（オンプレミスでもやると言っていたから今はわからない）

３・ec2のダッシュボードに移動して、スポットリクエストを選択する
　　スポットリクエスト一覧が表示されるので
　　スポットインスタンスリクエストのボタンを押す

４・スポットリクエストのウィザードが表示される
下記の内容を
　リクエストタイプ：リクエスト
　ターゲット容量：１
　AMI:ami-e7c96af1
　　を選択する
f:id:masalib:20170403023238j:plain
カスタムAMIを使うをクリックする

f:id:masalib:20170403023453j:plain
f:id:masalib:20170403023522j:plain

インスタンスタイプ：g2.2xlarge
f:id:masalib:20170403023539j:plain
選択したg2.2xlargeの値段をチェック

配分戦略:Lowest Price
　ネットワーク：前回作ったネットワークでもOK
アベイラビリティーゾーン:指定なし（スポットなので）
　最高価格：インスタンスタイプを選んだ時に表示された金額＋0.03ドルぐらい
　今回は0.129だったので0.15
　あとはセキュリティグループとキーファイル以外は初期設定で
　確認を押す
f:id:masalib:20170403023635j:plain
f:id:masalib:20170403023656j:plain

確認画面がでるので「作成」のボタンを押す
成功のメッセージが表示される
f:id:masalib:20170403023717j:plain

さらにインスタンスが成功するとようなリストになる

インスタンス一覧にいくと
リクエストしたインスタンス
初期化していますというメッセージになる

f:id:masalib:20170403023747j:plain

ここから約7〜8分で使えるようになる
f:id:masalib:20170403023802j:plain

「2/2 チェックに合格しました」になったあらターミナルで接続する

ssh -i "terraform-us-east-1.pem" ec2-user@ec2-54-145-247-36.compute-1.amazonaws.com
(ホストはリクエストのたびに変更になるので注意)

ソースをアップする
（gitでいいと思う）

下記のソースをアップ( めんどくらさいから自分はコピペ)
vi test.py

# coding: utf-8

from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets('MNIST_data', one_hot=True)

import tensorflow as tf
sess = tf.InteractiveSession()

x = tf.placeholder(tf.float32, shape=[None, 784])
y_ = tf.placeholder(tf.float32, shape=[None, 10])

W = tf.Variable(tf.zeros([784,10]))
b = tf.Variable(tf.zeros([10]))

sess.run(tf.initialize_all_variables())

y = tf.nn.softmax(tf.matmul(x,W) + b)

cross_entropy = -tf.reduce_sum(y_*tf.log(y))

train_step = tf.train.GradientDescentOptimizer(0.01).minimize(cross_entropy)

for i in range(1000):
  batch = mnist.train.next_batch(50)
  train_step.run(feed_dict={x: batch[0], y_: batch[1]})

correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))

accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

print(accuracy.eval(feed_dict={x: mnist.test.images, y_: mnist.test.labels}))

def weight_variable(shape):
  initial = tf.truncated_normal(shape, stddev=0.1)
  return tf.Variable(initial)

def bias_variable(shape):
  initial = tf.constant(0.1, shape=shape)
  return tf.Variable(initial)

def conv2d(x, W):
  return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')

def max_pool_2x2(x):
  return tf.nn.max_pool(x, ksize=[1, 2, 2, 1],
                        strides=[1, 2, 2, 1], padding='SAME')

W_conv1 = weight_variable([5, 5, 1, 32])
b_conv1 = bias_variable([32])

x_image = tf.reshape(x, [-1,28,28,1])

h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)
h_pool1 = max_pool_2x2(h_conv1)

W_conv2 = weight_variable([5, 5, 32, 64])
b_conv2 = bias_variable([64])

h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2)
h_pool2 = max_pool_2x2(h_conv2)

W_fc1 = weight_variable([7 * 7 * 64, 1024])
b_fc1 = bias_variable([1024])

h_pool2_flat = tf.reshape(h_pool2, [-1, 7*7*64])
h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)

keep_prob = tf.placeholder(tf.float32)
h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)

W_fc2 = weight_variable([1024, 10])
b_fc2 = bias_variable([10])

y_conv=tf.nn.softmax(tf.matmul(h_fc1_drop, W_fc2) + b_fc2)

cross_entropy = -tf.reduce_sum(y_*tf.log(y_conv))
train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)
correct_prediction = tf.equal(tf.argmax(y_conv,1), tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
sess.run(tf.initialize_all_variables())
for i in range(20000):
  batch = mnist.train.next_batch(50)
  if i%100 == 0:
    train_accuracy = accuracy.eval(feed_dict={
        x:batch[0], y_: batch[1], keep_prob: 1.0})
    print("step %d, training accuracy %g"%(i, train_accuracy))
  train_step.run(feed_dict={x: batch[0], y_: batch[1], keep_prob: 0.5})

print("test accuracy %g"%accuracy.eval(feed_dict={
    x: mnist.test.images, y_: mnist.test.labels, keep_prob: 1.0}))

実際に実行した結果

[ec2-user@ip-172-31-23-68 ~]$ python test.py
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.7.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.7.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.7.5 locally
Extracting MNIST_data/train-images-idx3-ubyte.gz
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:910] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties: 
name: GRID K520
major: 3 minor: 0 memoryClockRate (GHz) 0.797
pciBusID 0000:00:03.0
Total memory: 3.94GiB
Free memory: 3.91GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GRID K520, pci bus id: 0000:00:03.0)
WARNING:tensorflow:From test.py:15: initialize_all_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02.
Instructions for updating:
Use `tf.global_variables_initializer` instead.
0.9092
WARNING:tensorflow:From test.py:80: initialize_all_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02.
Instructions for updating:
Use `tf.global_variables_initializer` instead.
step 0, training accuracy 0.04
step 100, training accuracy 0.72
step 200, training accuracy 0.92
step 300, training accuracy 0.96
step 400, training accuracy 0.94
step 500, training accuracy 0.94
step 600, training accuracy 0.94
step 700, training accuracy 0.96
　・
　・
　・
step 19900, training accuracy 1
W tensorflow/core/common_runtime/bfc_allocator.cc:217] Ran out of memory trying to allocate 2.59GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
W tensorflow/core/common_runtime/bfc_allocator.cc:217] Ran out of memory trying to allocate 3.90GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
test accuracy 0.992

だいたい300秒ぐらいで終わります
今後は他のライブラリーもテストしていきます

ちなみにterraformでspotリクエストをやったのですが
今の所、失敗してインスタンスの構築までいきませんでした
サンプルソースが見つからず
たぶん[AWS_SPOT_FLEET_REQUEST]を使うと思うけど
うまくいかず・・何が悪いのかまだ洗い出せていない
今後の課題としてはGUIでやっていることを
terraformに落とし込むです
（せめてaws cliにしたい）