Amazon SageMakerを触ってみた

f:id:masalib:20190622225433j:plain

機械学習のWEBアプリを作りたいのですが最終的なエンドポイントのところでハマってしまいました。簡単にエンドポイントが作れるという記事を見たのでAmazon SageMakerというサービスを勉強することになりました

触ってみたというは、まだまだ理解が足りないという意味です
今回は下記の記事をほぼまるパクリです

dev.classmethod.jp

画面が少し違う程度です

Amazon SageMakerとは
S3バケットの作成
Amazon SageMaker ノートブックインスタンスの作成
組み込みのアルゴリズムでモデルをトレーニングし、デプロイする
モデルのトレーニング
- トレーニングジョブの作成
モデルをデプロイ
あと終了作業
感想

Amazon SageMakerとは

Amazon SageMaker は、すべての開発者とデータサイエンティストに機械学習モデルの構築、トレーニング、デプロイ手段を提供します

デプロイしてエンドポイントを作ることができます。 TensorFlow、Apache MXNet、PyTorch、Chainer、Scikit-learn、SparkML、Horovod、Keras、Gluon を自動的に構成して最適化します。一般的に使用される機械学習アルゴリズムが組み込まれています。

S3バケットの作成

トレーニングデータとか保存する領域を事前に作る

バケット名には、”sagemaker”という文字列を含める必要があります。SageMakerがノートブックインスタンスを作成する時に、”sagemaker”という文字列をバケット名に含むS3バケットへのアクセスを許可するIAMロールを作成するからです。

sagemaker-test-mnstという名前で作りました

Amazon SageMaker ノートブックインスタンスの作成

https://console.aws.amazon.com/sagemaker/

に行ってノートブックインスタンスの作成のボタンを押す

初めての場合は下記の画面 f:id:masalib:20190622220432j:plain

2回目以降は下記の画面

f:id:masalib:20190622220510j:plain

編集画面でsagemaker-test-mnstという

f:id:masalib:20190622220700j:plain

下にスクロールするとPermissions and encryptionという項目のIAM ロールがある

f:id:masalib:20190622221313j:plain

新規作成を選択するとIAM作成画面が表示されるのでS3をなしにして作成する

f:id:masalib:20190622221405j:plain

作成されると、作成されたIAMロール名が表示されますインスタンスを作成するのボタンを押す

f:id:masalib:20190622221504j:plain

作成直後は「ステータス」が「Pending」になり作成が完了すると、「InService」になります。

f:id:masalib:20190622221844j:plain

組み込みのアルゴリズムでモデルをトレーニングし、デプロイする

一覧にある Open Jupyterというリンクを押す

f:id:masalib:20190622223512j:plain

Jupyter NoteBookが開くので「New」から「conda_python3」を選択する

f:id:masalib:20190622223633j:plain

さらにJupyter NoteBookが開く保存名を適当につける

s3のバケットの準備をする

from sagemaker import get_execution_role
 
role = get_execution_role()
bucket='sagemaker-test-mnst' # 「事前準備」で作成したS3バケット名に書き換える

トレーニングデータのダウンロード

%%time
import pickle, gzip, numpy, urllib.request, json
 
# Load the dataset
urllib.request.urlretrieve("http://deeplearning.net/data/mnist/mnist.pkl.gz", "mnist.pkl.gz")
with gzip.open('mnist.pkl.gz', 'rb') as f:
    train_set, valid_set, test_set = pickle.load(f, encoding='latin1')

調査する

%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (2,10)
 
 
def show_digit(img, caption='', subplot=None):
    if subplot==None:
        _,(subplot)=plt.subplots(1,1)
    imgr=img.reshape((28,28))
    subplot.axis('off')
    subplot.imshow(imgr, cmap='gray')
    plt.title(caption)
 
show_digit(train_set[0][30], 'This is a {}'.format(train_set[1][30]))

f:id:masalib:20190622223838j:plain

MNISTデータセットの３１枚目の画像データがラベルの内容（３）と共に表示されます。

モデルのトレーニング

機械学習では、通常モデルに適したアルゴリズムをみつけるための評価プロセスが必要になりますが、今回は SageMaker の組み込みアルゴリズムの1つである k-means を使うことが決まっているため評価プロセスはスキップすることができます。

評価プロセスについて理解していないので今度、調べる・・・

トレーニングジョブの作成

from sagemaker import KMeans
 
data_location = 's3://{}/kmeans_highlevel_example/data'.format(bucket)
output_location = 's3://{}/kmeans_example/output'.format(bucket)
 
print('training data will be uploaded to: {}'.format(data_location))
print('training artifacts will be uploaded to: {}'.format(output_location))
 
kmeans = KMeans(role=role,
                train_instance_count=2,
                train_instance_type='ml.c4.8xlarge',
                output_path=output_location,
                k=10,
                data_location=data_location)

トレーニングの実行

%%time
 
kmeans.fit(kmeans.record_set(train_set[0]))

モデルのトレーニングを実行します。以下のPythonコードを5つ目のセルにペーストして、「Run」ボタンをクリックします。参考にしたサイトだと15分ぐらいかかると書いてあったが 3分で終わった。赤文字はWarningらしい。赤文字はやめろ！！

f:id:masalib:20190622230002j:plain 完了されるとs3にいろいろと保存されていました

モデルをデプロイ

Amazon SageMaker ホスティングサービスにモデルをデプロイする
この行為をおこなうとエンドポイントとエンドポイント設定が作成されます
この作業は５分ぐらいかかります

高レベルPythonライブラリを使うと、deployというメソッド一つでこれらの作業を行うことができます。以下のPythonコードを6つ目のセルにペーストして、「Run」ボタンをクリックします。

%%time
 
kmeans_predictor = kmeans.deploy(initial_instance_count=1,
                                 instance_type='ml.m4.xlarge')

モデルの検証

result = kmeans_predictor.predict(valid_set[0][30:31])
print(result)

[label {
  key: "closest_cluster"
  value {
    float32_tensor {
      values: 2.0
    }
  }
}
label {
  key: "distance_to_cluster"
  value {
    float32_tensor {
      values: 5.5268049240112305
    }
  }
}
]

valid_setデータセットの先頭から100個分の推論結果を取得してみます。
以下のPythonコードを8つ目のセルと9つ目のセルにペーストして、順番に「Run」ボタンをクリックします。

%%time 
 
result = kmeans_predictor.predict(valid_set[0][0:100])
clusters = [r.label['closest_cluster'].float32_tensor.values[0] for r in result]

for cluster in range(10):
    print('\n\n\nCluster {}:'.format(int(cluster)))
    digits = [ img for l, img in zip(clusters, valid_set[0]) if int(l) == cluster ]
    height = ((len(digits)-1)//5) + 1
    width = 5
    plt.rcParams["figure.figsize"] = (width,height)
    _, subplots = plt.subplots(height, width)
    subplots = numpy.ndarray.flatten(subplots)
    for subplot, image in zip(subplots, digits):
        show_digit(image, subplot=subplot)
    for subplot in subplots[len(digits):]:
        subplot.axis('off')
 
    plt.show()