struct2depth(距離推定)を動かしてみた - ハードウェア技術者のスキルアップ日誌

久しぶりにGITHUBで公開されているDeep Learningのネットワークを動かしてみました。今回はtensorflowのソース内にあるstruct2depthです。慣れていればどうってことないのでしょうか、初見でつまずいてしまったので、手順を記録しておきます。

struct2depthとは？

Google Brainが開発した、単眼カメラから深度とエゴモーション(カメラ自身の動き)を推定する手法です。取得するのが難しい、距離の正解情報を必要とせずに距離推定器を学習させることができるのが特徴です。

プロジェクトサイト

https://sites.google.com/view/struct2depth

動作環境

　●　OS : Windows 10 Home (64bit)

　●　Python 3.5

　●　Anaconda 4.2.0

　●　Tensorflow 1.12.0

手順

① GITHUBからレポジトリをクローンする

https://github.com/tensorflow/models
struct2depthはツリーの中のresearch/struct2depthにあります。

②struct2depthフォルダ内にinput, output, modelというフォルダを作る

f:id:masashi_k:20190925230058p:plain

③学習済みモデル(ckptファイル)をダウンロードし、modelフォルダに格納する

KITTIで学習したモデル
 Cityscapesで学習したモデル

※KITTI, Cityscapesは車載カメラで撮影した画像と距離情報を含むデータセット

④推論したい画像をinputフォルダに格納する

試しにKITTIの画像データの一部を使って推論を行います。
KITTIのデータセットはこちらからダウンロードできます。
今回は2011_09_26_drive_0002 を使います。
Image_03の中の画像をinputフォルダにコピーします。

f:id:masashi_k:20190925231114p:plain

⑤util.pyを修正する

このまま推論を実行すると私の環境では画像ファイルの読み込みで以下のエラーが発生しました。

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 0: invalid start byte

以下のようにutil.pyを修正します。

def load_image(img_file, resize=None, interpolation='linear'):
　　"""Load image from disk. Output value range: [0,1]."""
　　#im_data = np.fromstring(gfile.Open(img_file).read(), np.uint8)  #コメントアウト
　　#im = cv2.imdecode(im_data, cv2.IMREAD_COLOR) #コメントアウト
　　im = cv2.imread(img_file)  #追加
　　im = cv2.cvtColor(im, cv2.COLOR_BGR2RGB)
　　if resize and resize != im.shape[:2]:
　　ip = cv2.INTER_LINEAR if interpolation == 'linear' else cv2.INTER_NEAREST
　　im = cv2.resize(im, resize, interpolation=ip)
　　return np.array(im, dtype=np.float32) / 255.0

⑥Anacondaを起動

⑦以下のコマンドで推論を実行

python inference.py --logtostderr --file_extension png --depth --egomotion true --input_dir input --output_dir output --model_ckpt model/KITTI/model-199160

outputフォルダに推論結果（距離画像とegomotion）が出力されます。

以下は結果の一例です。
右側の自転車に乗っている人と電柱が正しく推定できていることがわかります。

f:id:masashi_k:20190925233210p:plain

カメラ入力での推論

ここまではGITHUBの情報とほぼ同じですが、USBカメラで撮影した画像を入力してリアルタイムで距離推定を行ってみました。

inference.pyの中の関数_run_inferenceを以下のように修正し、inference_webcam.pyという新しいpython スクリプトを作ります。

import cv2  #追加

##############
# 中略
##############

  with sv.managed_session() as sess:
    saver.restore(sess, model_ckpt)
    if not gfile.Exists(output_dir):
      gfile.MakeDirs(output_dir)
    logging.info('Predictions will be saved in %s.', output_dir)

    #input camera image
    video_capture = cv2.VideoCapture(0)
    fourcc = cv2.VideoWriter_fourcc(*'MP4V')
    out = cv2.VideoWriter(output_dir + '/' + 'webcam.mp4', fourcc, 25.0, (416, 128))


    # Run depth prediction network.
    while True:
      if depth:
        im_batch = []

        ret, im = video_capture.read()
        im = cv2.cvtColor(im, cv2.COLOR_BGR2RGB)
　　　　#VGAの画像を切り出して416x128に合わせる
        ymin, ymax, xmin, xmax = [142, 339, 0, 640]
        im = im[ymin:ymax, xmin:xmax]
        im = cv2.resize(im, (416,128))
        im = np.array(im, dtype=np.float32) / 255.0

        im_batch.append(im)

        #im_batch.append(np.zeros(shape=(img_height, img_width, 3), dtype=np.float32))
        im_batch = np.stack(im_batch, axis=0)
        est_depth = inference_model.inference_depth(im_batch, sess)

        if flip_for_depth:
          est_depth = np.flip(est_depth, axis=2)
          im_batch = np.flip(im_batch, axis=2)

        color_map = util.normalize_depth_for_display(np.squeeze(est_depth))
        color_map = (color_map * 255.0).astype(np.uint8)
        color_map = cv2.cvtColor(color_map, cv2.COLOR_RGB2BGR)

        out.write(color_map)
        cv2.imshow('video', color_map)
        im_batch = []

      if cv2.waitKey(1) & 0xFF == ord('q'):
        break
    video_capture.release()
    out.release()
  cv2.destroyAllwindows()

ソースコードの中で入力画像の加工を行っています。
ネットワークの入力解像度が416x128なので、カメラで撮影したVGAの画像をクリップ、リサイズして416x128の画像を作成しています。

以下のコマンドで推論を実行します。

python inference_webcam.py --logtostderr --file_extension png --depth --egomotion false --output_dir output --model_ckpt model/KITTI/model-199160

以下の画像はノートPCについているカメラで撮った映像を入力したときの結果です。

f:id:masashi_k:20190925235756p:plain

赤で囲ったところに手をかざしているのですが、何なのかよくわからない結果となっています。また、右側の黄色っぽところは何もないのですが、この部分は距離が近いと推定されています。
車載カメラで撮影した映像とノートPCについているカメラで撮った画像では違いが大きすぎて、学習結果をそのまま使うには無理があるのでしょう。用途に合わせた再学習が必要ということですね。

まとめ

Googleが開発したstruct2depthというネットワークの動かし方とUSBカメラで撮った映像で推論する方法をまとめてみました。何かのお役に立てればと思います。

参考サイト

[DL輪読会]Depth Prediction Without the Sensors: Leveraging Structure for…