Sensorflow object detection API Transition learning causes OOM

I'm a beginner in machine learning.
I would like to use the tensorflow object detection API for my own data transfer learning.

Learning as a proprietary dataset Take about 100 800x600 images with the camera. "I am using ""labelImg"" annotated."

The data set has been converted to the tfrecord format used by tensorflow. For training: test_train.record (9MB)
For verification: test_val.record (3.7MB)
appears.

Checkpoint files during learning can be found from Tensorflow detection model zoo
I downloaded and used the learned model of "ssd_resnet_50_fpn_coco".
Tensorflow detection model zoo

We started learning with the following command:

PIPELINE_CONFIG_PATH=foo/test/ssd_resnet50_v1_fpn_shared_box_predictor_640_coco14_sync.config
MODEL_DIR=/foo/test
NUM_TRAIN_STEPS = 30000
NUM_EVAL_STEPS = 2000
time python3 object_detection/model_main.py\
    --pipeline_config_path=${PIPELINE_CONFIG_PATH}\
    --model_dir=${MODEL_DIR}\
    --num_train_steps=${NUM_TRAIN_STEPS}\
    --num_eval_steps=${NUM_EVAL_STEPS}\
    --alsologtostderr

After that, the following error message was printed and learning could not begin.

2019-02-01 15:11:58.709085: W tensorflow/core/common_runtime/bfc_allocator.cc:279] **********************************************************************__**_____***********____
2019-02-01 15:11:58.709106: W tensorflow/core/framework/op_kernel.cc:1318] OP_REQUIRES failed at conv_ops.cc:386 : Resource exhausted: OOM when allocating tensor with shape[64,256,160,160] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1322, in _do_call
return fn(*args)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1307, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1409, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[64,256,160,160] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[Node:FeatureExtractor/resnet_v1_50/resnet_v1_50/block1/unit_1/bottleneck_v1/shortcut/Conv2D = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](FeatureExtractor/resnet_v1_50/resnet_v1_50/pool1/MaxPool, FeatureExtractor/resnet_v1_50/block1/unit_1/bottleneck_v1/shortcut/weights/read)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[Node: total_loss/_4689 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_22359_total_loss", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

The runtime config file specifies:

#SSD with Resnet50v1 FPN feature extractor, shared box predictor and focal
# loss(a.k.a Retinanet).

# See Lin et al, https://arxiv.org/abs/1708.02002
# Trained on COCO, initialized from Imagenet classification checkpoint

# Achieves 35.2 mAP on COCO 14 minimal data set. Doubling the number of training
# steps to 50k gets 36.9mAP

# This config is TPU compatible

model{
  US>ssd{
    place_batchnorm_update —true
    freeze_batchnorm —false
#    num_classes —90
    num_classes —Modify the number of classes 2#
    box_coder{
      faster_rcnn_box_coder{
        y_scale —10.0
        x_scale —10.0
        height_scale —5.0
        width_scale —5.0
      }
    }
    match {
      argmax_matcher {
        matched_threshold —0.5
        unmatched_threshold —0.5
        ignore_thresholds —false
        negatives_lower_than_unmatched —true
        force_match_for_each_row —true
        use_matmul_gather —true
      }
    }
    similarity_calculator {
      iou_similarity{
      }
    }
    encode_background_as_zeros —true
    anchor_generator {
      multiscale_anchor_generator {
        min_level:3
        max_level —7
        anchor_scale —4.0
        expect_ratios: [1.0, 2.0, 0.5]
        scale_per_octave —2
      }
    }
    image_resizer{
      fixed_shape_resizer{
        height —640
        width —640
      }
    }
    box_predictor{
      weight_shared_convolutional_box_predictor{
        depth —256
        class_prediction_bias_init: -4.6
        conv_hyperparams{
          activation —RELU_6,
          regularizer {
            l2_regularizer{
              weight: 0.0004
            }
          }
          initializer {
            random_normal_initializer {
              stddev —0.01
              mean—0.0
            }
      }
          batch_norm{
            scale —true,
            day: 0.997,
            epsilon: 0.001,
          }
        }
        num_layers_before_predictor:4
        kernel_size:3
      }
    }
    feature_extractor {
      type: 'ssd_resnet50_v1_fpn'
      US>fpn{
        min_level:3
        max_level —7
      }
      min_depth —16
      depth_multiplier: 1.0
      conv_hyperparams{
        activation —RELU_6,
        regularizer {
          l2_regularizer{
            weight: 0.0004
          }
        }
        initializer {
          truncated_normal_initializer{
            stddev —0.03
            mean—0.0
          }
        }
        batch_norm{
          scale —true,
          day: 0.997,
          epsilon: 0.001,
        }
      }
      override_base_feature_extractor_hyperparams —true
    }
    loss{
      classification_loss{
        weighted_sigmoid_local{
          alpha — 0.25
          gamma: 2.0
        }
      }
      localization_loss{
        weighted_smooth_l1 {
        }
      }
      classification_weight —1.0
      localization_weight —1.0
    }
    normalize_loss_by_num_matches —true
    normalize_loc_loss_by_codesize —true
    post_processing{
      batch_non_max_suppression{
        score_threshold —1e-8
        iou_threshold —0.6
        max_detections_per_class —100
        max_total_detections —100
      }
      score_converter —SIGMOID
    }
  }
}

train_config:{
#  fine_tune_checkpoint: "PATH_TO_BE_CONFIGURED/model.ckpt"
  fine_tune_checkpoint: "/foo/test/ssd_resnet50_v1_fpn_shared_box_predictor_640_coco14_sync_2018_07_03/model.ckpt" #Change checkpoint
  batch_size —64
  sync_replicas —true
  startup_delay_steps:0
  replicas_to_aggregate —8
  num_steps —25000
  data_augmentation_options{
    random_horizontal_flip {
    }
  }
  data_augmentation_options{
    random_crop_image{
      min_object_covered —0.0
      min_aspect_ratio —0.75
      max_aspect_ratio —3.0
      min_area —0.75
      max_area —1.0
      overlap_thresh —0.0
    }
  }
  optimizer {
    momentum_optimizer: {
      learning_rate: {
        cosine_decay_learning_rate{
          learning_rate_base:.04
          total_steps —25000
          warmup_learning_rate: .013333
          warmup_steps —2000
        }
      }
      momentum_optimizer_value —0.9
    }
    use_moving_average —false
  }
  max_number_of_boxes —100
  unpad_groundtruth_tensors —false
}

train_input_reader: {
  US>tf_record_input_reader{
#    input_path: "PATH_TO_BE_CONFIGURED/mscoco_train.record-00000-of-00100"
    input_path: "/foo/test/test_train.record" #Change to proprietary training data

  }
#  label_map_path: "PATH_TO_BE_CONFIGURED/mscoco_label_map.pbtxt"
  label_map_path: "/foo/test/test_label_map.pbtxt" #Change to proprietary class (2 classes)

}

event_config:{
  metrics_set: "coco_detection_metrics"
  use_moving_averages —false
  num_examples —8000
}

eval_input_reader: {
  US>tf_record_input_reader{
#    input_path: "PATH_TO_BE_CONFIGURED/mscoco_val.record-00000-of-00010"
    input_path: "/foo/test/test_val.record" #Change to Proprietary Validation Data
  }
#  label_map_path: "PATH_TO_BE_CONFIGURED/mscoco_label_map.pbtxt"
  label_map_path: "/foo/test/test_label_map.pbtxt" #Change to proprietary class (2 classes)
  shuffle —False
  num_readers: 1
}

"Reduce batch size"
in a similar OOM question. It was pointed out that
by referring to it.
The "batch_size" in the config file above is like 64→32→16→... Now learning starts when you reduce it to .

train_config:{
#  fine_tune_checkpoint: "PATH_TO_BE_CONFIGURED/model.ckpt"
  fine_tune_checkpoint: "/foo/test/ssd_resnet50_v1_fpn_shared_box_predictor_640_coco14_sync_2018_07_03/model.ckpt" #Change checkpoint
#  batch_size —64
  batch_size —4#Change the size to 4

Is it impossible to make this learned model run without changing its settings in the current environment?

I would like to do it with "batch_size:64" as much as possible.
Will I be able to do it if I add more GPUs?
If you have any knowledge, could you please let me know?
Thank you for your cooperation.

Run Environment
- OS:Ubuntu 18.04.1 LTS
- MEMORY: 31.3 GiB - Processors: Intel® Core™ i7-8700 CPU@12 3.20 GHz
- GPU:GeForce GTX 1070 Ti/PCIe/SSE2

python tensorflow

2022-09-29 22:22

1 Answers

By writing values to oom_score_adj, it seems that certain processes can be excluded from OOM killer.

TIPS: Disallow certain processes from OOM Killer

 echo-1000>/proc/<Process ID>/oom_score_adj

2022-09-29 22:22

If you have any answers or tips

Popular Tags

python x 4647

android x 1593

java x 1494

javascript x 1427

c x 927

c++ x 878

ruby-on-rails x 696

php x 692

python3 x 685

html x 656