I'm a beginner in machine learning.
I would like to use the tensorflow object detection API for my own data transfer learning.
Learning as a proprietary dataset
Take about 100 800x600 images with the camera.
"I am using ""labelImg"" annotated."
The data set has been converted to the tfrecord format used by tensorflow.
For training: test_train.record (9MB)
For verification: test_val.record (3.7MB)
appears.
Checkpoint files during learning can be found from Tensorflow detection model zoo
I downloaded and used the learned model of "ssd_resnet_50_fpn_coco".
Tensorflow detection model zoo
We started learning with the following command:
PIPELINE_CONFIG_PATH=foo/test/ssd_resnet50_v1_fpn_shared_box_predictor_640_coco14_sync.config
MODEL_DIR=/foo/test
NUM_TRAIN_STEPS = 30000
NUM_EVAL_STEPS = 2000
time python3 object_detection/model_main.py\
--pipeline_config_path=${PIPELINE_CONFIG_PATH}\
--model_dir=${MODEL_DIR}\
--num_train_steps=${NUM_TRAIN_STEPS}\
--num_eval_steps=${NUM_EVAL_STEPS}\
--alsologtostderr
After that, the following error message was printed and learning could not begin.
2019-02-01 15:11:58.709085: W tensorflow/core/common_runtime/bfc_allocator.cc:279] **********************************************************************__**_____***********____
2019-02-01 15:11:58.709106: W tensorflow/core/framework/op_kernel.cc:1318] OP_REQUIRES failed at conv_ops.cc:386 : Resource exhausted: OOM when allocating tensor with shape[64,256,160,160] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1322, in _do_call
return fn(*args)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1307, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1409, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[64,256,160,160] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[Node:FeatureExtractor/resnet_v1_50/resnet_v1_50/block1/unit_1/bottleneck_v1/shortcut/Conv2D = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](FeatureExtractor/resnet_v1_50/resnet_v1_50/pool1/MaxPool, FeatureExtractor/resnet_v1_50/block1/unit_1/bottleneck_v1/shortcut/weights/read)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[Node: total_loss/_4689 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_22359_total_loss", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
The runtime config file specifies:
#SSD with Resnet50v1 FPN feature extractor, shared box predictor and focal
# loss(a.k.a Retinanet).
# See Lin et al, https://arxiv.org/abs/1708.02002
# Trained on COCO, initialized from Imagenet classification checkpoint
# Achieves 35.2 mAP on COCO 14 minimal data set. Doubling the number of training
# steps to 50k gets 36.9mAP
# This config is TPU compatible
model{
US>ssd{
place_batchnorm_update —true
freeze_batchnorm —false
# num_classes —90
num_classes —Modify the number of classes 2#
box_coder{
faster_rcnn_box_coder{
y_scale —10.0
x_scale —10.0
height_scale —5.0
width_scale —5.0
}
}
match {
argmax_matcher {
matched_threshold —0.5
unmatched_threshold —0.5
ignore_thresholds —false
negatives_lower_than_unmatched —true
force_match_for_each_row —true
use_matmul_gather —true
}
}
similarity_calculator {
iou_similarity{
}
}
encode_background_as_zeros —true
anchor_generator {
multiscale_anchor_generator {
min_level:3
max_level —7
anchor_scale —4.0
expect_ratios: [1.0, 2.0, 0.5]
scale_per_octave —2
}
}
image_resizer{
fixed_shape_resizer{
height —640
width —640
}
}
box_predictor{
weight_shared_convolutional_box_predictor{
depth —256
class_prediction_bias_init: -4.6
conv_hyperparams{
activation —RELU_6,
regularizer {
l2_regularizer{
weight: 0.0004
}
}
initializer {
random_normal_initializer {
stddev —0.01
mean—0.0
}
}
batch_norm{
scale —true,
day: 0.997,
epsilon: 0.001,
}
}
num_layers_before_predictor:4
kernel_size:3
}
}
feature_extractor {
type: 'ssd_resnet50_v1_fpn'
US>fpn{
min_level:3
max_level —7
}
min_depth —16
depth_multiplier: 1.0
conv_hyperparams{
activation —RELU_6,
regularizer {
l2_regularizer{
weight: 0.0004
}
}
initializer {
truncated_normal_initializer{
stddev —0.03
mean—0.0
}
}
batch_norm{
scale —true,
day: 0.997,
epsilon: 0.001,
}
}
override_base_feature_extractor_hyperparams —true
}
loss{
classification_loss{
weighted_sigmoid_local{
alpha — 0.25
gamma: 2.0
}
}
localization_loss{
weighted_smooth_l1 {
}
}
classification_weight —1.0
localization_weight —1.0
}
normalize_loss_by_num_matches —true
normalize_loc_loss_by_codesize —true
post_processing{
batch_non_max_suppression{
score_threshold —1e-8
iou_threshold —0.6
max_detections_per_class —100
max_total_detections —100
}
score_converter —SIGMOID
}
}
}
train_config:{
# fine_tune_checkpoint: "PATH_TO_BE_CONFIGURED/model.ckpt"
fine_tune_checkpoint: "/foo/test/ssd_resnet50_v1_fpn_shared_box_predictor_640_coco14_sync_2018_07_03/model.ckpt" #Change checkpoint
batch_size —64
sync_replicas —true
startup_delay_steps:0
replicas_to_aggregate —8
num_steps —25000
data_augmentation_options{
random_horizontal_flip {
}
}
data_augmentation_options{
random_crop_image{
min_object_covered —0.0
min_aspect_ratio —0.75
max_aspect_ratio —3.0
min_area —0.75
max_area —1.0
overlap_thresh —0.0
}
}
optimizer {
momentum_optimizer: {
learning_rate: {
cosine_decay_learning_rate{
learning_rate_base:.04
total_steps —25000
warmup_learning_rate: .013333
warmup_steps —2000
}
}
momentum_optimizer_value —0.9
}
use_moving_average —false
}
max_number_of_boxes —100
unpad_groundtruth_tensors —false
}
train_input_reader: {
US>tf_record_input_reader{
# input_path: "PATH_TO_BE_CONFIGURED/mscoco_train.record-00000-of-00100"
input_path: "/foo/test/test_train.record" #Change to proprietary training data
}
# label_map_path: "PATH_TO_BE_CONFIGURED/mscoco_label_map.pbtxt"
label_map_path: "/foo/test/test_label_map.pbtxt" #Change to proprietary class (2 classes)
}
event_config:{
metrics_set: "coco_detection_metrics"
use_moving_averages —false
num_examples —8000
}
eval_input_reader: {
US>tf_record_input_reader{
# input_path: "PATH_TO_BE_CONFIGURED/mscoco_val.record-00000-of-00010"
input_path: "/foo/test/test_val.record" #Change to Proprietary Validation Data
}
# label_map_path: "PATH_TO_BE_CONFIGURED/mscoco_label_map.pbtxt"
label_map_path: "/foo/test/test_label_map.pbtxt" #Change to proprietary class (2 classes)
shuffle —False
num_readers: 1
}
"Reduce batch size"
in a similar OOM question.
It was pointed out that
by referring to it.
The "batch_size" in the config file above is like 64→32→16→...
Now learning starts when you reduce it to .
train_config:{
# fine_tune_checkpoint: "PATH_TO_BE_CONFIGURED/model.ckpt"
fine_tune_checkpoint: "/foo/test/ssd_resnet50_v1_fpn_shared_box_predictor_640_coco14_sync_2018_07_03/model.ckpt" #Change checkpoint
# batch_size —64
batch_size —4#Change the size to 4
Is it impossible to make this learned model run without changing its settings in the current environment?
I would like to do it with "batch_size:64" as much as possible.
Will I be able to do it if I add more GPUs?
If you have any knowledge, could you please let me know?
Thank you for your cooperation.
Run Environment
- OS:Ubuntu 18.04.1 LTS
- MEMORY: 31.3 GiB
- Processors: Intel® Core™ i7-8700 CPU@12 3.20 GHz
- GPU:GeForce GTX 1070 Ti/PCIe/SSE2
By writing values to oom_score_adj, it seems that certain processes can be excluded from OOM killer.
TIPS: Disallow certain processes from OOM Killer
echo-1000>/proc/<Process ID>/oom_score_adj
© 2024 OneMinuteCode. All rights reserved.