[Pytorch] torchvision을 통한 object detection 시 Loss is nan, stopping training 에러 발생

Notice

Recent Posts

Recent Comments

Link

GitHub

« 2025/01 »
일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

히비스서커스의 블로그

[Pytorch] torchvision을 통한 object detection 시 Loss is nan, stopping training 에러 발생 본문

Programming/Python

[Pytorch] torchvision을 통한 object detection 시 Loss is nan, stopping training 에러 발생

HibisCircus 2023. 6. 19. 18:57

728x90

상황

torchvision의 object detection으로 model을 training 하는 도중 다음과 같은 에러를 마주하였다.

에러

Loss is nan, stopping training
{'loss_classifier': tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>), 'loss_box_reg': tensor(nan, device='cuda:0', grad_fn=<DivBackward0>), 'loss_objectness': tensor(nan, device='cuda:0', grad_fn=<BinaryCrossEntropyWithLogitsBackward0>), 'loss_rpn_box_reg': tensor(nan, device='cuda:0', grad_fn=<DivBackward0>)}

원인

추측하건데 bounding box가 없는 데이터 부분에서 loss가 크게 발생하는데 이때 learning rate를 크게 주면 loss function을 지나면서 값이 매우 커져 nan 값으로 빠지는 것으로 예상한다.

해결방법

learning rate를 줄여주면 된다. 필자의 경우 1e-4에서 1e-5로 낮추어 해결하였다.

728x90

저작자표시

'Programming > Python' 카테고리의 다른 글

[Airflow] Docker 환경에서 Airflow와 Wandb 같이 활용하기 (0)	2023.07.27
[mmdetection] mmdetection을 통한 object detection 데이터셋 커스터마이징 방법 (ver.3.1) (2)	2023.07.25
[Pytorch] Airflow 사용 시 error: RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method 에러 (0)	2023.06.01
[Pytorch] wandb.run.summary에서 AttributeError: 'NoneType' object has no attribute 'summary' 발생 시 해결방법 (0)	2023.05.11
[Pytorch] RuntimeError: mat1 and mat2 shapes cannot be multiplied (16x204800 and 2048x4) (0)	2023.02.01

'Programming/Python' Related Articles

히비스서커스의 블로그

[Pytorch] torchvision을 통한 object detection 시 Loss is nan, stopping training 에러 발생 본문

[Pytorch] torchvision을 통한 object detection 시 Loss is nan, stopping training 에러 발생

상황

에러

원인

해결방법

'Programming > Python' 카테고리의 다른 글

티스토리툴바