Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A error when compose H with LG #730

Closed
Jarvan-Wang opened this issue Apr 28, 2021 · 17 comments
Closed

A error when compose H with LG #730

Jarvan-Wang opened this issue Apr 28, 2021 · 17 comments

Comments

@Jarvan-Wang
Copy link
Contributor

my k2 is 1acec6f

Error is :

[F] /search/odin/wangjiawen/k2/k2/csrc/tensor.cu:147:k2::Tensor::Tensor(k2::Dtype, const k2::Shape&, k2::RegionPtr, int32_t) Check failed: int64_t(impl_->byte_offset) + begin_elem * element_size >= 0 (-1803461652 vs. 0)

stack trace is roughly below:


k2/csrc/tensor.cu
    K2_CHECK_GE(int64_t(impl_->byte_offset) + begin_elem * element_size, 0);
k2/csrc/array.h
    return Tensor(dtype_, shape, region_, byte_offset_ + (ElementSize() * i));
k2/csrc/ragged_ops.cu
    Array1<int32_t> tot_sizes_out = Array1<int32_t>(new_offsets.Col(ans_dim0)).To(GetCpuContext());
k2/csrc/ragged_ops.cu
    return IndexAxis0(src, indexes, elem_indexes);
k2/python/csrc/torch/ragged_ops.cu
    out_fsa.aux_labels = index(b_fsa.aux_labels, b_arc_map)
my_own_code.py
    # I have checked that H.arcs_as_tensor()[:,2].max() == LG.arcs_as_tensor()[:,2].max()
    HLG = k2.compose(H, LG, inner_labels='phones')

Any idea?

@Jarvan-Wang
Copy link
Contributor Author

It's caused by out of memory
the output of dmesg:

[24467282.876208] Out of memory: Kill process 31744 (python3) score 275 or sacrifice child
[24467282.884565] Killed process 31744 (python3) total-vm:536841788kB, anon-rss:72284336kB, file-rss:0kB, shmem-rss:234548kB

but why there is no high level code raise a out of memory error, but a low level of error above?

@danpovey
Copy link
Collaborator

Why did you close the issue?
It could possibly be a bug; you could run it in gdb and find some of the relevant variables' values. E.g. what is impl_->byte_offset, begin_elem, element_size?

@Jarvan-Wang
Copy link
Contributor Author

I met this problem again when the graph composed is large (num of arc is near 1 billion):

k2/k2/csrc/ragged_ops.cu:427
GetOldAndNewOffsets(src, new2old, &old_offsets, &new_offsets);
//*(int32_t *)(new_offsets.region_._M_ptr->data+1856)==-1074111714
Array1<int32_t> tot_sizes_out =
Array1<int32_t>(new_offsets.Col(ans_dim0)).To(GetCpuContext());
//tot_sizes_out == {154, 706314, -1074111714}
if (elem_indexes) *elem_indexes = Array1<int32_t>(c, tot_sizes_out.Back());

some int32_t of the new_offsets after GetOldAndNewOffsets is overflow:

(gdb) parray2 new_offsets
$141 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36,
  37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72,
  73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106,
  107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135,
  136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154}
$142 = {0, 20, 9166, 9206, 18352, 18408, 27554, 27582, 36728, 36756, 45902, 45930, 55076, 55104, 64250, 64278, 73424, 73444, 82590, 82622, 91768,
  91796, 100942, 100966, 110112, 110128, 119274, 119318, 128464, 128484, 137630, 137670, 146816, 146848, 155994, 156026, 165172, 165192, 174338,
  174370, 183516, 183532, 192678, 192714, 201860, 201880, 211026, 211062, 220208, 220232, 229378, 229406, 238552, 238576, 247722, 247746, 256892,
  256924, 266070, 266106, 275252, 275292, 284438, 284454, 293600, 293624, 302770, 302794, 311940, 311956, 321102, 321126, 330272, 330288, 339434,
  339474, 348620, 348640, 357786, 357802, 366948, 366964, 376110, 376162, 385308, 385324, 394470, 394506, 403652, 403668, 412814, 412830, 421976,
  421992, 431138, 431158, 440304, 440336, 449482, 449502, 458648, 458664, 467810, 467838, 476984, 477012, 486158, 486186, 495332, 495364, 504510,
  504526, 513672, 513696, 522842, 522866, 532012, 532036, 541182, 541202, 550348, 550384, 559530, 559574, 568720, 568740, 577886, 577930, 587076,
  587116, 596262, 596286, 605432, 605452, 614598, 614626, 623772, 623792, 632938, 632954, 642100, 642136, 651282, 651310, 660456, 660476, 669622,
  669658, 678804, 678824, 687970, 687998, 697144, 697168, 706314}
$143 = {0, 47, 41829276, 41829373, 83658602, 83658739, 125487968, 125488035, 167317264, 167317331, 209146560, 209146627, 250975856, 250975923,
  292805152, 292805219, 334634448, 334634495, 376463724, 376463801, 418293030, 418293097, 460122326, 460122383, 501951612, 501951649, 543780878,
  543780985, 585610214, 585610261, 627439490, 627439587, 669268816, 669268893, 711098122, 711098199, 752927428, 752927475, 794756704, 794756781,
  836586010, 836586047, 878415276, 878415363, 920244592, 920244639, 962073868, 962073955, 1003903184, 1003903241, 1045732470, 1045732537,
  1087561766, 1087561823, 1129391052, 1129391109, 1171220338, 1171220415, 1213049644, 1213049731, 1254878960, 1254879057, 1296708286, 1296708323,
  1338537552, 1338537609, 1380366838, 1380366895, 1422196124, 1422196161, 1464025390, 1464025447, 1505854676, 1505854713, 1547683942, 1547684039,
  1589513268, 1589513315, 1631342544, 1631342581, 1673171810, 1673171847, 1715001076, 1715001203, 1756830432, 1756830469, 1798659698, 1798659785,
  1840489014, 1840489051, 1882318280, 1882318317, 1924147546, 1924147583, 1965976812, 1965976859, 2007806088, 2007806165, 2049635394, 2049635441,
  2091464670, 2091464707, 2133293936, 2133294003, -2119844064, -2119843997, -2078014768, -2078014701, -2036185472, -2036185395, -1994356166,
  -1994356129, -1952526900, -1952526843, -1910697614, -1910697557, -1868868328, -1868868271, -1827039042, -1827038995, -1785209766, -1785209679,
  -1743380450, -1743380343, -1701551114, -1701551067, -1659721838, -1659721731, -1617892502, -1617892405, -1576063176, -1576063119, -1534233890,
  -1534233843, -1492404614, -1492404547, -1450575318, -1450575271, -1408746042, -1408746005, -1366916776, -1366916689, -1325087460, -1325087393,
  -1283258164, -1283258117, -1241428888, -1241428801, -1199599572, -1199599525, -1157770296, -1157770229, -1115941000, -1115940943, -1074111714}

It also happens at

k2/csrc/array.h
//where some code in this header file calculate byte_offset, e.g.
return Tensor(dtype_, shape, region_, byte_offset_ + (ElementSize() * i));
//i is big than (INT_MAX / 4)
// but this can be solved by cast i to int64_t before

I think the some int32_t should be extended to int64_t or size_t
but when I did this, I found some API of moderngpu support int32_t only.
any idea?
@danpovey

@Jarvan-Wang Jarvan-Wang reopened this May 12, 2021
@danpovey
Copy link
Collaborator

danpovey commented May 12, 2021 via email

@Jarvan-Wang
Copy link
Contributor Author

Jarvan-Wang commented May 12, 2021

phone level model, when decoding:

[F] /search/odin/wangjiawen/k2/k2/csrc/tensor.cu:147:k2::Tensor::Tensor(k2::Dtype, const k2::Shape&, k2::RegionPtr, int32_t) Check failed: int64_t(impl_->byte_offset) + begin_elem * element_size >= 0 (-1803461652 vs. 0)


[ Stack-Trace: ]
/search/odin/wangjiawen/k2/build_1acec6f/lib/libk2_log.so(k2::internal::GetStackTrace()+0x34) [0x7fa478c10804]
/search/odin/wangjiawen/k2/build_1acec6f/lib/libk2context.so(k2::Tensor::Tensor(k2::Dtype, k2::Shape const&, std::shared_ptr<k2::Region>, int)+0x7e2) [0x7fa47937ccf2]
/search/odin/wangjiawen/k2/build_1acec6f/lib/libk2context.so(k2::Array2<int>::Col(int)+0x146) [0x7fa479330e76]
/search/odin/wangjiawen/k2/build_1acec6f/lib/libk2context.so(+0x2194c9) [0x7fa4793234c9]
/search/odin/wangjiawen/k2/build_1acec6f/lib/libk2context.so(k2::Index(k2::RaggedShape&, int, k2::Array1<int> const&, k2::Array1<int>*)+0x353) [0x7fa479325ff3]
/search/odin/wangjiawen/espnet2_500h_phonelevel_mmi/lib/_k2.cpython-37m-x86_64-linux-gnu.so(+0xa16a9) [0x7fa47dc056a9]
/search/odin/wangjiawen/espnet2_500h_phonelevel_mmi/lib/_k2.cpython-37m-x86_64-linux-gnu.so(+0x978f5) [0x7fa47dbfb8f5]
/search/odin/wangjiawen/espnet2_500h_phonelevel_mmi/lib/_k2.cpython-37m-x86_64-linux-gnu.so(+0x28995) [0x7fa47db8c995]


Traceback (most recent call last):
  File "/search/speech/wangjiawen/anaconda2/envs/python37/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/search/speech/wangjiawen/anaconda2/envs/python37/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/search/odin/wangjiawen/espnet2_500h_phonelevel_mmi/espnet2/bin/mmi_asr_inference.py", line 385, in <module>
    main()
  File "/search/odin/wangjiawen/espnet2_500h_phonelevel_mmi/espnet2/bin/mmi_asr_inference.py", line 381, in main
    inference(**kwargs)
  File "/search/odin/wangjiawen/espnet2_500h_phonelevel_mmi/espnet2/bin/mmi_asr_inference.py", line 211, in inference
    aux_labels_disambig_id_start=first_word_disambig_id)
  File "/search/odin/wangjiawen/espnet2_500h_phonelevel_mmi/snowfall/decoding/graph.py", line 72, in compile_HLG
    HLG = k2.compose(H, LG, inner_labels='phones')
  File "/search/odin/wangjiawen/espnet2_500h_phonelevel_mmi/k2/fsa_algo.py", line 369, in compose
    out_fsa.aux_labels = index(b_fsa.aux_labels, b_arc_map)
  File "/search/odin/wangjiawen/espnet2_500h_phonelevel_mmi/k2/ops.py", line 309, in index
    return index_ragged(src, indexes)
  File "/search/odin/wangjiawen/espnet2_500h_phonelevel_mmi/k2/ops.py", line 255, in index_ragged
    ans, _ = ragged_index(src, indexes)
  File "/search/odin/wangjiawen/espnet2_500h_phonelevel_mmi/k2/ragged/ops.py", line 55, in index
    need_value_indexes=need_value_indexes)
RuntimeError: Some bad things happed.

char level model, when training:

Traceback (most recent call last):
  File "/search/odin/wangjiawen/anaconda2/envs/python37/lib/python3.7/runpy.py", line 193, in _run_module_as_main
[Thread 0x7fc9be6fc700 (LWP 33114) exited]
    "__main__", mod_spec)
  File "/search/odin/wangjiawen/anaconda2/envs/python37/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/search/odin/wangjiawen/espnet2_500h_charlevel_mmi/espnet2/bin/charlevel_mmi_asr_train.py", line 23, in <module>
    main()
  File "/search/odin/wangjiawen/espnet2_500h_charlevel_mmi/espnet2/bin/charlevel_mmi_asr_train.py", line 19, in main
    ASRTask.main(cmd=cmd)
  File "/search/odin/wangjiawen/espnet2_500h_charlevel_mmi/espnet2/tasks/abs_task.py", line 1011, in main
    cls.main_worker(args)
  File "/search/odin/wangjiawen/espnet2_500h_charlevel_mmi/espnet2/tasks/abs_task.py", line 1333, in main_worker
    keep_all_models=args.keep_all_models,
  File "/search/odin/wangjiawen/espnet2_500h_charlevel_mmi/espnet2/train/trainer.py", line 218, in run
    options=trainer_options,
  File "/search/odin/wangjiawen/espnet2_500h_charlevel_mmi/espnet2/train/trainer.py", line 396, in train_one_epoch
    loss, stats, weight = model(**batch)
  File "/search/odin/wangjiawen/anaconda2/envs/python37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/search/odin/wangjiawen/espnet2_500h_charlevel_mmi/espnet2/asr/espnet_model_mmi.py", line 194, in forward
    encoder_out, encoder_out_lens, text, text_lengths
  File "/search/odin/wangjiawen/espnet2_500h_charlevel_mmi/espnet2/asr/espnet_model_mmi.py", line 355, in _calc_mmi_loss
    loss_mmi = self.mmi(encoder_out, encoder_out_lens, ys_pad, ys_pad_lens)
  File "/search/odin/wangjiawen/anaconda2/envs/python37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/search/odin/wangjiawen/espnet2_500h_charlevel_mmi/espnet2/asr/charlevel_mmi.py", line 132, in forward
    tot_score, tot_frames, all_frames = self.loss_fn(ys_hat if self.device is "auto" else ys_hat.to(torch.device(self.device)), texts, supervision_segments)
  File "/search/odin/wangjiawen/anaconda2/envs/python37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/search/odin/wangjiawen/espnet2_500h_charlevel_mmi/snowfall/objectives/mmi.py", line 73, in forward
    num_den_reordered_graphs = k2.index(num_den_graphs, num_den_graphs_indexes)
  File "/search/odin/wangjiawen/espnet2_500h_charlevel_mmi/k2/ops.py", line 304, in index
    return index_fsa(src, indexes)
  File "/search/odin/wangjiawen/espnet2_500h_charlevel_mmi/k2/ops.py", line 211, in index_fsa
    need_value_indexes=True)
  File "/search/odin/wangjiawen/espnet2_500h_charlevel_mmi/k2/ragged/ops.py", line 55, in index
    need_value_indexes=need_value_indexes)
RuntimeError: [enforce fail at CPUAllocator.cpp:48] ((ptrdiff_t)nbytes) >= 0. alloc_cpu() seems to have been called with negative number: 18446744069413104760
frame #0: c10::ThrowEnforceNotMet(char const*, int, char const*, std::string const&, void const*) + 0x47 (0x7fca17f6f6a7 in /search/odin/wangjiawen/anaconda2/envs/python37/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::alloc_cpu(unsigned long) + 0x487 (0x7fca17f40b97 in /search/odin/wangjiawen/anaconda2/envs/python37/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #2: <unknown function> + 0x247b6 (0x7fca17f427b6 in /search/odin/wangjiawen/anaconda2/envs/python37/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3: c10::Allocator::raw_allocate(unsigned long) + 0x32 (0x7fca100dd60a in /search/odin/wangjiawen/k2/build_1acec6f_bugfix_debug/lib/libk2context.so)
frame #4: k2::PytorchCpuContext::Allocate(unsigned long, void**) + 0x2b (0x7fca100def93 in /search/odin/wangjiawen/k2/build_1acec6f_bugfix_debug/lib/libk2context.so)
frame #5: k2::NewRegion(std::shared_ptr<k2::Context>, unsigned long) + 0xa1 (0x7fca0fe5cd09 in /search/odin/wangjiawen/k2/build_1acec6f_bugfix_debug/lib/libk2context.so)
frame #6: k2::Array1<int>::Init(std::shared_ptr<k2::Context>, int, k2::Dtype) + 0xea (0x7fca0fe33706 in /search/odin/wangjiawen/k2/build_1acec6f_bugfix_debug/lib/libk2context.so)
frame #7: k2::Array1<int>::Array1(std::shared_ptr<k2::Context>, int, k2::Dtype) + 0x50 (0x7fca0fe314ac in /search/odin/wangjiawen/k2/build_1acec6f_bugfix_debug/lib/libk2context.so)
frame #8: <unknown function> + 0x334b22 (0x7fca0ffc7b22 in /search/odin/wangjiawen/k2/build_1acec6f_bugfix_debug/lib/libk2context.so)
frame #9: k2::Index(k2::RaggedShape&, int, k2::Array1<int> const&, k2::Array1<int>*) + 0x195 (0x7fca0ffc872a in /search/odin/wangjiawen/k2/build_1acec6f_bugfix_debug/lib/libk2context.so)
frame #10: <unknown function> + 0x11a950 (0x7fca14c86950 in /search/odin/wangjiawen/espnet2_500h_phonelevel_mmi/lib/_k2.cpython-37m-x86_64-linux-gnu.so)
frame #11: <unknown function> + 0xfc8f5 (0x7fca14c688f5 in /search/odin/wangjiawen/espnet2_500h_phonelevel_mmi/lib/_k2.cpython-37m-x86_64-linux-gnu.so)
frame #12: <unknown function> + 0x1138aa (0x7fca14c7f8aa in /search/odin/wangjiawen/espnet2_500h_phonelevel_mmi/lib/_k2.cpython-37m-x86_64-linux-gnu.so)
frame #13: <unknown function> + 0x110faa (0x7fca14c7cfaa in /search/odin/wangjiawen/espnet2_500h_phonelevel_mmi/lib/_k2.cpython-37m-x86_64-linux-gnu.so)
frame #14: <unknown function> + 0x10afe3 (0x7fca14c76fe3 in /search/odin/wangjiawen/espnet2_500h_phonelevel_mmi/lib/_k2.cpython-37m-x86_64-linux-gnu.so)
frame #15: <unknown function> + 0x10b093 (0x7fca14c77093 in /search/odin/wangjiawen/espnet2_500h_phonelevel_mmi/lib/_k2.cpython-37m-x86_64-linux-gnu.so)
frame #16: <unknown function> + 0x522af (0x7fca14bbe2af in /search/odin/wangjiawen/espnet2_500h_phonelevel_mmi/lib/_k2.cpython-37m-x86_64-linux-gnu.so)
frame #17: _PyMethodDef_RawFastCallKeywords + 0x316 (0x55f53d8569b6 in /search/odin/wangjiawen/anaconda2/envs/python37/bin/python3)
frame #18: _PyCFunction_FastCallKeywords + 0x21 (0x55f53d856a31 in /search/odin/wangjiawen/anaconda2/envs/python37/bin/python3)
frame #19: _PyEval_EvalFrameDefault + 0x53e3 (0x55f53d8c3483 in /search/odin/wangjiawen/anaconda2/envs/python37/bin/python3)
frame #20: _PyEval_EvalCodeWithName + 0x2f9 (0x55f53d805829 in /search/odin/wangjiawen/anaconda2/envs/python37/bin/python3)
frame #21: _PyFunction_FastCallKeywords + 0x387 (0x55f53d856107 in /search/odin/wangjiawen/anaconda2/envs/python37/bin/python3)
frame #22: _PyEval_EvalFrameDefault + 0x14e5 (0x55f53d8bf585 in /search/odin/wangjiawen/anaconda2/envs/python37/bin/python3)
frame #23: _PyFunction_FastCallKeywords + 0xfb (0x55f53d855e7b in /search/odin/wangjiawen/anaconda2/envs/python37/bin/python3)
frame #24: _PyEval_EvalFrameDefault + 0x416 (0x55f53d8be4b6 in /search/odin/wangjiawen/anaconda2/envs/python37/bin/python3)
frame #25: _PyFunction_FastCallKeywords + 0xfb (0x55f53d855e7b in /search/odin/wangjiawen/anaconda2/envs/python37/bin/python3)
frame #26: _PyEval_EvalFrameDefault + 0x4a89 (0x55f53d8c2b29 in /search/odin/wangjiawen/anaconda2/envs/python37/bin/python3)
frame #27: _PyFunction_FastCallDict + 0x10b (0x55f53d80685b in /search/odin/wangjiawen/anaconda2/envs/python37/bin/python3)
frame #28: _PyObject_Call_Prepend + 0x63 (0x55f53d8254d3 in /search/odin/wangjiawen/anaconda2/envs/python37/bin/python3)
frame #29: PyObject_Call + 0x6e (0x55f53d817ffe in /search/odin/wangjiawen/anaconda2/envs/python37/bin/python3)
frame #30: _PyEval_EvalFrameDefault + 0x1e4a (0x55f53d8bfeea in /search/odin/wangjiawen/anaconda2/envs/python37/bin/python3)
frame #31: _PyEval_EvalCodeWithName + 0x2f9 (0x55f53d805829 in /search/odin/wangjiawen/anaconda2/envs/python37/bin/python3)
frame #32: _PyFunction_FastCallDict + 0x1d5 (0x55f53d806925 in /search/odin/wangjiawen/anaconda2/envs/python37/bin/python3)
frame #33: _PyObject_Call_Prepend + 0x63 (0x55f53d8254d3 in /search/odin/wangjiawen/anaconda2/envs/python37/bin/python3)
frame #34: <unknown function> + 0x16be1a (0x55f53d85ce1a in /search/odin/wangjiawen/anaconda2/envs/python37/bin/python3)
frame #35: _PyObject_FastCallKeywords + 0x48b (0x55f53d85dccb in /search/odin/wangjiawen/anaconda2/envs/python37/bin/python3)
frame #36: _PyEval_EvalFrameDefault + 0x52fe (0x55f53d8c339e in /search/odin/wangjiawen/anaconda2/envs/python37/bin/python3)
frame #37: _PyEval_EvalCodeWithName + 0xc30 (0x55f53d806160 in /search/odin/wangjiawen/anaconda2/envs/python37/bin/python3)
frame #38: _PyFunction_FastCallDict + 0x1d5 (0x55f53d806925 in /search/odin/wangjiawen/anaconda2/envs/python37/bin/python3)
frame #39: _PyObject_Call_Prepend + 0x63 (0x55f53d8254d3 in /search/odin/wangjiawen/anaconda2/envs/python37/bin/python3)
frame #40: PyObject_Call + 0x6e (0x55f53d817ffe in /search/odin/wangjiawen/anaconda2/envs/python37/bin/python3)
frame #41: _PyEval_EvalFrameDefault + 0x1e4a (0x55f53d8bfeea in /search/odin/wangjiawen/anaconda2/envs/python37/bin/python3)
frame #42: _PyEval_EvalCodeWithName + 0x2f9 (0x55f53d805829 in /search/odin/wangjiawen/anaconda2/envs/python37/bin/python3)
frame #43: _PyFunction_FastCallDict + 0x1d5 (0x55f53d806925 in /search/odin/wangjiawen/anaconda2/envs/python37/bin/python3)
frame #44: _PyObject_Call_Prepend + 0x63 (0x55f53d8254d3 in /search/odin/wangjiawen/anaconda2/envs/python37/bin/python3)
frame #45: <unknown function> + 0x16be1a (0x55f53d85ce1a in /search/odin/wangjiawen/anaconda2/envs/python37/bin/python3)
frame #46: _PyObject_FastCallKeywords + 0x48b (0x55f53d85dccb in /search/odin/wangjiawen/anaconda2/envs/python37/bin/python3)
frame #47: _PyEval_EvalFrameDefault + 0x52fe (0x55f53d8c339e in /search/odin/wangjiawen/anaconda2/envs/python37/bin/python3)
frame #48: _PyFunction_FastCallKeywords + 0xfb (0x55f53d855e7b in /search/odin/wangjiawen/anaconda2/envs/python37/bin/python3)
frame #49: _PyEval_EvalFrameDefault + 0x4a89 (0x55f53d8c2b29 in /search/odin/wangjiawen/anaconda2/envs/python37/bin/python3)
frame #50: _PyEval_EvalCodeWithName + 0x2f9 (0x55f53d805829 in /search/odin/wangjiawen/anaconda2/envs/python37/bin/python3)
frame #51: _PyFunction_FastCallDict + 0x400 (0x55f53d806b50 in /search/odin/wangjiawen/anaconda2/envs/python37/bin/python3)
frame #52: _PyObject_Call_Prepend + 0x63 (0x55f53d8254d3 in /search/odin/wangjiawen/anaconda2/envs/python37/bin/python3)
frame #53: PyObject_Call + 0x6e (0x55f53d817ffe in /search/odin/wangjiawen/anaconda2/envs/python37/bin/python3)
frame #54: _PyEval_EvalFrameDefault + 0x1e4a (0x55f53d8bfeea in /search/odin/wangjiawen/anaconda2/envs/python37/bin/python3)
frame #55: _PyEval_EvalCodeWithName + 0x2f9 (0x55f53d805829 in /search/odin/wangjiawen/anaconda2/envs/python37/bin/python3)
frame #56: _PyFunction_FastCallDict + 0x400 (0x55f53d806b50 in /search/odin/wangjiawen/anaconda2/envs/python37/bin/python3)
frame #57: _PyObject_Call_Prepend + 0x63 (0x55f53d8254d3 in /search/odin/wangjiawen/anaconda2/envs/python37/bin/python3)
frame #58: <unknown function> + 0x16be1a (0x55f53d85ce1a in /search/odin/wangjiawen/anaconda2/envs/python37/bin/python3)
frame #59: PyObject_Call + 0x6e (0x55f53d817ffe in /search/odin/wangjiawen/anaconda2/envs/python37/bin/python3)
frame #60: _PyEval_EvalFrameDefault + 0x1e4a (0x55f53d8bfeea in /search/odin/wangjiawen/anaconda2/envs/python37/bin/python3)
frame #61: _PyEval_EvalCodeWithName + 0x2f9 (0x55f53d805829 in /search/odin/wangjiawen/anaconda2/envs/python37/bin/python3)
frame #62: _PyFunction_FastCallKeywords + 0x387 (0x55f53d856107 in /search/odin/wangjiawen/anaconda2/envs/python37/bin/python3)
frame #63: _PyEval_EvalFrameDefault + 0x14e5 (0x55f53d8bf585 in /search/odin/wangjiawen/anaconda2/envs/python37/bin/python3)

my partner met similar problem in snowfall too
it reported that it need a billion GB cuda memory
image

@danpovey
Copy link
Collaborator

danpovey commented May 12, 2021 via email

@Jarvan-Wang
Copy link
Contributor Author

Yes, the decoding issue seems easy to fix, I'll try it first.

However, the setup is based on in-lab dataset that cannot be shared
It's can be reproduced by using char as modeling unit in AISHELL recipe (compose H with G without L), I guess.

Let's show the training issue more clear:
python level code:

# btw, I don't know why the new snowfall code cat the num and den together and reorder it
num_den_reordered_graphs = k2.index(num_den_graphs, num_den_graphs_indexes)

c++ code:

static RaggedShape IndexAxis0(RaggedShape &src, const Array1<int32_t> &new2old,
                              Array1<int32_t> *elem_indexes /*=nullptr*/) {
  //...
  Array2<int32_t> old_offsets,  // num_axes by ans_dim0
      new_offsets;              // num_axes by (ans_dim0 + 1).
  //src.NumElements() == 41834178
  //new2old == {0, 77, 1, 77, ..., 75, 77, 76, 77}
  //old_offsets == [
  //                            {0, 77, 1, 77, ..., 77, 74, 77, 75, 77, 76, 77}, 
  //                            {0, 2072, 20, 2072, ..., 2020, 2072, 2048, 2072},
  //                            {0, 4949, 47, 4949, ... 4825, 4949, 4892, 4949  ]
  //new_offests is above, some elements of it is overflow
  //btw, I don't know what old_offsets and new_offsets represent
  GetOldAndNewOffsets(src, new2old, &old_offsets, &new_offsets);
  
  //now, tot_sizes_out == {154, 706314, -1074111714}
  Array1<int32_t> tot_sizes_out =
      Array1<int32_t>(new_offsets.Col(ans_dim0)).To(GetCpuContext());
  //tot_sizes_out.Back() == -1074111714
  if (elem_indexes) *elem_indexes = Array1<int32_t>(c, tot_sizes_out.Back());
  //...
}

k2/k2/csrc/array.h:107
106       Array1(ContextPtr ctx, int32_t size, Dtype dtype = DtypeOf<T>::dtype) {
107         Init(ctx, size, dtype);
108       }

k2/k2/csrc/array.h:446
443       void Init(ContextPtr context, int32_t size, Dtype dtype) {
444         K2_CHECK(K2_TYPE_IS_ANY(T) || dtype == DtypeOf<T>::dtype);
445         dtype_ = dtype;
// now the num_bytes passed to NewRegion == 18446744069413104760
446         region_ = NewRegion(context, static_cast<size_t>(size) * ElementSize());
447         dim_ = size;
448         byte_offset_ = 0;
449       }

@csukuangfj
Copy link
Collaborator

@Jarvan-Wang
Looks like you are not using the latest k2. Could you update your code to the latest master?

@Jarvan-Wang
Copy link
Contributor Author

@Jarvan-Wang
Looks like you are not using the latest k2. Could you update your code to the latest master?

my k2: 1acec6f
the fixes in recent commits are not relevant with overflow issue:
https://github.com/k2-fsa/k2/blob/master/k2/csrc/tensor.h#167

@csukuangfj
Copy link
Collaborator

csukuangfj commented May 12, 2021

btw, I don't know why the new snowfall code cat the num and den together and reorder it

It is reordered because of
https://github.com/k2-fsa/snowfall/blob/949226f35b29c629cb03cae36fa43da5993d27a3/snowfall/objectives/mmi.py#L79

        num_den_lats = k2.intersect_dense(num_den_reordered_graphs,
                                          dense_fsa_vec,
                                          output_beam=10.0,
                                          a_to_b_map=a_to_b_map)

a_to_b map must be montonically increasing. If you don't reorder it, then a_to_b_map is not montonically increasing.


Also note that den_graph is replicated so that the number of den_graphs equals to that of num_graphs.

@Jarvan-Wang
Copy link
Contributor Author

btw, I don't know why the new snowfall code cat the num and den together and reorder it

It is reordered because of
https://github.com/k2-fsa/snowfall/blob/949226f35b29c629cb03cae36fa43da5993d27a3/snowfall/objectives/mmi.py#L79

        num_den_lats = k2.intersect_dense(num_den_reordered_graphs,
                                          dense_fsa_vec,
                                          output_beam=10.0,
                                          a_to_b_map=a_to_b_map)

a_to_b map must be montonically increasing. If you don't reorder it, then a_to_b_map is not montonically increasing.

Also note that den_graph is replicated so that the number of den_graphs equals to that of num_graphs.

gotcha, another question,
as a example below:

std::vector<int32_t> index_ = {0};
Array1<int32_t> index(GetCudaContext(), index_);
Ragged<int32_t> src('[ [ [ 1 2 ] [ 5 ] ] [ [ 7 8 9 ] ] ]');
GetOldAndNewOffsets(src.shape, index, &old_offsets, &new_offsets);

then what the resulting old_offsets and new_offsets is?
and what they mean?

@danpovey
Copy link
Collaborator

danpovey commented May 12, 2021 via email

@Jarvan-Wang
Copy link
Contributor Author

Jarvan-Wang commented May 12, 2021

After simply modify the construct func of Tensor.
And rerun the decoding.
Got:

[W] /search/odin/wangjiawen/k2/k2/csrc/ragged.cu:283:bool k2::RaggedShape::Validate(bool) const Ragged shape validation failed, row_splits.Back()=710089159 vs. cached-tot-size=58123011

py-bt:

  File "/search/odin/wangjiawen/data/espnet2_500h_phonelevel_mmi/k2/fsa_algo.py", line 369, in compose
    out_fsa.aux_labels = index(b_fsa.aux_labels, b_arc_map)
  File "/search/odin/wangjiawen/data/espnet2_500h_phonelevel_mmi/snowfall/decoding/graph.py", line 72, in compile_HLG
    HLG = k2.compose(H, LG, inner_labels='phones')

bt:

#0  k2::operator<< <int> (stream=..., array=...) at /search/odin/wangjiawen/k2/k2/csrc/array_inl.h:55
#1  0x00007f14e5e3a5ea in k2::operator<< (stream=..., shape=...) at /search/odin/wangjiawen/k2/k2/csrc/ragged.cu:86
#2  0x00007f14e5d7cad5 in k2::internal::Logger::operator<< <k2::RaggedShape> (this=0x7ffe9b29c290, t=...)
    at /search/odin/wangjiawen/k2/k2/csrc/log.h:209
#3  0x00007f14e5d65add in k2::RaggedShape::Check (this=0x7ffe9b29c3f0) at /search/odin/wangjiawen/k2/k2/csrc/ragged.h:207
#4  0x00007f14e5e4b36b in k2::IndexAxis0 (src=..., new2old=..., elem_indexes=0x7ffe9b29c970)
    at /search/odin/wangjiawen/k2/k2/csrc/ragged_ops.cu:558
#5  0x00007f14e5e4b774 in k2::Index (src=..., axis=0, indexes=..., elem_indexes=0x7ffe9b29c970)
    at /search/odin/wangjiawen/k2/k2/csrc/ragged_ops.cu:570
#6  0x00007f14eab0a966 in k2::Index<int> (src=..., axis=0, indexes=..., value_indexes_out=0x7ffe9b29ca20)
    at /search/odin/wangjiawen/k2/k2/csrc/ragged_ops.h:1186
#7  0x00007f14eaaecf8d in k2::<lambda(PyClass&, int32_t, at::Tensor, bool)>::operator()(k2::<lambda(const PyClass&)>::PyClass &, int32_t, at::Tensor, bool) const (this=0x55b27c406718, src=..., axis=0, indexes=..., need_value_indexes=true)

It's cannot pass the ans.Check()

k2/k2/csrc/ragged_ops.cu:558
406     static RaggedShape IndexAxis0(RaggedShape &src, const Array1<int32_t> &new2old,
407                                   Array1<int32_t> *elem_indexes /*=nullptr*/) {
...
557     #if !defined(NDEBUG)
558       ans.Check();
559     #endif
560       return ans;
561     }

I find the reason:

(gdb) pvector ans.layers_
elem[0]: $9 = {
  row_splits = {
    dim_ = 891915492,
    dtype_ = k2::Dtype::kInt32Dtype,
    byte_offset_ = 18446744072982246288,
    region_ = {
      <std::__shared_ptr<k2::Region, (__gnu_cxx::_Lock_policy)2>> = {
        <std::__shared_ptr_access<k2::Region, (__gnu_cxx::_Lock_policy)2, false, false>> = {<No data fields>},
        members of std::__shared_ptr<k2::Region, (__gnu_cxx::_Lock_policy)2>:
        _M_ptr = 0x55b27d7c4ff0,
        _M_refcount = {
          _M_pi = 0x55b27d7c4fe0
        }
      }, <No data fields>}
  },
  row_ids = {
    dim_ = 58123011,
    dtype_ = k2::Dtype::kInt32Dtype,
    byte_offset_ = 0,
    region_ = {
      <std::__shared_ptr<k2::Region, (__gnu_cxx::_Lock_policy)2>> = {
        <std::__shared_ptr_access<k2::Region, (__gnu_cxx::_Lock_policy)2, false, false>> = {<No data fields>},
        members of std::__shared_ptr<k2::Region, (__gnu_cxx::_Lock_policy)2>:
        _M_ptr = 0x55b27d7c6ac0,
        _M_refcount = {
          _M_pi = 0x55b27d7c6ab0
        }
      }, <No data fields>}
  },
  cached_tot_size = 58123011
}
Vector size = 1
Vector capacity = 1
Element type = std::_Vector_base<k2::RaggedShapeLayer, std::allocator<k2::RaggedShapeLayer> >::pointer
(gdb) p ans.layers_[0].row_splits[ans.layers_[0].row_splits.dim_-1] == ans.layers_[0].row_ids.dim_
$21 = false
(gdb) p ans.layers_[0].row_splits[ans.layers_[0].row_splits.dim_-1]
$22 = 710089159

the ans.layers_[0].row_splits is definitely wrong
and ans.layers_[0].row_splits.byte_offset_ is overflow
any idea?

btw:
the k2::RaggedShapeLayer::row_splits is type of Array1<int32_t>
is this of what you said the offsets into multi-dimensional arrays or just individual indexes?
@danpovey

@Jarvan-Wang
Copy link
Contributor Author

After modify the arg byte_offsets of construction of k2::Tensor, and rerun, got segfault:

[W] /search/odin/wangjiawen/k2/k2/csrc/ragged.cu:283:bool k2::RaggedShape::Validate(bool) const Ragged shape validation failed, row_splits.Back()=710089159 vs. cached-tot-size=58123011

Program received signal SIGSEGV, Segmentation fault.

backstace is :

#0  k2::operator<< <int> (stream=..., array=...) at /search/odin/wangjiawen/k2/k2/csrc/array_inl.h:55
#1  0x00007fedf7ae85ea in k2::operator<< (stream=..., shape=...) at /search/odin/wangjiawen/k2/k2/csrc/ragged.cu:86
#2  0x00007fedf7a2aad5 in k2::internal::Logger::operator<< <k2::RaggedShape> (this=0x7ffef6bf58e0, t=...)
    at /search/odin/wangjiawen/k2/k2/csrc/log.h:209
#3  0x00007fedf7a13add in k2::RaggedShape::Check (this=0x7ffef6bf5a40) at /search/odin/wangjiawen/k2/k2/csrc/ragged.h:207
#4  0x00007fedf7af936b in k2::IndexAxis0 (src=..., new2old=..., elem_indexes=0x7ffef6bf5fc0)
    at /search/odin/wangjiawen/k2/k2/csrc/ragged_ops.cu:558
#5  0x00007fedf7af9774 in k2::Index (src=..., axis=0, indexes=..., elem_indexes=0x7ffef6bf5fc0)
    at /search/odin/wangjiawen/k2/k2/csrc/ragged_ops.cu:570
#6  0x00007fedfc7b8966 in k2::Index<int> (src=..., axis=0, indexes=..., value_indexes_out=0x7ffef6bf6070)

frame 6

(gdb) p src.NumAxes()
$29 = 2
(gdb) whatis src
type = k2::Ragged<int> &
(gdb) call src.shape.layers_.size()
$22 = 1

frame 4

(gdb) p new2old.dim_
$24 = 891915491
(gdb) p new_offsets.Row(1)
$42 = {
  dim_ = 891915492,
  dtype_ = k2::Dtype::kInt32Dtype,
  byte_offset_ = 18446744072982246288,
  region_ = {
    <std::__shared_ptr<k2::Region, (__gnu_cxx::_Lock_policy)2>> = {
      <std::__shared_ptr_access<k2::Region, (__gnu_cxx::_Lock_policy)2, false, false>> = {<No data fields>},
      members of std::__shared_ptr<k2::Region, (__gnu_cxx::_Lock_policy)2>:
      _M_ptr = 0x558f9cd95620,
      _M_refcount = {
        _M_pi = 0x558f9cd95610
      }
    }, <No data fields>}
}
(gdb) p (size_t)(4*891915492)
$48 = 18446744072982246288

that is, when src.NumAxes() > 1, and indexes.dim_ > INT_MAX/4, will make the last row of new_offsets overflow.

@danpovey

@Jarvan-Wang
Copy link
Contributor Author

I guess you guys never met this issue is that maybe your experiments using open datasets of which the decoding grammar is small.
But the decoding G grammar I used is already a pruned version, with only 35341356 arcs, the full version has 77197771 arcs.

I'll retry the decoding with a G grammar trained with data/train/text

@danpovey
Copy link
Collaborator

danpovey commented Jun 17, 2021 via email

@danpovey
Copy link
Collaborator

danpovey commented Jun 17, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants