Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unaligned memmove is > 20% of encode time at -s0 #2646

Closed
negge opened this issue Jan 22, 2021 · 8 comments
Closed

Unaligned memmove is > 20% of encode time at -s0 #2646

negge opened this issue Jan 22, 2021 · 8 comments

Comments

@negge
Copy link
Collaborator

negge commented Jan 22, 2021

$ perf record target/release/rav1e Bosphorus_1920x1080_120fps_420_8bit_YUV.y4m -o /dev/null --tiles 4 --threads 4 --limit 64 -s0
Couldn't synthesize bpf events.
>  Using y4m decoder: 1920x1080p @ 30/1 fps, 4:2:0, 8-bit
>  Encoding settings: keyint_min=12 keyint_max=240 quantizer=100 bitrate=0 min_quantizer=0 low_latency=false tune=Psychovisual rdo_lookahead_frames=40 min_block_size=4x4 max_block_size=64x64 multiref=true fast_deblock=false reduced_tx_set=false tx_domain_distortion=false tx_domain_rate=false encode_bottomup=true rdo_tx_decision=true prediction_modes=Complex-All include_near_mvs=true no_scene_detection=false cdef=true use_satd_subpel=true non_square_partition=true enable_timing_info=false fine_directional_intra=true
>  CPU Feature Level: AVX2
>  Using 4 tiles (2x2)
>  encoded 64/64 frames, 0.100 fps, 2342.27 Kb/s, est. size: 0.60 MB, est. time: 0s,  elap. time: 10m 38s                        
>  ----------
>  Key frame:             1 | avg QP:  79.00 | avg size:  104170 B
>  Inter frame:          63 | avg QP: 138.00 | avg size:    8260 B
>  Intra only frame:      0 | avg QP:   0.00 | avg size:       0 B
>  Switching frame:       0 | avg QP:   0.00 | avg size:       0 B
>  ----
[ perf record: Woken up 1229 times to write data ]
[ perf record: Captured and wrote 308.871 MB perf.data (8096427 samples) ]

$ perf report
# Overhead  Command  Shared Object       Symbol                                                                                                                                                                   
# ........  .......  ..................  .........................................................................................................................................................................
#
    20.33%  rav1e    libc-2.30.so        [.] __memmove_avx_unaligned_erms
    11.39%  rav1e    rav1e               [.] rav1e::encoder::encode_tx_block
     9.44%  rav1e    rav1e               [.] <rav1e::ec::WriterBase<S> as rav1e::ec::Writer>::symbol_with_update
     7.59%  rav1e    rav1e               [.] rav1e::rdo::cdef_dist_wxh_8x8
     5.29%  rav1e    rav1e               [.] rav1e::context::block_unit::<impl rav1e::context::cdf_context::ContextWriter>::write_coeffs_lv_map
     3.72%  rav1e    rav1e               [.] rav1e::context::transform_unit::<impl rav1e::context::cdf_context::ContextWriter>::get_nz_map_contexts
     3.60%  rav1e    rav1e               [.] rav1e::asm::x86::transform::forward::forward_transform_avx2
     2.68%  rav1e    rav1e               [.] rav1e::encoder::encode_block_post_cdef
     2.22%  rav1e    rav1e               [.] rav1e::rdo::compute_distortion
     1.85%  rav1e    rav1e               [.] <rav1e::ec::WriterBase<S> as rav1e::ec::Writer>::bit
     1.77%  rav1e    rav1e               [.] __rust_probestack
     1.43%  rav1e    rav1e               [.] rav1e::rdo::luma_chroma_mode_rdo::_$u7b$$u7b$closure$u7d$$u7d$::hfd13f0685c9817e4
     1.21%  rav1e    rav1e               [.] rav1e::quantize::QuantizationContext::update
     1.03%  rav1e    rav1e               [.] rav1e::asm::x86::quantize::dequantize_avx2
     0.94%  rav1e    rav1e               [.] rav1e::partition::get_intra_edges
     0.82%  rav1e    rav1e               [.] rav1e::context::block_unit::BlockContext::get_txb_ctx
     0.78%  rav1e    rav1e               [.] rav1e::predict::PredictionMode::predict_inter_single
     0.77%  rav1e    rav1e               [.] rav1e::encoder::write_tx_blocks
     0.76%  rav1e    rav1e               [.] rav1e::asm::x86::transform::forward::daala_fdct32
     0.72%  rav1e    rav1e               [.] rav1e::asm::x86::transform::forward::daala_fdct_ii_16
     0.64%  rav1e    rav1e               [.] rav1e::transform::forward_shared::Txfm2DFlipCfg::fwd
     0.62%  rav1e    rav1e               [.] rav1e::rdo::rdo_tx_size_type
     0.61%  rav1e    rav1e               [.] core::cmp::PartialOrd::le
     0.61%  rav1e    rav1e               [.] rav1e::asm::x86::transform::forward::daala_fdct8
     0.59%  rav1e    rav1e               [.] rav1e::context::block_unit::<impl rav1e::context::cdf_context::ContextWriter>::fill_neighbours_ref_counts
     0.55%  rav1e    rav1e               [.] rav1e::encoder::motion_compensate
     0.54%  rav1e    rav1e               [.] rav1e::asm::x86::transform::inverse::inverse_transform_add
     0.51%  rav1e    libc-2.30.so        [.] __memset_avx2_unaligned_erms
     0.50%  rav1e    rav1e               [.] rav1e::predict::PredictionMode::predict_inter
@lu-zero
Copy link
Collaborator

lu-zero commented Jan 22, 2021

according to callgrind:

--------------------------------------------------------------------------------
         Ir  file:function
--------------------------------------------------------------------------------
298,088,019  ???:__memcpy_avx_unaligned_erms [/lib64/libc-2.32.so]
253,325,238  /home/lu_zero/Sources/rav1e/src/me.rs:rav1e::me::full_pixel_me [/home/lu_zero/Sources/rav1e/target/release/rav1e]
243,043,970  .//home/lu_zero/Sources/rav1e/src/x86/sad_sse2.asm:0x00000000002b4d80 [/home/lu_zero/Sources/rav1e/target/release/rav1e]
123,545,730  /home/lu_zero/Sources/rav1e/src/asm/x86/dist/mod.rs:rav1e::me::full_pixel_me
122,318,416  /home/lu_zero/Sources/rav1e/src/rdo.rs:rav1e::rdo::cdef_dist_wxh_8x8 [/home/lu_zero/Sources/rav1e/target/release/rav1e]
 80,727,559  /home/lu_zero/Sources/rav1e/src/tiling/plane_region.rs:rav1e::me::full_pixel_me
 75,656,947  /home/lu_zero/Sources/rav1e/src/context/transform_unit.rs:rav1e::context::transform_unit::<impl rav1e::context::cdf_context::ContextWriter>::get_nz_map_contexts [/home/lu_zero/Sources/rav1e/target/release/rav1e]

Digging further:

image

And further:

image

So we have BlockContext::rollback() and ContextWriter::rollback().

Ideally we could instrument the CdfContext struct further. It is 17774 bytes.

@lu-zero lu-zero removed their assignment Jan 22, 2021
@negge
Copy link
Collaborator Author

negge commented Feb 4, 2021

Nice, with f9c026f and 351f7c8 I am now seeing a ~15% speed-up from 10m 48s to 9m 26s with the same configuration. The calls to unaligned memmove are gone.

Using git hash 61133e3 (before log)

before $ perf record release/rav1e Bosphorus_1920x1080_120fps_420_8bit_YUV.y4m -o /dev/null --tiles 4 --threads 4 --limit 64 -s0
Couldn't synthesize bpf events.
>  Using y4m decoder: 1920x1080p @ 30/1 fps, 4:2:0, 8-bit
>  Encoding settings: keyint_min=12 keyint_max=240 quantizer=100 bitrate=0 min_quantizer=0 low_latency=false tune=Psychovisual rdo_lookahead_frames=40 min_block_size=4x4 max_block_size=64x64 multiref=true fast_deblock=false reduced_tx_set=false tx_domain_distortion=false tx_domain_rate=false encode_bottomup=true rdo_tx_decision=true prediction_modes=Complex-All include_near_mvs=true no_scene_detection=false cdef=true use_satd_subpel=true non_square_partition=true enable_timing_info=false fine_directional_intra=true
>  CPU Feature Level: AVX2
>  Using 4 tiles (2x2)
>  encoded 64/64 frames, 0.099 fps, 2342.27 Kb/s, est. size: 0.60 MB, est. time: 0s,  elap. time: 10m 48s                        
>  ----------
>  Key frame:             1 | avg QP:  79.00 | avg size:  104170 B
>  Inter frame:          63 | avg QP: 138.00 | avg size:    8260 B
>  Intra only frame:      0 | avg QP:   0.00 | avg size:       0 B
>  Switching frame:       0 | avg QP:   0.00 | avg size:       0 B
>  ----
[ perf record: Woken up 1242 times to write data ]
[ perf record: Captured and wrote 313.005 MB perf.data (8204764 samples) ]

before $ perf report
# Overhead  Command  Shared Object       Symbol                                                                                                                                                                                                 
# ........  .......  ..................  .......................................................................................................................................................................................................
#
    20.70%  rav1e    libc-2.30.so        [.] __memmove_avx_unaligned_erms
    11.11%  rav1e    rav1e               [.] rav1e::encoder::encode_tx_block
     9.52%  rav1e    rav1e               [.] <rav1e::ec::WriterBase<S> as rav1e::ec::Writer>::symbol_with_update
     7.61%  rav1e    rav1e               [.] rav1e::rdo::cdef_dist_wxh_8x8
     5.38%  rav1e    rav1e               [.] rav1e::context::block_unit::<impl rav1e::context::cdf_context::ContextWriter>::write_coeffs_lv_map
     3.68%  rav1e    rav1e               [.] rav1e::context::transform_unit::<impl rav1e::context::cdf_context::ContextWriter>::get_nz_map_contexts
     3.58%  rav1e    rav1e               [.] rav1e::asm::x86::transform::forward::forward_transform_avx2
     2.44%  rav1e    rav1e               [.] rav1e::encoder::encode_block_post_cdef
     2.21%  rav1e    rav1e               [.] rav1e::rdo::compute_distortion
     1.86%  rav1e    rav1e               [.] <rav1e::ec::WriterBase<S> as rav1e::ec::Writer>::bit
     1.77%  rav1e    rav1e               [.] __rust_probestack
     1.38%  rav1e    rav1e               [.] rav1e::rdo::luma_chroma_mode_rdo::_$u7b$$u7b$closure$u7d$$u7d$::hb407f086d5946c11
     1.19%  rav1e    rav1e               [.] rav1e::quantize::QuantizationContext::update
     1.08%  rav1e    rav1e               [.] rav1e::asm::x86::quantize::dequantize_avx2
     0.94%  rav1e    rav1e               [.] rav1e::partition::get_intra_edges
     0.83%  rav1e    rav1e               [.] rav1e::context::block_unit::BlockContext::get_txb_ctx
     0.77%  rav1e    rav1e               [.] rav1e::predict::PredictionMode::predict_inter_single
     0.77%  rav1e    rav1e               [.] rav1e::asm::x86::transform::forward::daala_fdct32
     0.73%  rav1e    rav1e               [.] rav1e::encoder::write_tx_blocks
     0.72%  rav1e    rav1e               [.] rav1e::asm::x86::transform::forward::daala_fdct_ii_16
     0.66%  rav1e    rav1e               [.] rav1e::context::block_unit::<impl rav1e::context::cdf_context::ContextWriter>::fill_neighbours_ref_counts
     0.65%  rav1e    rav1e               [.] rav1e::rdo::rdo_tx_size_type
     0.60%  rav1e    rav1e               [.] core::cmp::PartialOrd::le
     0.60%  rav1e    rav1e               [.] rav1e::asm::x86::transform::forward::daala_fdct8
     0.57%  rav1e    rav1e               [.] rav1e::encoder::motion_compensate
     0.55%  rav1e    rav1e               [.] rav1e::transform::forward_shared::Txfm2DFlipCfg::fwd
     0.54%  rav1e    rav1e               [.] rav1e::asm::x86::transform::inverse::inverse_transform_add
     0.53%  rav1e    libc-2.30.so        [.] __memset_avx2_unaligned_erms
     0.49%  rav1e    rav1e               [.] rav1e::predict::PredictionMode::predict_inter
     0.47%  rav1e    rav1e               [.] rav1e::encoder::write_tx_tree
     0.47%  rav1e    rav1e               [.] rav1e::asm::x86::transform::forward::daala_fdct64
     0.46%  rav1e    rav1e               [.] rav1e::encoder::encode_block_pre_cdef
     0.45%  rav1e    rav1e               [.] rav1e::asm::x86::lrf::sgrproj_box_ab_r1_avx2

Using git hash 351f7c8 (after log)

after $ perf record release/rav1e Bosphorus_1920x1080_120fps_420_8bit_YUV.y4m -o /dev/null --tiles 4 --threads 4 --limit 64 -s0
Couldn't synthesize bpf events.
>  Using y4m decoder: 1920x1080p @ 30/1 fps, 4:2:0, 8-bit
>  Encoding settings: keyint_min=12 keyint_max=240 quantizer=100 bitrate=0 min_quantizer=0 low_latency=false tune=Psychovisual rdo_lookahead_frames=40 min_block_size=4x4 max_block_size=64x64 multiref=true fast_deblock=false reduced_tx_set=false tx_domain_distortion=false tx_domain_rate=false encode_bottomup=true rdo_tx_decision=true prediction_modes=Complex-All include_near_mvs=true no_scene_detection=false cdef=true use_satd_subpel=true non_square_partition=true enable_timing_info=false fine_directional_intra=true
>  CPU Feature Level: AVX2
>  Using 4 tiles (2x2)
>  encoded 64/64 frames, 0.113 fps, 2342.27 Kb/s, est. size: 0.60 MB, est. time: 0s,  elap. time: 9m 26s                         
>  ----------
>  Key frame:             1 | avg QP:  79.00 | avg size:  104170 B
>  Inter frame:          63 | avg QP: 138.00 | avg size:    8260 B
>  Intra only frame:      0 | avg QP:   0.00 | avg size:       0 B
>  Switching frame:       0 | avg QP:   0.00 | avg size:       0 B
>  ----
[ perf record: Woken up 1084 times to write data ]
[ perf record: Captured and wrote 273.064 MB perf.data (7157790 samples) ]

after $ perf report

# Overhead  Command  Shared Object       Symbol                                                                                                                                                                                                                                               
# ........  .......  ..................  .....................................................................................................................................................................................................................................................
#
    13.75%  rav1e    rav1e               [.] rav1e::encoder::encode_tx_block
    12.39%  rav1e    rav1e               [.] <rav1e::ec::WriterBase<S> as rav1e::ec::Writer>::symbol_with_update
     8.73%  rav1e    rav1e               [.] rav1e::rdo::cdef_dist_wxh_8x8
     6.73%  rav1e    rav1e               [.] rav1e::context::block_unit::<impl rav1e::context::cdf_context::ContextWriter>::write_coeffs_lv_map
     4.50%  rav1e    libc-2.30.so        [.] __memmove_avx_unaligned_erms
     4.07%  rav1e    rav1e               [.] rav1e::context::transform_unit::<impl rav1e::context::cdf_context::ContextWriter>::get_nz_map_contexts
     4.01%  rav1e    rav1e               [.] rav1e::asm::x86::transform::forward::forward_transform_avx2
     2.77%  rav1e    rav1e               [.] rav1e::encoder::encode_block_post_cdef
     2.54%  rav1e    rav1e               [.] rav1e::rdo::compute_distortion
     2.43%  rav1e    rav1e               [.] <rav1e::ec::WriterBase<S> as rav1e::ec::Writer>::bit
     2.09%  rav1e    rav1e               [.] rav1e::rdo::luma_chroma_mode_rdo::_$u7b$$u7b$closure$u7d$$u7d$::h07a0f88a7eb12675
     1.73%  rav1e    rav1e               [.] __rust_probestack
     1.46%  rav1e    rav1e               [.] rav1e::quantize::QuantizationContext::update
     1.22%  rav1e    rav1e               [.] rav1e::asm::x86::quantize::dequantize_avx2
     1.07%  rav1e    rav1e               [.] rav1e::partition::get_intra_edges
     1.05%  rav1e    rav1e               [.] rav1e::rdo::rdo_tx_size_type
     1.00%  rav1e    rav1e               [.] rav1e::context::transform_unit::<impl rav1e::context::cdf_context::ContextWriter>::write_tx_type
     0.94%  rav1e    rav1e               [.] rav1e::context::block_unit::BlockContext::get_txb_ctx
     0.87%  rav1e    rav1e               [.] rav1e::asm::x86::transform::forward::daala_fdct32
     0.85%  rav1e    rav1e               [.] rav1e::predict::PredictionMode::predict_inter_single
     0.84%  rav1e    rav1e               [.] rav1e::asm::x86::transform::forward::daala_fdct_ii_16
     0.74%  rav1e    rav1e               [.] rav1e::context::block_unit::<impl rav1e::context::cdf_context::ContextWriter>::fill_neighbours_ref_counts
     0.70%  rav1e    rav1e               [.] rav1e::asm::x86::transform::forward::daala_fdct8
     0.69%  rav1e    rav1e               [.] core::cmp::PartialOrd::le
     0.62%  rav1e    libc-2.30.so        [.] __memset_avx2_unaligned_erms
     0.60%  rav1e    rav1e               [.] rav1e::transform::forward_shared::Txfm2DFlipCfg::fwd
     0.60%  rav1e    rav1e               [.] alloc::raw_vec::RawVec<T,A>::reserve
     0.58%  rav1e    rav1e               [.] rav1e::encoder::write_tx_blocks
     0.58%  rav1e    rav1e               [.] rav1e::encoder::motion_compensate
     0.57%  rav1e    rav1e               [.] rav1e::asm::x86::transform::inverse::inverse_transform_add
     0.55%  rav1e    rav1e               [.] rav1e::encoder::write_tx_tree

@lu-zero
Copy link
Collaborator

lu-zero commented Feb 4, 2021

On broadwell this is what I get doing the average of 5 runs

rav1e --threads 16 --tiles 16 -l 20 -s 0 -o ~/Encoded/Bosphorus_1920x1080_120fps_420_8bit_YUV-rav1e-0.4.0-0-l20.ivf ~/Samples/{sample} -y

image

@lu-zero
Copy link
Collaborator

lu-zero commented Feb 4, 2021

Same setting on the honeycomb-lx2k
image

@barrbrain
Copy link
Collaborator

 4.50%  rav1e    libc-2.30.so        [.] __memmove_avx_unaligned_erms

How much of this is from BlockContext::rollback(BlockContextCheckpoint)?
Are there more opportunities to recover some cycles here?

@lu-zero
Copy link
Collaborator

lu-zero commented Feb 8, 2021

The BlockContext is another candidate but is much smaller.

@barrbrain
Copy link
Collaborator

barrbrain commented Feb 10, 2021

PR #2667 should resolve the final significant group of calls to memmove. My testing with linux-perf showed 0.4% remained in memmove after applying those changes.

@negge
Copy link
Collaborator Author

negge commented Feb 18, 2021

Closing, thanks everyone who worked on this

@negge negge closed this as completed Feb 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants