Skip to content

Latest commit

 

History

History
263 lines (215 loc) · 12.5 KB

performance.md

File metadata and controls

263 lines (215 loc) · 12.5 KB

Performance

Under construction - currently mostly brainstorming

General

Performance is usually not composable! Functions contain more branches, there is more risk of confusing the optimizer Good explanation about what does (premature) optimization means: https://youtu.be/1DuMvpwWHH4?t=777 If you are serious about performance, performance tests can fail your build (CI or nightly at least). Division operations are expensive (up to 92 cycles on 64bit x86) and therefore should not be done in microbenchmarks 1 Benefits of low allocation rates, higher cache utilization: https://youtu.be/vZngvuXk7PM?t=758 Average Latency * Throughput https://t.co/DR7Od7IrRb at 28:14 Big lie: "Normal" Distributions 1

Exceptions: This might be surprising to some people, since undoubtedly everyone has heard that “exceptions are slow.” It turns out that they don’t have to be. And, when done right, they get error handling code and data off hot paths which increases I-cache and TLB performance.

Performance metrics

  • Throughput
  • Latency
  • Service Time
  • Response Time

Design

Typical performance numbers

https://www.youtube.com/watch?v=vZngvuXk7PM

data passing latency light over a fibre throughput
Methodcall inlined 0 real: 50ns 10m 20,000,000/sec
Shared memory 200ns 40m 5,000,000/sec
Sysv shared memory 2micros 400m 500,000/sec
Low latency network 8micros 1.6km 125,000/sec
Typical LAN 30micros 6km 30,000/sec
Typical data grid 800micros 160km 1,250/sec
4G request latency 55ms 11,000km 18/sec

Methodology

https://youtu.be/Zw_z7pjis7k?list=WL&t=3214

  1. What prevents app to go faster: Monitoring
  2. Where it resides: Profiling
  3. How to change it stop messing with perf

Response Time IPC

Deplyoment factors affecting response time: https://youtu.be/gsJztZkhQUQ?list=WL&t=2160

Key takeaway is: dedicated resource assignment and affinity

HW configuration:

  • multiple network cards, latency can be improved by affinitizing traffic to particular card and directing to specific processors
  • connection via inifiniBand or fiber optic instead of ethernet

Native OS:

  • within single OS image: take advantage of loopback traffic optimizations
  • across multiple OS images: careful about traffic routing

Virtual OS:

  • good to affinitize VM to cores instead of free floating
  • be careful about IO across virtual OS images

Process level:

  • affinitized to a socket delivers more consistent response time
  1. Reduce communication as much as possible
  2. Send data in big chunks
  3. Combine messages in batches
  4. If you care about througput, use an async model
  5. If you care about consistency, you have to use sync commit protocols
  6. Re-use resources (connection channels, threads, etc) do not create new ones for each message
  7. Check and log errors
  8. Use frameworks when appropriate (gives consistency)
  9. Profile and tune
  10. Dont believe it when someone says "it will never happen"

Pauses

https://youtu.be/vZngvuXk7PM?t=795

Causes for pauses

  • IO delays (seeks and writes)
  • OS interrupts (5ms is not uncommon, most of the pauses are from the OS)
  • Lock contention

Flame Graphs

Profilers

  1. System profilers: like Linux perf, which shows system code paths (eg, JVM GC, syscalls, TCP), but not Java methods
  2. JVM profilers: like hprof, LJP, and commercial profilers. These show Java methods, but usually not system code paths

Hardware

  • Power control
  • Caching, ICache, DCache
  • TLB
  • QPI (Intel QuickPath Interconnect)
  • Cache subsystem, L1,L2,L3,QPI, Bandwith, >90% of chip is cache

10GBit Ethernet vs QPI 20GByte? Ethernet only 16 times slower?

Tooling

People to learn from

Code to learn from

  • JCTools
  • Disruptor
  • Fast Flow
  • Aeron
  • Akka

Books, Literature, etc

Name Author Platform Rating Description
http://www.lighterra.com/papers/modernmicroprocessors/ Patterson HW 10 Best intro to microprocessors i know
Bad Concurrency Barker Java/OS/HW 9 Michael Barkers Blog
Nanotrusting the Nanotime Shipilev Java- 10 Best description of time in java and on modern hardware
https://blogs.oracle.com/dholmes/entry/inside_the_hotspot_vm_clocks David Holmes Java- 8 -
https://medium.com/@octskyward/modern-garbage-collection-911ef4f8bd8e#.a3ax4ucvz Mike Hearn Java/GO 8 A critique on the claims about Gos GC plus a good intro to modern GC
Your Load Generator Is Probably Lying To You - Take The Red Pill And Find Out Why High Scalability - - Talks about Coordinated Omission
https://lwn.net/Articles/252125/ - - - Easy to get but very long explanation about CPU caches
Not all operations are equal - - -

https://groups.google.com/forum/#!forum/mechanical-sympathy

Videos

Name Recorded Speaker Platform Rating Description
How NOT to Measure Latency - Tene (Azul) - 10 Must watch!
The Art of Java Benchmarking Oredev 2013 Shipilev Java 10 Must watch! for everyone benchmarking java
JMH the lesser of two evils Devoxx 2013 Shipilev Java 9 Similar to The Art of Java Benchmarking
Performance Methodology I Devoxx 2012 Shipliv & Pepperdine Java+ 10 Very good
Performance Methodology II Devoxx 2012 Shipliv & Pepperdine Java+ 10 Very good
LMAX - How to Do 100K TPS at Less than 1ms Latency QCon SF 2010 Barker & Thompson (LMAX) Java 9 Classic one about the Disruptor
Java at the Cutting Edge: Stuff I Learned about Performance JVM Users Aukland Barker (LMAX) Java - Watched long ago, but think it was good
Benchmarking: You're Doing It Wrong Strangeloop 2014 Greenberg (Google) - 8 -
Taming the 9s Strangeloop 2014 Weisberg (VoltDB) Java/C++ - -
When to OS gets in the way Strangeloop 2015 Price (LMAX) Java/OS/HW 9 Explains importance of thread pinning etc
Mythbusting Modern Hardware to Gain 'Mechanical Sympathy' - https://www.youtube.com/watch?v=MC1EKLQ2Wmg&t=4s Goto 2012 Thompson - 8 Clears up some common misconceptions about HW
JVM Profiling pitfalls - Nitsan Wakart Java 9 Getting deep into profiling
Performance Testing Java Applications Devoxx UK 2013 Thompson Java 8 One thing to note: Dont write your benchmark harness, use JMH!
A Crash Course in Modern Hardware Devoxx Cliff Click HW,OS,JVM,Java 8 Really a crash course but still quite good
CPU caches and why you care code::dive conference 2014 Scott Meyers HW,C++ 9 Classic one about caches, must watch
Deep Dive Performance ? Nitsan Wakart Java 6 3 talks - some good some not so good
The Illusion of Execution JFokus 2015 Nitsan Wakart Java 10 Nice deep dive
CON1517 An Introduction to JVM Performance JavaOne 2015 Rafael Winterhalter Java 9 Very good and practical
Caching in: understand, measure, and use your CPU Cache more effectively JavaOne 2015 Richard Warburton HW 9 Easy intro
Data Oriented Design CppCon2014 Mike Acton Cpp 8 Designing code based on its data, very low level
Life of a Twitter JVM engineer Devoxx 2016 Tony Printezis JVM 8+ Mostly GC problems
Writing Fast Code I Code::dive 2015 Andrei Alexandrescu C++/HW 9 Low level
Writing Fast Code 2 Code::dive 2015 Andrei Alexandrescu C++/HW 9 Low level
Optimization Tips - Mo' Hustle Mo' Problems CooCon 2014 Andrei Alexandrescu 8 Very low level