Under construction - currently mostly brainstorming
Performance is usually not composable! Functions contain more branches, there is more risk of confusing the optimizer Good explanation about what does (premature) optimization means: https://youtu.be/1DuMvpwWHH4?t=777 If you are serious about performance, performance tests can fail your build (CI or nightly at least). Division operations are expensive (up to 92 cycles on 64bit x86) and therefore should not be done in microbenchmarks 1 Benefits of low allocation rates, higher cache utilization: https://youtu.be/vZngvuXk7PM?t=758 Average Latency * Throughput https://t.co/DR7Od7IrRb at 28:14 Big lie: "Normal" Distributions 1
Exceptions: This might be surprising to some people, since undoubtedly everyone has heard that “exceptions are slow.” It turns out that they don’t have to be. And, when done right, they get error handling code and data off hot paths which increases I-cache and TLB performance.
- Throughput
- Latency
- Service Time
- Response Time
- Amdahls Law
- Batching
- Little's law
- Law of diminishing returns
- Numbers every programmer should know
- Register access
- L1 access
- L2 access
- L3 access
- RAM access
- Sequential SSD
- Sequential HDD
- Network access
- Mutex lock/unlock
- CPU pipeline stall
- Roundtrip 10Gbit kernel bypass
- Contention cost
- QPI (Intel QuickPath Interconnect)
https://www.youtube.com/watch?v=vZngvuXk7PM
data passing | latency | light over a fibre | throughput |
---|---|---|---|
Methodcall | inlined 0 real: 50ns | 10m | 20,000,000/sec |
Shared memory | 200ns | 40m | 5,000,000/sec |
Sysv shared memory | 2micros | 400m | 500,000/sec |
Low latency network | 8micros | 1.6km | 125,000/sec |
Typical LAN | 30micros | 6km | 30,000/sec |
Typical data grid | 800micros | 160km | 1,250/sec |
4G request latency | 55ms | 11,000km | 18/sec |
https://youtu.be/Zw_z7pjis7k?list=WL&t=3214
- What prevents app to go faster: Monitoring
- Where it resides: Profiling
- How to change it stop messing with perf
Deplyoment factors affecting response time: https://youtu.be/gsJztZkhQUQ?list=WL&t=2160
Key takeaway is: dedicated resource assignment and affinity
HW configuration:
- multiple network cards, latency can be improved by affinitizing traffic to particular card and directing to specific processors
- connection via inifiniBand or fiber optic instead of ethernet
Native OS:
- within single OS image: take advantage of loopback traffic optimizations
- across multiple OS images: careful about traffic routing
Virtual OS:
- good to affinitize VM to cores instead of free floating
- be careful about IO across virtual OS images
Process level:
- affinitized to a socket delivers more consistent response time
- Reduce communication as much as possible
- Send data in big chunks
- Combine messages in batches
- If you care about througput, use an async model
- If you care about consistency, you have to use sync commit protocols
- Re-use resources (connection channels, threads, etc) do not create new ones for each message
- Check and log errors
- Use frameworks when appropriate (gives consistency)
- Profile and tune
- Dont believe it when someone says "it will never happen"
https://youtu.be/vZngvuXk7PM?t=795
Causes for pauses
- IO delays (seeks and writes)
- OS interrupts (5ms is not uncommon, most of the pauses are from the OS)
- Lock contention
- http://techblog.netflix.com/2015/07/java-in-flames.html
- http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html#Java
- http://www.slideshare.net/brendangregg/java-performance-analysis-on-linux-with-flame-graphs
- Containerized: http://blog.alicegoldfuss.com/making-flamegraphs-with-containerized-java/
- My first flame graph
- System profilers: like Linux perf, which shows system code paths (eg, JVM GC, syscalls, TCP), but not Java methods
- JVM profilers: like hprof, LJP, and commercial profilers. These show Java methods, but usually not system code paths
- Power control
- Caching, ICache, DCache
- TLB
- QPI (Intel QuickPath Interconnect)
- Cache subsystem, L1,L2,L3,QPI, Bandwith, >90% of chip is cache
10GBit Ethernet vs QPI 20GByte? Ethernet only 16 times slower?
- JMH
- Gatling
- wrk2
- tcpkali
- jrt-socket-response-tool
- top: CPU should be split into instructions retired and memory stalls
- http://highscalability.com/blog/2015/5/27/a-toolkit-to-measure-basic-system-performance-and-os-jitter.html
- HdrHistogram
- Gil Tene, Nitsan Wakart, Cliff Click, basically everyone from Azul Systems
- Martin Thomson
- Doug Lea
- Aleksey Shipilev
- Norman Maurer
- Todd L. Montgomery
- Jonas Boner, Viktor Klang, Konrad Malawski, people from Typesafe/Lightbend
- Andrei Alexandrescu
- Herb Sutter
- Scott Meyers
- sustrik (ZeroMQ inventor) (http://250bpm.com/)
- Brendan D. Gregg (http://www.brendangregg.com)
- Richard Warburton (http://www.insightfullogic.com/)
- https://blog.codecentric.de/
- JCTools
- Disruptor
- Fast Flow
- Aeron
- Akka
Name | Author | Platform | Rating | Description |
---|---|---|---|---|
http://www.lighterra.com/papers/modernmicroprocessors/ | Patterson | HW | 10 | Best intro to microprocessors i know |
Bad Concurrency | Barker | Java/OS/HW | 9 | Michael Barkers Blog |
Nanotrusting the Nanotime | Shipilev | Java- | 10 | Best description of time in java and on modern hardware |
https://blogs.oracle.com/dholmes/entry/inside_the_hotspot_vm_clocks | David Holmes | Java- | 8 | - |
https://medium.com/@octskyward/modern-garbage-collection-911ef4f8bd8e#.a3ax4ucvz | Mike Hearn | Java/GO | 8 | A critique on the claims about Gos GC plus a good intro to modern GC |
Your Load Generator Is Probably Lying To You - Take The Red Pill And Find Out Why | High Scalability | - | - | Talks about Coordinated Omission |
https://lwn.net/Articles/252125/ | - | - | - | Easy to get but very long explanation about CPU caches |
Not all operations are equal | - | - | - |
https://groups.google.com/forum/#!forum/mechanical-sympathy
Name | Recorded | Speaker | Platform | Rating | Description |
---|---|---|---|---|---|
How NOT to Measure Latency | - | Tene (Azul) | - | 10 | Must watch! |
The Art of Java Benchmarking | Oredev 2013 | Shipilev | Java | 10 | Must watch! for everyone benchmarking java |
JMH the lesser of two evils | Devoxx 2013 | Shipilev | Java | 9 | Similar to The Art of Java Benchmarking |
Performance Methodology I | Devoxx 2012 | Shipliv & Pepperdine | Java+ | 10 | Very good |
Performance Methodology II | Devoxx 2012 | Shipliv & Pepperdine | Java+ | 10 | Very good |
LMAX - How to Do 100K TPS at Less than 1ms Latency | QCon SF 2010 | Barker & Thompson (LMAX) | Java | 9 | Classic one about the Disruptor |
Java at the Cutting Edge: Stuff I Learned about Performance | JVM Users Aukland | Barker (LMAX) | Java | - | Watched long ago, but think it was good |
Benchmarking: You're Doing It Wrong | Strangeloop 2014 | Greenberg (Google) | - | 8 | - |
Taming the 9s | Strangeloop 2014 | Weisberg (VoltDB) | Java/C++ | - | - |
When to OS gets in the way | Strangeloop 2015 | Price (LMAX) | Java/OS/HW | 9 | Explains importance of thread pinning etc |
Mythbusting Modern Hardware to Gain 'Mechanical Sympathy' - https://www.youtube.com/watch?v=MC1EKLQ2Wmg&t=4s | Goto 2012 | Thompson | - | 8 | Clears up some common misconceptions about HW |
JVM Profiling pitfalls | - | Nitsan Wakart | Java | 9 | Getting deep into profiling |
Performance Testing Java Applications | Devoxx UK 2013 | Thompson | Java | 8 | One thing to note: Dont write your benchmark harness, use JMH! |
A Crash Course in Modern Hardware | Devoxx | Cliff Click | HW,OS,JVM,Java | 8 | Really a crash course but still quite good |
CPU caches and why you care | code::dive conference 2014 | Scott Meyers | HW,C++ | 9 | Classic one about caches, must watch |
Deep Dive Performance | ? | Nitsan Wakart | Java | 6 | 3 talks - some good some not so good |
The Illusion of Execution | JFokus 2015 | Nitsan Wakart | Java | 10 | Nice deep dive |
CON1517 An Introduction to JVM Performance | JavaOne 2015 | Rafael Winterhalter | Java | 9 | Very good and practical |
Caching in: understand, measure, and use your CPU Cache more effectively | JavaOne 2015 | Richard Warburton | HW | 9 | Easy intro |
Data Oriented Design | CppCon2014 | Mike Acton | Cpp | 8 | Designing code based on its data, very low level |
Life of a Twitter JVM engineer | Devoxx 2016 | Tony Printezis | JVM | 8+ | Mostly GC problems |
Writing Fast Code I | Code::dive 2015 | Andrei Alexandrescu | C++/HW | 9 | Low level |
Writing Fast Code 2 | Code::dive 2015 | Andrei Alexandrescu | C++/HW | 9 | Low level |
Optimization Tips - Mo' Hustle Mo' Problems | CooCon 2014 | Andrei Alexandrescu | 8 | Very low level |