README

Copyright (c) 2013, Nokia Siemens Networks
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
    * Redistributions of source code must retain the above copyright
      notice, this list of conditions and the following disclaimer.
    * Redistributions in binary form must reproduce the above copyright
      notice, this list of conditions and the following disclaimer in the
      documentation and/or other materials provided with the distribution.
    * Neither the name of Nokia Siemens Networks nor the
      names of its contributors may be used to endorse or promote products
      derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY
DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.


====================================================================================================
Table of Contents
====================================================================================================
1. Open Event Machine
2. Dependencies
3. Installation
4. Compilation
5. Examples / Test Cases
6. Changes
7. Open Issues
8. Timers
9. Notes


====================================================================================================
1. Open Event Machine (OpenEM or EM)
====================================================================================================
License:
- OpenEM     - See the beginning of this file or the event_machine/LICENSE file.
- Intel DPDK - See the DPDK source code.


Note: Read the Open Event Machine API, especially the event_machine.h file, for a more
      thorough description about the functionality.


Open Event Machine is a lightweight, dynamically load balancing, run-to-completion,
event processing runtime targeting multicore SoC data plane environments.

This release of the Open Event Machine contains the OpenEM API as well as a SW implementation
for Intel multicore CPUs.


The Event Machine, developed by NSN, is a multicore optimized, high performance,
data plane processing concept based on asynchronous queues and event scheduling. 
Applications are built from Execution Objects (EO), events and event queues. 
The EO is a run-to-completion object that gets called at each event receive. 
Communication is built around events and event queues. The EM scheduler selects 
an event from a queue based on priority, and calls the EO's receive function 
for event processing. The EO might send events to other EOs or HW-accelerators 
for further processing or output. The next event for the core is selected only 
after the EO has returned. The selected queue type, queue priority and queue 
group affects the scheduling. Dynamic load balancing of work between the cores
is the default behaviour. Note that the application (or EO) is not tied to a 
particular core, instead the event processing for the EO is performed on any core
that is available at that time.


The implementation of OpenEM for Intel CPUs in this package should NOT be considered a 
"Reference OpenEM implementation", but rather an "Example of a SW-implementation of OpenEM
for Intel CPUs". Most functionality described in the event_machine.h is supported by the Intel version.
Strict per-event load balanced and priority based scheduling has been relaxed in favor of performance:
E.g. events are scheduled in bursts rather than on an event-by-event basis and priority scheduling is
performed separately for each queue group&type instead of always taking into account all queues/events
on the whole device. Only four (4) queue priority levels are used to speed-up the scheduling.

OpenEM for Intel runs in Linux user space and supports two modes of operation:
 - Each EM-core runs in a separate (single threaded) process     = EM process-per-core mode.
 - Each EM-core runs in a separate thread under a single process = EM thread-per-core  mode.
The mode is selected at application startup with command line options:
 -p for EM process-per-core mode 
 -t for EM thread-per-core  mode
Note that the selected mode affects the application mainly in how shared memory can be used,
i.e. normal linux pthread vs. process rules for data sharing&protection apply.
The provided example applications work in both modes.
Pointers can be shared/passed between EM-cores, even in the multi-process mode, because DPDK and the 
forking of each process allowes all processes see the same virt<->phys memory mappings.


====================================================================================================
2. Dependencies
====================================================================================================
To be able to build and run the Open Event Machine for Intel, you need the following
environment, tools and SW packages:

Environment:
- x86_64 Intel multicore CPU, preferrably a CPU that supports 1GB huge pages
  /proc/cpuinfo - flags contain 'pdpe1gb'. 2MB huge page support also works but the
  DPDK config might have to be tweaked a bit (more memory segments etc).
- Linux with Address Space Layout Randomization (ASLR) preferrably turned off (see DPDK docs).
  (set /proc/sys/kernel/randomize_va_space to '0')
- 1GE or 10GE NICs supported by the Intel DPDK package (if using packet I/O)

Tools:
- Compilers: GCC or ICC
- GNU Make

SW packages:
- Intel DPDK 1.3.1-7 (DPDK source code and documents: http://www.intel.com/go/dpdk) - DPDK 
  docs provide a more thorough description about the environment requirements.

Note: The setup used in testing:
- Intel Sandy Bridge based Xeon E5 multicore processor
- Intel 82599 10 Gigabit Ethernet Controller
- Ubuntu 11.04, kernel 2.6.38-11-generic (kernel options added):
    default_hugepagesz=1G hugepagesz=1G hugepages=4
- GCC 4.5.2 or
- ICC 12.1.3 20120212, 12.1.6 20120821, 13.1.1 20130313
- GNU Make 3.81
- Intel DPDK 1.3.1-7


====================================================================================================
3. Installation
====================================================================================================
First install the needed tools and SW packages. Install the Intel DPDK software
according to the DPDK installation instructions.

The Open Event Machine package consists of source code and is installed simply
by extracing the source from the archive-file.

Note: Linux 1GB huge pages recommended. 2MB huge pages work but need a bit of tweaking of the number
      of pages to get enough memory to use. Also DPDK probably needs a lot more of memory segments
      (set in DPDK .config) to be able to map all the 2MB pages into DPDK memsegments.

====================================================================================================
4. Compile & Run
====================================================================================================

Changes to the DPDK standard config for OpenEM:
(#=original value, followed by used value)

Note: Especially the memory related config values could be smaller, depending much on the application.


#CONFIG_RTE_MAX_MEMZONE=2560
## 1M memzones
#CONFIG_RTE_MAX_MEMZONE=1048576
## 128k memzones
CONFIG_RTE_MAX_MEMZONE=131072

#CONFIG_RTE_LIBEAL_USE_HPET=y
CONFIG_RTE_LIBEAL_USE_HPET=n

#CONFIG_RTE_LIBRTE_IGB_PMD=y
CONFIG_RTE_LIBRTE_IGB_PMD=n

#CONFIG_RTE_MEMPOOL_CACHE_MAX_SIZE=512
#CONFIG_RTE_MEMPOOL_CACHE_MAX_SIZE=1024
CONFIG_RTE_MEMPOOL_CACHE_MAX_SIZE=2048

#CONFIG_RTE_MBUF_SCATTER_GATHER=y
#CONFIG_RTE_MBUF_REFCNT_ATOMIC=y
CONFIG_RTE_MBUF_SCATTER_GATHER=n
CONFIG_RTE_MBUF_REFCNT_ATOMIC=n


Recompile DPDK with the changes above before compiling and running the Open Event Machine
tests (first install and config DPDK as described in the manuals, then recompile). E.g.:
  > cd {DPDK 1.3.1-7 DIR}/x86_64-default-linuxapp-gcc/
  > make clean
  > make


OpenEM application Make commands:
  > cd {OPEN EVENT MACHINE DIR}/event_test/example/intel
    or
  > cd {OPEN EVENT MACHINE DIR}/event_test/packet_io/intel
  
  > make real_clean && make em_clean && make clean && make
  
    > make real_clean - Cleans the application and build directories
    > make em_clean   - Cleans the EM-library
    > make clean      - Cleans the application (leaves build dirs)
    > make            - Compile & link the application
  
-------------------------------------------
A) Compile & run the example test cases
-------------------------------------------
1. Change into the OpenEM-intel example directory
  > cd {OPEN EVENT MACHINE DIR}/event_test/example/intel

2. Build the test applications: 'hello', 'perf', 'event_group' and 'error' 
 (> make real_clean && make em_clean)
  > make
  
3. Run one of the compiled examples, e.g.:
  > sudo ./build/hello -c 0xfe -n 4 -- -p          (Run 'hello' on  7 cores using EM process-per-core mode(-p))
  > sudo ./build/hello -c 0xfcfc -n 4 -- -t        (Run 'hello' on 12 cores using EM thread-per-core  mode(-t))
  
  > sudo ./build/perf -c 0xffff -n 4 -- -p         (Run 'perf' on 16 cores using EM process-per-core mode(-p))
  > sudo ./build/perf -c 0xfffe -n 4 -- -t         (Run 'perf' on 15 cores using EM thread-per-core  mode(-t))
  
  > sudo ./build/event_group -c 0x0c0c -n 4 -- -p  (Run 'event_group' on 4 cores using EM process-per-core mode(-p))
  > sudo ./build/event_group -c 0x00f0 -n 4 -- -t  (Run 'event_group' on 4 cores using EM thread-per-core  mode(-t))
  
  > sudo ./build/error -c 0x3 -n 4 -- -p           (Run 'error' on 2 cores using EM process-per-core mode(-p))
  > sudo ./build/error -c 0x2 -n 4 -- -t           (Run 'error' on 1 core  using EM thread-per-core  mode(-t))
  

  Help available on command line options through the -h, --help options, e.g.:
  > sudo ./build/hello --help

      Usage: hello DPDK-OPTIONS -- APPL&EM-OPTIONS
        E.g. hello -c 0xfe -n 4 -- -p
      
      Open Event Machine example application.
      
      Mandatory DPDK-OPTIONS options:
        -c                      Coremask (hex) - select the cores that the application&EM will run on.
        -n                      DPDK-option: Number of memory channels in the system
      
      Optional [DPDK-OPTIONS]
       (see DPDK manuals for an extensive list)
      
      Mandatory APPL&EM-OPTIONS:
        -p, --process-per-core  Running OpenEM with one process per core.
        -t, --thread-per-core   Running OpenEM with one thread per core.
      
        Select EITHER -p OR -t, but not both!
      
      Optional [APPL&EM-OPTIONS]
        -h, --help              Display help and exit.


-------------------------------------------
B) Compile & run the packet I/O test cases
-------------------------------------------

1. Change into the OpenEM-intel packet-io directory
  > cd {OPEN EVENT MACHINE DIR}/event_test/packet_io/intel

2. Build the test applications: 'packet_loopback' and 'packet_multi_stage'
 (> make real_clean && make em_clean)
  > make
  
3. Run one of the compiled examples, e.g.:
  > sudo ./build/packet_loopback -c 0xfe -n 4 -- -p       (Run 'packet_loopback' on  7 cores using EM process-per-core mode(-p))
  > sudo ./build/packet_loopback -c 0xfffe -n 4 -- -t     (Run 'packet_loopback' on 15 cores using EM thread-per-core  mode(-t))
  
  > sudo ./build/packet_multi_stage -c 0xffff -n 4 -- -p  (Run 'packet_multi_stage' on 16 cores using EM process-per-core mode(-p))
  > sudo ./build/packet_multi_stage -c 0xfefe -n 4 -- -t  (Run 'packet_multi_stage' on 14 cores using EM thread-per-core  mode(-t))


  Help available on command line options through the -h, --help options, e.g.:
  > sudo ./build/packet_loopback --help
    
    Usage: packet_loopback DPDK-OPTIONS -- APPL&EM-OPTIONS
      E.g. packet_loopback -c 0xfe -n 4 -- -p
    
    Open Event Machine example application.
    
    Mandatory DPDK-OPTIONS options:
      -c                      Coremask (hex) - select the cores that the application&EM will run on.
      -n                      DPDK-option: Number of memory channels in the system
    
    Optional [DPDK-OPTIONS]
     (see DPDK manuals for an extensive list)
    
    Mandatory APPL&EM-OPTIONS:
      -p, --process-per-core  Running OpenEM with one process per core.
      -t, --thread-per-core   Running OpenEM with one thread per core.
    
      Select EITHER -p OR -t, but not both!
    
    Optional [APPL&EM-OPTIONS]
      -h, --help              Display help and exit.
      

====================================================================================================
5. Examples / Test Cases
====================================================================================================
The package contains a set of examples / test cases.

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
A) Basic standalone Examples - does not require any external input or I/O,
   just compile and run to get output/results.
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

  Sources in directory: {OPEN EVENT MACHINE DIR}/event_test/example/  
  
  ---------------------------------------------------------
  A1) test_appl_hello.c  (executable: 'hello')
  ---------------------------------------------------------
  Simple "Hello World" example the Event Machine way. 
  
  Creates two Execution Objects (EOs), each with a dedicated queue for incoming events,
  and ping-pongs an event between the EOs while printing "Hello world" each time the
  event is received. Note the changing core number in the printout below (the event is handled
  by any core that is ready to take on new work):

  Run hello on 8 cores
  > sudo ./build/hello -c 0xff -n 4 -- -p
    or 
  > sudo ./build/hello -c 0xff -n 4 -- -t
    ...    
    Hello world started EO A. I'm EO 0. My queue is 320.
    Hello world started EO B. I'm EO 1. My queue is 321.
    Entering event dispatch loop() on EM-core 0
    Hello world from EO A! My queue is 320. I'm on core 02. Event seq is 0.
    Hello world from EO B! My queue is 321. I'm on core 04. Event seq is 1.
    Hello world from EO A! My queue is 320. I'm on core 02. Event seq is 2.
    Hello world from EO B! My queue is 321. I'm on core 05. Event seq is 3.
    Hello world from EO A! My queue is 320. I'm on core 01. Event seq is 4.
    Hello world from EO B! My queue is 321. I'm on core 04. Event seq is 5.
    Hello world from EO A! My queue is 320. I'm on core 05. Event seq is 6.
    Hello world from EO B! My queue is 321. I'm on core 06. Event seq is 7.
    Hello world from EO A! My queue is 320. I'm on core 04. Event seq is 8.
    Hello world from EO B! My queue is 321. I'm on core 07. Event seq is 9.
    Hello world from EO A! My queue is 320. I'm on core 06. Event seq is 10.
    ...

  ---------------------------------------------------------
  A2) test_appl_perf.c  (executable: 'perf')
  ---------------------------------------------------------
  Measures and prints the average cycles consumed in an event send - sched - receive loop.
  
  Note that the cycles per event increases with a larger core count (try e.g 4 vs 8 cores).
  Also note that the way EM-cores are mapped to HW threads matters - using HW threads on the
  same CPU-core differs from using HW threads on separate cores.

  Sample output run an a CPU with 8 cores, 16 HW threads - max core mask -c 0xffff

  Run on 4 EM-cores:    
  >  sudo ./build/perf -c 0xf -n 4 -- -p     (4 EM-cores (=processes) mapped to 4 HW threads on separate cores)
  >  sudo ./build/perf -c 0x0303 -n 4 -- -p  (4 EM-cores (=processes) mapped to 4 HW threads on 2 cores, 2 HW threads per core)
   or 
  >  sudo ./build/perf -c 0xf -n 4 -- -t     (4 EM-cores (=pthreads) mapped to 4 HW threads on separate cores)
  >  sudo ./build/perf -c 0x0303 -n 4 -- -t  (4 EM-cores (=pthreads) mapped to 4 HW threads on 2 cores, 2 HW threads per core)
  
  
  > sudo ./build/perf -c 0xf -n 4 -- -p
  ...
  cycles per event 263.99  @2693.43 MHz (core-00 2)
  cycles per event 264.10  @2693.43 MHz (core-02 2)
  cycles per event 264.20  @2693.43 MHz (core-01 2)
  cycles per event 265.37  @2693.43 MHz (core-03 2)
  cycles per event 263.76  @2693.43 MHz (core-00 3)
  cycles per event 263.98  @2693.43 MHz (core-02 3)
  cycles per event 264.27  @2693.43 MHz (core-01 3)
  cycles per event 264.99  @2693.43 MHz (core-03 3)
  ...
  
  
  Run on 8 EM-cores:
  >  sudo ./build/perf -c 0xff -n 4 -- -p    (8 EM-cores (=processes) mapped to 8 HW threads on separate cores)
  >  sudo ./build/perf -c 0x0f0f -n 4 -- -p  (8 EM-cores (=processes) mapped to 8 HW threads on 4 cores, 2 HW threads per core)
   or 
  >  sudo ./build/perf -c 0xff -n 4 -- -t    (8 EM-cores (=pthreads) mapped to 8 HW threads on separate cores)
  >  sudo ./build/perf -c 0x0f0f -n 4 -- -t  (8 EM-cores (=pthreads) mapped to 8 HW threads on 4 cores, 2 HW threads per core)  
  
  
  > sudo ./build/perf -c 0xff -n 4 -- -p
  ...
  cycles per event 308.63  @2693.43 MHz (core-07 2)
  cycles per event 309.91  @2693.43 MHz (core-00 2)
  cycles per event 309.79  @2693.43 MHz (core-06 2)
  cycles per event 310.68  @2693.43 MHz (core-01 2)
  cycles per event 311.75  @2693.43 MHz (core-05 2)
  cycles per event 311.72  @2693.43 MHz (core-02 2)
  cycles per event 312.15  @2693.43 MHz (core-03 2)
  cycles per event 312.53  @2693.43 MHz (core-04 2)
  cycles per event 308.73  @2693.43 MHz (core-07 3)
  cycles per event 309.91  @2693.43 MHz (core-00 3)
  cycles per event 310.49  @2693.43 MHz (core-06 3)
  cycles per event 310.76  @2693.43 MHz (core-01 3)
  cycles per event 311.63  @2693.43 MHz (core-05 3)
  cycles per event 311.55  @2693.43 MHz (core-02 3)
  cycles per event 311.82  @2693.43 MHz (core-03 3)
  cycles per event 312.82  @2693.43 MHz (core-04 3)
  ...
  
  ---------------------------------------------------------
  A3) test_appl_error.c  (executable: 'error')
  ---------------------------------------------------------
  Demonstrate and test the Event Machine error handling functionality,

  Three application EOs are created, each with a dedicated queue. An application specific
  global error handler is registered (thus replacing the EM default). Additionally EO A 
  will register an EO specific error handler.
  When the EOs receive events (error_receive) they will generate errors by explicit calls to
  em_error() and by calling EM-API functions with invalid arguments.
  The registered error handlers simply print the error information on screen.
  
  Note that we continue executing after a fatal error since these are only test-errors...

  > sudo ./build/error -c 0xf -n 4 -- -p
   or 
  > sudo ./build/error -c 0xf -n 4 -- -t
  ...
  Error log from EO C [0] on core 1!
  THIS IS A FATAL ERROR!!
  Appl Global error handler     : EO 2  error 0x8000DEAD  escope 0x0
  Return from fatal.
  
  Error log from EO A [0] on core 2!
  Appl EO specific error handler: EO 0  error 0x00001111  escope 0x1
  Appl EO specific error handler: EO 0  error 0x00002222  escope 0x2 ARGS: Second error
  Appl EO specific error handler: EO 0  error 0x00003333  escope 0x3 ARGS: Third  error 320
  Appl EO specific error handler: EO 0  error 0x00004444  escope 0x4 ARGS: Fourth error 320 0
  Appl EO specific error handler: EO 0  error 0x0000000A  escope 0xFF000402 - EM info:
  EM ERROR:0x0000000A  ESCOPE:0xFF000402  EO:0-"EO A"  core:02 ecount:5(4)  em_free(L:2243) em_intel.c  event ptr NULL!
  ...

  ---------------------------------------------------------
  A4) test_appl_event_group.c  (executable: 'event_group')
  ---------------------------------------------------------
  Tests and measures the event group feature for fork-join type of operations using events.
  See the event_machine_group.h file for the event group API calls.
  
  An EO allocates and sends a number of data events to itself (using an event group)
  to trigger a notification event to be sent when the configured event count
  has been received. The cycles consumed until the notification is received is measured
  and printed.
  
  Note: To keep things simple this testcase uses only a single queue into which to receive
  all events, including the notification events. The event group fork-join mechanism does not
  care about the used queues however, it's basically a counter of events sent using a certain 
  event group id. In a more complex example each data event could be send from different
  EO:s to different queues and the final notification event sent yet to another queue.    

  > sudo ./build/event_group -c 0xf -n 4 -- -p
    or 
  > sudo ./build/event_group -c 0xf -n 4 -- -t
  ...
  Start.
  Done. Notification event received after 256 data events. Cycles curr:149944, ave:149944
  
  Start.
  Done. Notification event received after 256 data events. Cycles curr:142339, ave:147409
  
  Start.
  Done. Notification event received after 256 data events. Cycles curr:141217, ave:145861
  ...


+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
B) Simple Packet I/O examples - used together with an external traffic generator.
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

  These tests assume that the test system is equipped with at least one NIC fo the following type:
    - Intel 82599 10 Gigabit Ethernet Controller.

  Sources in directory: {OPEN EVENT MACHINE DIR}/event_test/packet_io/
  
  Packet-io enabled by setting: em_conf.pkt_io = 1 before calling em_init(&em_conf).

  ---------------------------------------------------------
  B1) packet_loopback.c  (executable: 'packet_loopback')
  ---------------------------------------------------------  
    Simple Dynamically Load Balanced Packet-I/O loopback test application.
    
    Receives UDP/IP packets, swaps the addresses/ports and sends the packets back to where they came from.
    The test expects the traffic generator to send data using 4096 UDP-flows:
      - 4 IP dst addresses each with 1024 different UDP dst ports (=flows).
        Alternatively setting "#define QUEUE_PER_FLOW 0" will accept any packets,
        but uses only a single default EM queue, thus limiting performance.
    
    Use the traffic generator to find the max sustainable throughput for loopback traffic.
    The throughput should increase near-linearly with increasing core counts (set by -c 0xcoremask).

    > sudo ./build/packet_loopback -c 0xffff -n 4 -- -p
      or 
    > sudo ./build/packet_loopback -c 0xffff -n 4 -- -t

  ---------------------------------------------------------
  B2) packet_multi_stage.c  (executable: 'packet_multi_stage')
  ---------------------------------------------------------
    Similar packet-I/O example as packet_loopback, except that each UDP-flow is handled in
    three (3) stages before sending back out. The three stages (3 EOs) causes each packet be 
    enqueued, scheduled and received multiple times on the multicore CPU, thus stressing the 
    EM scheduler.
    Additionally this test uses EM queues of different priority and type.

    The test expects the traffic generator to send data using 1024 UDP-flows:
      - 4 IP dst addresses each with 256 different UDP dst ports (=flows).

    Use the traffic generator to find the max sustainable throughput for multi-stage traffic.
    The throughput should increase near-linearly with increasing core counts (set by -c 0xcoremask).

    > sudo ./build/packet_multi_stage -c 0xffff -n 4 -- -p
      or 
    > sudo ./build/packet_multi_stage -c 0xffff -n 4 -- -t
    

====================================================================================================
6. Changes
====================================================================================================
---------------------------------------
Open Event Machine 1.2:
---------------------------------------
- Requires Intel DPDK 1.3.1-7 available at www.intel.com/go/dpdk (OpenEM 1.1 and older used DPDK 1.2.3-3) 

- Support for both EM process-per-core and EM thread-per-core modes, selected at application startup with the 
  command line options -p (process) or -t (thread).

- Usage of 'ENV_SHARED' to create global shared data does not work with the multi-process model - thus 'ENV_SHARED'
  has been removed from all applications and is no longer provided.

- Shared memory should be created with env_shared_reserve() or env_shared_malloc() (instead of ENV_SHARED).
  Both aforementioned functions return a huge page backed buffer that can be shared between EM-cores. Also 
  pointers to these buffers can be passed between EM-cores.
  The env_shared_reserve("name") creates a named memory region that other cores can look-up using the name.
  Buffers reserved this way can not be freed.
  The env_shared_malloc() provides malloc-like functionality and returns shareable buffers that can be freed
  with env_shared_free().
  See the example applications for details on shared memory creation. Applications using env_shared_reserve()
  or env_shared_malloc() works in both process-per-core and thread-per-core modes. Events are allocated with
  em_alloc() as before.
  The env_*() functions are not part of the OpenEM API.

 - Application initialization code changed to better work in both thread-per-core and process-per-core modes.

---------------------------------------
Open Event Machine 1.1:
---------------------------------------
- The Event Machine is compiled into a static library (libem.a) that the test applications link against.

- Building the example applications: 'make' builds all examples in the directory instead of just one.
  Use 'make clean' to clean the test applications/examples, 'make real_clean' for a more thorough cleanup
  and 'make em_clean' for a cleanup of the EM-library.
  'make real_clean && make em_clean' would clean everything.

- main() moved from EM into the application.

- Command-line arg parsing should be handled by the application for both EM and application.
  Note that DPDK requires some CL args and it parses them at startup, see rte_eal_init(argc, argv) - the rest
  of the arguments, if any, should be parsed by the application. See the examples for details.

- The Event Machine config is set by filling an em_conf_t struct and calling em_init(&em_conf) at startup.
  The contents of em_conf_t is environment specific and can be extented/modified as needed.
  Note: em_config_t.em_instance_id is added for future use - Currently only one (1) running instance of EM is supported.
  
- Startup and initialization sequence changed:
    - main in application: initializes the HW-specific environment (DPDK - rte_eal_init()), initializes EM (em_init())
      and launches the application main-loop on each EM-core. The application is responsible for creating the 
      needed processes or threads for each EM-core to use. Note that the DPDK rte_eal_init() does this for you 
      in the Intel case.
    - The application main-loop on each EM-core must call em_init_core() before using any further EM-API calls.
      An EM-core is ready to start creating EOs & queues, sending events etc. once the call to em_init_core()
      has returned successfully.
    - The application is responsible of calling em_dispatch() in its main-loop to (schedule and) dispatch 
      EM events for processing. This provides more flexibility, as some applications might decide to perform
      tasks outside of EM once returning from em_dispatch() and then at a later time enter EM again by
      another em_dispatch() call. Calling em_dispatch(0 /*rounds=FOREVER*/) with argument 'rounds=0' will never return,
      thus dispatching events until program execution is halted (this is the normal behaviour). On the other hand,
      using a >=1 value for 'rounds' can be useful to perform some actions outside of EM every once in a while:
          /* Application main-loop on each EM-core */
          for(;;)
          {
            em_dispatch(1000); // Perform 1000 EM dispatch rounds before returning
            
            do_something_else(); // Perform some non-EM work.
          }

- Summary: New functions: em_init(), em_init_core(), em_dispatch() - in event_machine_hw_specific.h
  (and event_machine_hw_specific.h.template)
  NOTE! These functions were not added to the "official" EM-API (event_machine.h remains unchanged),
  instead a HW specific file (and .template) was updated. The startup and initialization needs can 
  vary greatly between systems and environments, thus it's better to allow some freedom in this area.
  Example of the usage of these functions along with the changed initialization can be studied in the
  event_test/example and event_test/packet_io directories.

---------------------------------------
Open Event Machine 1.0:
---------------------------------------
- Created.


====================================================================================================
7. Open Issues
====================================================================================================
- em_atomic_processing_end() not implemented - currently just return. The solution must take event
  dequeue-bursts into consideration.


====================================================================================================
8. Packet I/O
====================================================================================================
Note: The OpenEM API does not contain an API for packet-I/O, however simple packet I/O functionality
      has been added to demonstrate a way to connect packet-I/O into the OpenEM. The packet-I/O API
      used here is by no means standard or mandatory and could be modified/replaced to better suit
      the processing environment used in your project.
      See the Packet-I/O examples for more details.

Enable the PacketI/O by setting: em_conf.pkt_io = 1 before calling em_init(&em_conf):
  ...
  em_conf.pkt_io = 1; // Packet-I/O:  disable=0, enable=1
  ...
  em_init(&em_conf);


====================================================================================================
9. Timers - event based notification
====================================================================================================
Note: The event timer is not (yet?) a part of OpenEM. The current timer API, functionality and
      implementation may change later on. The version for Intel is an experimental feature.

Enable the event timer by setting: em_conf.evt_timer = 1 before calling em_init(&em_conf):
  ...
  em_conf.evt_timer = 1; // Event-Timer: disable=0, enable=1 
  ...
  em_init(&em_conf);


Timer usage with OpenEM:

The OpenEM package contains the event_timer/ directory with source code for setting up and using
an event based timeout mechanism. 'Event based' means that upon timeout a given event is sent to a 
given queue:
  evt_request_timeout(timeout, event, queue, cancel);
    timeout - timeout/tick value specifying when to send 'event' to 'queue'
    event   - user allocated & filled input event that gets sent at 'timeout'
    queue   - destination queue into which 'event' is sent
    cancel  - timeout cancel handle
  
Enable the Event Timer by setting: em_conf.evt_timer = 1 before calling em_init(&em_conf).

OpenEM sets up the event timer (if used) at startup and calls manage() during runtime. The application
can immediately start using timeouts through the API calls evt_request_timeout(), evt_cancel_timeout() etc.

No example test application is currently included - TBD.


====================================================================================================
10. Notes
====================================================================================================
Miscellaneous notes regarding the OpenEM usage and performance.

10.1 OpenEM Configuration parameters:

- RX_DIRECT_DISPATCH - 0(disabled=default) or 1(enabled) in event_machine/intel/em_intel_packet.h

  /**
   * Eth Rx Direct Dispatch: if enabled (=1) will try to dispatch the input Eth Rx event
   * for processing as fast as possible by bypassing the event scheduler queues.
   * Direct dispatch will, however, preserve packet order for each flow and for atomic
   * flows/queues also the atomic context is maintained.
   * Directly dispatching an event reduces the number of enqueue+dequeue operations and keeps the 
   * event processing on the same core as it was received on, thus giving better performance.
   * Event priority handling is weakened by enabling direct dispatch as newer events
   * can be dipatched before older events (of another flow) that are enqueued in the scheduling
   * queues - this is the reason why it has been set to '0' by default. An application that do
   * not care about strict priority cound significantly benefit from enabling this feature.
   */
  #define RX_DIRECT_DISPATCH     (0) // 0=Off(lower performance,  better priority handling)
                                     // 1=On (better performance, weaker priority)


10.2 Event bursting:

To improve performance & throughput events are dequeued from the scheduling queues and eth-ports
in bursts rather than one-by-one. The best event load balancing and priority handling over all cores 
would be obtained by dequeing/enqueueing one event at a time, but since this severely decreases throughput,
it has been decided to allow bursting to favor performance.
The defines MAX_Q_BULK_ATOMIC&MAX_E_BULK_ATOMIC, MAX_E_BULK_PARALLEL and MAX_E_BULK_PARALLEL_ORD can be tuned
to change the dequeue burst sizes in the scheduler (em_intel_sched.c)
Eth Rx and Tx burst sizes can be changed through the MAX_RX_PKT_BURST and MAX_TX_PKT_BURST defines (em_intel_packet.c)