Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use osrm-datastore for testing, keep osrm-routed runnning #889

Merged
merged 8 commits into from
Oct 15, 2014

Conversation

emiltin
Copy link
Contributor

@emiltin emiltin commented Jan 24, 2014

This branch uses osrm-datastore to load data during cucumber testing, resulting in a speed up of more than 3x on cached tests. (First run is about the same, since data must be converted with osmosis).

Before each scenario is run, osrm-datastore is used to load the new into shared memory. osmr-routed is then launched if it's not already running. As cucumber exits, osrm-routed is shutdown.

We might want to add some testing of the good ol' way of loading data directly with osrm-routed, since with the current version of this branch, osrm-routed never load data directly.

I did experience some weird behaviour when trying to launch osrm-routed manually from the command line, and then running osrm-datastore from the cuke scripts, including errors in datastore indicating failure to free data, osrm-routed not returning correct routes, and osrm-routed throwing exception. This should perhaps be investigated. But when launching both routed and datastore from the cuke scripts it seems to work fine.

@emiltin
Copy link
Contributor Author

emiltin commented Jan 24, 2014

I only tried the branch on Mac so far.

@emiltin
Copy link
Contributor Author

emiltin commented Jan 24, 2014

There's a ton of failed cucumber tests on Travis, but they seem related to osmosis. However the build is still reported as passing - is that because the build itself succeds, and the cucumber result is not considered?

@DennisOSRM
Copy link
Collaborator

Travis appears to have changed the environment. No idea why osmosis is broken there ATM. Nevertheless the Jenkins server does run the tests too

@alex85k
Copy link
Contributor

alex85k commented Jan 25, 2014

I have tried this branch on home Ubuntu 12.04 and have multiple errors (manual loading and serving seems to work)

On the first run (when files are generated and delays are big) first ~20 tests seem so pass, but on later tests and after restarting cucumber there are many failing tests followed by "osrm-routed is not running" errors.

(I have 3Gb RAM machine with no swap partition, this may not be enough, but old-way tests pass without any problems)

There should be some way to increase reliability of osrm-datastore and routed on frequent reloading...
Maybe datastore test runner can become an optional way to run tests, not default?

@alex85k
Copy link
Contributor

alex85k commented Jan 25, 2014

On Windows (waitpid-modified script from this branch and sources from #880) there are also correct results->then some incorrect results->then crashes of osrm-routed.

@alex85k alex85k mentioned this pull request Jan 26, 2014
@emiltin
Copy link
Contributor Author

emiltin commented Jan 28, 2014

doesn't really seem to work on Ubuntu (i'm running v 13).

  • both osrm-routed and osrm-datastore report [warn] "could not request RAM lock" on every launch, even when i run - them manually.
  • routing results are sometimes incorrect, perhaps because the new data is not fully loaded?
  • osrm-routing not responding: *** osrm-routed is not running. (RuntimeError)

@alex85k
Copy link
Contributor

alex85k commented Jan 28, 2014

This does not depend on test-running configuration, actuallly. Huge timeiout is not a way to solve this problem :)
There is something in code or system configuratiion (shared mem, named mutexes) that makes the behavior unpredictable. However, it its hard to imagine that data is still loading after osrm-datastore already finished.

When the next request comes, the reloading process in osrm-routed is initialized, name of fileIndex file is correct... Errors may be the result of reading 1) previous data, 2) still changing data or 3) incorrectly loaded data.
I do not know how to determine exact reason...

@DennisOSRM
Copy link
Collaborator

@alex85k I haven't looked at the code, but this sounds like race conditions to me. The swapping of the data in memory should be safeguarded by mutex's.

@emiltin
Copy link
Contributor Author

emiltin commented Jan 28, 2014

@DennisOSRM, can you get the experimental/cuke_datastore branch to run tests succesfully on ubuntu?

@DennisOSRM
Copy link
Collaborator

@emiltin will do tomorrow Morning.

@DennisOSRM
Copy link
Collaborator

Sorry. Got delayed. Will get to that asap

@DennisOSRM
Copy link
Collaborator

Tests run fine on my Ubuntu dev machine. First run takes 3m37s while the second (cached) run takes only 0m15s. The only downside is that the following warning is produced for every test:

[warn] Process ../build/osrm-datastore could not request RAM lock

I am not yet sure what the reason is.

@DennisOSRM
Copy link
Collaborator

So, after digging a bit deeper I found why it is warning. The OS is not allowing to lock the data into RAM as it is hitting a limit. To view the limit try

$ ulimit -l

On my system it says 64 which means you can only lock at most 64kb of data into RAM by default. The setting can be tweaked though:

$ sudo vi /etc/security/limits.conf 

and then add the following two lines at the bottom, where is your user name:

<user>       hard    memlock     unlimited
<user>       soft    memlock     68719476736

Login and out ( or even reboot ) and the warning should be gone. While the message is certainly nagging, it is a message that one could safely ignore during tests.

@emiltin
Copy link
Contributor Author

emiltin commented Jan 30, 2014

interesting, because they don't run at all on my ubuntu machine.
you're using the experimental/cuke_datastore branch?
what version of ubuntu are you running?
what settings are you using for shmall, shmmax, and what's your total ram?

@DennisOSRM
Copy link
Collaborator

I am running the code from this pull request on Ubuntu 13.10.

What kind of error do you get?

@emiltin
Copy link
Contributor Author

emiltin commented Jan 30, 2014

here you can see the errors:
https://gist.github.com/emiltin/8705155

editing /etc/security/limits.conf did not seem to make a difference, i'm still getting the warning, and ulimit -l still reports 64.

emil@emil-OptiPlex-7010:~/code/Project-OSRM$ git branch
  develop
* experimental/cuke_datastore
  master
emil@emil-OptiPlex-7010:~/code/Project-OSRM$ git log -n 1
commit 02f631e3c6d5b580263aa74cfe0711d6746d98fc
Author: Emil Tin <emil@tin.dk>
Date:   Fri Jan 24 21:14:38 2014 +0100

    use osrm-datastore for testing, keep osrm-routed runnning
emil@emil-OptiPlex-7010:~/code/Project-OSRM$ ulimit -l
64
emil@emil-OptiPlex-7010:~/code/Project-OSRM$ tail /etc/security/limits.conf
#@faculty        soft    nproc           20
#@faculty        hard    nproc           50
#ftp             hard    nproc           0
#ftp             -       chroot          /ftp
#@student        -       maxlogins       4

# End of file

<user>       hard    memlock     unlimited
<user>       soft    memlock     68719476736
emil@emil-OptiPlex-7010:~/code/Project-OSRM$ sysctl -a | grep shmmax
sysctl: permission denied on key 'fs.protected_hardlinks'
sysctl: permission denied on key 'fs.protected_symlinks'
sysctl: permission denied on key 'kernel.cad_pid'
sysctl: permission denied on key 'kernel.usermodehelper.bset'
sysctl: permission denied on key 'kernel.usermodehelper.inheritable'
kernel.shmmax = 134217728
sysctl: permission denied on key 'net.ipv4.tcp_fastopen_key'
emil@emil-OptiPlex-7010:~/code/Project-OSRM$ sysctl -a | grep shmall
sysctl: permission denied on key 'fs.protected_hardlinks'
sysctl: permission denied on key 'fs.protected_symlinks'
sysctl: permission denied on key 'kernel.cad_pid'
sysctl: permission denied on key 'kernel.usermodehelper.bset'
sysctl: permission denied on key 'kernel.usermodehelper.inheritable'
kernel.shmall = 262144
sysctl: permission denied on key 'net.ipv4.tcp_fastopen_key'
emil@emil-OptiPlex-7010:~/code/Project-OSRM$ free -m -h
             total       used       free     shared    buffers     cached
Mem:          3.8G       1.2G       2.5G         0B        69M       699M
-/+ buffers/cache:       503M       3.3G
Swap:         3.9G         0B       3.9G

@DennisOSRM
Copy link
Collaborator

You need to replace <user> with your actual user name, ie. emil

@emiltin
Copy link
Contributor Author

emiltin commented Jan 30, 2014

oh.. i see!

@emiltin
Copy link
Contributor Author

emiltin commented Jan 30, 2014

i got rid of the warning, by modiying /etc/security/limits.conf.

but cucumber still reports tons of errors. if i run "cucumber -t @basic" (consisting of 11 scenarios), i will get anything from 2-9 failed scenarios, either because the routing is incorrect, or osrm-routed doesn't repond.

sometimes osrm-datastore seems to hang for 10-30 seconds, making the whole machine unresponsive, as if huge amounts of memory is being allocated.

@emiltin
Copy link
Contributor Author

emiltin commented Jan 30, 2014

i added some debug info, so you can see the order in which datastore is called, and routed is launched/shutdown.

@emiltin
Copy link
Contributor Author

emiltin commented Jan 30, 2014

the develop branch runs all test without errors on my machine

@emiltin
Copy link
Contributor Author

emiltin commented Jan 30, 2014

rebased on latest develop. (still same errors)

@emiltin
Copy link
Contributor Author

emiltin commented Jan 30, 2014

when i run the cucumber tests, and then use 'rake pid' to monitor the osrm-routed process, i can see that it at some point it changes from mode S to mode Z (Defunct "zombie" process, terminated but not reaped by its parent.) from then on, cucumber reports '*** osrm-routed is not running'

so it seems that reload data with osrm-datastore somehow causes osrm-routed to die?

it would be nice if osrm-routed would output something to the log when new data has been loaded

@DennisOSRM
Copy link
Collaborator

The point of the entire data store thingy is that osrm-routedis not terminated. Not sure what is happening there.

@emiltin
Copy link
Contributor Author

emiltin commented Jan 30, 2014

yes it's odd

@alex85k
Copy link
Contributor

alex85k commented Jan 30, 2014

I have seen the Defunct osrm-routed too when running those tests (I guess it was after getting segmentation faults)...

@alex85k
Copy link
Contributor

alex85k commented Jan 31, 2014

I tried to compile and run the tests (cuke_datastore branch) on FreeBSD 10 virtual machine with CLang 3.3. All 251 tests passed without any shmem configuration. Second run took 1m17.311s (on VM, Core2Duo, 2Gb RAM) and did not show any errors (first time in my experience).

This is extremely strange.
Maybe there is a problem with newer or older libraries like Boost or even system libs?
(my unsuccessfull tests were on Boost 1.55)

@DennisOSRM
Copy link
Collaborator

Don't think this is related to boost. The testing code is in ruby and should not interfere with boost (as linked into the OSRM binaries).

@emiltin Is the routed process dying from a segfault or is it because of some other exception?

@alex85k
Copy link
Contributor

alex85k commented Jan 31, 2014

Errors are caused by routed faults, not testing environment...
I had explicit segmentation faults on RedHat nonstandard system with old glibc (2007), on Windows routing daemon dies without any message. Did not notice the segfault messages on Ubuntu 12.04, but maybe there were some (routed also died periodically).

Now rebuilding with custom boost 1.55 on FreeBSD to check my hypothesis :)

@DennisOSRM
Copy link
Collaborator

Cancelled the builds for 14eac50 to get results for the latest commit earlier.

@emiltin
Copy link
Contributor Author

emiltin commented Oct 13, 2014

all green on travis. but we should still add a test for direct data load.

@DennisOSRM
Copy link
Collaborator

AppVeyor is looking good, too, while it is halfway through. once we have a test for direct loading, this is looking really good to merge. We are close.

@alex85k
Copy link
Contributor

alex85k commented Oct 13, 2014

I have tested this branch after rebase (on 6-core Xeon E5-1650 with SSD):

  • no more Ruby errors
  • only Routing on a oneway roundabout test still fails (not related to this PR)
  • first run - 3min , second run - 51s (Release build). Wow. :)

Thank you!

@emiltin
Copy link
Contributor Author

emiltin commented Oct 14, 2014

added test of direct data load. this required some changes to the test infrastructure. you can now use

    Given the data is loaded directly

or

    Given the data is loaded with datastore

to specify for each scenario how data is loaded. To minimize the risk of hard-to-debug problems, only one instance of osrm-routed will be launched at the same time.

the default is to use datastore to load data and osrm-routed running for all tests. but osrm-routed will be relaunched when needed, ie everytime a scenario uses direct data load, or you go from direct to datastore.

Direct data is tested with these scenarios:
https://github.com/Project-OSRM/osrm-backend/blob/71b967d24308be1939e8290d4597847e526566c3/features/testbot/load.feature

@DennisOSRM
Copy link
Collaborator

Cool!

@emiltin
Copy link
Contributor Author

emiltin commented Oct 14, 2014

@alex85k does the latest commit work on windows?

@emiltin
Copy link
Contributor Author

emiltin commented Oct 14, 2014

21.5s on my linux box for 353 scenarios / 1467 steps :-)

@alex85k
Copy link
Contributor

alex85k commented Oct 14, 2014

@alex85k does the latest commit work on windows?

It should work, I'll check tomorrow. But why 0.1 timeout for shutdown is too big? There is only one shutdown in testing process, if I understand correctly.

@emiltin
Copy link
Contributor Author

emiltin commented Oct 14, 2014

yes only used very seldom. but it's a retry delay, not a timeout.

@alex85k
Copy link
Contributor

alex85k commented Oct 14, 2014

Seem to work fine on Windows with latest commit (partial run, but full should be the same)

@emiltin
Copy link
Contributor Author

emiltin commented Oct 14, 2014

actually you need to be sure to include feature/testbot/load.feature as well as other tests, to make sure you cover loading data both with datastore and directly. but appveyor seems happy.

@emiltin
Copy link
Contributor Author

emiltin commented Oct 14, 2014

uhm guess appveyor doesn't run the cucumber tests?

@DennisOSRM
Copy link
Collaborator

uhm guess appveyor doesn't run the cucumber tests?

not yet

@alex85k
Copy link
Contributor

alex85k commented Oct 14, 2014

I had a prototype of testing environment for Appveyor, maybe now it can fit in time (at least some tests).
@DennisOSRM : they now have 100Mb cache you asked : http://www.appveyor.com/docs/build-cache, it should be enough to store dependencies and stripped Ruby+Gems folder.

@DennisOSRM
Copy link
Collaborator

Yay! for caching

@DennisOSRM
Copy link
Collaborator

@alex85k could you provide the output of cucumber features\testbot\oneway.feature:7 on Windows?

@alex85k
Copy link
Contributor

alex85k commented Oct 15, 2014

This time no test failures, of course :) (first run 27 min on Core2Duo, Debug) . Prevoius error was non-existing path.

@DennisOSRM: are you sure that the error will not show up on some circular isolated road or so on?

@emiltin
Copy link
Contributor Author

emiltin commented Oct 15, 2014

appveyor debug build failed due to 30 min timeout

@emiltin
Copy link
Contributor Author

emiltin commented Oct 15, 2014

what's left to do?

@DennisOSRM
Copy link
Collaborator

I think we are good to merge. Great job, everyone.

DennisOSRM added a commit that referenced this pull request Oct 15, 2014
use osrm-datastore for testing, keep osrm-routed runnning
@DennisOSRM DennisOSRM merged commit dfc81f6 into develop Oct 15, 2014
@DennisOSRM DennisOSRM deleted the experimental/cuke_datastore branch October 15, 2014 13:42
@alex85k
Copy link
Contributor

alex85k commented Oct 15, 2014

Thank you!

@alex85k alex85k mentioned this pull request Oct 16, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants