Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tests over weekend 2021Dec17 a lot of timeouts on 5 and 7secs #542

Closed
neilh10 opened this issue Dec 20, 2021 · 13 comments
Closed

Tests over weekend 2021Dec17 a lot of timeouts on 5 and 7secs #542

neilh10 opened this issue Dec 20, 2021 · 13 comments

Comments

@neilh10
Copy link

neilh10 commented Dec 20, 2021

For tests running over the wekend starting 2021Dec17 ModularSensors/Mayflys experienced a lot of timeouts, no response from the server.

ModularSensors, waits for a defined timeout period from the server, if there is no response then it defines it locally as an HTTP 504 .
With the transition from servers this may be expected, and I will change the code this week to also list what IP its getting when it looks up ,Host: data.enviroDIY.org, however in a lot of cases it seems like there is a pulse of POSTs, all on the same tcp/ip link, and then a timeout.

test08 has been through a number of powerdowns, and subjected to a a 5minute powerdown on Dec19 19:24PST for more overnight testing.

For test06/Verizon - the timeout was set to 7 seconds and between POST set to 3seconds.
TIMER_POST_TOUT_MS=7000; Gateway Timeout (ms)
TIMER_POST_PACE_MS=3000; Pacing between POSTs
POST_MAX_NUM =100 ;On POSTing MAX NUM after which defered next connection

The response code seemed pretty good at 2.5Seconds, unless there was a timeout
[2021-12-17 16:43:36] -- Response Code -- 201 waited 2531 mS Timeout 7000
though it slowed down later
[2021-12-19 18:19:55] -- Response Code -- 201 waited 4700 mS Timeout 7000

[2021-12-17 14:03:06] test06 was reset started with a backlog of QUE of about 5862 records, and 100 record limit, and immediately upload 100 records successfully.
[2021-12-17 14:17:16] seQCF Que for next pass unsent records 5762
then next attempts just w few before a timeout 504 of 7seconds no response
[2021-12-17 14:31:25] seQCF Que for next pass unsent records 5736
[2021-12-17 14:51:26] seQCF Que for next pass unsent records 5667
[2021-12-17 15:11:23] seQCF Que for next pass unsent records 5613
then 100 uploaded
[2021-12-17 15:35:11] seQCF Que for next pass unsent records 5516
then 504's
[2021-12-17 16:16:26] seQCF Que for next pass unsent records 5457
[2021-12-17 16:31:23] seQCF Que for next pass unsent records 5432
..
[2021-12-19 16:12:53] seQCF Que for next pass unsent records 4409
[2021-12-19 16:36:15] seQCF Que for next pass unsent records 4404
[2021-12-19 18:21:12] seQCF Que for next pass unsent records 4434
[2021-12-19 18:46:13] seQCF Que for next pass unsent records 4419

For test08/WiFi - the timeout was set to 5seconds, and between POST set to 3seconds
TIMER_POST_TOUT_MS=5000; Gateway Timeout (ms)
TIMER_POST_PACE_MS=3000; Pacing between POSTs
POST_MAX_NUM =100 ;On POSTing MAX NUM after which defered next connection

For test08, from 2021-12-17 11:50 PST there was 1619 samples generated, and through reliable delivery they all where found when they where downloaded in the CSV file.
The average time to respond seemed very slow from 3.9secs to
[2021-12-19 12:06:29.003] -- Response Code -- 201 waited 3.951 mS Timeout 5000

[2021-12-19 18:54:11.547] Connected Internet
[2021-12-19 18:54:11.547]
[2021-12-19 18:54:11.547] pubDQTR Sending data to [ 0 ] data.envirodiy.org
[2021-12-19 18:54:12.000] POST /api/data-stream/ HTTP/1.1
[2021-12-19 18:54:12.000] Host: data.envirodiy.org
[2021-12-19 18:54:12.000] TOKEN: 0cf7c40a-232e-457d-87d6-cea5c0757fec
[2021-12-19 18:54:12.000] Content-Length: 461
[2021-12-19 18:54:12.000] Content-Type: application/json
[2021-12-19 18:54:12.000]
[2021-12-19 18:54:12.000] {"sampling_feature":"236c674b-69b9-43af-b0d6-33d67b870ecc","timestamp":"2021-12-19T18:54:00-08:00","8c57835f-a32f-4d62-82dc-0ba09f04cf52":1608,"1f2c9e91-3aa6-44d7-9312-160d04fbf877":1914.238,"65e0a9e5-cc8a-4ed6-8d28-127b5ec5e8e9":-0.670,"a0e41a66-875a-44fc-9e2f-02c6e25f6063":4.1626,"3bebd4a3-8b54-4f92-ba55-5fd2fd021358":4.427,"43bcda9b-2973-4639-af2c-f0b6bb3fa44b":0.04317,"ff4d732d-88d8-4a1b-b499-16417603edfe":15.86,"7182846e-46e0-4a10-b110-9bc32de4aca9":0}
[2021-12-19 18:54:12.000]
[2021-12-19 18:54:17.818] -- Response Code -- 201 waited 4528 mS Timeout 5000
[2021-12-19 18:54:20.852] Sent 1 readings in 9.23 sec. Queued readings= 11
[2021-12-19 18:54:24.215] POST /api/data-stream/ HTTP/1.1
[2021-12-19 18:54:24.215] Host: data.envirodiy.org
[2021-12-19 18:54:24.215] TOKEN: 0cf7c40a-232e-457d-87d6-cea5c0757fec
[2021-12-19 18:54:24.215] Content-Length: 463
[2021-12-19 18:54:24.215] Content-Type: application/json
[2021-12-19 18:54:24.215]
[2021-12-19 18:54:24.215] {"sampling_feature":"236c674b-69b9-43af-b0d6-33d67b870ecc","timestamp":"2021-12-19T18:28:00-08:00","8c57835f-a32f-4d62-82dc-0ba09f04cf52":1595,"1f2c9e91-3aa6-44d7-9312-160d04fbf877":1924.512,"65e0a9e5-cc8a-4ed6-8d28-127b5ec5e8e9":-2.680,"a0e41a66-875a-44fc-9e2f-02c6e25f6063":4.1626,"3bebd4a3-8b54-4f92-ba55-5fd2fd021358":4.473,"43bcda9b-2973-4639-af2c-f0b6bb3fa44b":0.04304,"ff4d732d-88d8-4a1b-b499-16417603edfe":15.84,"7182846e-46e0-4a10-b110-9bc32de4aca9":-28}
[2021-12-19 18:54:24.215]
[2021-12-19 18:54:26.082] WATCHDOG ISR barksUntilReset 51 <--WatchDogAVR
[2021-12-19 18:54:29.786] -- Response Code -- 201 waited 4326 mS Timeout 5000
[2021-12-19 18:54:29.786] ... Watchdog low barksUntilReset 51 expected 52
[2021-12-19 18:54:33.157] POST /api/data-stream/ HTTP/1.1
[2021-12-19 18:54:33.157] Host: data.envirodiy.org
[2021-12-19 18:54:33.157] TOKEN: 0cf7c40a-232e-457d-87d6-cea5c0757fec
[2021-12-19 18:54:33.157] Content-Length: 463
[2021-12-19 18:54:33.157] Content-Type: application/json
[2021-12-19 18:54:33.157]
[2021-12-19 18:54:33.157] {"sampling_feature":"236c674b-69b9-43af-b0d6-33d67b870ecc","timestamp":"2021-12-19T18:32:00-08:00","8c57835f-a32f-4d62-82dc-0ba09f04cf52":1597,"1f2c9e91-3aa6-44d7-9312-160d04fbf877":1922.948,"65e0a9e5-cc8a-4ed6-8d28-127b5ec5e8e9":-1.117,"a0e41a66-875a-44fc-9e2f-02c6e25f6063":4.1651,"3bebd4a3-8b54-4f92-ba55-5fd2fd021358":4.412,"43bcda9b-2973-4639-af2c-f0b6bb3fa44b":0.04268,"ff4d732d-88d8-4a1b-b499-16417603edfe":15.84,"7182846e-46e0-4a10-b110-9bc32de4aca9":-22}
[2021-12-19 18:54:33.157]
[2021-12-19 18:54:38.059] WATCHDOG ISR barksUntilReset 51 <--WatchDogAVR
[2021-12-19 18:54:38.726] -- Response Code -- 201 waited 4338 mS Timeout 5000
[2021-12-19 18:54:38.726] ... Watchdog low barksUntilReset 51 expected 52
[2021-12-19 18:54:42.108] POST /api/data-stream/ HTTP/1.1
[2021-12-19 18:54:42.108] Host: data.envirodiy.org
[2021-12-19 18:54:42.108] TOKEN: 0cf7c40a-232e-457d-87d6-cea5c0757fec
[2021-12-19 18:54:42.108] Content-Length: 461
[2021-12-19 18:54:42.108] Content-Type: application/json
[2021-12-19 18:54:42.108]
[2021-12-19 18:54:42.108] {"sampling_feature":"236c674b-69b9-43af-b0d6-33d67b870ecc","timestamp":"2021-12-19T18:34:00-08:00","8c57835f-a32f-4d62-82dc-0ba09f04cf52":1598,"1f2c9e91-3aa6-44d7-9312-160d04fbf877":1922.278,"65e0a9e5-cc8a-4ed6-8d28-127b5ec5e8e9":-0.670,"a0e41a66-875a-44fc-9e2f-02c6e25f6063":4.1626,"3bebd4a3-8b54-4f92-ba55-5fd2fd021358":4.351,"43bcda9b-2973-4639-af2c-f0b6bb3fa44b":0.04393,"ff4d732d-88d8-4a1b-b499-16417603edfe":15.84,"7182846e-46e0-4a10-b110-9bc32de4aca9":0}
[2021-12-19 18:54:42.108]
[2021-12-19 18:54:47.009] WATCHDOG ISR barksUntilReset 51 <--WatchDogAVR
[2021-12-19 18:54:47.700] -- Response Code -- 201 waited 4360 mS Timeout 5000
[2021-12-19 18:54:47.700] ... Watchdog low barksUntilReset 51 expected 52
[2021-12-19 18:54:51.090] POST /api/data-stream/ HTTP/1.1
[2021-12-19 18:54:51.090] Host: data.envirodiy.org
[2021-12-19 18:54:51.090] TOKEN: 0cf7c40a-232e-457d-87d6-cea5c0757fec
[2021-12-19 18:54:51.090] Content-Length: 461
[2021-12-19 18:54:51.090] Content-Type: application/json
[2021-12-19 18:54:51.090]
[2021-12-19 18:54:51.090] {"sampling_feature":"236c674b-69b9-43af-b0d6-33d67b870ecc","timestamp":"2021-12-19T18:36:00-08:00","8c57835f-a32f-4d62-82dc-0ba09f04cf52":1599,"1f2c9e91-3aa6-44d7-9312-160d04fbf877":1921.608,"65e0a9e5-cc8a-4ed6-8d28-127b5ec5e8e9":-0.670,"a0e41a66-875a-44fc-9e2f-02c6e25f6063":4.1651,"3bebd4a3-8b54-4f92-ba55-5fd2fd021358":4.215,"43bcda9b-2973-4639-af2c-f0b6bb3fa44b":0.04326,"ff4d732d-88d8-4a1b-b499-16417603edfe":15.84,"7182846e-46e0-4a10-b110-9bc32de4aca9":0}
[2021-12-19 18:54:51.090]
[2021-12-19 18:54:55.991] WATCHDOG ISR barksUntilReset 51 <--WatchDogAVR
[2021-12-19 18:54:56.663] -- Response Code -- 201 waited 4338 mS Timeout 5000
[2021-12-19 18:54:56.663] ... Watchdog low barksUntilReset 51 expected 52
[2021-12-19 18:55:00.046] POST /api/data-stream/ HTTP/1.1
[2021-12-19 18:55:00.046] Host: data.envirodiy.org
[2021-12-19 18:55:00.046] TOKEN: 0cf7c40a-232e-457d-87d6-cea5c0757fec
[2021-12-19 18:55:00.046] Content-Length: 461
[2021-12-19 18:55:00.046] Content-Type: application/json
[2021-12-19 18:55:00.046]
[2021-12-19 18:55:00.046] {"sampling_feature":"236c674b-69b9-43af-b0d6-33d67b870ecc","timestamp":"2021-12-19T18:38:00-08:00","8c57835f-a32f-4d62-82dc-0ba09f04cf52":1600,"1f2c9e91-3aa6-44d7-9312-160d04fbf877":1920.715,"65e0a9e5-cc8a-4ed6-8d28-127b5ec5e8e9":-0.893,"a0e41a66-875a-44fc-9e2f-02c6e25f6063":4.1651,"3bebd4a3-8b54-4f92-ba55-5fd2fd021358":4.366,"43bcda9b-2973-4639-af2c-f0b6bb3fa44b":0.04330,"ff4d732d-88d8-4a1b-b499-16417603edfe":15.84,"7182846e-46e0-4a10-b110-9bc32de4aca9":0}
[2021-12-19 18:55:00.046]
[2021-12-19 18:55:04.948] WATCHDOG ISR barksUntilReset 51 <--WatchDogAVR
[2021-12-19 18:55:05.635] -- Response Code -- 201 waited 4360 mS Timeout 5000
[2021-12-19 18:55:05.635] ... Watchdog low barksUntilReset 51 expected 52
[2021-12-19 18:55:08.963] POST /api/data-stream/ HTTP/1.1
[2021-12-19 18:55:08.963] Host: data.envirodiy.org
[2021-12-19 18:55:08.994] TOKEN: 0cf7c40a-232e-457d-87d6-cea5c0757fec
[2021-12-19 18:55:08.994] Content-Length: 461
[2021-12-19 18:55:08.994] Content-Type: application/json
[2021-12-19 18:55:08.994]
[2021-12-19 18:55:08.994] {"sampling_feature":"236c674b-69b9-43af-b0d6-33d67b870ecc","timestamp":"2021-12-19T18:40:00-08:00","8c57835f-a32f-4d62-82dc-0ba09f04cf52":1601,"1f2c9e91-3aa6-44d7-9312-160d04fbf877":1920.045,"65e0a9e5-cc8a-4ed6-8d28-127b5ec5e8e9":-0.670,"a0e41a66-875a-44fc-9e2f-02c6e25f6063":4.1626,"3bebd4a3-8b54-4f92-ba55-5fd2fd021358":4.230,"43bcda9b-2973-4639-af2c-f0b6bb3fa44b":0.04345,"ff4d732d-88d8-4a1b-b499-16417603edfe":15.84,"7182846e-46e0-4a10-b110-9bc32de4aca9":0}
[2021-12-19 18:55:08.994]
[2021-12-19 18:55:13.917] WATCHDOG ISR barksUntilReset 51 <--WatchDogAVR
[2021-12-19 18:55:14.631] -- Response Code -- 201 waited 4372 mS Timeout 5000
[2021-12-19 18:55:14.631] ... Watchdog low barksUntilReset 51 expected 52
[2021-12-19 18:55:18.018] POST /api/data-stream/ HTTP/1.1
[2021-12-19 18:55:18.018] Host: data.envirodiy.org
[2021-12-19 18:55:18.018] TOKEN: 0cf7c40a-232e-457d-87d6-cea5c0757fec
[2021-12-19 18:55:18.018] Content-Length: 461
[2021-12-19 18:55:18.018] Content-Type: application/json
[2021-12-19 18:55:18.018]
[2021-12-19 18:55:18.018] {"sampling_feature":"236c674b-69b9-43af-b0d6-33d67b870ecc","timestamp":"2021-12-19T18:42:00-08:00","8c57835f-a32f-4d62-82dc-0ba09f04cf52":1602,"1f2c9e91-3aa6-44d7-9312-160d04fbf877":1919.152,"65e0a9e5-cc8a-4ed6-8d28-127b5ec5e8e9":-0.893,"a0e41a66-875a-44fc-9e2f-02c6e25f6063":4.1602,"3bebd4a3-8b54-4f92-ba55-5fd2fd021358":4.412,"43bcda9b-2973-4639-af2c-f0b6bb3fa44b":0.04257,"ff4d732d-88d8-4a1b-b499-16417603edfe":15.85,"7182846e-46e0-4a10-b110-9bc32de4aca9":0}
[2021-12-19 18:55:18.018]
[2021-12-19 18:55:22.920] WATCHDOG ISR barksUntilReset 51 <--WatchDogAVR
[2021-12-19 18:55:25.043] -- Response Code -- 504 waited 5011 mS Timeout 5000
[2021-12-19 18:55:25.043] ... Watchdog low barksUntilReset 51 expected 52
[2021-12-19 18:55:25.434] seQCF Que for next pass unsent records 4

@aufdenkampe
Copy link
Member

@neilh10, if you have the old IP address anywhere in your code, then that will be problematic, as that will bypass our new DNS routing.

LimnoTech servers were getting shutdown this weekend for maintenance, so does seem like that test device might be pointing to our old IP. As long as a device is doing that, it will have an error anytime LimnoTech servers are shut down, even if our AWS servers have 99.999% uptime.

Also, the main host URL that you should be using on devices is monitormywatershed.org. For a couple of years now, data.enviroDIY.org is just a DNS alias that will get redirected to monitormywatershed.org, even for POST requests from devices. I'm not sure if that is causing any of the issues you are experiencing or not.

@neilh10
Copy link
Author

neilh10 commented Dec 20, 2021

Ok thanks for the info. All lookups are via a URL, I don't have any hard coded IP (except maybe NTP)
Part of the stating my testing, and trying to make it visible, is if I'm doing something wrong or pushing the envelope, then that can be indicated.
I've enabled some debug, but not enough to catch the lookup and resultant Ip. (I've been out all morning)

[PROVIDER_MMW]
CLOUD_ID=monitormywatershed.org

pubDQTR Sending data to [ 0 ] monitormywatershed.org
EnvrDIYPub gQS 1 <--EnviroDIYPublisher
getTPP(mS) 3000 <--EnviroDIYPublisher
Current TX Buffer Size: 0 <--dataPublisherBase
Connecting client. Timer (mS) <--EnviroDIYPublisher
Client connected after 361 ms <--EnviroDIYPublisher

POST /api/data-stream/ HTTP/1.1
Host: monitormywatershed.org
TOKEN: 0cf7c40a-232e-457d-87d6-cea5c0757fec
Content-Length: 461
Content-Type: application/json

{"sampling_feature":"236c674b-69b9-43af-b0d6-33d67b870ecc","timestamp":"2021-12-20T13:02:00-08:00","8c57835f-a32f-4d62-82dc-0ba09f04cf52":77,"1f2c9e91-3aa6-44d7-9312-160d04fbf877":1922.502,"65e0a9e5-cc8a-4ed6-8d28-127b5ec5e8e9":-0.670,"a0e41a66-875a-44fc-9e2f-02c6e25f6063":4.0577,"3bebd4a3-8b54-4f92-ba55-5fd2fd021358":4.245,"43bcda9b-2973-4639-af2c-f0b6bb3fa44b":0.04310,"ff4d732d-88d8-4a1b-b499-16417603edfe":14.96,"7182846e-46e0-4a10-b110-9bc32de4aca9":-36}

Dumping the TX Buffer <--dataPublisherBase
[[Client waited 5012 mS for 0 bytes. <--EnviroDIYPublisher
Client stopped after 350 ms <--EnviroDIYPublisher
Rsp:' ' <--EnviroDIYPublisher
-- Response Code -- 504 waited 5012 mS Timeout 5000

@neilh10
Copy link
Author

neilh10 commented Dec 20, 2021

The IP lookup is deep inside the tinyGsm library, as modem specific.
I get 3.14.188.180 when looking it up over WiFi -
and also on ping

ping monitormywatershed.org
Pinging monitormywatershed.org [3.14.188.180] with 32 bytes of data:
Reply from 3.14.188.180: bytes=32 time=83ms TTL=12
ubDQTR Sending data to [ 0 ] monitormywatershed.org

with -DSTREAMDEBUGGER_DBG in platformio.ini

EnvrDIYPub gQS 1 <--EnviroDIYPublisher
getTPP(mS) 3000 <--EnviroDIYPublisher
deszLine Reading sz 47 : 1,2000.000,0.000,4.0577,4.594,0.04277,15.01,-24
: <--LoggerBase
Current TX Buffer Size: 0 <--dataPublisherBase
Connecting client. Timer (mS) <--EnviroDIYPublisher
+++OK
ATLAmonitormywatershed.org
3.14.188.180
ATIP
1
ATDL
0.0.0.0
ATDL3.14.188.180
OK
ATDL
3.14.188.180
ATDE
25
ATDE50
OK
ATDE
50
ATCN
OK
Client connected after 459 ms <--EnviroDIYPublisher
Filled from SD QUE in 18 ms <--EnviroDIYPublisher
POST /api/data-stream/ HTTP/1.1
Host: monitormywatershed.org
TOKEN: 0cf7c40a-232e-457d-87d6-cea5c0757fec
Content-Length: 459
Content-Type: application/json

{"sampling_feature":"236c674b-69b9-43af-b0d6-33d67b870ecc","timestamp":"2021-12-20T13:30:27-08:00","8c57835f-a32f-4d62-82dc-0ba09f04cf52":1,"1f2c9e91-3aa6-44d7-9312-160d04fbf877":2000.000,"65e0a9e5-cc8a-4ed6-8d28-127b5ec5e8e9":0.000,"a0e41a66-875a-44fc-9e2f-02c6e25f6063":4.0577,"3bebd4a3-8b54-4f92-ba55-5fd2fd021358":4.594,"43bcda9b-2973-4639-af2c-f0b6bb3fa44b":0.04277,"ff4d732d-88d8-4a1b-b499-16417603edfe":15.01,"7182846e-46e0-4a10-b110-9bc32de4aca9":-24}

POST /api/data-stream/ HTTP/1.1
Host: monitormywatershed.org
TOKEN: 0cf7c40a-232e-457d-87d6-cea5c0757fec
Content-Length: 459
Content-Type: application/json

{"sampling_feature":"236c674b-69b9-43af-b0d6-33d67b870ecc","timestamp":"2021-12-20T13:30:27-08:00","8c57835f-a32f-4d62-82dc-0ba09f04cf52":1,"1f2c9e91-3aa6-44d7-9312-160d04fbf877":2000.000,"65e0a9e5-cc8a-4ed6-8d28-127b5ec5e8e9":0.000,"a0e41a66-875a-44fc-9e2f-02c6e25f6063":4.0577,"3bebd4a3-8b54-4f92-ba55-5fd2fd021358":4.594,"43bcda9b-2973-4639-af2c-f0b6bb3fa44b":0.04277,"ff4d732d-88d8-4a1b-b499-16417603edfe":15.01,"7182846e-46e0-4a10-b110-9bc32de4aca9":-24}
Dumping the TX Buffer <--dataPublisherBase
[[Client waited 5012 mS for 0 bytes. <--EnviroDIYPublisher
+++OK
ATCN
OK
Client stopped after 350 ms <--EnviroDIYPublisher
Rsp:' ' <--EnviroDIYPublisher
-- Response Code -- 504 waited 5012 mS Timeout 5000
getTPT(mS) 5000 <--EnviroDIYPublisher
Sent 1 readings in 7.62 sec. Queued readings= 1
pubDQTR drop retrys. rspCode 504 <--LoggerBase
SyncTimeCheck 0 modulo_sec 1827 Time 1640007027
Updating modem metadata... <--LoggerBase

@neilh10
Copy link
Author

neilh10 commented Dec 21, 2021

So interesting, for a verizon with
with -DSTREAMDEBUGGER_DBG in platformio.ini
it does the lookup gets 3.14.188.180 - and sets it once in the LTE chip on power up.
Thus possibly the requirement to at least reset a system, and more exactly power cycling just in case there is lingering cache.
However deep inside the tinyGsmLib it caches the IP, and then uses it as part of the waiting for the POST response. pretty weird.
For the POSTs, the reading appears to be received by https://monitormywatershed.org/sites/tu_rc_test06/
however the Mayfly never receives a 201 handshake.

With AT stream debug turned on, the Mayfly only detect timeouts. Possibly because of processing delays imposed by some weird tcp/ip closing and opening, that is exagerated by the debug stream.
(the monitoring below is from a minicom terminal program, and doesn't add extra CRLF to make it look better.)

[2021-12-20 17:26:15] pubDQTR Sending data to [ 0 ] monitormywatershed.org
[2021-12-20 17:26:15]  : <--LoggerBase
[2021-12-20 17:26:15] +++OKATLAmonitormywatershed.orgERRORATLAmonitormywatershed.org WATCHDOG ISR barksUntilReset 51 <--WatchDogAVR
[2021-12-20 17:26:28] ERRORATLAmonitormywatershed.org WATCHDOG ISR barksUntilReset 50 <--WatchDogAVR
[2021-12-20 17:26:36] ERRORATLAmonitormywatershed.orgERROR WATCHDOG ISR barksUntilReset 49 <--WatchDogAVR
[2021-12-20 17:26:46] ATLAmonitormywatershed.orgERROR WATCHDOG ISR barksUntilReset 48 <--WatchDogAVR
[2021-12-20 17:26:53] ATLAmonitormywatershed.orgERRORATLAmonitormywatershed.org WATCHDOG ISR barksUntilReset 47 <--WatchDogAVR
[2021-12-20 17:27:06] ERRORATLAmonitormywatershed.orgERRORATLAmonitormywatershed.orgERROR WATCHDOG ISR barksUntilReset 46 <--WatchDogAVR
[2021-12-20 17:27:13] ATLAmonitormywatershed.orgERRORATLAmonitormywatershed.orgERRORATLAmonitormywatershed.org3.14.188.180ATIP1ATDL132.163.96.4ATDL3.14.188.180OKATDL3.14.188.180ATDE25ATDE50OKATDE50ATWROKATACOKATCIFFATCNOKPOST /api/data-stream/ HTTP/1.1
[2021-12-20 17:27:19] Host: monitormywatershed.org
[2021-12-20 17:27:19] TOKEN: e89538ae-7ff0-4897-a396-9b10c0fadfc3
[2021-12-20 17:27:19] Content-Length: 458
[2021-12-20 17:27:19] Content-Type: application/json
[2021-12-20 17:27:19] 
[2021-12-20 17:27:19] {"sampling_feature":"31461753-cd5d-4ed0-99f7-17713c5e1e0e","timestamp":"2021-12-20T17:16:00-08:00","b404d204-f127-44d4-869c-50c058412999":2012,"32b6fb1f-4688-4fb0-ada7-323753536c08":3.7893,"be84dc61-86dd-4ff9-9c44-f1d25bd78f6b":3.775,"a4361b13-a4b4-4d53-b1c8-1398d73ceeec":0.07820,"b40cdca7-22c5-4d6e-91db-4eb77ebf2179":9.24,"e5a6b171-ae06-4fc0-bc3f-68820f9e448c":-0.049,"9a848614-1dc2-4372-890b-68d2b48d9407":7.72,"83fb7c9a-9af5-4bff-874a-d780045c19f0":-81}
[2021-12-20 17:27:19] 
[2021-12-20 17:27:19] POST /api/data-stream/ HTTP/1.1
[2021-12-20 17:27:19] Host: monitormywatershed.org
[2021-12-20 17:27:19] TOKEN: e89538ae-7ff0-4897-a396-9b10c0fadfc3
[2021-12-20 17:27:19] Content-Length: 458
[2021-12-20 17:27:19] Content-Type: application/json
[2021-12-20 17:27:20] 
[2021-12-20 17:27:20] {"sampling_feature":"31461753-cd5d-4ed0-99f7-17713c5e1e0e","timestamp":"2021-12-20T17:16:00-08:00","b404d204-f127-44d4-869c-50c058412999":2012,"32b6fb1f-4688-4fb0-ada7-323753536c08":3.7893,"be84dc61-86dd-4ff9-9c44-f1d25bd78f6b":3.775,"a4361b13-a4b4-4d53-b1c8-1398d73ceeec":0.07820,"b40cdca7-22c5-4d6e-91db-4eb77ebf2179":9.24,"e5a6b171-ae06-4fc0-bc3f-68820f9e448c":-0.049,"9a848614-1dc2-4372-890b-68d2b48d9407":7.72,"83fb7c9a-9af5-4bff-874a-d780045c19f0":-81}+++OKATCIFFATOD0.0.0.0ATCNOK
[2021-12-20 17:27:20] +++OKATCI0ATOD3.14.188.180ATCN WATCHDOG ISR barksUntilReset 45 <--WatchDogAVR
[2021-12-20 17:27:21] OK+++OKATTM64ATTM64OKATCNOK-- Response Code -- 504 waited  7000 mS Timeout 7000

@aufdenkampe
Copy link
Member

@neilh10, thanks for all this info. Our team looked at it all carefully yesterday, and @ptomasula has dug deeply into possible causes and has even started trying potential solutions.

I see that you reached out to the community with additional questions via https://www.envirodiy.org/topic/mmw-timeouts/.
Sorry to not keep you updated over the last couple of days.

The short story is that because of the volume of requests that are coming in, most of them are taking >7 seconds to complete, because of the way they are getting queued up. This means that you are getting a 504 because ModularSensors stops waiting. However, the request completes afterwards, but your devices are no longer listening when the 201 is sent.

All of this is happening because @ptomasula has actually improved error handling, by making it more accurate.

However, the result on your end is that you are seeing what appears to be more errors, even though there are actually fewer errors and our system is more performant.

Compounding it all is that your specific devices now think they failed, so they queue up the same set of data again, along with the current row of data, for another post, even though the data are already in the database. This has resulted in a huge increase in traffic for us over the last week, which is why @ptomasula recently switched back to sending a 201.

Perhaps you and @SRGDamia1 can increase the device timeout period from 7 seconds to 25 seconds, at least for any systems that resend the data.

We're working hard on getting this right, but are still massively handicapped with a convoluted codebase that was mostly written in Python 2 and that is still in very dire need for major refactoring.

@neilh10
Copy link
Author

neilh10 commented Dec 22, 2021

@aufdenkampe thanks - I'll try the 25seconds.
sorry for the nudge on that, I was wanted to check as both my systems where timing out, and trying to figure out if I needed to do something.
Ouch, yes more accurate now, so I'm guessing its tracking the received message better, and therefore more visibility when something goes wrong and isn't able to respond with the 201 ..... I guess its the "tech debt" for lack of initial testing coverage.

So setting my test08 to

[PROVIDER_MMW]
CLOUD_ID=monitormywatershed.org
TIMER_POST_TOUT_MS=25000; Gateway Timeout (ms)
TIMER_POST_PACE_MS=3000; Pacing between POSTs

Yeah - it responded in 8.8sec

Connected Internet

pubDQTR Sending data to [ 0 ] monitormywatershed.org
POST /api/data-stream/ HTTP/1.1
Host: monitormywatershed.org
TOKEN: 0cf7c40a-232e-457d-87d6-cea5c0757fec
Content-Length: 466
Content-Type: application/json

{"sampling_feature":"236c674b-69b9-43af-b0d6-33d67b870ecc","timestamp":"2021-12-22T12:08:02-08:00","8c57835f-a32f-4d62-82dc-0ba09f04cf52":1,"1f2c9e91-3aa6-44d7-9312-160d04fbf877":2000.000,"65e0a9e5-cc8a-4ed6-8d28-127b5ec5e8e9":0.223,"a0e41a66-875a-44fc-9e2f-02c6e25f6063":4.0236,"3bebd4a3-8b54-4f92-ba55-5fd2fd021358":3.775,"43bcda9b-2973-4639-af2c-f0b6bb3fa44b":-9999.00000,"ff4d732d-88d8-4a1b-b499-16417603edfe":-9999.00,"7182846e-46e0-4a10-b110-9bc32de4aca9":-25}

 WATCHDOG ISR barksUntilReset 51 <--WatchDogAVR
-- Response Code -- 201 waited  8759 mS Timeout 25000

@neilh10
Copy link
Author

neilh10 commented Dec 22, 2021

For my test06 LTE/Verizon ~~ yeah ~~ some 201 - averaging 10secs
and still some timeouts at 25secs

[2021-12-22 12:16:03] Connecting to the Internet with Digi XBee3 Cellular LTE-M
[2021-12-22 12:16:03] Connected Internet
[2021-12-22 12:17:16] POST /api/data-stream/ HTTP/1.1
[2021-12-22 12:17:16] Host: monitormywatershed.org
[2021-12-22 12:17:16] TOKEN: e89538ae-7ff0-4897-a396-9b10c0fadfc3
[2021-12-22 12:17:16] Content-Length: 456
[2021-12-22 12:17:16] Content-Type: application/json
[2021-12-22 12:17:16] 
[2021-12-22 12:17:16] {"sampling_feature":"31461753-cd5d-4ed0-99f7-17713c5e1e0e","timestamp":"2021-12-22T12:10:00-08:00","b404d204-f127-44d4-869c-50c058412999":77,"32b6fb1f-4688-4fb0-ada7-323753536c08":4.0626,"be84dc61-86dd-4ff9-9c44-f1d25bd78f6b":4.048,"a4361b13-a4b4-4d53-b1c8-1398d73ceeec":0.10520,"b40cdca7-22c5-4d6e-91db-4eb77ebf2179":9.23,"e5a6b171-ae06-4fc0-bc3f-68820f9e448c":-0.049,"9a848614-1dc2-4372-890b-68d2b48d9407":7.52,"83fb7c9a-9af5-4bff-874a-d780045c19f0":-93}
[2021-12-22 12:17:16] 
[2021-12-22 12:17:18]  WATCHDOG ISR barksUntilReset 44 <--WatchDogAVR
[2021-12-22 12:17:27]  WATCHDOG ISR barksUntilReset 43 <--WatchDogAVR
[2021-12-22 12:17:36]  WATCHDOG ISR barksUntilReset 42 <--WatchDogAVR
[2021-12-22 12:17:43] -- Response Code -- 504 waited  25010 mS Timeout 25000
[2021-12-22 12:17:43] ... Watchdog low barksUntilReset 42  expected 52
[2021-12-22 12:17:44] POST /api/data-stream/ HTTP/1.1
[2021-12-22 12:17:44] Host: monitormywatershed.org
[2021-12-22 12:17:44] TOKEN: e89538ae-7ff0-4897-a396-9b10c0fadfc3
[2021-12-22 12:17:44] Content-Length: 456
[2021-12-22 12:17:44] Content-Type: application/json
[2021-12-22 12:17:44] 
[2021-12-22 12:17:44] {"sampling_feature":"31461753-cd5d-4ed0-99f7-17713c5e1e0e","timestamp":"2021-12-22T12:12:00-08:00","b404d204-f127-44d4-869c-50c058412999":78,"32b6fb1f-4688-4fb0-ada7-323753536c08":4.0821,"be84dc61-86dd-4ff9-9c44-f1d25bd78f6b":4.124,"a4361b13-a4b4-4d53-b1c8-1398d73ceeec":0.10484,"b40cdca7-22c5-4d6e-91db-4eb77ebf2179":9.26,"e5a6b171-ae06-4fc0-bc3f-68820f9e448c":-0.049,"9a848614-1dc2-4372-890b-68d2b48d9407":7.51,"83fb7c9a-9af5-4bff-874a-d780045c19f0":-93}
[2021-12-22 12:17:44] 
[2021-12-22 12:17:53]  WATCHDOG ISR barksUntilReset 51 <--WatchDogAVR
[2021-12-22 12:17:56] -- Response Code -- 201 waited  10240 mS Timeout 25000
[2021-12-22 12:18:11] -- Response Code -- 201 waited  10023 mS Timeout 25000

[2021-12-22 12:18:28] -- Response Code -- 201 waited  9145 mS Timeout 25000

[2021-12-22 12:18:43] -- Response Code -- 201 waited  9697 mS Timeout 25000

[2021-12-22 12:18:58] -- Response Code -- 201 waited  9855 mS Timeout 25000

[2021-12-22 12:19:29] -- Response Code -- 504 waited  25000 mS Timeout 25000



[2021-12-22 12:40:52] -- Response Code -- 504 waited  25010 mS Timeout 25000

[2021-12-22 12:41:20] -- Response Code -- 504 waited  25010 mS Timeout 25000

[2021-12-22 12:41:32] -- Response Code -- 201 waited  10275 mS Timeout 25000

[2021-12-22 12:41:47] -- Response Code -- 201 waited  10144 mS Timeout 25000

[2021-12-22 12:42:02] -- Response Code -- 201 waited  10000 mS Timeout 25000

[2021-12-22 12:42:19] -- Response Code -- 201 waited  8913 mS Timeout 25000

[2021-12-22 12:42:50] -- Response Code -- 504 waited  25010 mS Timeout 25000

@aufdenkampe
Copy link
Member

@neilh10, thanks for extending the timeout time and posting your results. That's very helpful to see how long it takes for each of your posts, and the exact timing of each. I'm assuming that all the above posts the same size payload. Is that right?

The exact it takes to complete a POST is as much about how many other POSTs are ahead in the queue.

We'll be working on other hot fixes and will keep you informed.

@aufdenkampe
Copy link
Member

aufdenkampe commented Dec 22, 2021

By the way @neilh10, I've been editing your posts to add code snippet characters around what I see as code blocks, like this: https://www.markdownguide.org/extended-syntax/#fenced-code-blocks

and sometimes with syntax highlighting like this: https://www.markdownguide.org/extended-syntax/#syntax-highlighting.
using these language names or aliases: https://github.com/github/linguist/blob/master/lib/linguist/languages.yml

It helps me separate your text from your pasted code and makes it easier for me to read.

@neilh10
Copy link
Author

neilh10 commented Dec 22, 2021

@aufdenkampe Great the responses times are useful, I believe that response time is a characterization of the system.
The servers processing isn't an infinite resource, so in being able to figure out if something has changed is to monitor that spread of response times?

thanks for the edits - I must admit I'm filtering for the debug to try and bring out the core issues.
The mark up has got me flumexed at times - and conflicts with normal text ~~ humour ~~ so am trying to keep it simple text - and then elsewhere on enviroDIY there was strong discouragement for editing, locking messages, - but hey I think whatever make it more viewable, thanks for the links - will try to use them, or at least the Preview. But even the Preview seemed to be challenged.

@neilh10
Copy link
Author

neilh10 commented Dec 28, 2021

Oops reading through the previous posts - yes all the POSTs are exactly the same size. A standard Mayfly POST. If any differences then its got lost on its way through the processing.

Just for the record, looking at my tests that have been running over the holiday weekend, my two test systems which are set at timeout of 25sec, are only getting timeouts.
The last time my test06 over Verizon got a 201
[2021-12-24 10:43:30] -- Response Code -- 201 waited 9988 mS Timeout 25000

@neilh10
Copy link
Author

neilh10 commented Jan 3, 2022

I put the test systems at normal updating every hour, 4 records, to test some Modularsensor's changes, and it started getting 201's
test06/Verizon

[2022-01-02 05:49:22] -- Response Code -- 201 waited  3025 mS Timeout 25000

response times variable, some snapshots..

[2022-01-02 11:49:16] -- Response Code -- 201 waited  1806 mS Timeout 25000

..

[2022-01-02 17:50:15] -- Response Code -- 201 waited  9228 mS Timeout 25000
[2022-01-02 18:49:16] -- Response Code -- 201 waited  1651 mS Timeout 25000
[2022-01-02 21:49:35] -- Response Code -- 201 waited  7963 mS Timeout 25000

On test08/WiFi some fast responses

[2022-01-02 11:04:23.003] -- Response Code -- 201 waited  999 mS Timeout 15000
[2022-01-02 13:04:12.813] -- Response Code -- 201 waited  637 mS Timeout 15000

@aufdenkampe
Copy link
Member

Closing with the successful hotfix deployed yesterday and described in:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants