Discussion:
[Twisted-Python] buildbot.twistedmatrix.com is down a lot
Craig Rodrigues
2016-07-17 05:11:27 UTC
Permalink
In the past few days, buildbot.twistedmatrix.com seems to be down all the
time, and requires manual restarts. As I write this, it is down right now.

Is there something wrong with the hardware involved with
buildbot.twistedmatrix.com?

--
Craig
Adi Roiban
2016-07-17 06:18:14 UTC
Permalink
Post by Craig Rodrigues
In the past few days, buildbot.twistedmatrix.com seems to be down all the
time, and requires manual restarts. As I write this, it is down right now.
Is there something wrong with the hardware involved with
buildbot.twistedmatrix.com?
The hardware is fine.
For some unknown reason the buildmaster process is terminated.

I have restarted it again.
--
Adi Roiban
Amber Brown
2016-07-17 06:21:51 UTC
Permalink
It's OOMing -- I think the upgrade to Eight trunk introduced some sort of
memory usage regression or we've done something wrong -- I've unfortunately
not had time to investigate.

We could size up the RAM in the meantime I guess?

-Amber
Post by Adi Roiban
Post by Craig Rodrigues
In the past few days, buildbot.twistedmatrix.com seems to be down all the
time, and requires manual restarts. As I write this, it is down right now.
Is there something wrong with the hardware involved with
buildbot.twistedmatrix.com?
The hardware is fine.
For some unknown reason the buildmaster process is terminated.
I have restarted it again.
--
Adi Roiban
_______________________________________________
Twisted-Python mailing list
http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python
Adi Roiban
2016-07-17 06:36:23 UTC
Permalink
Post by Amber Brown
It's OOMing -- I think the upgrade to Eight trunk introduced some sort of
memory usage regression or we've done something wrong -- I've unfortunately
not had time to investigate.
We could size up the RAM in the meantime I guess?
-Amber
I can try to revert the github webhooks + github status send and see if we
still get these errors.

I also don't have too much time to investigate, but I can revert things if
it helps.
--
Adi Roiban
Amber Brown
2016-07-17 06:38:38 UTC
Permalink
Yeah, that's a good idea - disable them for now, and we'll see if the OOMs
happen. Then we can investigate them closer if it stops.
Post by Adi Roiban
Post by Amber Brown
It's OOMing -- I think the upgrade to Eight trunk introduced some sort of
memory usage regression or we've done something wrong -- I've unfortunately
not had time to investigate.
We could size up the RAM in the meantime I guess?
-Amber
I can try to revert the github webhooks + github status send and see if we
still get these errors.
I also don't have too much time to investigate, but I can revert things if
it helps.
--
Adi Roiban
_______________________________________________
Twisted-Python mailing list
http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python
Adi Roiban
2016-07-17 06:47:40 UTC
Permalink
Post by Amber Brown
Yeah, that's a good idea - disable them for now, and we'll see if the OOMs
happen. Then we can investigate them closer if it stops.
Post by Adi Roiban
Post by Amber Brown
It's OOMing -- I think the upgrade to Eight trunk introduced some sort
of memory usage regression or we've done something wrong -- I've
unfortunately not had time to investigate.
We could size up the RAM in the meantime I guess?
-Amber
I can try to revert the github webhooks + github status send and see if
we still get these errors.
I also don't have too much time to investigate, but I can revert things
if it helps.
There is this ticket https://github.com/twisted-infra/braid/issues/216 to
track the progress and changes.
--
Adi Roiban
James Broadhead
2016-07-18 18:04:12 UTC
Permalink
It's OOMing (...)
Have you considered something like monit[1] to detect & restart in cases
like this?

[1] https://mmonit.com/monit/
Adi Roiban
2016-07-20 13:31:24 UTC
Permalink
Post by James Broadhead
It's OOMing (...)
Have you considered something like monit[1] to detect & restart in cases
like this?
This might help, but will not help up understand what we are doing wrong :)

After disabling the github webhooks, the buildbot look stable... so we
might have a clue about what goes wrong.

Right now I don't have time to look into this issue, so github hooks are
disabled for now from the GitHub UI.
--
Adi Roiban
Glyph Lefkowitz
2016-07-20 16:51:59 UTC
Permalink
It's OOMing (...)
Have you considered something like monit[1] to detect & restart in cases like this?
This might help, but will not help up understand what we are doing wrong :)
After disabling the github webhooks, the buildbot look stable... so we might have a clue about what goes wrong.
Right now I don't have time to look into this issue, so github hooks are disabled for now from the GitHub UI.
Can someone who's had a direct look at the OOMing process (adi? amber?) report this upstream? It's a real pity that we won't get github statuses for buildbot builds any more; that was a huge step in the right direction.

-glyph
Adi Roiban
2016-07-20 18:01:44 UTC
Permalink
Post by Adi Roiban
It's OOMing (...)
Have you considered something like monit[1] to detect & restart in cases like this?
This might help, but will not help up understand what we are doing wrong :)
After disabling the github webhooks, the buildbot look stable... so we
might have a clue about what goes wrong.
Right now I don't have time to look into this issue, so github hooks are
disabled for now from the GitHub UI.
Can someone who's had a direct look at the OOMing process (adi? amber?)
report this upstream? It's a real pity that we won't get github statuses
for buildbot builds any more; that was a huge step in the right direction.
I don't know how to grasp this.
By the time I was observing the issue, the buildbot process was already
dead.

I have recently discovered the Rackspace monitoring capabilities for VM...
and set up a memory notification... not sure who will receive the alerts.

I have re-enable to GitHub hooks and will start taking a closer look at the
buildmaster process.... but maybe 2GB is just not enough for a buildmaster.

I have triggered the creation of an image for the current buildbot machine
and will consider upgrading the buildbot to 4GB of memory to see if we
still hit the ceiling.

For my project I have a similar buildmaster based on number of builders and
slaves (without github hooks and without linter factories) and in 2 weeks
of uptime the virtual memory usage is 1.5GB
.... so mabybe 2GB is just not enough for buildbot.
--
Adi Roiban
Glyph Lefkowitz
2016-07-20 21:31:54 UTC
Permalink
Post by Glyph Lefkowitz
It's OOMing (...)
Have you considered something like monit[1] to detect & restart in cases like this?
This might help, but will not help up understand what we are doing wrong :)
After disabling the github webhooks, the buildbot look stable... so we might have a clue about what goes wrong.
Right now I don't have time to look into this issue, so github hooks are disabled for now from the GitHub UI.
Can someone who's had a direct look at the OOMing process (adi? amber?) report this upstream? It's a real pity that we won't get github statuses for buildbot builds any more; that was a huge step in the right direction.
I don't know how to grasp this.
By the time I was observing the issue, the buildbot process was already dead.
Yeah, these types of issues are tricky to debug. Thanks for looking into it nonetheless; I was hoping you knew more, but if you don't, nothing to be done.
Post by Glyph Lefkowitz
I have recently discovered the Rackspace monitoring capabilities for VM... and set up a memory notification... not sure who will receive the alerts.
I'll make sure that the relevant people are on the monitoring list.
Post by Glyph Lefkowitz
I have re-enable to GitHub hooks and will start taking a closer look at the buildmaster process.... but maybe 2GB is just not enough for a buildmaster.
Thanks.
Post by Glyph Lefkowitz
I have triggered the creation of an image for the current buildbot machine and will consider upgrading the buildbot to 4GB of memory to see if we still hit the ceiling.
For my project I have a similar buildmaster based on number of builders and slaves (without github hooks and without linter factories) and in 2 weeks of uptime the virtual memory usage is 1.5GB
.... so mabybe 2GB is just not enough for buildbot.
Bummer. It does seem like that's quite likely.

-glyph
Glyph Lefkowitz
2016-07-20 23:58:33 UTC
Permalink
Post by Glyph Lefkowitz
Post by Glyph Lefkowitz
I have recently discovered the Rackspace monitoring capabilities for VM... and set up a memory notification... not sure who will receive the alerts.
I'll make sure that the relevant people are on the monitoring list.
I created 'technical contact' users for you and Amber, with current email addresses, which you can use (and even log in as!) if you edit yourselves under 'user management'. I apparently had one already. You should both have a bogus alert about a MySQL server (since we don't run mysql it seemed a reasonable thing to test). Make sure that's not flagged as spam and we should all be set up to receive alerts :).

I also added some basic HTTPS monitoring to it as well, so we should see if it goes down for reasons unrelated to memory.

-glyph
Adi Roiban
2016-07-21 12:49:28 UTC
Permalink
Post by Adi Roiban
I have recently discovered the Rackspace monitoring capabilities for VM...
and set up a memory notification... not sure who will receive the alerts.
I'll make sure that the relevant people are on the monitoring list.
I created 'technical contact' users for you and Amber, with current email
addresses, which you can use (and even log in as!) if you edit yourselves
under 'user management'. I apparently had one already. You should both
have a bogus alert about a MySQL server (since we don't run mysql it seemed
a reasonable thing to test). Make sure that's not flagged as spam and we
should all be set up to receive alerts :).
I also added some basic HTTPS monitoring to it as well, so we should see
if it goes down for reasons unrelated to memory.
OK. I have received the mysql error

I can see that when there we got more builds there is significant increase
in memory usage... but will recover once moved to idle.

For now the VM still has 2GB ... and GitHub webhooks are still enabled

Regards
--
Adi Roiban
Glyph Lefkowitz
2016-07-21 17:40:27 UTC
Permalink
I can see that when there we got more builds there is significant increase in memory usage... but will recover once moved to idle.
Cool. Is there something we can do to limit the global concurrency of the builds to preserve resources on the buildmaster, then?

Or: perhaps we could move the buildbot to Carina, which has 4G of RAM and won't impact our hosting budget?

-glyph

Loading...