[Twisted-Python] clientfactory cleanup slow-down (after many http requests)

Discussion:

Randomcoder

2016-08-06 10:48:23 UTC

Hello,

I've been working on a small Twisted program.
The program makes HTTP requests to a large number of feeds.
Twisted is used to speed up the entire process.
After the feeds are fetched, they're parsed. Finally they should be
written to a database (to simplify the code, that part is left out).

Feeds are fetched in parallel using gatherResults, and a batch is
built. Then all batches are again gathered into a set of batches,
a DeferredList is built out of those. A semaphore controls both the
batch-level list of deferreds, and a semaphore controls the entire batch
list deferred.

Currently, the program works ok on 100-150 feeds, and BATCH_SIZE between
5 and 20.

However, I notice the program starts to hang for a long time, when the
number of feeds goes over 150-200.

To be more precise, at the end of running the program, messages
like these are printed, but the program seems to not be very active:

Stopping factory <twisted.web.client._HTTP11ClientFactory instance at 0x7f0b7d5f3908>

It seems like this is the cleanup phase.

I've read what I could find on the topic. I wasn't able to make progress
on it, so I'm posting to the mailing list to ask if someone has encountered this
before. Maybe it's a common pitfall or issue that other people have also
bumped into.

Thanks

Glyph Lefkowitz

2016-08-06 22:51:39 UTC

Permalink

Post by Randomcoder
Hello,
I've been working on a small Twisted program.

Cool, thanks for using Twisted.

Post by Randomcoder
The program makes HTTP requests to a large number of feeds.
Twisted is used to speed up the entire process.
After the feeds are fetched, they're parsed. Finally they should be
written to a database (to simplify the code, that part is left out).

Thanks for including examples, so we know exactly what you're talking about! :)

Post by Randomcoder
Feeds are fetched in parallel using gatherResults, and a batch is
built. Then all batches are again gathered into a set of batches,
a DeferredList is built out of those. A semaphore controls both the
batch-level list of deferreds, and a semaphore controls the entire batch
list deferred.
Currently, the program works ok on 100-150 feeds, and BATCH_SIZE between
5 and 20.

This all seems pretty reasonable and following best practices and such...

Post by Randomcoder
However, I notice the program starts to hang for a long time, when the
number of feeds goes over 150-200.

Two key questions: what do you mean by "hang" and what is "a long time"? Do you mean it's totally unresponsive, or do you mean it's just failing to make progress on downloading more feeds?

Post by Randomcoder
To be more precise, at the end of running the program, messages
Stopping factory <twisted.web.client._HTTP11ClientFactory instance at 0x7f0b7d5f3908>
It seems like this is the cleanup phase.

This just means that it is finished making connections. We have to do some clean-up around the usefulness of these log messages, sorry :-\.

Post by Randomcoder
I've read what I could find on the topic. I wasn't able to make progress
on it, so I'm posting to the mailing list to ask if someone has encountered this
before. Maybe it's a common pitfall or issue that other people have also
bumped into.

Right now, my guess is this: some of the sites you're contacting have very slow proxies, or for some other reason let you connect to them, but then hang when sent requests. If you're simultaneously requesting stuff from a very large number of different sites, this is sort of inevitably bound to happen, either based on network problems, or issues with the sites themselves. I suspect you thought that the connectTimeout argument to Agent would save you from this, but that timeout is just for making the initial underlying TCP connection, not receiving a full response. What you actually want to do is cancel the Deferred returned by Agent.request.

Luckily, https://treq.readthedocs.io/en/latest/ <https://treq.readthedocs.io/en/latest/> already implements this high-level timeout functionality for you, in the form of the 'timeout=' argument it accepts. If you give that a try, do you see more connections timing out as it runs, rather than "hanging" the process for long periods of time?

As long as I'm looking at your code, as a way of thanking you for providing such a nice specific runnable example, I have a few other random thoughts which may be useful to you:

- I see you're importing psycopg. Do you know about https://txpostgres.readthedocs.io/en/latest/ <https://txpostgres.readthedocs.io/en/latest/> ? You can talk to postgres asynchronously with Twisted.
- d.addCallback(lambda out: out).addCallback(lambda resp: client.readBody(resp)) can be much more briefly spelled "d.addCallback(client.readBody)". d.addErrback(lambda err: err) does nothing and can just be removed.
- BrowserLikePolicyForHTTPS() is the default, so you don't need to pass that.
- clean_up_and_exit will only be called if batchesDef doesn't fail, and if it does fail, it will swallow the exception message. Rather than manually calling `reactor.stop`, you probably want to use react(), <https://twistedmatrix.com/documents/16.3.0/api/twisted.internet.task.html#react <https://twistedmatrix.com/documents/16.3.0/api/twisted.internet.task.html#react>>. This way your function is an API that anyone who wants to use it can call - it just returns a Deferred when it's done - but your __main__ block calls react() which will both start and stop the reactor, as well as reporting errors if there's a problem while still shutting down.

Hope some of that code review is helpful - let us know if the treq timeout solves the problem or if the issue is somewhere else!

-glyph

Manish Tomar

2016-08-11 21:19:45 UTC

Permalink

Wow! This is the friendliest way to welcome a new Twisted programmer. Great
job Glyph! :)

Regards,
Manish

Post by Randomcoder
Hello,
I've been working on a small Twisted program.
Cool, thanks for using Twisted.
The program makes HTTP requests to a large number of feeds.
Twisted is used to speed up the entire process.
After the feeds are fetched, they're parsed. Finally they should be
written to a database (to simplify the code, that part is left out).
Thanks for including examples, so we know exactly what you're talking about! :)
Feeds are fetched in parallel using gatherResults, and a batch is
built. Then all batches are again gathered into a set of batches,
a DeferredList is built out of those. A semaphore controls both the
batch-level list of deferreds, and a semaphore controls the entire batch
list deferred.
Currently, the program works ok on 100-150 feeds, and BATCH_SIZE between
5 and 20.
This all seems pretty reasonable and following best practices and such...
However, I notice the program starts to hang for a long time, when the
number of feeds goes over 150-200.
Two key questions: what do you mean by "hang" and what is "a long time"?
Do you mean it's totally unresponsive, or do you mean it's just failing to
make progress on downloading more feeds?
To be more precise, at the end of running the program, messages
Stopping factory <twisted.web.client._HTTP11ClientFactory instance at 0x7f0b7d5f3908>
It seems like this is the cleanup phase.
This just means that it is finished making connections. We have to do
some clean-up around the usefulness of these log messages, sorry :-\.
I've read what I could find on the topic. I wasn't able to make progress
on it, so I'm posting to the mailing list to ask if someone has encountered this
before. Maybe it's a common pitfall or issue that other people have also
bumped into.
Right now, my guess is this: some of the sites you're contacting have very
slow proxies, or for some other reason let you *connect* to them, but
then hang when sent requests. If you're simultaneously requesting stuff
from a very large number of different sites, this is sort of inevitably
bound to happen, either based on network problems, or issues with the sites
themselves. I suspect you thought that the connectTimeout argument to
Agent would save you from this, but that timeout is just for making the
initial underlying TCP connection, not receiving a full response. What you
actually want to do is cancel the Deferred returned by Agent.request.
Luckily, https://treq.readthedocs.io/en/latest/ already implements this
high-level timeout functionality for you, in the form of the 'timeout='
argument it accepts. If you give that a try, do you see more connections
timing out as it runs, rather than "hanging" the process for long periods
of time?
As long as I'm looking at your code, as a way of thanking you for
providing such a nice specific runnable example, I have a few other random
- I see you're importing psycopg. Do you know about https://txpostgres.
readthedocs.io/en/latest/ ? You can talk to postgres asynchronously with
Twisted.
client.readBody(resp)) can be much more briefly spelled
"d.addCallback(client.readBody)". d.addErrback(lambda err: err) does
nothing and can just be removed.
- BrowserLikePolicyForHTTPS() is the default, so you don't need to pass that.
- clean_up_and_exit will only be called if batchesDef doesn't fail, and if
it does fail, it will swallow the exception message. Rather than manually
calling `reactor.stop`, you probably want to use react(), <
https://twistedmatrix.com/documents/16.3.0/api/twisted.
internet.task.html#react>. This way your function is an API that anyone
who wants to use it can call - it just returns a Deferred when it's done -
but your __main__ block calls react() which will both start and stop the
reactor, as well as reporting errors if there's a problem while still
shutting down.
Hope some of that code review is helpful - let us know if the treq timeout
solves the problem or if the issue is somewhere else!
-glyph
_______________________________________________
Twisted-Python mailing list
http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python