[Twisted-Python] AMP message length limit

Discussion:

Oon-Ee Ng

2015-11-22 22:51:57 UTC

I've just (to my surprise) hit this. As I understand from searching
around, AMP messages are limited to ~64k due to the length prefix
being 16-bit. A change in my internal data being sent (using dicts
rather than lists) kicked one of my messages to way over that limit.

There's a bit of discussion here -
http://twistedmatrix.com/pipermail/twisted-python/2014-November/028947.html

Is there an internal twisted solution planned, or should I go ahead
and roll my own paging code? If the latter (as I strongly suspect),
could I get some comments on this idea:-

Original amp.Command had a single argument (amp.ListOf(amp.String())
and no response

Modified amp.Command, 4 arguments and 1 response
ID (sequentially generated by producer) - amp.Integer()
CurPage - amp.Integer()
TotalPage - amp.Integer()
ActualData - amp.ListOf(amp.String())
Response - RecievedPage - amp.Integer()

Questions:-
1. ID is so the client can be sure not to concatenate different lists
2. Do I need a response at all?
3. Should I attempt to plug as many list items as possible into each
page (requires length checking of json-encoded strings and repeated
encoding/checks) or just choose a suitable limit of list items (my
current max length is about 200 characters and average is 71) of maybe
300 list items per message? My current list is about 1k items in all,
and it's only going to get bigger.
4. I was intrigued by the mention of 'Tubes' in the link above. Found
it here - https://tubes.readthedocs.org/en/latest/tube.html - should I
be using that instead? I'm writing a homegrown app which will only
really need (at this point) to communicate with itself and copies of
itself, and settled with AMP as being a simple way of achieving that.

Thanks for the time.

Glyph Lefkowitz

2015-11-23 00:54:43 UTC

Permalink

Post by Oon-Ee Ng
I've just (to my surprise) hit this. As I understand from searching
around, AMP messages are limited to ~64k due to the length prefix
being 16-bit. A change in my internal data being sent (using dicts
rather than lists) kicked one of my messages to way over that limit.

I'm sorry that this was an unpleasant surprise. I wish that we had a better way of getting this across up-front :-). However, it seems like the length limit is doing its job in terms of constraining your protocol design to not have individual messages "hog" the wire...

Post by Oon-Ee Ng
There's a bit of discussion here -
http://twistedmatrix.com/pipermail/twisted-python/2014-November/028947.html
Is there an internal twisted solution planned, or should I go ahead
and roll my own paging code? If the latter (as I strongly suspect),
could I get some comments on this idea:-

Definitely the latter if you have a short time frame. How big are your messages? If your limit is still fairly small (5M, let's say) but much bigger than 64k there are other options you can use.

Post by Oon-Ee Ng
Original amp.Command had a single argument (amp.ListOf(amp.String())
and no response
Modified amp.Command, 4 arguments and 1 response
ID (sequentially generated by producer) - amp.Integer()
CurPage - amp.Integer()
TotalPage - amp.Integer()
ActualData - amp.ListOf(amp.String())
Response - RecievedPage - amp.Integer()

Implementing a paging API like this is exactly what the length limit is designed to encourage you to do - it is much more flexible, since you can request a subset of pages, and continue receiving things other than pages while the data is being streamed to you.

Post by Oon-Ee Ng
Questions:-
1. ID is so the client can be sure not to concatenate different lists

This... is correct, but doesn't sound like a question. Is it meant to be?

Post by Oon-Ee Ng
2. Do I need a response at all?

No. You can tell AMP not to bother generating the protocol-level response by setting the requiresAnswer flag on your Command to False: <https://twistedmatrix.com/documents/15.4.0/api/twisted.protocols.amp.Command.html#requiresAnswer>

Post by Oon-Ee Ng
3. Should I attempt to plug as many list items as possible into each
page (requires length checking of json-encoded strings and repeated
encoding/checks) or just choose a suitable limit of list items (my
current max length is about 200 characters and average is 71) of maybe
300 list items per message? My current list is about 1k items in all,
and it's only going to get bigger.

Why are you encoding as _both_ JSON and AMP?

I'd say you should do the length-checking, because you still might end up with list items that are larger than expected if they're variable size.

Post by Oon-Ee Ng
4. I was intrigued by the mention of 'Tubes' in the link above. Found
it here - https://tubes.readthedocs.org/en/latest/tube.html - should I
be using that instead? I'm writing a homegrown app which will only
really need (at this point) to communicate with itself and copies of
itself, and settled with AMP as being a simple way of achieving that.

I would love it if you would help me test out and develop Tubes. If it is a small homegrown app it might be a good use-case. There are pros and cons: Tubes has higher test coverage and cleaner code since it was developed much more recently; but, it still has very limited functionality, badly broken areas, and no compatibility guarantees, because it's still somewhat experimental.

However, Tubes is a way of implementing protocols, whereas AMP is an implementation of a request/response protocol. If you use Tubes, you'll need to do an implementation of AMP (or something like it) in order to issue requests and give responses. If I were you, especially since you've already figured out paging, I would probably just stick with AMP and Twisted as-is.

-glyph

Oon-Ee Ng

2015-11-24 01:16:59 UTC

Permalink

On Mon, Nov 23, 2015 at 8:54 AM, Glyph Lefkowitz

Post by Glyph Lefkowitz
I'm sorry that this was an unpleasant surprise. I wish that we had a better
way of getting this across up-front :-). However, it seems like the length
limit is doing its job in terms of constraining your protocol design to not
have individual messages "hog" the wire...

Yes, that it did.

Post by Glyph Lefkowitz
Definitely the latter if you have a short time frame. How big are your
messages? If your limit is still fairly small (5M, let's say) but much
bigger than 64k there are other options you can use.

I don't foresee it getting over an MB or so (as the data is being read
from disk, so unlikely that network I/O will be the biggest bottleneck
in this case).

Post by Glyph Lefkowitz

Post by Oon-Ee Ng
Questions:-
1. ID is so the client can be sure not to concatenate different lists

This... is correct, but doesn't sound like a question. Is it meant to be?

Sorry, the real question is whether an ID is at all required. I'm not
using threads, and the concurrent AMP messages will be sent from a
single server process in a loop. Each client is guaranteed to have
only one server. In this situation, do I even need an ID?

Post by Glyph Lefkowitz
No. You can tell AMP not to bother generating the protocol-level response
<https://twistedmatrix.com/documents/15.4.0/api/twisted.protocols.amp.Command.html#requiresAnswer>

Thanks, right now I just have plenty of return {} everywhere. Does
requiresAnswer=False mean less bandwidth usage (no need to transmit an
empty dict)?

Post by Glyph Lefkowitz

Why are you encoding as _both_ JSON and AMP?
I'd say you should do the length-checking, because you still might end up
with list items that are larger than expected if they're variable size.

I'm sending classes over the wire by json-encoding their __dict__.
Although now that you mentioned it, I started doing that because I
believed AMP to be constrained to ASCII strings (before I found
amp.Unicode()) and my classes will almost always have unicode data.
Looks like I can skip a step then, will test that out.

I'm trying not to do length-checking simply because I'm lazy (and
because I'm abstracting out all the twisted parts into an SPClient and
SPServer which handle this data conversion transparently to the
working code). In particular the 'best' ways I can think to do
length-checking is to either:-
1. Binary search for an 'optimal' size just under a limit (50k for
sake of argument)
2. Single check which splits the length by half (300>150>75 etc.)
Both would clutter up the transmission code more than I would like at
this point, and could probably be added in future on transmission side
without any change in recipient side code. So it's on the backburner.

Post by Glyph Lefkowitz
I would love it if you would help me test out and develop Tubes. If it is a
Tubes has higher test coverage and cleaner code since it was developed much
more recently; but, it still has very limited functionality, badly broken
areas, and no compatibility guarantees, because it's still somewhat
experimental.
However, Tubes is a way of implementing protocols, whereas AMP is an
implementation of a request/response protocol. If you use Tubes, you'll
need to do an implementation of AMP (or something like it) in order to issue
requests and give responses. If I were you, especially since you've already
figured out paging, I would probably just stick with AMP and Twisted as-is.

That's polite =). I'll keep it in mind. If there's a quick link
somewhere on 'badly broken area's I'd be interested, because without
knowing that it's hard to justify spending time there when I already
have something working with AMP. I especially like the idea of
streaming, but that'd require writing my code to accept data piecemeal
on the other end, and I can foresee that getting very messy very fast.

Oon-Ee Ng

2015-11-24 02:58:03 UTC

Permalink

Post by Oon-Ee Ng
Thanks, right now I just have plenty of return {} everywhere. Does
requiresAnswer=False mean less bandwidth usage (no need to transmit an
empty dict)?

Having read the documentation a bit, it appears requiresAnswer=False
is a hint and I'd still have to return the correct response (in this
case an empty dict)

http://twistedmatrix.com/trac/ticket/1985 and in particular the
following comment by yourself:-

Responders for Commands defined not to require a response should
return a valid response nonetheless, because requiresAnswer is an
optimization hint that the client can specify, on any request whose
response it will not process, to optimize network traffic.

Looks like I'll update my clients to specify it then. Was thinking it
should be specified when defining message classes.

Oon-Ee Ng

2015-11-24 03:03:41 UTC

Permalink

Post by Oon-Ee Ng

Post by Oon-Ee Ng
Thanks, right now I just have plenty of return {} everywhere. Does
requiresAnswer=False mean less bandwidth usage (no need to transmit an
empty dict)?

Having read the documentation a bit, it appears requiresAnswer=False
is a hint and I'd still have to return the correct response (in this
case an empty dict)
http://twistedmatrix.com/trac/ticket/1985 and in particular the
following comment by yourself:-
Responders for Commands defined not to require a response should
return a valid response nonetheless, because requiresAnswer is an
optimization hint that the client can specify, on any request whose
response it will not process, to optimize network traffic.
Looks like I'll update my clients to specify it then. Was thinking it
should be specified when defining message classes.

And it looks like I do have to specify it in defining message classes.

And furthermore that when I do that, callRemote no longer returns a
deferred (which makes sense, really) and instead gets a None. One more
check before I add my default errBacks then. Optimised network traffic
sounds positive, at the least (I assume this means one less
transmission since it effectively makes the AMP one-way for the
messages which have this set to False).

Oon-Ee Ng

2015-11-25 09:54:51 UTC

Permalink

Just realized, requiresAnswer=False means I can't add errBacks, which
means there's no way to handle a receiver-side error. Is this correct?

Glyph Lefkowitz

2015-11-28 14:41:43 UTC

Permalink

Post by Oon-Ee Ng

Just realized, requiresAnswer=False means I can't add errBacks, which
means there's no way to handle a receiver-side error. Is this correct?

That is correct. But you shouldn't want to handle errors there :-). Don't think of a requiresAnswer=False command as an optimization; instead, think of it as a piece of information being relayed.

For example; consider an HTTP stream. The client sends a request. If the server has an error, it sends an error code. Then, the server sends the entity-body, one chunk at a time.

The client has to process each of those chunks one after another. If the server sends a chunk and the client encounters an error, there's nothing for the server to do; the client has no way to communicate it and it can just disconnect. You would use requiresAnswer=False (as you are already doing in your case) to send a chunk of data like those entity-body chunks, which ought not to be a message that... requires an answer. It's just a way of encoding some data on the connection.

Make sense?

-glyph

Oon-Ee Ng

2015-11-28 20:03:09 UTC

Permalink

On Sat, Nov 28, 2015 at 10:41 PM, Glyph Lefkowitz

Post by Oon-Ee Ng
And furthermore that when I do that, callRemote no longer returns a
deferred (which makes sense, really) and instead gets a None. One more
check before I add my default errBacks then. Optimised network traffic
sounds positive, at the least (I assume this means one less
transmission since it effectively makes the AMP one-way for the
messages which have this set to False).
Just realized, requiresAnswer=False means I can't add errBacks, which
means there's no way to handle a receiver-side error. Is this correct?
That is correct. But you shouldn't want to handle errors there :-). Don't
think of a requiresAnswer=False command as an optimization; instead, think
of it as a piece of information being relayed.
For example; consider an HTTP stream. The client sends a request. If the
server has an error, it sends an error code. Then, the server sends the
entity-body, one chunk at a time.
The client has to process each of those chunks one after another. If the
server sends a chunk and the client encounters an error, there's nothing for
the server to do; the client has no way to communicate it and it can just
disconnect. You would use requiresAnswer=False (as you are already doing in
your case) to send a chunk of data like those entity-body chunks, which
ought not to be a message that... requires an answer. It's just a way of
encoding some data on the connection.
Make sense?

Yeah, especially in the context of a HTTP stream. In other words
requiresAnswer=False is just a way of labelling a messages as
'fire-and-forget'.

I'll have to think a bit more about which of my messages actually need
answers then. For some, I do need an indication of success (or
failure), if only so I can try re-sending. Those won't get the flag
then.