[Twisted-Python] Converting str to bytes in Py3 port

Glyph

2015-09-08 07:44:17 UTC

Post by Adi Roiban
Hi,
While reviewing the latest patch related to porting Twisted to py3 I
saw that many values which were supposed to hold just text were ported
as bytes.

This is fairly ambiguous, since it depends quite closely on what exact values they were.

Post by Adi Roiban
The argument for this conversion was that Twisted is a low level
framework and that other high level framworks like Treq or Klein
should implement the required code so that end user can just use text.

When high-level

Post by Adi Roiban
For example the HTTP response messages are not bytes, even if the RFC
specified that they should only contain text with a single encoding.

By "response messages" do you mean

response headers (which are ASCII-punned bytes, or text with a restricted character set, depending on your philosophical disposition)
response bodies (which are unambiguously bytes, unless you interpret them according to their content-type, in which case they are arbitrary objects, one type of which may be text)
response status codes (which are sort of like headers, except not, except maybe they're integers)
response status text (which are arguably text, but do not feature an encoding, and are therefore in the same grey area as headers)

?

Post by Adi Roiban
I have little experience with twisted.web, so maybe there are many
users of twisted.web which use binary data for response messages or
maybe there is a good use case for putting random bytes in the HTTP
response message.

The main use-case is the Python zen, "refuse the temptation to guess". Treq and Klein can provide good default ways to interpret things, but some applications will need to get underneath those defaults and treat things differently.

Post by Adi Roiban
If for twisted.web there are Treq or Klein to implement the user
friendly interfaces, I don't know what can be used for twisted.conch

Clearly we need to write some new code.

Post by Adi Roiban
In the ticket for porting twisted.conch.ssh.key to py3 [1] the name of
the ssh key algorithms like 'ssh-rsa' or 'ssh-dsa', encryption
algorithm names like AES-128-CBC and ssh key components like p, q, y,
x, n, e are now all bytes.
Do you think that this is ok?

Yes, and here's why: those values are all enumerated constants. They come off the wire as bytes, in no particular encoding (these happen to be ASCII, but is there a guarantee that all future algorithm names will also be?), and then they have to be treated specially. A good, high-level API for this would use twisted.python.constants, and not bother application code with bytes or text. Given that what is implemented is all pretty low-level, bytes make sense.

That said, if there were a good case for ASCII being the declared encoding and having some authoritative sense that that's what we should use, then we should use it. Except that the 2.x types already use bytes, and so we'd have to either go with "native str" (which is a very problematic type, and should be avoided for everything except Python identifiers and docstrings, or things that need to be processed into them.)

Post by Adi Roiban
Why allow or encourage people to use random bytes for fields which
should contain human readable text?

If you encounter an entity that can read the string "ssh-rsa" and truly comprehend it, chances are good you are not dealing with a human.

Post by Adi Roiban
For HTTP response line and response headers I think that all values should be text and encoded in ISO-8859-1.
RFC 4819 [2] only talks about using US-ASCII for all names used in the
SSH public key subsystem.
Why use bytes to represent these names?

You might be correct according to the specification (although it remains to be seen if you're right as far as implementations are concerned); however, why would it be useful to decode these values into bytes? Should we be processing them as text? In what context?

Post by Adi Roiban
RFC 4716 specifies that header tags must be US-ASCII while header
value UTF-8 ... while all IANA names are US-ASCII.... and names in the
As a reviewer I don't know that is the degin/architecture choose by
Twisted and how to review such changes.

It is not necessarily possible to rationalize every decision that has been made thus far as being part of one grand plan. For one thing, many of them have been taken by different people. For another, we learn things as we go along, and so some of the decisions made thus far are now recognized as mistakes. So at this point I think it is best that you just state your preferred design and we discuss the pros and cons of that.

To the extent that there has been a conscious design strategy, it's something like this: every API needs one layer at which it needs to treat most of its data as bytes. (Sadly) few Twisted APIs have nice, discrete higher layers with objects that represent meaningful user actions rather than protocol trivia. So the existing strategy has been around making the lower levels consistently manipulate bytes everywhere, in the hope that we will promote these objects to more high-level types in a different layer later (hence the "well, in klein and treq..." answers).

Post by Adi Roiban
As a developer I prefer to have as much text as possible so that I can
do text manipulation operation on these values and directly include
them in logs or error messages.

OK, so the main utility of treating them as text is being able to concatenate them into diagnostic messages? Can this not be done with bytes equally well? This is the main thing to focus on, I think: concrete useful things you could do with text in these places where bytes are sub-optimal.

Post by Adi Roiban
I assume that all the people involved in writing the RFC had a good
reason to require those fields to be text rather than any bytes.
Thanks for your feedback!

Thank you for prompting this conversation, adi, we do need a better communicated strategy around how we handle text and encodings when protocols stipulate that the things they're dealing with are text.

-glyph