[Twisted-Python] Twisted 16.6.0rc1 Release Candidate Announcement

Post by Amber "Hawkie" Brown
- Python 3 support for Words' IRC support and twisted.protocols.sip among some smaller modules,

I have opened a PR to revert this:

https://github.com/twisted/twisted/pull/593

A full explanation is here:

https://twistedmatrix.com/trac/ticket/6320#comment:16

In summary: a valid IRC message will cause a UnicodeDecodeError within
the event loop that a user cannot handle or avoid, and all length
checks on line sizes are wrong because they occur prior to encoding to
utf-8.

Glyph Lefkowitz

2016-11-17 07:22:49 UTC

Post by Amber "Hawkie" Brown
- Python 3 support for Words' IRC support and twisted.protocols.sip among some smaller modules,

https://github.com/twisted/twisted/pull/593
https://twistedmatrix.com/trac/ticket/6320#comment:16
In summary: a valid IRC message will cause a UnicodeDecodeError within
the event loop that a user cannot handle or avoid, and all length
checks on line sizes are wrong because they occur prior to encoding to
utf-8.

Reverts should be commits that go straight to trunk and reopen tickets, per the current process.

However; is it really a regression to have py3 support for Words that just doesn't support other encodings yet? It strikes me that this is just a bug, and that we should just fall back from UTF-8 to latin-1 in this scenario. But adding that fallback is a small additional fix (perhaps one that should be slated for 16.6.0 if you want to make it).

-glyph

Amber "Hawkie" Brown

2016-11-17 07:50:02 UTC

Post by Amber "Hawkie" Brown
- Python 3 support for Words' IRC support and twisted.protocols.sip among some smaller modules,

Reverts should be commits that go straight to trunk and reopen tickets, per the current process.
However; is it really a regression to have py3 support for Words that just doesn't support other encodings yet? It strikes me that this is just a bug, and that we should just fall back from UTF-8 to latin-1 in this scenario. But adding that fallback is a small additional fix (perhaps one that should be slated for 16.6.0 if you want to make it).
-glyph

Yeah, this is just a plain old bug. Bugs in new features (where a module being on Python 3 counts as one to me) aren't regressions; we sometimes fix them in pre if there's time/other stuff is getting fixed, but this one will just be a known bug until 16.7 in December.

- Amber

Amber "Hawkie" Brown

2016-11-17 07:53:59 UTC

Post by Amber "Hawkie" Brown

Post by Amber "Hawkie" Brown
- Python 3 support for Words' IRC support and twisted.protocols.sip among some smaller modules,

https://github.com/twisted/twisted/pull/593 <https://github.com/twisted/twisted/pull/593>
https://twistedmatrix.com/trac/ticket/6320#comment:16
In summary: a valid IRC message will cause a UnicodeDecodeError within
the event loop that a user cannot handle or avoid, and all length
checks on line sizes are wrong because they occur prior to encoding to
utf-8.

Reverts should be commits that go straight to trunk and reopen tickets, per the current process.
However; is it really a regression to have py3 support for Words that just doesn't support other encodings yet? It strikes me that this is just a bug, and that we should just fall back from UTF-8 to latin-1 in this scenario. But adding that fallback is a small additional fix (perhaps one that should be slated for 16.6.0 if you want to make it).
-glyph

(or a 16.6.1)

Mark Williams

2016-11-17 14:43:22 UTC

Post by Glyph Lefkowitz
However; is it really a regression to have py3 support for Words that just doesn't support other encodings yet? It strikes me that this is just a bug, and that we should just fall back from UTF-8 to latin-1 in this scenario. But adding that fallback is a small additional fix (perhaps one that should be slated for 16.6.0 if you want to make it).

Falling back to latin-1 will address the most obvious issue exposed by
the client in the re-opened ticket. It will not fix the general issue.

Note that my sample was heavily biased towards European servers.
Other IRC servers in other regions might prefer a different 8-bit
encoding, like windows-1251 or Big5. And often a single server will
see a long tail (or at least a tail) of different 8-bit encodings.
Listing all channels on a server, as the example script does, cannot
be done with an implementation that decodes input as text prior to
parsing it. It's even possible to use chardet to detect encodings.

IRC's encoding situation mirrors file systems' one on POSIX. A given
path's components can be in multiple encodings. I believe at least
part of the reason FilePath's paths are bytes, even when
surrogateescape exists, is that Unicode paths on POSIX systems would
make FilePath unusable for perfectly valid use cases. We can pretend
that IRC has a defined encoding, but doing so will make unusable for
perfectly valid use cases.

Post by Glyph Lefkowitz
-glyph
_______________________________________________
Twisted-Python mailing list
http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python

Glyph Lefkowitz

2016-11-17 19:00:13 UTC

Falling back to latin-1 will address the most obvious issue exposed by
the client in the re-opened ticket. It will not fix the general issue.

This doesn't appear to be an answer to the "is it a regression" question though ;-). I'm still curious what you think there.

The _general_ issue is unfixable, except to use chardet upon encoding errors. As far as I'm aware, IRC simply doesn't have the ability to specify an encoding.

More importantly, IRC doesn't specify an encoding and it is also responsible for transmitting textual data intended to be input and consumed by humans. If you can't decode it, faithfully replicating the on-the-wire encoding is of limited utility. You can't write any code to process the data.

Post by Mark Williams
Note that my sample was heavily biased towards European servers.
Other IRC servers in other regions might prefer a different 8-bit
encoding, like windows-1251 or Big5. And often a single server will
see a long tail (or at least a tail) of different 8-bit encodings.
Listing all channels on a server, as the example script does, cannot
be done with an implementation that decodes input as text prior to
parsing it. It's even possible to use chardet to detect encodings.

If chardet is installed, can it be specified as an encoding itself? Like, b"garbage garbage".decode("chardet")? This would make it possible to use without binding to the library; you just specify an encoding. (The library is LGPL2.1 which makes it a problematic dependency for Twisted, even optionally.)

Post by Mark Williams
IRC's encoding situation mirrors file systems' one on POSIX. A given
path's components can be in multiple encodings. I believe at least
part of the reason FilePath's paths are bytes, even when
surrogateescape exists, is that Unicode paths on POSIX systems would
make FilePath unusable for perfectly valid use cases. We can pretend
that IRC has a defined encoding, but doing so will make unusable for
perfectly valid use cases.

Here we go :-).

POSIX has an internally inconsistent model of how encodings work; they cannot possibly function correctly.

First off, let me put to rest the lie that paths are "really" bytes. Paths are text. They must be text because they have to transit through text-processing systems, such as windowing systems and and terminal programs. Users must be able to visually identify and select them, as text.

This is significant because certain operations on paths-as-bytes will inevitably fail. You can't type an invalidly-encoded pathname in your shell. If two paths differ by an incorrectly-encoded character you won't be able to visually distinguish between them without inspecting their contents. This is why OS X forces all paths to be UTF-8, and why paths are "really" unicode (UCS-2, precisely) on Windows.

There's POSIX metadata which allows you to select an encoding; locale. But, locale is per-process state, and, due to the fact that you can have multiple filesystems mounted simultaneously, it's impossible for this metadata to fully describe the state of any arbitrary path. The standard metadata is insufficient. This is why UI toolkits like GTK+ have adopted the policy of "ignore the locale, paths are UTF-8, deal with it ð¶". As far back as GTK2, non-utf-8 path selection has been deprecated: <https://developer.gnome.org/gtk2/stable/GtkFileSelection.html#gtk-file-selection-set-filename <https://developer.gnome.org/gtk2/stable/GtkFileSelection.html#gtk-file-selection-set-filename>>.

While a mis-encoded path is a failure, there are ways to treat paths as a data structure to allow for only partial failure. They're a data structure because they must be in an encoding with no NULLs, which encode SOLIDUS as the octet 0x2F, and so you can fail on each individual path component; if you're lucky you don't need to present all the components in the path to manipulate it.

We don't do this in Twisted right now (as I was somewhat disappointed to discover while writing this), but we should, and more importantly we could; FilePath(b"\xff").child("valid").asTextMode().basename() could return u"valid" rather than returning an encoding error.

To bring all this back to IRC though:

Mis-encoded IRC messages are not data structures; they're just strings. There's no opportunity for partial recovery beyond chardet and mojibake. In most cases, partial recovery requires configuration. Per-channel encodings, for example, or per-user, which have to be agreed upon out of band, in ways that IRC does not expose as metadata.

Given this situation, the only reasonable way forward as a community is to tell users that using anything other than UTF-8 is a misconfiguration and we need to be getting all those out-of-band agreements to switch to it.

-glyph

Mark Williams

2016-11-18 08:13:13 UTC

Post by Glyph Lefkowitz
This doesn't appear to be an answer to the "is it a regression" question though ;-). I'm still curious what you think there.

It's not a shipped feature so it can't be a regression. But if the
feature doesn't work it shouldn't be shipped.

I did consult the policy manual before opening revert PR. Here's what
seemed most relevant:

https://twistedmatrix.com/trac/wiki/ReviewProcess#Revertingachange

This, and the other revert documents, focus on test regressions. But
I opened the PR because of the above link's mention of "undesirable."
Is there a better resource that explains when a revert is appropriate?

Post by Glyph Lefkowitz
The _general_ issue is unfixable, except to use chardet upon encoding errors. As far as I'm aware, IRC simply doesn't have the ability to specify an encoding.

IRCv3 (http://ircv3.net/) is attempting to mandate utf-8 for certain
protocol elements (usernames and metadata). But it needs to be
backwards compatible, so it can't mandate it for all messages. And it
is not IRC as specified by RFC1459. So no, no defined encoding.

Post by Glyph Lefkowitz
More importantly, IRC doesn't specify an encoding and it is also responsible for transmitting textual data intended to be input and consumed by humans. If you can't decode it, faithfully replicating the on-the-wire encoding is of limited utility. You can't write any code to process the data.

I can write code that uses the encoding that makes sense for my use
case. I can't if we mandate utf-8, even when I receive perfectly
valid IRC messages.

Post by Glyph Lefkowitz
If chardet is installed, can it be specified as an encoding itself? Like, b"garbage garbage".decode("chardet")? This would make it possible to use without binding to the library; you just specify an encoding. (The library is LGPL2.1 which makes it a problematic dependency for Twisted, even optionally.)

It does not, but if that makes it more generally usable you've given a
great idea for my next PyPI package :)

Post by Glyph Lefkowitz
POSIX has an internally inconsistent model of how encodings work; they cannot possibly function correctly.
First off, let me put to rest the lie that paths are "really" bytes. Paths are text. They must be text because they have to transit through text-processing systems, such pas windowing systems and and terminal programs. Users must be able to visually identify and select them, as text.
This is significant because certain operations on paths-as-bytes will inevitably fail. You can't type an invalidly-encoded pathname in your shell. If two paths differ by an incorrectly-encoded character you won't be able to visually distinguish between them without inspecting their contents. This is why OS X forces all paths to be UTF-8, and why paths are "really" unicode (UCS-2, precisely) on Windows.
There's POSIX metadata which allows you to select an encoding; locale. But, locale is per-process state, and, due to the fact that you can have multiple filesystems mounted simultaneously, it's impossible for this metadata to fully describe the state of any arbitrary path. The standard metadata is insufficient. This is why UI toolkits like GTK+ have adopted the policy of "ignore the locale, paths are UTF-8, deal with it 🕶". As far back as GTK2, non-utf-8 path selection has been deprecated: <https://developer.gnome.org/gtk2/stable/GtkFileSelection.html#gtk-file-selection-set-filename <https://developer.gnome.org/gtk2/stable/GtkFileSelection.html#gtk-file-selection-set-filename>>.

When I received Arabic PDFs on a FAT16 USB drive with filenames in
CP1256, I had to switch mlterm to that particular code page to read
the directory listings so I could use convmv to convert them to UTF-8.
I'll note that this was impossible to do with a GTK-based tool.

Opinionated software is fine when it operates at the point of user
interpretation.

mlterm had to decode the stuff as unicode so X could display the
graphemes. But if Linux's FAT16 implementation decided that we should
all quit whining and use UTF-8, even though no other FAT16
implementation requires this, it wouldn't have mattered what mlterm
could or couldn't do and I would have lost those files. And it would
have been incredibly confounding to me, because everything would have
agreed that I had a FAT16 partition, but only Linux would have
mysteriously failed to read it.

Similarly, Twisted provides an IRC *library*. It's a Python API, not
irssi or Textual. The ultimate consumer of what passes through it may
be a human, but the next consumer might not be. What if I want to
write write a bot that bridges two IRC networks? What if I want to
dump the raw IRC data to a file so I can train a tensorflow version of
chardet? There's nothing in the IRC specification that prevents me
from doing this, but there will be something in Twisted's
implementation that does.

Post by Glyph Lefkowitz
While a mis-encoded path is a failure, there are ways to treat paths as a data structure to allow for only partial failure. They're a data structure because they must be in an encoding with no NULLs, which encode SOLIDUS as the octet 0x2F, and so you can fail on each individual path component; if you're lucky you don't need to present all the components in the path to manipulate it.
We don't do this in Twisted right now (as I was somewhat disappointed to discover while writing this), but we should, and more importantly we could; FilePath(b"\xff").child("valid").asTextMode().basename() could return u"valid" rather than returning an encoding error.

https://twistedmatrix.com/trac/ticket/8908

Post by Glyph Lefkowitz
Mis-encoded IRC messages are not data structures; they're just strings. There's no opportunity for partial recovery beyond chardet and mojibake. In most cases, partial recovery requires configuration. Per-channel encodings, for example, or per-user, which have to be agreed upon out of band, in ways that IRC does not expose as metadata.

It would also have to be per server, since any two channels might
disagree on the encoding of their topics. And the welcome message
might be in its own encoding. And, and, and...

But none of this is actually true. What seems to be true is that
non-utf-8 encodings are rarely if ever seen on Freenode, and sometimes
to regularly seen on many other IRC servers. These encodings are
certainly used.

Post by Glyph Lefkowitz
Given this situation, the only reasonable way forward as a community is to tell users that using anything other than UTF-8 is a misconfiguration and we need to be getting all those out-of-band agreements to switch to it.

Doing this ensures Twisted's IRC implementation will be unable to
communicate with a significant minority of users, and will be a less
useful programming tool.

It makes more sense to have an implementation that parses protocol
elements as bytes and provides a bytes API. It's fine to also provide
a decoded text API, but not to the exclusion of bytes.

-Mark

Glyph Lefkowitz

2016-11-19 01:36:16 UTC

Post by Glyph Lefkowitz
This doesn't appear to be an answer to the "is it a regression" question though ;-). I'm still curious what you think there.

It's not a shipped feature so it can't be a regression. But if the
feature doesn't work it shouldn't be shipped.

"doesn't work" is a pretty black-and-white assessment. Are you anticipating a problem with the way the interface is specified that it can't be easily changed?

I should say up front here that I think I was being too emphatic in my support for UTF-8. We absolutely must support the ability to decode other encodings. I don't think that means we need support for access to raw bytes.

Post by Mark Williams
I did consult the policy manual before opening revert PR. Here's what
https://twistedmatrix.com/trac/wiki/ReviewProcess#Revertingachange
This, and the other revert documents, focus on test regressions. But
I opened the PR because of the above link's mention of "undesirable."
Is there a better resource that explains when a revert is appropriate?

Test regressions are listed because they're unambiguously cause for a revert; "undesirable" is intentionally vague because we might decide to revert a thing for no reason. I guess opening a PR for a discussion like this is reasonable.

This could be considered an incompatible interface change; I'm honestly not sure about the exact type signatures of various methods to say whether it is or not.

Post by Glyph Lefkowitz
The _general_ issue is unfixable, except to use chardet upon encoding errors. As far as I'm aware, IRC simply doesn't have the ability to specify an encoding.

Not only "no defined encoding" but also no mechanism like HTTP headers to say what the encoding is.

I can write code that uses the encoding that makes sense for my use
case. I can't if we mandate utf-8, even when I receive perfectly
valid IRC messages.

Sorry, I haven't been separating out my lines of reasoning clearly enough here.

My points are, separately:

IRC is text. It's nonsensical to process it as bytes, because you can't process it as bytes. This is separate from the question of "what encoding is IRC".
UTF-8 is good. There should be gradual social pressure to use UTF-8 everywhere (I'm a fan of http://utf8everywhere.org <http://utf8everywhere.org/>). This is especially true in protocols like IRC and filenames where there's no mechanism to specify an encoding so that it can be correctly decoded. Therefore:
an initial release which features UTF-8 only is fine; therefore there's no need to do a revert.
defaulting to UTF-8 is reasonable for the forseeable future; users should only change this if they know that they want something unusual.
IRC is an incompatible and broken wasteland; thanks to your quantitative research we know exactly how broken. Therefore:
"support alternate encodings" is a valuable feature. Supporting point 2.1, this feature can be added on at any later point, making a revert of the present implementation unnecessary.
We can, and should, just go ahead and add support for alternate (per-server, per-channel, per-user) default and fallback encodings.
We should always have a fallback encoding, since blowing up on "invalid" data on a protocol where there's no standard to say what is or isn't valid doesn't seem very helpful.

It does not, but if that makes it more generally usable you've given a
great idea for my next PyPI package :)

Let me know :).

Post by Glyph Lefkowitz
POSIX has an internally inconsistent model of how encodings work; they cannot possibly function correctly.
First off, let me put to rest the lie that paths are "really" bytes. Paths are text. They must be text because they have to transit through text-processing systems, such pas windowing systems and and terminal programs. Users must be able to visually identify and select them, as text.
This is significant because certain operations on paths-as-bytes will inevitably fail. You can't type an invalidly-encoded pathname in your shell. If two paths differ by an incorrectly-encoded character you won't be able to visually distinguish between them without inspecting their contents. This is why OS X forces all paths to be UTF-8, and why paths are "really" unicode (UCS-2, precisely) on Windows.
There's POSIX metadata which allows you to select an encoding; locale. But, locale is per-process state, and, due to the fact that you can have multiple filesystems mounted simultaneously, it's impossible for this metadata to fully describe the state of any arbitrary path. The standard metadata is insufficient. This is why UI toolkits like GTK+ have adopted the policy of "ignore the locale, paths are UTF-8, deal with it ð¶". As far back as GTK2, non-utf-8 path selection has been deprecated: <https://developer.gnome.org/gtk2/stable/GtkFileSelection.html#gtk-file-selection-set-filename <https://developer.gnome.org/gtk2/stable/GtkFileSelection.html#gtk-file-selection-set-filename>>.

There is no question that your life has been hard, and that a wide array of people have made bad decisions that contribute to your difficulties. :-)

Post by Mark Williams
I'll note that this was impossible to do with a GTK-based tool.
Opinionated software is fine when it operates at the point of user
interpretation.
mlterm had to decode the stuff as unicode so X could display the
graphemes. But if Linux's FAT16 implementation decided that we should
all quit whining and use UTF-8, even though no other FAT16
implementation requires this, it wouldn't have mattered what mlterm
could or couldn't do and I would have lost those files. And it would
have been incredibly confounding to me, because everything would have
agreed that I had a FAT16 partition, but only Linux would have
mysteriously failed to read it.

But, Linux's FAT16 driver has decided that.

The correct way to solve your problem with current Linux (I don't know if this was possible at the time) would be to address it with mount, not special user-space software. Specifically, I think it would be something like:

mount -t fat -o fat=16,iocharset=utf-8,codepage=1256 /dev/disk/by-label/arabic.msdos /media/arabic.msdos

Now all your GTK+ software works, too, because you're not trying to reconcile your legacy format support at the application level.

In other words, the thing those pathnames are encoding is text; the way they're being encoded is codepage 1256 on the platter. However, the interface between the OS and the application can still be "text" (i.e. UTF-8) without breaking the on-disk "bytes" (cp1256).

Post by Mark Williams
Similarly, Twisted provides an IRC *library*. It's a Python API, not
irssi or Textual. The ultimate consumer of what passes through it may
be a human, but the next consumer might not be. What if I want to
write write a bot that bridges two IRC networks? What if I want to
dump the raw IRC data to a file so I can train a tensorflow version of
chardet? There's nothing in the IRC specification that prevents me
from doing this, but there will be something in Twisted's
implementation that does.

In the current release, yes. But in a future release: no, you can't just bridge arbitrary bytes between two networks and expect them to work. Those networks (or channels, or users) might have different implicit encoding rules; which, by default and only by default, should be utf-8. In a multi-encoding world, you may need to transcode between them to properly bridge; this is a consequence of the fact that eventually you're presenting this data as text to human eyeballs.

https://twistedmatrix.com/trac/ticket/8908 <https://twistedmatrix.com/trac/ticket/8908>

Thanks for filing that!

It would also have to be per server, since any two channels might
disagree on the encoding of their topics. And the welcome message
might be in its own encoding. And, and, and...

Right. Per-server default, and then per-channel and per-(privmsg)-user is about as precise as you can get though. In principle, it's possible that different segments of the same topic could be in different encodings, different words in the same sentence! In practice though that just means somebody screwed up and the topic is now unreadable garbage in all clients.

Post by Mark Williams
But none of this is actually true. What seems to be true is that
non-utf-8 encodings are rarely if ever seen on Freenode, and sometimes
to regularly seen on many other IRC servers. These encodings are
certainly used.

I can't really parse you here - are you saying that each network more or less sticks to one encoding?

Doing this ensures Twisted's IRC implementation will be unable to
communicate with a significant minority of users, and will be a less
useful programming tool.

Sorry, my statement you were responding to here was way too strong. What I meant to say here is that long term there is no way to get a "right" answer in this ecosystem, so "UTF-8 is the only correct answer" is the only direction we can push in to actually make things work reasonably by default an increasing proportion of the time. For the forseeable future, adding the ability to cope with other encodings (encoding a fallback to latin-1 so that you can at least do demojibakefication manually after copy/pasting) is something a general-purpose IRC library absolutely needs. This is why every client has an "encoding" selection menu, too.

Post by Mark Williams
It makes more sense to have an implementation that parses protocol
elements as bytes and provides a bytes API. It's fine to also provide
a decoded text API, but not to the exclusion of bytes.

This is the point where I think we diverge. I don't think adding a bytes API actually adds any value. Trying to process the contents of IRC as bytes as any way leads to inevitable failures ("line" truncation midway through a UTF-8 escape sequence for example).

So, the thing IRC is transmitting is text. The way it's transmitting it is poorly specified and will need manual configurability hooks to specify encoding information, probably forever, and perhaps even to guess it (although "encoding=chardet" would be nice). I agree that just saying "UTF-8 or GTFO" is not a sustainable approach at all. "UTF-8 or have a bad time with this fiddly customization API and config file" is fine, because anyone wanting something else is probably already having a bad time.

If you are engaging in a real abuse of the IRC protocol and you're treating it as an 8-bit clean stream to send some escaped binary data through (like a video stream, something like that), well, that's what the 'charmap' alias of 'latin-1' is for :-).

So... have I sold you?

-glyph

Mark Williams

2016-11-21 00:35:29 UTC

Post by Glyph Lefkowitz
"doesn't work" is a pretty black-and-white assessment. Are you anticipating a problem with the way the interface is specified that it can't be easily changed?

Yes. Here's the lede: IRCClient should deal in bytes and we should
introduce a ProtocolWrapper-like thing that encodes and decodes
command prefixes and parameters. It should implement an interface,
and we can start with an implementation that only knows about UTF-8.
The obvious advantage of this is that you can more easily write
IRCClients that work on both Python 2 and 3. I'll attempt to explain
others in the rest of this email.

Post by Glyph Lefkowitz
I should say up front here that I think I was being too emphatic in my support for UTF-8.

Phew!

Post by Glyph Lefkowitz
Test regressions are listed because they're unambiguously cause for a revert; "undesirable" is intentionally vague because we might decide to revert a thing for no reason. I guess opening a PR for a discussion like this is reasonable.

Good to know!

Post by Glyph Lefkowitz
This could be considered an incompatible interface change; I'm honestly not sure about the exact type signatures of various methods to say whether it is or not.

I'm also not entirely sure of the consequences of this interface
change. I think it deserves more thought before it becomes an API
that we have to support. This is the primary reason I opened the
revert PR.

I'm more precisely worried about the fact that the implementation
raises a decoding exception that cannot be handled in user code when
it receives non-UTF-8 messages, and the fact that the line length
checks occur prior to encoding, ensuring mid-codepoint truncation.
These issues also contributed to my revert.

Post by Glyph Lefkowitz
IRC is text. It's nonsensical to process it as bytes, because you can't process it as bytes. This is separate from the question of "what encoding is IRC".

It's nonsensical that it be finally presented to a human as raw bytes.
I'm advocating for the decision to be made as late as possible. That
doesn't mean we can't provide an easy-to-use recoding client that we
encourage people to turn to first.

Post by Glyph Lefkowitz
an initial release which features UTF-8 only is fine; therefore there's no need to do a revert.
defaulting to UTF-8 is reasonable for the forseeable future; users should only change this if they know that they want something unusual.
"support alternate encodings" is a valuable feature. Supporting point 2.1, this feature can be added on at any later point, making a revert of the present implementation unnecessary.
We can, and should, just go ahead and add support for alternate (per-server, per-channel, per-user) default and fallback encodings.
We should always have a fallback encoding, since blowing up on "invalid" data on a protocol where there's no standard to say what is or isn't valid doesn't seem very helpful.

I appreciate the consistency of this, and agree the documented
preference should be a client implementation that assumes UTF-8. But
we can't have *a* fallback encoding. My encoding detector program
indicates that latin-1 is the second most popular encoding for
European IRC servers, but Russian servers I sampled (not in
netsplit.de's top 10) used a variety of Cyrillic encodings.

I also want to enable arbitrary recovery strategies for bad encodings.
For instance, in the case that an IRC client or server truncates a
code point at a line boundary, it might be the right idea to binary
search until the invalid byte sequence is found, and then exclude it.
It might be the right idea to buffer the message for a time in the
hopes that the codepoint got split over two lines.

And what if somebody wants to run another encoding survey?

I don't expect most users to do any of that, but *I* certainly want to
without having to copy and paste a bunch of code.

Post by Mark Williams
When I received Arabic PDFs on a FAT16 USB drive with filenames in
CP1256, I had to switch mlterm to that particular code page to read
the directory listings so I could use convmv to convert them to UTF-8.

There is no question that your life has been hard, and that a wide array of people have made bad decisions that contribute to your difficulties. :-)

My real point was that dealing with bad encodings is not theoretical.
Nobody knew the encoding, by the way; they just knew the USB drive
worked for some of them and not others, and were resulting to printing
things out or taking screen shots.

That's the situation opinionated software with monolithic abstractions
creates. People *will* find workarounds that are terrible for a bunch
of reasons. I can vouch for the utility of tools that decide on
encodings as late as possible.

Note that I'm not asking that we be everything to all people, but
rather that we allow people the option of dealing with the IRC
encoding disaster the way they see fit.

Post by Glyph Lefkowitz
But, Linux's FAT16 driver has decided that.
mount -t fat -o fat=16,iocharset=utf-8,codepage=1256 /dev/disk/by-label/arabic.msdos /media/arabic.msdos
Now all your GTK+ software works, too, because you're not trying to reconcile your legacy format support at the application level.

I don't remember either. But, now the driver *allows* me to do that
without requiring it, and also allows me to mount the file system so
that the paths are exposed as bytes. Since nobody knew the encoding,
that was essential to letting me use mlterm to determine it. Nowadays
I'd probably use chardet but would still need the raw bytes.

And as far as I know code point sequence truncation can also occur on
FAT16/32 partitions. In the event of such truncation the automatic
decoding would only prevent me from mounting the partition. I'm
thankful that the implementation allows me to choose a recovery
strategy in a very real edge case. If it didn't, I'd have to look up
the file system's on disk format and reimplement 99% of a FAT16 driver
to get at the data.

So it's the case that raw bytes weren't useful to me when I tried to
actually read the paths, but they were super useful to me when a
perfectly reasonable assumption was wrong. And when no encoding is
mandated, perfectly reasonable assumptions do fail and fail often.

Post by Mark Williams
What if I want to
write write a bot that bridges two IRC networks?

It's true that if one channel is latin-1 and the other is MacCyrillic
that a text-only IRCClient implementation could handle this just by
allowing the user to choose an encoding. The recoding API I'm talking
about wouldn't give you anything. But it would help with truncation
issues and channels' topics using different encodings.

I can't really parse you here - are you saying that each network more or less sticks to one encoding?

Not quite - I meant that in my survey, I saw no latin-1 on Freenode,
but that may be because they decided I was abusing the network early
on in my attempt to list and join channel.

But on other networks I saw a lot of different encodings, used across
different channels, so that the channel list contained topics encoded
in many different 8-bit encodings.

Post by Glyph Lefkowitz
Sorry, my statement you were responding to here was way too strong. What I meant to say here is that long term there is no way to get a "right" answer in this ecosystem, so "UTF-8 is the only correct answer" is the only direction we can push in to actually make things work reasonably by default an increasing proportion of the time. For the forseeable future, adding the ability to cope with other encodings (encoding a fallback to latin-1 so that you can at least do demojibakefication manually after copy/pasting) is something a general-purpose IRC library absolutely needs. This is why every client has an "encoding" selection menu, too.

For what it's worth, I want to make it easy to use UTF-8. I just
don't want to make it hard to use an encoding that's *not* UTF-8.

This is precisely where we disagree. As I described above, I can
think of a couple ways to handle mid-codepoint truncation. A
Twisted-based IRC client should have the option to implement its own.
The end result would still be text (or at least an informative log
message.)

I think the best way to handle this is to have a bytes-only IRC client
that can then be wrapped with something that decodes prefixes and
parameters. We can provide a UTF-8 recoder that people are encouraged
to use, and an interface that allows implementers to choose their own
encoding strategy.

I don't think it can be a ProtocolWrapper, because it'll need to know
about the particulars of IRCClient. That means I don't have a clear
idea of the interface yet. Until I do, I'd prefer we ship something
that implements the RFC and allows people to do handling encoding the
way they see fit. I will say I'm happy to take a stab at a recoder.
But it can't be written with IRCClient as it stands now and would
certainly be done in a separate PR.

Shipping what we have now will mean we're putting bugs out there (see
the line length issues called out in the ticket) and an interface I
think we haven't thought through, but that certainly limits what IRC
protocol messages you can receive.

(Also - I don't think any multibyte UTF-8 sequence can contain a byte
<= 127, so that it can't be truncated by ASCII-only code. This of
course isn't true for fixed-width encodings. '\n\n' is a totally
valid UTF-16 sequence.)

Post by Glyph Lefkowitz
So, the thing IRC is transmitting is text. The way it's transmitting it is poorly specified and will need manual configurability hooks to specify encoding information, probably forever, and perhaps even to guess it (although "encoding=chardet" would be nice). I agree that just saying "UTF-8 or GTFO" is not a sustainable approach at all. "UTF-8 or have a bad time with this fiddly customization API and config file" is fine, because anyone wanting something else is probably already having a bad time.
If you are engaging in a real abuse of the IRC protocol and you're treating it as an 8-bit clean stream to send some escaped binary data through (like a video stream, something like that), well, that's what the 'charmap' alias of 'latin-1' is for :-).

I guess charmap could be used to implement the recovery scheme I keep
talking about, but then we'd be telling people to work out the
recoding interaction between IRCClient and their own implementation.
I'd like to provide a defined way of doing so eventually.

Post by Glyph Lefkowitz
So... have I sold you?

On default UTF-8? Absolutely! But I don't know exactly the way to do
it, so I'd rather provide a Python 3 port that actually implements the
protocol, and then work out a nice recoding API.

Thanks for taking the time to talk through this. I appreciate it!

-Mark

Glyph Lefkowitz

2016-11-22 21:37:01 UTC

Post by Glyph Lefkowitz
"doesn't work" is a pretty black-and-white assessment. Are you anticipating a problem with the way the interface is specified that it can't be easily changed?

IRCClient should deal in bytes and we should introduce a ProtocolWrapper-like thing that encodes and decodes
command prefixes and parameters.

I disagree. Any user-facing API should deal in unicode objects. (There is one caveat here; there really should be a separate layer for dealing with text; IRCClient being a subclassing-based API pollutes the whole issue. But that API shouldn't be public, so this is largely minutae; the "right" answer here has nothing to do with bytes or text and everything to do with adopting .)

It should implement an interface, and we can start with an implementation that only knows about UTF-8.

We should have the implementation initially know about UTF-8, yes.

The obvious advantage of this is that you can more easily write IRCClients that work on both Python 2 and 3.

This is the part that I'm worried about. It kinda seems like we're moving toward "native string" being the type used in IRCClient, and that is capital-W Wrong. Native strings are for Python-native types only, i.e. docstrings and method names.

One of the things that's informing my decision is that IRCClient is already an incredibly ill-defined API that probably needs to be deprecated and overhauled at some point. However, in the intervening (what will almost certainly be a) decade, I'd like it to work on Python 3.

I'm more precisely worried about the fact that the implementation
raises a decoding exception that cannot be handled in user code when
it receives non-UTF-8 messages,

The right way to deal with this is twofold:

Add the ability to specify both the "encoding" and the "errors" of the relevant codec <https://docs.python.org/2.7/library/codecs.html#codecs.decode <https://docs.python.org/2.7/library/codecs.html#codecs.decode>>, so that we can choose error handling strategies.
(potentially, if you have very nuanced requirements for dealing with weird encodings) write a codec that logs and handles its own errors. (We probably shouldn't be logging a traceback for encoding problems regardless, if it's UnicodeDecodeError. But that's something that can easily be fixed in subsequent releases as well)

and the fact that the line length checks occur prior to encoding, ensuring mid-codepoint truncation. These issues also contributed to my revert.

Line length checks are a super interesting example because I think they also illustrate my concerns as well.

To properly do message-splitting (which is why we're checking line length), you have to:

check the length in octets (because it's actually a message-length limit in octets, not a line-length limit in characters)
split the textual representation - ideally somewhere relevant like a word break, which you can only detect in text!
try encoding again and ensure that the encoded representation is the correct length, repeating if necessary.

This is an implementation-level bug though, not an interface-level one, so I'm also comfortable fixing this bug in the future.

Post by Glyph Lefkowitz
IRC is text. It's nonsensical to process it as bytes, because you can't process it as bytes. This is separate from the question of "what encoding is IRC".

You can't process it as bytes either, though. In some cases you think you can, but then you get mid-codepoint truncation :-).

But we can't have *a* fallback encoding. My encoding detector program
indicates that latin-1 is the second most popular encoding for
European IRC servers, but Russian servers I sampled (not in
netsplit.de's top 10) used a variety of Cyrillic encodings.

If you really want to do something this sophisticated (and, I should note: no other IRC clients or bots I'm aware of do, so I think you've got an unrealistically tight set of requirements) then you can just write your own single codec that composes a bunch of others, and install it. Python's encoding system is extensible for exactly this reason :).

I also want to enable arbitrary recovery strategies for bad encodings.

This is totally not an IRC-specific thing though :-).

For instance, in the case that an IRC client or server truncates a
code point at a line boundary, it might be the right idea to binary
search until the invalid byte sequence is found, and then exclude it.
It might be the right idea to buffer the message for a time in the
hopes that the codepoint got split over two lines.
And what if somebody wants to run another encoding survey?

Decode as charmap, which is what we call latin-1 when we want to do this :). That's a super edge-case, and should not be easy by default.

I don't expect most users to do any of that, but *I* certainly want to
without having to copy and paste a bunch of code.

You can totally do all of these things once we can specify an encoding.

Post by Glyph Lefkowitz
<arabic USB drive>

Sure, sorry for my sarcastic retort. The example is totally germane; I didn't mean to say it wasn't.

That's the situation opinionated software with monolithic abstractions
creates. People *will* find workarounds that are terrible for a bunch
of reasons. I can vouch for the utility of tools that decide on
encodings as late as possible.

Wouldn't it have been great if you couldn't create this mess in the first place, though? The ability to recover is good (and being able to specify the encoding, and write your own custom codec, for IRC is certainly important).

Using latin-1 in this scenario would have worked as well, though.

And as far as I know code point sequence truncation can also occur on
FAT16/32 partitions. In the event of such truncation the automatic
decoding would only prevent me from mounting the partition. I'm
thankful that the implementation allows me to choose a recovery
strategy in a very real edge case. If it didn't, I'd have to look up
the file system's on disk format and reimplement 99% of a FAT16 driver
to get at the data.

OK, now we're getting into some real filesystem esoterica which I'm not sure applies any more :-).

For what it's worth, I want to make it easy to use UTF-8. I just
don't want to make it hard to use an encoding that's *not* UTF-8.

I want to make it a little hard. Having a version floating around for a few releases that only supports UTF-8 creates gentle social pressure for everyone to fix their encodings. Later releasing the version that supports arbitrary stuff including chardet addresses the long tail of brokenness that can't be fixed by a nudge.

OK, this is definitely the part where we diverge.

If you care so much about the hairsplitting specifics of IRC byte handling that you want to change the line-splitting algorithm to do something specific, you should be maintaining Twisted, not writing applications with it.

I suppose I should reveal my bias here: IRC is a garbage protocol, and its implementations' main utility should be upward compatibility with something more modern, maybe a line-delimited JSON thing, since XMPP doesn't seem to have taken off. That thing hasn't arrived yet, whatever it will be, but when we present an application-level interface to it, we should strip away as much IRC-specific junk as we can, while still maintaining enough specificity that consumers of the API can provoke specific desired user-facing behaviors in user interfaces (for example, preserving the distinction between "notice" and "message").

Twisted's IRC support's job, in my mind, is to support applications that want to interact with users and servers, and possibly process messages in between. You can't process messages as bytes (see mid-codepoint truncation above), so presenting a bytes-oriented interface is useless for this whole class of application, not just for the final step where the message is presented to a human. Presenting this low-level interface to enable users the ability to customize line-splitting is just bonkers.

I think the best way to handle this is to have a bytes-only IRC client
that can then be wrapped with something that decodes prefixes and
parameters. We can provide a UTF-8 recoder that people are encouraged
to use, and an interface that allows implementers to choose their own
encoding strategy.

At the risk of repeating myself, the way you select an encoding strategy in Python is selecting an encoding :).

I will say I'm happy to take a stab at a recoder.

You've used this word a few times - what is a "recoder"?

Shipping what we have now will mean we're putting bugs out there (see
the line length issues called out in the ticket) and an interface I
think we haven't thought through, but that certainly limits what IRC
protocol messages you can receive.

I'm OK with there being edge-case bugs like this: we should fix them one at a time. Smaller PRs are better, even if it means not everything works perfectly in every release.

As Kay put it, simple things should be easy, and complex things should be possible. I am happy with this tradeoff - writing this weird transcoding nexus IRC proxy application _should_ be kind of hard ;). Writing a bot that spits out emoji in response to jokes should be easy. (And you can't even encode emoji in KOI-8, so.)

Post by Glyph Lefkowitz
So... have I sold you?

On default UTF-8? Absolutely! But I don't know exactly the way to do
it, so I'd rather provide a Python 3 port that actually implements the
protocol, and then work out a nice recoding API.
Thanks for taking the time to talk through this. I appreciate it!

Sorry to say my final call (as backed up by Amber, apropos of our earlier IRC conversation (WHICH I SHOULD NOTE WAS CONDUCTED USING UTF-8 TEXT!!!)) is not to act on the revert. But you raise many valid issues and I hope that we can get those nailed by the next release as just regular old bugfixes :).

This has been a great conversation though, I hope we can have more like it on the mailing list :).

-glyph

Tristan Seligmann

2016-11-22 23:14:23 UTC

Unless I'm misunderstanding, we're not "moving towards" it, we have *already
arrived*: IRCClient deals in str (bytes) on Python 2, and str (unicode) on
Python 3. Even if we want a unicode API, having it only exist on Python 3
seems incredibly confusing from a user standpoint, and would appear to
require some absurd contortions to write client code that behaves
approximately the same on both Python 2 and 3.

Tristan Seligmann

2016-11-22 23:26:27 UTC

Post by Glyph Lefkowitz
This is the part that I'm worried about. It kinda seems like we're moving
toward "native string" being the type used in IRCClient, and *that* is
capital-W Wrong. Native strings are for Python-native types only, i.e.
docstrings and method names.
Unless I'm misunderstanding, we're not "moving towards" it, we have *already
arrived*: IRCClient deals in str (bytes) on Python 2, and str (unicode)
on Python 3. Even if we want a unicode API, having it only exist on Python
3 seems incredibly confusing from a user standpoint, and would appear to
require some absurd contortions to write client code that behaves
approximately the same on both Python 2 and 3.

For example, as far as I can tell, the only way to write code to join a
channel named #tÃ«st (UTF-8 encoded) is:

channel = u'#tÃ«st'
if PY3:
channel = channel.encode('utf-8')
client.join(channel)

On Python 3, client.join(b'#t\xc3\xab') will try to send JOIN b'#t\xc3\xab',
which is garbage, whereas on Python 2, client.join(u'#t\xebst') will
produce a UnicodeEncodeError.

Tristan Seligmann

2016-11-22 23:27:12 UTC

Argh, the above should be if PY2 of course.

Glyph Lefkowitz

2016-11-22 23:31:45 UTC

Post by Tristan Seligmann
Argh, the above should be if PY2 of course.

OK, this whole time I thought we were talking about a sensible application of text_type to the API, perhaps with some leniency for bytes-ish-ness on python 2. I haven't reviewed the PR, I was just responding to the concerns as raised on the list.

If it's just randomly encoding on one version and not the other, and correct usage of the API depends on *users* doing 'if PY2:' in their own code, then perhaps Mark's concern is indeed well-founded and we should roll it back before 16.6.

-glyph

Mark Williams

2016-11-23 01:27:11 UTC

Post by Glyph Lefkowitz
OK, this whole time I thought we were talking about a sensible application of text_type to the API, perhaps with some leniency for bytes-ish-ness on python 2. I haven't reviewed the PR, I was just responding to the concerns as raised on the list.

Sorry - I didn't mean to steer this towards API bike shedding.

Post by Glyph Lefkowitz
If it's just randomly encoding on one version and not the other, and correct usage of the API depends on *users* doing 'if PY2:' in their own code, then perhaps Mark's concern is indeed well-founded and we should roll it back before 16.6.

Tristan's exactly right. Furthermore, if we decide to make IRCClient
call its various command methods with unicode strings on Python 2,
we'll be breaking backwards compatibility. This is what I meant when

Post by Glyph Lefkowitz
Yes. Here's the lede: IRCClient should deal in bytes and we should
introduce a ProtocolWrapper-like thing that encodes and decodes
command prefixes and parameters. It should implement an interface,
and we can start with an implementation that only knows about UTF-8.
The obvious advantage of this is that you can more easily write
IRCClients that work on both Python 2 and 3.

But it totally wasn't clear - sorry!

Of course, I also want IRC client implementation that lets me get at
bytes, but that's a discussion I'll move to a new thread.

Given the inconsistency between Python 2 and Python 3, do we proceed
with the revert?

-Mark

Glyph Lefkowitz

2016-11-23 02:03:07 UTC

Sorry - I didn't mean to steer this towards API bike shedding.

But it totally wasn't clear - sorry!
Of course, I also want IRC client implementation that lets me get at
bytes, but that's a discussion I'll move to a new thread.
Given the inconsistency between Python 2 and Python 3, do we proceed
with the revert?

Okay. So.

The rule for reverts like this is: if you do something today, which is correct usage of the API and produces an observably correct result, will that be broken in the future if we fix it? If so, then we need to revert because the interface as released is unsupportable.

As it stands, we have a matrix of 4 behaviors:

bytes
text(ascii)
text(nonascii)
py2
works
works
UnicodeDecodeError
py3
garbage
works
works

This... is actually... fine, surprisingly.

The right thing to do is to write code that passes text all the time. If you do that right now, it'll work on py3 and raise an exception on py2, unless it happens to be ASCII, in which case it'll work.

If you write code that passes bytes on py3, it'll just be garbage. But, we want to deprecate that anyway, and you can't get correct, usable behavior out of it, no matter what workarounds you stuff in; so it's a bug, and can be fixed like any bug.

Similarly if you pass non-ascii text on py3, you'll get a UnicodeDecodeError.

This is not a good situation, but it's totally fixable without breaking the interface. We just fix the py2 version to accept text_type as well, and if Mark sneaks in a patch that makes py3 do the right thing with bytes, well, I don't know that I can stop him.

More importantly, it would probably be a smaller change to fix the methods (we could even fix them one at a time; say, action, join, etc) than to un-port and re-port the whole thing.

So: yes, it's broken, and in a worse way than I thought. To get it to the point where we can actually implement logic consistently between two versions, we need to add a flag to IRCClient's constructor which is default-false on py2 and default-true on py3 which says "give me text", so that callbacks like privmsg and joined can start receiving text_type on py2 as well as py3; right now it has to receive str because they've previously received str. But that's a separate issue.

I am open to the idea that I have evaluated this incorrectly though, since this has been possibly the most confusing change since https://twistedmatrix.com/trac/ticket/411 <https://twistedmatrix.com/trac/ticket/411>. But as of right now I still think we shouldn't revert.

-glyph

John Santos

2016-11-23 02:35:41 UTC

Been lurking here, no cows in the fire, no irons in the race, or
whatever, except wanting Twisted to be perfect and easy to use and being
perennially confused by text encoding, but I did notice this:

On 11/22/2016 9:03 PM, Glyph Lefkowitz wrote:

[...]

Post by Glyph Lefkowitz
Okay. So.
The rule for reverts like this is: if you do something today, which is
correct usage of the API and produces an observably correct result,
will that be broken in the future if we fix it? If so, then we need
to revert because the interface as released is unsupportable.
*bytes*
*text(ascii)*
*text(nonascii)*
*py2*
works
works
UnicodeDecodeError
*py3*
garbage
works
works
This... is actually... fine, surprisingly.
The /right/ thing to do is to write code that passes text all the
time. If you do that right now, it'll work on py3 and raise an
exception on py2, unless it /happens/ to be ASCII, in which case it'll
work.
If you write code that passes bytes on py3, it'll just be garbage.
But, we want to deprecate that anyway, and you can't get correct,
usable behavior out of it, no matter what workarounds you stuff in; so
it's a bug, and can be fixed like any bug.
Similarly if you pass non-ascii text on py3, you'll get a
UnicodeDecodeError.

Shouldn't this be "if you pass non-ascii text on *py2, *you'll get ..." ?

[...]

Post by Glyph Lefkowitz
-glyph

Pedantically yours,

--
John Santos
Evans Griffiths & Hart, Inc.
781-861-0670 ext 539

Glyph Lefkowitz

2016-11-23 05:04:36 UTC

Shouldn't this be "if you pass non-ascii text on py2, you'll get ..." ?

Yes. Thanks for that catch :).

-g

Mark Williams

2016-11-23 02:36:47 UTC

Given that matrix, how would this work on Python 2 and 3:

https://github.com/buildbot/buildbot/blob/40d5dd3d101704aa8db582e306b3c6cf7921c23c/master/buildbot/reporters/irc.py#L67-L68

And how would that code not have to change if a future release accommodates
Unicode on Python 2 or bytes on Python 3?

Glyph Lefkowitz

2016-11-23 05:04:09 UTC