Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
192 changes: 192 additions & 0 deletions README-Unicode.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,192 @@
Python3 Unicode migration in the XCP package
============================================

Migrating `subprocess.Popen()`
------------------------------
With Python3, the `stdin`, `stdout` and `stderr` pipes for `Popen()` default to `bytes`(binary mode). Binary mode is much safer because it foregoes the encode/decode.

The for working with strings, existing users need to either enable text mode (when safe, it will attempt to decode and encode!) or be able to use bytes instead.

For cases where the data is guaranteed to be pure ASCII, such as when resting the `proc.stdout` of `lspci -nm`, it is sufficient to use:
```py
open(["lspci, "-nm"], stdout=subprocess.PIPE, universal_newlines=True)
```
This is possible because `universal_newlines=True` is accepted by Python2 and Python3.
On Python3, it also enables text mode for `subprocess.PIPE` streams (newline conversion
not needed, but text mode is needed)

Migrating `builtins.open()`
---------------------------
On Python3, `builtins.open()` can be used in a number of modes:
- Binary mode (when `"b"` is in `mode`): `read()` and `write()` use `bytes`.
- Text mode (Python3 default up to Python 3.6), when UTF-8 character encoding is not set by the locale
- UTF-8 mode (default since Python 3.7): https://peps.python.org/pep-0540/

When no Unicode locale in force, like in XAPI plugins, Python3 will be in text mode or UTF-8 (since Python 3.7, but existing XS is on 3.6):

* By default, `read()` on files `open()`ed without selecting binary mode attempts
to decode the data into the Python3 Unicode string type.
This fails for binary data.
Any `ord() >= 128`, when no UTF-8 locale is active With Python 3.6, triggers `UnicodeDecodeError`.

* Thus, all `open()` calls which might open binary files have to be converted to binary
or UTF-8 mode unless the caller is sure he is opening an ASCII file.
But even then, enabling an error handler to handle decoding errors is recommended:
```py
open(filename, errors="replace")
```
But neither `errors=` nor `encoding=` is accepted by Python2, so a wrapper is likely best.

### Binary mode

When decoding bytes to strings is not needed, binary mode can be great because it side-steps string codecs. But, string operations involving string Literals have to be changed to bytes.

However, when strings need to returned from the library, something like `bytes.decode(errors="ignore")` to get strings is needed.

### Text mode

Text mode using the `ascii` codec should be only used when it is ensured that the data can only consist of ASCII characters (0-127). Sadly, it is the default in Python 3.6 when the Python interpreter was not started using an UTF-8 locale for the LC_CTYPE locale category (set by LC_ALL, LC_CTYPE, LANG environment variables, overriding each other in that order)

### UTF-8 mode

Most if the time, the `UTF-8` codec should be used since even simple text files which are even documented to contain only ASCII characters like `"/usr/share/hwdata/pci.ids"` in fact __do__ contain UTF-8 characters.

Some files or some output data from commands even contains legacy `ISO-8859-1` chars, and even the `UTF-8` codec would raise `UnicodeDecodeError` for these.
When this is known to be the case, `encoding="iso-8859-1` could be tried (not tested yet).

### Problems

With the locale set to C (XAPI plugins have that), Python's default mode changes
between 3.6 and 3.7:
```py
for i in 3.{6,7,10,11};do echo -n "3.$i: ";
LC_ALL=C python3.$i -c 'import locale,sys;print(locale.getpreferredencoding())';done
3.6: ANSI_X3.4-1968
3.7: UTF-8
3.10: UTF-8
3.11: utf-8
```
This has the effect that in Python 3.6, the default codec for XAPI plugins is `ascii`:
```py
for i in 2.7 3.{6,7};do echo "$i:";
LC_ALL=C python$i -c 'open("/usr/share/hwdata/pci.ids").read()';done
2.7:
3.6:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/lib64/python3.6/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 97850: ordinal not in range(128)
3.7:
```
This error means that the `'ascii' codec` cannot handle input ord() >= 128, and as some Video cards use `²` to reference their power, the `ascii` codec chokes on them.

It means `xcp.pci.PCIIds()` cannot use `open("/usr/share/hwdata/pci.ids").read()`.

While Python 3.7 and newer use UTF-8 mode by default, it does not set up an error handler for `UnicodeDecodeError`.

As it happens, some older tools output ISO-8859-1 characters hard-coded and these aren't valid UTF-8 sequences, and even newer Python versions need error handlers to not fail:
```py
echo -e "\0262" # ISO-8859-1 for: "²"
python3 -c 'open(".text").read()'
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb2 in position 0: invalid start byte
```

```py
pylint -d all -e unspecified-encoding --msg-template="{path} line {line} in {obj}()" xcp/ tests/
************* Module xcp.accessor
xcp/accessor.py line 165 in MountingAccessor.writeFile()
xcp/accessor.py line 240 in FileAccessor.writeFile()
************* Module xcp.bootloader
xcp/bootloader.py line 111 in Bootloader.readExtLinux()
xcp/bootloader.py line 219 in Bootloader.readGrub()
xcp/bootloader.py line 335 in Bootloader.readGrub2()
xcp/bootloader.py line 465 in Bootloader.writeExtLinux()
xcp/bootloader.py line 507 in Bootloader.writeGrub()
xcp/bootloader.py line 541 in Bootloader.writeGrub2()
************* Module xcp.cmd
xcp/cmd.py line 67 in OutputCache.fileContents()
************* Module xcp.dom0
xcp/dom0.py line 85 in default_memory()
************* Module xcp.environ
xcp/environ.py line 48 in readInventory()
************* Module xcp.logger
xcp/logger.py line 51 in openLog()
************* Module xcp.net.ifrename.dynamic
xcp/net/ifrename/dynamic.py line 95 in DynamicRules.load_and_parse()
xcp/net/ifrename/dynamic.py line 292 in DynamicRules.save()
************* Module xcp.net.ifrename.static
xcp/net/ifrename/static.py line 118 in StaticRules.load_and_parse()
xcp/net/ifrename/static.py line 330 in StaticRules.save()
************* Module tests.test_biosdevname
tests/test_biosdevname.py line 30 in TestDeviceNames.test()
tests/test_biosdevname.py line 32 in TestDeviceNames.test()
************* Module tests.test_bootloader
tests/test_bootloader.py line 32 in TestLinuxBootloader.setUp()
tests/test_bootloader.py line 34 in TestLinuxBootloader.setUp()
tests/test_bootloader.py line 36 in TestLinuxBootloader.setUp()
tests/test_bootloader.py line 38 in TestLinuxBootloader.setUp()
************* Module tests.test_pci
tests/test_pci.py line 96 in TestPCIIds.test_videoclass_by_mock_calls()
tests/test_pci.py line 110 in TestPCIIds.mock_lspci_using_open_testfile()
```
Of course, `xcp/net/ifrename` won't be affected but it would be good to fix the
warning for them as well in an intelligent way. See the proposal for that below.

There are a couple of possibilities and Python because 2.7 does not support the
arguments we need to pass to ensure that all users of open() will work, we need
to make passing the arguments conditional on Python >= 3.

1. Overriding `open()`, while technically working would not only affect xcp.python but the entire program:
```py
if sys.version_info >= (3, 0):
original_open = __builtins__["open"]
def uopen(*args, **kwargs):
if "b" not in (args[1] \
if len(args) >= 2 else kwargs.get("mode", "")):
kwargs.setdefault("encoding", "UTF-8")
kwargs.setdefault("errors", "replace")
return original_open(*args, **kwargs)
__builtins__["open"] = uopen
```
2. This is sufficient but is not very nice:
```py
# xcp/utf8mode.py
if sys.version_info >= (3, 0):
open_utf8args = {"encoding": "utf-8", "errors": "replace"}
else:
open_utf8args = {}
# xcp/{cmd,pci,environ?,logger?}.py tests/test_{pci,biodevname?,...?}.py
+ from .utf8mode import open_utf8args
...
- open(filename)
+ open(filename, **open_utf8args)
```
But, `pylint` will still warn about these lines, so I propose:

3. Instead, use a wrapper function, which will also silence the `pylint` warnings at the locations which have been changed to use it:
```py
# xcp/utf8mode.py
if sys.version_info >= (3, 0):
def utf8open(*args, **kwargs):
if len(args) > 1 and "b" in args[1]:
return open(*args, **kwargs)
return open(*args, encoding="utf-8", errors="replace", **kwargs)
else:
utf8open = open
# xcp/{cmd,pci,environ?,logger?}.py tests/test_{pci,biodevname?,...?}.py
+ from .utf8mode import utf8open
...
- open(filename)
+ utf8open(filename)
```

Using the 3rd option, the `pylint` warnings for the changed locations
`unspecified-encoding` and `consider-using-with` don't appear without
explicitly disabling them.

PS: Since utf8open() still returns a context-manager, `with open(...) as f:`
would still work.
12 changes: 1 addition & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -172,17 +172,7 @@ Special open TODOs:
-------------------

Charset encoding/string handling:
* With Python3, `read()` on files `open()`ed without specifying binary mode will attempt
to decode the data into the Python3 Unicode string type, which will fail for all
binary data. Thus all `open()` calls which might open binary files have to be converted
to binary mode by default unless the caller is sure he is opening an ASCII file,
even then, enabling an error handle to handle decoding errors is recommended.
* With Python3, the `stdin`, `stdout` and `stderr` pipes for `Popen()` default to
`bytes`(binary mode.) Binary mode is much safer because it foregoes the encode/decode. The existing users need to be able to enable text mode (when safe, it will attempt
to decode and encode!) or preferably be able to use bytes (which is the type behind Python2 strings too) instead. See these PRs for details:
* https://github.com/xenserver/python-libs/pull/22
* https://github.com/xenserver/python-libs/pull/23
* https://github.com/xenserver/python-libs/pull/24
See [README-Unicode.md](README-Unicode.md) for details:
* What's more: When code is called from a xapi plugin (such as ACK), when such code
attempts to read text files like the `pciids` file, and there is a Unicode char
it int, and the locale is not set up to be UTF-8 (because xapi plugins are started
Expand Down
1 change: 1 addition & 0 deletions tests/data/lspci-mn
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@
03:00.0 "0380" "1002" "7340" -rc1 "1462" "12ac"
03:00.1 "0403" "1002" "ab38" "1462" "12ac"
04:00.0 "0280" "8086" "2723" -r1a "8086" "0084"
04:01.0 "0380" "1002" "67b0" "1787" "2020"
05:00.0 "0200" "10ec" "8168" -r15 "1462" "12ac"
06:00.0 "0108" "144d" "a808" -p02 "144d" "a801"
07:00.0 "0300" "1002" "1636" -rc6 "1462" "12ac"
Expand Down
2 changes: 2 additions & 0 deletions tests/data/pci.ids
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@
1314 Wrestler HDMI Audio
174b 1001 PURE Fusion Mini
1636 Renoir
67b0 Hawaii XT / Grenada XT [Radeon R9 290X/390X]
1787 2020 R9 290X IceQ X² Ultra Turbo Overdrive³ USB 3.2 SuperSpeed Edition
7340 Navi 14 [Radeon RX 5500/5500M / Pro 5500M]

# List of known device classes, subclasses and programming interfaces
Expand Down
49 changes: 41 additions & 8 deletions tests/test_pci.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,15 @@
# -*- coding: utf-8 -*-
# pylint: disable=unspecified-encoding # we pass encoding using kwargs where needed


import subprocess
import unittest
from os import environ

import pyfakefs.fake_filesystem_unittest # type: ignore[import]
from mock import patch, Mock

from xcp.compat import open_utf8
from xcp.pci import PCI, PCIIds, PCIDevices

class TestInvalid(unittest.TestCase):
Expand Down Expand Up @@ -87,8 +92,8 @@ def test_videoclass_without_mock(self):

def test_videoclass_by_mock_calls(self):
with patch("xcp.pci.os.path.exists") as exists_mock, \
patch("xcp.pci.open") as open_mock, \
open("tests/data/pci.ids") as fake_data:
patch("xcp.pci.utf8open") as open_mock, \
open("tests/data/pci.ids", **open_utf8) as fake_data: # type: ignore[call-overload]
exists_mock.return_value = True
open_mock.return_value.__iter__ = Mock(return_value=iter(fake_data))
ids = PCIIds.read()
Expand Down Expand Up @@ -116,21 +121,49 @@ def assert_videoclass_devices(self, ids, devs): # type: (PCIIds, PCIDevices) ->
self.assertEqual(video_class, ["03"])
sorted_devices = sorted(devs.findByClass(video_class),
key=lambda x: x['id'])
self.assertEqual(len(sorted_devices), 2)

# Assert devs.findByClass() finding 3 GPUs from tests/data/lspci-mn in our mocked PCIIds DB:
self.assertEqual(len(sorted_devices), 3)

# For each of the found devices, assert these expected values:
for (video_dev,
num_functions,
vendor,
device,
subdevice,
) in zip(sorted_devices,
(1, 5),
("Advanced Micro Devices, Inc. [AMD/ATI]",
"Advanced Micro Devices, Inc. [AMD/ATI]"),
("Navi 14 [Radeon RX 5500/5500M / Pro 5500M]",
"Renoir"),
# 1: Number of other PCI device functions shown by mocked lspci in this PCI slot:
(
1,
0,
5,
),
# 2: GPU Vendor
(
"Advanced Micro Devices, Inc. [AMD/ATI]",
"Advanced Micro Devices, Inc. [AMD/ATI]",
"Advanced Micro Devices, Inc. [AMD/ATI]",
),
# 3: GPU Device name
(
"Navi 14 [Radeon RX 5500/5500M / Pro 5500M]",
"Hawaii XT / Grenada XT [Radeon R9 290X/390X]",
"Renoir",
),
# 4: GPU Subdevice name
(
None,
"R9 290X IceQ X² Ultra Turbo Overdrive³ USB 3.2 SuperSpeed Edition",
None,
),
):
self.assertEqual(len(devs.findRelatedFunctions(video_dev['id'])), num_functions)
self.assertEqual(ids.findVendor(video_dev['vendor']), vendor)
self.assertEqual(ids.findDevice(video_dev['vendor'], video_dev['device']), device)
# Expect that we can lookup the subdevice and get the name of the subdevice, if found:
self.assertEqual(
ids.findSubdevice(video_dev["subvendor"], video_dev["subdevice"]),
subdevice,
)

self.assertEqual(len(devs.findRelatedFunctions('00:18.1')), 7)
1 change: 1 addition & 0 deletions tox.ini
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,7 @@ passenv =
fox: XAUTHORITY
fox: DBUS_SESSION_BUS_ADDRESS
setenv =
LC_ALL=C # Ensure that xcp is tested without an locale (like XAPI plugins)
PYTHONDEVMODE=yes # Enables development/resource checks: eg unclosed files and more
PYTHONPATH=stubs
# Inhibit dev-warnings on pytest plugins, but we enable dev-warnings in conftest.py
Expand Down
21 changes: 21 additions & 0 deletions xcp/compat.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
"""Helper module for providing common defaults on how to enable UTF-8 I/O in Py3.6 and newer"""
# See README-Unicode.md for Details
import sys

if sys.version_info >= (3, 0): # pragma: no cover
open_utf8 = {"encoding": "utf-8", "errors": "replace"}

def utf8open(filename, *args, **kwargs):
"""Helper for open(): Handle UTF-8: Default to encoding="utf-8", errors="replace" for Py3"""
if "b" in (args[0] if args else kwargs.get("mode", "")):
# Binary mode: just call open() unmodified:
return open(filename, *args, **kwargs) # pylint: disable=unspecified-encoding
# Text mode: default to UTF-8 with error handling to replace malformed UTF-8 sequences
kwargs.setdefault("encoding", "utf-8") # Needed for Python 3.6 when no UTF-8 locale is set
kwargs.setdefault("errors", "replace") # Simple codec error handler: Replace malformed char
# pylint: disable-next=unspecified-encoding
return open(filename, *args, **kwargs) # type: ignore[call-overload]
else:
# Python2.7: None of the above is either supported or relevant (strings are bytes):
open_utf8 = {}
utf8open = open
4 changes: 3 additions & 1 deletion xcp/pci.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,8 @@
import re
import six

from .compat import utf8open

_SBDF = (r"(?:(?P<segment> [\da-dA-F]{4}):)?" # Segment (optional)
r" (?P<bus> [\da-fA-F]{2}):" # Bus
r" (?P<device> [\da-fA-F]{2})\." # Device
Expand Down Expand Up @@ -185,7 +187,7 @@ def __init__(self, fn):
vendor = None
cls = None

fh = open(fn)
fh = utf8open(fn)
for l in fh:
line = l.rstrip()
if line == '' or line.startswith('#'):
Expand Down