Convert bytes to a string in Python 3

Question

I captured the standard output of an external program into a bytes object:

>>> from subprocess import *
>>> stdout = Popen(['ls', '-l'], stdout=PIPE).communicate()[0]
>>> stdout
b'total 0\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file1\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file2\n'

I want to convert that to a normal Python string, so that I can print it like this:

>>> print(stdout)
-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file1
-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file2

How do I convert the bytes object to a str with Python 3?

_{See Best way to convert string to bytes in Python 3? for the other way around.}

why doesn't str(text_bytes) work? This seems bizarre to me. — Charlie Parker, Commented Mar 14, 2019 at 22:25
@CharlieParker Because str(text_bytes) can't specify the encoding. Depending on what's in text_bytes, text_bytes.decode('cp1250)` might result in a very different string to text_bytes.decode('utf-8'). — Craig Anderson, Commented Mar 31, 2019 at 17:32
so str function does not convert to a real string anymore. One HAS to say an encoding explicitly for some reason I am to lazy to read through why. Just convert it to utf-8 and see if ur code works. e.g. var = var.decode('utf-8') — Charlie Parker, Commented Apr 22, 2019 at 23:32
@CraigAnderson: unicode_text = str(bytestring, character_encoding) works as expected on Python 3. Though unicode_text = bytestring.decode(character_encoding) is more preferable to avoid confusion with just str(bytes_obj) that produces a text representation for bytes_obj instead of decoding it to text: str(b'\xb6', 'cp1252') == b'\xb6'.decode('cp1252') == '¶' and str(b'\xb6') == "b'\\xb6'" == repr(b'\xb6') != '¶' — jfs, Commented Apr 12, 2020 at 5:11
Also, you can pass text=True to subprocess.run() or .Popen() and then you'll get a string back, no need to convert bytes. Or specify encoding="utf-8" to either function. — David Gilbertson, Commented Sep 13, 2022 at 5:46

Per Lundberg · Accepted Answer · 2024-02-05 13:32:31Z

5736

Decode the bytes object to produce a string:

>>> b"abcde".decode("utf-8")
'abcde'

The above example assumes that the bytes object is in UTF-8, because it is a common encoding. However, you should use the encoding your data is actually in!

edited Feb 5 at 13:32

Per Lundberg

4,1011 gold badge37 silver badges48 bronze badges

answered Mar 3, 2009 at 12:26

Aaron Maenpaa

122k11 gold badges95 silver badges108 bronze badges

1

Yes, but given that this is the output from a windows command, shouldn't it instead be using ".decode('windows-1252')" ?
– mcherm
Commented Jul 18, 2011 at 19:48
100

Using "windows-1252" is not reliable either (e.g., for other language versions of Windows), wouldn't it be best to use sys.stdout.encoding?
– nikow
Commented Jan 3, 2012 at 15:20
23

Maybe this will help somebody further: Sometimes you use byte array for e.x. TCP communication. If you want to convert byte array to string cutting off trailing '\x00' characters the following answer is not enough. Use b'example\x00\x00'.decode('utf-8').strip('\x00') then.
– Wookie88
Commented Apr 16, 2013 at 13:27
2

Official documentation for this: for all bytes and bytearray operations (methods which can be called on these objects), see here: docs.python.org/3/library/stdtypes.html#bytes-methods. For bytes.decode() in particular, see here: docs.python.org/3/library/stdtypes.html#bytes.decode.
– Gabriel Staples
Commented Mar 25, 2021 at 4:12
1

just decode() cause utf-8 is default
– Nam G VU
Commented Dec 31, 2023 at 11:14

Add a comment |

Mateen Ulhaq · Accepted Answer · 2022-06-06 03:29:45Z

422

Decode the byte string and turn it in to a character (Unicode) string.

Python 3:

encoding = 'utf-8'
b'hello'.decode(encoding)

or

str(b'hello', encoding)

Python 2:

encoding = 'utf-8'
'hello'.decode(encoding)

or

unicode('hello', encoding)

edited Jun 6, 2022 at 3:29

Mateen Ulhaq

26.5k20 gold badges115 silver badges146 bronze badges

answered Mar 3, 2009 at 12:28

dF.

75.2k31 gold badges133 silver badges137 bronze badges

just decode() cause utf-8 is default
– Nam G VU
Commented Dec 31, 2023 at 11:14

Add a comment |

Mateen Ulhaq · Accepted Answer · 2022-06-06 03:32:23Z

263

This joins together a list of bytes into a string:

>>> bytes_data = [112, 52, 52]
>>> "".join(map(chr, bytes_data))
'p44'

edited Jun 6, 2022 at 3:32

Mateen Ulhaq

26.5k20 gold badges115 silver badges146 bronze badges

answered Aug 22, 2012 at 12:57

Sisso

3,0591 gold badge17 silver badges13 bronze badges

9

@leetNightshade: yet it is terribly inefficient. If you have a byte array you only need to decode.
– Martijn Pieters ♦
Commented Sep 1, 2014 at 16:25
12

@Sasszem: this method is a perverted way to express: a.decode('latin-1') where a = bytearray([112, 52, 52]) ("There Ain't No Such Thing as Plain Text". If you've managed to convert bytes into a text string then you used some encoding—latin-1 in this case)
– jfs
Commented Nov 16, 2016 at 3:16
7

@leetNightshade: For completeness sake: bytes(list_of_integers).decode('ascii') is about 1/3rd faster than ''.join(map(chr, list_of_integers)) on Python 3.6.
– Martijn Pieters ♦
Commented Jul 3, 2018 at 12:01

Add a comment |

Peter Mortensen · Accepted Answer · 2019-09-28 10:59:43Z

133

In Python 3, the default encoding is "utf-8", so you can directly use:

b'hello'.decode()

which is equivalent to

b'hello'.decode(encoding="utf-8")

On the other hand, in Python 2, encoding defaults to the default string encoding. Thus, you should use:

b'hello'.decode(encoding)

where encoding is the encoding you want.

Note: support for keyword arguments was added in Python 2.7.

edited Sep 28, 2019 at 10:59

Peter Mortensen

31.3k22 gold badges109 silver badges132 bronze badges

answered Jun 29, 2016 at 14:21

lmiguelvargasf

67.7k47 gold badges227 silver badges234 bronze badges

1

Great to point out decode() w/ utf-8 is default
– Nam G VU
Commented Dec 31, 2023 at 11:14

Add a comment |

Peter Mortensen · Accepted Answer · 2019-09-28 10:58:53Z

If you don't know the encoding, then to read binary input into string in Python 3 and Python 2 compatible way, use the ancient MS-DOS CP437 encoding:

PY3K = sys.version_info >= (3, 0)

lines = []
for line in stream:
    if not PY3K:
        lines.append(line)
    else:
        lines.append(line.decode('cp437'))

Because encoding is unknown, expect non-English symbols to translate to characters of cp437 (English characters are not translated, because they match in most single byte encodings and UTF-8).

Decoding arbitrary binary input to UTF-8 is unsafe, because you may get this:

>>> b'\x00\x01\xffsd'.decode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 2: invalid
start byte

The same applies to latin-1, which was popular (the default?) for Python 2. See the missing points in Codepage Layout - it is where Python chokes with infamous ordinal not in range.

UPDATE 20150604: There are rumors that Python 3 has the surrogateescape error strategy for encoding stuff into binary data without data loss and crashes, but it needs conversion tests, [binary] -> [str] -> [binary], to validate both performance and reliability.

UPDATE 20170116: Thanks to comment by Nearoo - there is also a possibility to slash escape all unknown bytes with backslashreplace error handler. That works only for Python 3, so even with this workaround you will still get inconsistent output from different Python versions:

PY3K = sys.version_info >= (3, 0)

lines = []
for line in stream:
    if not PY3K:
        lines.append(line)
    else:
        lines.append(line.decode('utf-8', 'backslashreplace'))

See Python’s Unicode Support for details.

UPDATE 20170119: I decided to implement slash escaping decode that works for both Python 2 and Python 3. It should be slower than the cp437 solution, but it should produce identical results on every Python version.

# --- preparation

import codecs

def slashescape(err):
    """ codecs error handler. err is UnicodeDecode instance. return
    a tuple with a replacement for the unencodable part of the input
    and a position where encoding should continue"""
    #print err, dir(err), err.start, err.end, err.object[:err.start]
    thebyte = err.object[err.start:err.end]
    repl = u'\\x'+hex(ord(thebyte))[2:]
    return (repl, err.end)

codecs.register_error('slashescape', slashescape)

# --- processing

stream = [b'\x80abc']

lines = []
for line in stream:
    lines.append(line.decode('utf-8', 'slashescape'))

This answer is incorrect. The latin-1, i.e. ISO-8859-1 encoding is perfectly capable of handling arbitrary binary data - bytes(range(256)).decode('latin-1') runs without error on modern Python versions, and I can't come up with a reason why it ever would have failed. The entire point of Latin-1 is that it maps each byte to the first 256 code points in Unicode - or rather, the ordering of Unicode was chosen, ever since the first version in 1991, so that the first 256 code points would match Latin-1. You could run into problems printing the string, but that's entirely orthogonal. — Karl Knechtel, Commented Jul 1, 2022 at 7:15

wim · Accepted Answer · 2020-09-04 18:35:32Z

Since this question is actually asking about subprocess output, you have more direct approaches available. The most modern would be using subprocess.check_output and passing text=True (Python 3.7+) to automatically decode stdout using the system default coding:

text = subprocess.check_output(["ls", "-l"], text=True)

For Python 3.6, Popen accepts an encoding keyword:

>>> from subprocess import Popen, PIPE
>>> text = Popen(['ls', '-l'], stdout=PIPE, encoding='utf-8').communicate()[0]
>>> type(text)
str
>>> print(text)
total 0
-rw-r--r-- 1 wim badger 0 May 31 12:45 some_file.txt

The general answer to the question in the title, if you're not dealing with subprocess output, is to decode bytes to text:

>>> b'abcde'.decode()
'abcde'

With no argument, sys.getdefaultencoding() will be used. If your data is not sys.getdefaultencoding(), then you must specify the encoding explicitly in the decode call:

>>> b'caf\xe9'.decode('cp1250')
'café'

Peter Mortensen · Accepted Answer · 2019-09-28 10:54:06Z

I think you actually want this:

>>> from subprocess import *
>>> command_stdout = Popen(['ls', '-l'], stdout=PIPE).communicate()[0]
>>> command_text = command_stdout.decode(encoding='windows-1252')

Aaron's answer was correct, except that you need to know which encoding to use. And I believe that Windows uses 'windows-1252'. It will only matter if you have some unusual (non-ASCII) characters in your content, but then it will make a difference.

By the way, the fact that it does matter is the reason that Python moved to using two different types for binary and text data: it can't convert magically between them, because it doesn't know the encoding unless you tell it! The only way YOU would know is to read the Windows documentation (or read it here).

Borislav Sabev · Accepted Answer · 2014-01-21 15:47:48Z

38

Set universal_newlines to True, i.e.

command_stdout = Popen(['ls', '-l'], stdout=PIPE, universal_newlines=True).communicate()[0]

edited Jan 21, 2014 at 15:47

Borislav Sabev

4,8261 gold badge26 silver badges30 bronze badges

answered Jan 21, 2014 at 15:31

ContextSwitch

3894 silver badges2 bronze badges

3

On 3.7 you can (and should) do text=True instead of universal_newlines=True.
– user3064538
Commented Jan 13, 2019 at 17:02

Add a comment |

jfs · Accepted Answer · 2019-10-04 20:19:44Z

To interpret a byte sequence as a text, you have to know the corresponding character encoding:

unicode_text = bytestring.decode(character_encoding)

Example:

>>> b'\xc2\xb5'.decode('utf-8')
'µ'

ls command may produce output that can't be interpreted as text. File names on Unix may be any sequence of bytes except slash b'/' and zero b'\0':

>>> open(bytes(range(0x100)).translate(None, b'\0/'), 'w').close()

Trying to decode such byte soup using utf-8 encoding raises UnicodeDecodeError.

It can be worse. The decoding may fail silently and produce mojibake if you use a wrong incompatible encoding:

>>> '—'.encode('utf-8').decode('cp1252')
'â€”'

The data is corrupted but your program remains unaware that a failure has occurred.

In general, what character encoding to use is not embedded in the byte sequence itself. You have to communicate this info out-of-band. Some outcomes are more likely than others and therefore chardet module exists that can guess the character encoding. A single Python script may use multiple character encodings in different places.

ls output can be converted to a Python string using os.fsdecode() function that succeeds even for undecodable filenames (it uses sys.getfilesystemencoding() and surrogateescape error handler on Unix):

import os
import subprocess

output = os.fsdecode(subprocess.check_output('ls'))

To get the original bytes, you could use os.fsencode().

If you pass universal_newlines=True parameter then subprocess uses locale.getpreferredencoding(False) to decode bytes e.g., it can be cp1252 on Windows.

To decode the byte stream on-the-fly, io.TextIOWrapper() could be used: example.

Different commands may use different character encodings for their output e.g., dir internal command (cmd) may use cp437. To decode its output, you could pass the encoding explicitly (Python 3.6+):

output = subprocess.check_output('dir', shell=True, encoding='cp437')

The filenames may differ from os.listdir() (which uses Windows Unicode API) e.g., '\xb6' can be substituted with '\x14'—Python's cp437 codec maps b'\x14' to control character U+0014 instead of U+00B6 (¶). To support filenames with arbitrary Unicode characters, see Decode PowerShell output possibly containing non-ASCII Unicode characters into a Python string

Felipe Augusto · Accepted Answer · 2019-06-10 16:04:48Z

28

While @Aaron Maenpaa's answer just works, a user recently asked:

Is there any more simply way? 'fhand.read().decode("ASCII")' [...] It's so long!

You can use:

command_stdout.decode()

decode() has a standard argument:

codecs.decode(obj, encoding='utf-8', errors='strict')

edited Jun 10, 2019 at 16:04

Felipe Augusto

7,98412 gold badges42 silver badges74 bronze badges

answered Nov 13, 2015 at 10:24

serv-inc

37.3k9 gold badges181 silver badges202 bronze badges

Add a comment |

Peter Mortensen · Accepted Answer · 2022-09-19 16:28:19Z

21

If you have had this error:

utf-8 codec can't decode byte 0x8a,

then it is better to use the following code to convert bytes to a string:

bytes = b"abcdefg"
string = bytes.decode("utf-8", "ignore")

edited Sep 19, 2022 at 16:28

Peter Mortensen

31.3k22 gold badges109 silver badges132 bronze badges

answered Oct 21, 2021 at 6:36

Yasser M

6627 silver badges10 bronze badges

Add a comment |

Felipe Augusto · Accepted Answer · 2019-06-10 16:03:10Z

20

If you should get the following by trying decode():

AttributeError: 'str' object has no attribute 'decode'

You can also specify the encoding type straight in a cast:

>>> my_byte_str
b'Hello World'

>>> str(my_byte_str, 'utf-8')
'Hello World'

edited Jun 10, 2019 at 16:03

Felipe Augusto

7,98412 gold badges42 silver badges74 bronze badges

answered Nov 22, 2017 at 4:20

Broper

2,1681 gold badge16 silver badges15 bronze badges

Add a comment |

Supergamer · Accepted Answer · 2022-09-18 09:16:03Z

20

Bytes

m=b'This is bytes'

Converting to string

Method 1

m.decode("utf-8")

or

m.decode()

Method 2

import codecs
codecs.decode(m,encoding="utf-8")

or

import codecs
codecs.decode(m)

Method 3

str(m,encoding="utf-8")

or

str(m)[2:-1]

Result

'This is bytes'

edited Sep 18, 2022 at 9:16

answered Jun 21, 2022 at 13:18

Supergamer

4274 silver badges13 bronze badges

Add a comment |

Peter Mortensen · Accepted Answer · 2022-09-19 16:33:23Z

We can decode the bytes object to produce a string using bytes.decode(encoding='utf-8', errors='strict'). For documentation see bytes.decode.

Python 3 example:

byte_value = b"abcde"
print("Initial value = {}".format(byte_value))
print("Initial value type = {}".format(type(byte_value)))
string_value = byte_value.decode("utf-8")
# utf-8 is used here because it is a very common encoding, but you need to use the encoding your data is actually in.
print("------------")
print("Converted value = {}".format(string_value))
print("Converted value type = {}".format(type(string_value)))

Output:

Initial value = b'abcde'
Initial value type = <class 'bytes'>
------------
Converted value = abcde
Converted value type = <class 'str'>

Note: In Python 3, by default the encoding type is UTF-8. So, <byte_string>.decode("utf-8") can be also written as <byte_string>.decode()

Peter Mortensen · Accepted Answer · 2019-09-28 11:11:58Z

8

For Python 3, this is a much safer and Pythonic approach to convert from byte to string:

def byte_to_str(bytes_or_str):
    if isinstance(bytes_or_str, bytes): # Check if it's in bytes
        print(bytes_or_str.decode('utf-8'))
    else:
        print("Object not of byte type")

byte_to_str(b'total 0\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file1\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file2\n')

Output:

total 0
-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file1
-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file2

edited Sep 28, 2019 at 11:11

Peter Mortensen

31.3k22 gold badges109 silver badges132 bronze badges

answered Jan 18, 2017 at 7:21

Taufiq Rahman

5,6763 gold badges37 silver badges45 bronze badges

Add a comment |

bers · Accepted Answer · 2018-03-16 13:28:25Z

When working with data from Windows systems (with \r\n line endings), my answer is

String = Bytes.decode("utf-8").replace("\r\n", "\n")

Why? Try this with a multiline Input.txt:

Bytes = open("Input.txt", "rb").read()
String = Bytes.decode("utf-8")
open("Output.txt", "w").write(String)

All your line endings will be doubled (to \r\r\n), leading to extra empty lines. Python's text-read functions usually normalize line endings so that strings use only \n. If you receive binary data from a Windows system, Python does not have a chance to do that. Thus,

Bytes = open("Input.txt", "rb").read()
String = Bytes.decode("utf-8").replace("\r\n", "\n")
open("Output.txt", "w").write(String)

will replicate your original file.

score 5 · Accepted Answer · 2019-08-07 19:51:33Z

For your specific case of "run a shell command and get its output as text instead of bytes", on Python 3.7, you should use subprocess.run and pass in text=True (as well as capture_output=True to capture the output)

command_result = subprocess.run(["ls", "-l"], capture_output=True, text=True)
command_result.stdout  # is a `str` containing your program's stdout

text used to be called universal_newlines, and was changed (well, aliased) in Python 3.7. If you want to support Python versions before 3.7, pass in universal_newlines=True instead of text=True

Peter Mortensen · Accepted Answer · 2019-09-28 10:54:56Z

3

From sys — System-specific parameters and functions:

To write or read binary data from/to the standard streams, use the underlying binary buffer. For example, to write bytes to stdout, use sys.stdout.buffer.write(b'abc').

edited Sep 28, 2019 at 10:54

Peter Mortensen

31.3k22 gold badges109 silver badges132 bronze badges

answered Jan 11, 2014 at 7:15

Zhichang Yu

3793 silver badges8 bronze badges

5

The pipe to the subprocess is already a binary buffer. Your answer fails to address how to get a string value from the resulting bytes value.
– Martijn Pieters ♦
Commented Sep 1, 2014 at 17:34

Add a comment |

Peter Mortensen · Accepted Answer · 2022-09-19 16:32:36Z

3

Try this:

bytes.fromhex('c3a9').decode('utf-8')

edited Sep 19, 2022 at 16:32

Peter Mortensen

31.3k22 gold badges109 silver badges132 bronze badges

answered Jan 19, 2020 at 8:19

Victor Choy

4,22631 silver badges37 bronze badges

Add a comment |

Suyog Shimpi · Accepted Answer · 2023-06-18 10:57:28Z

2

One of the best ways to convert to string without caring about any encoding type is as follows -

import json


b_string = b'test string'
string = b_string.decode(
    json.detect_encoding(b_string)  # detect_encoding - used to detect encoding
)
print(string)

Here, we used json.detect_encoding method to detect the encoding.

answered Jun 18, 2023 at 10:57

Suyog Shimpi

8061 gold badge10 silver badges17 bronze badges

Add a comment |

Leonardo Filipe · Accepted Answer · 2018-06-03 22:44:45Z

1

def toString(string):    
    try:
        return v.decode("utf-8")
    except ValueError:
        return string

b = b'97.080.500'
s = '97.080.500'
print(toString(b))
print(toString(s))

answered Jun 3, 2018 at 22:44

Leonardo Filipe

1,70016 silver badges9 bronze badges

1

While this code may answer the question, providing additional context regarding how and/or why it solves the problem would improve the answer's long-term value. Remember that you are answering the question for readers in the future, not just the person asking now! Please edit your answer to add an explanation, and give an indication of what limitations and assumptions apply. It also doesn't hurt to mention why this answer is more appropriate than others.
– Dev-iL
Commented Jun 4, 2018 at 5:37
Hi @Dev-iL, if you are a moderator, can you tell me if it's possible for moderators to delete pointless empty incoherent answers like this one: stackoverflow.com/a/68310461/134044
– NeilG
Commented Jan 11, 2023 at 0:00
1

@NeilG I'm not a moderator (note that I have no diamond next to my nickname). If you think a post is low quality, you should report it, and if the community agrees with you - it will be deleted.
– Dev-iL
Commented Jan 11, 2023 at 8:47

Add a comment |

Peter Mortensen · Accepted Answer · 2019-09-28 11:14:40Z

1

If you want to convert any bytes, not just string converted to bytes:

with open("bytesfile", "rb") as infile:
    str = base64.b85encode(imageFile.read())

with open("bytesfile", "rb") as infile:
    str2 = json.dumps(list(infile.read()))

This is not very efficient, however. It will turn a 2 MB picture into 9 MB.

edited Sep 28, 2019 at 11:14

Peter Mortensen

31.3k22 gold badges109 silver badges132 bronze badges

answered Jun 1, 2019 at 2:30

HCLivess

1,0551 gold badge13 silver badges22 bronze badges

Add a comment |

Peter Mortensen · Accepted Answer · 2022-09-19 16:31:12Z

Try using this one; this function will ignore all the non-character sets (like UTF-8) binaries and return a clean string. It is tested for Python 3.6 and above.

def bin2str(text, encoding = 'utf-8'):
    """Converts a binary to Unicode string by removing all non Unicode char
    text: binary string to work on
    encoding: output encoding *utf-8"""

    return text.decode(encoding, 'ignore')

Here, the function will take the binary and decode it (converts binary data to characters using the Python predefined character set and the ignore argument ignores all non-character set data from your binary and finally returns your desired string value.

If you are not sure about the encoding, use sys.getdefaultencoding() to get the default encoding of your device.

Collectives™ on Stack Overflow

Convert bytes to a string in Python 3

23 Answers 23

Bytes

Converting to string

Method 1

Method 2

Method 3

Result

Not the answer you're looking for? Browse other questions tagged
python
string
python-3.x
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

23 Answers 23

Bytes

Converting to string

Method 1

Method 2

Method 3

Result

Not the answer you're looking for? Browse other questions tagged pythonstringpython-3.x or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
python
string
python-3.x
or ask your own question.