Wednesday, 25 April 2012

Is there a scanf() or sscanf() equivalent? | Python

Not as such.
For simple input parsing, the easiest approach is usually to split the line into whitespace-delimited words using the split() method of string objects and then convert decimal strings to numeric values using int() or float(). split() supports an optional "sep" parameter which is useful if the line uses something other than whitespace as a separator.
For more complicated input parsing, regular expressions more powerful than C's sscanf() and better suited for the task. 1.3.9 What does 'UnicodeError: ASCII [decoding,encoding] error: ordinal not in range(128)' mean?
This error indicates that your Python installation can handle only 7-bit ASCII strings. There are a couple ways to fix or work around the problem.
If your programs must handle data in arbitrary character set encodings, the environment the application runs in will generally identify the encoding of the data it is handing you. You need to convert the input to Unicode data using that encoding. For example, a program that handles email or web input will typically find character set encoding information in Content-Type headers. This can then be used to properly convert input data to Unicode. Assuming the string referred to by value is encoded as UTF-8:
value = unicode(value, "utf-8")
will return a Unicode object. If the data is not correctly encoded as UTF-8, the above call will raise a UnicodeError exception.
If you only want strings converted to Unicode which have non-ASCII data, you can try converting them first assuming an ASCII encoding, and then generate Unicode objects if that fails:
try:
x = unicode(value, "ascii") except UnicodeError:
value = unicode(value, "utf-8")
else:
# value was valid ASCII data
pass
It's possible to set a default encoding in a file called sitecustomize.py that's part of the Python library. However, this isn't recommended because changing the Python-wide default encoding may cause thirdparty extension modules to fail.
Note that on Windows, there is an encoding known as "mbcs", which uses an encoding specific to your current locale. In many cases, and particularly when working with COM, this may be an appropriate default encoding to use.