12 New, Improved, and Deprecated Modules
As usual, Python's standard library received a number of enhancements and bug fixes. Here's a partial list of the most notable changes, sorted alphabetically by module name. Consult the Misc/NEWS file in the source tree for a more complete list of changes, or look through the CVS logs for all the details.
- The asyncore module's loop() function now
has a count parameter that lets you perform a limited number
of passes through the polling loop. The default is still to loop
forever.
- The base64 module now has more complete RFC 3548 support
for Base64, Base32, and Base16 encoding and decoding, including
optional case folding and optional alternative alphabets.
(Contributed by Barry Warsaw.)
- The bisect module now has an underlying C implementation
for improved performance.
(Contributed by Dmitry Vasiliev.)
- The CJKCodecs collections of East Asian codecs, maintained
by Hye-Shik Chang, was integrated into 2.4.
The new encodings are:
- Chinese (PRC): gb2312, gbk, gb18030, big5hkscs, hz
- Chinese (ROC): big5, cp950
- Japanese: cp932, euc-jis-2004, euc-jp, euc-jisx0213, iso-2022-jp, iso-2022-jp-1, iso-2022-jp-2, iso-2022-jp-3, iso-2022-jp-ext, iso-2022-jp-2004, shift-jis, shift-jisx0213, shift-jis-2004
- Korean: cp949, euc-kr, johab, iso-2022-kr
- Some other new encodings were added: HP Roman8,
ISO_8859-11, ISO_8859-16, PCTP-154, and TIS-620.
- The UTF-8 and UTF-16 codecs now cope better with receiving partial input.
Previously the StreamReader class would try to read more data,
making it impossible to resume decoding from the stream. The
read() method will now return as much data as it can and future
calls will resume decoding where previous ones left off.
(Implemented by Walter Dörwald.)
- There is a new collections module for
various specialized collection datatypes.
Currently it contains just one type, deque,
a double-ended queue that supports efficiently adding and removing
elements from either end:
>>> from collections import deque >>> d = deque('ghi') # make a new deque with three items >>> d.append('j') # add a new entry to the right side >>> d.appendleft('f') # add a new entry to the left side >>> d # show the representation of the deque deque(['f', 'g', 'h', 'i', 'j']) >>> d.pop() # return and remove the rightmost item 'j' >>> d.popleft() # return and remove the leftmost item 'f' >>> list(d) # list the contents of the deque ['g', 'h', 'i'] >>> 'h' in d # search the deque True
Several modules, such as the Queue and threading modules, now take advantage of collections.deque for improved performance. (Contributed by Raymond Hettinger.)
- The ConfigParser classes have been enhanced slightly.
The read() method now returns a list of the files that
were successfully parsed, and the set() method raises
TypeError if passed a value argument that isn't a
string. (Contributed by John Belmonte and David Goodger.)
- The curses module now supports the ncurses extension
use_default_colors(). On platforms where the terminal
supports transparency, this makes it possible to use a transparent
background. (Contributed by Jörg Lehmann.)
- The difflib module now includes an HtmlDiff class
that creates an HTML table showing a side by side comparison
of two versions of a text. (Contributed by Dan Gass.)
- The email package was updated to version 3.0,
which dropped various deprecated APIs and removes support for Python
versions earlier than 2.3. The 3.0 version of the package uses a new
incremental parser for MIME messages, available in the
email.FeedParser module. The new parser doesn't require
reading the entire message into memory, and doesn't throw exceptions
if a message is malformed; instead it records any problems in the
defect attribute of the message. (Developed by Anthony
Baxter, Barry Warsaw, Thomas Wouters, and others.)
- The heapq module has been converted to C. The resulting
tenfold improvement in speed makes the module suitable for handling
high volumes of data. In addition, the module has two new functions
nlargest() and nsmallest() that use heaps to
find the N largest or smallest values in a dataset without the
expense of a full sort. (Contributed by Raymond Hettinger.)
- The httplib module now contains constants for HTTP
status codes defined in various HTTP-related RFC documents. Constants
have names such as OK, CREATED,
CONTINUE, and MOVED_PERMANENTLY; use pydoc to
get a full list. (Contributed by Andrew Eland.)
- The imaplib module now supports IMAP's THREAD command
(contributed by Yves Dionne) and new deleteacl() and
myrights() methods (contributed by Arnaud Mazin).
- The itertools module gained a
groupby(iterable[, func]) function.
iterable is something that can be iterated over to return a
stream of elements, and the optional func parameter is a
function that takes an element and returns a key value; if omitted,
the key is simply the element itself. groupby() then
groups the elements into subsequences which have matching values of
the key, and returns a series of 2-tuples containing the key value
and an iterator over the subsequence.
Here's an example to make this clearer. The key function simply returns whether a number is even or odd, so the result of groupby() is to return consecutive runs of odd or even numbers.
>>> import itertools >>> L = [2, 4, 6, 7, 8, 9, 11, 12, 14] >>> for key_val, it in itertools.groupby(L, lambda x: x % 2): ... print key_val, list(it) ... 0 [2, 4, 6] 1 [7] 0 [8] 1 [9, 11] 0 [12, 14] >>>
groupby() is typically used with sorted input. The logic for groupby() is similar to the Unix
uniq
filter which makes it handy for eliminating, counting, or identifying duplicate elements:>>> word = 'abracadabra' >>> letters = sorted(word) # Turn string into a sorted list of letters >>> letters ['a', 'a', 'a', 'a', 'a', 'b', 'b', 'c', 'd', 'r', 'r'] >>> for k, g in itertools.groupby(letters): ... print k, list(g) ... a ['a', 'a', 'a', 'a', 'a'] b ['b', 'b'] c ['c'] d ['d'] r ['r', 'r'] >>> # List unique letters >>> [k for k, g in groupby(letters)] ['a', 'b', 'c', 'd', 'r'] >>> # Count letter occurrences >>> [(k, len(list(g))) for k, g in groupby(letters)] [('a', 5), ('b', 2), ('c', 1), ('d', 1), ('r', 2)]
(Contributed by Hye-Shik Chang.)
- itertools also gained a function named
tee(iterator, N) that returns N independent
iterators that replicate iterator. If N is omitted, the
default is 2.
>>> L = [1,2,3] >>> i1, i2 = itertools.tee(L) >>> i1,i2 (<itertools.tee object at 0x402c2080>, <itertools.tee object at 0x402c2090>) >>> list(i1) # Run the first iterator to exhaustion [1, 2, 3] >>> list(i2) # Run the second iterator to exhaustion [1, 2, 3] >
Note that tee() has to keep copies of the values returned by the iterator; in the worst case, it may need to keep all of them. This should therefore be used carefully if the leading iterator can run far ahead of the trailing iterator in a long stream of inputs. If the separation is large, then you might as well use list() instead. When the iterators track closely with one another, tee() is ideal. Possible applications include bookmarking, windowing, or lookahead iterators. (Contributed by Raymond Hettinger.)
- A number of functions were added to the locale
module, such as bind_textdomain_codeset() to specify a
particular encoding and a family of l*gettext() functions
that return messages in the chosen encoding.
(Contributed by Gustavo Niemeyer.)
- Some keyword arguments were added to the logging
package's basicConfig function to simplify log
configuration. The default behavior is to log messages to standard
error, but various keyword arguments can be specified to log to a
particular file, change the logging format, or set the logging level.
For example:
import logging logging.basicConfig(filename='/var/log/application.log', level=0, # Log all messages format='%(levelname):%(process):%(thread):%(message)')
Other additions to the logging package include a log(level, msg) convenience method, as well as a TimedRotatingFileHandler class that rotates its log files at a timed interval. The module already had RotatingFileHandler, which rotated logs once the file exceeded a certain size. Both classes derive from a new BaseRotatingHandler class that can be used to implement other rotating handlers.
(Changes implemented by Vinay Sajip.)
- The marshal module now shares interned strings on unpacking a
data structure. This may shrink the size of certain pickle strings,
but the primary effect is to make .pyc files significantly smaller.
(Contributed by Martin von Loewis.)
- The nntplib module's NNTP class gained
description() and descriptions() methods to retrieve
newsgroup descriptions for a single group or for a range of groups.
(Contributed by Jürgen A. Erhard.)
- Two new functions were added to the operator module,
attrgetter(attr) and itemgetter(index).
Both functions return callables that take a single argument and return
the corresponding attribute or item; these callables make excellent
data extractors when used with map() or
sorted(). For example:
>>> L = [('c', 2), ('d', 1), ('a', 4), ('b', 3)] >>> map(operator.itemgetter(0), L) ['c', 'd', 'a', 'b'] >>> map(operator.itemgetter(1), L) [2, 1, 4, 3] >>> sorted(L, key=operator.itemgetter(1)) # Sort list by second tuple item [('d', 1), ('c', 2), ('b', 3), ('a', 4)]
(Contributed by Raymond Hettinger.)
- The optparse module was updated in various ways. The
module now passes its messages through gettext.gettext(),
making it possible to internationalize Optik's help and error
messages. Help messages for options can now include the string
'%default'
, which will be replaced by the option's default value. (Contributed by Greg Ward.) - The long-term plan is to deprecate the rfc822 module
in some future Python release in favor of the email package.
To this end, the email.Utils.formatdate() function has been
changed to make it usable as a replacement for
rfc822.formatdate(). You may want to write new e-mail
processing code with this in mind. (Change implemented by Anthony
Baxter.)
- A new urandom(n) function was added to the
os module, returning a string containing n bytes of
random data. This function provides access to platform-specific
sources of randomness such as /dev/urandom on Linux or the
Windows CryptoAPI. (Contributed by Trevor Perrin.)
- Another new function: os.path.lexists(path)
returns true if the file specified by path exists, whether or
not it's a symbolic link. This differs from the existing
os.path.exists(path) function, which returns false if
path is a symlink that points to a destination that doesn't exist.
(Contributed by Beni Cherniavsky.)
- A new getsid() function was added to the
posix module that underlies the os module.
(Contributed by J. Raynor.)
- The poplib module now supports POP over SSL. (Contributed by
Hector Urtubia.)
- The profile module can now profile C extension functions.
(Contributed by Nick Bastin.)
- The random module has a new method called
getrandbits(N) that returns a long integer N
bits in length. The existing randrange() method now uses
getrandbits() where appropriate, making generation of
arbitrarily large random numbers more efficient. (Contributed by
Raymond Hettinger.)
- The regular expression language accepted by the re module
was extended with simple conditional expressions, written as
(?(group)A|B). group is either a
numeric group ID or a group name defined with (?P<group>...)
earlier in the expression. If the specified group matched, the
regular expression pattern A will be tested against the string; if
the group didn't match, the pattern B will be used instead.
(Contributed by Gustavo Niemeyer.)
- The re module is also no longer recursive, thanks to a
massive amount of work by Gustavo Niemeyer. In a recursive regular
expression engine, certain patterns result in a large amount of C
stack space being consumed, and it was possible to overflow the stack.
For example, if you matched a 30000-byte string of "a" characters
against the expression (a|b)+, one stack frame was consumed
per character. Python 2.3 tried to check for stack overflow and raise
a RuntimeError exception, but certain patterns could
sidestep the checking and if you were unlucky Python could segfault.
Python 2.4's regular expression engine can match this pattern without
problems.
- A new socketpair() function, returning a pair of
connected sockets, was added to the socket module.
(Contributed by Dave Cole.)
- The sys.exitfunc() function has been deprecated. Code
should be using the existing atexit module, which correctly
handles calling multiple exit functions. Eventually
sys.exitfunc() will become a purely internal interface,
accessed only by atexit.
- The tarfile module now generates GNU-format tar files
by default. (Contributed by Lars Gustaebel.)
- The threading module now has an elegantly simple way to support
thread-local data. The module contains a local class whose
attribute values are local to different threads.
import threading data = threading.local() data.number = 42 data.url = ('www.python.org', 80)
Other threads can assign and retrieve their own values for the number and url attributes. You can subclass local to initialize attributes or to add methods. (Contributed by Jim Fulton.)
- The timeit module now automatically disables periodic
garbarge collection during the timing loop. This change makes
consecutive timings more comparable. (Contributed by Raymond Hettinger.)
- The weakref module now supports a wider variety of objects
including Python functions, class instances, sets, frozensets, deques,
arrays, files, sockets, and regular expression pattern objects.
(Contributed by Raymond Hettinger.)
- The xmlrpclib module now supports a multi-call extension for
transmitting multiple XML-RPC calls in a single HTTP operation.
(Contributed by Brian Quinlan.)
- The mpz, rotor, and xreadlines modules have
been removed.
12.1 cookielib
The cookielib library supports client-side handling for HTTP cookies, mirroring the Cookie module's server-side cookie support. Cookies are stored in cookie jars; the library transparently stores cookies offered by the web server in the cookie jar, and fetches the cookie from the jar when connecting to the server. As in web browsers, policy objects control whether cookies are accepted or not.
In order to store cookies across sessions, two implementations of cookie jars are provided: one that stores cookies in the Netscape format so applications can use the Mozilla or Lynx cookie files, and one that stores cookies in the same format as the Perl libwww library.
urllib2 has been changed to interact with cookielib: HTTPCookieProcessor manages a cookie jar that is used when accessing URLs.
This module was contributed by John J. Lee.
12.2 doctest
The doctest module underwent considerable refactoring thanks to Edward Loper and Tim Peters. Testing can still be as simple as running doctest.testmod(), but the refactorings allow customizing the module's operation in various ways
The new DocTestFinder class extracts the tests from a given object's docstrings:
def f (x, y): """>>> f(2,2) 4 >>> f(3,2) 6 """ return x*y finder = doctest.DocTestFinder() # Get list of DocTest instances tests = finder.find(f)
The new DocTestRunner class then runs individual tests and can produce a summary of the results:
runner = doctest.DocTestRunner() for t in tests: tried, failed = runner.run(t) runner.summarize(verbose=1)
The above example produces the following output:
1 items passed all tests: 2 tests in f 2 tests in 1 items. 2 passed and 0 failed. Test passed.
DocTestRunner uses an instance of the OutputChecker class to compare the expected output with the actual output. This class takes a number of different flags that customize its behaviour; ambitious users can also write a completely new subclass of OutputChecker.
The default output checker provides a number of handy features. For example, with the doctest.ELLIPSIS option flag, an ellipsis ("...") in the expected output matches any substring, making it easier to accommodate outputs that vary in minor ways:
def o (n): """>>> o(1) <__main__.C instance at 0x...> >>> """
Another special string, "<BLANKLINE>", matches a blank line:
def p (n): """>>> p(1) <BLANKLINE> >>> """
Another new capability is producing a diff-style display of the output by specifying the doctest.REPORT_UDIFF (unified diffs), doctest.REPORT_CDIFF (context diffs), or doctest.REPORT_NDIFF (delta-style) option flags. For example:
def g (n): """>>> g(4) here is a lengthy >>>""" L = 'here is a rather lengthy list of words'.split() for word in L[:n]: print word
Running the above function's tests with doctest.REPORT_UDIFF specified, you get the following output:
********************************************************************** File ``t.py'', line 15, in g Failed example: g(4) Differences (unified diff with -expected +actual): @@ -2,3 +2,3 @@ is a -lengthy +rather **********************************************************************
See About this document... for information on suggesting changes.