12 New, Improved, and Deprecated Modules

Python 2.2


12 New, Improved, and Deprecated Modules

As usual, Python's standard library received a number of enhancements and bug fixes. Here's a partial list of the most notable changes, sorted alphabetically by module name. Consult the Misc/NEWS file in the source tree for a more complete list of changes, or look through the CVS logs for all the details.

  • The asyncore module's loop() function now has a count parameter that lets you perform a limited number of passes through the polling loop. The default is still to loop forever.

  • The base64 module now has more complete RFC 3548 support for Base64, Base32, and Base16 encoding and decoding, including optional case folding and optional alternative alphabets. (Contributed by Barry Warsaw.)

  • The bisect module now has an underlying C implementation for improved performance. (Contributed by Dmitry Vasiliev.)

  • The CJKCodecs collections of East Asian codecs, maintained by Hye-Shik Chang, was integrated into 2.4. The new encodings are:

    • Chinese (PRC): gb2312, gbk, gb18030, big5hkscs, hz
    • Chinese (ROC): big5, cp950
    • Japanese: cp932, euc-jis-2004, euc-jp, euc-jisx0213, iso-2022-jp, iso-2022-jp-1, iso-2022-jp-2, iso-2022-jp-3, iso-2022-jp-ext, iso-2022-jp-2004, shift-jis, shift-jisx0213, shift-jis-2004
    • Korean: cp949, euc-kr, johab, iso-2022-kr

  • Some other new encodings were added: HP Roman8, ISO_8859-11, ISO_8859-16, PCTP-154, and TIS-620.

  • The UTF-8 and UTF-16 codecs now cope better with receiving partial input. Previously the StreamReader class would try to read more data, making it impossible to resume decoding from the stream. The read() method will now return as much data as it can and future calls will resume decoding where previous ones left off. (Implemented by Walter Dörwald.)

  • There is a new collections module for various specialized collection datatypes. Currently it contains just one type, deque, a double-ended queue that supports efficiently adding and removing elements from either end:

    >>> from collections import deque
    >>> d = deque('ghi')        # make a new deque with three items
    >>> d.append('j')           # add a new entry to the right side
    >>> d.appendleft('f')       # add a new entry to the left side
    >>> d                       # show the representation of the deque
    deque(['f', 'g', 'h', 'i', 'j'])
    >>> d.pop()                 # return and remove the rightmost item
    'j'
    >>> d.popleft()             # return and remove the leftmost item
    'f'
    >>> list(d)                 # list the contents of the deque
    ['g', 'h', 'i']
    >>> 'h' in d                # search the deque
    True
    

    Several modules, such as the Queue and threading modules, now take advantage of collections.deque for improved performance. (Contributed by Raymond Hettinger.)

  • The ConfigParser classes have been enhanced slightly. The read() method now returns a list of the files that were successfully parsed, and the set() method raises TypeError if passed a value argument that isn't a string. (Contributed by John Belmonte and David Goodger.)

  • The curses module now supports the ncurses extension use_default_colors(). On platforms where the terminal supports transparency, this makes it possible to use a transparent background. (Contributed by Jörg Lehmann.)

  • The difflib module now includes an HtmlDiff class that creates an HTML table showing a side by side comparison of two versions of a text. (Contributed by Dan Gass.)

  • The email package was updated to version 3.0, which dropped various deprecated APIs and removes support for Python versions earlier than 2.3. The 3.0 version of the package uses a new incremental parser for MIME messages, available in the email.FeedParser module. The new parser doesn't require reading the entire message into memory, and doesn't throw exceptions if a message is malformed; instead it records any problems in the defect attribute of the message. (Developed by Anthony Baxter, Barry Warsaw, Thomas Wouters, and others.)

  • The heapq module has been converted to C. The resulting tenfold improvement in speed makes the module suitable for handling high volumes of data. In addition, the module has two new functions nlargest() and nsmallest() that use heaps to find the N largest or smallest values in a dataset without the expense of a full sort. (Contributed by Raymond Hettinger.)

  • The httplib module now contains constants for HTTP status codes defined in various HTTP-related RFC documents. Constants have names such as OK, CREATED, CONTINUE, and MOVED_PERMANENTLY; use pydoc to get a full list. (Contributed by Andrew Eland.)

  • The imaplib module now supports IMAP's THREAD command (contributed by Yves Dionne) and new deleteacl() and myrights() methods (contributed by Arnaud Mazin).

  • The itertools module gained a groupby(iterable[, func]) function. iterable is something that can be iterated over to return a stream of elements, and the optional func parameter is a function that takes an element and returns a key value; if omitted, the key is simply the element itself. groupby() then groups the elements into subsequences which have matching values of the key, and returns a series of 2-tuples containing the key value and an iterator over the subsequence.

    Here's an example to make this clearer. The key function simply returns whether a number is even or odd, so the result of groupby() is to return consecutive runs of odd or even numbers.

    >>> import itertools
    >>> L = [2, 4, 6, 7, 8, 9, 11, 12, 14]
    >>> for key_val, it in itertools.groupby(L, lambda x: x % 2):
    ...    print key_val, list(it)
    ... 
    0 [2, 4, 6]
    1 [7]
    0 [8]
    1 [9, 11]
    0 [12, 14]
    >>>
    

    groupby() is typically used with sorted input. The logic for groupby() is similar to the Unix uniq filter which makes it handy for eliminating, counting, or identifying duplicate elements:

    >>> word = 'abracadabra'
    >>> letters = sorted(word)   # Turn string into a sorted list of letters
    >>> letters 
    ['a', 'a', 'a', 'a', 'a', 'b', 'b', 'c', 'd', 'r', 'r']
    >>> for k, g in itertools.groupby(letters):
    ...    print k, list(g)
    ... 
    a ['a', 'a', 'a', 'a', 'a']
    b ['b', 'b']
    c ['c']
    d ['d']
    r ['r', 'r']
    >>> # List unique letters
    >>> [k for k, g in groupby(letters)]                     
    ['a', 'b', 'c', 'd', 'r']
    >>> # Count letter occurrences
    >>> [(k, len(list(g))) for k, g in groupby(letters)]     
    [('a', 5), ('b', 2), ('c', 1), ('d', 1), ('r', 2)]
    

    (Contributed by Hye-Shik Chang.)

  • itertools also gained a function named tee(iterator, N) that returns N independent iterators that replicate iterator. If N is omitted, the default is 2.

    >>> L = [1,2,3]
    >>> i1, i2 = itertools.tee(L)
    >>> i1,i2
    (<itertools.tee object at 0x402c2080>, <itertools.tee object at 0x402c2090>)
    >>> list(i1)               # Run the first iterator to exhaustion
    [1, 2, 3]
    >>> list(i2)               # Run the second iterator to exhaustion
    [1, 2, 3]
    >
    

    Note that tee() has to keep copies of the values returned by the iterator; in the worst case, it may need to keep all of them. This should therefore be used carefully if the leading iterator can run far ahead of the trailing iterator in a long stream of inputs. If the separation is large, then you might as well use list() instead. When the iterators track closely with one another, tee() is ideal. Possible applications include bookmarking, windowing, or lookahead iterators. (Contributed by Raymond Hettinger.)

  • A number of functions were added to the locale module, such as bind_textdomain_codeset() to specify a particular encoding and a family of l*gettext() functions that return messages in the chosen encoding. (Contributed by Gustavo Niemeyer.)

  • Some keyword arguments were added to the logging package's basicConfig function to simplify log configuration. The default behavior is to log messages to standard error, but various keyword arguments can be specified to log to a particular file, change the logging format, or set the logging level. For example:

    import logging
    logging.basicConfig(filename='/var/log/application.log',
        level=0,  # Log all messages
        format='%(levelname):%(process):%(thread):%(message)')	            
    

    Other additions to the logging package include a log(level, msg) convenience method, as well as a TimedRotatingFileHandler class that rotates its log files at a timed interval. The module already had RotatingFileHandler, which rotated logs once the file exceeded a certain size. Both classes derive from a new BaseRotatingHandler class that can be used to implement other rotating handlers.

    (Changes implemented by Vinay Sajip.)

  • The marshal module now shares interned strings on unpacking a data structure. This may shrink the size of certain pickle strings, but the primary effect is to make .pyc files significantly smaller. (Contributed by Martin von Loewis.)

  • The nntplib module's NNTP class gained description() and descriptions() methods to retrieve newsgroup descriptions for a single group or for a range of groups. (Contributed by Jürgen A. Erhard.)

  • Two new functions were added to the operator module, attrgetter(attr) and itemgetter(index). Both functions return callables that take a single argument and return the corresponding attribute or item; these callables make excellent data extractors when used with map() or sorted(). For example:

    >>> L = [('c', 2), ('d', 1), ('a', 4), ('b', 3)]
    >>> map(operator.itemgetter(0), L)
    ['c', 'd', 'a', 'b']
    >>> map(operator.itemgetter(1), L)
    [2, 1, 4, 3]
    >>> sorted(L, key=operator.itemgetter(1)) # Sort list by second tuple item
    [('d', 1), ('c', 2), ('b', 3), ('a', 4)]
    

    (Contributed by Raymond Hettinger.)

  • The optparse module was updated in various ways. The module now passes its messages through gettext.gettext(), making it possible to internationalize Optik's help and error messages. Help messages for options can now include the string '%default', which will be replaced by the option's default value. (Contributed by Greg Ward.)

  • The long-term plan is to deprecate the rfc822 module in some future Python release in favor of the email package. To this end, the email.Utils.formatdate() function has been changed to make it usable as a replacement for rfc822.formatdate(). You may want to write new e-mail processing code with this in mind. (Change implemented by Anthony Baxter.)

  • A new urandom(n) function was added to the os module, returning a string containing n bytes of random data. This function provides access to platform-specific sources of randomness such as /dev/urandom on Linux or the Windows CryptoAPI. (Contributed by Trevor Perrin.)

  • Another new function: os.path.lexists(path) returns true if the file specified by path exists, whether or not it's a symbolic link. This differs from the existing os.path.exists(path) function, which returns false if path is a symlink that points to a destination that doesn't exist. (Contributed by Beni Cherniavsky.)

  • A new getsid() function was added to the posix module that underlies the os module. (Contributed by J. Raynor.)

  • The poplib module now supports POP over SSL. (Contributed by Hector Urtubia.)

  • The profile module can now profile C extension functions. (Contributed by Nick Bastin.)

  • The random module has a new method called getrandbits(N) that returns a long integer N bits in length. The existing randrange() method now uses getrandbits() where appropriate, making generation of arbitrarily large random numbers more efficient. (Contributed by Raymond Hettinger.)

  • The regular expression language accepted by the re module was extended with simple conditional expressions, written as (?(group)A|B). group is either a numeric group ID or a group name defined with (?P<group>...) earlier in the expression. If the specified group matched, the regular expression pattern A will be tested against the string; if the group didn't match, the pattern B will be used instead. (Contributed by Gustavo Niemeyer.)

  • The re module is also no longer recursive, thanks to a massive amount of work by Gustavo Niemeyer. In a recursive regular expression engine, certain patterns result in a large amount of C stack space being consumed, and it was possible to overflow the stack. For example, if you matched a 30000-byte string of "a" characters against the expression (a|b)+, one stack frame was consumed per character. Python 2.3 tried to check for stack overflow and raise a RuntimeError exception, but certain patterns could sidestep the checking and if you were unlucky Python could segfault. Python 2.4's regular expression engine can match this pattern without problems.

  • A new socketpair() function, returning a pair of connected sockets, was added to the socket module. (Contributed by Dave Cole.)

  • The sys.exitfunc() function has been deprecated. Code should be using the existing atexit module, which correctly handles calling multiple exit functions. Eventually sys.exitfunc() will become a purely internal interface, accessed only by atexit.

  • The tarfile module now generates GNU-format tar files by default. (Contributed by Lars Gustaebel.)

  • The threading module now has an elegantly simple way to support thread-local data. The module contains a local class whose attribute values are local to different threads.

    import threading
    
    data = threading.local()
    data.number = 42
    data.url = ('www.python.org', 80)
    

    Other threads can assign and retrieve their own values for the number and url attributes. You can subclass local to initialize attributes or to add methods. (Contributed by Jim Fulton.)

  • The timeit module now automatically disables periodic garbarge collection during the timing loop. This change makes consecutive timings more comparable. (Contributed by Raymond Hettinger.)

  • The weakref module now supports a wider variety of objects including Python functions, class instances, sets, frozensets, deques, arrays, files, sockets, and regular expression pattern objects. (Contributed by Raymond Hettinger.)

  • The xmlrpclib module now supports a multi-call extension for transmitting multiple XML-RPC calls in a single HTTP operation. (Contributed by Brian Quinlan.)

  • The mpz, rotor, and xreadlines modules have been removed.

12.1 cookielib

The cookielib library supports client-side handling for HTTP cookies, mirroring the Cookie module's server-side cookie support. Cookies are stored in cookie jars; the library transparently stores cookies offered by the web server in the cookie jar, and fetches the cookie from the jar when connecting to the server. As in web browsers, policy objects control whether cookies are accepted or not.

In order to store cookies across sessions, two implementations of cookie jars are provided: one that stores cookies in the Netscape format so applications can use the Mozilla or Lynx cookie files, and one that stores cookies in the same format as the Perl libwww library.

urllib2 has been changed to interact with cookielib: HTTPCookieProcessor manages a cookie jar that is used when accessing URLs.

This module was contributed by John J. Lee.

12.2 doctest

The doctest module underwent considerable refactoring thanks to Edward Loper and Tim Peters. Testing can still be as simple as running doctest.testmod(), but the refactorings allow customizing the module's operation in various ways

The new DocTestFinder class extracts the tests from a given object's docstrings:

def f (x, y):
    """>>> f(2,2)
4
>>> f(3,2)
6
    """
    return x*y

finder = doctest.DocTestFinder()

# Get list of DocTest instances
tests = finder.find(f)

The new DocTestRunner class then runs individual tests and can produce a summary of the results:

runner = doctest.DocTestRunner()
for t in tests:
    tried, failed = runner.run(t)
    
runner.summarize(verbose=1)

The above example produces the following output:

1 items passed all tests:
   2 tests in f
2 tests in 1 items.
2 passed and 0 failed.
Test passed.

DocTestRunner uses an instance of the OutputChecker class to compare the expected output with the actual output. This class takes a number of different flags that customize its behaviour; ambitious users can also write a completely new subclass of OutputChecker.

The default output checker provides a number of handy features. For example, with the doctest.ELLIPSIS option flag, an ellipsis ("...") in the expected output matches any substring, making it easier to accommodate outputs that vary in minor ways:

def o (n):
    """>>> o(1)
<__main__.C instance at 0x...>
>>>
"""

Another special string, "<BLANKLINE>", matches a blank line:

def p (n):
    """>>> p(1)
<BLANKLINE>
>>>
"""

Another new capability is producing a diff-style display of the output by specifying the doctest.REPORT_UDIFF (unified diffs), doctest.REPORT_CDIFF (context diffs), or doctest.REPORT_NDIFF (delta-style) option flags. For example:

def g (n):
    """>>> g(4)
here
is
a
lengthy
>>>"""
    L = 'here is a rather lengthy list of words'.split()
    for word in L[:n]:
        print word

Running the above function's tests with doctest.REPORT_UDIFF specified, you get the following output:

**********************************************************************
File ``t.py'', line 15, in g
Failed example:
    g(4)
Differences (unified diff with -expected +actual):
    @@ -2,3 +2,3 @@
     is
     a
    -lengthy
    +rather
**********************************************************************

See About this document... for information on suggesting changes.