Mercurial > genshi > genshi-test
view doc/streams.txt @ 680:052fe2b1bfc8 experimental-optimizer
Fixed #178.
author | athomas |
---|---|
date | Thu, 24 Jan 2008 00:42:04 +0000 |
parents | ca7d707d51b0 |
children | 1447d40df660 67d324a62cc0 |
line wrap: on
line source
.. -*- mode: rst; encoding: utf-8 -*- ============== Markup Streams ============== A stream is the common representation of markup as a *stream of events*. .. contents:: Contents :depth: 1 .. sectnum:: Basics ====== A stream can be attained in a number of ways. It can be: * the result of parsing XML or HTML text, or * the result of selecting a subset of another stream using XPath, or * programmatically generated. For example, the functions ``XML()`` and ``HTML()`` can be used to convert literal XML or HTML text to a markup stream: .. code-block:: pycon >>> from genshi import XML >>> stream = XML('<p class="intro">Some text and ' ... '<a href="http://example.org/">a link</a>.' ... '<br/></p>') >>> stream <genshi.core.Stream object at ...> The stream is the result of parsing the text into events. Each event is a tuple of the form ``(kind, data, pos)``, where: * ``kind`` defines what kind of event it is (such as the start of an element, text, a comment, etc). * ``data`` is the actual data associated with the event. How this looks depends on the event kind (see `event kinds`_) * ``pos`` is a ``(filename, lineno, column)`` tuple that describes where the event “comes from”. .. code-block:: pycon >>> for kind, data, pos in stream: ... print kind, `data`, pos ... START (QName(u'p'), Attrs([(QName(u'class'), u'intro')])) (None, 1, 0) TEXT u'Some text and ' (None, 1, 17) START (QName(u'a'), Attrs([(QName(u'href'), u'http://example.org/')])) (None, 1, 31) TEXT u'a link' (None, 1, 61) END QName(u'a') (None, 1, 67) TEXT u'.' (None, 1, 71) START (QName(u'br'), Attrs()) (None, 1, 72) END QName(u'br') (None, 1, 77) END QName(u'p') (None, 1, 77) Filtering ========= One important feature of markup streams is that you can apply *filters* to the stream, either filters that come with Genshi, or your own custom filters. A filter is simply a callable that accepts the stream as parameter, and returns the filtered stream: .. code-block:: python def noop(stream): """A filter that doesn't actually do anything with the stream.""" for kind, data, pos in stream: yield kind, data, pos Filters can be applied in a number of ways. The simplest is to just call the filter directly: .. code-block:: python stream = noop(stream) The ``Stream`` class also provides a ``filter()`` method, which takes an arbitrary number of filter callables and applies them all: .. code-block:: python stream = stream.filter(noop) Finally, filters can also be applied using the *bitwise or* operator (``|``), which allows a syntax similar to pipes on Unix shells: .. code-block:: python stream = stream | noop One example of a filter included with Genshi is the ``HTMLSanitizer`` in ``genshi.filters``. It processes a stream of HTML markup, and strips out any potentially dangerous constructs, such as Javascript event handlers. ``HTMLSanitizer`` is not a function, but rather a class that implements ``__call__``, which means instances of the class are callable: .. code-block:: python stream = stream | HTMLSanitizer() Both the ``filter()`` method and the pipe operator allow easy chaining of filters: .. code-block:: python from genshi.filters import HTMLSanitizer stream = stream.filter(noop, HTMLSanitizer()) That is equivalent to: .. code-block:: python stream = stream | noop | HTMLSanitizer() For more information about the built-in filters, see `Stream Filters`_. .. _`Stream Filters`: filters.html Serialization ============= Serialization means producing some kind of textual output from a stream of events, which you'll need when you want to transmit or store the results of generating or otherwise processing markup. The ``Stream`` class provides two methods for serialization: ``serialize()`` and ``render()``. The former is a generator that yields chunks of ``Markup`` objects (which are basically unicode strings that are considered safe for output on the web). The latter returns a single string, by default UTF-8 encoded. Here's the output from ``serialize()``: .. code-block:: pycon >>> for output in stream.serialize(): ... print `output` ... <Markup u'<p class="intro">'> <Markup u'Some text and '> <Markup u'<a href="http://example.org/">'> <Markup u'a link'> <Markup u'</a>'> <Markup u'.'> <Markup u'<br/>'> <Markup u'</p>'> And here's the output from ``render()``: .. code-block:: pycon >>> print stream.render() <p class="intro">Some text and <a href="http://example.org/">a link</a>.<br/></p> Both methods can be passed a ``method`` parameter that determines how exactly the events are serialzed to text. This parameter can be either “xml” (the default), “xhtml”, “html”, “text”, or a custom serializer class: .. code-block:: pycon >>> print stream.render('html') <p class="intro">Some text and <a href="http://example.org/">a link</a>.<br></p> Note how the `<br>` element isn't closed, which is the right thing to do for HTML. In addition, the ``render()`` method takes an ``encoding`` parameter, which defaults to “UTF-8”. If set to ``None``, the result will be a unicode string. The different serializer classes in ``genshi.output`` can also be used directly: .. code-block:: pycon >>> from genshi.filters import HTMLSanitizer >>> from genshi.output import TextSerializer >>> print ''.join(TextSerializer()(HTMLSanitizer()(stream))) Some text and a link. The pipe operator allows a nicer syntax: .. code-block:: pycon >>> print stream | HTMLSanitizer() | TextSerializer() Some text and a link. Serialization Options --------------------- Both ``serialize()`` and ``render()`` support additional keyword arguments that are passed through to the initializer of the serializer class. The following options are supported by the built-in serializers: ``strip_whitespace`` Whether the serializer should remove trailing spaces and empty lines. Defaults to ``True``. (This option is not available for serialization to plain text.) ``doctype`` A ``(name, pubid, sysid)`` tuple defining the name, publid identifier, and system identifier of a ``DOCTYPE`` declaration to prepend to the generated output. If provided, this declaration will override any ``DOCTYPE`` declaration in the stream. (This option is not available for serialization to plain text.) ``namespace_prefixes`` The namespace prefixes to use for namespace that are not bound to a prefix in the stream itself. (This option is not available for serialization to HTML or plain text.) Using XPath =========== XPath can be used to extract a specific subset of the stream via the ``select()`` method: .. code-block:: pycon >>> substream = stream.select('a') >>> substream <genshi.core.Stream object at ...> >>> print substream <a href="http://example.org/">a link</a> Often, streams cannot be reused: in the above example, the sub-stream is based on a generator. Once it has been serialized, it will have been fully consumed, and cannot be rendered again. To work around this, you can wrap such a stream in a ``list``: .. code-block:: pycon >>> from genshi import Stream >>> substream = Stream(list(stream.select('a'))) >>> substream <genshi.core.Stream object at ...> >>> print substream <a href="http://example.org/">a link</a> >>> print substream.select('@href') http://example.org/ >>> print substream.select('text()') a link See `Using XPath in Genshi`_ for more information about the XPath support in Genshi. .. _`Using XPath in Genshi`: xpath.html .. _`event kinds`: Event Kinds =========== Every event in a stream is of one of several *kinds*, which also determines what the ``data`` item of the event tuple looks like. The different kinds of events are documented below. .. note:: The ``data`` item is generally immutable. If the data is to be modified when processing a stream, it must be replaced by a new tuple. Effectively, this means the entire event tuple is immutable. START ----- The opening tag of an element. For this kind of event, the ``data`` item is a tuple of the form ``(tagname, attrs)``, where ``tagname`` is a ``QName`` instance describing the qualified name of the tag, and ``attrs`` is an ``Attrs`` instance containing the attribute names and values associated with the tag (excluding namespace declarations): .. code-block:: python START, (QName(u'p'), Attrs([(u'class', u'intro')])), pos END --- The closing tag of an element. The ``data`` item of end events consists of just a ``QName`` instance describing the qualified name of the tag: .. code-block:: python END, QName(u'p'), pos TEXT ---- Character data outside of elements and comments. For text events, the ``data`` item should be a unicode object: .. code-block:: python TEXT, u'Hello, world!', pos START_NS -------- The start of a namespace mapping, binding a namespace prefix to a URI. The ``data`` item of this kind of event is a tuple of the form ``(prefix, uri)``, where ``prefix`` is the namespace prefix and ``uri`` is the full URI to which the prefix is bound. Both should be unicode objects. If the namespace is not bound to any prefix, the ``prefix`` item is an empty string: .. code-block:: python START_NS, (u'svg', u'http://www.w3.org/2000/svg'), pos END_NS ------ The end of a namespace mapping. The ``data`` item of such events consists of only the namespace prefix (a unicode object): .. code-block:: python END_NS, u'svg', pos DOCTYPE ------- A document type declaration. For this type of event, the ``data`` item is a tuple of the form ``(name, pubid, sysid)``, where ``name`` is the name of the root element, ``pubid`` is the public identifier of the DTD (or ``None``), and ``sysid`` is the system identifier of the DTD (or ``None``): .. code-block:: python DOCTYPE, (u'html', u'-//W3C//DTD XHTML 1.0 Transitional//EN', \ u'http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd'), pos COMMENT ------- A comment. For such events, the ``data`` item is a unicode object containing all character data between the comment delimiters: .. code-block:: python COMMENT, u'Commented out', pos PI -- A processing instruction. The ``data`` item is a tuple of the form ``(target, data)`` for processing instructions, where ``target`` is the target of the PI (used to identify the application by which the instruction should be processed), and ``data`` is text following the target (excluding the terminating question mark): .. code-block:: python PI, (u'php', u'echo "Yo" '), pos START_CDATA ----------- Marks the beginning of a ``CDATA`` section. The ``data`` item for such events is always ``None``: .. code-block:: python START_CDATA, None, pos END_CDATA --------- Marks the end of a ``CDATA`` section. The ``data`` item for such events is always ``None``: .. code-block:: python END_CDATA, None, pos