cmlenz@226: .. -*- mode: rst; encoding: utf-8 -*- cmlenz@226: cmlenz@226: ============== cmlenz@226: Markup Streams cmlenz@226: ============== cmlenz@226: cmlenz@226: A stream is the common representation of markup as a *stream of events*. cmlenz@226: cmlenz@226: cmlenz@226: .. contents:: Contents cmlenz@382: :depth: 1 cmlenz@226: .. sectnum:: cmlenz@226: cmlenz@226: cmlenz@226: Basics cmlenz@226: ====== cmlenz@226: cmlenz@226: A stream can be attained in a number of ways. It can be: cmlenz@226: cmlenz@226: * the result of parsing XML or HTML text, or cmlenz@438: * the result of selecting a subset of another stream using XPath, or cmlenz@438: * programmatically generated. cmlenz@226: cmlenz@226: For example, the functions ``XML()`` and ``HTML()`` can be used to convert cmlenz@508: literal XML or HTML text to a markup stream: cmlenz@508: cmlenz@510: .. code-block:: pycon cmlenz@226: cmlenz@230: >>> from genshi import XML cmlenz@226: >>> stream = XML('

Some text and ' cmlenz@226: ... 'a link.' cmlenz@226: ... '

') cmlenz@226: >>> stream cmlenz@382: cmlenz@226: cmlenz@226: The stream is the result of parsing the text into events. Each event is a tuple cmlenz@226: of the form ``(kind, data, pos)``, where: cmlenz@226: cmlenz@226: * ``kind`` defines what kind of event it is (such as the start of an element, cmlenz@226: text, a comment, etc). cmlenz@226: * ``data`` is the actual data associated with the event. How this looks depends cmlenz@382: on the event kind (see `event kinds`_) cmlenz@226: * ``pos`` is a ``(filename, lineno, column)`` tuple that describes where the cmlenz@226: event “comes from”. cmlenz@226: cmlenz@510: .. code-block:: pycon cmlenz@226: cmlenz@226: >>> for kind, data, pos in stream: cmlenz@226: ... print kind, `data`, pos cmlenz@226: ... cmlenz@382: START (QName(u'p'), Attrs([(QName(u'class'), u'intro')])) (None, 1, 0) cmlenz@382: TEXT u'Some text and ' (None, 1, 17) cmlenz@382: START (QName(u'a'), Attrs([(QName(u'href'), u'http://example.org/')])) (None, 1, 31) cmlenz@382: TEXT u'a link' (None, 1, 61) cmlenz@382: END QName(u'a') (None, 1, 67) cmlenz@382: TEXT u'.' (None, 1, 71) cmlenz@382: START (QName(u'br'), Attrs()) (None, 1, 72) cmlenz@382: END QName(u'br') (None, 1, 77) cmlenz@382: END QName(u'p') (None, 1, 77) cmlenz@226: cmlenz@226: cmlenz@226: Filtering cmlenz@226: ========= cmlenz@226: cmlenz@226: One important feature of markup streams is that you can apply *filters* to the cmlenz@230: stream, either filters that come with Genshi, or your own custom filters. cmlenz@226: cmlenz@226: A filter is simply a callable that accepts the stream as parameter, and returns cmlenz@508: the filtered stream: cmlenz@508: cmlenz@508: .. code-block:: python cmlenz@226: cmlenz@226: def noop(stream): cmlenz@226: """A filter that doesn't actually do anything with the stream.""" cmlenz@226: for kind, data, pos in stream: cmlenz@226: yield kind, data, pos cmlenz@226: cmlenz@226: Filters can be applied in a number of ways. The simplest is to just call the cmlenz@508: filter directly: cmlenz@508: cmlenz@508: .. code-block:: python cmlenz@226: cmlenz@226: stream = noop(stream) cmlenz@226: cmlenz@226: The ``Stream`` class also provides a ``filter()`` method, which takes an cmlenz@508: arbitrary number of filter callables and applies them all: cmlenz@508: cmlenz@508: .. code-block:: python cmlenz@226: cmlenz@226: stream = stream.filter(noop) cmlenz@226: cmlenz@226: Finally, filters can also be applied using the *bitwise or* operator (``|``), cmlenz@508: which allows a syntax similar to pipes on Unix shells: cmlenz@508: cmlenz@508: .. code-block:: python cmlenz@226: cmlenz@226: stream = stream | noop cmlenz@226: cmlenz@230: One example of a filter included with Genshi is the ``HTMLSanitizer`` in cmlenz@230: ``genshi.filters``. It processes a stream of HTML markup, and strips out any cmlenz@226: potentially dangerous constructs, such as Javascript event handlers. cmlenz@226: ``HTMLSanitizer`` is not a function, but rather a class that implements cmlenz@508: ``__call__``, which means instances of the class are callable: cmlenz@508: cmlenz@508: .. code-block:: python cmlenz@438: cmlenz@438: stream = stream | HTMLSanitizer() cmlenz@226: cmlenz@226: Both the ``filter()`` method and the pipe operator allow easy chaining of cmlenz@508: filters: cmlenz@508: cmlenz@508: .. code-block:: python cmlenz@226: cmlenz@230: from genshi.filters import HTMLSanitizer cmlenz@226: stream = stream.filter(noop, HTMLSanitizer()) cmlenz@226: cmlenz@508: That is equivalent to: cmlenz@508: cmlenz@508: .. code-block:: python cmlenz@226: cmlenz@226: stream = stream | noop | HTMLSanitizer() cmlenz@226: cmlenz@438: For more information about the built-in filters, see `Stream Filters`_. cmlenz@438: cmlenz@438: .. _`Stream Filters`: filters.html cmlenz@438: cmlenz@226: cmlenz@226: Serialization cmlenz@226: ============= cmlenz@226: cmlenz@438: Serialization means producing some kind of textual output from a stream of cmlenz@438: events, which you'll need when you want to transmit or store the results of cmlenz@438: generating or otherwise processing markup. cmlenz@438: cmlenz@438: The ``Stream`` class provides two methods for serialization: ``serialize()`` and cmlenz@438: ``render()``. The former is a generator that yields chunks of ``Markup`` objects cmlenz@438: (which are basically unicode strings that are considered safe for output on the cmlenz@438: web). The latter returns a single string, by default UTF-8 encoded. cmlenz@226: cmlenz@508: Here's the output from ``serialize()``: cmlenz@508: cmlenz@510: .. code-block:: pycon cmlenz@226: cmlenz@226: >>> for output in stream.serialize(): cmlenz@226: ... print `output` cmlenz@226: ... cmlenz@226: '> cmlenz@226: cmlenz@226: '> cmlenz@226: cmlenz@226: '> cmlenz@226: cmlenz@226: '> cmlenz@226: '> cmlenz@226: cmlenz@508: And here's the output from ``render()``: cmlenz@508: cmlenz@510: .. code-block:: pycon cmlenz@226: cmlenz@226: >>> print stream.render() cmlenz@226:

Some text and a link.

cmlenz@226: cmlenz@226: Both methods can be passed a ``method`` parameter that determines how exactly cmlenz@226: the events are serialzed to text. This parameter can be either “xml” (the cmlenz@508: default), “xhtml”, “html”, “text”, or a custom serializer class: cmlenz@508: cmlenz@510: .. code-block:: pycon cmlenz@226: cmlenz@226: >>> print stream.render('html') cmlenz@226:

Some text and a link.

cmlenz@226: cmlenz@226: Note how the `
` element isn't closed, which is the right thing to do for cmlenz@226: HTML. cmlenz@226: cmlenz@226: In addition, the ``render()`` method takes an ``encoding`` parameter, which cmlenz@226: defaults to “UTF-8”. If set to ``None``, the result will be a unicode string. cmlenz@226: cmlenz@230: The different serializer classes in ``genshi.output`` can also be used cmlenz@508: directly: cmlenz@508: cmlenz@510: .. code-block:: pycon cmlenz@226: cmlenz@230: >>> from genshi.filters import HTMLSanitizer cmlenz@230: >>> from genshi.output import TextSerializer cmlenz@382: >>> print ''.join(TextSerializer()(HTMLSanitizer()(stream))) cmlenz@226: Some text and a link. cmlenz@226: cmlenz@508: The pipe operator allows a nicer syntax: cmlenz@508: cmlenz@510: .. code-block:: pycon cmlenz@226: cmlenz@226: >>> print stream | HTMLSanitizer() | TextSerializer() cmlenz@226: Some text and a link. cmlenz@226: cmlenz@382: cmlenz@438: Serialization Options cmlenz@438: --------------------- cmlenz@438: cmlenz@438: Both ``serialize()`` and ``render()`` support additional keyword arguments that cmlenz@438: are passed through to the initializer of the serializer class. The following cmlenz@438: options are supported by the built-in serializers: cmlenz@438: cmlenz@438: ``strip_whitespace`` cmlenz@438: Whether the serializer should remove trailing spaces and empty lines. Defaults cmlenz@438: to ``True``. cmlenz@438: cmlenz@438: (This option is not available for serialization to plain text.) cmlenz@438: cmlenz@438: ``doctype`` cmlenz@438: A ``(name, pubid, sysid)`` tuple defining the name, publid identifier, and cmlenz@438: system identifier of a ``DOCTYPE`` declaration to prepend to the generated cmlenz@438: output. If provided, this declaration will override any ``DOCTYPE`` cmlenz@438: declaration in the stream. cmlenz@438: cmlenz@438: (This option is not available for serialization to plain text.) cmlenz@438: cmlenz@438: ``namespace_prefixes`` cmlenz@438: The namespace prefixes to use for namespace that are not bound to a prefix cmlenz@438: in the stream itself. cmlenz@438: cmlenz@438: (This option is not available for serialization to HTML or plain text.) cmlenz@438: cmlenz@438: cmlenz@438: cmlenz@226: Using XPath cmlenz@226: =========== cmlenz@226: cmlenz@226: XPath can be used to extract a specific subset of the stream via the cmlenz@508: ``select()`` method: cmlenz@508: cmlenz@510: .. code-block:: pycon cmlenz@226: cmlenz@226: >>> substream = stream.select('a') cmlenz@226: >>> substream cmlenz@382: cmlenz@226: >>> print substream cmlenz@226: a link cmlenz@226: cmlenz@226: Often, streams cannot be reused: in the above example, the sub-stream is based cmlenz@226: on a generator. Once it has been serialized, it will have been fully consumed, cmlenz@226: and cannot be rendered again. To work around this, you can wrap such a stream cmlenz@508: in a ``list``: cmlenz@508: cmlenz@510: .. code-block:: pycon cmlenz@226: cmlenz@230: >>> from genshi import Stream cmlenz@226: >>> substream = Stream(list(stream.select('a'))) cmlenz@226: >>> substream cmlenz@382: cmlenz@226: >>> print substream cmlenz@226: a link cmlenz@226: >>> print substream.select('@href') cmlenz@226: http://example.org/ cmlenz@226: >>> print substream.select('text()') cmlenz@226: a link cmlenz@382: cmlenz@382: See `Using XPath in Genshi`_ for more information about the XPath support in cmlenz@382: Genshi. cmlenz@382: cmlenz@382: .. _`Using XPath in Genshi`: xpath.html cmlenz@382: cmlenz@382: cmlenz@382: .. _`event kinds`: cmlenz@382: cmlenz@382: Event Kinds cmlenz@382: =========== cmlenz@382: cmlenz@382: Every event in a stream is of one of several *kinds*, which also determines cmlenz@382: what the ``data`` item of the event tuple looks like. The different kinds of cmlenz@382: events are documented below. cmlenz@382: cmlenz@394: .. note:: The ``data`` item is generally immutable. If the data is to be cmlenz@382: modified when processing a stream, it must be replaced by a new tuple. cmlenz@382: Effectively, this means the entire event tuple is immutable. cmlenz@382: cmlenz@382: START cmlenz@382: ----- cmlenz@382: The opening tag of an element. cmlenz@382: cmlenz@382: For this kind of event, the ``data`` item is a tuple of the form cmlenz@382: ``(tagname, attrs)``, where ``tagname`` is a ``QName`` instance describing the cmlenz@382: qualified name of the tag, and ``attrs`` is an ``Attrs`` instance containing cmlenz@382: the attribute names and values associated with the tag (excluding namespace cmlenz@508: declarations): cmlenz@508: cmlenz@508: .. code-block:: python cmlenz@382: cmlenz@382: START, (QName(u'p'), Attrs([(u'class', u'intro')])), pos cmlenz@382: cmlenz@382: END cmlenz@382: --- cmlenz@382: The closing tag of an element. cmlenz@382: cmlenz@382: The ``data`` item of end events consists of just a ``QName`` instance cmlenz@508: describing the qualified name of the tag: cmlenz@508: cmlenz@508: .. code-block:: python cmlenz@382: cmlenz@382: END, QName(u'p'), pos cmlenz@382: cmlenz@382: TEXT cmlenz@382: ---- cmlenz@394: Character data outside of elements and comments. cmlenz@382: cmlenz@508: For text events, the ``data`` item should be a unicode object: cmlenz@508: cmlenz@508: .. code-block:: python cmlenz@382: cmlenz@382: TEXT, u'Hello, world!', pos cmlenz@382: cmlenz@382: START_NS cmlenz@382: -------- cmlenz@382: The start of a namespace mapping, binding a namespace prefix to a URI. cmlenz@382: cmlenz@382: The ``data`` item of this kind of event is a tuple of the form cmlenz@382: ``(prefix, uri)``, where ``prefix`` is the namespace prefix and ``uri`` is the cmlenz@382: full URI to which the prefix is bound. Both should be unicode objects. If the cmlenz@508: namespace is not bound to any prefix, the ``prefix`` item is an empty string: cmlenz@508: cmlenz@508: .. code-block:: python cmlenz@382: cmlenz@382: START_NS, (u'svg', u'http://www.w3.org/2000/svg'), pos cmlenz@382: cmlenz@382: END_NS cmlenz@382: ------ cmlenz@382: The end of a namespace mapping. cmlenz@382: cmlenz@382: The ``data`` item of such events consists of only the namespace prefix (a cmlenz@508: unicode object): cmlenz@508: cmlenz@508: .. code-block:: python cmlenz@382: cmlenz@382: END_NS, u'svg', pos cmlenz@382: cmlenz@382: DOCTYPE cmlenz@382: ------- cmlenz@382: A document type declaration. cmlenz@382: cmlenz@382: For this type of event, the ``data`` item is a tuple of the form cmlenz@382: ``(name, pubid, sysid)``, where ``name`` is the name of the root element, cmlenz@382: ``pubid`` is the public identifier of the DTD (or ``None``), and ``sysid`` is cmlenz@508: the system identifier of the DTD (or ``None``): cmlenz@508: cmlenz@508: .. code-block:: python cmlenz@382: cmlenz@382: DOCTYPE, (u'html', u'-//W3C//DTD XHTML 1.0 Transitional//EN', \ cmlenz@382: u'http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd'), pos cmlenz@382: cmlenz@382: COMMENT cmlenz@382: ------- cmlenz@382: A comment. cmlenz@382: cmlenz@382: For such events, the ``data`` item is a unicode object containing all character cmlenz@508: data between the comment delimiters: cmlenz@508: cmlenz@508: .. code-block:: python cmlenz@382: cmlenz@382: COMMENT, u'Commented out', pos cmlenz@382: cmlenz@382: PI cmlenz@382: -- cmlenz@382: A processing instruction. cmlenz@382: cmlenz@382: The ``data`` item is a tuple of the form ``(target, data)`` for processing cmlenz@382: instructions, where ``target`` is the target of the PI (used to identify the cmlenz@382: application by which the instruction should be processed), and ``data`` is text cmlenz@508: following the target (excluding the terminating question mark): cmlenz@508: cmlenz@508: .. code-block:: python cmlenz@382: cmlenz@382: PI, (u'php', u'echo "Yo" '), pos cmlenz@382: cmlenz@382: START_CDATA cmlenz@382: ----------- cmlenz@382: Marks the beginning of a ``CDATA`` section. cmlenz@382: cmlenz@508: The ``data`` item for such events is always ``None``: cmlenz@508: cmlenz@508: .. code-block:: python cmlenz@382: cmlenz@382: START_CDATA, None, pos cmlenz@382: cmlenz@382: END_CDATA cmlenz@382: --------- cmlenz@382: Marks the end of a ``CDATA`` section. cmlenz@382: cmlenz@508: The ``data`` item for such events is always ``None``: cmlenz@508: cmlenz@508: .. code-block:: python cmlenz@382: cmlenz@382: END_CDATA, None, pos