cmlenz@226: .. -*- mode: rst; encoding: utf-8 -*-
cmlenz@226:
cmlenz@226: ==============
cmlenz@226: Markup Streams
cmlenz@226: ==============
cmlenz@226:
cmlenz@226: A stream is the common representation of markup as a *stream of events*.
cmlenz@226:
cmlenz@226:
cmlenz@226: .. contents:: Contents
cmlenz@382: :depth: 1
cmlenz@226: .. sectnum::
cmlenz@226:
cmlenz@226:
cmlenz@226: Basics
cmlenz@226: ======
cmlenz@226:
cmlenz@226: A stream can be attained in a number of ways. It can be:
cmlenz@226:
cmlenz@226: * the result of parsing XML or HTML text, or
cmlenz@226: * programmatically generated, or
cmlenz@226: * the result of selecting a subset of another stream filtered by an XPath
cmlenz@226: expression.
cmlenz@226:
cmlenz@226: For example, the functions ``XML()`` and ``HTML()`` can be used to convert
cmlenz@226: literal XML or HTML text to a markup stream::
cmlenz@226:
cmlenz@230: >>> from genshi import XML
cmlenz@226: >>> stream = XML('
Some text and '
cmlenz@226: ... 'a link.'
cmlenz@226: ... '
')
cmlenz@226: >>> stream
cmlenz@382:
cmlenz@226:
cmlenz@226: The stream is the result of parsing the text into events. Each event is a tuple
cmlenz@226: of the form ``(kind, data, pos)``, where:
cmlenz@226:
cmlenz@226: * ``kind`` defines what kind of event it is (such as the start of an element,
cmlenz@226: text, a comment, etc).
cmlenz@226: * ``data`` is the actual data associated with the event. How this looks depends
cmlenz@382: on the event kind (see `event kinds`_)
cmlenz@226: * ``pos`` is a ``(filename, lineno, column)`` tuple that describes where the
cmlenz@226: event “comes from”.
cmlenz@226:
cmlenz@226: ::
cmlenz@226:
cmlenz@226: >>> for kind, data, pos in stream:
cmlenz@226: ... print kind, `data`, pos
cmlenz@226: ...
cmlenz@382: START (QName(u'p'), Attrs([(QName(u'class'), u'intro')])) (None, 1, 0)
cmlenz@382: TEXT u'Some text and ' (None, 1, 17)
cmlenz@382: START (QName(u'a'), Attrs([(QName(u'href'), u'http://example.org/')])) (None, 1, 31)
cmlenz@382: TEXT u'a link' (None, 1, 61)
cmlenz@382: END QName(u'a') (None, 1, 67)
cmlenz@382: TEXT u'.' (None, 1, 71)
cmlenz@382: START (QName(u'br'), Attrs()) (None, 1, 72)
cmlenz@382: END QName(u'br') (None, 1, 77)
cmlenz@382: END QName(u'p') (None, 1, 77)
cmlenz@226:
cmlenz@226:
cmlenz@226: Filtering
cmlenz@226: =========
cmlenz@226:
cmlenz@226: One important feature of markup streams is that you can apply *filters* to the
cmlenz@230: stream, either filters that come with Genshi, or your own custom filters.
cmlenz@226:
cmlenz@226: A filter is simply a callable that accepts the stream as parameter, and returns
cmlenz@226: the filtered stream::
cmlenz@226:
cmlenz@226: def noop(stream):
cmlenz@226: """A filter that doesn't actually do anything with the stream."""
cmlenz@226: for kind, data, pos in stream:
cmlenz@226: yield kind, data, pos
cmlenz@226:
cmlenz@226: Filters can be applied in a number of ways. The simplest is to just call the
cmlenz@226: filter directly::
cmlenz@226:
cmlenz@226: stream = noop(stream)
cmlenz@226:
cmlenz@226: The ``Stream`` class also provides a ``filter()`` method, which takes an
cmlenz@226: arbitrary number of filter callables and applies them all::
cmlenz@226:
cmlenz@226: stream = stream.filter(noop)
cmlenz@226:
cmlenz@226: Finally, filters can also be applied using the *bitwise or* operator (``|``),
cmlenz@226: which allows a syntax similar to pipes on Unix shells::
cmlenz@226:
cmlenz@226: stream = stream | noop
cmlenz@226:
cmlenz@230: One example of a filter included with Genshi is the ``HTMLSanitizer`` in
cmlenz@230: ``genshi.filters``. It processes a stream of HTML markup, and strips out any
cmlenz@226: potentially dangerous constructs, such as Javascript event handlers.
cmlenz@226: ``HTMLSanitizer`` is not a function, but rather a class that implements
cmlenz@226: ``__call__``, which means instances of the class are callable.
cmlenz@226:
cmlenz@226: Both the ``filter()`` method and the pipe operator allow easy chaining of
cmlenz@226: filters::
cmlenz@226:
cmlenz@230: from genshi.filters import HTMLSanitizer
cmlenz@226: stream = stream.filter(noop, HTMLSanitizer())
cmlenz@226:
cmlenz@226: That is equivalent to::
cmlenz@226:
cmlenz@226: stream = stream | noop | HTMLSanitizer()
cmlenz@226:
cmlenz@226:
cmlenz@226: Serialization
cmlenz@226: =============
cmlenz@226:
cmlenz@226: The ``Stream`` class provides two methods for serializing this list of events:
cmlenz@226: ``serialize()`` and ``render()``. The former is a generator that yields chunks
cmlenz@230: of ``Markup`` objects (which are basically unicode strings that are considered
cmlenz@230: safe for output on the web). The latter returns a single string, by default
cmlenz@230: UTF-8 encoded.
cmlenz@226:
cmlenz@226: Here's the output from ``serialize()``::
cmlenz@226:
cmlenz@226: >>> for output in stream.serialize():
cmlenz@226: ... print `output`
cmlenz@226: ...
cmlenz@226: '>
cmlenz@226:
cmlenz@226: '>
cmlenz@226:
cmlenz@226: '>
cmlenz@226:
cmlenz@226: '>
cmlenz@226: '>
cmlenz@226:
cmlenz@226: And here's the output from ``render()``::
cmlenz@226:
cmlenz@226: >>> print stream.render()
cmlenz@226: Some text and a link.
cmlenz@226:
cmlenz@226: Both methods can be passed a ``method`` parameter that determines how exactly
cmlenz@226: the events are serialzed to text. This parameter can be either “xml” (the
cmlenz@226: default), “xhtml”, “html”, “text”, or a custom serializer class::
cmlenz@226:
cmlenz@226: >>> print stream.render('html')
cmlenz@226: Some text and a link.
cmlenz@226:
cmlenz@226: Note how the `
` element isn't closed, which is the right thing to do for
cmlenz@226: HTML.
cmlenz@226:
cmlenz@226: In addition, the ``render()`` method takes an ``encoding`` parameter, which
cmlenz@226: defaults to “UTF-8”. If set to ``None``, the result will be a unicode string.
cmlenz@226:
cmlenz@230: The different serializer classes in ``genshi.output`` can also be used
cmlenz@226: directly::
cmlenz@226:
cmlenz@230: >>> from genshi.filters import HTMLSanitizer
cmlenz@230: >>> from genshi.output import TextSerializer
cmlenz@382: >>> print ''.join(TextSerializer()(HTMLSanitizer()(stream)))
cmlenz@226: Some text and a link.
cmlenz@226:
cmlenz@226: The pipe operator allows a nicer syntax::
cmlenz@226:
cmlenz@226: >>> print stream | HTMLSanitizer() | TextSerializer()
cmlenz@226: Some text and a link.
cmlenz@226:
cmlenz@382:
cmlenz@226: Using XPath
cmlenz@226: ===========
cmlenz@226:
cmlenz@226: XPath can be used to extract a specific subset of the stream via the
cmlenz@226: ``select()`` method::
cmlenz@226:
cmlenz@226: >>> substream = stream.select('a')
cmlenz@226: >>> substream
cmlenz@382:
cmlenz@226: >>> print substream
cmlenz@226: a link
cmlenz@226:
cmlenz@226: Often, streams cannot be reused: in the above example, the sub-stream is based
cmlenz@226: on a generator. Once it has been serialized, it will have been fully consumed,
cmlenz@226: and cannot be rendered again. To work around this, you can wrap such a stream
cmlenz@226: in a ``list``::
cmlenz@226:
cmlenz@230: >>> from genshi import Stream
cmlenz@226: >>> substream = Stream(list(stream.select('a')))
cmlenz@226: >>> substream
cmlenz@382:
cmlenz@226: >>> print substream
cmlenz@226: a link
cmlenz@226: >>> print substream.select('@href')
cmlenz@226: http://example.org/
cmlenz@226: >>> print substream.select('text()')
cmlenz@226: a link
cmlenz@382:
cmlenz@382: See `Using XPath in Genshi`_ for more information about the XPath support in
cmlenz@382: Genshi.
cmlenz@382:
cmlenz@382: .. _`Using XPath in Genshi`: xpath.html
cmlenz@382:
cmlenz@382:
cmlenz@382: .. _`event kinds`:
cmlenz@382:
cmlenz@382: Event Kinds
cmlenz@382: ===========
cmlenz@382:
cmlenz@382: Every event in a stream is of one of several *kinds*, which also determines
cmlenz@382: what the ``data`` item of the event tuple looks like. The different kinds of
cmlenz@382: events are documented below.
cmlenz@382:
cmlenz@382: .. note:: The ``data`` item is generally immutable. It the data is to be
cmlenz@382: modified when processing a stream, it must be replaced by a new tuple.
cmlenz@382: Effectively, this means the entire event tuple is immutable.
cmlenz@382:
cmlenz@382: START
cmlenz@382: -----
cmlenz@382: The opening tag of an element.
cmlenz@382:
cmlenz@382: For this kind of event, the ``data`` item is a tuple of the form
cmlenz@382: ``(tagname, attrs)``, where ``tagname`` is a ``QName`` instance describing the
cmlenz@382: qualified name of the tag, and ``attrs`` is an ``Attrs`` instance containing
cmlenz@382: the attribute names and values associated with the tag (excluding namespace
cmlenz@382: declarations)::
cmlenz@382:
cmlenz@382: START, (QName(u'p'), Attrs([(u'class', u'intro')])), pos
cmlenz@382:
cmlenz@382: END
cmlenz@382: ---
cmlenz@382: The closing tag of an element.
cmlenz@382:
cmlenz@382: The ``data`` item of end events consists of just a ``QName`` instance
cmlenz@382: describing the qualified name of the tag::
cmlenz@382:
cmlenz@382: END, QName(u'p'), pos
cmlenz@382:
cmlenz@382: TEXT
cmlenz@382: ----
cmlenz@382: Character data outside of elements and other nodes.
cmlenz@382:
cmlenz@382: For text events, the ``data`` item should be a unicode object::
cmlenz@382:
cmlenz@382: TEXT, u'Hello, world!', pos
cmlenz@382:
cmlenz@382: START_NS
cmlenz@382: --------
cmlenz@382: The start of a namespace mapping, binding a namespace prefix to a URI.
cmlenz@382:
cmlenz@382: The ``data`` item of this kind of event is a tuple of the form
cmlenz@382: ``(prefix, uri)``, where ``prefix`` is the namespace prefix and ``uri`` is the
cmlenz@382: full URI to which the prefix is bound. Both should be unicode objects. If the
cmlenz@382: namespace is not bound to any prefix, the ``prefix`` item is an empty string::
cmlenz@382:
cmlenz@382: START_NS, (u'svg', u'http://www.w3.org/2000/svg'), pos
cmlenz@382:
cmlenz@382: END_NS
cmlenz@382: ------
cmlenz@382: The end of a namespace mapping.
cmlenz@382:
cmlenz@382: The ``data`` item of such events consists of only the namespace prefix (a
cmlenz@382: unicode object)::
cmlenz@382:
cmlenz@382: END_NS, u'svg', pos
cmlenz@382:
cmlenz@382: DOCTYPE
cmlenz@382: -------
cmlenz@382: A document type declaration.
cmlenz@382:
cmlenz@382: For this type of event, the ``data`` item is a tuple of the form
cmlenz@382: ``(name, pubid, sysid)``, where ``name`` is the name of the root element,
cmlenz@382: ``pubid`` is the public identifier of the DTD (or ``None``), and ``sysid`` is
cmlenz@382: the system identifier of the DTD (or ``None``)::
cmlenz@382:
cmlenz@382: DOCTYPE, (u'html', u'-//W3C//DTD XHTML 1.0 Transitional//EN', \
cmlenz@382: u'http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd'), pos
cmlenz@382:
cmlenz@382: COMMENT
cmlenz@382: -------
cmlenz@382: A comment.
cmlenz@382:
cmlenz@382: For such events, the ``data`` item is a unicode object containing all character
cmlenz@382: data between the comment delimiters::
cmlenz@382:
cmlenz@382: COMMENT, u'Commented out', pos
cmlenz@382:
cmlenz@382: PI
cmlenz@382: --
cmlenz@382: A processing instruction.
cmlenz@382:
cmlenz@382: The ``data`` item is a tuple of the form ``(target, data)`` for processing
cmlenz@382: instructions, where ``target`` is the target of the PI (used to identify the
cmlenz@382: application by which the instruction should be processed), and ``data`` is text
cmlenz@382: following the target (excluding the terminating question mark)::
cmlenz@382:
cmlenz@382: PI, (u'php', u'echo "Yo" '), pos
cmlenz@382:
cmlenz@382: START_CDATA
cmlenz@382: -----------
cmlenz@382: Marks the beginning of a ``CDATA`` section.
cmlenz@382:
cmlenz@382: The ``data`` item for such events is always ``None``::
cmlenz@382:
cmlenz@382: START_CDATA, None, pos
cmlenz@382:
cmlenz@382: END_CDATA
cmlenz@382: ---------
cmlenz@382: Marks the end of a ``CDATA`` section.
cmlenz@382:
cmlenz@382: The ``data`` item for such events is always ``None``::
cmlenz@382:
cmlenz@382: END_CDATA, None, pos