cmlenz@226: .. -*- mode: rst; encoding: utf-8 -*- cmlenz@226: cmlenz@226: ============== cmlenz@226: Markup Streams cmlenz@226: ============== cmlenz@226: cmlenz@226: A stream is the common representation of markup as a *stream of events*. cmlenz@226: cmlenz@226: cmlenz@226: .. contents:: Contents cmlenz@226: :depth: 2 cmlenz@226: .. sectnum:: cmlenz@226: cmlenz@226: cmlenz@226: Basics cmlenz@226: ====== cmlenz@226: cmlenz@226: A stream can be attained in a number of ways. It can be: cmlenz@226: cmlenz@226: * the result of parsing XML or HTML text, or cmlenz@226: * programmatically generated, or cmlenz@226: * the result of selecting a subset of another stream filtered by an XPath cmlenz@226: expression. cmlenz@226: cmlenz@226: For example, the functions ``XML()`` and ``HTML()`` can be used to convert cmlenz@226: literal XML or HTML text to a markup stream:: cmlenz@226: cmlenz@226: >>> from markup import XML cmlenz@226: >>> stream = XML('

Some text and ' cmlenz@226: ... 'a link.' cmlenz@226: ... '

') cmlenz@226: >>> stream cmlenz@226: cmlenz@226: cmlenz@226: The stream is the result of parsing the text into events. Each event is a tuple cmlenz@226: of the form ``(kind, data, pos)``, where: cmlenz@226: cmlenz@226: * ``kind`` defines what kind of event it is (such as the start of an element, cmlenz@226: text, a comment, etc). cmlenz@226: * ``data`` is the actual data associated with the event. How this looks depends cmlenz@226: on the event kind. cmlenz@226: * ``pos`` is a ``(filename, lineno, column)`` tuple that describes where the cmlenz@226: event “comes from”. cmlenz@226: cmlenz@226: :: cmlenz@226: cmlenz@226: >>> for kind, data, pos in stream: cmlenz@226: ... print kind, `data`, pos cmlenz@226: ... cmlenz@226: START (u'p', [(u'class', u'intro')]) ('', 1, 0) cmlenz@226: TEXT u'Some text and ' ('', 1, 31) cmlenz@226: START (u'a', [(u'href', u'http://example.org/')]) ('', 1, 31) cmlenz@226: TEXT u'a link' ('', 1, 67) cmlenz@226: END u'a' ('', 1, 67) cmlenz@226: TEXT u'.' ('', 1, 72) cmlenz@226: START (u'br', []) ('', 1, 72) cmlenz@226: END u'br' ('', 1, 77) cmlenz@226: END u'p' ('', 1, 77) cmlenz@226: cmlenz@226: cmlenz@226: Filtering cmlenz@226: ========= cmlenz@226: cmlenz@226: One important feature of markup streams is that you can apply *filters* to the cmlenz@226: stream, either filters that come with Markup, or your own custom filters. cmlenz@226: cmlenz@226: A filter is simply a callable that accepts the stream as parameter, and returns cmlenz@226: the filtered stream:: cmlenz@226: cmlenz@226: def noop(stream): cmlenz@226: """A filter that doesn't actually do anything with the stream.""" cmlenz@226: for kind, data, pos in stream: cmlenz@226: yield kind, data, pos cmlenz@226: cmlenz@226: Filters can be applied in a number of ways. The simplest is to just call the cmlenz@226: filter directly:: cmlenz@226: cmlenz@226: stream = noop(stream) cmlenz@226: cmlenz@226: The ``Stream`` class also provides a ``filter()`` method, which takes an cmlenz@226: arbitrary number of filter callables and applies them all:: cmlenz@226: cmlenz@226: stream = stream.filter(noop) cmlenz@226: cmlenz@226: Finally, filters can also be applied using the *bitwise or* operator (``|``), cmlenz@226: which allows a syntax similar to pipes on Unix shells:: cmlenz@226: cmlenz@226: stream = stream | noop cmlenz@226: cmlenz@226: One example of a filter included with Markup is the ``HTMLSanitizer`` in cmlenz@226: ``markup.filters``. It processes a stream of HTML markup, and strips out any cmlenz@226: potentially dangerous constructs, such as Javascript event handlers. cmlenz@226: ``HTMLSanitizer`` is not a function, but rather a class that implements cmlenz@226: ``__call__``, which means instances of the class are callable. cmlenz@226: cmlenz@226: Both the ``filter()`` method and the pipe operator allow easy chaining of cmlenz@226: filters:: cmlenz@226: cmlenz@226: from markup.filters import HTMLSanitizer cmlenz@226: stream = stream.filter(noop, HTMLSanitizer()) cmlenz@226: cmlenz@226: That is equivalent to:: cmlenz@226: cmlenz@226: stream = stream | noop | HTMLSanitizer() cmlenz@226: cmlenz@226: cmlenz@226: Serialization cmlenz@226: ============= cmlenz@226: cmlenz@226: The ``Stream`` class provides two methods for serializing this list of events: cmlenz@226: ``serialize()`` and ``render()``. The former is a generator that yields chunks cmlenz@226: of ``Markup`` objects (which are basically unicode strings). The latter returns cmlenz@226: a single string, by default UTF-8 encoded. cmlenz@226: cmlenz@226: Here's the output from ``serialize()``:: cmlenz@226: cmlenz@226: >>> for output in stream.serialize(): cmlenz@226: ... print `output` cmlenz@226: ... cmlenz@226: '> cmlenz@226: cmlenz@226: '> cmlenz@226: cmlenz@226: '> cmlenz@226: cmlenz@226: '> cmlenz@226: '> cmlenz@226: cmlenz@226: And here's the output from ``render()``:: cmlenz@226: cmlenz@226: >>> print stream.render() cmlenz@226:

Some text and a link.

cmlenz@226: cmlenz@226: Both methods can be passed a ``method`` parameter that determines how exactly cmlenz@226: the events are serialzed to text. This parameter can be either “xml” (the cmlenz@226: default), “xhtml”, “html”, “text”, or a custom serializer class:: cmlenz@226: cmlenz@226: >>> print stream.render('html') cmlenz@226:

Some text and a link.

cmlenz@226: cmlenz@226: Note how the `
` element isn't closed, which is the right thing to do for cmlenz@226: HTML. cmlenz@226: cmlenz@226: In addition, the ``render()`` method takes an ``encoding`` parameter, which cmlenz@226: defaults to “UTF-8”. If set to ``None``, the result will be a unicode string. cmlenz@226: cmlenz@226: The different serializer classes in ``markup.output`` can also be used cmlenz@226: directly:: cmlenz@226: cmlenz@226: >>> from markup.filters import HTMLSanitizer cmlenz@226: >>> from markup.output import TextSerializer cmlenz@226: >>> print TextSerializer()(HTMLSanitizer()(stream)) cmlenz@226: Some text and a link. cmlenz@226: cmlenz@226: The pipe operator allows a nicer syntax:: cmlenz@226: cmlenz@226: >>> print stream | HTMLSanitizer() | TextSerializer() cmlenz@226: Some text and a link. cmlenz@226: cmlenz@226: Using XPath cmlenz@226: =========== cmlenz@226: cmlenz@226: XPath can be used to extract a specific subset of the stream via the cmlenz@226: ``select()`` method:: cmlenz@226: cmlenz@226: >>> substream = stream.select('a') cmlenz@226: >>> substream cmlenz@226: cmlenz@226: >>> print substream cmlenz@226: a link cmlenz@226: cmlenz@226: Often, streams cannot be reused: in the above example, the sub-stream is based cmlenz@226: on a generator. Once it has been serialized, it will have been fully consumed, cmlenz@226: and cannot be rendered again. To work around this, you can wrap such a stream cmlenz@226: in a ``list``:: cmlenz@226: cmlenz@226: >>> from markup import Stream cmlenz@226: >>> substream = Stream(list(stream.select('a'))) cmlenz@226: >>> substream cmlenz@226: cmlenz@226: >>> print substream cmlenz@226: a link cmlenz@226: >>> print substream.select('@href') cmlenz@226: http://example.org/ cmlenz@226: >>> print substream.select('text()') cmlenz@226: a link