cmlenz@226: .. -*- mode: rst; encoding: utf-8 -*- cmlenz@226: cmlenz@226: ============== cmlenz@226: Markup Streams cmlenz@226: ============== cmlenz@226: cmlenz@226: A stream is the common representation of markup as a *stream of events*. cmlenz@226: cmlenz@226: cmlenz@226: .. contents:: Contents cmlenz@745: :depth: 2 cmlenz@226: .. sectnum:: cmlenz@226: cmlenz@226: cmlenz@226: Basics cmlenz@226: ====== cmlenz@226: cmlenz@226: A stream can be attained in a number of ways. It can be: cmlenz@226: cmlenz@226: * the result of parsing XML or HTML text, or cmlenz@438: * the result of selecting a subset of another stream using XPath, or cmlenz@438: * programmatically generated. cmlenz@226: cmlenz@226: For example, the functions ``XML()`` and ``HTML()`` can be used to convert cmlenz@508: literal XML or HTML text to a markup stream: cmlenz@508: cmlenz@510: .. code-block:: pycon cmlenz@226: cmlenz@230: >>> from genshi import XML cmlenz@226: >>> stream = XML('

Some text and ' cmlenz@226: ... 'a link.' cmlenz@226: ... '

') cmlenz@226: >>> stream cmlenz@382: cmlenz@226: cmlenz@226: The stream is the result of parsing the text into events. Each event is a tuple cmlenz@226: of the form ``(kind, data, pos)``, where: cmlenz@226: cmlenz@226: * ``kind`` defines what kind of event it is (such as the start of an element, cmlenz@226: text, a comment, etc). cmlenz@226: * ``data`` is the actual data associated with the event. How this looks depends cmlenz@382: on the event kind (see `event kinds`_) cmlenz@226: * ``pos`` is a ``(filename, lineno, column)`` tuple that describes where the cmlenz@226: event “comes from”. cmlenz@226: cmlenz@510: .. code-block:: pycon cmlenz@226: cmlenz@226: >>> for kind, data, pos in stream: cmlenz@853: ... print('%s %r %r' % (kind, data, pos)) cmlenz@226: ... cmlenz@857: START (QName('p'), Attrs([(QName('class'), u'intro')])) (None, 1, 0) cmlenz@382: TEXT u'Some text and ' (None, 1, 17) cmlenz@857: START (QName('a'), Attrs([(QName('href'), u'http://example.org/')])) (None, 1, 31) cmlenz@382: TEXT u'a link' (None, 1, 61) cmlenz@857: END QName('a') (None, 1, 67) cmlenz@382: TEXT u'.' (None, 1, 71) cmlenz@857: START (QName('br'), Attrs()) (None, 1, 72) cmlenz@857: END QName('br') (None, 1, 77) cmlenz@857: END QName('p') (None, 1, 77) cmlenz@226: cmlenz@226: cmlenz@226: Filtering cmlenz@226: ========= cmlenz@226: cmlenz@226: One important feature of markup streams is that you can apply *filters* to the cmlenz@230: stream, either filters that come with Genshi, or your own custom filters. cmlenz@226: cmlenz@226: A filter is simply a callable that accepts the stream as parameter, and returns cmlenz@508: the filtered stream: cmlenz@508: cmlenz@508: .. code-block:: python cmlenz@226: cmlenz@226: def noop(stream): cmlenz@226: """A filter that doesn't actually do anything with the stream.""" cmlenz@226: for kind, data, pos in stream: cmlenz@226: yield kind, data, pos cmlenz@226: cmlenz@226: Filters can be applied in a number of ways. The simplest is to just call the cmlenz@508: filter directly: cmlenz@508: cmlenz@508: .. code-block:: python cmlenz@226: cmlenz@226: stream = noop(stream) cmlenz@226: cmlenz@226: The ``Stream`` class also provides a ``filter()`` method, which takes an cmlenz@508: arbitrary number of filter callables and applies them all: cmlenz@508: cmlenz@508: .. code-block:: python cmlenz@226: cmlenz@226: stream = stream.filter(noop) cmlenz@226: cmlenz@226: Finally, filters can also be applied using the *bitwise or* operator (``|``), cmlenz@508: which allows a syntax similar to pipes on Unix shells: cmlenz@508: cmlenz@508: .. code-block:: python cmlenz@226: cmlenz@226: stream = stream | noop cmlenz@226: cmlenz@230: One example of a filter included with Genshi is the ``HTMLSanitizer`` in cmlenz@230: ``genshi.filters``. It processes a stream of HTML markup, and strips out any cmlenz@226: potentially dangerous constructs, such as Javascript event handlers. cmlenz@226: ``HTMLSanitizer`` is not a function, but rather a class that implements cmlenz@508: ``__call__``, which means instances of the class are callable: cmlenz@508: cmlenz@508: .. code-block:: python cmlenz@438: cmlenz@438: stream = stream | HTMLSanitizer() cmlenz@226: cmlenz@226: Both the ``filter()`` method and the pipe operator allow easy chaining of cmlenz@508: filters: cmlenz@508: cmlenz@508: .. code-block:: python cmlenz@226: cmlenz@230: from genshi.filters import HTMLSanitizer cmlenz@226: stream = stream.filter(noop, HTMLSanitizer()) cmlenz@226: cmlenz@508: That is equivalent to: cmlenz@508: cmlenz@508: .. code-block:: python cmlenz@226: cmlenz@226: stream = stream | noop | HTMLSanitizer() cmlenz@226: cmlenz@438: For more information about the built-in filters, see `Stream Filters`_. cmlenz@438: cmlenz@438: .. _`Stream Filters`: filters.html cmlenz@438: cmlenz@226: cmlenz@226: Serialization cmlenz@226: ============= cmlenz@226: cmlenz@438: Serialization means producing some kind of textual output from a stream of cmlenz@438: events, which you'll need when you want to transmit or store the results of cmlenz@438: generating or otherwise processing markup. cmlenz@438: cmlenz@745: The ``Stream`` class provides two methods for serialization: ``serialize()`` cmlenz@745: and ``render()``. The former is a generator that yields chunks of ``Markup`` cmlenz@745: objects (which are basically unicode strings that are considered safe for cmlenz@745: output on the web). The latter returns a single string, by default UTF-8 cmlenz@745: encoded. cmlenz@226: cmlenz@508: Here's the output from ``serialize()``: cmlenz@508: cmlenz@510: .. code-block:: pycon cmlenz@226: cmlenz@226: >>> for output in stream.serialize(): cmlenz@853: ... print(repr(output)) cmlenz@226: ... cmlenz@226: '> cmlenz@226: cmlenz@226: '> cmlenz@226: cmlenz@226: '> cmlenz@226: cmlenz@226: '> cmlenz@226: '> cmlenz@226: cmlenz@508: And here's the output from ``render()``: cmlenz@508: cmlenz@510: .. code-block:: pycon cmlenz@226: cmlenz@853: >>> print(stream.render()) cmlenz@226:

Some text and a link.

cmlenz@226: cmlenz@226: Both methods can be passed a ``method`` parameter that determines how exactly cmlenz@745: the events are serialized to text. This parameter can be either a string or a cmlenz@745: custom serializer class: cmlenz@508: cmlenz@510: .. code-block:: pycon cmlenz@226: cmlenz@853: >>> print(stream.render('html')) cmlenz@226:

Some text and a link.

cmlenz@226: cmlenz@226: Note how the `
` element isn't closed, which is the right thing to do for cmlenz@745: HTML. See `serialization methods`_ for more details. cmlenz@226: cmlenz@226: In addition, the ``render()`` method takes an ``encoding`` parameter, which cmlenz@226: defaults to “UTF-8”. If set to ``None``, the result will be a unicode string. cmlenz@226: cmlenz@230: The different serializer classes in ``genshi.output`` can also be used cmlenz@508: directly: cmlenz@508: cmlenz@510: .. code-block:: pycon cmlenz@226: cmlenz@230: >>> from genshi.filters import HTMLSanitizer cmlenz@230: >>> from genshi.output import TextSerializer cmlenz@853: >>> print(''.join(TextSerializer()(HTMLSanitizer()(stream)))) cmlenz@226: Some text and a link. cmlenz@226: cmlenz@508: The pipe operator allows a nicer syntax: cmlenz@508: cmlenz@510: .. code-block:: pycon cmlenz@226: cmlenz@853: >>> print(stream | HTMLSanitizer() | TextSerializer()) cmlenz@226: Some text and a link. cmlenz@226: cmlenz@382: cmlenz@745: .. _`serialization methods`: cmlenz@745: cmlenz@745: Serialization Methods cmlenz@745: --------------------- cmlenz@745: cmlenz@745: Genshi supports the use of different serialization methods to use for creating cmlenz@745: a text representation of a markup stream. cmlenz@745: cmlenz@745: ``xml`` cmlenz@745: The ``XMLSerializer`` is the default serialization method and results in cmlenz@745: proper XML output including namespace support, the XML declaration, CDATA cmlenz@745: sections, and so on. It is not generally not suitable for serving HTML or cmlenz@745: XHTML web pages (unless you want to use true XHTML 1.1), for which the cmlenz@745: ``xhtml`` and ``html`` serializers described below should be preferred. cmlenz@745: cmlenz@745: ``xhtml`` cmlenz@745: The ``XHTMLSerializer`` is a specialization of the generic ``XMLSerializer`` cmlenz@745: that understands the pecularities of producing XML-compliant output that can cmlenz@745: also be parsed without problems by the HTML parsers found in modern web cmlenz@745: browsers. Thus, the output by this serializer should be usable whether sent cmlenz@745: as "text/html" or "application/xhtml+html" (although there are a lot of cmlenz@745: subtle issues to pay attention to when switching between the two, in cmlenz@745: particular with respect to differences in the DOM and CSS). cmlenz@745: cmlenz@745: For example, instead of rendering a script tag as ````. Also, it will normalize any boolean attributes values cmlenz@745: that are minimized in HTML, so that for example ``
`` cmlenz@745: becomes ``
``. cmlenz@745: cmlenz@745: This serializer supports the use of namespaces for compound documents, for cmlenz@745: example to use inline SVG inside an XHTML document. cmlenz@745: cmlenz@745: ``html`` cmlenz@745: The ``HTMLSerializer`` produces proper HTML markup. The main differences cmlenz@745: compared to ``xhtml`` serialization are that boolean attributes are cmlenz@745: minimized, empty tags are not self-closing (so it's ``
`` instead of cmlenz@745: ``
``), and that the contents of ``