cmlenz@226: .. -*- mode: rst; encoding: utf-8 -*-
cmlenz@226:
cmlenz@226: ==============
cmlenz@226: Markup Streams
cmlenz@226: ==============
cmlenz@226:
cmlenz@226: A stream is the common representation of markup as a *stream of events*.
cmlenz@226:
cmlenz@226:
cmlenz@226: .. contents:: Contents
cmlenz@745: :depth: 2
cmlenz@226: .. sectnum::
cmlenz@226:
cmlenz@226:
cmlenz@226: Basics
cmlenz@226: ======
cmlenz@226:
cmlenz@226: A stream can be attained in a number of ways. It can be:
cmlenz@226:
cmlenz@226: * the result of parsing XML or HTML text, or
cmlenz@438: * the result of selecting a subset of another stream using XPath, or
cmlenz@438: * programmatically generated.
cmlenz@226:
cmlenz@226: For example, the functions ``XML()`` and ``HTML()`` can be used to convert
cmlenz@508: literal XML or HTML text to a markup stream:
cmlenz@508:
cmlenz@510: .. code-block:: pycon
cmlenz@226:
cmlenz@230: >>> from genshi import XML
cmlenz@226: >>> stream = XML('
Some text and '
cmlenz@226: ... 'a link.'
cmlenz@226: ... '
')
cmlenz@226: >>> stream
cmlenz@382:
cmlenz@226:
cmlenz@226: The stream is the result of parsing the text into events. Each event is a tuple
cmlenz@226: of the form ``(kind, data, pos)``, where:
cmlenz@226:
cmlenz@226: * ``kind`` defines what kind of event it is (such as the start of an element,
cmlenz@226: text, a comment, etc).
cmlenz@226: * ``data`` is the actual data associated with the event. How this looks depends
cmlenz@382: on the event kind (see `event kinds`_)
cmlenz@226: * ``pos`` is a ``(filename, lineno, column)`` tuple that describes where the
cmlenz@226: event “comes from”.
cmlenz@226:
cmlenz@510: .. code-block:: pycon
cmlenz@226:
cmlenz@226: >>> for kind, data, pos in stream:
cmlenz@853: ... print('%s %r %r' % (kind, data, pos))
cmlenz@226: ...
cmlenz@857: START (QName('p'), Attrs([(QName('class'), u'intro')])) (None, 1, 0)
cmlenz@382: TEXT u'Some text and ' (None, 1, 17)
cmlenz@857: START (QName('a'), Attrs([(QName('href'), u'http://example.org/')])) (None, 1, 31)
cmlenz@382: TEXT u'a link' (None, 1, 61)
cmlenz@857: END QName('a') (None, 1, 67)
cmlenz@382: TEXT u'.' (None, 1, 71)
cmlenz@857: START (QName('br'), Attrs()) (None, 1, 72)
cmlenz@857: END QName('br') (None, 1, 77)
cmlenz@857: END QName('p') (None, 1, 77)
cmlenz@226:
cmlenz@226:
cmlenz@226: Filtering
cmlenz@226: =========
cmlenz@226:
cmlenz@226: One important feature of markup streams is that you can apply *filters* to the
cmlenz@230: stream, either filters that come with Genshi, or your own custom filters.
cmlenz@226:
cmlenz@226: A filter is simply a callable that accepts the stream as parameter, and returns
cmlenz@508: the filtered stream:
cmlenz@508:
cmlenz@508: .. code-block:: python
cmlenz@226:
cmlenz@226: def noop(stream):
cmlenz@226: """A filter that doesn't actually do anything with the stream."""
cmlenz@226: for kind, data, pos in stream:
cmlenz@226: yield kind, data, pos
cmlenz@226:
cmlenz@226: Filters can be applied in a number of ways. The simplest is to just call the
cmlenz@508: filter directly:
cmlenz@508:
cmlenz@508: .. code-block:: python
cmlenz@226:
cmlenz@226: stream = noop(stream)
cmlenz@226:
cmlenz@226: The ``Stream`` class also provides a ``filter()`` method, which takes an
cmlenz@508: arbitrary number of filter callables and applies them all:
cmlenz@508:
cmlenz@508: .. code-block:: python
cmlenz@226:
cmlenz@226: stream = stream.filter(noop)
cmlenz@226:
cmlenz@226: Finally, filters can also be applied using the *bitwise or* operator (``|``),
cmlenz@508: which allows a syntax similar to pipes on Unix shells:
cmlenz@508:
cmlenz@508: .. code-block:: python
cmlenz@226:
cmlenz@226: stream = stream | noop
cmlenz@226:
cmlenz@230: One example of a filter included with Genshi is the ``HTMLSanitizer`` in
cmlenz@230: ``genshi.filters``. It processes a stream of HTML markup, and strips out any
cmlenz@226: potentially dangerous constructs, such as Javascript event handlers.
cmlenz@226: ``HTMLSanitizer`` is not a function, but rather a class that implements
cmlenz@508: ``__call__``, which means instances of the class are callable:
cmlenz@508:
cmlenz@508: .. code-block:: python
cmlenz@438:
cmlenz@438: stream = stream | HTMLSanitizer()
cmlenz@226:
cmlenz@226: Both the ``filter()`` method and the pipe operator allow easy chaining of
cmlenz@508: filters:
cmlenz@508:
cmlenz@508: .. code-block:: python
cmlenz@226:
cmlenz@230: from genshi.filters import HTMLSanitizer
cmlenz@226: stream = stream.filter(noop, HTMLSanitizer())
cmlenz@226:
cmlenz@508: That is equivalent to:
cmlenz@508:
cmlenz@508: .. code-block:: python
cmlenz@226:
cmlenz@226: stream = stream | noop | HTMLSanitizer()
cmlenz@226:
cmlenz@438: For more information about the built-in filters, see `Stream Filters`_.
cmlenz@438:
cmlenz@438: .. _`Stream Filters`: filters.html
cmlenz@438:
cmlenz@226:
cmlenz@226: Serialization
cmlenz@226: =============
cmlenz@226:
cmlenz@438: Serialization means producing some kind of textual output from a stream of
cmlenz@438: events, which you'll need when you want to transmit or store the results of
cmlenz@438: generating or otherwise processing markup.
cmlenz@438:
cmlenz@745: The ``Stream`` class provides two methods for serialization: ``serialize()``
cmlenz@745: and ``render()``. The former is a generator that yields chunks of ``Markup``
cmlenz@745: objects (which are basically unicode strings that are considered safe for
cmlenz@745: output on the web). The latter returns a single string, by default UTF-8
cmlenz@745: encoded.
cmlenz@226:
cmlenz@508: Here's the output from ``serialize()``:
cmlenz@508:
cmlenz@510: .. code-block:: pycon
cmlenz@226:
cmlenz@226: >>> for output in stream.serialize():
cmlenz@853: ... print(repr(output))
cmlenz@226: ...
cmlenz@226: '>
cmlenz@226:
cmlenz@226: '>
cmlenz@226:
cmlenz@226: '>
cmlenz@226:
cmlenz@226: '>
cmlenz@226: '>
cmlenz@226:
cmlenz@508: And here's the output from ``render()``:
cmlenz@508:
cmlenz@510: .. code-block:: pycon
cmlenz@226:
cmlenz@853: >>> print(stream.render())
cmlenz@226: Some text and a link.
cmlenz@226:
cmlenz@226: Both methods can be passed a ``method`` parameter that determines how exactly
cmlenz@745: the events are serialized to text. This parameter can be either a string or a
cmlenz@745: custom serializer class:
cmlenz@508:
cmlenz@510: .. code-block:: pycon
cmlenz@226:
cmlenz@853: >>> print(stream.render('html'))
cmlenz@226: Some text and a link.
cmlenz@226:
cmlenz@226: Note how the `
` element isn't closed, which is the right thing to do for
cmlenz@745: HTML. See `serialization methods`_ for more details.
cmlenz@226:
cmlenz@226: In addition, the ``render()`` method takes an ``encoding`` parameter, which
cmlenz@226: defaults to “UTF-8”. If set to ``None``, the result will be a unicode string.
cmlenz@226:
cmlenz@230: The different serializer classes in ``genshi.output`` can also be used
cmlenz@508: directly:
cmlenz@508:
cmlenz@510: .. code-block:: pycon
cmlenz@226:
cmlenz@230: >>> from genshi.filters import HTMLSanitizer
cmlenz@230: >>> from genshi.output import TextSerializer
cmlenz@853: >>> print(''.join(TextSerializer()(HTMLSanitizer()(stream))))
cmlenz@226: Some text and a link.
cmlenz@226:
cmlenz@508: The pipe operator allows a nicer syntax:
cmlenz@508:
cmlenz@510: .. code-block:: pycon
cmlenz@226:
cmlenz@853: >>> print(stream | HTMLSanitizer() | TextSerializer())
cmlenz@226: Some text and a link.
cmlenz@226:
cmlenz@382:
cmlenz@745: .. _`serialization methods`:
cmlenz@745:
cmlenz@745: Serialization Methods
cmlenz@745: ---------------------
cmlenz@745:
cmlenz@745: Genshi supports the use of different serialization methods to use for creating
cmlenz@745: a text representation of a markup stream.
cmlenz@745:
cmlenz@745: ``xml``
cmlenz@745: The ``XMLSerializer`` is the default serialization method and results in
cmlenz@745: proper XML output including namespace support, the XML declaration, CDATA
cmlenz@745: sections, and so on. It is not generally not suitable for serving HTML or
cmlenz@745: XHTML web pages (unless you want to use true XHTML 1.1), for which the
cmlenz@745: ``xhtml`` and ``html`` serializers described below should be preferred.
cmlenz@745:
cmlenz@745: ``xhtml``
cmlenz@745: The ``XHTMLSerializer`` is a specialization of the generic ``XMLSerializer``
cmlenz@745: that understands the pecularities of producing XML-compliant output that can
cmlenz@745: also be parsed without problems by the HTML parsers found in modern web
cmlenz@745: browsers. Thus, the output by this serializer should be usable whether sent
cmlenz@745: as "text/html" or "application/xhtml+html" (although there are a lot of
cmlenz@745: subtle issues to pay attention to when switching between the two, in
cmlenz@745: particular with respect to differences in the DOM and CSS).
cmlenz@745:
cmlenz@745: For example, instead of rendering a script tag as ```` (which
cmlenz@745: confuses the HTML parser in many browsers), it will produce
cmlenz@745: ````. Also, it will normalize any boolean attributes values
cmlenz@745: that are minimized in HTML, so that for example ``
``
cmlenz@745: becomes ``
``.
cmlenz@745:
cmlenz@745: This serializer supports the use of namespaces for compound documents, for
cmlenz@745: example to use inline SVG inside an XHTML document.
cmlenz@745:
cmlenz@745: ``html``
cmlenz@745: The ``HTMLSerializer`` produces proper HTML markup. The main differences
cmlenz@745: compared to ``xhtml`` serialization are that boolean attributes are
cmlenz@745: minimized, empty tags are not self-closing (so it's ``
`` instead of
cmlenz@745: ``
``), and that the contents of ``