226
|
1 .. -*- mode: rst; encoding: utf-8 -*-
|
|
2
|
|
3 ==============
|
|
4 Markup Streams
|
|
5 ==============
|
|
6
|
|
7 A stream is the common representation of markup as a *stream of events*.
|
|
8
|
|
9
|
|
10 .. contents:: Contents
|
395
|
11 :depth: 1
|
226
|
12 .. sectnum::
|
|
13
|
|
14
|
|
15 Basics
|
|
16 ======
|
|
17
|
|
18 A stream can be attained in a number of ways. It can be:
|
|
19
|
|
20 * the result of parsing XML or HTML text, or
|
|
21 * programmatically generated, or
|
|
22 * the result of selecting a subset of another stream filtered by an XPath
|
|
23 expression.
|
|
24
|
|
25 For example, the functions ``XML()`` and ``HTML()`` can be used to convert
|
|
26 literal XML or HTML text to a markup stream::
|
|
27
|
230
|
28 >>> from genshi import XML
|
226
|
29 >>> stream = XML('<p class="intro">Some text and '
|
|
30 ... '<a href="http://example.org/">a link</a>.'
|
|
31 ... '<br/></p>')
|
|
32 >>> stream
|
395
|
33 <genshi.core.Stream object at ...>
|
226
|
34
|
|
35 The stream is the result of parsing the text into events. Each event is a tuple
|
|
36 of the form ``(kind, data, pos)``, where:
|
|
37
|
|
38 * ``kind`` defines what kind of event it is (such as the start of an element,
|
|
39 text, a comment, etc).
|
|
40 * ``data`` is the actual data associated with the event. How this looks depends
|
395
|
41 on the event kind (see `event kinds`_)
|
226
|
42 * ``pos`` is a ``(filename, lineno, column)`` tuple that describes where the
|
|
43 event “comes from”.
|
|
44
|
|
45 ::
|
|
46
|
|
47 >>> for kind, data, pos in stream:
|
|
48 ... print kind, `data`, pos
|
|
49 ...
|
395
|
50 START (QName(u'p'), Attrs([(QName(u'class'), u'intro')])) (None, 1, 0)
|
|
51 TEXT u'Some text and ' (None, 1, 17)
|
|
52 START (QName(u'a'), Attrs([(QName(u'href'), u'http://example.org/')])) (None, 1, 31)
|
|
53 TEXT u'a link' (None, 1, 61)
|
|
54 END QName(u'a') (None, 1, 67)
|
|
55 TEXT u'.' (None, 1, 71)
|
|
56 START (QName(u'br'), Attrs()) (None, 1, 72)
|
|
57 END QName(u'br') (None, 1, 77)
|
|
58 END QName(u'p') (None, 1, 77)
|
226
|
59
|
|
60
|
|
61 Filtering
|
|
62 =========
|
|
63
|
|
64 One important feature of markup streams is that you can apply *filters* to the
|
230
|
65 stream, either filters that come with Genshi, or your own custom filters.
|
226
|
66
|
|
67 A filter is simply a callable that accepts the stream as parameter, and returns
|
|
68 the filtered stream::
|
|
69
|
|
70 def noop(stream):
|
|
71 """A filter that doesn't actually do anything with the stream."""
|
|
72 for kind, data, pos in stream:
|
|
73 yield kind, data, pos
|
|
74
|
|
75 Filters can be applied in a number of ways. The simplest is to just call the
|
|
76 filter directly::
|
|
77
|
|
78 stream = noop(stream)
|
|
79
|
|
80 The ``Stream`` class also provides a ``filter()`` method, which takes an
|
|
81 arbitrary number of filter callables and applies them all::
|
|
82
|
|
83 stream = stream.filter(noop)
|
|
84
|
|
85 Finally, filters can also be applied using the *bitwise or* operator (``|``),
|
|
86 which allows a syntax similar to pipes on Unix shells::
|
|
87
|
|
88 stream = stream | noop
|
|
89
|
230
|
90 One example of a filter included with Genshi is the ``HTMLSanitizer`` in
|
|
91 ``genshi.filters``. It processes a stream of HTML markup, and strips out any
|
226
|
92 potentially dangerous constructs, such as Javascript event handlers.
|
|
93 ``HTMLSanitizer`` is not a function, but rather a class that implements
|
|
94 ``__call__``, which means instances of the class are callable.
|
|
95
|
|
96 Both the ``filter()`` method and the pipe operator allow easy chaining of
|
|
97 filters::
|
|
98
|
230
|
99 from genshi.filters import HTMLSanitizer
|
226
|
100 stream = stream.filter(noop, HTMLSanitizer())
|
|
101
|
|
102 That is equivalent to::
|
|
103
|
|
104 stream = stream | noop | HTMLSanitizer()
|
|
105
|
|
106
|
|
107 Serialization
|
|
108 =============
|
|
109
|
|
110 The ``Stream`` class provides two methods for serializing this list of events:
|
|
111 ``serialize()`` and ``render()``. The former is a generator that yields chunks
|
230
|
112 of ``Markup`` objects (which are basically unicode strings that are considered
|
|
113 safe for output on the web). The latter returns a single string, by default
|
|
114 UTF-8 encoded.
|
226
|
115
|
|
116 Here's the output from ``serialize()``::
|
|
117
|
|
118 >>> for output in stream.serialize():
|
|
119 ... print `output`
|
|
120 ...
|
|
121 <Markup u'<p class="intro">'>
|
|
122 <Markup u'Some text and '>
|
|
123 <Markup u'<a href="http://example.org/">'>
|
|
124 <Markup u'a link'>
|
|
125 <Markup u'</a>'>
|
|
126 <Markup u'.'>
|
|
127 <Markup u'<br/>'>
|
|
128 <Markup u'</p>'>
|
|
129
|
|
130 And here's the output from ``render()``::
|
|
131
|
|
132 >>> print stream.render()
|
|
133 <p class="intro">Some text and <a href="http://example.org/">a link</a>.<br/></p>
|
|
134
|
|
135 Both methods can be passed a ``method`` parameter that determines how exactly
|
|
136 the events are serialzed to text. This parameter can be either “xml” (the
|
|
137 default), “xhtml”, “html”, “text”, or a custom serializer class::
|
|
138
|
|
139 >>> print stream.render('html')
|
|
140 <p class="intro">Some text and <a href="http://example.org/">a link</a>.<br></p>
|
|
141
|
|
142 Note how the `<br>` element isn't closed, which is the right thing to do for
|
|
143 HTML.
|
|
144
|
|
145 In addition, the ``render()`` method takes an ``encoding`` parameter, which
|
|
146 defaults to “UTF-8”. If set to ``None``, the result will be a unicode string.
|
|
147
|
230
|
148 The different serializer classes in ``genshi.output`` can also be used
|
226
|
149 directly::
|
|
150
|
230
|
151 >>> from genshi.filters import HTMLSanitizer
|
|
152 >>> from genshi.output import TextSerializer
|
395
|
153 >>> print ''.join(TextSerializer()(HTMLSanitizer()(stream)))
|
226
|
154 Some text and a link.
|
|
155
|
|
156 The pipe operator allows a nicer syntax::
|
|
157
|
|
158 >>> print stream | HTMLSanitizer() | TextSerializer()
|
|
159 Some text and a link.
|
|
160
|
395
|
161
|
226
|
162 Using XPath
|
|
163 ===========
|
|
164
|
|
165 XPath can be used to extract a specific subset of the stream via the
|
|
166 ``select()`` method::
|
|
167
|
|
168 >>> substream = stream.select('a')
|
|
169 >>> substream
|
395
|
170 <genshi.core.Stream object at ...>
|
226
|
171 >>> print substream
|
|
172 <a href="http://example.org/">a link</a>
|
|
173
|
|
174 Often, streams cannot be reused: in the above example, the sub-stream is based
|
|
175 on a generator. Once it has been serialized, it will have been fully consumed,
|
|
176 and cannot be rendered again. To work around this, you can wrap such a stream
|
|
177 in a ``list``::
|
|
178
|
230
|
179 >>> from genshi import Stream
|
226
|
180 >>> substream = Stream(list(stream.select('a')))
|
|
181 >>> substream
|
395
|
182 <genshi.core.Stream object at ...>
|
226
|
183 >>> print substream
|
|
184 <a href="http://example.org/">a link</a>
|
|
185 >>> print substream.select('@href')
|
|
186 http://example.org/
|
|
187 >>> print substream.select('text()')
|
|
188 a link
|
395
|
189
|
|
190 See `Using XPath in Genshi`_ for more information about the XPath support in
|
|
191 Genshi.
|
|
192
|
|
193 .. _`Using XPath in Genshi`: xpath.html
|
|
194
|
|
195
|
|
196 .. _`event kinds`:
|
|
197
|
|
198 Event Kinds
|
|
199 ===========
|
|
200
|
|
201 Every event in a stream is of one of several *kinds*, which also determines
|
|
202 what the ``data`` item of the event tuple looks like. The different kinds of
|
|
203 events are documented below.
|
|
204
|
|
205 .. note:: The ``data`` item is generally immutable. If the data is to be
|
|
206 modified when processing a stream, it must be replaced by a new tuple.
|
|
207 Effectively, this means the entire event tuple is immutable.
|
|
208
|
|
209 START
|
|
210 -----
|
|
211 The opening tag of an element.
|
|
212
|
|
213 For this kind of event, the ``data`` item is a tuple of the form
|
|
214 ``(tagname, attrs)``, where ``tagname`` is a ``QName`` instance describing the
|
|
215 qualified name of the tag, and ``attrs`` is an ``Attrs`` instance containing
|
|
216 the attribute names and values associated with the tag (excluding namespace
|
|
217 declarations)::
|
|
218
|
|
219 START, (QName(u'p'), Attrs([(u'class', u'intro')])), pos
|
|
220
|
|
221 END
|
|
222 ---
|
|
223 The closing tag of an element.
|
|
224
|
|
225 The ``data`` item of end events consists of just a ``QName`` instance
|
|
226 describing the qualified name of the tag::
|
|
227
|
|
228 END, QName(u'p'), pos
|
|
229
|
|
230 TEXT
|
|
231 ----
|
|
232 Character data outside of elements and comments.
|
|
233
|
|
234 For text events, the ``data`` item should be a unicode object::
|
|
235
|
|
236 TEXT, u'Hello, world!', pos
|
|
237
|
|
238 START_NS
|
|
239 --------
|
|
240 The start of a namespace mapping, binding a namespace prefix to a URI.
|
|
241
|
|
242 The ``data`` item of this kind of event is a tuple of the form
|
|
243 ``(prefix, uri)``, where ``prefix`` is the namespace prefix and ``uri`` is the
|
|
244 full URI to which the prefix is bound. Both should be unicode objects. If the
|
|
245 namespace is not bound to any prefix, the ``prefix`` item is an empty string::
|
|
246
|
|
247 START_NS, (u'svg', u'http://www.w3.org/2000/svg'), pos
|
|
248
|
|
249 END_NS
|
|
250 ------
|
|
251 The end of a namespace mapping.
|
|
252
|
|
253 The ``data`` item of such events consists of only the namespace prefix (a
|
|
254 unicode object)::
|
|
255
|
|
256 END_NS, u'svg', pos
|
|
257
|
|
258 DOCTYPE
|
|
259 -------
|
|
260 A document type declaration.
|
|
261
|
|
262 For this type of event, the ``data`` item is a tuple of the form
|
|
263 ``(name, pubid, sysid)``, where ``name`` is the name of the root element,
|
|
264 ``pubid`` is the public identifier of the DTD (or ``None``), and ``sysid`` is
|
|
265 the system identifier of the DTD (or ``None``)::
|
|
266
|
|
267 DOCTYPE, (u'html', u'-//W3C//DTD XHTML 1.0 Transitional//EN', \
|
|
268 u'http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd'), pos
|
|
269
|
|
270 COMMENT
|
|
271 -------
|
|
272 A comment.
|
|
273
|
|
274 For such events, the ``data`` item is a unicode object containing all character
|
|
275 data between the comment delimiters::
|
|
276
|
|
277 COMMENT, u'Commented out', pos
|
|
278
|
|
279 PI
|
|
280 --
|
|
281 A processing instruction.
|
|
282
|
|
283 The ``data`` item is a tuple of the form ``(target, data)`` for processing
|
|
284 instructions, where ``target`` is the target of the PI (used to identify the
|
|
285 application by which the instruction should be processed), and ``data`` is text
|
|
286 following the target (excluding the terminating question mark)::
|
|
287
|
|
288 PI, (u'php', u'echo "Yo" '), pos
|
|
289
|
|
290 START_CDATA
|
|
291 -----------
|
|
292 Marks the beginning of a ``CDATA`` section.
|
|
293
|
|
294 The ``data`` item for such events is always ``None``::
|
|
295
|
|
296 START_CDATA, None, pos
|
|
297
|
|
298 END_CDATA
|
|
299 ---------
|
|
300 Marks the end of a ``CDATA`` section.
|
|
301
|
|
302 The ``data`` item for such events is always ``None``::
|
|
303
|
|
304 END_CDATA, None, pos
|