Mercurial > genshi > mirror
annotate markup/core.py @ 17:74cc70129d04 trunk
Refactoring to address #6: all match templates are now processed by a single filter, which means that match templates added by included templates are properly applied. A side effect of this refactoring is that `Context` objects may not be reused across multiple template processing runs.
Also, output filters are now applied in the `Stream.serialize()` method instead of by the `Template.generate()` method, which just makes more sense.
author | cmlenz |
---|---|
date | Sun, 18 Jun 2006 22:33:33 +0000 |
parents | f77f7a91aa46 |
children | 5420cfe42d36 |
rev | line source |
---|---|
1 | 1 # -*- coding: utf-8 -*- |
2 # | |
3 # Copyright (C) 2006 Christopher Lenz | |
4 # All rights reserved. | |
5 # | |
6 # This software is licensed as described in the file COPYING, which | |
7 # you should have received as part of this distribution. The terms | |
8 # are also available at http://trac.edgewall.com/license.html. | |
9 # | |
10 # This software consists of voluntary contributions made by many | |
11 # individuals. For the exact contribution history, see the revision | |
12 # history and logs, available at http://projects.edgewall.com/trac/. | |
13 | |
14 """Core classes for markup processing.""" | |
15 | |
16 import htmlentitydefs | |
17 import re | |
18 from StringIO import StringIO | |
19 | |
20 __all__ = ['Stream', 'Markup', 'escape', 'unescape', 'Namespace', 'QName'] | |
21 | |
22 | |
17
74cc70129d04
Refactoring to address #6: all match templates are now processed by a single filter, which means that match templates added by included templates are properly applied. A side effect of this refactoring is that `Context` objects may not be reused across multiple template processing runs.
cmlenz
parents:
10
diff
changeset
|
23 class StreamEventKind(str): |
1 | 24 """A kind of event on an XML stream.""" |
25 | |
26 | |
27 class Stream(object): | |
28 """Represents a stream of markup events. | |
29 | |
30 This class is basically an iterator over the events. | |
31 | |
32 Also provided are ways to serialize the stream to text. The `serialize()` | |
33 method will return an iterator over generated strings, while `render()` | |
34 returns the complete generated text at once. Both accept various parameters | |
35 that impact the way the stream is serialized. | |
36 | |
37 Stream events are tuples of the form: | |
38 | |
39 (kind, data, position) | |
40 | |
41 where `kind` is the event kind (such as `START`, `END`, `TEXT`, etc), `data` | |
42 depends on the kind of event, and `position` is a `(line, offset)` tuple | |
43 that contains the location of the original element or text in the input. | |
44 """ | |
45 __slots__ = ['events'] | |
46 | |
17
74cc70129d04
Refactoring to address #6: all match templates are now processed by a single filter, which means that match templates added by included templates are properly applied. A side effect of this refactoring is that `Context` objects may not be reused across multiple template processing runs.
cmlenz
parents:
10
diff
changeset
|
47 START = StreamEventKind('START') # a start tag |
74cc70129d04
Refactoring to address #6: all match templates are now processed by a single filter, which means that match templates added by included templates are properly applied. A side effect of this refactoring is that `Context` objects may not be reused across multiple template processing runs.
cmlenz
parents:
10
diff
changeset
|
48 END = StreamEventKind('END') # an end tag |
74cc70129d04
Refactoring to address #6: all match templates are now processed by a single filter, which means that match templates added by included templates are properly applied. A side effect of this refactoring is that `Context` objects may not be reused across multiple template processing runs.
cmlenz
parents:
10
diff
changeset
|
49 TEXT = StreamEventKind('TEXT') # literal text |
74cc70129d04
Refactoring to address #6: all match templates are now processed by a single filter, which means that match templates added by included templates are properly applied. A side effect of this refactoring is that `Context` objects may not be reused across multiple template processing runs.
cmlenz
parents:
10
diff
changeset
|
50 PROLOG = StreamEventKind('PROLOG') # XML prolog |
74cc70129d04
Refactoring to address #6: all match templates are now processed by a single filter, which means that match templates added by included templates are properly applied. A side effect of this refactoring is that `Context` objects may not be reused across multiple template processing runs.
cmlenz
parents:
10
diff
changeset
|
51 DOCTYPE = StreamEventKind('DOCTYPE') # doctype declaration |
74cc70129d04
Refactoring to address #6: all match templates are now processed by a single filter, which means that match templates added by included templates are properly applied. A side effect of this refactoring is that `Context` objects may not be reused across multiple template processing runs.
cmlenz
parents:
10
diff
changeset
|
52 START_NS = StreamEventKind('START-NS') # start namespace mapping |
74cc70129d04
Refactoring to address #6: all match templates are now processed by a single filter, which means that match templates added by included templates are properly applied. A side effect of this refactoring is that `Context` objects may not be reused across multiple template processing runs.
cmlenz
parents:
10
diff
changeset
|
53 END_NS = StreamEventKind('END-NS') # end namespace mapping |
74cc70129d04
Refactoring to address #6: all match templates are now processed by a single filter, which means that match templates added by included templates are properly applied. A side effect of this refactoring is that `Context` objects may not be reused across multiple template processing runs.
cmlenz
parents:
10
diff
changeset
|
54 PI = StreamEventKind('PI') # processing instruction |
74cc70129d04
Refactoring to address #6: all match templates are now processed by a single filter, which means that match templates added by included templates are properly applied. A side effect of this refactoring is that `Context` objects may not be reused across multiple template processing runs.
cmlenz
parents:
10
diff
changeset
|
55 COMMENT = StreamEventKind('COMMENT') # comment |
1 | 56 |
57 def __init__(self, events): | |
58 """Initialize the stream with a sequence of markup events. | |
59 | |
60 @oaram events: a sequence or iterable providing the events | |
61 """ | |
62 self.events = events | |
63 | |
64 def __iter__(self): | |
65 return iter(self.events) | |
66 | |
17
74cc70129d04
Refactoring to address #6: all match templates are now processed by a single filter, which means that match templates added by included templates are properly applied. A side effect of this refactoring is that `Context` objects may not be reused across multiple template processing runs.
cmlenz
parents:
10
diff
changeset
|
67 def render(self, method='xml', encoding='utf-8', filters=None, **kwargs): |
1 | 68 """Return a string representation of the stream. |
69 | |
70 @param method: determines how the stream is serialized; can be either | |
71 'xml' or 'html', or a custom `Serializer` subclass | |
72 @param encoding: how the output string should be encoded; if set to | |
73 `None`, this method returns a `unicode` object | |
74 | |
75 Any additional keyword arguments are passed to the serializer, and thus | |
76 depend on the `method` parameter value. | |
77 """ | |
17
74cc70129d04
Refactoring to address #6: all match templates are now processed by a single filter, which means that match templates added by included templates are properly applied. A side effect of this refactoring is that `Context` objects may not be reused across multiple template processing runs.
cmlenz
parents:
10
diff
changeset
|
78 generator = self.serialize(method=method, filters=filters, **kwargs) |
74cc70129d04
Refactoring to address #6: all match templates are now processed by a single filter, which means that match templates added by included templates are properly applied. A side effect of this refactoring is that `Context` objects may not be reused across multiple template processing runs.
cmlenz
parents:
10
diff
changeset
|
79 output = u''.join(list(generator)) |
1 | 80 if encoding is not None: |
9
5dc4bfe67c20
Actually use the specified encoding in `Stream.render()`.
cmlenz
parents:
8
diff
changeset
|
81 return output.encode(encoding) |
8
3710e3d0d4a2
`Stream.render()` was masking `TypeError`s (fix based on suggestion by Matt Good).
cmlenz
parents:
6
diff
changeset
|
82 return output |
1 | 83 |
84 def select(self, path): | |
85 """Return a new stream that contains the events matching the given | |
86 XPath expression. | |
87 | |
88 @param path: a string containing the XPath expression | |
89 """ | |
90 from markup.path import Path | |
17
74cc70129d04
Refactoring to address #6: all match templates are now processed by a single filter, which means that match templates added by included templates are properly applied. A side effect of this refactoring is that `Context` objects may not be reused across multiple template processing runs.
cmlenz
parents:
10
diff
changeset
|
91 return Path(path).select(self) |
1 | 92 |
17
74cc70129d04
Refactoring to address #6: all match templates are now processed by a single filter, which means that match templates added by included templates are properly applied. A side effect of this refactoring is that `Context` objects may not be reused across multiple template processing runs.
cmlenz
parents:
10
diff
changeset
|
93 def serialize(self, method='xml', filters=None, **kwargs): |
1 | 94 """Generate strings corresponding to a specific serialization of the |
95 stream. | |
96 | |
97 Unlike the `render()` method, this method is a generator this returns | |
98 the serialized output incrementally, as opposed to returning a single | |
99 string. | |
100 | |
101 @param method: determines how the stream is serialized; can be either | |
102 'xml' or 'html', or a custom `Serializer` subclass | |
103 """ | |
17
74cc70129d04
Refactoring to address #6: all match templates are now processed by a single filter, which means that match templates added by included templates are properly applied. A side effect of this refactoring is that `Context` objects may not be reused across multiple template processing runs.
cmlenz
parents:
10
diff
changeset
|
104 from markup.filters import WhitespaceFilter |
1 | 105 from markup import output |
106 cls = method | |
107 if isinstance(method, basestring): | |
108 cls = {'xml': output.XMLSerializer, | |
109 'html': output.HTMLSerializer}[method] | |
110 else: | |
111 assert issubclass(cls, serializers.Serializer) | |
112 serializer = cls(**kwargs) | |
17
74cc70129d04
Refactoring to address #6: all match templates are now processed by a single filter, which means that match templates added by included templates are properly applied. A side effect of this refactoring is that `Context` objects may not be reused across multiple template processing runs.
cmlenz
parents:
10
diff
changeset
|
113 |
74cc70129d04
Refactoring to address #6: all match templates are now processed by a single filter, which means that match templates added by included templates are properly applied. A side effect of this refactoring is that `Context` objects may not be reused across multiple template processing runs.
cmlenz
parents:
10
diff
changeset
|
114 stream = self |
74cc70129d04
Refactoring to address #6: all match templates are now processed by a single filter, which means that match templates added by included templates are properly applied. A side effect of this refactoring is that `Context` objects may not be reused across multiple template processing runs.
cmlenz
parents:
10
diff
changeset
|
115 if filters is None: |
74cc70129d04
Refactoring to address #6: all match templates are now processed by a single filter, which means that match templates added by included templates are properly applied. A side effect of this refactoring is that `Context` objects may not be reused across multiple template processing runs.
cmlenz
parents:
10
diff
changeset
|
116 filters = [WhitespaceFilter()] |
74cc70129d04
Refactoring to address #6: all match templates are now processed by a single filter, which means that match templates added by included templates are properly applied. A side effect of this refactoring is that `Context` objects may not be reused across multiple template processing runs.
cmlenz
parents:
10
diff
changeset
|
117 for filter_ in filters: |
74cc70129d04
Refactoring to address #6: all match templates are now processed by a single filter, which means that match templates added by included templates are properly applied. A side effect of this refactoring is that `Context` objects may not be reused across multiple template processing runs.
cmlenz
parents:
10
diff
changeset
|
118 stream = filter_(iter(stream)) |
74cc70129d04
Refactoring to address #6: all match templates are now processed by a single filter, which means that match templates added by included templates are properly applied. A side effect of this refactoring is that `Context` objects may not be reused across multiple template processing runs.
cmlenz
parents:
10
diff
changeset
|
119 |
74cc70129d04
Refactoring to address #6: all match templates are now processed by a single filter, which means that match templates added by included templates are properly applied. A side effect of this refactoring is that `Context` objects may not be reused across multiple template processing runs.
cmlenz
parents:
10
diff
changeset
|
120 return serializer.serialize(stream) |
1 | 121 |
122 def __str__(self): | |
123 return self.render() | |
124 | |
125 def __unicode__(self): | |
126 return self.render(encoding=None) | |
127 | |
128 | |
129 class Attributes(list): | |
130 | |
131 def __init__(self, attrib=None): | |
132 list.__init__(self, map(lambda (k, v): (QName(k), v), attrib or [])) | |
133 | |
134 def __contains__(self, name): | |
135 return name in [attr for attr, value in self] | |
136 | |
137 def get(self, name, default=None): | |
138 for attr, value in self: | |
139 if attr == name: | |
140 return value | |
141 return default | |
142 | |
5
dbb08edbc615
Improved `py:attrs` directive so that it removes existing attributes if they evaluate to `None` (AFAICT matching Kid behavior).
cmlenz
parents:
1
diff
changeset
|
143 def remove(self, name): |
dbb08edbc615
Improved `py:attrs` directive so that it removes existing attributes if they evaluate to `None` (AFAICT matching Kid behavior).
cmlenz
parents:
1
diff
changeset
|
144 for idx, (attr, _) in enumerate(self): |
dbb08edbc615
Improved `py:attrs` directive so that it removes existing attributes if they evaluate to `None` (AFAICT matching Kid behavior).
cmlenz
parents:
1
diff
changeset
|
145 if attr == name: |
dbb08edbc615
Improved `py:attrs` directive so that it removes existing attributes if they evaluate to `None` (AFAICT matching Kid behavior).
cmlenz
parents:
1
diff
changeset
|
146 del self[idx] |
dbb08edbc615
Improved `py:attrs` directive so that it removes existing attributes if they evaluate to `None` (AFAICT matching Kid behavior).
cmlenz
parents:
1
diff
changeset
|
147 break |
dbb08edbc615
Improved `py:attrs` directive so that it removes existing attributes if they evaluate to `None` (AFAICT matching Kid behavior).
cmlenz
parents:
1
diff
changeset
|
148 |
1 | 149 def set(self, name, value): |
150 for idx, (attr, _) in enumerate(self): | |
151 if attr == name: | |
152 self[idx] = (attr, value) | |
153 break | |
154 else: | |
155 self.append((QName(name), value)) | |
156 | |
157 | |
158 class Markup(unicode): | |
159 """Marks a string as being safe for inclusion in HTML/XML output without | |
160 needing to be escaped. | |
161 """ | |
162 def __new__(self, text='', *args): | |
163 if args: | |
164 text %= tuple([escape(arg) for arg in args]) | |
165 return unicode.__new__(self, text) | |
166 | |
167 def __add__(self, other): | |
168 return Markup(unicode(self) + Markup.escape(other)) | |
169 | |
170 def __mod__(self, args): | |
171 if not isinstance(args, (list, tuple)): | |
172 args = [args] | |
173 return Markup(unicode.__mod__(self, | |
174 tuple([escape(arg) for arg in args]))) | |
175 | |
176 def __mul__(self, num): | |
177 return Markup(unicode(self) * num) | |
178 | |
17
74cc70129d04
Refactoring to address #6: all match templates are now processed by a single filter, which means that match templates added by included templates are properly applied. A side effect of this refactoring is that `Context` objects may not be reused across multiple template processing runs.
cmlenz
parents:
10
diff
changeset
|
179 def __repr__(self): |
74cc70129d04
Refactoring to address #6: all match templates are now processed by a single filter, which means that match templates added by included templates are properly applied. A side effect of this refactoring is that `Context` objects may not be reused across multiple template processing runs.
cmlenz
parents:
10
diff
changeset
|
180 return '<%s "%s">' % (self.__class__.__name__, self) |
74cc70129d04
Refactoring to address #6: all match templates are now processed by a single filter, which means that match templates added by included templates are properly applied. A side effect of this refactoring is that `Context` objects may not be reused across multiple template processing runs.
cmlenz
parents:
10
diff
changeset
|
181 |
1 | 182 def join(self, seq): |
183 return Markup(unicode(self).join([Markup.escape(item) for item in seq])) | |
184 | |
185 def stripentities(self, keepxmlentities=False): | |
186 """Return a copy of the text with any character or numeric entities | |
187 replaced by the equivalent UTF-8 characters. | |
188 | |
189 If the `keepxmlentities` parameter is provided and evaluates to `True`, | |
17
74cc70129d04
Refactoring to address #6: all match templates are now processed by a single filter, which means that match templates added by included templates are properly applied. A side effect of this refactoring is that `Context` objects may not be reused across multiple template processing runs.
cmlenz
parents:
10
diff
changeset
|
190 the core XML entities (&, ', >, < and ") are not |
74cc70129d04
Refactoring to address #6: all match templates are now processed by a single filter, which means that match templates added by included templates are properly applied. A side effect of this refactoring is that `Context` objects may not be reused across multiple template processing runs.
cmlenz
parents:
10
diff
changeset
|
191 stripped. |
1 | 192 """ |
193 def _replace_entity(match): | |
194 if match.group(1): # numeric entity | |
195 ref = match.group(1) | |
196 if ref.startswith('x'): | |
197 ref = int(ref[1:], 16) | |
198 else: | |
199 ref = int(ref, 10) | |
200 return unichr(ref) | |
201 else: # character entity | |
202 ref = match.group(2) | |
203 if keepxmlentities and ref in ('amp', 'apos', 'gt', 'lt', 'quot'): | |
204 return '&%s;' % ref | |
205 try: | |
206 codepoint = htmlentitydefs.name2codepoint[ref] | |
207 return unichr(codepoint) | |
208 except KeyError: | |
209 if keepxmlentities: | |
210 return '&%s;' % ref | |
211 else: | |
212 return ref | |
213 return Markup(re.sub(r'&(?:#((?:\d+)|(?:[xX][0-9a-fA-F]+));?|(\w+);)', | |
214 _replace_entity, self)) | |
215 | |
216 def striptags(self): | |
217 """Return a copy of the text with all XML/HTML tags removed.""" | |
218 return Markup(re.sub(r'<[^>]*?>', '', self)) | |
219 | |
220 def escape(cls, text, quotes=True): | |
221 """Create a Markup instance from a string and escape special characters | |
222 it may contain (<, >, & and \"). | |
223 | |
224 If the `quotes` parameter is set to `False`, the \" character is left | |
225 as is. Escaping quotes is generally only required for strings that are | |
226 to be used in attribute values. | |
227 """ | |
228 if isinstance(text, cls): | |
229 return text | |
230 text = unicode(text) | |
231 if not text: | |
232 return cls() | |
233 text = text.replace('&', '&') \ | |
234 .replace('<', '<') \ | |
235 .replace('>', '>') | |
236 if quotes: | |
237 text = text.replace('"', '"') | |
238 return cls(text) | |
239 escape = classmethod(escape) | |
240 | |
241 def unescape(self): | |
242 """Reverse-escapes &, <, > and \" and returns a `unicode` object.""" | |
243 if not self: | |
244 return '' | |
245 return unicode(self).replace('"', '"') \ | |
246 .replace('>', '>') \ | |
247 .replace('<', '<') \ | |
248 .replace('&', '&') | |
249 | |
250 def plaintext(self, keeplinebreaks=True): | |
6 | 251 """Returns the text as a `unicode` string with all entities and tags |
252 removed. | |
253 """ | |
1 | 254 text = unicode(self.striptags().stripentities()) |
255 if not keeplinebreaks: | |
256 text = text.replace('\n', ' ') | |
257 return text | |
258 | |
259 def sanitize(self): | |
260 from markup.filters import HTMLSanitizer | |
261 from markup.input import HTMLParser | |
17
74cc70129d04
Refactoring to address #6: all match templates are now processed by a single filter, which means that match templates added by included templates are properly applied. A side effect of this refactoring is that `Context` objects may not be reused across multiple template processing runs.
cmlenz
parents:
10
diff
changeset
|
262 text = StringIO(self.stripentities(keepxmlentities=True)) |
74cc70129d04
Refactoring to address #6: all match templates are now processed by a single filter, which means that match templates added by included templates are properly applied. A side effect of this refactoring is that `Context` objects may not be reused across multiple template processing runs.
cmlenz
parents:
10
diff
changeset
|
263 return Stream(HTMLSanitizer()(HTMLParser(text))) |
1 | 264 |
265 | |
266 escape = Markup.escape | |
267 | |
268 def unescape(text): | |
269 """Reverse-escapes &, <, > and \" and returns a `unicode` object.""" | |
270 if not isinstance(text, Markup): | |
271 return text | |
272 return text.unescape() | |
273 | |
274 | |
275 class Namespace(object): | |
276 | |
277 def __init__(self, uri): | |
278 self.uri = uri | |
279 | |
280 def __getitem__(self, name): | |
281 return QName(self.uri + '}' + name) | |
282 | |
283 __getattr__ = __getitem__ | |
284 | |
285 def __repr__(self): | |
286 return '<Namespace "%s">' % self.uri | |
287 | |
288 def __str__(self): | |
289 return self.uri | |
290 | |
291 def __unicode__(self): | |
292 return unicode(self.uri) | |
293 | |
294 | |
295 class QName(unicode): | |
296 """A qualified element or attribute name. | |
297 | |
298 The unicode value of instances of this class contains the qualified name of | |
299 the element or attribute, in the form `{namespace}localname`. The namespace | |
300 URI can be obtained through the additional `namespace` attribute, while the | |
301 local name can be accessed through the `localname` attribute. | |
302 """ | |
303 __slots__ = ['namespace', 'localname'] | |
304 | |
305 def __new__(cls, qname): | |
306 if isinstance(qname, QName): | |
307 return qname | |
308 | |
309 parts = qname.split('}', 1) | |
310 if qname.find('}') > 0: | |
311 self = unicode.__new__(cls, '{' + qname) | |
312 self.namespace = parts[0] | |
313 self.localname = parts[1] | |
314 else: | |
315 self = unicode.__new__(cls, qname) | |
316 self.namespace = None | |
317 self.localname = qname | |
318 return self |