annotate genshi/filters/html.py @ 839:85a61d0bd67b stable-0.5.x

Ported [1046:1047] to 0.5.x branch.
author cmlenz
date Tue, 17 Mar 2009 17:20:04 +0000
parents 09a90feb9269
children b09f746b4881
rev   line source
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
1 # -*- coding: utf-8 -*-
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
2 #
719
09a90feb9269 Fix copyright years.
cmlenz
parents: 706
diff changeset
3 # Copyright (C) 2006-2008 Edgewall Software
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
4 # All rights reserved.
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
5 #
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
6 # This software is licensed as described in the file COPYING, which
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
7 # you should have received as part of this distribution. The terms
230
24757b771651 Renamed Markup to Genshi in repository.
cmlenz
parents: 182
diff changeset
8 # are also available at http://genshi.edgewall.org/wiki/License.
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
9 #
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
10 # This software consists of voluntary contributions made by many
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
11 # individuals. For the exact contribution history, see the revision
230
24757b771651 Renamed Markup to Genshi in repository.
cmlenz
parents: 182
diff changeset
12 # history and logs, available at http://genshi.edgewall.org/log/.
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
13
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
14 """Implementation of a number of stream filters."""
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
15
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
16 try:
706
4cece3a7dcae Fix Python 2.3 compatibility of HTMLSanitizer doctest.
cmlenz
parents: 584
diff changeset
17 set
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
18 except NameError:
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
19 from sets import ImmutableSet as frozenset
706
4cece3a7dcae Fix Python 2.3 compatibility of HTMLSanitizer doctest.
cmlenz
parents: 584
diff changeset
20 from sets import Set as set
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
21 import re
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
22
403
32b283e1d310 Remove some magic/overhead from `Attrs` creation and manipulation by not automatically wrapping attribute names in `QName`.
cmlenz
parents: 363
diff changeset
23 from genshi.core import Attrs, QName, stripentities
571
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
24 from genshi.core import END, START, TEXT, COMMENT
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
25
363
caf7b68ab5dc Parse template includes at parse time to avoid some runtime overhead.
cmlenz
parents: 345
diff changeset
26 __all__ = ['HTMLFormFiller', 'HTMLSanitizer']
425
5b248708bbed Try to use proper reStructuredText for docstrings throughout.
cmlenz
parents: 415
diff changeset
27 __docformat__ = 'restructuredtext en'
275
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
28
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
29
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
30 class HTMLFormFiller(object):
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
31 """A stream filter that can populate HTML forms from a dictionary of values.
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
32
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
33 >>> from genshi.input import HTML
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
34 >>> html = HTML('''<form>
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
35 ... <p><input type="text" name="foo" /></p>
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
36 ... </form>''')
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
37 >>> filler = HTMLFormFiller(data={'foo': 'bar'})
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
38 >>> print html | filler
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
39 <form>
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
40 <p><input type="text" name="foo" value="bar"/></p>
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
41 </form>
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
42 """
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
43 # TODO: only select the first radio button, and the first select option
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
44 # (if not in a multiple-select)
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
45 # TODO: only apply to elements in the XHTML namespace (or no namespace)?
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
46
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
47 def __init__(self, name=None, id=None, data=None):
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
48 """Create the filter.
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
49
425
5b248708bbed Try to use proper reStructuredText for docstrings throughout.
cmlenz
parents: 415
diff changeset
50 :param name: The name of the form that should be populated. If this
5b248708bbed Try to use proper reStructuredText for docstrings throughout.
cmlenz
parents: 415
diff changeset
51 parameter is given, only forms where the ``name`` attribute
5b248708bbed Try to use proper reStructuredText for docstrings throughout.
cmlenz
parents: 415
diff changeset
52 value matches the parameter are processed.
5b248708bbed Try to use proper reStructuredText for docstrings throughout.
cmlenz
parents: 415
diff changeset
53 :param id: The ID of the form that should be populated. If this
5b248708bbed Try to use proper reStructuredText for docstrings throughout.
cmlenz
parents: 415
diff changeset
54 parameter is given, only forms where the ``id`` attribute
5b248708bbed Try to use proper reStructuredText for docstrings throughout.
cmlenz
parents: 415
diff changeset
55 value matches the parameter are processed.
5b248708bbed Try to use proper reStructuredText for docstrings throughout.
cmlenz
parents: 415
diff changeset
56 :param data: The dictionary of form values, where the keys are the names
5b248708bbed Try to use proper reStructuredText for docstrings throughout.
cmlenz
parents: 415
diff changeset
57 of the form fields, and the values are the values to fill
5b248708bbed Try to use proper reStructuredText for docstrings throughout.
cmlenz
parents: 415
diff changeset
58 in.
275
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
59 """
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
60 self.name = name
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
61 self.id = id
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
62 if data is None:
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
63 data = {}
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
64 self.data = data
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
65
439
b84c11a49ad2 Add support for adding custom template filters by passing a custom callback function to the `TemplateLoader`. Closes #89 (see added unit test).
cmlenz
parents: 431
diff changeset
66 def __call__(self, stream):
277
c9fd81953169 The `HTMLSanitizer` now lets you override the default set of tag and attribute names that are considered safe.
cmlenz
parents: 275
diff changeset
67 """Apply the filter to the given stream.
c9fd81953169 The `HTMLSanitizer` now lets you override the default set of tag and attribute names that are considered safe.
cmlenz
parents: 275
diff changeset
68
425
5b248708bbed Try to use proper reStructuredText for docstrings throughout.
cmlenz
parents: 415
diff changeset
69 :param stream: the markup event stream to filter
277
c9fd81953169 The `HTMLSanitizer` now lets you override the default set of tag and attribute names that are considered safe.
cmlenz
parents: 275
diff changeset
70 """
275
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
71 in_form = in_select = in_option = in_textarea = False
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
72 select_value = option_value = textarea_value = None
584
84137a71a4ca Fixed a few cases where HTMLFormFiller didn't work well with option elements:
jonas
parents: 576
diff changeset
73 option_start = None
84137a71a4ca Fixed a few cases where HTMLFormFiller didn't work well with option elements:
jonas
parents: 576
diff changeset
74 option_text = []
84137a71a4ca Fixed a few cases where HTMLFormFiller didn't work well with option elements:
jonas
parents: 576
diff changeset
75 no_option_value = False
275
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
76
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
77 for kind, data, pos in stream:
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
78
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
79 if kind is START:
345
8e75b83d3e71 Make `Attrs` instances immutable.
cmlenz
parents: 305
diff changeset
80 tag, attrs = data
275
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
81 tagname = tag.localname
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
82
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
83 if tagname == 'form' and (
345
8e75b83d3e71 Make `Attrs` instances immutable.
cmlenz
parents: 305
diff changeset
84 self.name and attrs.get('name') == self.name or
8e75b83d3e71 Make `Attrs` instances immutable.
cmlenz
parents: 305
diff changeset
85 self.id and attrs.get('id') == self.id or
275
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
86 not (self.id or self.name)):
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
87 in_form = True
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
88
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
89 elif in_form:
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
90 if tagname == 'input':
345
8e75b83d3e71 Make `Attrs` instances immutable.
cmlenz
parents: 305
diff changeset
91 type = attrs.get('type')
275
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
92 if type in ('checkbox', 'radio'):
345
8e75b83d3e71 Make `Attrs` instances immutable.
cmlenz
parents: 305
diff changeset
93 name = attrs.get('name')
471
17ce8bf006d7 The `HTMLFormFiller` stream filter no longer alters form elements for which the data element contains no corresponding item.
cmlenz
parents: 446
diff changeset
94 if name and name in self.data:
17ce8bf006d7 The `HTMLFormFiller` stream filter no longer alters form elements for which the data element contains no corresponding item.
cmlenz
parents: 446
diff changeset
95 value = self.data[name]
345
8e75b83d3e71 Make `Attrs` instances immutable.
cmlenz
parents: 305
diff changeset
96 declval = attrs.get('value')
275
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
97 checked = False
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
98 if isinstance(value, (list, tuple)):
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
99 if declval:
584
84137a71a4ca Fixed a few cases where HTMLFormFiller didn't work well with option elements:
jonas
parents: 576
diff changeset
100 checked = declval in [unicode(v) for v
415
c267061c961f `HTMLFormFiller` now correctly deals with non-string values in the data dictionary for select/checkbox/radio controls.
cmlenz
parents: 408
diff changeset
101 in value]
275
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
102 else:
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
103 checked = bool(filter(None, value))
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
104 else:
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
105 if declval:
584
84137a71a4ca Fixed a few cases where HTMLFormFiller didn't work well with option elements:
jonas
parents: 576
diff changeset
106 checked = declval == unicode(value)
275
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
107 elif type == 'checkbox':
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
108 checked = bool(value)
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
109 if checked:
403
32b283e1d310 Remove some magic/overhead from `Attrs` creation and manipulation by not automatically wrapping attribute names in `QName`.
cmlenz
parents: 363
diff changeset
110 attrs |= [(QName('checked'), 'checked')]
345
8e75b83d3e71 Make `Attrs` instances immutable.
cmlenz
parents: 305
diff changeset
111 elif 'checked' in attrs:
8e75b83d3e71 Make `Attrs` instances immutable.
cmlenz
parents: 305
diff changeset
112 attrs -= 'checked'
275
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
113 elif type in (None, 'hidden', 'text'):
345
8e75b83d3e71 Make `Attrs` instances immutable.
cmlenz
parents: 305
diff changeset
114 name = attrs.get('name')
471
17ce8bf006d7 The `HTMLFormFiller` stream filter no longer alters form elements for which the data element contains no corresponding item.
cmlenz
parents: 446
diff changeset
115 if name and name in self.data:
17ce8bf006d7 The `HTMLFormFiller` stream filter no longer alters form elements for which the data element contains no corresponding item.
cmlenz
parents: 446
diff changeset
116 value = self.data[name]
275
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
117 if isinstance(value, (list, tuple)):
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
118 value = value[0]
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
119 if value is not None:
403
32b283e1d310 Remove some magic/overhead from `Attrs` creation and manipulation by not automatically wrapping attribute names in `QName`.
cmlenz
parents: 363
diff changeset
120 attrs |= [(QName('value'), unicode(value))]
275
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
121 elif tagname == 'select':
345
8e75b83d3e71 Make `Attrs` instances immutable.
cmlenz
parents: 305
diff changeset
122 name = attrs.get('name')
471
17ce8bf006d7 The `HTMLFormFiller` stream filter no longer alters form elements for which the data element contains no corresponding item.
cmlenz
parents: 446
diff changeset
123 if name in self.data:
17ce8bf006d7 The `HTMLFormFiller` stream filter no longer alters form elements for which the data element contains no corresponding item.
cmlenz
parents: 446
diff changeset
124 select_value = self.data[name]
17ce8bf006d7 The `HTMLFormFiller` stream filter no longer alters form elements for which the data element contains no corresponding item.
cmlenz
parents: 446
diff changeset
125 in_select = True
275
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
126 elif tagname == 'textarea':
345
8e75b83d3e71 Make `Attrs` instances immutable.
cmlenz
parents: 305
diff changeset
127 name = attrs.get('name')
471
17ce8bf006d7 The `HTMLFormFiller` stream filter no longer alters form elements for which the data element contains no corresponding item.
cmlenz
parents: 446
diff changeset
128 if name in self.data:
17ce8bf006d7 The `HTMLFormFiller` stream filter no longer alters form elements for which the data element contains no corresponding item.
cmlenz
parents: 446
diff changeset
129 textarea_value = self.data.get(name)
17ce8bf006d7 The `HTMLFormFiller` stream filter no longer alters form elements for which the data element contains no corresponding item.
cmlenz
parents: 446
diff changeset
130 if isinstance(textarea_value, (list, tuple)):
17ce8bf006d7 The `HTMLFormFiller` stream filter no longer alters form elements for which the data element contains no corresponding item.
cmlenz
parents: 446
diff changeset
131 textarea_value = textarea_value[0]
17ce8bf006d7 The `HTMLFormFiller` stream filter no longer alters form elements for which the data element contains no corresponding item.
cmlenz
parents: 446
diff changeset
132 in_textarea = True
275
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
133 elif in_select and tagname == 'option':
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
134 option_start = kind, data, pos
345
8e75b83d3e71 Make `Attrs` instances immutable.
cmlenz
parents: 305
diff changeset
135 option_value = attrs.get('value')
584
84137a71a4ca Fixed a few cases where HTMLFormFiller didn't work well with option elements:
jonas
parents: 576
diff changeset
136 if option_value is None:
84137a71a4ca Fixed a few cases where HTMLFormFiller didn't work well with option elements:
jonas
parents: 576
diff changeset
137 no_option_value = True
84137a71a4ca Fixed a few cases where HTMLFormFiller didn't work well with option elements:
jonas
parents: 576
diff changeset
138 option_value = ''
275
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
139 in_option = True
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
140 continue
345
8e75b83d3e71 Make `Attrs` instances immutable.
cmlenz
parents: 305
diff changeset
141 yield kind, (tag, attrs), pos
8e75b83d3e71 Make `Attrs` instances immutable.
cmlenz
parents: 305
diff changeset
142
275
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
143 elif in_form and kind is TEXT:
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
144 if in_select and in_option:
584
84137a71a4ca Fixed a few cases where HTMLFormFiller didn't work well with option elements:
jonas
parents: 576
diff changeset
145 if no_option_value:
84137a71a4ca Fixed a few cases where HTMLFormFiller didn't work well with option elements:
jonas
parents: 576
diff changeset
146 option_value += data
84137a71a4ca Fixed a few cases where HTMLFormFiller didn't work well with option elements:
jonas
parents: 576
diff changeset
147 option_text.append((kind, data, pos))
275
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
148 continue
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
149 elif in_textarea:
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
150 continue
345
8e75b83d3e71 Make `Attrs` instances immutable.
cmlenz
parents: 305
diff changeset
151 yield kind, data, pos
275
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
152
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
153 elif in_form and kind is END:
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
154 tagname = data.localname
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
155 if tagname == 'form':
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
156 in_form = False
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
157 elif tagname == 'select':
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
158 in_select = False
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
159 select_value = None
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
160 elif in_select and tagname == 'option':
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
161 if isinstance(select_value, (tuple, list)):
584
84137a71a4ca Fixed a few cases where HTMLFormFiller didn't work well with option elements:
jonas
parents: 576
diff changeset
162 selected = option_value in [unicode(v) for v
415
c267061c961f `HTMLFormFiller` now correctly deals with non-string values in the data dictionary for select/checkbox/radio controls.
cmlenz
parents: 408
diff changeset
163 in select_value]
275
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
164 else:
584
84137a71a4ca Fixed a few cases where HTMLFormFiller didn't work well with option elements:
jonas
parents: 576
diff changeset
165 selected = option_value == unicode(select_value)
345
8e75b83d3e71 Make `Attrs` instances immutable.
cmlenz
parents: 305
diff changeset
166 okind, (tag, attrs), opos = option_start
275
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
167 if selected:
403
32b283e1d310 Remove some magic/overhead from `Attrs` creation and manipulation by not automatically wrapping attribute names in `QName`.
cmlenz
parents: 363
diff changeset
168 attrs |= [(QName('selected'), 'selected')]
345
8e75b83d3e71 Make `Attrs` instances immutable.
cmlenz
parents: 305
diff changeset
169 elif 'selected' in attrs:
8e75b83d3e71 Make `Attrs` instances immutable.
cmlenz
parents: 305
diff changeset
170 attrs -= 'selected'
8e75b83d3e71 Make `Attrs` instances immutable.
cmlenz
parents: 305
diff changeset
171 yield okind, (tag, attrs), opos
275
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
172 if option_text:
584
84137a71a4ca Fixed a few cases where HTMLFormFiller didn't work well with option elements:
jonas
parents: 576
diff changeset
173 for event in option_text:
84137a71a4ca Fixed a few cases where HTMLFormFiller didn't work well with option elements:
jonas
parents: 576
diff changeset
174 yield event
275
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
175 in_option = False
584
84137a71a4ca Fixed a few cases where HTMLFormFiller didn't work well with option elements:
jonas
parents: 576
diff changeset
176 no_option_value = False
84137a71a4ca Fixed a few cases where HTMLFormFiller didn't work well with option elements:
jonas
parents: 576
diff changeset
177 option_start = option_value = None
84137a71a4ca Fixed a few cases where HTMLFormFiller didn't work well with option elements:
jonas
parents: 576
diff changeset
178 option_text = []
275
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
179 elif tagname == 'textarea':
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
180 if textarea_value:
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
181 yield TEXT, unicode(textarea_value), pos
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
182 in_textarea = False
345
8e75b83d3e71 Make `Attrs` instances immutable.
cmlenz
parents: 305
diff changeset
183 yield kind, data, pos
275
7f24dd6fb904 Integrated `HTMLFormFiller` filter initially presented as a [wiki:FormFilling#Usingatemplatefilter recipe].
cmlenz
parents: 230
diff changeset
184
345
8e75b83d3e71 Make `Attrs` instances immutable.
cmlenz
parents: 305
diff changeset
185 else:
8e75b83d3e71 Make `Attrs` instances immutable.
cmlenz
parents: 305
diff changeset
186 yield kind, data, pos
123
93bbdcf9428b Fix for #18: whitespace in space-sensitive elements such as `<pre>` and `<textarea>` is now preserved.
cmlenz
parents: 113
diff changeset
187
93bbdcf9428b Fix for #18: whitespace in space-sensitive elements such as `<pre>` and `<textarea>` is now preserved.
cmlenz
parents: 113
diff changeset
188
93bbdcf9428b Fix for #18: whitespace in space-sensitive elements such as `<pre>` and `<textarea>` is now preserved.
cmlenz
parents: 113
diff changeset
189 class HTMLSanitizer(object):
93bbdcf9428b Fix for #18: whitespace in space-sensitive elements such as `<pre>` and `<textarea>` is now preserved.
cmlenz
parents: 113
diff changeset
190 """A filter that removes potentially dangerous HTML tags and attributes
93bbdcf9428b Fix for #18: whitespace in space-sensitive elements such as `<pre>` and `<textarea>` is now preserved.
cmlenz
parents: 113
diff changeset
191 from the stream.
431
747baa1cd597 * Don't allow `style` attributes by default in the `HTMLSanitizer`. Closes #97.
cmlenz
parents: 425
diff changeset
192
747baa1cd597 * Don't allow `style` attributes by default in the `HTMLSanitizer`. Closes #97.
cmlenz
parents: 425
diff changeset
193 >>> from genshi import HTML
747baa1cd597 * Don't allow `style` attributes by default in the `HTMLSanitizer`. Closes #97.
cmlenz
parents: 425
diff changeset
194 >>> html = HTML('<div><script>alert(document.cookie)</script></div>')
747baa1cd597 * Don't allow `style` attributes by default in the `HTMLSanitizer`. Closes #97.
cmlenz
parents: 425
diff changeset
195 >>> print html | HTMLSanitizer()
747baa1cd597 * Don't allow `style` attributes by default in the `HTMLSanitizer`. Closes #97.
cmlenz
parents: 425
diff changeset
196 <div/>
747baa1cd597 * Don't allow `style` attributes by default in the `HTMLSanitizer`. Closes #97.
cmlenz
parents: 425
diff changeset
197
747baa1cd597 * Don't allow `style` attributes by default in the `HTMLSanitizer`. Closes #97.
cmlenz
parents: 425
diff changeset
198 The default set of safe tags and attributes can be modified when the filter
747baa1cd597 * Don't allow `style` attributes by default in the `HTMLSanitizer`. Closes #97.
cmlenz
parents: 425
diff changeset
199 is instantiated. For example, to allow inline ``style`` attributes, the
747baa1cd597 * Don't allow `style` attributes by default in the `HTMLSanitizer`. Closes #97.
cmlenz
parents: 425
diff changeset
200 following instantation would work:
747baa1cd597 * Don't allow `style` attributes by default in the `HTMLSanitizer`. Closes #97.
cmlenz
parents: 425
diff changeset
201
747baa1cd597 * Don't allow `style` attributes by default in the `HTMLSanitizer`. Closes #97.
cmlenz
parents: 425
diff changeset
202 >>> html = HTML('<div style="background: #000"></div>')
747baa1cd597 * Don't allow `style` attributes by default in the `HTMLSanitizer`. Closes #97.
cmlenz
parents: 425
diff changeset
203 >>> sanitizer = HTMLSanitizer(safe_attrs=HTMLSanitizer.SAFE_ATTRS | set(['style']))
747baa1cd597 * Don't allow `style` attributes by default in the `HTMLSanitizer`. Closes #97.
cmlenz
parents: 425
diff changeset
204 >>> print html | sanitizer
747baa1cd597 * Don't allow `style` attributes by default in the `HTMLSanitizer`. Closes #97.
cmlenz
parents: 425
diff changeset
205 <div style="background: #000"/>
747baa1cd597 * Don't allow `style` attributes by default in the `HTMLSanitizer`. Closes #97.
cmlenz
parents: 425
diff changeset
206
747baa1cd597 * Don't allow `style` attributes by default in the `HTMLSanitizer`. Closes #97.
cmlenz
parents: 425
diff changeset
207 Note that even in this case, the filter *does* attempt to remove dangerous
747baa1cd597 * Don't allow `style` attributes by default in the `HTMLSanitizer`. Closes #97.
cmlenz
parents: 425
diff changeset
208 constructs from style attributes:
747baa1cd597 * Don't allow `style` attributes by default in the `HTMLSanitizer`. Closes #97.
cmlenz
parents: 425
diff changeset
209
747baa1cd597 * Don't allow `style` attributes by default in the `HTMLSanitizer`. Closes #97.
cmlenz
parents: 425
diff changeset
210 >>> html = HTML('<div style="background: url(javascript:void); color: #000"></div>')
747baa1cd597 * Don't allow `style` attributes by default in the `HTMLSanitizer`. Closes #97.
cmlenz
parents: 425
diff changeset
211 >>> print html | sanitizer
747baa1cd597 * Don't allow `style` attributes by default in the `HTMLSanitizer`. Closes #97.
cmlenz
parents: 425
diff changeset
212 <div style="color: #000"/>
747baa1cd597 * Don't allow `style` attributes by default in the `HTMLSanitizer`. Closes #97.
cmlenz
parents: 425
diff changeset
213
747baa1cd597 * Don't allow `style` attributes by default in the `HTMLSanitizer`. Closes #97.
cmlenz
parents: 425
diff changeset
214 This handles HTML entities, unicode escapes in CSS and Javascript text, as
747baa1cd597 * Don't allow `style` attributes by default in the `HTMLSanitizer`. Closes #97.
cmlenz
parents: 425
diff changeset
215 well as a lot of other things. However, the style tag is still excluded by
747baa1cd597 * Don't allow `style` attributes by default in the `HTMLSanitizer`. Closes #97.
cmlenz
parents: 425
diff changeset
216 default because it is very hard for such sanitizing to be completely safe,
747baa1cd597 * Don't allow `style` attributes by default in the `HTMLSanitizer`. Closes #97.
cmlenz
parents: 425
diff changeset
217 especially considering how much error recovery current web browsers perform.
571
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
218
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
219 :warn: Note that this special processing of CSS is currently only applied to
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
220 style attributes, **not** style elements.
123
93bbdcf9428b Fix for #18: whitespace in space-sensitive elements such as `<pre>` and `<textarea>` is now preserved.
cmlenz
parents: 113
diff changeset
221 """
93bbdcf9428b Fix for #18: whitespace in space-sensitive elements such as `<pre>` and `<textarea>` is now preserved.
cmlenz
parents: 113
diff changeset
222
277
c9fd81953169 The `HTMLSanitizer` now lets you override the default set of tag and attribute names that are considered safe.
cmlenz
parents: 275
diff changeset
223 SAFE_TAGS = frozenset(['a', 'abbr', 'acronym', 'address', 'area', 'b',
123
93bbdcf9428b Fix for #18: whitespace in space-sensitive elements such as `<pre>` and `<textarea>` is now preserved.
cmlenz
parents: 113
diff changeset
224 'big', 'blockquote', 'br', 'button', 'caption', 'center', 'cite',
93bbdcf9428b Fix for #18: whitespace in space-sensitive elements such as `<pre>` and `<textarea>` is now preserved.
cmlenz
parents: 113
diff changeset
225 'code', 'col', 'colgroup', 'dd', 'del', 'dfn', 'dir', 'div', 'dl', 'dt',
93bbdcf9428b Fix for #18: whitespace in space-sensitive elements such as `<pre>` and `<textarea>` is now preserved.
cmlenz
parents: 113
diff changeset
226 'em', 'fieldset', 'font', 'form', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6',
93bbdcf9428b Fix for #18: whitespace in space-sensitive elements such as `<pre>` and `<textarea>` is now preserved.
cmlenz
parents: 113
diff changeset
227 'hr', 'i', 'img', 'input', 'ins', 'kbd', 'label', 'legend', 'li', 'map',
93bbdcf9428b Fix for #18: whitespace in space-sensitive elements such as `<pre>` and `<textarea>` is now preserved.
cmlenz
parents: 113
diff changeset
228 'menu', 'ol', 'optgroup', 'option', 'p', 'pre', 'q', 's', 'samp',
93bbdcf9428b Fix for #18: whitespace in space-sensitive elements such as `<pre>` and `<textarea>` is now preserved.
cmlenz
parents: 113
diff changeset
229 'select', 'small', 'span', 'strike', 'strong', 'sub', 'sup', 'table',
93bbdcf9428b Fix for #18: whitespace in space-sensitive elements such as `<pre>` and `<textarea>` is now preserved.
cmlenz
parents: 113
diff changeset
230 'tbody', 'td', 'textarea', 'tfoot', 'th', 'thead', 'tr', 'tt', 'u',
93bbdcf9428b Fix for #18: whitespace in space-sensitive elements such as `<pre>` and `<textarea>` is now preserved.
cmlenz
parents: 113
diff changeset
231 'ul', 'var'])
93bbdcf9428b Fix for #18: whitespace in space-sensitive elements such as `<pre>` and `<textarea>` is now preserved.
cmlenz
parents: 113
diff changeset
232
277
c9fd81953169 The `HTMLSanitizer` now lets you override the default set of tag and attribute names that are considered safe.
cmlenz
parents: 275
diff changeset
233 SAFE_ATTRS = frozenset(['abbr', 'accept', 'accept-charset', 'accesskey',
123
93bbdcf9428b Fix for #18: whitespace in space-sensitive elements such as `<pre>` and `<textarea>` is now preserved.
cmlenz
parents: 113
diff changeset
234 'action', 'align', 'alt', 'axis', 'bgcolor', 'border', 'cellpadding',
93bbdcf9428b Fix for #18: whitespace in space-sensitive elements such as `<pre>` and `<textarea>` is now preserved.
cmlenz
parents: 113
diff changeset
235 'cellspacing', 'char', 'charoff', 'charset', 'checked', 'cite', 'class',
93bbdcf9428b Fix for #18: whitespace in space-sensitive elements such as `<pre>` and `<textarea>` is now preserved.
cmlenz
parents: 113
diff changeset
236 'clear', 'cols', 'colspan', 'color', 'compact', 'coords', 'datetime',
93bbdcf9428b Fix for #18: whitespace in space-sensitive elements such as `<pre>` and `<textarea>` is now preserved.
cmlenz
parents: 113
diff changeset
237 'dir', 'disabled', 'enctype', 'for', 'frame', 'headers', 'height',
93bbdcf9428b Fix for #18: whitespace in space-sensitive elements such as `<pre>` and `<textarea>` is now preserved.
cmlenz
parents: 113
diff changeset
238 'href', 'hreflang', 'hspace', 'id', 'ismap', 'label', 'lang',
93bbdcf9428b Fix for #18: whitespace in space-sensitive elements such as `<pre>` and `<textarea>` is now preserved.
cmlenz
parents: 113
diff changeset
239 'longdesc', 'maxlength', 'media', 'method', 'multiple', 'name',
93bbdcf9428b Fix for #18: whitespace in space-sensitive elements such as `<pre>` and `<textarea>` is now preserved.
cmlenz
parents: 113
diff changeset
240 'nohref', 'noshade', 'nowrap', 'prompt', 'readonly', 'rel', 'rev',
93bbdcf9428b Fix for #18: whitespace in space-sensitive elements such as `<pre>` and `<textarea>` is now preserved.
cmlenz
parents: 113
diff changeset
241 'rows', 'rowspan', 'rules', 'scope', 'selected', 'shape', 'size',
431
747baa1cd597 * Don't allow `style` attributes by default in the `HTMLSanitizer`. Closes #97.
cmlenz
parents: 425
diff changeset
242 'span', 'src', 'start', 'summary', 'tabindex', 'target', 'title',
747baa1cd597 * Don't allow `style` attributes by default in the `HTMLSanitizer`. Closes #97.
cmlenz
parents: 425
diff changeset
243 'type', 'usemap', 'valign', 'value', 'vspace', 'width'])
277
c9fd81953169 The `HTMLSanitizer` now lets you override the default set of tag and attribute names that are considered safe.
cmlenz
parents: 275
diff changeset
244
c9fd81953169 The `HTMLSanitizer` now lets you override the default set of tag and attribute names that are considered safe.
cmlenz
parents: 275
diff changeset
245 SAFE_SCHEMES = frozenset(['file', 'ftp', 'http', 'https', 'mailto', None])
c9fd81953169 The `HTMLSanitizer` now lets you override the default set of tag and attribute names that are considered safe.
cmlenz
parents: 275
diff changeset
246
c9fd81953169 The `HTMLSanitizer` now lets you override the default set of tag and attribute names that are considered safe.
cmlenz
parents: 275
diff changeset
247 URI_ATTRS = frozenset(['action', 'background', 'dynsrc', 'href', 'lowsrc',
123
93bbdcf9428b Fix for #18: whitespace in space-sensitive elements such as `<pre>` and `<textarea>` is now preserved.
cmlenz
parents: 113
diff changeset
248 'src'])
277
c9fd81953169 The `HTMLSanitizer` now lets you override the default set of tag and attribute names that are considered safe.
cmlenz
parents: 275
diff changeset
249
c9fd81953169 The `HTMLSanitizer` now lets you override the default set of tag and attribute names that are considered safe.
cmlenz
parents: 275
diff changeset
250 def __init__(self, safe_tags=SAFE_TAGS, safe_attrs=SAFE_ATTRS,
c9fd81953169 The `HTMLSanitizer` now lets you override the default set of tag and attribute names that are considered safe.
cmlenz
parents: 275
diff changeset
251 safe_schemes=SAFE_SCHEMES, uri_attrs=URI_ATTRS):
c9fd81953169 The `HTMLSanitizer` now lets you override the default set of tag and attribute names that are considered safe.
cmlenz
parents: 275
diff changeset
252 """Create the sanitizer.
c9fd81953169 The `HTMLSanitizer` now lets you override the default set of tag and attribute names that are considered safe.
cmlenz
parents: 275
diff changeset
253
c9fd81953169 The `HTMLSanitizer` now lets you override the default set of tag and attribute names that are considered safe.
cmlenz
parents: 275
diff changeset
254 The exact set of allowed elements and attributes can be configured.
c9fd81953169 The `HTMLSanitizer` now lets you override the default set of tag and attribute names that are considered safe.
cmlenz
parents: 275
diff changeset
255
425
5b248708bbed Try to use proper reStructuredText for docstrings throughout.
cmlenz
parents: 415
diff changeset
256 :param safe_tags: a set of tag names that are considered safe
5b248708bbed Try to use proper reStructuredText for docstrings throughout.
cmlenz
parents: 415
diff changeset
257 :param safe_attrs: a set of attribute names that are considered safe
5b248708bbed Try to use proper reStructuredText for docstrings throughout.
cmlenz
parents: 415
diff changeset
258 :param safe_schemes: a set of URI schemes that are considered safe
5b248708bbed Try to use proper reStructuredText for docstrings throughout.
cmlenz
parents: 415
diff changeset
259 :param uri_attrs: a set of names of attributes that contain URIs
277
c9fd81953169 The `HTMLSanitizer` now lets you override the default set of tag and attribute names that are considered safe.
cmlenz
parents: 275
diff changeset
260 """
c9fd81953169 The `HTMLSanitizer` now lets you override the default set of tag and attribute names that are considered safe.
cmlenz
parents: 275
diff changeset
261 self.safe_tags = safe_tags
571
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
262 "The set of tag names that are considered safe."
277
c9fd81953169 The `HTMLSanitizer` now lets you override the default set of tag and attribute names that are considered safe.
cmlenz
parents: 275
diff changeset
263 self.safe_attrs = safe_attrs
571
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
264 "The set of attribute names that are considered safe."
277
c9fd81953169 The `HTMLSanitizer` now lets you override the default set of tag and attribute names that are considered safe.
cmlenz
parents: 275
diff changeset
265 self.uri_attrs = uri_attrs
571
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
266 "The set of names of attributes that may contain URIs."
277
c9fd81953169 The `HTMLSanitizer` now lets you override the default set of tag and attribute names that are considered safe.
cmlenz
parents: 275
diff changeset
267 self.safe_schemes = safe_schemes
571
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
268 "The set of URI schemes that are considered safe."
123
93bbdcf9428b Fix for #18: whitespace in space-sensitive elements such as `<pre>` and `<textarea>` is now preserved.
cmlenz
parents: 113
diff changeset
269
439
b84c11a49ad2 Add support for adding custom template filters by passing a custom callback function to the `TemplateLoader`. Closes #89 (see added unit test).
cmlenz
parents: 431
diff changeset
270 def __call__(self, stream):
277
c9fd81953169 The `HTMLSanitizer` now lets you override the default set of tag and attribute names that are considered safe.
cmlenz
parents: 275
diff changeset
271 """Apply the filter to the given stream.
c9fd81953169 The `HTMLSanitizer` now lets you override the default set of tag and attribute names that are considered safe.
cmlenz
parents: 275
diff changeset
272
425
5b248708bbed Try to use proper reStructuredText for docstrings throughout.
cmlenz
parents: 415
diff changeset
273 :param stream: the markup event stream to filter
277
c9fd81953169 The `HTMLSanitizer` now lets you override the default set of tag and attribute names that are considered safe.
cmlenz
parents: 275
diff changeset
274 """
123
93bbdcf9428b Fix for #18: whitespace in space-sensitive elements such as `<pre>` and `<textarea>` is now preserved.
cmlenz
parents: 113
diff changeset
275 waiting_for = None
93bbdcf9428b Fix for #18: whitespace in space-sensitive elements such as `<pre>` and `<textarea>` is now preserved.
cmlenz
parents: 113
diff changeset
276
93bbdcf9428b Fix for #18: whitespace in space-sensitive elements such as `<pre>` and `<textarea>` is now preserved.
cmlenz
parents: 113
diff changeset
277 for kind, data, pos in stream:
93bbdcf9428b Fix for #18: whitespace in space-sensitive elements such as `<pre>` and `<textarea>` is now preserved.
cmlenz
parents: 113
diff changeset
278 if kind is START:
93bbdcf9428b Fix for #18: whitespace in space-sensitive elements such as `<pre>` and `<textarea>` is now preserved.
cmlenz
parents: 113
diff changeset
279 if waiting_for:
93bbdcf9428b Fix for #18: whitespace in space-sensitive elements such as `<pre>` and `<textarea>` is now preserved.
cmlenz
parents: 113
diff changeset
280 continue
345
8e75b83d3e71 Make `Attrs` instances immutable.
cmlenz
parents: 305
diff changeset
281 tag, attrs = data
277
c9fd81953169 The `HTMLSanitizer` now lets you override the default set of tag and attribute names that are considered safe.
cmlenz
parents: 275
diff changeset
282 if tag not in self.safe_tags:
123
93bbdcf9428b Fix for #18: whitespace in space-sensitive elements such as `<pre>` and `<textarea>` is now preserved.
cmlenz
parents: 113
diff changeset
283 waiting_for = tag
93bbdcf9428b Fix for #18: whitespace in space-sensitive elements such as `<pre>` and `<textarea>` is now preserved.
cmlenz
parents: 113
diff changeset
284 continue
93bbdcf9428b Fix for #18: whitespace in space-sensitive elements such as `<pre>` and `<textarea>` is now preserved.
cmlenz
parents: 113
diff changeset
285
345
8e75b83d3e71 Make `Attrs` instances immutable.
cmlenz
parents: 305
diff changeset
286 new_attrs = []
8e75b83d3e71 Make `Attrs` instances immutable.
cmlenz
parents: 305
diff changeset
287 for attr, value in attrs:
123
93bbdcf9428b Fix for #18: whitespace in space-sensitive elements such as `<pre>` and `<textarea>` is now preserved.
cmlenz
parents: 113
diff changeset
288 value = stripentities(value)
277
c9fd81953169 The `HTMLSanitizer` now lets you override the default set of tag and attribute names that are considered safe.
cmlenz
parents: 275
diff changeset
289 if attr not in self.safe_attrs:
123
93bbdcf9428b Fix for #18: whitespace in space-sensitive elements such as `<pre>` and `<textarea>` is now preserved.
cmlenz
parents: 113
diff changeset
290 continue
277
c9fd81953169 The `HTMLSanitizer` now lets you override the default set of tag and attribute names that are considered safe.
cmlenz
parents: 275
diff changeset
291 elif attr in self.uri_attrs:
123
93bbdcf9428b Fix for #18: whitespace in space-sensitive elements such as `<pre>` and `<textarea>` is now preserved.
cmlenz
parents: 113
diff changeset
292 # Don't allow URI schemes such as "javascript:"
571
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
293 if not self.is_safe_uri(value):
123
93bbdcf9428b Fix for #18: whitespace in space-sensitive elements such as `<pre>` and `<textarea>` is now preserved.
cmlenz
parents: 113
diff changeset
294 continue
93bbdcf9428b Fix for #18: whitespace in space-sensitive elements such as `<pre>` and `<textarea>` is now preserved.
cmlenz
parents: 113
diff changeset
295 elif attr == 'style':
93bbdcf9428b Fix for #18: whitespace in space-sensitive elements such as `<pre>` and `<textarea>` is now preserved.
cmlenz
parents: 113
diff changeset
296 # Remove dangerous CSS declarations from inline styles
571
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
297 decls = self.sanitize_css(value)
123
93bbdcf9428b Fix for #18: whitespace in space-sensitive elements such as `<pre>` and `<textarea>` is now preserved.
cmlenz
parents: 113
diff changeset
298 if not decls:
93bbdcf9428b Fix for #18: whitespace in space-sensitive elements such as `<pre>` and `<textarea>` is now preserved.
cmlenz
parents: 113
diff changeset
299 continue
93bbdcf9428b Fix for #18: whitespace in space-sensitive elements such as `<pre>` and `<textarea>` is now preserved.
cmlenz
parents: 113
diff changeset
300 value = '; '.join(decls)
345
8e75b83d3e71 Make `Attrs` instances immutable.
cmlenz
parents: 305
diff changeset
301 new_attrs.append((attr, value))
123
93bbdcf9428b Fix for #18: whitespace in space-sensitive elements such as `<pre>` and `<textarea>` is now preserved.
cmlenz
parents: 113
diff changeset
302
345
8e75b83d3e71 Make `Attrs` instances immutable.
cmlenz
parents: 305
diff changeset
303 yield kind, (tag, Attrs(new_attrs)), pos
123
93bbdcf9428b Fix for #18: whitespace in space-sensitive elements such as `<pre>` and `<textarea>` is now preserved.
cmlenz
parents: 113
diff changeset
304
93bbdcf9428b Fix for #18: whitespace in space-sensitive elements such as `<pre>` and `<textarea>` is now preserved.
cmlenz
parents: 113
diff changeset
305 elif kind is END:
93bbdcf9428b Fix for #18: whitespace in space-sensitive elements such as `<pre>` and `<textarea>` is now preserved.
cmlenz
parents: 113
diff changeset
306 tag = data
93bbdcf9428b Fix for #18: whitespace in space-sensitive elements such as `<pre>` and `<textarea>` is now preserved.
cmlenz
parents: 113
diff changeset
307 if waiting_for:
93bbdcf9428b Fix for #18: whitespace in space-sensitive elements such as `<pre>` and `<textarea>` is now preserved.
cmlenz
parents: 113
diff changeset
308 if waiting_for == tag:
93bbdcf9428b Fix for #18: whitespace in space-sensitive elements such as `<pre>` and `<textarea>` is now preserved.
cmlenz
parents: 113
diff changeset
309 waiting_for = None
93bbdcf9428b Fix for #18: whitespace in space-sensitive elements such as `<pre>` and `<textarea>` is now preserved.
cmlenz
parents: 113
diff changeset
310 else:
93bbdcf9428b Fix for #18: whitespace in space-sensitive elements such as `<pre>` and `<textarea>` is now preserved.
cmlenz
parents: 113
diff changeset
311 yield kind, data, pos
93bbdcf9428b Fix for #18: whitespace in space-sensitive elements such as `<pre>` and `<textarea>` is now preserved.
cmlenz
parents: 113
diff changeset
312
571
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
313 elif kind is not COMMENT:
123
93bbdcf9428b Fix for #18: whitespace in space-sensitive elements such as `<pre>` and `<textarea>` is now preserved.
cmlenz
parents: 113
diff changeset
314 if not waiting_for:
93bbdcf9428b Fix for #18: whitespace in space-sensitive elements such as `<pre>` and `<textarea>` is now preserved.
cmlenz
parents: 113
diff changeset
315 yield kind, data, pos
431
747baa1cd597 * Don't allow `style` attributes by default in the `HTMLSanitizer`. Closes #97.
cmlenz
parents: 425
diff changeset
316
571
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
317 def is_safe_uri(self, uri):
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
318 """Determine whether the given URI is to be considered safe for
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
319 inclusion in the output.
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
320
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
321 The default implementation checks whether the scheme of the URI is in
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
322 the set of allowed URIs (`safe_schemes`).
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
323
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
324 >>> sanitizer = HTMLSanitizer()
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
325 >>> sanitizer.is_safe_uri('http://example.org/')
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
326 True
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
327 >>> sanitizer.is_safe_uri('javascript:alert(document.cookie)')
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
328 False
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
329
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
330 :param uri: the URI to check
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
331 :return: `True` if the URI can be considered safe, `False` otherwise
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
332 :rtype: `bool`
576
53f4088e1e3b Improve docs on `Stream.select()` for #135.
cmlenz
parents: 571
diff changeset
333 :since: version 0.4.3
571
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
334 """
839
85a61d0bd67b Ported [1046:1047] to 0.5.x branch.
cmlenz
parents: 719
diff changeset
335 if '#' in uri:
85a61d0bd67b Ported [1046:1047] to 0.5.x branch.
cmlenz
parents: 719
diff changeset
336 uri = uri.split('#', 1)[0] # Strip out the fragment identifier
571
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
337 if ':' not in uri:
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
338 return True # This is a relative URI
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
339 chars = [char for char in uri.split(':', 1)[0] if char.isalnum()]
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
340 return ''.join(chars).lower() in self.safe_schemes
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
341
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
342 def sanitize_css(self, text):
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
343 """Remove potentially dangerous property declarations from CSS code.
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
344
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
345 In particular, properties using the CSS ``url()`` function with a scheme
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
346 that is not considered safe are removed:
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
347
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
348 >>> sanitizer = HTMLSanitizer()
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
349 >>> sanitizer.sanitize_css(u'''
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
350 ... background: url(javascript:alert("foo"));
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
351 ... color: #000;
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
352 ... ''')
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
353 [u'color: #000']
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
354
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
355 Also, the proprietary Internet Explorer function ``expression()`` is
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
356 always stripped:
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
357
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
358 >>> sanitizer.sanitize_css(u'''
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
359 ... background: #fff;
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
360 ... color: #000;
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
361 ... width: e/**/xpression(alert("foo"));
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
362 ... ''')
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
363 [u'background: #fff', u'color: #000']
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
364
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
365 :param text: the CSS text; this is expected to be `unicode` and to not
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
366 contain any character or numeric references
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
367 :return: a list of declarations that are considered safe
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
368 :rtype: `list`
576
53f4088e1e3b Improve docs on `Stream.select()` for #135.
cmlenz
parents: 571
diff changeset
369 :since: version 0.4.3
571
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
370 """
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
371 decls = []
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
372 text = self._strip_css_comments(self._replace_unicode_escapes(text))
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
373 for decl in filter(None, text.split(';')):
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
374 decl = decl.strip()
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
375 if not decl:
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
376 continue
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
377 is_evil = False
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
378 if 'expression' in decl:
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
379 is_evil = True
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
380 for match in re.finditer(r'url\s*\(([^)]+)', decl):
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
381 if not self.is_safe_uri(match.group(1)):
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
382 is_evil = True
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
383 break
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
384 if not is_evil:
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
385 decls.append(decl.strip())
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
386 return decls
5815ad5f75a4 * Cleaned up the implementation of the `HTMLSanitizer`.
cmlenz
parents: 556
diff changeset
387
431
747baa1cd597 * Don't allow `style` attributes by default in the `HTMLSanitizer`. Closes #97.
cmlenz
parents: 425
diff changeset
388 _NORMALIZE_NEWLINES = re.compile(r'\r\n').sub
747baa1cd597 * Don't allow `style` attributes by default in the `HTMLSanitizer`. Closes #97.
cmlenz
parents: 425
diff changeset
389 _UNICODE_ESCAPE = re.compile(r'\\([0-9a-fA-F]{1,6})\s?').sub
747baa1cd597 * Don't allow `style` attributes by default in the `HTMLSanitizer`. Closes #97.
cmlenz
parents: 425
diff changeset
390
747baa1cd597 * Don't allow `style` attributes by default in the `HTMLSanitizer`. Closes #97.
cmlenz
parents: 425
diff changeset
391 def _replace_unicode_escapes(self, text):
747baa1cd597 * Don't allow `style` attributes by default in the `HTMLSanitizer`. Closes #97.
cmlenz
parents: 425
diff changeset
392 def _repl(match):
747baa1cd597 * Don't allow `style` attributes by default in the `HTMLSanitizer`. Closes #97.
cmlenz
parents: 425
diff changeset
393 return unichr(int(match.group(1), 16))
747baa1cd597 * Don't allow `style` attributes by default in the `HTMLSanitizer`. Closes #97.
cmlenz
parents: 425
diff changeset
394 return self._UNICODE_ESCAPE(_repl, self._NORMALIZE_NEWLINES('\n', text))
556
d5cb5c200045 The HTML sanitizer now strips any CSS comments in style attributes, which could previously be used to hide malicious property values.
cmlenz
parents: 471
diff changeset
395
d5cb5c200045 The HTML sanitizer now strips any CSS comments in style attributes, which could previously be used to hide malicious property values.
cmlenz
parents: 471
diff changeset
396 _CSS_COMMENTS = re.compile(r'/\*.*?\*/').sub
d5cb5c200045 The HTML sanitizer now strips any CSS comments in style attributes, which could previously be used to hide malicious property values.
cmlenz
parents: 471
diff changeset
397
d5cb5c200045 The HTML sanitizer now strips any CSS comments in style attributes, which could previously be used to hide malicious property values.
cmlenz
parents: 471
diff changeset
398 def _strip_css_comments(self, text):
d5cb5c200045 The HTML sanitizer now strips any CSS comments in style attributes, which could previously be used to hide malicious property values.
cmlenz
parents: 471
diff changeset
399 return self._CSS_COMMENTS('', text)
Copyright (C) 2012-2017 Edgewall Software