annotate genshi/input.py @ 932:e53161c2773c

Merge r1140 from py3k: add support for python 3 to core genshi components (genshi.core, genshi.input and genshi.output): * default input and output encodings changed from UTF-8 to None (i.e. unicode strings) * Namespace and QName objects do not call stringrepr in __repr__ in Python 3 since repr() returns a unicode string there. * track changes to expat parser in Python 3 (mostly it accepts bytes instead of strings)
author hodgestar
date Fri, 18 Mar 2011 09:08:12 +0000
parents fbe34d12acde
children
rev   line source
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
1 # -*- coding: utf-8 -*-
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
2 #
854
0d9e87c6cf6e More work on reducing the size of the diff produced by 2to3.
cmlenz
parents: 853
diff changeset
3 # Copyright (C) 2006-2009 Edgewall Software
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
4 # All rights reserved.
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
5 #
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
6 # This software is licensed as described in the file COPYING, which
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
7 # you should have received as part of this distribution. The terms
230
24757b771651 Renamed Markup to Genshi in repository.
cmlenz
parents: 213
diff changeset
8 # are also available at http://genshi.edgewall.org/wiki/License.
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
9 #
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
10 # This software consists of voluntary contributions made by many
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
11 # individuals. For the exact contribution history, see the revision
230
24757b771651 Renamed Markup to Genshi in repository.
cmlenz
parents: 213
diff changeset
12 # history and logs, available at http://genshi.edgewall.org/log/.
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
13
425
5b248708bbed Try to use proper reStructuredText for docstrings throughout.
cmlenz
parents: 423
diff changeset
14 """Support for constructing markup streams from files, strings, or other
5b248708bbed Try to use proper reStructuredText for docstrings throughout.
cmlenz
parents: 423
diff changeset
15 sources.
5b248708bbed Try to use proper reStructuredText for docstrings throughout.
cmlenz
parents: 423
diff changeset
16 """
5b248708bbed Try to use proper reStructuredText for docstrings throughout.
cmlenz
parents: 423
diff changeset
17
144
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
18 from itertools import chain
859
fbe34d12acde More bits of 2to3 related cleanup.
cmlenz
parents: 857
diff changeset
19 import htmlentitydefs as entities
fbe34d12acde More bits of 2to3 related cleanup.
cmlenz
parents: 857
diff changeset
20 import HTMLParser as html
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
21 from xml.parsers import expat
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
22
293
38adb4aa7df5 Fix a bug in the XML parser, where attributes containing HTML entity references would get pulled out of the attribute value, and instead added as a text node just before the associated start tag. Thanks to Hamish Lawson for [http://groups.google.com/group/genshi/browse_thread/thread/c64eb48676b0ff96/0e6ce786e8820f3d pointing out the problem].
cmlenz
parents: 290
diff changeset
23 from genshi.core import Attrs, QName, Stream, stripentities
859
fbe34d12acde More bits of 2to3 related cleanup.
cmlenz
parents: 857
diff changeset
24 from genshi.core import START, END, XML_DECL, DOCTYPE, TEXT, START_NS, \
fbe34d12acde More bits of 2to3 related cleanup.
cmlenz
parents: 857
diff changeset
25 END_NS, START_CDATA, END_CDATA, PI, COMMENT
932
e53161c2773c Merge r1140 from py3k:
hodgestar
parents: 859
diff changeset
26 from genshi.compat import StringIO, BytesIO
e53161c2773c Merge r1140 from py3k:
hodgestar
parents: 859
diff changeset
27
144
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
28
290
a6738047c85e Move the ElementTree ''element-to-stream'' adaptation function `ET()` into the `genshi.input` module.
cmlenz
parents: 230
diff changeset
29 __all__ = ['ET', 'ParseError', 'XMLParser', 'XML', 'HTMLParser', 'HTML']
425
5b248708bbed Try to use proper reStructuredText for docstrings throughout.
cmlenz
parents: 423
diff changeset
30 __docformat__ = 'restructuredtext en'
290
a6738047c85e Move the ElementTree ''element-to-stream'' adaptation function `ET()` into the `genshi.input` module.
cmlenz
parents: 230
diff changeset
31
859
fbe34d12acde More bits of 2to3 related cleanup.
cmlenz
parents: 857
diff changeset
32
290
a6738047c85e Move the ElementTree ''element-to-stream'' adaptation function `ET()` into the `genshi.input` module.
cmlenz
parents: 230
diff changeset
33 def ET(element):
433
6d01e91f2a49 More API docs.
cmlenz
parents: 425
diff changeset
34 """Convert a given ElementTree element to a markup stream.
6d01e91f2a49 More API docs.
cmlenz
parents: 425
diff changeset
35
6d01e91f2a49 More API docs.
cmlenz
parents: 425
diff changeset
36 :param element: an ElementTree element
6d01e91f2a49 More API docs.
cmlenz
parents: 425
diff changeset
37 :return: a markup stream
6d01e91f2a49 More API docs.
cmlenz
parents: 425
diff changeset
38 """
290
a6738047c85e Move the ElementTree ''element-to-stream'' adaptation function `ET()` into the `genshi.input` module.
cmlenz
parents: 230
diff changeset
39 tag_name = QName(element.tag.lstrip('{'))
458
160f787cc818 The `ET()` function now correctly handles attributes with a namespace.
cmlenz
parents: 434
diff changeset
40 attrs = Attrs([(QName(attr.lstrip('{')), value)
160f787cc818 The `ET()` function now correctly handles attributes with a namespace.
cmlenz
parents: 434
diff changeset
41 for attr, value in element.items()])
290
a6738047c85e Move the ElementTree ''element-to-stream'' adaptation function `ET()` into the `genshi.input` module.
cmlenz
parents: 230
diff changeset
42
a6738047c85e Move the ElementTree ''element-to-stream'' adaptation function `ET()` into the `genshi.input` module.
cmlenz
parents: 230
diff changeset
43 yield START, (tag_name, attrs), (None, -1, -1)
a6738047c85e Move the ElementTree ''element-to-stream'' adaptation function `ET()` into the `genshi.input` module.
cmlenz
parents: 230
diff changeset
44 if element.text:
a6738047c85e Move the ElementTree ''element-to-stream'' adaptation function `ET()` into the `genshi.input` module.
cmlenz
parents: 230
diff changeset
45 yield TEXT, element.text, (None, -1, -1)
a6738047c85e Move the ElementTree ''element-to-stream'' adaptation function `ET()` into the `genshi.input` module.
cmlenz
parents: 230
diff changeset
46 for child in element.getchildren():
a6738047c85e Move the ElementTree ''element-to-stream'' adaptation function `ET()` into the `genshi.input` module.
cmlenz
parents: 230
diff changeset
47 for item in ET(child):
a6738047c85e Move the ElementTree ''element-to-stream'' adaptation function `ET()` into the `genshi.input` module.
cmlenz
parents: 230
diff changeset
48 yield item
a6738047c85e Move the ElementTree ''element-to-stream'' adaptation function `ET()` into the `genshi.input` module.
cmlenz
parents: 230
diff changeset
49 yield END, tag_name, (None, -1, -1)
a6738047c85e Move the ElementTree ''element-to-stream'' adaptation function `ET()` into the `genshi.input` module.
cmlenz
parents: 230
diff changeset
50 if element.tail:
a6738047c85e Move the ElementTree ''element-to-stream'' adaptation function `ET()` into the `genshi.input` module.
cmlenz
parents: 230
diff changeset
51 yield TEXT, element.tail, (None, -1, -1)
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
52
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
53
21
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
54 class ParseError(Exception):
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
55 """Exception raised when fatal syntax errors are found in the input being
433
6d01e91f2a49 More API docs.
cmlenz
parents: 425
diff changeset
56 parsed.
6d01e91f2a49 More API docs.
cmlenz
parents: 425
diff changeset
57 """
21
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
58
422
95089b6e37ca More work to include absolute file paths in exceptions.
cmlenz
parents: 419
diff changeset
59 def __init__(self, message, filename=None, lineno=-1, offset=-1):
433
6d01e91f2a49 More API docs.
cmlenz
parents: 425
diff changeset
60 """Exception initializer.
6d01e91f2a49 More API docs.
cmlenz
parents: 425
diff changeset
61
6d01e91f2a49 More API docs.
cmlenz
parents: 425
diff changeset
62 :param message: the error message from the parser
6d01e91f2a49 More API docs.
cmlenz
parents: 425
diff changeset
63 :param filename: the path to the file that was parsed
6d01e91f2a49 More API docs.
cmlenz
parents: 425
diff changeset
64 :param lineno: the number of the line on which the error was encountered
6d01e91f2a49 More API docs.
cmlenz
parents: 425
diff changeset
65 :param offset: the column number where the error was encountered
6d01e91f2a49 More API docs.
cmlenz
parents: 425
diff changeset
66 """
422
95089b6e37ca More work to include absolute file paths in exceptions.
cmlenz
parents: 419
diff changeset
67 self.msg = message
95089b6e37ca More work to include absolute file paths in exceptions.
cmlenz
parents: 419
diff changeset
68 if filename:
434
e065d7906b68 * Better method to propogate the full path to the template file on parse errors. Supersedes r513.
cmlenz
parents: 433
diff changeset
69 message += ', in ' + filename
21
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
70 Exception.__init__(self, message)
422
95089b6e37ca More work to include absolute file paths in exceptions.
cmlenz
parents: 419
diff changeset
71 self.filename = filename or '<string>'
21
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
72 self.lineno = lineno
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
73 self.offset = offset
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
74
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
75
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
76 class XMLParser(object):
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
77 """Generator-based XML parser based on roughly equivalent code in
26
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
78 Kid/ElementTree.
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
79
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
80 The parsing is initiated by iterating over the parser object:
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
81
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
82 >>> parser = XMLParser(StringIO('<root id="2"><child>Foo</child></root>'))
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
83 >>> for kind, data, pos in parser:
853
4376010bb97e Convert a bunch of print statements to py3k compatible syntax.
cmlenz
parents: 852
diff changeset
84 ... print('%s %s' % (kind, data))
857
24733a5854d9 Avoid unicode literals in `repr`s of `QName` and `Namespace` when not necessary.
cmlenz
parents: 856
diff changeset
85 START (QName('root'), Attrs([(QName('id'), u'2')]))
24733a5854d9 Avoid unicode literals in `repr`s of `QName` and `Namespace` when not necessary.
cmlenz
parents: 856
diff changeset
86 START (QName('child'), Attrs())
26
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
87 TEXT Foo
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
88 END child
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
89 END root
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
90 """
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
91
293
38adb4aa7df5 Fix a bug in the XML parser, where attributes containing HTML entity references would get pulled out of the attribute value, and instead added as a text node just before the associated start tag. Thanks to Hamish Lawson for [http://groups.google.com/group/genshi/browse_thread/thread/c64eb48676b0ff96/0e6ce786e8820f3d pointing out the problem].
cmlenz
parents: 290
diff changeset
92 _entitydefs = ['<!ENTITY %s "&#%d;">' % (name, value) for name, value in
856
1e2be9fb3348 Add a couple of fallback imports for Python 3.0.
cmlenz
parents: 854
diff changeset
93 entities.name2codepoint.items()]
932
e53161c2773c Merge r1140 from py3k:
hodgestar
parents: 859
diff changeset
94 _external_dtd = u'\n'.join(_entitydefs).encode('utf-8')
293
38adb4aa7df5 Fix a bug in the XML parser, where attributes containing HTML entity references would get pulled out of the attribute value, and instead added as a text node just before the associated start tag. Thanks to Hamish Lawson for [http://groups.google.com/group/genshi/browse_thread/thread/c64eb48676b0ff96/0e6ce786e8820f3d pointing out the problem].
cmlenz
parents: 290
diff changeset
95
316
4ab9edf5e83b Configurable encoding of template files, closing #65.
cmlenz
parents: 312
diff changeset
96 def __init__(self, source, filename=None, encoding=None):
311
01e2c48f6dfb * The `HTMLParser` class and the `HTML` function now accept an `encoding` parameter to properly deal with bytestring input (defaults to UTF-8).
cmlenz
parents: 293
diff changeset
97 """Initialize the parser for the given XML input.
26
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
98
425
5b248708bbed Try to use proper reStructuredText for docstrings throughout.
cmlenz
parents: 423
diff changeset
99 :param source: the XML text as a file-like object
5b248708bbed Try to use proper reStructuredText for docstrings throughout.
cmlenz
parents: 423
diff changeset
100 :param filename: the name of the file, if appropriate
5b248708bbed Try to use proper reStructuredText for docstrings throughout.
cmlenz
parents: 423
diff changeset
101 :param encoding: the encoding of the file; if not specified, the
5b248708bbed Try to use proper reStructuredText for docstrings throughout.
cmlenz
parents: 423
diff changeset
102 encoding is assumed to be ASCII, UTF-8, or UTF-16, or
5b248708bbed Try to use proper reStructuredText for docstrings throughout.
cmlenz
parents: 423
diff changeset
103 whatever the encoding specified in the XML declaration
5b248708bbed Try to use proper reStructuredText for docstrings throughout.
cmlenz
parents: 423
diff changeset
104 (if any)
26
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
105 """
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
106 self.source = source
21
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
107 self.filename = filename
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
108
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
109 # Setup the Expat parser
316
4ab9edf5e83b Configurable encoding of template files, closing #65.
cmlenz
parents: 312
diff changeset
110 parser = expat.ParserCreate(encoding, '}')
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
111 parser.buffer_text = True
932
e53161c2773c Merge r1140 from py3k:
hodgestar
parents: 859
diff changeset
112 # Python 3 does not have returns_unicode
e53161c2773c Merge r1140 from py3k:
hodgestar
parents: 859
diff changeset
113 if hasattr(parser, 'returns_unicode'):
e53161c2773c Merge r1140 from py3k:
hodgestar
parents: 859
diff changeset
114 parser.returns_unicode = True
160
faea6db52ef1 Attribute order in parsed XML is now preserved.
cmlenz
parents: 146
diff changeset
115 parser.ordered_attributes = True
faea6db52ef1 Attribute order in parsed XML is now preserved.
cmlenz
parents: 146
diff changeset
116
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
117 parser.StartElementHandler = self._handle_start
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
118 parser.EndElementHandler = self._handle_end
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
119 parser.CharacterDataHandler = self._handle_data
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
120 parser.StartDoctypeDeclHandler = self._handle_doctype
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
121 parser.StartNamespaceDeclHandler = self._handle_start_ns
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
122 parser.EndNamespaceDeclHandler = self._handle_end_ns
143
ef761afcedff CDATA sections in XML input now appear as CDATA sections in the output. This should address the problem with escaping the contents of `<style>` and `<script>` elements, which would only get interpreted correctly if the output was served as `application/xhtml+xml`. Closes #24.
cmlenz
parents: 140
diff changeset
123 parser.StartCdataSectionHandler = self._handle_start_cdata
ef761afcedff CDATA sections in XML input now appear as CDATA sections in the output. This should address the problem with escaping the contents of `<style>` and `<script>` elements, which would only get interpreted correctly if the output was served as `application/xhtml+xml`. Closes #24.
cmlenz
parents: 140
diff changeset
124 parser.EndCdataSectionHandler = self._handle_end_cdata
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
125 parser.ProcessingInstructionHandler = self._handle_pi
460
6b5544bb5a99 Apply patch by Alec Thomas for processing XML declarations (#111). Thanks!
cmlenz
parents: 458
diff changeset
126 parser.XmlDeclHandler = self._handle_xml_decl
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
127 parser.CommentHandler = self._handle_comment
209
5b422db07359 * Fix bug in handling of undefined entities. Thanks to Arnar for reporting the issue on IRC.
cmlenz
parents: 207
diff changeset
128
5b422db07359 * Fix bug in handling of undefined entities. Thanks to Arnar for reporting the issue on IRC.
cmlenz
parents: 207
diff changeset
129 # Tell Expat that we'll handle non-XML entities ourselves
5b422db07359 * Fix bug in handling of undefined entities. Thanks to Arnar for reporting the issue on IRC.
cmlenz
parents: 207
diff changeset
130 # (in _handle_other)
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
131 parser.DefaultHandler = self._handle_other
293
38adb4aa7df5 Fix a bug in the XML parser, where attributes containing HTML entity references would get pulled out of the attribute value, and instead added as a text node just before the associated start tag. Thanks to Hamish Lawson for [http://groups.google.com/group/genshi/browse_thread/thread/c64eb48676b0ff96/0e6ce786e8820f3d pointing out the problem].
cmlenz
parents: 290
diff changeset
132 parser.SetParamEntityParsing(expat.XML_PARAM_ENTITY_PARSING_ALWAYS)
209
5b422db07359 * Fix bug in handling of undefined entities. Thanks to Arnar for reporting the issue on IRC.
cmlenz
parents: 207
diff changeset
133 parser.UseForeignDTD()
293
38adb4aa7df5 Fix a bug in the XML parser, where attributes containing HTML entity references would get pulled out of the attribute value, and instead added as a text node just before the associated start tag. Thanks to Hamish Lawson for [http://groups.google.com/group/genshi/browse_thread/thread/c64eb48676b0ff96/0e6ce786e8820f3d pointing out the problem].
cmlenz
parents: 290
diff changeset
134 parser.ExternalEntityRefHandler = self._build_foreign
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
135
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
136 self.expat = parser
21
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
137 self._queue = []
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
138
144
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
139 def parse(self):
433
6d01e91f2a49 More API docs.
cmlenz
parents: 425
diff changeset
140 """Generator that parses the XML source, yielding markup events.
6d01e91f2a49 More API docs.
cmlenz
parents: 425
diff changeset
141
6d01e91f2a49 More API docs.
cmlenz
parents: 425
diff changeset
142 :return: a markup event stream
6d01e91f2a49 More API docs.
cmlenz
parents: 425
diff changeset
143 :raises ParseError: if the XML text is not well formed
6d01e91f2a49 More API docs.
cmlenz
parents: 425
diff changeset
144 """
144
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
145 def _generate():
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
146 try:
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
147 bufsize = 4 * 1024 # 4K
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
148 done = False
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
149 while 1:
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
150 while not done and len(self._queue) == 0:
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
151 data = self.source.read(bufsize)
932
e53161c2773c Merge r1140 from py3k:
hodgestar
parents: 859
diff changeset
152 if not data: # end of data
144
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
153 if hasattr(self, 'expat'):
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
154 self.expat.Parse('', True)
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
155 del self.expat # get rid of circular references
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
156 done = True
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
157 else:
207
0619a27f5e67 The `XMLParser` now correctly handles unicode input. Closes #43.
cmlenz
parents: 182
diff changeset
158 if isinstance(data, unicode):
0619a27f5e67 The `XMLParser` now correctly handles unicode input. Closes #43.
cmlenz
parents: 182
diff changeset
159 data = data.encode('utf-8')
144
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
160 self.expat.Parse(data, False)
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
161 for event in self._queue:
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
162 yield event
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
163 self._queue = []
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
164 if done:
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
165 break
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
166 except expat.ExpatError, e:
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
167 msg = str(e)
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
168 raise ParseError(msg, self.filename, e.lineno, e.offset)
146
db0dacc1239a Simplifed `CoalesceFilter` (now a function)
cmlenz
parents: 145
diff changeset
169 return Stream(_generate()).filter(_coalesce)
144
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
170
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
171 def __iter__(self):
144
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
172 return iter(self.parse())
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
173
293
38adb4aa7df5 Fix a bug in the XML parser, where attributes containing HTML entity references would get pulled out of the attribute value, and instead added as a text node just before the associated start tag. Thanks to Hamish Lawson for [http://groups.google.com/group/genshi/browse_thread/thread/c64eb48676b0ff96/0e6ce786e8820f3d pointing out the problem].
cmlenz
parents: 290
diff changeset
174 def _build_foreign(self, context, base, sysid, pubid):
38adb4aa7df5 Fix a bug in the XML parser, where attributes containing HTML entity references would get pulled out of the attribute value, and instead added as a text node just before the associated start tag. Thanks to Hamish Lawson for [http://groups.google.com/group/genshi/browse_thread/thread/c64eb48676b0ff96/0e6ce786e8820f3d pointing out the problem].
cmlenz
parents: 290
diff changeset
175 parser = self.expat.ExternalEntityParserCreate(context)
932
e53161c2773c Merge r1140 from py3k:
hodgestar
parents: 859
diff changeset
176 parser.ParseFile(BytesIO(self._external_dtd))
293
38adb4aa7df5 Fix a bug in the XML parser, where attributes containing HTML entity references would get pulled out of the attribute value, and instead added as a text node just before the associated start tag. Thanks to Hamish Lawson for [http://groups.google.com/group/genshi/browse_thread/thread/c64eb48676b0ff96/0e6ce786e8820f3d pointing out the problem].
cmlenz
parents: 290
diff changeset
177 return 1
38adb4aa7df5 Fix a bug in the XML parser, where attributes containing HTML entity references would get pulled out of the attribute value, and instead added as a text node just before the associated start tag. Thanks to Hamish Lawson for [http://groups.google.com/group/genshi/browse_thread/thread/c64eb48676b0ff96/0e6ce786e8820f3d pointing out the problem].
cmlenz
parents: 290
diff changeset
178
143
ef761afcedff CDATA sections in XML input now appear as CDATA sections in the output. This should address the problem with escaping the contents of `<style>` and `<script>` elements, which would only get interpreted correctly if the output was served as `application/xhtml+xml`. Closes #24.
cmlenz
parents: 140
diff changeset
179 def _enqueue(self, kind, data=None, pos=None):
26
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
180 if pos is None:
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
181 pos = self._getpos()
144
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
182 if kind is TEXT:
134
df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though).
cmlenz
parents: 69
diff changeset
183 # Expat reports the *end* of the text event as current position. We
df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though).
cmlenz
parents: 69
diff changeset
184 # try to fix that up here as much as possible. Unfortunately, the
df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though).
cmlenz
parents: 69
diff changeset
185 # offset is only valid for single-line text. For multi-line text,
df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though).
cmlenz
parents: 69
diff changeset
186 # it is apparently not possible to determine at what offset it
df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though).
cmlenz
parents: 69
diff changeset
187 # started
df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though).
cmlenz
parents: 69
diff changeset
188 if '\n' in data:
df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though).
cmlenz
parents: 69
diff changeset
189 lines = data.splitlines()
df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though).
cmlenz
parents: 69
diff changeset
190 lineno = pos[1] - len(lines) + 1
df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though).
cmlenz
parents: 69
diff changeset
191 offset = -1
df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though).
cmlenz
parents: 69
diff changeset
192 else:
df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though).
cmlenz
parents: 69
diff changeset
193 lineno = pos[1]
df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though).
cmlenz
parents: 69
diff changeset
194 offset = pos[2] - len(data)
df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though).
cmlenz
parents: 69
diff changeset
195 pos = (pos[0], lineno, offset)
26
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
196 self._queue.append((kind, data, pos))
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
197
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
198 def _getpos_unknown(self):
134
df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though).
cmlenz
parents: 69
diff changeset
199 return (self.filename, -1, -1)
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
200
21
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
201 def _getpos(self):
134
df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though).
cmlenz
parents: 69
diff changeset
202 return (self.filename, self.expat.CurrentLineNumber,
21
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
203 self.expat.CurrentColumnNumber)
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
204
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
205 def _handle_start(self, tag, attrib):
403
32b283e1d310 Remove some magic/overhead from `Attrs` creation and manipulation by not automatically wrapping attribute names in `QName`.
cmlenz
parents: 378
diff changeset
206 attrs = Attrs([(QName(name), value) for name, value in
32b283e1d310 Remove some magic/overhead from `Attrs` creation and manipulation by not automatically wrapping attribute names in `QName`.
cmlenz
parents: 378
diff changeset
207 zip(*[iter(attrib)] * 2)])
32b283e1d310 Remove some magic/overhead from `Attrs` creation and manipulation by not automatically wrapping attribute names in `QName`.
cmlenz
parents: 378
diff changeset
208 self._enqueue(START, (QName(tag), attrs))
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
209
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
210 def _handle_end(self, tag):
144
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
211 self._enqueue(END, QName(tag))
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
212
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
213 def _handle_data(self, text):
144
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
214 self._enqueue(TEXT, text)
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
215
460
6b5544bb5a99 Apply patch by Alec Thomas for processing XML declarations (#111). Thanks!
cmlenz
parents: 458
diff changeset
216 def _handle_xml_decl(self, version, encoding, standalone):
6b5544bb5a99 Apply patch by Alec Thomas for processing XML declarations (#111). Thanks!
cmlenz
parents: 458
diff changeset
217 self._enqueue(XML_DECL, (version, encoding, standalone))
6b5544bb5a99 Apply patch by Alec Thomas for processing XML declarations (#111). Thanks!
cmlenz
parents: 458
diff changeset
218
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
219 def _handle_doctype(self, name, sysid, pubid, has_internal_subset):
144
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
220 self._enqueue(DOCTYPE, (name, pubid, sysid))
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
221
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
222 def _handle_start_ns(self, prefix, uri):
144
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
223 self._enqueue(START_NS, (prefix or '', uri))
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
224
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
225 def _handle_end_ns(self, prefix):
144
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
226 self._enqueue(END_NS, prefix or '')
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
227
143
ef761afcedff CDATA sections in XML input now appear as CDATA sections in the output. This should address the problem with escaping the contents of `<style>` and `<script>` elements, which would only get interpreted correctly if the output was served as `application/xhtml+xml`. Closes #24.
cmlenz
parents: 140
diff changeset
228 def _handle_start_cdata(self):
144
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
229 self._enqueue(START_CDATA)
143
ef761afcedff CDATA sections in XML input now appear as CDATA sections in the output. This should address the problem with escaping the contents of `<style>` and `<script>` elements, which would only get interpreted correctly if the output was served as `application/xhtml+xml`. Closes #24.
cmlenz
parents: 140
diff changeset
230
ef761afcedff CDATA sections in XML input now appear as CDATA sections in the output. This should address the problem with escaping the contents of `<style>` and `<script>` elements, which would only get interpreted correctly if the output was served as `application/xhtml+xml`. Closes #24.
cmlenz
parents: 140
diff changeset
231 def _handle_end_cdata(self):
144
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
232 self._enqueue(END_CDATA)
143
ef761afcedff CDATA sections in XML input now appear as CDATA sections in the output. This should address the problem with escaping the contents of `<style>` and `<script>` elements, which would only get interpreted correctly if the output was served as `application/xhtml+xml`. Closes #24.
cmlenz
parents: 140
diff changeset
233
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
234 def _handle_pi(self, target, data):
144
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
235 self._enqueue(PI, (target, data))
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
236
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
237 def _handle_comment(self, text):
144
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
238 self._enqueue(COMMENT, text)
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
239
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
240 def _handle_other(self, text):
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
241 if text.startswith('&'):
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
242 # deal with undefined entities
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
243 try:
856
1e2be9fb3348 Add a couple of fallback imports for Python 3.0.
cmlenz
parents: 854
diff changeset
244 text = unichr(entities.name2codepoint[text[1:-1]])
144
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
245 self._enqueue(TEXT, text)
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
246 except KeyError:
209
5b422db07359 * Fix bug in handling of undefined entities. Thanks to Arnar for reporting the issue on IRC.
cmlenz
parents: 207
diff changeset
247 filename, lineno, offset = self._getpos()
5b422db07359 * Fix bug in handling of undefined entities. Thanks to Arnar for reporting the issue on IRC.
cmlenz
parents: 207
diff changeset
248 error = expat.error('undefined entity "%s": line %d, column %d'
5b422db07359 * Fix bug in handling of undefined entities. Thanks to Arnar for reporting the issue on IRC.
cmlenz
parents: 207
diff changeset
249 % (text, lineno, offset))
5b422db07359 * Fix bug in handling of undefined entities. Thanks to Arnar for reporting the issue on IRC.
cmlenz
parents: 207
diff changeset
250 error.code = expat.errors.XML_ERROR_UNDEFINED_ENTITY
5b422db07359 * Fix bug in handling of undefined entities. Thanks to Arnar for reporting the issue on IRC.
cmlenz
parents: 207
diff changeset
251 error.lineno = lineno
5b422db07359 * Fix bug in handling of undefined entities. Thanks to Arnar for reporting the issue on IRC.
cmlenz
parents: 207
diff changeset
252 error.offset = offset
5b422db07359 * Fix bug in handling of undefined entities. Thanks to Arnar for reporting the issue on IRC.
cmlenz
parents: 207
diff changeset
253 raise error
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
254
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
255
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
256 def XML(text):
433
6d01e91f2a49 More API docs.
cmlenz
parents: 425
diff changeset
257 """Parse the given XML source and return a markup stream.
6d01e91f2a49 More API docs.
cmlenz
parents: 425
diff changeset
258
6d01e91f2a49 More API docs.
cmlenz
parents: 425
diff changeset
259 Unlike with `XMLParser`, the returned stream is reusable, meaning it can be
6d01e91f2a49 More API docs.
cmlenz
parents: 425
diff changeset
260 iterated over multiple times:
6d01e91f2a49 More API docs.
cmlenz
parents: 425
diff changeset
261
6d01e91f2a49 More API docs.
cmlenz
parents: 425
diff changeset
262 >>> xml = XML('<doc><elem>Foo</elem><elem>Bar</elem></doc>')
853
4376010bb97e Convert a bunch of print statements to py3k compatible syntax.
cmlenz
parents: 852
diff changeset
263 >>> print(xml)
433
6d01e91f2a49 More API docs.
cmlenz
parents: 425
diff changeset
264 <doc><elem>Foo</elem><elem>Bar</elem></doc>
853
4376010bb97e Convert a bunch of print statements to py3k compatible syntax.
cmlenz
parents: 852
diff changeset
265 >>> print(xml.select('elem'))
433
6d01e91f2a49 More API docs.
cmlenz
parents: 425
diff changeset
266 <elem>Foo</elem><elem>Bar</elem>
853
4376010bb97e Convert a bunch of print statements to py3k compatible syntax.
cmlenz
parents: 852
diff changeset
267 >>> print(xml.select('elem/text()'))
433
6d01e91f2a49 More API docs.
cmlenz
parents: 425
diff changeset
268 FooBar
6d01e91f2a49 More API docs.
cmlenz
parents: 425
diff changeset
269
6d01e91f2a49 More API docs.
cmlenz
parents: 425
diff changeset
270 :param text: the XML source
6d01e91f2a49 More API docs.
cmlenz
parents: 425
diff changeset
271 :return: the parsed XML event stream
6d01e91f2a49 More API docs.
cmlenz
parents: 425
diff changeset
272 :raises ParseError: if the XML text is not well-formed
6d01e91f2a49 More API docs.
cmlenz
parents: 425
diff changeset
273 """
859
fbe34d12acde More bits of 2to3 related cleanup.
cmlenz
parents: 857
diff changeset
274 return Stream(list(XMLParser(StringIO(text))))
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
275
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
276
21
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
277 class HTMLParser(html.HTMLParser, object):
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
278 """Parser for HTML input based on the Python `HTMLParser` module.
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
279
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
280 This class provides the same interface for generating stream events as
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
281 `XMLParser`, and attempts to automatically balance tags.
26
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
282
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
283 The parsing is initiated by iterating over the parser object:
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
284
932
e53161c2773c Merge r1140 from py3k:
hodgestar
parents: 859
diff changeset
285 >>> parser = HTMLParser(BytesIO(u'<UL compact><LI>Foo</UL>'.encode('utf-8')), encoding='utf-8')
26
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
286 >>> for kind, data, pos in parser:
853
4376010bb97e Convert a bunch of print statements to py3k compatible syntax.
cmlenz
parents: 852
diff changeset
287 ... print('%s %s' % (kind, data))
857
24733a5854d9 Avoid unicode literals in `repr`s of `QName` and `Namespace` when not necessary.
cmlenz
parents: 856
diff changeset
288 START (QName('ul'), Attrs([(QName('compact'), u'compact')]))
24733a5854d9 Avoid unicode literals in `repr`s of `QName` and `Namespace` when not necessary.
cmlenz
parents: 856
diff changeset
289 START (QName('li'), Attrs())
26
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
290 TEXT Foo
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
291 END li
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
292 END ul
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
293 """
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
294
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
295 _EMPTY_ELEMS = frozenset(['area', 'base', 'basefont', 'br', 'col', 'frame',
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
296 'hr', 'img', 'input', 'isindex', 'link', 'meta',
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
297 'param'])
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
298
932
e53161c2773c Merge r1140 from py3k:
hodgestar
parents: 859
diff changeset
299 def __init__(self, source, filename=None, encoding=None):
311
01e2c48f6dfb * The `HTMLParser` class and the `HTML` function now accept an `encoding` parameter to properly deal with bytestring input (defaults to UTF-8).
cmlenz
parents: 293
diff changeset
300 """Initialize the parser for the given HTML input.
01e2c48f6dfb * The `HTMLParser` class and the `HTML` function now accept an `encoding` parameter to properly deal with bytestring input (defaults to UTF-8).
cmlenz
parents: 293
diff changeset
301
425
5b248708bbed Try to use proper reStructuredText for docstrings throughout.
cmlenz
parents: 423
diff changeset
302 :param source: the HTML text as a file-like object
5b248708bbed Try to use proper reStructuredText for docstrings throughout.
cmlenz
parents: 423
diff changeset
303 :param filename: the name of the file, if known
5b248708bbed Try to use proper reStructuredText for docstrings throughout.
cmlenz
parents: 423
diff changeset
304 :param filename: encoding of the file; ignored if the input is unicode
311
01e2c48f6dfb * The `HTMLParser` class and the `HTML` function now accept an `encoding` parameter to properly deal with bytestring input (defaults to UTF-8).
cmlenz
parents: 293
diff changeset
305 """
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
306 html.HTMLParser.__init__(self)
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
307 self.source = source
21
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
308 self.filename = filename
311
01e2c48f6dfb * The `HTMLParser` class and the `HTML` function now accept an `encoding` parameter to properly deal with bytestring input (defaults to UTF-8).
cmlenz
parents: 293
diff changeset
309 self.encoding = encoding
21
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
310 self._queue = []
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
311 self._open_tags = []
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
312
144
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
313 def parse(self):
433
6d01e91f2a49 More API docs.
cmlenz
parents: 425
diff changeset
314 """Generator that parses the HTML source, yielding markup events.
6d01e91f2a49 More API docs.
cmlenz
parents: 425
diff changeset
315
6d01e91f2a49 More API docs.
cmlenz
parents: 425
diff changeset
316 :return: a markup event stream
6d01e91f2a49 More API docs.
cmlenz
parents: 425
diff changeset
317 :raises ParseError: if the HTML text is not well formed
6d01e91f2a49 More API docs.
cmlenz
parents: 425
diff changeset
318 """
144
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
319 def _generate():
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
320 try:
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
321 bufsize = 4 * 1024 # 4K
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
322 done = False
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
323 while 1:
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
324 while not done and len(self._queue) == 0:
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
325 data = self.source.read(bufsize)
932
e53161c2773c Merge r1140 from py3k:
hodgestar
parents: 859
diff changeset
326 if not data: # end of data
144
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
327 self.close()
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
328 done = True
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
329 else:
932
e53161c2773c Merge r1140 from py3k:
hodgestar
parents: 859
diff changeset
330 if not isinstance(data, unicode):
e53161c2773c Merge r1140 from py3k:
hodgestar
parents: 859
diff changeset
331 # bytes
e53161c2773c Merge r1140 from py3k:
hodgestar
parents: 859
diff changeset
332 if self.encoding:
e53161c2773c Merge r1140 from py3k:
hodgestar
parents: 859
diff changeset
333 data = data.decode(self.encoding)
e53161c2773c Merge r1140 from py3k:
hodgestar
parents: 859
diff changeset
334 else:
e53161c2773c Merge r1140 from py3k:
hodgestar
parents: 859
diff changeset
335 raise UnicodeError("source returned bytes, but no encoding specified")
144
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
336 self.feed(data)
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
337 for kind, data, pos in self._queue:
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
338 yield kind, data, pos
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
339 self._queue = []
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
340 if done:
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
341 open_tags = self._open_tags
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
342 open_tags.reverse()
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
343 for tag in open_tags:
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
344 yield END, QName(tag), pos
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
345 break
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
346 except html.HTMLParseError, e:
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
347 msg = '%s: line %d, column %d' % (e.msg, e.lineno, e.offset)
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
348 raise ParseError(msg, self.filename, e.lineno, e.offset)
146
db0dacc1239a Simplifed `CoalesceFilter` (now a function)
cmlenz
parents: 145
diff changeset
349 return Stream(_generate()).filter(_coalesce)
144
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
350
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
351 def __iter__(self):
144
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
352 return iter(self.parse())
21
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
353
26
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
354 def _enqueue(self, kind, data, pos=None):
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
355 if pos is None:
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
356 pos = self._getpos()
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
357 self._queue.append((kind, data, pos))
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
358
21
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
359 def _getpos(self):
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
360 lineno, column = self.getpos()
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
361 return (self.filename, lineno, column)
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
362
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
363 def handle_starttag(self, tag, attrib):
26
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
364 fixed_attrib = []
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
365 for name, value in attrib: # Fixup minimized attributes
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
366 if value is None:
312
7e743338a799 Follow-up to [385]: also decode attribute values in the `HTMLParser`.
cmlenz
parents: 311
diff changeset
367 value = unicode(name)
7e743338a799 Follow-up to [385]: also decode attribute values in the `HTMLParser`.
cmlenz
parents: 311
diff changeset
368 elif not isinstance(value, unicode):
7e743338a799 Follow-up to [385]: also decode attribute values in the `HTMLParser`.
cmlenz
parents: 311
diff changeset
369 value = value.decode(self.encoding, 'replace')
403
32b283e1d310 Remove some magic/overhead from `Attrs` creation and manipulation by not automatically wrapping attribute names in `QName`.
cmlenz
parents: 378
diff changeset
370 fixed_attrib.append((QName(name), stripentities(value)))
26
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
371
182
41db0260ebb1 Renamed `Attributes` to `Attrs` to reduce the verbosity.
cmlenz
parents: 160
diff changeset
372 self._enqueue(START, (QName(tag), Attrs(fixed_attrib)))
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
373 if tag in self._EMPTY_ELEMS:
144
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
374 self._enqueue(END, QName(tag))
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
375 else:
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
376 self._open_tags.append(tag)
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
377
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
378 def handle_endtag(self, tag):
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
379 if tag not in self._EMPTY_ELEMS:
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
380 while self._open_tags:
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
381 open_tag = self._open_tags.pop()
378
fff4a81ffc56 Improve handling of incorrectly nested tags in the HTML parser.
cmlenz
parents: 376
diff changeset
382 self._enqueue(END, QName(open_tag))
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
383 if open_tag.lower() == tag.lower():
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
384 break
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
385
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
386 def handle_data(self, text):
311
01e2c48f6dfb * The `HTMLParser` class and the `HTML` function now accept an `encoding` parameter to properly deal with bytestring input (defaults to UTF-8).
cmlenz
parents: 293
diff changeset
387 if not isinstance(text, unicode):
01e2c48f6dfb * The `HTMLParser` class and the `HTML` function now accept an `encoding` parameter to properly deal with bytestring input (defaults to UTF-8).
cmlenz
parents: 293
diff changeset
388 text = text.decode(self.encoding, 'replace')
144
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
389 self._enqueue(TEXT, text)
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
390
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
391 def handle_charref(self, name):
423
7589a0e51001 Applied patch for #106 (handling of hex charrefs in HTML parser).
cmlenz
parents: 422
diff changeset
392 if name.lower().startswith('x'):
7589a0e51001 Applied patch for #106 (handling of hex charrefs in HTML parser).
cmlenz
parents: 422
diff changeset
393 text = unichr(int(name[1:], 16))
7589a0e51001 Applied patch for #106 (handling of hex charrefs in HTML parser).
cmlenz
parents: 422
diff changeset
394 else:
7589a0e51001 Applied patch for #106 (handling of hex charrefs in HTML parser).
cmlenz
parents: 422
diff changeset
395 text = unichr(int(name))
144
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
396 self._enqueue(TEXT, text)
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
397
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
398 def handle_entityref(self, name):
144
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
399 try:
856
1e2be9fb3348 Add a couple of fallback imports for Python 3.0.
cmlenz
parents: 854
diff changeset
400 text = unichr(entities.name2codepoint[name])
144
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
401 except KeyError:
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
402 text = '&%s;' % name
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
403 self._enqueue(TEXT, text)
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
404
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
405 def handle_pi(self, data):
376
74b6bf92f0cd Fix parsing of processing instructions in HTML input.
cmlenz
parents: 326
diff changeset
406 target, data = data.split(None, 1)
74b6bf92f0cd Fix parsing of processing instructions in HTML input.
cmlenz
parents: 326
diff changeset
407 if data.endswith('?'):
74b6bf92f0cd Fix parsing of processing instructions in HTML input.
cmlenz
parents: 326
diff changeset
408 data = data[:-1]
144
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
409 self._enqueue(PI, (target.strip(), data.strip()))
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
410
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
411 def handle_comment(self, text):
144
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
412 self._enqueue(COMMENT, text)
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
413
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
414
932
e53161c2773c Merge r1140 from py3k:
hodgestar
parents: 859
diff changeset
415 def HTML(text, encoding=None):
433
6d01e91f2a49 More API docs.
cmlenz
parents: 425
diff changeset
416 """Parse the given HTML source and return a markup stream.
6d01e91f2a49 More API docs.
cmlenz
parents: 425
diff changeset
417
6d01e91f2a49 More API docs.
cmlenz
parents: 425
diff changeset
418 Unlike with `HTMLParser`, the returned stream is reusable, meaning it can be
6d01e91f2a49 More API docs.
cmlenz
parents: 425
diff changeset
419 iterated over multiple times:
6d01e91f2a49 More API docs.
cmlenz
parents: 425
diff changeset
420
932
e53161c2773c Merge r1140 from py3k:
hodgestar
parents: 859
diff changeset
421 >>> html = HTML('<body><h1>Foo</h1></body>', encoding='utf-8')
853
4376010bb97e Convert a bunch of print statements to py3k compatible syntax.
cmlenz
parents: 852
diff changeset
422 >>> print(html)
433
6d01e91f2a49 More API docs.
cmlenz
parents: 425
diff changeset
423 <body><h1>Foo</h1></body>
853
4376010bb97e Convert a bunch of print statements to py3k compatible syntax.
cmlenz
parents: 852
diff changeset
424 >>> print(html.select('h1'))
433
6d01e91f2a49 More API docs.
cmlenz
parents: 425
diff changeset
425 <h1>Foo</h1>
853
4376010bb97e Convert a bunch of print statements to py3k compatible syntax.
cmlenz
parents: 852
diff changeset
426 >>> print(html.select('h1/text()'))
433
6d01e91f2a49 More API docs.
cmlenz
parents: 425
diff changeset
427 Foo
6d01e91f2a49 More API docs.
cmlenz
parents: 425
diff changeset
428
6d01e91f2a49 More API docs.
cmlenz
parents: 425
diff changeset
429 :param text: the HTML source
6d01e91f2a49 More API docs.
cmlenz
parents: 425
diff changeset
430 :return: the parsed XML event stream
6d01e91f2a49 More API docs.
cmlenz
parents: 425
diff changeset
431 :raises ParseError: if the HTML text is not well-formed, and error recovery
6d01e91f2a49 More API docs.
cmlenz
parents: 425
diff changeset
432 fails
6d01e91f2a49 More API docs.
cmlenz
parents: 425
diff changeset
433 """
932
e53161c2773c Merge r1140 from py3k:
hodgestar
parents: 859
diff changeset
434 if isinstance(text, unicode):
e53161c2773c Merge r1140 from py3k:
hodgestar
parents: 859
diff changeset
435 return Stream(list(HTMLParser(StringIO(text), encoding=encoding)))
e53161c2773c Merge r1140 from py3k:
hodgestar
parents: 859
diff changeset
436 return Stream(list(HTMLParser(BytesIO(text), encoding=encoding)))
859
fbe34d12acde More bits of 2to3 related cleanup.
cmlenz
parents: 857
diff changeset
437
144
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
438
146
db0dacc1239a Simplifed `CoalesceFilter` (now a function)
cmlenz
parents: 145
diff changeset
439 def _coalesce(stream):
144
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
440 """Coalesces adjacent TEXT events into a single event."""
146
db0dacc1239a Simplifed `CoalesceFilter` (now a function)
cmlenz
parents: 145
diff changeset
441 textbuf = []
db0dacc1239a Simplifed `CoalesceFilter` (now a function)
cmlenz
parents: 145
diff changeset
442 textpos = None
db0dacc1239a Simplifed `CoalesceFilter` (now a function)
cmlenz
parents: 145
diff changeset
443 for kind, data, pos in chain(stream, [(None, None, None)]):
db0dacc1239a Simplifed `CoalesceFilter` (now a function)
cmlenz
parents: 145
diff changeset
444 if kind is TEXT:
db0dacc1239a Simplifed `CoalesceFilter` (now a function)
cmlenz
parents: 145
diff changeset
445 textbuf.append(data)
db0dacc1239a Simplifed `CoalesceFilter` (now a function)
cmlenz
parents: 145
diff changeset
446 if textpos is None:
db0dacc1239a Simplifed `CoalesceFilter` (now a function)
cmlenz
parents: 145
diff changeset
447 textpos = pos
db0dacc1239a Simplifed `CoalesceFilter` (now a function)
cmlenz
parents: 145
diff changeset
448 else:
db0dacc1239a Simplifed `CoalesceFilter` (now a function)
cmlenz
parents: 145
diff changeset
449 if textbuf:
852
04945cd67dad Remove usage of unicode literals in a couple of places where they were not strictly necessary.
cmlenz
parents: 750
diff changeset
450 yield TEXT, ''.join(textbuf), textpos
146
db0dacc1239a Simplifed `CoalesceFilter` (now a function)
cmlenz
parents: 145
diff changeset
451 del textbuf[:]
db0dacc1239a Simplifed `CoalesceFilter` (now a function)
cmlenz
parents: 145
diff changeset
452 textpos = None
db0dacc1239a Simplifed `CoalesceFilter` (now a function)
cmlenz
parents: 145
diff changeset
453 if kind:
db0dacc1239a Simplifed `CoalesceFilter` (now a function)
cmlenz
parents: 145
diff changeset
454 yield kind, data, pos
Copyright (C) 2012-2017 Edgewall Software