annotate markup/input.py @ 134:df44110ca91d

* Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though). * Evaluation errors in expressions now include the original expression code in the traceback.
author cmlenz
date Sun, 06 Aug 2006 18:07:21 +0000
parents e9a3930f8823
children a2edde90ad24
rev   line source
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
1 # -*- coding: utf-8 -*-
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
2 #
66
822089ae65ce Switch copyright to Edgewall and URLs to markup.edgewall.org.
cmlenz
parents: 27
diff changeset
3 # Copyright (C) 2006 Edgewall Software
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
4 # All rights reserved.
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
5 #
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
6 # This software is licensed as described in the file COPYING, which
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
7 # you should have received as part of this distribution. The terms
66
822089ae65ce Switch copyright to Edgewall and URLs to markup.edgewall.org.
cmlenz
parents: 27
diff changeset
8 # are also available at http://markup.edgewall.org/wiki/License.
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
9 #
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
10 # This software consists of voluntary contributions made by many
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
11 # individuals. For the exact contribution history, see the revision
66
822089ae65ce Switch copyright to Edgewall and URLs to markup.edgewall.org.
cmlenz
parents: 27
diff changeset
12 # history and logs, available at http://markup.edgewall.org/log/.
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
13
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
14 from xml.parsers import expat
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
15 try:
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
16 frozenset
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
17 except NameError:
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
18 from sets import ImmutableSet as frozenset
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
19 import HTMLParser as html
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
20 import htmlentitydefs
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
21 from StringIO import StringIO
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
22
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
23 from markup.core import Attributes, Markup, QName, Stream
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
24
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
25
21
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
26 class ParseError(Exception):
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
27 """Exception raised when fatal syntax errors are found in the input being
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
28 parsed."""
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
29
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
30 def __init__(self, message, filename='<string>', lineno=-1, offset=-1):
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
31 Exception.__init__(self, message)
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
32 self.filename = filename
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
33 self.lineno = lineno
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
34 self.offset = offset
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
35
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
36
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
37 class XMLParser(object):
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
38 """Generator-based XML parser based on roughly equivalent code in
26
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
39 Kid/ElementTree.
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
40
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
41 The parsing is initiated by iterating over the parser object:
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
42
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
43 >>> parser = XMLParser(StringIO('<root id="2"><child>Foo</child></root>'))
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
44 >>> for kind, data, pos in parser:
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
45 ... print kind, data
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
46 START (u'root', [(u'id', u'2')])
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
47 START (u'child', [])
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
48 TEXT Foo
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
49 END child
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
50 END root
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
51 """
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
52
21
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
53 def __init__(self, source, filename=None):
26
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
54 """Initialize the parser for the given XML text.
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
55
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
56 @param source: the XML text as a file-like object
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
57 @param filename: the name of the file, if appropriate
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
58 """
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
59 self.source = source
21
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
60 self.filename = filename
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
61
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
62 # Setup the Expat parser
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
63 parser = expat.ParserCreate('utf-8', '}')
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
64 parser.buffer_text = True
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
65 parser.returns_unicode = True
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
66 parser.StartElementHandler = self._handle_start
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
67 parser.EndElementHandler = self._handle_end
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
68 parser.CharacterDataHandler = self._handle_data
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
69 parser.XmlDeclHandler = self._handle_prolog
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
70 parser.StartDoctypeDeclHandler = self._handle_doctype
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
71 parser.StartNamespaceDeclHandler = self._handle_start_ns
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
72 parser.EndNamespaceDeclHandler = self._handle_end_ns
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
73 parser.ProcessingInstructionHandler = self._handle_pi
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
74 parser.CommentHandler = self._handle_comment
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
75 parser.DefaultHandler = self._handle_other
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
76
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
77 # Location reporting is only support in Python >= 2.4
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
78 if not hasattr(parser, 'CurrentLineNumber'):
21
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
79 self._getpos = self._getpos_unknown
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
80
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
81 self.expat = parser
21
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
82 self._queue = []
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
83
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
84 def __iter__(self):
21
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
85 try:
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
86 bufsize = 4 * 1024 # 4K
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
87 done = False
69
e9a3930f8823 A couple of minor performance improvements.
cmlenz
parents: 66
diff changeset
88 while 1:
21
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
89 while not done and len(self._queue) == 0:
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
90 data = self.source.read(bufsize)
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
91 if data == '': # end of data
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
92 if hasattr(self, 'expat'):
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
93 self.expat.Parse('', True)
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
94 del self.expat # get rid of circular references
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
95 done = True
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
96 else:
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
97 self.expat.Parse(data, False)
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
98 for event in self._queue:
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
99 yield event
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
100 self._queue = []
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
101 if done:
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
102 break
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
103 except expat.ExpatError, e:
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
104 msg = str(e)
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
105 if self.filename:
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
106 msg += ', in ' + self.filename
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
107 raise ParseError(msg, self.filename, e.lineno, e.offset)
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
108
26
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
109 def _enqueue(self, kind, data, pos=None):
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
110 if pos is None:
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
111 pos = self._getpos()
134
df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though).
cmlenz
parents: 69
diff changeset
112 if kind is Stream.TEXT:
df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though).
cmlenz
parents: 69
diff changeset
113 # Expat reports the *end* of the text event as current position. We
df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though).
cmlenz
parents: 69
diff changeset
114 # try to fix that up here as much as possible. Unfortunately, the
df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though).
cmlenz
parents: 69
diff changeset
115 # offset is only valid for single-line text. For multi-line text,
df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though).
cmlenz
parents: 69
diff changeset
116 # it is apparently not possible to determine at what offset it
df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though).
cmlenz
parents: 69
diff changeset
117 # started
df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though).
cmlenz
parents: 69
diff changeset
118 if '\n' in data:
df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though).
cmlenz
parents: 69
diff changeset
119 lines = data.splitlines()
df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though).
cmlenz
parents: 69
diff changeset
120 lineno = pos[1] - len(lines) + 1
df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though).
cmlenz
parents: 69
diff changeset
121 offset = -1
df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though).
cmlenz
parents: 69
diff changeset
122 else:
df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though).
cmlenz
parents: 69
diff changeset
123 lineno = pos[1]
df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though).
cmlenz
parents: 69
diff changeset
124 offset = pos[2] - len(data)
df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though).
cmlenz
parents: 69
diff changeset
125 pos = (pos[0], lineno, offset)
26
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
126 self._queue.append((kind, data, pos))
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
127
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
128 def _getpos_unknown(self):
134
df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though).
cmlenz
parents: 69
diff changeset
129 return (self.filename, -1, -1)
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
130
21
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
131 def _getpos(self):
134
df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though).
cmlenz
parents: 69
diff changeset
132 return (self.filename, self.expat.CurrentLineNumber,
21
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
133 self.expat.CurrentColumnNumber)
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
134
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
135 def _handle_start(self, tag, attrib):
26
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
136 self._enqueue(Stream.START, (QName(tag), Attributes(attrib.items())))
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
137
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
138 def _handle_end(self, tag):
26
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
139 self._enqueue(Stream.END, QName(tag))
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
140
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
141 def _handle_data(self, text):
26
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
142 self._enqueue(Stream.TEXT, text)
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
143
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
144 def _handle_prolog(self, version, encoding, standalone):
26
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
145 self._enqueue(Stream.PROLOG, (version, encoding, standalone))
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
146
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
147 def _handle_doctype(self, name, sysid, pubid, has_internal_subset):
26
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
148 self._enqueue(Stream.DOCTYPE, (name, pubid, sysid))
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
149
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
150 def _handle_start_ns(self, prefix, uri):
26
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
151 self._enqueue(Stream.START_NS, (prefix or '', uri))
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
152
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
153 def _handle_end_ns(self, prefix):
26
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
154 self._enqueue(Stream.END_NS, prefix or '')
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
155
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
156 def _handle_pi(self, target, data):
26
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
157 self._enqueue(Stream.PI, (target, data))
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
158
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
159 def _handle_comment(self, text):
26
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
160 self._enqueue(Stream.COMMENT, text)
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
161
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
162 def _handle_other(self, text):
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
163 if text.startswith('&'):
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
164 # deal with undefined entities
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
165 try:
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
166 text = unichr(htmlentitydefs.name2codepoint[text[1:-1]])
26
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
167 self._enqueue(Stream.TEXT, text)
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
168 except KeyError:
21
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
169 lineno, offset = self._getpos()
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
170 raise expat.error("undefined entity %s: line %d, column %d" %
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
171 (text, lineno, offset))
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
172
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
173
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
174 def XML(text):
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
175 return Stream(list(XMLParser(StringIO(text))))
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
176
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
177
21
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
178 class HTMLParser(html.HTMLParser, object):
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
179 """Parser for HTML input based on the Python `HTMLParser` module.
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
180
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
181 This class provides the same interface for generating stream events as
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
182 `XMLParser`, and attempts to automatically balance tags.
26
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
183
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
184 The parsing is initiated by iterating over the parser object:
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
185
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
186 >>> parser = HTMLParser(StringIO('<UL compact><LI>Foo</UL>'))
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
187 >>> for kind, data, pos in parser:
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
188 ... print kind, data
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
189 START (u'ul', [(u'compact', u'compact')])
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
190 START (u'li', [])
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
191 TEXT Foo
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
192 END li
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
193 END ul
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
194 """
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
195
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
196 _EMPTY_ELEMS = frozenset(['area', 'base', 'basefont', 'br', 'col', 'frame',
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
197 'hr', 'img', 'input', 'isindex', 'link', 'meta',
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
198 'param'])
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
199
21
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
200 def __init__(self, source, filename=None):
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
201 html.HTMLParser.__init__(self)
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
202 self.source = source
21
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
203 self.filename = filename
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
204 self._queue = []
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
205 self._open_tags = []
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
206
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
207 def __iter__(self):
21
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
208 try:
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
209 bufsize = 4 * 1024 # 4K
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
210 done = False
69
e9a3930f8823 A couple of minor performance improvements.
cmlenz
parents: 66
diff changeset
211 while 1:
21
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
212 while not done and len(self._queue) == 0:
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
213 data = self.source.read(bufsize)
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
214 if data == '': # end of data
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
215 self.close()
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
216 done = True
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
217 else:
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
218 self.feed(data)
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
219 for kind, data, pos in self._queue:
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
220 yield kind, data, pos
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
221 self._queue = []
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
222 if done:
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
223 open_tags = self._open_tags
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
224 open_tags.reverse()
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
225 for tag in open_tags:
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
226 yield Stream.END, QName(tag), pos
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
227 break
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
228 except html.HTMLParseError, e:
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
229 msg = '%s: line %d, column %d' % (e.msg, e.lineno, e.offset)
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
230 if self.filename:
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
231 msg += ', in %s' % self.filename
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
232 raise ParseError(msg, self.filename, e.lineno, e.offset)
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
233
26
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
234 def _enqueue(self, kind, data, pos=None):
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
235 if pos is None:
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
236 pos = self._getpos()
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
237 self._queue.append((kind, data, pos))
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
238
21
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
239 def _getpos(self):
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
240 lineno, column = self.getpos()
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
241 return (self.filename, lineno, column)
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
242
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
243 def handle_starttag(self, tag, attrib):
26
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
244 fixed_attrib = []
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
245 for name, value in attrib: # Fixup minimized attributes
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
246 if value is None:
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
247 value = name
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
248 fixed_attrib.append((name, unicode(value)))
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
249
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
250 self._enqueue(Stream.START, (QName(tag), Attributes(fixed_attrib)))
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
251 if tag in self._EMPTY_ELEMS:
26
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
252 self._enqueue(Stream.END, QName(tag))
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
253 else:
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
254 self._open_tags.append(tag)
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
255
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
256 def handle_endtag(self, tag):
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
257 if tag not in self._EMPTY_ELEMS:
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
258 while self._open_tags:
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
259 open_tag = self._open_tags.pop()
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
260 if open_tag.lower() == tag.lower():
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
261 break
26
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
262 self._enqueue(Stream.END, QName(open_tag))
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
263 self._enqueue(Stream.END, QName(tag))
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
264
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
265 def handle_data(self, text):
26
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
266 self._enqueue(Stream.TEXT, text)
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
267
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
268 def handle_charref(self, name):
26
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
269 self._enqueue(Stream.TEXT, Markup('&#%s;' % name))
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
270
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
271 def handle_entityref(self, name):
26
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
272 self._enqueue(Stream.TEXT, Markup('&%s;' % name))
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
273
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
274 def handle_pi(self, data):
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
275 target, data = data.split(maxsplit=1)
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
276 data = data.rstrip('?')
26
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
277 self._enqueue(Stream.PI, (target.strip(), data.strip()))
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
278
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
279 def handle_comment(self, text):
26
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
280 self._enqueue(Stream.COMMENT, text)
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
281
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
282
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
283 def HTML(text):
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
284 return Stream(list(HTMLParser(StringIO(text))))
Copyright (C) 2012-2017 Edgewall Software