genshi/genshi-test: genshi/input.py annotate

annotate genshi/input.py @ 820:1837f39efd6f experimental-inline

Sync (old) experimental inline branch with trunk@1027.

author	cmlenz
date	Wed, 11 Mar 2009 17:51:06 +0000
parents	0742f421caba
children	09cc3627654c

rev	line source
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	1 # -- coding: utf-8 --
821114ec4f69 Initial import. cmlenz parents: diff changeset	2 #
500 0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	3 # Copyright (C) 2006-2007 Edgewall Software
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	4 # All rights reserved.
821114ec4f69 Initial import. cmlenz parents: diff changeset	5 #
821114ec4f69 Initial import. cmlenz parents: diff changeset	6 # This software is licensed as described in the file COPYING, which
821114ec4f69 Initial import. cmlenz parents: diff changeset	7 # you should have received as part of this distribution. The terms
230 24757b771651 Renamed Markup to Genshi in repository. cmlenz parents: 213 diff changeset	8 # are also available at http://genshi.edgewall.org/wiki/License.
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	9 #
821114ec4f69 Initial import. cmlenz parents: diff changeset	10 # This software consists of voluntary contributions made by many
821114ec4f69 Initial import. cmlenz parents: diff changeset	11 # individuals. For the exact contribution history, see the revision
230 24757b771651 Renamed Markup to Genshi in repository. cmlenz parents: 213 diff changeset	12 # history and logs, available at http://genshi.edgewall.org/log/.
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	13
500 0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	14 """Support for constructing markup streams from files, strings, or other
0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	15 sources.
0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	16 """
0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	17
144 28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	18 from itertools import chain
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	19 from xml.parsers import expat
821114ec4f69 Initial import. cmlenz parents: diff changeset	20 import HTMLParser as html
821114ec4f69 Initial import. cmlenz parents: diff changeset	21 import htmlentitydefs
821114ec4f69 Initial import. cmlenz parents: diff changeset	22 from StringIO import StringIO
821114ec4f69 Initial import. cmlenz parents: diff changeset	23
293 38adb4aa7df5 Fix a bug in the XML parser, where attributes containing HTML entity references would get pulled out of the attribute value, and instead added as a text node just before the associated start tag. Thanks to Hamish Lawson for [http://groups.google.com/group/genshi/browse_thread/thread/c64eb48676b0ff96/0e6ce786e8820f3d pointing out the problem]. cmlenz parents: 290 diff changeset	24 from genshi.core import Attrs, QName, Stream, stripentities
500 0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	25 from genshi.core import START, END, XML_DECL, DOCTYPE, TEXT, START_NS, END_NS, \
144 28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	26 START_CDATA, END_CDATA, PI, COMMENT
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	27
290 a6738047c85e Move the ElementTree ''element-to-stream'' adaptation function `ET()` into the `genshi.input` module. cmlenz parents: 230 diff changeset	28 __all__ = ['ET', 'ParseError', 'XMLParser', 'XML', 'HTMLParser', 'HTML']
500 0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	29 __docformat__ = 'restructuredtext en'
290 a6738047c85e Move the ElementTree ''element-to-stream'' adaptation function `ET()` into the `genshi.input` module. cmlenz parents: 230 diff changeset	30
a6738047c85e Move the ElementTree ''element-to-stream'' adaptation function `ET()` into the `genshi.input` module. cmlenz parents: 230 diff changeset	31 def ET(element):
500 0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	32 """Convert a given ElementTree element to a markup stream.
0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	33
0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	34 :param element: an ElementTree element
0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	35 :return: a markup stream
0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	36 """
290 a6738047c85e Move the ElementTree ''element-to-stream'' adaptation function `ET()` into the `genshi.input` module. cmlenz parents: 230 diff changeset	37 tag_name = QName(element.tag.lstrip('{'))
500 0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	38 attrs = Attrs([(QName(attr.lstrip('{')), value)
0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	39 for attr, value in element.items()])
290 a6738047c85e Move the ElementTree ''element-to-stream'' adaptation function `ET()` into the `genshi.input` module. cmlenz parents: 230 diff changeset	40
a6738047c85e Move the ElementTree ''element-to-stream'' adaptation function `ET()` into the `genshi.input` module. cmlenz parents: 230 diff changeset	41 yield START, (tag_name, attrs), (None, -1, -1)
a6738047c85e Move the ElementTree ''element-to-stream'' adaptation function `ET()` into the `genshi.input` module. cmlenz parents: 230 diff changeset	42 if element.text:
a6738047c85e Move the ElementTree ''element-to-stream'' adaptation function `ET()` into the `genshi.input` module. cmlenz parents: 230 diff changeset	43 yield TEXT, element.text, (None, -1, -1)
a6738047c85e Move the ElementTree ''element-to-stream'' adaptation function `ET()` into the `genshi.input` module. cmlenz parents: 230 diff changeset	44 for child in element.getchildren():
a6738047c85e Move the ElementTree ''element-to-stream'' adaptation function `ET()` into the `genshi.input` module. cmlenz parents: 230 diff changeset	45 for item in ET(child):
a6738047c85e Move the ElementTree ''element-to-stream'' adaptation function `ET()` into the `genshi.input` module. cmlenz parents: 230 diff changeset	46 yield item
a6738047c85e Move the ElementTree ''element-to-stream'' adaptation function `ET()` into the `genshi.input` module. cmlenz parents: 230 diff changeset	47 yield END, tag_name, (None, -1, -1)
a6738047c85e Move the ElementTree ''element-to-stream'' adaptation function `ET()` into the `genshi.input` module. cmlenz parents: 230 diff changeset	48 if element.tail:
a6738047c85e Move the ElementTree ''element-to-stream'' adaptation function `ET()` into the `genshi.input` module. cmlenz parents: 230 diff changeset	49 yield TEXT, element.tail, (None, -1, -1)
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	50
821114ec4f69 Initial import. cmlenz parents: diff changeset	51
21 eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3. cmlenz parents: 1 diff changeset	52 class ParseError(Exception):
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3. cmlenz parents: 1 diff changeset	53 """Exception raised when fatal syntax errors are found in the input being
500 0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	54 parsed.
0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	55 """
21 eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3. cmlenz parents: 1 diff changeset	56
500 0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	57 def __init__(self, message, filename=None, lineno=-1, offset=-1):
0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	58 """Exception initializer.
0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	59
0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	60 :param message: the error message from the parser
0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	61 :param filename: the path to the file that was parsed
0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	62 :param lineno: the number of the line on which the error was encountered
0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	63 :param offset: the column number where the error was encountered
0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	64 """
0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	65 self.msg = message
0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	66 if filename:
0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	67 message += ', in ' + filename
21 eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3. cmlenz parents: 1 diff changeset	68 Exception.__init__(self, message)
500 0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	69 self.filename = filename or '<string>'
21 eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3. cmlenz parents: 1 diff changeset	70 self.lineno = lineno
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3. cmlenz parents: 1 diff changeset	71 self.offset = offset
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3. cmlenz parents: 1 diff changeset	72
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3. cmlenz parents: 1 diff changeset	73
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	74 class XMLParser(object):
821114ec4f69 Initial import. cmlenz parents: diff changeset	75 """Generator-based XML parser based on roughly equivalent code in
26 039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	76 Kid/ElementTree.
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	77
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	78 The parsing is initiated by iterating over the parser object:
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	79
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	80 >>> parser = XMLParser(StringIO('<root id="2"><child>Foo</child></root>'))
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	81 >>> for kind, data, pos in parser:
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	82 ... print kind, data
326 08ada6b4b767 Fixed `__repr__` of the `QName`, `Attrs`, and `Expression` classes so that the output can be used as code to instantiate the object again. cmlenz parents: 316 diff changeset	83 START (QName(u'root'), Attrs([(QName(u'id'), u'2')]))
08ada6b4b767 Fixed `__repr__` of the `QName`, `Attrs`, and `Expression` classes so that the output can be used as code to instantiate the object again. cmlenz parents: 316 diff changeset	84 START (QName(u'child'), Attrs())
26 039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	85 TEXT Foo
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	86 END child
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	87 END root
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	88 """
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	89
293 38adb4aa7df5 Fix a bug in the XML parser, where attributes containing HTML entity references would get pulled out of the attribute value, and instead added as a text node just before the associated start tag. Thanks to Hamish Lawson for [http://groups.google.com/group/genshi/browse_thread/thread/c64eb48676b0ff96/0e6ce786e8820f3d pointing out the problem]. cmlenz parents: 290 diff changeset	90 _entitydefs = ['<!ENTITY %s "&#%d;">' % (name, value) for name, value in
38adb4aa7df5 Fix a bug in the XML parser, where attributes containing HTML entity references would get pulled out of the attribute value, and instead added as a text node just before the associated start tag. Thanks to Hamish Lawson for [http://groups.google.com/group/genshi/browse_thread/thread/c64eb48676b0ff96/0e6ce786e8820f3d pointing out the problem]. cmlenz parents: 290 diff changeset	91 htmlentitydefs.name2codepoint.items()]
38adb4aa7df5 Fix a bug in the XML parser, where attributes containing HTML entity references would get pulled out of the attribute value, and instead added as a text node just before the associated start tag. Thanks to Hamish Lawson for [http://groups.google.com/group/genshi/browse_thread/thread/c64eb48676b0ff96/0e6ce786e8820f3d pointing out the problem]. cmlenz parents: 290 diff changeset	92 _external_dtd = '\n'.join(_entitydefs)
38adb4aa7df5 Fix a bug in the XML parser, where attributes containing HTML entity references would get pulled out of the attribute value, and instead added as a text node just before the associated start tag. Thanks to Hamish Lawson for [http://groups.google.com/group/genshi/browse_thread/thread/c64eb48676b0ff96/0e6ce786e8820f3d pointing out the problem]. cmlenz parents: 290 diff changeset	93
316 4ab9edf5e83b Configurable encoding of template files, closing #65. cmlenz parents: 312 diff changeset	94 def __init__(self, source, filename=None, encoding=None):
311 01e2c48f6dfb * The `HTMLParser` class and the `HTML` function now accept an `encoding` parameter to properly deal with bytestring input (defaults to UTF-8). cmlenz parents: 293 diff changeset	95 """Initialize the parser for the given XML input.
26 039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	96
500 0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	97 :param source: the XML text as a file-like object
0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	98 :param filename: the name of the file, if appropriate
0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	99 :param encoding: the encoding of the file; if not specified, the
0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	100 encoding is assumed to be ASCII, UTF-8, or UTF-16, or
0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	101 whatever the encoding specified in the XML declaration
0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	102 (if any)
26 039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	103 """
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	104 self.source = source
21 eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3. cmlenz parents: 1 diff changeset	105 self.filename = filename
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	106
821114ec4f69 Initial import. cmlenz parents: diff changeset	107 # Setup the Expat parser
316 4ab9edf5e83b Configurable encoding of template files, closing #65. cmlenz parents: 312 diff changeset	108 parser = expat.ParserCreate(encoding, '}')
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	109 parser.buffer_text = True
821114ec4f69 Initial import. cmlenz parents: diff changeset	110 parser.returns_unicode = True
160 faea6db52ef1 Attribute order in parsed XML is now preserved. cmlenz parents: 146 diff changeset	111 parser.ordered_attributes = True
faea6db52ef1 Attribute order in parsed XML is now preserved. cmlenz parents: 146 diff changeset	112
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	113 parser.StartElementHandler = self._handle_start
821114ec4f69 Initial import. cmlenz parents: diff changeset	114 parser.EndElementHandler = self._handle_end
821114ec4f69 Initial import. cmlenz parents: diff changeset	115 parser.CharacterDataHandler = self._handle_data
821114ec4f69 Initial import. cmlenz parents: diff changeset	116 parser.StartDoctypeDeclHandler = self._handle_doctype
821114ec4f69 Initial import. cmlenz parents: diff changeset	117 parser.StartNamespaceDeclHandler = self._handle_start_ns
821114ec4f69 Initial import. cmlenz parents: diff changeset	118 parser.EndNamespaceDeclHandler = self._handle_end_ns
143 ef761afcedff CDATA sections in XML input now appear as CDATA sections in the output. This should address the problem with escaping the contents of `<style>` and `<script>` elements, which would only get interpreted correctly if the output was served as `application/xhtml+xml`. Closes #24. cmlenz parents: 140 diff changeset	119 parser.StartCdataSectionHandler = self._handle_start_cdata
ef761afcedff CDATA sections in XML input now appear as CDATA sections in the output. This should address the problem with escaping the contents of `<style>` and `<script>` elements, which would only get interpreted correctly if the output was served as `application/xhtml+xml`. Closes #24. cmlenz parents: 140 diff changeset	120 parser.EndCdataSectionHandler = self._handle_end_cdata
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	121 parser.ProcessingInstructionHandler = self._handle_pi
500 0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	122 parser.XmlDeclHandler = self._handle_xml_decl
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	123 parser.CommentHandler = self._handle_comment
209 5b422db07359 * Fix bug in handling of undefined entities. Thanks to Arnar for reporting the issue on IRC. cmlenz parents: 207 diff changeset	124
5b422db07359 * Fix bug in handling of undefined entities. Thanks to Arnar for reporting the issue on IRC. cmlenz parents: 207 diff changeset	125 # Tell Expat that we'll handle non-XML entities ourselves
5b422db07359 * Fix bug in handling of undefined entities. Thanks to Arnar for reporting the issue on IRC. cmlenz parents: 207 diff changeset	126 # (in _handle_other)
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	127 parser.DefaultHandler = self._handle_other
293 38adb4aa7df5 Fix a bug in the XML parser, where attributes containing HTML entity references would get pulled out of the attribute value, and instead added as a text node just before the associated start tag. Thanks to Hamish Lawson for [http://groups.google.com/group/genshi/browse_thread/thread/c64eb48676b0ff96/0e6ce786e8820f3d pointing out the problem]. cmlenz parents: 290 diff changeset	128 parser.SetParamEntityParsing(expat.XML_PARAM_ENTITY_PARSING_ALWAYS)
209 5b422db07359 * Fix bug in handling of undefined entities. Thanks to Arnar for reporting the issue on IRC. cmlenz parents: 207 diff changeset	129 parser.UseForeignDTD()
293 38adb4aa7df5 Fix a bug in the XML parser, where attributes containing HTML entity references would get pulled out of the attribute value, and instead added as a text node just before the associated start tag. Thanks to Hamish Lawson for [http://groups.google.com/group/genshi/browse_thread/thread/c64eb48676b0ff96/0e6ce786e8820f3d pointing out the problem]. cmlenz parents: 290 diff changeset	130 parser.ExternalEntityRefHandler = self._build_foreign
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	131
821114ec4f69 Initial import. cmlenz parents: diff changeset	132 self.expat = parser
21 eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3. cmlenz parents: 1 diff changeset	133 self._queue = []
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	134
144 28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	135 def parse(self):
500 0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	136 """Generator that parses the XML source, yielding markup events.
0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	137
0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	138 :return: a markup event stream
0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	139 :raises ParseError: if the XML text is not well formed
0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	140 """
144 28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	141 def _generate():
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	142 try:
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	143 bufsize = 4 * 1024 # 4K
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	144 done = False
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	145 while 1:
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	146 while not done and len(self._queue) == 0:
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	147 data = self.source.read(bufsize)
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	148 if data == '': # end of data
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	149 if hasattr(self, 'expat'):
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	150 self.expat.Parse('', True)
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	151 del self.expat # get rid of circular references
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	152 done = True
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	153 else:
207 0619a27f5e67 The `XMLParser` now correctly handles unicode input. Closes #43. cmlenz parents: 182 diff changeset	154 if isinstance(data, unicode):
0619a27f5e67 The `XMLParser` now correctly handles unicode input. Closes #43. cmlenz parents: 182 diff changeset	155 data = data.encode('utf-8')
144 28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	156 self.expat.Parse(data, False)
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	157 for event in self._queue:
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	158 yield event
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	159 self._queue = []
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	160 if done:
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	161 break
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	162 except expat.ExpatError, e:
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	163 msg = str(e)
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	164 raise ParseError(msg, self.filename, e.lineno, e.offset)
146 db0dacc1239a Simplifed `CoalesceFilter` (now a function) cmlenz parents: 145 diff changeset	165 return Stream(_generate()).filter(_coalesce)
144 28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	166
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	167 def __iter__(self):
144 28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	168 return iter(self.parse())
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	169
293 38adb4aa7df5 Fix a bug in the XML parser, where attributes containing HTML entity references would get pulled out of the attribute value, and instead added as a text node just before the associated start tag. Thanks to Hamish Lawson for [http://groups.google.com/group/genshi/browse_thread/thread/c64eb48676b0ff96/0e6ce786e8820f3d pointing out the problem]. cmlenz parents: 290 diff changeset	170 def _build_foreign(self, context, base, sysid, pubid):
38adb4aa7df5 Fix a bug in the XML parser, where attributes containing HTML entity references would get pulled out of the attribute value, and instead added as a text node just before the associated start tag. Thanks to Hamish Lawson for [http://groups.google.com/group/genshi/browse_thread/thread/c64eb48676b0ff96/0e6ce786e8820f3d pointing out the problem]. cmlenz parents: 290 diff changeset	171 parser = self.expat.ExternalEntityParserCreate(context)
38adb4aa7df5 Fix a bug in the XML parser, where attributes containing HTML entity references would get pulled out of the attribute value, and instead added as a text node just before the associated start tag. Thanks to Hamish Lawson for [http://groups.google.com/group/genshi/browse_thread/thread/c64eb48676b0ff96/0e6ce786e8820f3d pointing out the problem]. cmlenz parents: 290 diff changeset	172 parser.ParseFile(StringIO(self._external_dtd))
38adb4aa7df5 Fix a bug in the XML parser, where attributes containing HTML entity references would get pulled out of the attribute value, and instead added as a text node just before the associated start tag. Thanks to Hamish Lawson for [http://groups.google.com/group/genshi/browse_thread/thread/c64eb48676b0ff96/0e6ce786e8820f3d pointing out the problem]. cmlenz parents: 290 diff changeset	173 return 1
38adb4aa7df5 Fix a bug in the XML parser, where attributes containing HTML entity references would get pulled out of the attribute value, and instead added as a text node just before the associated start tag. Thanks to Hamish Lawson for [http://groups.google.com/group/genshi/browse_thread/thread/c64eb48676b0ff96/0e6ce786e8820f3d pointing out the problem]. cmlenz parents: 290 diff changeset	174
143 ef761afcedff CDATA sections in XML input now appear as CDATA sections in the output. This should address the problem with escaping the contents of `<style>` and `<script>` elements, which would only get interpreted correctly if the output was served as `application/xhtml+xml`. Closes #24. cmlenz parents: 140 diff changeset	175 def _enqueue(self, kind, data=None, pos=None):
26 039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	176 if pos is None:
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	177 pos = self._getpos()
144 28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	178 if kind is TEXT:
134 df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though). cmlenz parents: 69 diff changeset	179 # Expat reports the end of the text event as current position. We
df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though). cmlenz parents: 69 diff changeset	180 # try to fix that up here as much as possible. Unfortunately, the
df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though). cmlenz parents: 69 diff changeset	181 # offset is only valid for single-line text. For multi-line text,
df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though). cmlenz parents: 69 diff changeset	182 # it is apparently not possible to determine at what offset it
df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though). cmlenz parents: 69 diff changeset	183 # started
df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though). cmlenz parents: 69 diff changeset	184 if '\n' in data:
df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though). cmlenz parents: 69 diff changeset	185 lines = data.splitlines()
df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though). cmlenz parents: 69 diff changeset	186 lineno = pos[1] - len(lines) + 1
df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though). cmlenz parents: 69 diff changeset	187 offset = -1
df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though). cmlenz parents: 69 diff changeset	188 else:
df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though). cmlenz parents: 69 diff changeset	189 lineno = pos[1]
df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though). cmlenz parents: 69 diff changeset	190 offset = pos[2] - len(data)
df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though). cmlenz parents: 69 diff changeset	191 pos = (pos[0], lineno, offset)
26 039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	192 self._queue.append((kind, data, pos))
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	193
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	194 def _getpos_unknown(self):
134 df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though). cmlenz parents: 69 diff changeset	195 return (self.filename, -1, -1)
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	196
21 eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3. cmlenz parents: 1 diff changeset	197 def _getpos(self):
134 df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though). cmlenz parents: 69 diff changeset	198 return (self.filename, self.expat.CurrentLineNumber,
21 eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3. cmlenz parents: 1 diff changeset	199 self.expat.CurrentColumnNumber)
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	200
821114ec4f69 Initial import. cmlenz parents: diff changeset	201 def _handle_start(self, tag, attrib):
500 0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	202 attrs = Attrs([(QName(name), value) for name, value in
0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	203 zip([iter(attrib)] 2)])
0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	204 self._enqueue(START, (QName(tag), attrs))
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	205
821114ec4f69 Initial import. cmlenz parents: diff changeset	206 def _handle_end(self, tag):
144 28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	207 self._enqueue(END, QName(tag))
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	208
821114ec4f69 Initial import. cmlenz parents: diff changeset	209 def _handle_data(self, text):
144 28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	210 self._enqueue(TEXT, text)
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	211
500 0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	212 def _handle_xml_decl(self, version, encoding, standalone):
0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	213 self._enqueue(XML_DECL, (version, encoding, standalone))
0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	214
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	215 def _handle_doctype(self, name, sysid, pubid, has_internal_subset):
144 28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	216 self._enqueue(DOCTYPE, (name, pubid, sysid))
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	217
821114ec4f69 Initial import. cmlenz parents: diff changeset	218 def _handle_start_ns(self, prefix, uri):
144 28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	219 self._enqueue(START_NS, (prefix or '', uri))
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	220
821114ec4f69 Initial import. cmlenz parents: diff changeset	221 def _handle_end_ns(self, prefix):
144 28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	222 self._enqueue(END_NS, prefix or '')
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	223
143 ef761afcedff CDATA sections in XML input now appear as CDATA sections in the output. This should address the problem with escaping the contents of `<style>` and `<script>` elements, which would only get interpreted correctly if the output was served as `application/xhtml+xml`. Closes #24. cmlenz parents: 140 diff changeset	224 def _handle_start_cdata(self):
144 28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	225 self._enqueue(START_CDATA)
143 ef761afcedff CDATA sections in XML input now appear as CDATA sections in the output. This should address the problem with escaping the contents of `<style>` and `<script>` elements, which would only get interpreted correctly if the output was served as `application/xhtml+xml`. Closes #24. cmlenz parents: 140 diff changeset	226
ef761afcedff CDATA sections in XML input now appear as CDATA sections in the output. This should address the problem with escaping the contents of `<style>` and `<script>` elements, which would only get interpreted correctly if the output was served as `application/xhtml+xml`. Closes #24. cmlenz parents: 140 diff changeset	227 def _handle_end_cdata(self):
144 28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	228 self._enqueue(END_CDATA)
143 ef761afcedff CDATA sections in XML input now appear as CDATA sections in the output. This should address the problem with escaping the contents of `<style>` and `<script>` elements, which would only get interpreted correctly if the output was served as `application/xhtml+xml`. Closes #24. cmlenz parents: 140 diff changeset	229
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	230 def _handle_pi(self, target, data):
144 28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	231 self._enqueue(PI, (target, data))
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	232
821114ec4f69 Initial import. cmlenz parents: diff changeset	233 def _handle_comment(self, text):
144 28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	234 self._enqueue(COMMENT, text)
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	235
821114ec4f69 Initial import. cmlenz parents: diff changeset	236 def _handle_other(self, text):
821114ec4f69 Initial import. cmlenz parents: diff changeset	237 if text.startswith('&'):
821114ec4f69 Initial import. cmlenz parents: diff changeset	238 # deal with undefined entities
821114ec4f69 Initial import. cmlenz parents: diff changeset	239 try:
821114ec4f69 Initial import. cmlenz parents: diff changeset	240 text = unichr(htmlentitydefs.name2codepoint[text[1:-1]])
144 28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	241 self._enqueue(TEXT, text)
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	242 except KeyError:
209 5b422db07359 * Fix bug in handling of undefined entities. Thanks to Arnar for reporting the issue on IRC. cmlenz parents: 207 diff changeset	243 filename, lineno, offset = self._getpos()
5b422db07359 * Fix bug in handling of undefined entities. Thanks to Arnar for reporting the issue on IRC. cmlenz parents: 207 diff changeset	244 error = expat.error('undefined entity "%s": line %d, column %d'
5b422db07359 * Fix bug in handling of undefined entities. Thanks to Arnar for reporting the issue on IRC. cmlenz parents: 207 diff changeset	245 % (text, lineno, offset))
5b422db07359 * Fix bug in handling of undefined entities. Thanks to Arnar for reporting the issue on IRC. cmlenz parents: 207 diff changeset	246 error.code = expat.errors.XML_ERROR_UNDEFINED_ENTITY
5b422db07359 * Fix bug in handling of undefined entities. Thanks to Arnar for reporting the issue on IRC. cmlenz parents: 207 diff changeset	247 error.lineno = lineno
5b422db07359 * Fix bug in handling of undefined entities. Thanks to Arnar for reporting the issue on IRC. cmlenz parents: 207 diff changeset	248 error.offset = offset
5b422db07359 * Fix bug in handling of undefined entities. Thanks to Arnar for reporting the issue on IRC. cmlenz parents: 207 diff changeset	249 raise error
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	250
821114ec4f69 Initial import. cmlenz parents: diff changeset	251
821114ec4f69 Initial import. cmlenz parents: diff changeset	252 def XML(text):
500 0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	253 """Parse the given XML source and return a markup stream.
0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	254
0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	255 Unlike with `XMLParser`, the returned stream is reusable, meaning it can be
0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	256 iterated over multiple times:
0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	257
0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	258 >>> xml = XML('<doc><elem>Foo</elem><elem>Bar</elem></doc>')
0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	259 >>> print xml
0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	260 <doc><elem>Foo</elem><elem>Bar</elem></doc>
0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	261 >>> print xml.select('elem')
0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	262 <elem>Foo</elem><elem>Bar</elem>
0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	263 >>> print xml.select('elem/text()')
0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	264 FooBar
0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	265
0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	266 :param text: the XML source
0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	267 :return: the parsed XML event stream
0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	268 :raises ParseError: if the XML text is not well-formed
0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	269 """
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	270 return Stream(list(XMLParser(StringIO(text))))
821114ec4f69 Initial import. cmlenz parents: diff changeset	271
821114ec4f69 Initial import. cmlenz parents: diff changeset	272
21 eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3. cmlenz parents: 1 diff changeset	273 class HTMLParser(html.HTMLParser, object):
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	274 """Parser for HTML input based on the Python `HTMLParser` module.
821114ec4f69 Initial import. cmlenz parents: diff changeset	275
821114ec4f69 Initial import. cmlenz parents: diff changeset	276 This class provides the same interface for generating stream events as
821114ec4f69 Initial import. cmlenz parents: diff changeset	277 `XMLParser`, and attempts to automatically balance tags.
26 039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	278
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	279 The parsing is initiated by iterating over the parser object:
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	280
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	281 >>> parser = HTMLParser(StringIO('<UL compact><LI>Foo</UL>'))
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	282 >>> for kind, data, pos in parser:
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	283 ... print kind, data
326 08ada6b4b767 Fixed `__repr__` of the `QName`, `Attrs`, and `Expression` classes so that the output can be used as code to instantiate the object again. cmlenz parents: 316 diff changeset	284 START (QName(u'ul'), Attrs([(QName(u'compact'), u'compact')]))
08ada6b4b767 Fixed `__repr__` of the `QName`, `Attrs`, and `Expression` classes so that the output can be used as code to instantiate the object again. cmlenz parents: 316 diff changeset	285 START (QName(u'li'), Attrs())
26 039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	286 TEXT Foo
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	287 END li
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	288 END ul
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	289 """
821114ec4f69 Initial import. cmlenz parents: diff changeset	290
821114ec4f69 Initial import. cmlenz parents: diff changeset	291 _EMPTY_ELEMS = frozenset(['area', 'base', 'basefont', 'br', 'col', 'frame',
821114ec4f69 Initial import. cmlenz parents: diff changeset	292 'hr', 'img', 'input', 'isindex', 'link', 'meta',
821114ec4f69 Initial import. cmlenz parents: diff changeset	293 'param'])
821114ec4f69 Initial import. cmlenz parents: diff changeset	294
311 01e2c48f6dfb * The `HTMLParser` class and the `HTML` function now accept an `encoding` parameter to properly deal with bytestring input (defaults to UTF-8). cmlenz parents: 293 diff changeset	295 def __init__(self, source, filename=None, encoding='utf-8'):
01e2c48f6dfb * The `HTMLParser` class and the `HTML` function now accept an `encoding` parameter to properly deal with bytestring input (defaults to UTF-8). cmlenz parents: 293 diff changeset	296 """Initialize the parser for the given HTML input.
01e2c48f6dfb * The `HTMLParser` class and the `HTML` function now accept an `encoding` parameter to properly deal with bytestring input (defaults to UTF-8). cmlenz parents: 293 diff changeset	297
500 0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	298 :param source: the HTML text as a file-like object
0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	299 :param filename: the name of the file, if known
0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	300 :param filename: encoding of the file; ignored if the input is unicode
311 01e2c48f6dfb * The `HTMLParser` class and the `HTML` function now accept an `encoding` parameter to properly deal with bytestring input (defaults to UTF-8). cmlenz parents: 293 diff changeset	301 """
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	302 html.HTMLParser.__init__(self)
821114ec4f69 Initial import. cmlenz parents: diff changeset	303 self.source = source
21 eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3. cmlenz parents: 1 diff changeset	304 self.filename = filename
311 01e2c48f6dfb * The `HTMLParser` class and the `HTML` function now accept an `encoding` parameter to properly deal with bytestring input (defaults to UTF-8). cmlenz parents: 293 diff changeset	305 self.encoding = encoding
21 eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3. cmlenz parents: 1 diff changeset	306 self._queue = []
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	307 self._open_tags = []
821114ec4f69 Initial import. cmlenz parents: diff changeset	308
144 28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	309 def parse(self):
500 0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	310 """Generator that parses the HTML source, yielding markup events.
0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	311
0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	312 :return: a markup event stream
0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	313 :raises ParseError: if the HTML text is not well formed
0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	314 """
144 28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	315 def _generate():
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	316 try:
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	317 bufsize = 4 * 1024 # 4K
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	318 done = False
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	319 while 1:
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	320 while not done and len(self._queue) == 0:
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	321 data = self.source.read(bufsize)
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	322 if data == '': # end of data
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	323 self.close()
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	324 done = True
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	325 else:
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	326 self.feed(data)
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	327 for kind, data, pos in self._queue:
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	328 yield kind, data, pos
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	329 self._queue = []
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	330 if done:
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	331 open_tags = self._open_tags
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	332 open_tags.reverse()
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	333 for tag in open_tags:
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	334 yield END, QName(tag), pos
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	335 break
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	336 except html.HTMLParseError, e:
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	337 msg = '%s: line %d, column %d' % (e.msg, e.lineno, e.offset)
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	338 raise ParseError(msg, self.filename, e.lineno, e.offset)
146 db0dacc1239a Simplifed `CoalesceFilter` (now a function) cmlenz parents: 145 diff changeset	339 return Stream(_generate()).filter(_coalesce)
144 28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	340
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	341 def __iter__(self):
144 28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	342 return iter(self.parse())
21 eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3. cmlenz parents: 1 diff changeset	343
26 039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	344 def _enqueue(self, kind, data, pos=None):
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	345 if pos is None:
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	346 pos = self._getpos()
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	347 self._queue.append((kind, data, pos))
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	348
21 eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3. cmlenz parents: 1 diff changeset	349 def _getpos(self):
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3. cmlenz parents: 1 diff changeset	350 lineno, column = self.getpos()
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3. cmlenz parents: 1 diff changeset	351 return (self.filename, lineno, column)
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	352
821114ec4f69 Initial import. cmlenz parents: diff changeset	353 def handle_starttag(self, tag, attrib):
26 039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	354 fixed_attrib = []
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	355 for name, value in attrib: # Fixup minimized attributes
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	356 if value is None:
312 7e743338a799 Follow-up to [385]: also decode attribute values in the `HTMLParser`. cmlenz parents: 311 diff changeset	357 value = unicode(name)
7e743338a799 Follow-up to [385]: also decode attribute values in the `HTMLParser`. cmlenz parents: 311 diff changeset	358 elif not isinstance(value, unicode):
7e743338a799 Follow-up to [385]: also decode attribute values in the `HTMLParser`. cmlenz parents: 311 diff changeset	359 value = value.decode(self.encoding, 'replace')
500 0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	360 fixed_attrib.append((QName(name), stripentities(value)))
26 039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	361
182 41db0260ebb1 Renamed `Attributes` to `Attrs` to reduce the verbosity. cmlenz parents: 160 diff changeset	362 self._enqueue(START, (QName(tag), Attrs(fixed_attrib)))
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	363 if tag in self._EMPTY_ELEMS:
144 28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	364 self._enqueue(END, QName(tag))
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	365 else:
821114ec4f69 Initial import. cmlenz parents: diff changeset	366 self._open_tags.append(tag)
821114ec4f69 Initial import. cmlenz parents: diff changeset	367
821114ec4f69 Initial import. cmlenz parents: diff changeset	368 def handle_endtag(self, tag):
821114ec4f69 Initial import. cmlenz parents: diff changeset	369 if tag not in self._EMPTY_ELEMS:
821114ec4f69 Initial import. cmlenz parents: diff changeset	370 while self._open_tags:
821114ec4f69 Initial import. cmlenz parents: diff changeset	371 open_tag = self._open_tags.pop()
395 55cf81951686 inline branch: Merged [439:479/trunk]. cmlenz parents: 326 diff changeset	372 self._enqueue(END, QName(open_tag))
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	373 if open_tag.lower() == tag.lower():
821114ec4f69 Initial import. cmlenz parents: diff changeset	374 break
821114ec4f69 Initial import. cmlenz parents: diff changeset	375
821114ec4f69 Initial import. cmlenz parents: diff changeset	376 def handle_data(self, text):
311 01e2c48f6dfb * The `HTMLParser` class and the `HTML` function now accept an `encoding` parameter to properly deal with bytestring input (defaults to UTF-8). cmlenz parents: 293 diff changeset	377 if not isinstance(text, unicode):
01e2c48f6dfb * The `HTMLParser` class and the `HTML` function now accept an `encoding` parameter to properly deal with bytestring input (defaults to UTF-8). cmlenz parents: 293 diff changeset	378 text = text.decode(self.encoding, 'replace')
144 28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	379 self._enqueue(TEXT, text)
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	380
821114ec4f69 Initial import. cmlenz parents: diff changeset	381 def handle_charref(self, name):
500 0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	382 if name.lower().startswith('x'):
0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	383 text = unichr(int(name[1:], 16))
0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	384 else:
0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	385 text = unichr(int(name))
144 28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	386 self._enqueue(TEXT, text)
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	387
821114ec4f69 Initial import. cmlenz parents: diff changeset	388 def handle_entityref(self, name):
144 28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	389 try:
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	390 text = unichr(htmlentitydefs.name2codepoint[name])
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	391 except KeyError:
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	392 text = '&%s;' % name
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	393 self._enqueue(TEXT, text)
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	394
821114ec4f69 Initial import. cmlenz parents: diff changeset	395 def handle_pi(self, data):
395 55cf81951686 inline branch: Merged [439:479/trunk]. cmlenz parents: 326 diff changeset	396 target, data = data.split(None, 1)
55cf81951686 inline branch: Merged [439:479/trunk]. cmlenz parents: 326 diff changeset	397 if data.endswith('?'):
55cf81951686 inline branch: Merged [439:479/trunk]. cmlenz parents: 326 diff changeset	398 data = data[:-1]
144 28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	399 self._enqueue(PI, (target.strip(), data.strip()))
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	400
821114ec4f69 Initial import. cmlenz parents: diff changeset	401 def handle_comment(self, text):
144 28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	402 self._enqueue(COMMENT, text)
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	403
821114ec4f69 Initial import. cmlenz parents: diff changeset	404
311 01e2c48f6dfb * The `HTMLParser` class and the `HTML` function now accept an `encoding` parameter to properly deal with bytestring input (defaults to UTF-8). cmlenz parents: 293 diff changeset	405 def HTML(text, encoding='utf-8'):
500 0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	406 """Parse the given HTML source and return a markup stream.
0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	407
0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	408 Unlike with `HTMLParser`, the returned stream is reusable, meaning it can be
0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	409 iterated over multiple times:
0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	410
0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	411 >>> html = HTML('<body><h1>Foo</h1></body>')
0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	412 >>> print html
0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	413 <body><h1>Foo</h1></body>
0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	414 >>> print html.select('h1')
0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	415 <h1>Foo</h1>
0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	416 >>> print html.select('h1/text()')
0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	417 Foo
0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	418
0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	419 :param text: the HTML source
0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	420 :return: the parsed XML event stream
0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	421 :raises ParseError: if the HTML text is not well-formed, and error recovery
0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	422 fails
0742f421caba Merged revisions 487-603 via svnmerge from cmlenz parents: 395 diff changeset	423 """
311 01e2c48f6dfb * The `HTMLParser` class and the `HTML` function now accept an `encoding` parameter to properly deal with bytestring input (defaults to UTF-8). cmlenz parents: 293 diff changeset	424 return Stream(list(HTMLParser(StringIO(text), encoding=encoding)))
144 28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	425
146 db0dacc1239a Simplifed `CoalesceFilter` (now a function) cmlenz parents: 145 diff changeset	426 def _coalesce(stream):
144 28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	427 """Coalesces adjacent TEXT events into a single event."""
146 db0dacc1239a Simplifed `CoalesceFilter` (now a function) cmlenz parents: 145 diff changeset	428 textbuf = []
db0dacc1239a Simplifed `CoalesceFilter` (now a function) cmlenz parents: 145 diff changeset	429 textpos = None
db0dacc1239a Simplifed `CoalesceFilter` (now a function) cmlenz parents: 145 diff changeset	430 for kind, data, pos in chain(stream, [(None, None, None)]):
db0dacc1239a Simplifed `CoalesceFilter` (now a function) cmlenz parents: 145 diff changeset	431 if kind is TEXT:
db0dacc1239a Simplifed `CoalesceFilter` (now a function) cmlenz parents: 145 diff changeset	432 textbuf.append(data)
db0dacc1239a Simplifed `CoalesceFilter` (now a function) cmlenz parents: 145 diff changeset	433 if textpos is None:
db0dacc1239a Simplifed `CoalesceFilter` (now a function) cmlenz parents: 145 diff changeset	434 textpos = pos
db0dacc1239a Simplifed `CoalesceFilter` (now a function) cmlenz parents: 145 diff changeset	435 else:
db0dacc1239a Simplifed `CoalesceFilter` (now a function) cmlenz parents: 145 diff changeset	436 if textbuf:
db0dacc1239a Simplifed `CoalesceFilter` (now a function) cmlenz parents: 145 diff changeset	437 yield TEXT, u''.join(textbuf), textpos
db0dacc1239a Simplifed `CoalesceFilter` (now a function) cmlenz parents: 145 diff changeset	438 del textbuf[:]
db0dacc1239a Simplifed `CoalesceFilter` (now a function) cmlenz parents: 145 diff changeset	439 textpos = None
db0dacc1239a Simplifed `CoalesceFilter` (now a function) cmlenz parents: 145 diff changeset	440 if kind:
db0dacc1239a Simplifed `CoalesceFilter` (now a function) cmlenz parents: 145 diff changeset	441 yield kind, data, pos

Mercurial > genshi > genshi-test

annotate genshi/input.py @ 820:1837f39efd6f experimental-inline