Şu anda, pandaların G / Ç araçları bir read_xml()
yöntem ve muadili tutmamaktadır to_xml()
. Ancak, read_json
ağaç benzeri yapıların veri çerçevesi içe aktarma ve read_html
biçimlendirme formatları için uygulanabileceğini kanıtlar .
Pandalar ekibi, read_xml
gelecekteki bir Pandalar sürümü için böyle bir yöntemi düşünürse , hangi uygulamayı takip ederdi: XPath 1.0 ve XSLT 1.0 yöntemleriyle yerleşik xml.etree.ElementTree
ile iterfind()
veya iterparse()
işlevleriyle veya üçüncü taraf modülüyle ayrıştırma lxml
?
Aşağıda, basit, düz, öğe merkezli XML girişi üzerinde dört yöntem türü için test çalıştırmalarım yer almaktadır. Tümü, kökün herhangi bir ikinci düzey alt düzeyi için genelleştirilmiş ayrıştırma için ayarlanmıştır ve her yöntem, tam olarak aynı pandalar veri çerçevesini vermelidir. pd.Dataframe()
Sözlük listesindeki son aramalar hariç hepsi . Dökme için XSLT yöntemi CSV, XML dönüşümleri StringIO()
içinde pd.read_csv()
.
Soru (çok parçalı)
PERFORMANS:
iterparse
Dosya yinelemeli olarak ayrıştırılırken daha büyük dosyalar için sıklıkla önerilen yavaş olanı nasıl açıklarsınız? Kısmenif
mantık kontrollerinden mi kaynaklanıyor ?BELLEK: CPU belleği, G / Ç aramalarındaki zamanlamalarla ilişkili mi? XSLT ve XPath 1.0, daha büyük XML belgelerinde iyi ölçeklenemez, çünkü tüm dosyanın ayrıştırılması için bellekte okunması gerekir.
STRATEJİ: Sözlük listesi
Dataframe()
arama için en uygun strateji midir? : Bu ilginç cevaplar bakın jeneratör sürümü ve bir iterwalk kullanıcı tanımlı sürümü. Her iki yukarı yayın listesi veri çerçevesine.
Giriş Verileri (Stack Overflow'un panda arkadaşlarımızın dahil olduğu yıllara göre mevcut en iyi kullanıcıları )
<?xml version="1.0" encoding="utf-8"?>
<stackoverflow>
<topusers>
<user>Gordon Linoff</user>
<link>http://www.stackoverflow.com//users/1144035/gordon-linoff</link>
<location>New York, United States</location>
<year_rep>5,985</year_rep>
<total_rep>499,408</total_rep>
<tag1>sql</tag1>
<tag2>sql-server</tag2>
<tag3>mysql</tag3>
</topusers>
<topusers>
<user>Günter Zöchbauer</user>
<link>http://www.stackoverflow.com//users/217408/g%c3%bcnter-z%c3%b6chbauer</link>
<location>Linz, Austria</location>
<year_rep>5,835</year_rep>
<total_rep>154,439</total_rep>
<tag1>angular2</tag1>
<tag2>typescript</tag2>
<tag3>javascript</tag3>
</topusers>
<topusers>
<user>jezrael</user>
<link>http://www.stackoverflow.com//users/2901002/jezrael</link>
<location>Bratislava, Slovakia</location>
<year_rep>5,740</year_rep>
<total_rep>83,237</total_rep>
<tag1>pandas</tag1>
<tag2>python</tag2>
<tag3>dataframe</tag3>
</topusers>
<topusers>
<user>VonC</user>
<link>http://www.stackoverflow.com//users/6309/vonc</link>
<location>France</location>
<year_rep>5,577</year_rep>
<total_rep>651,397</total_rep>
<tag1>git</tag1>
<tag2>github</tag2>
<tag3>docker</tag3>
</topusers>
<topusers>
<user>Martijn Pieters</user>
<link>http://www.stackoverflow.com//users/100297/martijn-pieters</link>
<location>Cambridge, United Kingdom</location>
<year_rep>5,337</year_rep>
<total_rep>525,176</total_rep>
<tag1>python</tag1>
<tag2>python-3.x</tag2>
<tag3>python-2.7</tag3>
</topusers>
<topusers>
<user>T.J. Crowder</user>
<link>http://www.stackoverflow.com//users/157247/t-j-crowder</link>
<location>United Kingdom</location>
<year_rep>5,258</year_rep>
<total_rep>508,310</total_rep>
<tag1>javascript</tag1>
<tag2>jquery</tag2>
<tag3>java</tag3>
</topusers>
<topusers>
<user>akrun</user>
<link>http://www.stackoverflow.com//users/3732271/akrun</link>
<location></location>
<year_rep>5,188</year_rep>
<total_rep>229,553</total_rep>
<tag1>r</tag1>
<tag2>dplyr</tag2>
<tag3>dataframe</tag3>
</topusers>
<topusers>
<user>Wiktor Stribi?ew</user>
<link>http://www.stackoverflow.com//users/3832970/wiktor-stribi%c5%bcew</link>
<location>Warsaw, Poland</location>
<year_rep>4,948</year_rep>
<total_rep>158,134</total_rep>
<tag1>regex</tag1>
<tag2>javascript</tag2>
<tag3>c#</tag3>
</topusers>
<topusers>
<user>Darin Dimitrov</user>
<link>http://www.stackoverflow.com//users/29407/darin-dimitrov</link>
<location>Sofia, Bulgaria</location>
<year_rep>4,936</year_rep>
<total_rep>709,683</total_rep>
<tag1>c#</tag1>
<tag2>asp.net-mvc</tag2>
<tag3>asp.net-mvc-3</tag3>
</topusers>
<topusers>
<user>Eric Duminil</user>
<link>http://www.stackoverflow.com//users/6419007/eric-duminil</link>
<location></location>
<year_rep>4,854</year_rep>
<total_rep>12,557</total_rep>
<tag1>ruby</tag1>
<tag2>ruby-on-rails</tag2>
<tag3>arrays</tag3>
</topusers>
<topusers>
<user>alecxe</user>
<link>http://www.stackoverflow.com//users/771848/alecxe</link>
<location>New York, United States</location>
<year_rep>4,723</year_rep>
<total_rep>233,368</total_rep>
<tag1>python</tag1>
<tag2>selenium</tag2>
<tag3>protractor</tag3>
</topusers>
<topusers>
<user>Jean-François Fabre</user>
<link>http://www.stackoverflow.com//users/6451573/jean-fran%c3%a7ois-fabre</link>
<location>Toulouse, France</location>
<year_rep>4,526</year_rep>
<total_rep>30,027</total_rep>
<tag1>python</tag1>
<tag2>python-3.x</tag2>
<tag3>python-2.7</tag3>
</topusers>
<topusers>
<user>piRSquared</user>
<link>http://www.stackoverflow.com//users/2336654/pirsquared</link>
<location>Bellevue, WA, United States</location>
<year_rep>4,482</year_rep>
<total_rep>41,183</total_rep>
<tag1>pandas</tag1>
<tag2>python</tag2>
<tag3>dataframe</tag3>
</topusers>
<topusers>
<user>CommonsWare</user>
<link>http://www.stackoverflow.com//users/115145/commonsware</link>
<location>Who Wants to Know?</location>
<year_rep>4,475</year_rep>
<total_rep>616,135</total_rep>
<tag1>android</tag1>
<tag2>java</tag2>
<tag3>android-intent</tag3>
</topusers>
<topusers>
<user>Quentin</user>
<link>http://www.stackoverflow.com//users/19068/quentin</link>
<location>United Kingdom</location>
<year_rep>4,464</year_rep>
<total_rep>509,365</total_rep>
<tag1>javascript</tag1>
<tag2>html</tag2>
<tag3>css</tag3>
</topusers>
<topusers>
<user>Jon Skeet</user>
<link>http://www.stackoverflow.com//users/22656/jon-skeet</link>
<location>Reading, United Kingdom</location>
<year_rep>4,348</year_rep>
<total_rep>921,690</total_rep>
<tag1>c#</tag1>
<tag2>java</tag2>
<tag3>.net</tag3>
</topusers>
<topusers>
<user>Felix Kling</user>
<link>http://www.stackoverflow.com//users/218196/felix-kling</link>
<location>Sunnyvale, CA</location>
<year_rep>4,324</year_rep>
<total_rep>411,535</total_rep>
<tag1>javascript</tag1>
<tag2>jquery</tag2>
<tag3>asynchronous</tag3>
</topusers>
<topusers>
<user>matt</user>
<link>http://www.stackoverflow.com//users/341994/matt</link>
<location></location>
<year_rep>4,313</year_rep>
<total_rep>220,515</total_rep>
<tag1>swift</tag1>
<tag2>ios</tag2>
<tag3>xcode</tag3>
</topusers>
<topusers>
<user>Psidom</user>
<link>http://www.stackoverflow.com//users/4983450/psidom</link>
<location>Atlanta, GA, United States</location>
<year_rep>4,236</year_rep>
<total_rep>36,950</total_rep>
<tag1>python</tag1>
<tag2>pandas</tag2>
<tag3>r</tag3>
</topusers>
<topusers>
<user>Martin R</user>
<link>http://www.stackoverflow.com//users/1187415/martin-r</link>
<location>Germany</location>
<year_rep>4,195</year_rep>
<total_rep>269,380</total_rep>
<tag1>swift</tag1>
<tag2>ios</tag2>
<tag3>swift3</tag3>
</topusers>
<topusers>
<user>Barmar</user>
<link>http://www.stackoverflow.com//users/1491895/barmar</link>
<location>Arlington, MA</location>
<year_rep>4,179</year_rep>
<total_rep>289,989</total_rep>
<tag1>javascript</tag1>
<tag2>php</tag2>
<tag3>jquery</tag3>
</topusers>
<topusers>
<user>Alexey Mezenin</user>
<link>http://www.stackoverflow.com//users/1227923/alexey-mezenin</link>
<location>??????</location>
<year_rep>4,142</year_rep>
<total_rep>31,602</total_rep>
<tag1>laravel</tag1>
<tag2>php</tag2>
<tag3>laravel-5.3</tag3>
</topusers>
<topusers>
<user>BalusC</user>
<link>http://www.stackoverflow.com//users/157882/balusc</link>
<location>Amsterdam, Netherlands</location>
<year_rep>4,046</year_rep>
<total_rep>703,046</total_rep>
<tag1>java</tag1>
<tag2>jsf</tag2>
<tag3>servlets</tag3>
</topusers>
<topusers>
<user>GurV</user>
<link>http://www.stackoverflow.com//users/6348498/gurv</link>
<location></location>
<year_rep>4,016</year_rep>
<total_rep>7,932</total_rep>
<tag1>sql</tag1>
<tag2>mysql</tag2>
<tag3>sql-server</tag3>
</topusers>
<topusers>
<user>Nina Scholz</user>
<link>http://www.stackoverflow.com//users/1447675/nina-scholz</link>
<location>Berlin, Deutschland</location>
<year_rep>3,950</year_rep>
<total_rep>61,135</total_rep>
<tag1>javascript</tag1>
<tag2>arrays</tag2>
<tag3>object</tag3>
</topusers>
<topusers>
<user>JB Nizet</user>
<link>http://www.stackoverflow.com//users/571407/jb-nizet</link>
<location>Saint-Etienne, France</location>
<year_rep>3,923</year_rep>
<total_rep>418,780</total_rep>
<tag1>java</tag1>
<tag2>hibernate</tag2>
<tag3>java-8</tag3>
</topusers>
<topusers>
<user>Frank van Puffelen</user>
<link>http://www.stackoverflow.com//users/209103/frank-van-puffelen</link>
<location>San Francisco, CA</location>
<year_rep>3,920</year_rep>
<total_rep>86,520</total_rep>
<tag1>firebase</tag1>
<tag2>firebase-database</tag2>
<tag3>android</tag3>
</topusers>
<topusers>
<user>dasblinkenlight</user>
<link>http://www.stackoverflow.com//users/335858/dasblinkenlight</link>
<location>United States</location>
<year_rep>3,886</year_rep>
<total_rep>475,813</total_rep>
<tag1>c#</tag1>
<tag2>java</tag2>
<tag3>c++</tag3>
</topusers>
<topusers>
<user>Tim Biegeleisen</user>
<link>http://www.stackoverflow.com//users/1863229/tim-biegeleisen</link>
<location>Singapore</location>
<year_rep>3,814</year_rep>
<total_rep>77,211</total_rep>
<tag1>sql</tag1>
<tag2>mysql</tag2>
<tag3>java</tag3>
</topusers>
<topusers>
<user>Greg Hewgill</user>
<link>http://www.stackoverflow.com//users/893/greg-hewgill</link>
<location>Christchurch, New Zealand</location>
<year_rep>3,796</year_rep>
<total_rep>529,137</total_rep>
<tag1>git</tag1>
<tag2>python</tag2>
<tag3>git-pull</tag3>
</topusers>
<topusers>
<user>unutbu</user>
<link>http://www.stackoverflow.com//users/190597/unutbu</link>
<location></location>
<year_rep>3,735</year_rep>
<total_rep>401,595</total_rep>
<tag1>python</tag1>
<tag2>pandas</tag2>
<tag3>numpy</tag3>
</topusers>
<topusers>
<user>Hans Passant</user>
<link>http://www.stackoverflow.com//users/17034/hans-passant</link>
<location>Madison, WI</location>
<year_rep>3,688</year_rep>
<total_rep>672,118</total_rep>
<tag1>c#</tag1>
<tag2>.net</tag2>
<tag3>winforms</tag3>
</topusers>
<topusers>
<user>Jonathan Leffler</user>
<link>http://www.stackoverflow.com//users/15168/jonathan-leffler</link>
<location>California, USA</location>
<year_rep>3,649</year_rep>
<total_rep>455,157</total_rep>
<tag1>c</tag1>
<tag2>bash</tag2>
<tag3>unix</tag3>
</topusers>
<topusers>
<user>paxdiablo</user>
<link>http://www.stackoverflow.com//users/14860/paxdiablo</link>
<location></location>
<year_rep>3,636</year_rep>
<total_rep>507,043</total_rep>
<tag1>c</tag1>
<tag2>c++</tag2>
<tag3>bash</tag3>
</topusers>
<topusers>
<user>Pranav C Balan</user>
<link>http://www.stackoverflow.com//users/3037257/pranav-c-balan</link>
<location>Ramanthali, Kannur, Kerala, India</location>
<year_rep>3,604</year_rep>
<total_rep>64,476</total_rep>
<tag1>javascript</tag1>
<tag2>jquery</tag2>
<tag3>html</tag3>
</topusers>
<topusers>
<user>Suragch</user>
<link>http://www.stackoverflow.com//users/3681880/suragch</link>
<location>Hohhot, China</location>
<year_rep>3,580</year_rep>
<total_rep>71,032</total_rep>
<tag1>swift</tag1>
<tag2>ios</tag2>
<tag3>android</tag3>
</topusers>
</stackoverflow>
Python Yöntemleri
import xml.etree.ElementTree as et
import pandas as pd
from io import StringIO
from lxml import etree as lxet
def read_xml_iterfind():
tree = et.parse('Input.xml')
data = []
inner = {}
for el in tree.iterfind('./*'):
for i in el.iterfind('*'):
inner[i.tag] = i.text
data.append(inner)
inner = {}
df = pd.DataFrame(data)
def read_xml_iterparse():
data = []
inner = {}
i = 1
for (ev, el) in et.iterparse(path):
if i <= 2:
first_tag = el.tag
if el.tag == first_tag and len(inner) != 0:
data.append(inner)
inner = {}
if el.text is not None and len(el.text.strip()) > 0:
inner[el.tag] = el.text
i += 1
df = pd.DataFrame(data)
def read_xml_lxml_xpath():
tree = lxet.parse('Input.xml')
data = []
inner = {}
for el in tree.xpath('/*/*'):
for i in el:
inner[i.tag] = i.text
data.append(inner)
inner = {}
df = pd.DataFrame(data)
def read_xml_lxml_xsl():
xml = lxet.parse('Input.xml')
xslstr = '''
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output version="1.0" encoding="UTF-8" indent="yes" method="text"/>
<xsl:strip-space elements="*"/>
<!-- HEADERS -->
<xsl:template match = "/*">
<xsl:for-each select="*[1]/*">
<xsl:value-of select="local-name()" />
<xsl:choose>
<xsl:when test="position() != last()">
<xsl:text>,</xsl:text>
</xsl:when>
<xsl:otherwise>
<xsl:text>
</xsl:text>
</xsl:otherwise>
</xsl:choose>
</xsl:for-each>
<xsl:apply-templates/>
</xsl:template>
<!-- DATA ROWS (COMMA-SEPARATED) -->
<xsl:template match="/*/*" priority="2">
<xsl:for-each select="*">
<xsl:if test="position() = 1">
<xsl:text>"</xsl:text>
</xsl:if>
<xsl:value-of select="." />
<xsl:choose>
<xsl:when test="position() != last()">
<xsl:text>","</xsl:text>
</xsl:when>
<xsl:otherwise>
<xsl:text>"
</xsl:text>
</xsl:otherwise>
</xsl:choose>
</xsl:for-each>
</xsl:template>
</xsl:transform>
'''
xsl = lxet.fromstring(xslstr)
transform = lxet.XSLT(xsl)
newdom = transform(xml)
df = pd.read_csv(StringIO(str(newdom)))
Zamanlamalar (mevcut XML ve XML ile 25 kat alt öğe ile (yani 900 StackOverflow kullanıcı kaydı)
# SHORTER FILE
python -mtimeit -s'import readxml_test_runs as test' 'test.read_xml_iterfind()'
100 loops, best of 3: 3.87 msec per loop
python -mtimeit -s'import readxml_test_runs as test' 'test.read_xml_iterparse()'
100 loops, best of 3: 5.5 msec per loop
python -mtimeit -s'import readxml_test_runs as test' 'test.read_xml_lxml_xpath()'
100 loops, best of 3: 3.86 msec per loop
python -mtimeit -s'import readxml_test_runs as test' 'test.read_xml_lxml_xsl()'
100 loops, best of 3: 5.68 msec per loop
# LARGER FILE
python -mtimeit -n'100' -s'import readxml_test_runs as test' 'test.read_xml_iterfind()'
100 loops, best of 3: 36 msec per loop
python -mtimeit -n'100' -s'import readxml_test_runs as test' 'test.read_xml_iterparse()'
100 loops, best of 3: 78.9 msec per loop
python -mtimeit -n'100' -s'import readxml_test_runs as test' 'test.read_xml_lxml_xpath()'
100 loops, best of 3: 32.7 msec per loop
python -mtimeit -n'100' -s'import readxml_test_runs as test' 'test.read_xml_lxml_xsl()'
100 loops, best of 3: 51.4 msec per loop