

{"id":5409,"date":"2018-10-23T13:33:08","date_gmt":"2018-10-23T17:33:08","guid":{"rendered":"https:\/\/sites.temple.edu\/tudsc\/?p=5409"},"modified":"2023-01-24T12:58:54","modified_gmt":"2023-01-24T16:58:54","slug":"create-a-network-from-xml-and-visualize-it-a-soccer-player-passing-network","status":"publish","type":"post","link":"https:\/\/sites.temple.edu\/tudsc\/2018\/10\/23\/create-a-network-from-xml-and-visualize-it-a-soccer-player-passing-network\/","title":{"rendered":"A Network Analysis of Soccer Passes: Parsing XML with Python to Visualize Sports in Gephi"},"content":{"rendered":"<p>By Luling Huang<\/p>\n<p><!--more-->A large portion of a soccer match is just players passing without any goals. What are some passing patterns out of the randomness in the game? Which players are the hubs of passing in a team? With the right data-set, and a few easy-to-use <em>Python<\/em> programming scripts, we can build a player passing network to find out.<\/p>\n<p>In this post, I will demonstrate how to use the Python &#8216;<a href=\"https:\/\/lxml.de\/\">lxml<\/a>&#8216; package (with <a href=\"https:\/\/www.w3schools.com\/xml\/xml_xpath.asp\">XPath<\/a>) to parse <a href=\"https:\/\/www.w3schools.com\/xml\/xml_whatis.asp\">XML<\/a>. I will then show how to use the &#8216;<a href=\"https:\/\/networkx.github.io\/documentation\/stable\/\">networkx<\/a>&#8216; package to build and export network graphs in <a href=\"https:\/\/github.com\/gephi\/gexf\">GEXF<\/a>\u00a0for <a href=\"https:\/\/gephi.org\/\">Gephi<\/a>. The demonstration will use a great data-set with detailed information about a single football\/soccer match made publicly available by the Manchester City Analytics project with Opta Sports. A player passing network will be created.<\/p>\n<p>You will find this post helpful if your network analysis project&#8217;s raw data input is in <em>XML<\/em> (a pretty common document format to store large amount of data), and if your preferred network visualization tool is <a href=\"https:\/\/gephi.org\/\">Gephi<\/a> (a free and powerful software), or if you don&#8217;t mind some casual reading on the beautiful game. (Here is a serious reading on the relationship between player passing network and team performances:\u00a0\u00a0<a href=\"https:\/\/www.sciencedirect.com\/science\/article\/pii\/S0378873312000500\">Grund&#8217;s (2012) study<\/a>.)<\/p>\n<hr \/>\n<p><strong>Beautiful Soup vs lxml?<\/strong><\/p>\n<p>I have used <a href=\"https:\/\/beautiful-soup-4.readthedocs.io\/en\/latest\/\">Beautiful Soup<\/a> multiple times for webscraping. However, <em>lxml<\/em> with <em>XPath<\/em> is way faster than <em>Beautiful Soup<\/em> to parse an <em>XML<\/em> document. <a href=\"http:\/\/edmundmartin.com\/beautiful-soup-vs-lxml-speed\/\">Martin (2017)<\/a> showed some results from testing the speeds of the two. The takeaway here is that if your <em>XML<\/em> file is small, it doesn&#8217;t matter to choose which one to use. But if you have a large file (e.g., 50 MB), you will notice <em>Beautiful Soup<\/em> is significantly slower than lxml.<\/p>\n<hr \/>\n<p><strong>What is this thing called GEXF?<\/strong><\/p>\n<p>When I thought about a network data structure before, an <a href=\"http:\/\/faculty.ucr.edu\/~hanneman\/nettext\/C5_%20Matrices.html#adjacency\">adjacency matrix<\/a> or an edgelist always came up first. But let&#8217;s say I have some new information on whether a node (say a person) in my network is a bowler, a surfer, or a nihilist. Then I want to present a graph with bowlers only. How do I add some personal attribute information to my network data?<\/p>\n<p>Well, we can just prepare multiple data files (say two <em>CSV<\/em> files: one for network and one for node attributes) and make some connection among them. But isn&#8217;t a single file able to store all the relevant information a better idea? <em>GEXF<\/em> (Graph Exchange XML Format) can do just that. <em>GEXF<\/em>&#8216;s hierarchical data structure makes it possible to add multiple node and edge attributes on top of a network structure.<\/p>\n<hr \/>\n<p><strong>In Python<\/strong>:<\/p>\n<p>In this section, I will share my <em>Python<\/em> code that parses a data-set in\u00a0<em>XML<\/em>, builds a player passing network, and exports the network to <em>GEXF<\/em>. I will use the one-match data available from\u00a0the Manchester City Analytics project with Opta Sports. A minimum level of programming knowledge is needed to understand this section, but nothing too fancy here.<\/p>\n<p>Some prerequisites:<\/p>\n<p><!-- HTML generated using hilite.me --><\/p>\n<div style=\"background: #f0f0f0;overflow: auto;width: auto;border: solid gray;border-width: .1em .1em .1em .8em;padding: .2em .6em\">\n<pre style=\"margin: 0;line-height: 125%\"><span style=\"color: #007020;font-weight: bold\">import<\/span> <span style=\"color: #0e84b5;font-weight: bold\">networkx<\/span> <span style=\"color: #007020;font-weight: bold\">as<\/span> <span style=\"color: #0e84b5;font-weight: bold\">nx<\/span>\n<span style=\"color: #007020;font-weight: bold\">from<\/span> <span style=\"color: #0e84b5;font-weight: bold\">lxml<\/span> <span style=\"color: #007020;font-weight: bold\">import<\/span> etree\n<span style=\"color: #007020;font-weight: bold\">from<\/span> <span style=\"color: #0e84b5;font-weight: bold\">itertools<\/span> <span style=\"color: #007020;font-weight: bold\">import<\/span> groupby\n<span style=\"color: #007020;font-weight: bold\">from<\/span> <span style=\"color: #0e84b5;font-weight: bold\">operator<\/span> <span style=\"color: #007020;font-weight: bold\">import<\/span> itemgetter\n<\/pre>\n<\/div>\n<p>The F24 data (see below) has 1,673 events for both teams in a single match. An event can be a pass, a tackle, a shot, or a substitution, etc. The F7 data contains player names,\u00a0 which will be matched to player IDs in F24.<\/p>\n<p>Import data and parse <em>XML<\/em>:<\/p>\n<p><!-- HTML generated using hilite.me --><\/p>\n<div style=\"background: #f0f0f0;overflow: auto;width: auto;border: solid gray;border-width: .1em .1em .1em .8em;padding: .2em .6em\">\n<pre style=\"margin: 0;line-height: 125%\">data <span style=\"color: #666666\">=<\/span> etree<span style=\"color: #666666\">.<\/span>parse(<span style=\"color: #4070a0\">'Bolton_ManCityF24.xml'<\/span>)\nall_events <span style=\"color: #666666\">=<\/span> data<span style=\"color: #666666\">.<\/span>findall(<span style=\"color: #4070a0\">'\/\/Event'<\/span>)\nmeta <span style=\"color: #666666\">=<\/span> etree<span style=\"color: #666666\">.<\/span>parse(<span style=\"color: #4070a0\">'Bolton_ManCityF7.xml'<\/span>)\n<\/pre>\n<\/div>\n<p>Next, we are looking for all the events that represents Man City&#8217;s successful passes (427 in total). One complexity in the data: whereas successful passers are identified, there is no information for receivers of the passes. Here is how I dealt with this problem (<em>a.<\/em>, <em>b<\/em>., and <em>c<\/em>. below):<\/p>\n<p><em>a<\/em>. Group all consecutive City&#8217;s good passes first;<\/p>\n<p><!-- HTML generated using hilite.me --><\/p>\n<div style=\"background: #f0f0f0;overflow: auto;width: auto;border: solid gray;border-width: .1em .1em .1em .8em;padding: .2em .6em\">\n<pre style=\"margin: 0;line-height: 125%\">posl <span style=\"color: #666666\">=<\/span> [all_events<span style=\"color: #666666\">.<\/span>index(e) <span style=\"color: #007020;font-weight: bold\">for<\/span> e <span style=\"color: #007020;font-weight: bold\">in<\/span> all_events <span style=\"color: #007020;font-weight: bold\">if<\/span> e<span style=\"color: #666666\">.<\/span>xpath(<span style=\"color: #4070a0\">'@team_id=\"43\" and @type_id=\"1\" and @outcome=\"1\"'<\/span>) <span style=\"color: #007020;font-weight: bold\">is<\/span> <span style=\"color: #007020\">True<\/span>]\ngroups <span style=\"color: #666666\">=<\/span> []\n<span style=\"color: #007020;font-weight: bold\">for<\/span> k, g <span style=\"color: #007020;font-weight: bold\">in<\/span> groupby(<span style=\"color: #007020\">enumerate<\/span>(posl), <span style=\"color: #007020;font-weight: bold\">lambda<\/span> ix: ix[<span style=\"color: #40a070\">0<\/span>]<span style=\"color: #666666\">-<\/span>ix[<span style=\"color: #40a070\">1<\/span>]):\n    groups<span style=\"color: #666666\">.<\/span>append([i[<span style=\"color: #40a070\">1<\/span>] <span style=\"color: #007020;font-weight: bold\">for<\/span> i <span style=\"color: #007020;font-weight: bold\">in<\/span> g])\n<\/pre>\n<\/div>\n<p><em>b<\/em>. Within each sequence of good passes, a current pass&#8217;s sender is treated as the previous pass&#8217;s receiver.<\/p>\n<p><em>c<\/em>. For the last good pass in a sequence (<em>S<\/em>_<em>n<\/em>), find the next City&#8217;s event. The actor (a City player) of this event is assumed to be the receiver of the last good pass in <em>S<\/em>_<em>n<\/em>.<\/p>\n<p>Now, we are ready to build a directed weighted network.<\/p>\n<p>Create two functions to make the script briefer first:<\/p>\n<p><!-- HTML generated using hilite.me --><\/p>\n<div style=\"background: #f0f0f0;overflow: auto;width: auto;border: solid gray;border-width: .1em .1em .1em .8em;padding: .2em .6em\">\n<pre style=\"margin: 0;line-height: 125%\"><span style=\"color: #007020;font-weight: bold\">def<\/span> <span style=\"color: #06287e\">get_player_name<\/span>(player_id):\n    p_id <span style=\"color: #666666\">=<\/span> <span style=\"color: #4070a0\">'p'<\/span> <span style=\"color: #666666\">+<\/span> player_id\n    player_name <span style=\"color: #666666\">=<\/span> meta<span style=\"color: #666666\">.<\/span>xpath(<span style=\"color: #4070a0\">'\/\/Player[@uID={!r}]\/PersonName\/Last\/text()'<\/span><span style=\"color: #666666\">.<\/span>format(p_id))[<span style=\"color: #40a070\">0<\/span>]\n    <span style=\"color: #007020;font-weight: bold\">return<\/span> player_name<\/pre>\n<\/div>\n<div style=\"background: #f0f0f0;overflow: auto;width: auto;border: solid gray;border-width: .1em .1em .1em .8em;padding: .2em .6em\">\n<pre style=\"margin: 0;line-height: 125%\"><span style=\"color: #007020;font-weight: bold\">def<\/span> <span style=\"color: #06287e\">add_new_actorpair<\/span>(graph,sd_id,rcv_id,sd_name,rcv_name):\n    graph<span style=\"color: #666666\">.<\/span>add_edge(sd_id,rcv_id)\n    graph<span style=\"color: #666666\">.<\/span>node[sd_id][<span style=\"color: #4070a0\">'label'<\/span>] <span style=\"color: #666666\">=<\/span> sd_name\n    graph<span style=\"color: #666666\">.<\/span>node[rcv_id][<span style=\"color: #4070a0\">'label'<\/span>] <span style=\"color: #666666\">=<\/span> rcv_name\n    graph[sd_id][rcv_id][<span style=\"color: #4070a0\">'weight'<\/span>] <span style=\"color: #666666\">=<\/span> <span style=\"color: #40a070\">1<\/span>\n    <span style=\"color: #007020;font-weight: bold\">return<\/span> graph\n<\/pre>\n<\/div>\n<p>Grand finale (finishing <em>b<\/em>. and <em>c<\/em>. above):<\/p>\n<p><!-- HTML generated using hilite.me --><\/p>\n<div style=\"background: #f0f0f0;overflow: auto;width: auto;border: solid gray;border-width: .1em .1em .1em .8em;padding: .2em .6em\">\n<pre style=\"margin: 0;line-height: 125%\">G <span style=\"color: #666666\">=<\/span> nx<span style=\"color: #666666\">.<\/span>DiGraph()\n<span style=\"color: #007020;font-weight: bold\">for<\/span> g <span style=\"color: #007020;font-weight: bold\">in<\/span> groups:\n    <span style=\"color: #007020;font-weight: bold\">for<\/span> fst, snd <span style=\"color: #007020;font-weight: bold\">in<\/span> <span style=\"color: #007020\">zip<\/span>(g,g[<span style=\"color: #40a070\">1<\/span>:]):\n        sd_id <span style=\"color: #666666\">=<\/span> all_events[fst]<span style=\"color: #666666\">.<\/span>xpath(<span style=\"color: #4070a0\">'@player_id'<\/span>)[<span style=\"color: #40a070\">0<\/span>]\n        rcv_id <span style=\"color: #666666\">=<\/span> all_events[snd]<span style=\"color: #666666\">.<\/span>xpath(<span style=\"color: #4070a0\">'@player_id'<\/span>)[<span style=\"color: #40a070\">0<\/span>]\n        sd_name <span style=\"color: #666666\">=<\/span> get_player_name(sd_id)\n        rcv_name <span style=\"color: #666666\">=<\/span> get_player_name(rcv_id)\n        <span style=\"color: #007020;font-weight: bold\">if<\/span> G<span style=\"color: #666666\">.<\/span>has_edge(sd_id,rcv_id) <span style=\"color: #007020;font-weight: bold\">is<\/span> <span style=\"color: #007020\">True<\/span>:\n            G[sd_id][rcv_id][<span style=\"color: #4070a0\">'weight'<\/span>] <span style=\"color: #666666\">+=<\/span> <span style=\"color: #40a070\">1<\/span>\n        <span style=\"color: #007020;font-weight: bold\">else<\/span>:\n            add_new_actorpair(G,sd_id,rcv_id,sd_name,rcv_name)\n    last_pass_idx <span style=\"color: #666666\">=<\/span> g[<span style=\"color: #666666\">-<\/span><span style=\"color: #40a070\">1<\/span>]\n    end_pass_sdid <span style=\"color: #666666\">=<\/span> all_events[g[<span style=\"color: #666666\">-<\/span><span style=\"color: #40a070\">1<\/span>]]<span style=\"color: #666666\">.<\/span>xpath(<span style=\"color: #4070a0\">'@player_id'<\/span>)[<span style=\"color: #40a070\">0<\/span>]\n    end_pass_rcvid <span style=\"color: #666666\">=<\/span> all_events[g[<span style=\"color: #666666\">-<\/span><span style=\"color: #40a070\">1<\/span>]]<span style=\"color: #666666\">.<\/span>xpath(<span style=\"color: #4070a0\">'following-sibling::Event[@team_id=\"43\"][1]'<\/span>)[<span style=\"color: #40a070\">0<\/span>]<span style=\"color: #666666\">.<\/span>xpath(<span style=\"color: #4070a0\">'@player_id'<\/span>)[<span style=\"color: #40a070\">0<\/span>]\n    end_sd_name <span style=\"color: #666666\">=<\/span> get_player_name(end_pass_sdid)\n    end_rcv_name <span style=\"color: #666666\">=<\/span> get_player_name(end_pass_rcvid)\n    <span style=\"color: #007020;font-weight: bold\">if<\/span> G<span style=\"color: #666666\">.<\/span>has_edge(end_pass_sdid,end_pass_rcvid) <span style=\"color: #007020;font-weight: bold\">is<\/span> <span style=\"color: #007020\">True<\/span>:\n        G[end_pass_sdid][end_pass_rcvid][<span style=\"color: #4070a0\">'weight'<\/span>] <span style=\"color: #666666\">+=<\/span> <span style=\"color: #40a070\">1<\/span>\n    <span style=\"color: #007020;font-weight: bold\">else<\/span>:\n        add_new_actorpair(G,end_pass_sdid,end_pass_rcvid,end_sd_name,end_rcv_name)\n<\/pre>\n<\/div>\n<p>Write to <em>GEXF<\/em>:<\/p>\n<p><!-- HTML generated using hilite.me --><\/p>\n<div style=\"background: #f0f0f0;overflow: auto;width: auto;border: solid gray;border-width: .1em .1em .1em .8em;padding: .2em .6em\">\n<pre style=\"margin: 0;line-height: 125%\">nx<span style=\"color: #666666\">.<\/span>write_gexf(G,<span style=\"color: #4070a0\">'whatever_name.gexf'<\/span>,encoding<span style=\"color: #666666\">=<\/span><span style=\"color: #4070a0\">'utf-8'<\/span>,version<span style=\"color: #666666\">=<\/span><span style=\"color: #4070a0\">'1.2draft'<\/span>)\n<\/pre>\n<\/div>\n<p>The above one liner is one of the reasons to use the <em>networkx<\/em> package to create a <em>GEXF<\/em> file. Convenient! But we also lose some control on getting the <em>GEXF<\/em> file we really want. Note that you can always write a <em>GEXF<\/em> file from scratch (1) by using <em>lxml<\/em> to construct <em>XML<\/em> elements; (2) or by just writing a plain text file without using <em>lxml<\/em>.<\/p>\n<hr \/>\n<p><strong>What is the pattern?<\/strong><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-5416\" src=\"https:\/\/sites.temple.edu\/tudsc\/files\/2018\/10\/passing_network.png\" alt=\"\" width=\"1024\" height=\"768\" srcset=\"https:\/\/sites.temple.edu\/tudsc\/files\/2018\/10\/passing_network.png 1024w, https:\/\/sites.temple.edu\/tudsc\/files\/2018\/10\/passing_network-300x225.png 300w, https:\/\/sites.temple.edu\/tudsc\/files\/2018\/10\/passing_network-768x576.png 768w, https:\/\/sites.temple.edu\/tudsc\/files\/2018\/10\/passing_network-850x638.png 850w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/p>\n<p>In <em>Gephi<\/em>, the <em>GEXF<\/em> file can be visualized as the above graph. Node size is ranked by <a href=\"http:\/\/faculty.ucr.edu\/~hanneman\/nettext\/C10_Centrality.html#Degree\">degree<\/a>; Node color is differentiated by <a href=\"http:\/\/faculty.ucr.edu\/~hanneman\/nettext\/C10_Centrality.html#Betweenness\">betweenness centrality<\/a>, where darker blue indicates a higher betweenness centrality; Edge width is ranked by weight (i.e., number of successful passes between two players). Tevez is the only substitute included. Player locations are presented based on the formation and position information in the F24 data.<\/p>\n<p>The graph shows that Man City&#8217;s attacking focus was predominantly on its left flank. The <em>betweenness centrality<\/em> shows that the left back Kolarov was the most important link-up player who connected the otherwise disconnected defensive line and attacking line.<\/p>\n<p>Interestingly, Tevez, who substituted for the starting center forward, Dzeko, and only played 25 minutes, had the third highest betweenness in the squad. At the time Tevez started playing, City was leading 3-2, but Bolton just scored a goal a few minutes ago. One may interpret that Tevez&#8217;s task as a sub striker was to draw back a bit more and to connect the defenders\/midfielders and other attacking players more than Dzeko.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>By Luling Huang<\/p>\n","protected":false},"author":3163,"featured_media":5416,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[303,286,2],"tags":[64,270,360,357,182,359,71,6,361,358],"class_list":["post-5409","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-computer-science","category-cultural-studies","category-grad-students","tag-coding","tag-gephi","tag-gexf","tag-lxml","tag-network-analysis","tag-networkx","tag-python","tag-top-news","tag-visualization","tag-xml"],"_links":{"self":[{"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/posts\/5409","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/users\/3163"}],"replies":[{"embeddable":true,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/comments?post=5409"}],"version-history":[{"count":0,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/posts\/5409\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/media\/5416"}],"wp:attachment":[{"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/media?parent=5409"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/categories?post=5409"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/tags?post=5409"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}