Machine Learning, Data Analytics and Modeling

DATAM 2018

Satellite session - Conference on Complex Systems 2018
September 26, from 2:30-6:30 pm


Goal

The goal of this satellite is to present studies at the intersection of data analytics and modeling in order to understand the behavior of complex systems across multiple domains, with emphasis in social systems.

Program

Time Speaker Title
2:30-3:00 Ciro Cattuto Patterns in high-resolution human mobility and interaction data
3:00-3:30 Hernan Makse Influencers in Twitter, genetic networks and ecosystems
3:30-3:45 Marco Pangallo Econometrics, Machine Learning and Causality: An Application to the Housing Market
3:45-4:00 Fabio Saracco Voters’ polarisation in electoral campaign: the Italian case study
4:00-4:30 Break
4:30-4:45 Leto Peel Graph-based semi-supervised learning for complex networks
4:45-5:00 Dave Braines Classifying Types of Network Communities Using Motifs
5:00-5:15 Kiran Sharma Data science approach to study hubs and motifs in complex global terrorist network
5:15-5:45 Ying Zhao Deep Models, Machine Learning and Artificial Intelligence Applications in National and International Security
5:45-6:15 Marta Gonzalez Cases of study in Computational Urban Science

Abstracts

Time Speaker Title
2:30-3:00 Ciro Cattuto Patterns in high-resolution human mobility and interaction data
Our environments and everyday actions are increasingly enabled and instrumented by digital platforms, providing new opportunities to access finely resolved human behavioural data. In this talk I will focus on the application of machine learning methods to high-resolution data on human mobility and close-range interactions collected by means of wearable sensors. I will discuss case studies motivated by different research questions, and point to directions for future research.
3:00-3:30 Hernan Makse Influencers in Twitter, genetic networks and ecosystems
Identifying essential nodes in complex networks is a central problem for biological and social systems. In this talk I will continue the exposition on the topic of influencers or essential nodes in three paradigmatic cases of complex networks: genetic network, mutualistic ecosystems and social networks. We will discuss the relevance of these influencers to systems characterized by (i) second order transitions like information spreading in Twitter, (ii) mutualistic ecological systems dominated by abrupt first order tipping points and (iii) genetic networks dominated by positive and negative interactions and argue that these systems need a new paradigm beyond network theory to be understood.
3:30-3:45 Marco Pangallo Econometrics, Machine Learning and Causality: An Application to the Housing Market
Policy decisions in complex social systems must be justified on the basis of causal evidence. Over the past 50 years, econometrics has developed clever tools to understand causality, but econometric theory is optimized for small-scale datasets with a few variables. With the availability of big data, an emerging literature is combining the econometric methodology with the power and flexibility of machine learning. In this talk I will give a short introduction to this literature, and then provide an example from my own research on housing markets.

The need to understand causality stems from the fact that policy changes alter the statistical properties of the data. For example, suppose that you want to understand the causal effect of institutions (e.g. democracy) on economic well-being (e.g. GDP). Building a deep learning algorithm that predicts well-being based on institutions and other factors is incorrect. Indeed, it is necessary to understand the counterfactual – if a country transitioned from dictatorship to democracy, how would its well-being change? Since institutions influence well-being, and well-being influences institutions, the direction of causality is not clear. To solve this issue, econometrics has put forward the idea of using instruments. In this example, these are variables correlated with institutions but not with well-being. A celebrated result in the applied econometrics literature uses long-term settlement of a population in the past as an instrument: it is likely to influence current institutions due to their persistence, but not current well-being, except through institutions. The problem with these methods is that current datasets that describe complex social systems have many variables and many potential instruments. Machine learning algorithms can select the covariates in a flexible and automated way, just focusing on predictive performance. There exist several other areas where econometrics and machine learning are combined, exploiting the strengths of the two approaches – causal understanding for econometrics, prediction for machine learning.

Within this framework, I will show how to estimate the price elasticity of demand, i.e. the sensitivity of demand to price changes. The elasticity is a fundamental concept in economics, and it has a lot of policy relevance. For example, consider a program of housing subsidies that aims to foster social mixing: the policy maker wants to know how poor households would react to an effectively lower house price (net of the amount of the subsidy). It is not sufficient to predict the demand as a function of the price and of the house and neighborhood characteristics; it is instead necessary to understand which difference in demand is caused by a difference in price. We address this question using a dataset from a website of housing sales advertisements (ads). A unique feature of this dataset is that we know the number of clicks on each individual ad, which we show is a good proxy of actual housing demand. We devise a machine learning algorithm to identify duplicate ads, i.e. ads that refer to the same housing unit. The algorithm is based on a classification tree with boosting and on textual analysis of the description of the ads. The basic idea is that (under some caveats) differences in demand between the two ads can only be caused by differences in price. Quantitatively, we find that a 1% higher price causes a 0.66% lower number of clicks (i.e. the elasticity is -0.66).
3:45-4:00 Fabio Saracco Voters’ polarisation in electoral campaign: the Italian case study
In this work we analyse the Twitter dataset collected during the last Italian elections in March 4, 2018. Starting from an initial set of elections-related keywords (such as elezioni, elezioni2018, 4marzo, 4marzo2018), the Twitter API returned a sample of the overall set of tweets posted from January 28 until March 11 that contain at least one of the keywords of interest. We first select a subset of verified users (i.e. authentic accounts related to figures of public interest) that are in some way involved in the elections (i.e. politicians, political parties, sources of information such as newspapers, TV channels, etc). Then we construct the bipartite network of verified and non-verified users in which an edge between two users indicates retweets or mentions from the non-verified to the verified user. The analysis proceeds as follows: we first project the network on the verified accounts layer and validate the projection with the procedure described in (Saracco et al., 2017). The resulting network reveals a clear division in communities, that have been detected with a standard Louvain community detection method.

We are then interested in studying the interactions of non-verified accounts with the three detected groups. Following (Schmidt et al. 2018), for each user we compute a localisation order parameter, that simply counts the number of social groups each user interacts with. Left panel of Figure 1 shows the hisogram of the final values. There is a clear peak around 1, showing that a strong majority of users retweet contents coming from a single social group of interest. Then, we provide a very simple definition of polarisation. For each user we simply consider the ratio of the total number of interactions she dedicates to the three communities and then take the maximum among these values. A value of this index close to 1 indicates that the majority of the interactions have been observed with one group only; a uniform distribution of the interactions among the three groups denotes absence of polarisation, while values close to 0 indicate nearly absence of activity. Right panel of Figure 1 shows the results. Each dot represents to a user. The position of a dot in the ternary plot indicates the community each account interacts most. The colour of the dot denotes instead the associated level of polarisation according to our definition. Indeed, nodes located on the triangle’s vertices mainly interact with the corresponding group only (indicated at the vertex) and thus show on average higher values of this quantity of interest.
4:00-4:30 Break
4:30-4:45 Leto Peel Graph-based semi-supervised learning for complex networks
In most complex networks, nodes have attributes, or metadata, that describe properties of the nodes. In some cases these attributes are only partially observed for a variety of reasons e.g. the data is expensive, time-consuming or difficult to accurately collect. In machine learning, classification algorithms are used to predict discrete node attributes (which we refer to as class labels) by learning from a training set of labelled data, i.e. data for which the target attribute values are known. Semi-supervised learning is a classification problem that aims to make use of both the unlabelled data in addition to the labelled data typically used to train supervised models. A common approach is graph-based semi-supervised learning (GSSL) in which (often independent) data are represented as a similarity graph, such that a vertex is a data instance and an edge indicates similarity between two instances. By utilising the graph structure, of labelled and unlabelled data, it is possible to accurately classify the unlabelled vertices using a relatively small set of labelled instances.

Here we consider the semi-supervised learning problem in the context of complex networks. These networks consist of nodes representing entities (e.g. people, user accounts, documents) and links representing pairwise dependencies or relationships (e.g. friendships, contacts, references). Class labels are discrete-valued node attributes (e.g. gender, location, topic) and our task is to predict these labels based only on the network structure and a small subset of nodes already labelled. This problem of classifying nodes in networks is often treated as a GSSL problem because the objective, to predict missing node labels, and the input, a graph, are the same. Sometimes this approach works well due to assortative mixing, or homophily, a feature frequently observed in networks, particularly in social networks. Homophily is the effect that linked nodes share similar properties or attributes and occurs either through a process of selection or influence. However, not all node attributes in complex networks are assortative. For example, in a network of sexual interactions between people it is likely that some attributes will be common across links, e.g. similar demographic information or shared interests, but other attributes will be different, e.g. links between people of different genders. Furthermore, the pattern of similarity or dissimilarity of attributes across links may not be consistent across the whole network, e.g. in some parts of the network links will occur between people of the same gender.

We present two novel methods to deal with this problem by approximating equivalence relations from social network theory that define notions of similarity robust to different patterns of interaction. We use these to implicitly construct similarity graphs upon which we can propagate class label information. We demonstrate on a variety of real and synthetic networks that our methods are capable of classifying nodes under a range of different interaction patterns in which standard methods fail.
4:45-5:00 Dave Braines Classifying Types of Network Communities Using Motifs
One of the most significant tasks when studying network topology is the identification of network type of community. A network type is defined as the type of relation or interaction between nodes, including social ties in social networks, interactions between individuals in communication networks (e.g. email). Fundamentally, communities, which are groups of densely connected nodes in the network, enable us to discover clusters of interacting nodes and the relations between them. Communities from the same network type tend to have similar structures. Identifying their network types further allows us to study interactions between nodes, infer and predict unobserved network structures. Given a network topological structure, identifying the network type of a community can be viewed as a (sub)graph classification problem. Many existing methods use different graph embedding techniques to represent graphs in vector space and apply machine learning methods for classification. Yet, little work has applied network motifs, which capture the local structure of a network in terms of patterns that occur significantly more or less frequently, into network type identification. To the best of our knowledge, we are the first to use network motifs in classifying network type of community. In this paper, we find that there exists a strong relation between network type and motif distribution and subgraph ratio profile (SRP).

We first propose a generic framework to construct vectors for feature representations of static directed graphs from their topological structure using motif distribution and SRP. We argue that this fixed length feature representation can be used to classify and compare communities of varying sizes with high accuracy. We use three different null models, namely (i) NM-1: random graphs with the same number of nodes and edges; (ii) NM-2: random graphs with the same number of dyads; and (iii) NM-3: random graphs with the same in/out degree-pair sequence, to compute SRPs for all 16 triads in an empirical (sub)graph. For each null model, a 16 element vector containing SRPs of the corresponding 16 triads is computed to represent graph features. In order to evaluate our proposed graph feature representations, we collect 5108 communities from 955 real-world networks in 15 network types, including Google Plus and Twitter in social networks, high energy physics theory citation networks, Gnutella P2P networks and so on. We apply various well-known machine learning models along with our graph feature representation for network type of community classification. Our results achieve 90.73 ± 2.88% accuracy compared to 76.89± 3.00% from struct2vec, a state-of-the-art method that constructs graph representation using node attributes. In addition, we find that motif SRPs computed from NM-1 achieves the best performance compared to the other two null models in network type of community classification.
5:00-5:15 Kiran Sharma Data science approach to study hubs and motifs in complex global terrorist network
Terrorism instills fear in the minds of people and takes away the freedom of individuals to act as they will. Terrorism has turned out to be an international menace in the global community; every nation is getting affected, directly or indirectly. Here, we studied the terrorist attack incidents which occurred in the last half century across the globe from the open source, Global terrorism database, and developed a view on their spatio-temporal dynamics. We constructed a complex network of global terrorism and studied its growth dynamics, along with the statistical properties of the network. We studied the resilience of the network against targeted attacks and random failures, which could guide the counter-terrorist outfits in designing strategies to fight terrorism. We then used a disparity filter method to isolate the backbone of the giant component, and identified the terror hubs and vulnerable motifs of global terrorism. We also examined the evolution of the hubs and motifs in a few exemplary cases like Afghanistan, Colombia, Israel, Peru and United Kingdom. Normally, each nation pursues its own vision of international security based upon its mandate and particular notions of politics and its policies to counter the threat of terrorism that could naturally include the use of tactical measures and strategic negotiations, or even physical power. The dynamics of the terror hubs and the vulnerable motifs that we discovered in the network backbone could provide deep insight on their formations and spreading, and thereby help in contending terrorism or making public policies that may check their spread.
5:15-5:45 Ying Zhao Deep Models, Machine Learning and Artificial Intelligence Applications in National and International Security
Recent advancements in artificial intelligence (AI) enables new technologies to assist the modern warfighters by automatically analyzing big data at time scales much faster than a human can achieve. Deep learning (DL) is the core of the new AI revolution by demonstrating that not only can machines classify quicker than humans, but can also classify more accurately than humans. These technologies have revolutionized many commercial applications but are not designed to solve military problems.

Fundamentally, the field of machine learning seeks to learn the parameters of a function given a data set. DL refers to a subfield of machine learning that consists of large number of parameters for accurate classifications or predictions (e.g. convolutional neural networks [CNN]). DL was initially demonstrated in the breakthrough results for supervised learning of machine vision applications. Academic and industrial DL, ML and AI are active in the applications of machine vision, speech recognition, chat and autonomous driving.

Four of the main challenges in military applications include lack of or no adequate samples for classification tasks, short time scales for learning, less computational resources, and adversarial behavior. In the military applications, deep models are broadly defined as all analytic models that can handle big data or no data at all, perform ML and AI.

What are the potentials, theories, practices, tools and risks for the deep models and artificial intelligence for military applications (see the list below but not limited to)? Deep data fusion models, Various types of machine learning models (e.g., supervised learning, reinforcement learning, and unsupervised learning), Deep learning models such as deep machine vision and image processing models, Pattern recognition and anomaly detection algorithms, Advanced optimization algorithms, Network models, Graph models, Game theory models, Link analysis models, Parallel and distributed computing models, Smart data outputs from deep analytics, Visualizations and depictions of smart data outputs, Decision making models, Cognitive models, Using AI and human capabilities fused and optimized together, or is there optimized human-in-the-loop AI?, Advanced optimization algorithms and online learning, Cyber security, ethical/open AI, Swarm intelligence, Trusted AI.

Deep models, ML and AI will become the life-blood military applications. With opportunities can be risks. There are more potential opportunities than risks. Can AI be trusted? AI can be weaponized and data can be poisoned. However, opportunities are plenty if we foster broader communities and collaboration.
5:45-6:15 Marta Gonzalez Cases of study in Computational Urban Science
Computational Urban Sciences refer to the use of information and communication technology and data in the context of cities and urban environments. First, I present methods to identify patterns of behavior in energy consumption and credit card transactions. Then I show how this is a more complex task when working with environmental data. I finalize with open questions and proposed research to study human-natural systems interactions.

Invited Speakers

Marta Gonzalez, UC Berkeley
Ciro Cattuto, ISI Foundation
Hernan Makse, Levich Institute
Ying Zhao, Naval Postgraduate School

Important Dates

Abstract submission deadline: June 20, 2018
Notification to authors: June 27, 2018

Call for Abstracts

Abstracts should be around 500 words.

https://easychair.org/conferences/?conf=datam2018

Summary

New technologies are enabling social communication and coordination on unprecedented scales. The world is becoming more economically, politically and socially integrated and previously local issues are now becoming global. Counterintuitively, instead of homogenizing society, the excess of connections seems to be increasingly differentiating and fragmenting it. Recent electoral processes have shown that the Internet can increase polarization instead of reducing it and that the way globalization has been implemented is fraught with conflict and economic distress. These seemingly unrelated phenomena have similar causes: the complex dynamics of social systems and the non-linear effects of increasing inter-dependencies among complex systems’ parts.

Policy and decision makers are hardly able to understand or cope with the current world. They have failed to foresee ongoing social changes and have not been able to effectively respond to them. This is partly because of the unpredictable nature of complex systems, but mainly because of the limitations of available models and analytical tools. These are not adequate for understanding the complexity of social systems, especially in a global context.

The opportunities available from big data could solve this problem but we must analyze the data properly. The right framing of understanding and analytical toolsets could could enhance our understanding of social systems and enable the design of better technologies and intervention strategies. Ultimately, we could benefit from the complexity of social systems, rather than being endangered by it.

Venue

The satellite meeting will be hosted by Conference on Complex Systems 2018. The CCS’18 meeting will take place in Thessaloniki, Greece during Sept 23-28, 2018. CSS’18 is a major annual international event gathering diverse communities engaged in Complex Systems research, ranging from Life Sciences to Physics, from Computer Science to Social Sciences, and from Networks to Policy Implications.

All participants of the satellite meeting (with or without abstract submission) have to register to CCS'18.

Organizers

Alfredo J. Morales
New England Complex Systems Institute
Massachusetts Institute of Technology

NECSI advances the development of complex systems science and its applications to real world problems, including social policy matters. We study how interactions within a system lead to its behavioral patterns and how the system interacts with its environment. Our researchers study data science, networks, agent-based modeling, multi-scale analysis and complexity

Alfredo contributes to building a better understanding of social systems by developing computational and analytical methods based on complex systems science and data science. His work is at the intersection of computer science, statistics, applied physics and artificial intelligence. He analyzes large datasets that result from human activity on social media, internet, mobile phones or purchases in order to retrieve unstructured patterns of collective behaviors that explain large scale societal properties, such as social dynamics, urban dynamics, segregation, political engagement, political polarization and social influence.

Rosa M. Benito
Technical University of Madrid

Rosa heads the Complex Systems Group at The Technical University of Madrid (UPM) and is also Professor of Applied Physics at UPM. Prof. Benito's work focuses on understanding and characterizing the structure and dynamics of different complex systems by using Complex Network Theory and Data Science. In particular she has proposed a new formalism to model complex networks topology, and she has been working with big data to determine the individual and collective behavior of users in specific online conversation on Twitter, and human mobility patterns through mobile phone data. She has lead many research projects and has participated in several Challenge for Development using mobile phone data from African's countries (Ivory Coast and Senegal). Her work has been published in many academic publications. She head the PhD Program in Complex Systems and has supervised several PhD Thesis. She has been awarded for Excellence in her Academic Career and for Innovation Education from UPM.

 

 

Phone: 617-547-4100 | Fax: 617-661-7711 | Email: office at necsi.edu

277 Broadway Cambridge, MA USA

Privacy Policy