Author: František Dařena, frantisek.darena [at] gmail.com
When using the application, please, cite as: Dařena, F., Troussov, A., Žižka, J. Simulating Activation Propagation in Social Networks Using the Graph Theory. Acta Universitatis agriculturae et silviculturae Mendelianae Brunensis, 2010, LVIII, 3: 21-28. ISSN 1211-8516.
Kosmos is a software framework for spreading activation methods that can be used for mining of multidimensional socio-semantic networks by application of traditional and novel graph-based methods. These methods iteratively propagate the activation from the initially activated set of nodes to the other nodes in a network through outward links. The level of the nodes activation could be used for discovering user similarity or resource relevancy, measuring node centrality characteristics or network clustering.
A multidimensional network (network containing nodes and links of multiple types) can be modeled as a directed graph (digraph), G, which is an object described by a set V of vertices v_{i} (called also nodes), and a set of edges, E (called also as links). Here, the edges are directed and called arcs, creating a set A of arcs a_{j}. Arcs are ordered pairs (coordinates) that define beginnings and endings of each arc. The graph G is a simple graph (with no loops connecting a vertex directly to itself, and no multiple edges): G = (V, A).
The graphs have the following basic functions available:
Before applying graph-based methods for networks, the model of the real world data must be refined to make it suitable for particular type of graph-mining technique. Before proceeding to graph-mining by application of a generic algorithm that is not tailored to the domain, the real-world network is pre-processed taking into account what type of graph-mining will be used.
In order to enable using computer-based tools for domain specialists, that are not necessarily familiar with graph or other theory terminology, the terminology that is common in social networks, recommender systems and similar domains can be used (node vs. vertex, link vs. edge, arc, node and link type vs. color, node importance vs. vertex weight etc.).
All users and items of a network are represented by nodes. A node is characterized by its identifier that enables to identify a node uniquely. It is also used for definition of initial and terminal nodes for links. The identifier (or alternatively a label) might be also useful for network visualization and in explanatory modules. The type of a node enables to distinguish the nodes and classify them into groups (e.g. in tag-aware recommender systems we distinguish actors, resources, tags and instances (assignments) of tagging) which enables e.g. filtering. Node importance models the weight of nodes in a graph by assigning numbers from subrange <0, 1>. This number is a result of conversion of physical units to abstract semantic units.
A link represents a relation between two nodes. The link weight quantifies this relation and is used e.g. as a base for calculating the signal decay described by spreading activation algorithm. Value less than 1 means decay of the signal and value greater than 1 amplifies the signal. Following link property, the reciprocity, defines whether there exists a reciprocal link that doesn't have to be defined explicitly. The reciprocal link might be created automatically (e.g. when a resource is tagged by an actor, the actor also tags the resource). On the other hand, the possibility not to create the reciprocal link is also kept to have freedom in modeling situations when automatic reciprocity is not undisputable (e.g. a person A, a participant of a social network, indicates that another person - person B - was his teacher, but this person B doesn't indicate A as her pupil).
Spreading Activation is a processing technique on network data structure (Crestani, 1997). It has its roots in psychology (Crestani, Lee; 2000) and is widely used in fields, such as numerical simulation of physics phenomena, epidemic models, information retrieval, and recommender systems. Spreading activation algorithms iteratively propagate the activation from the initial set of nodes referred to as the seed, to the other nodes in a network through outward links. Usually, this propagation is performed until the behavior of the system stabilizes near the so called limit distribution of the algorithm or is stopped by constraints, e.g. the limitation on the total number of iterations (Troussov et al., 2009).
The generic algorithm as described by Troussov, Parra, Brusilovsky (2009) is defined as follows:
The Re-computation step in the Iteration phase consists of following actions:
At this moment, two kinds of normalization (calibration) are considered. In both of them, the sum of values F(v) for a set of nodes N defined by some conditions remains constant after each pulse (∑ F(v_{i}) = const., vi ∈ N). In the first case, Conservation of Initial Activation, the set N contains nodes that were initially activated. In the second case, Conservation of total activation, set N is equal to the set of all activated nodes. The sum of nodes from set N equals to total initial activation (sum of activations of nodes at the moment of initial activation).
Representing the activation by vectors (or, more generally, by ordered sequences of non-homogeneous values where components might belong to different universes -- numbers, binary and Boolean data types, etc.), allows to store at each node the information about various dimensionalities of the activation as well as the information about dynamics of the process of propagation. We introduce two mechanisms to exploit vector-valued activation to modify the behaviour of the propagation.
The idea behind the first mechanism is to provide the nodes with a kind of "inertia" in terms of changing their activation values according to the progress of previous iterations. In addition, this "inertia" is used to speed up the convergence of the algorithm (i.e. achieving the limit distribution). For instance, if on each iteration the activation at particular node decreases, this mechanism makes this decrease faster.
From the technical point of view, the primary goal of the second mechanism -- process dependent constraint on the number of iterations -- is to speed up the convergence of the algorithm without limiting the spread only in the vicinity of the initial seed. Traditional spreading activation algorithms frequently employ constraint on the total number of iterations, so that the process of redistribution of the activation stops independently on the topology of the network and the distribution of the activation achieved (Dařena, Troussov, Žižka, 2010). The new mechanism we propose limits the number of input/output operations for nodes; some nodes (especially those located near the initial seeds), might be removed from the process of redistribution of the activation, while some other nodes (located further from the seed) might continue to participate in the redistribution of the activation. From an application point of view, this mechanism aims to reduce the influence of globally important nodes (hubs) on the activation redistribution on micro level.
Implementation of mechanisms that modify traditional algorithm of propagation of real valued activation, that were mentioned above, requires to store at each node the vector F(h_{1}, h_{2}, ... h_{n}), n_{hist}. Values h_{i} store the history of activation at the node -- h_{1} is the oldest and h_{n} most recent value, n_{hist} represents the number of iterations when the node was active.
Recomputation step that computates the new level of activation based on the values of the Current, Input, and Output values of the activation at the node becomes different for such VSA modification and includes:
Kosmos.pm -- the module provides methods that implement spreading activation algorithms.
Config_file.pm -- the module provides a method read_cfg_file for reading the names of configuration files that is used in Kosmos.pm.
Math::Matrix module is required for calculating the value of quadratic regression.
A sample script that uses Kosmos framework for obtaining the results of SA is below. The script is called with a parameter that is interpreted as the name of the file with location of all configuration files (see below for explanation).
use Kosmos; use Config_file; # reading information about configuration files my $files = read_cfg_file(-filename => $ARGV[0]); Kosmos::set_config_files(-cfg_files => $files); Kosmos::read_config_files; if ($Kosmos::data_ok) { Kosmos::spreading_activation; } else { warn "Incorrect data, can't run spreading activation.\n"; } |
The standard output of the script (Kosmos::spreading_activation method) has following format:
An example of ouput produced by Kosmos based on parameters from the example below:
(['A1', 'I1'], ['A1', 'I2'], ['A2', 'I3'], ['A2', 'I4'], ['I1', 'A1'], ['I1', 'R1'], ['I1', 'T'], ['I2', 'A1'], ['I2', 'R2'], ['I2', 'T'], ['I3', 'A2'], ['I3', 'R1'], ['I3', 'T'], ['I4', 'A2'], ['I4', 'R2'], ['I4', 'T'], ['R1', 'I1'], ['R1', 'I3'], ['R2', 'I2'], ['R2', 'I4'], ['T', 'I1'], ['T', 'I2'], ['T', 'I3'], ['T', 'I4']) iter. A1 A2 I1 I2 I3 I4 R1 R2 T 0 1.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 1 0.46918 0.00000 0.26541 0.26541 0.00000 0.00000 0.00000 0.00000 0.00000 2 0.31520 0.00000 0.23422 0.23422 0.00000 0.00000 0.05409 0.05409 0.10818 3 0.23100 0.00000 0.21137 0.21137 0.03210 0.03210 0.07052 0.07052 0.14103 4 0.18363 0.01277 0.18884 0.18884 0.05532 0.05532 0.07882 0.07882 0.15764 5 0.15372 0.02742 0.17187 0.17187 0.07306 0.07306 0.08225 0.08225 0.16450 6 0.13394 0.04068 0.15909 0.15909 0.08611 0.08611 0.08374 0.08374 0.16749 7 0.12032 0.05150 0.14959 0.14959 0.09573 0.09573 0.08438 0.08438 0.16877 8 0.11070 0.05992 0.14256 0.14256 0.10282 0.10282 0.08466 0.08466 0.16932 9 0.10378 0.06632 0.13736 0.13736 0.10804 0.10804 0.08478 0.08478 0.16955 10 0.09876 0.07113 0.13352 0.13352 0.11189 0.11189 0.08483 0.08483 0.16965 |
The results can be visualized by processing the outputs e.g. into an image below. The image was prepared by a standalone script that converts the data into a LaTeX file that can be converted into PDF and displayed (each iteration per one page, listing though the pages can be used as a simple "animation").
Currently, the algorithm implemented by Kosmos work with several parameters. They can be listed in the algorithm configuration file. When they are not, Kosmos works with default values. The list of supported parameters is listed in following table:
Parameter | Type | Example | Meaning |
---|---|---|---|
IterationsNo | integer | 1000 | how many iterations will be executet |
IterationStep | integer | 100 | after how many iterations the output is needed |
Iteration | integer | 5 | in which iterations is the output needed; can appear several times |
Calibration | enumerated | None | type of calibration, can be also 'ConservationOfInitialActivation' or 'ConservationOfTotalActivation' |
Beta | real | 0.5 | affecting the strength of outcoming signal |
Mu | real | 1 | multiplicator for node inputs |
Activation_decrease | real | 0 | how the node value is decreased when the signal is sent |
Max_output_iterations | integer | 10 | maximal number of iterations when nodes can send signal |
Max_input_iterations | integer | 10 | maximal number of iterations when nodes can receive signal |
Booster | integer | 1 | whether to use booster/modified SA algorithm working with history and value interpolation |
Regression | enumerated | linear | type of regression used by booster -- 'linear' or 'quadratic' |
History_length | integer | 3 | number of values in node history |
Lambda | real | 2 | affecting their mportance of interpolation for new node value |
Max_iterations_with_history | integer | 1000 | after how many iterations input/output operation stops with booster |
This section describes the structure and content of Kosmos configuration files. The examples are prepared according a simple synthetic network from following figure.
Elements A represent the actors, elements T tags, R are resources and I instances (assignments) of tagging (e.g. I1 means that actor A1 tags the resource R1 by tag T).
The file contains information about the location of all network and algorithm data and configuration files.
FILENAME.types | # types of nodes and links |
FILENAME.data | # network data - nodes and incidence matrix |
FILENAME.init | # initial activation if nodes |
FILENAME.cfg | # task dependent parameters affecting how graph-mining algorithms should work with nodes and links of various types |
FILENAME.alg | # parameters of spreading activation algorithm |
The file contains information about types of nodes and links in the network.
#NODE TYPES | ||
#keyword | type | |
nt | Person | |
nt | Instance | |
nt | Resource | |
nt | Tag | |
#LINK_TYPES | ||
#keyword | type name | type of reciprocal link |
ltra | A2I | I2A |
ltra | I2RT | RT2I |
The file contains definitions of nodes and links of the network together with their types.
#NODES | |||
#keyword | node ID | node type | importance - not implemented in current version of framework |
n | A1 | Person | 1.0 |
n | A2 | Person | 1.0 |
n | I1 | Instance | 1.0 |
n | I2 | Instance | 1.0 |
n | I3 | Instance | 1.0 |
n | I4 | Instance | 1.0 |
n | R1 | Resource | 1.0 |
n | R2 | Resource | 1.0 |
n | T | Tag | 1.0 |
#LINKS | |||
#keyword | initial node | terminal node | link type |
l | A1 | I1 | A2I |
l | A1 | I2 | A2I |
l | A2 | I3 | A2I |
l | A2 | I4 | A2I |
l | I1 | R1 | I2RT |
l | I1 | T | I2RT |
l | I2 | R2 | I2RT |
l | I2 | T | I2RT |
l | I3 | R1 | I2RT |
l | I3 | T | I2RT |
l | I4 | R2 | I2RT |
l | I4 | T | I2RT |
The file contains the list of initially activated nodes.
#INITIAL ACTIVATION | ||
#keyword | node ID | activation level |
ia | A1 | 1.0 |
The file contains task dependent parameters affecting how graph-mining algorithms should work with nodes and links of various types.
#LINK WEIGHTS | ||
#keyword | link_type | weight |
lw | A2I | 0.8 |
lw | I2A | 0.8 |
lw | I2RT | 0.8 |
lw | RT2I | 0.8 |
The file contains parameters of the algorithm.
#SAM ALGORITHM PARAMETERS | |
Beta | 0.5 |
IterationsNo | 10 |
IterationStep | 1 |
Calibration | ConservationOfTotalActivation |