Kosmos -- a software framework for spreading activation methods

Author: František Dařena, frantisek.darena [at] gmail.com

When using the application, please, cite as: Dařena, F., Troussov, A., Žižka, J. Simulating Activation Propagation in Social Networks Using the Graph Theory. Acta Universitatis agriculturae et silviculturae Mendelianae Brunensis, 2010, LVIII, 3: 21-28. ISSN 1211-8516.

Kosmos is a software framework for spreading activation methods that can be used for mining of multidimensional socio-semantic networks by application of traditional and novel graph-based methods. These methods iteratively propagate the activation from the initially activated set of nodes to the other nodes in a network through outward links. The level of the nodes activation could be used for discovering user similarity or resource relevancy, measuring node centrality characteristics or network clustering.

The algorithm

A multidimensional network (network containing nodes and links of multiple types) can be modeled as a directed graph (digraph), G, which is an object described by a set V of vertices vi (called also nodes), and a set of edges, E (called also as links). Here, the edges are directed and called arcs, creating a set A of arcs aj. Arcs are ordered pairs (coordinates) that define beginnings and endings of each arc. The graph G is a simple graph (with no loops connecting a vertex directly to itself, and no multiple edges): G = (V, A).

The graphs have the following basic functions available:

Before applying graph-based methods for networks, the model of the real world data must be refined to make it suitable for particular type of graph-mining technique. Before proceeding to graph-mining by application of a generic algorithm that is not tailored to the domain, the real-world network is pre-processed taking into account what type of graph-mining will be used.

In order to enable using computer-based tools for domain specialists, that are not necessarily familiar with graph or other theory terminology, the terminology that is common in social networks, recommender systems and similar domains can be used (node vs. vertex, link vs. edge, arc, node and link type vs. color, node importance vs. vertex weight etc.).

All users and items of a network are represented by nodes. A node is characterized by its identifier that enables to identify a node uniquely. It is also used for definition of initial and terminal nodes for links. The identifier (or alternatively a label) might be also useful for network visualization and in explanatory modules. The type of a node enables to distinguish the nodes and classify them into groups (e.g. in tag-aware recommender systems we distinguish actors, resources, tags and instances (assignments) of tagging) which enables e.g. filtering. Node importance models the weight of nodes in a graph by assigning numbers from subrange <0, 1>. This number is a result of conversion of physical units to abstract semantic units.

A link represents a relation between two nodes. The link weight quantifies this relation and is used e.g. as a base for calculating the signal decay described by spreading activation algorithm. Value less than 1 means decay of the signal and value greater than 1 amplifies the signal. Following link property, the reciprocity, defines whether there exists a reciprocal link that doesn't have to be defined explicitly. The reciprocal link might be created automatically (e.g. when a resource is tagged by an actor, the actor also tags the resource). On the other hand, the possibility not to create the reciprocal link is also kept to have freedom in modeling situations when automatic reciprocity is not undisputable (e.g. a person A, a participant of a social network, indicates that another person - person B - was his teacher, but this person B doesn't indicate A as her pupil).

Spreading Activation is a processing technique on network data structure (Crestani, 1997). It has its roots in psychology (Crestani, Lee; 2000) and is widely used in fields, such as numerical simulation of physics phenomena, epidemic models, information retrieval, and recommender systems. Spreading activation algorithms iteratively propagate the activation from the initial set of nodes referred to as the seed, to the other nodes in a network through outward links. Usually, this propagation is performed until the behavior of the system stabilizes near the so called limit distribution of the algorithm or is stopped by constraints, e.g. the limitation on the total number of iterations (Troussov et al., 2009).

The generic algorithm as described by Troussov, Parra, Brusilovsky (2009) is defined as follows:

  1. Initialization (input) -- setting the parameters of the algorithm, network, and a starting list of non-zero valued, "activated" set of nodes vi ∈ V with positive values assigned by F(V).
  2. Iteration
    1. Expanding the list of nodes activated by their neighbors.
    2. Re-computing activation values of all nodes in the list.
    3. Normalization -- scaling linearly up or down the numerical values of all activated nodes to satisfy given preconditions of activation conservation.
    4. Purging the list from nodes having the activity value F(vi) < t, where t is a threshold value and one of the algorithm parameters.
    5. Checking if the last iteration step (given by a suitable condition like a maximum number of iterations) was reached; if not, returning to the step a).
  3. Output -- the ranked list of activated nodes with their activity values gained iteratively through F(V).

The Re-computation step in the Iteration phase consists of following actions:

  1. Input/Throughput/Output of the links
  2. Input/Output of node activation
  3. Computation Of New Level Of Activation

At this moment, two kinds of normalization (calibration) are considered. In both of them, the sum of values F(v) for a set of nodes N defined by some conditions remains constant after each pulse (∑ F(vi) = const., vi ∈ N). In the first case, Conservation of Initial Activation, the set N contains nodes that were initially activated. In the second case, Conservation of total activation, set N is equal to the set of all activated nodes. The sum of nodes from set N equals to total initial activation (sum of activations of nodes at the moment of initial activation).

Representing the activation by vectors (or, more generally, by ordered sequences of non-homogeneous values where components might belong to different universes -- numbers, binary and Boolean data types, etc.), allows to store at each node the information about various dimensionalities of the activation as well as the information about dynamics of the process of propagation. We introduce two mechanisms to exploit vector-valued activation to modify the behaviour of the propagation.

The idea behind the first mechanism is to provide the nodes with a kind of "inertia" in terms of changing their activation values according to the progress of previous iterations. In addition, this "inertia" is used to speed up the convergence of the algorithm (i.e. achieving the limit distribution). For instance, if on each iteration the activation at particular node decreases, this mechanism makes this decrease faster.

From the technical point of view, the primary goal of the second mechanism -- process dependent constraint on the number of iterations -- is to speed up the convergence of the algorithm without limiting the spread only in the vicinity of the initial seed. Traditional spreading activation algorithms frequently employ constraint on the total number of iterations, so that the process of redistribution of the activation stops independently on the topology of the network and the distribution of the activation achieved (Dařena, Troussov, Žižka, 2010). The new mechanism we propose limits the number of input/output operations for nodes; some nodes (especially those located near the initial seeds), might be removed from the process of redistribution of the activation, while some other nodes (located further from the seed) might continue to participate in the redistribution of the activation. From an application point of view, this mechanism aims to reduce the influence of globally important nodes (hubs) on the activation redistribution on micro level.

Implementation of mechanisms that modify traditional algorithm of propagation of real valued activation, that were mentioned above, requires to store at each node the vector F(h1, h2, ... hn), nhist. Values hi store the history of activation at the node -- h1 is the oldest and hn most recent value, nhist represents the number of iterations when the node was active.

Recomputation step that computates the new level of activation based on the values of the Current, Input, and Output values of the activation at the node becomes different for such VSA modification and includes:

The application

Kosmos.pm -- the module provides methods that implement spreading activation algorithms.

Config_file.pm -- the module provides a method read_cfg_file for reading the names of configuration files that is used in Kosmos.pm.

Math::Matrix module is required for calculating the value of quadratic regression.

A sample script that uses Kosmos framework for obtaining the results of SA is below. The script is called with a parameter that is interpreted as the name of the file with location of all configuration files (see below for explanation).

use Kosmos;
use Config_file;

# reading information about configuration files
my $files = read_cfg_file(-filename => $ARGV[0]);

Kosmos::set_config_files(-cfg_files => $files);
Kosmos::read_config_files;

if ($Kosmos::data_ok) {
	Kosmos::spreading_activation;
} else {
	warn "Incorrect data, can't run spreading activation.\n";
}

The standard output of the script (Kosmos::spreading_activation method) has following format:

An example of ouput produced by Kosmos based on parameters from the example below:

(['A1', 'I1'], ['A1', 'I2'], ['A2', 'I3'], ['A2', 'I4'], ['I1', 'A1'], ['I1', 'R1'], ['I1', 'T'], 
['I2', 'A1'], ['I2', 'R2'], ['I2', 'T'], ['I3', 'A2'], ['I3', 'R1'], ['I3', 'T'], ['I4', 'A2'], 
['I4', 'R2'], ['I4', 'T'], ['R1', 'I1'], ['R1', 'I3'], ['R2', 'I2'], ['R2', 'I4'], ['T', 'I1'], 
['T', 'I2'], ['T', 'I3'], ['T', 'I4'])
iter.	A1	A2	I1	I2	I3	I4	R1	R2	T
0	1.00000	0.00000	0.00000	0.00000	0.00000	0.00000	0.00000	0.00000	0.00000
1	0.46918	0.00000	0.26541	0.26541	0.00000	0.00000	0.00000	0.00000	0.00000
2	0.31520	0.00000	0.23422	0.23422	0.00000	0.00000	0.05409	0.05409	0.10818
3	0.23100	0.00000	0.21137	0.21137	0.03210	0.03210	0.07052	0.07052	0.14103
4	0.18363	0.01277	0.18884	0.18884	0.05532	0.05532	0.07882	0.07882	0.15764
5	0.15372	0.02742	0.17187	0.17187	0.07306	0.07306	0.08225	0.08225	0.16450
6	0.13394	0.04068	0.15909	0.15909	0.08611	0.08611	0.08374	0.08374	0.16749
7	0.12032	0.05150	0.14959	0.14959	0.09573	0.09573	0.08438	0.08438	0.16877
8	0.11070	0.05992	0.14256	0.14256	0.10282	0.10282	0.08466	0.08466	0.16932
9	0.10378	0.06632	0.13736	0.13736	0.10804	0.10804	0.08478	0.08478	0.16955
10	0.09876	0.07113	0.13352	0.13352	0.11189	0.11189	0.08483	0.08483	0.16965

The results can be visualized by processing the outputs e.g. into an image below. The image was prepared by a standalone script that converts the data into a LaTeX file that can be converted into PDF and displayed (each iteration per one page, listing though the pages can be used as a simple "animation").

simulation visualization

Currently, the algorithm implemented by Kosmos work with several parameters. They can be listed in the algorithm configuration file. When they are not, Kosmos works with default values. The list of supported parameters is listed in following table:

Parameter Type Example Meaning
IterationsNo integer 1000 how many iterations will be executet
IterationStep integer 100 after how many iterations the output is needed
Iteration integer 5 in which iterations is the output needed; can appear several times
Calibration enumerated None type of calibration, can be also 'ConservationOfInitialActivation' or 'ConservationOfTotalActivation'
Beta real 0.5 affecting the strength of outcoming signal
Mu real 1 multiplicator for node inputs
Activation_decrease real 0 how the node value is decreased when the signal is sent
Max_output_iterations integer 10 maximal number of iterations when nodes can send signal
Max_input_iterations integer 10 maximal number of iterations when nodes can receive signal
Booster integer 1 whether to use booster/modified SA algorithm working with history and value interpolation
Regression enumerated linear type of regression used by booster -- 'linear' or 'quadratic'
History_length integer 3 number of values in node history
Lambda real 2 affecting their mportance of interpolation for new node value
Max_iterations_with_history integer 1000 after how many iterations input/output operation stops with booster

Kosmos configuration files

This section describes the structure and content of Kosmos configuration files. The examples are prepared according a simple synthetic network from following figure.

network topology

Elements A represent the actors, elements T tags, R are resources and I instances (assignments) of tagging (e.g. I1 means that actor A1 tags the resource R1 by tag T).

CFGFILE.cfg

The file contains information about the location of all network and algorithm data and configuration files.

FILENAME.types # types of nodes and links
FILENAME.data # network data - nodes and incidence matrix
FILENAME.init # initial activation if nodes
FILENAME.cfg # task dependent parameters affecting how graph-mining algorithms should work with nodes and links of various types
FILENAME.alg # parameters of spreading activation algorithm

FILENAME.types

The file contains information about types of nodes and links in the network.

#NODE TYPES
#keyword type
nt Person
nt Instance
nt Resource
nt Tag
#LINK_TYPES
#keyword type name type of reciprocal link
ltra A2I I2A
ltra I2RT RT2I

FILENAME.data

The file contains definitions of nodes and links of the network together with their types.

#NODES
#keyword node IDnode type importance - not implemented in current version of framework
n A1 Person 1.0
n A2 Person 1.0
n I1 Instance 1.0
n I2 Instance 1.0
n I3 Instance 1.0
n I4 Instance 1.0
n R1 Resource 1.0
n R2 Resource 1.0
n T Tag 1.0
#LINKS
#keyword initial node terminal node link type
l A1 I1 A2I
l A1 I2 A2I
l A2 I3 A2I
l A2 I4 A2I
l I1 R1 I2RT
l I1 T I2RT
l I2 R2 I2RT
l I2 T I2RT
l I3 R1 I2RT
l I3 T I2RT
l I4 R2 I2RT
l I4 T I2RT

FILENAME.init

The file contains the list of initially activated nodes.

#INITIAL ACTIVATION
#keyword node ID activation level
ia A1 1.0

FILENAME.cfg

The file contains task dependent parameters affecting how graph-mining algorithms should work with nodes and links of various types.

#LINK WEIGHTS
#keyword link_type weight
lw A2I 0.8
lw I2A 0.8
lw I2RT 0.8
lw RT2I 0.8

FILENAME.cfg

The file contains parameters of the algorithm.

#SAM ALGORITHM PARAMETERS
Beta 0.5
IterationsNo 10
IterationStep 1
Calibration ConservationOfTotalActivation

Papers based on Kosmos framework