본문 바로가기

Research (5-6월)

Graph Neural Network (Concepts)

 

Node Embeddings

Intuition: map nodes to d-dimensional embeddings such that similar nodes in the graph are embedded close together.

Goal: map nodes so that similarity in the embedding space (e.g., dot product) approximates similarity (e.g., proximity) in the network.

 

Goal: similarity(u, v) <- need to define; encode nodes $ENC(v)$

 

Shallow Encoders

  • One-layer of data transformation
  • A single hidden layer maps node $u$  to embedding  $Z_u$  via function $f()$, e.g.  $z_u = f(z_v, v \in N_R(u))$
  • Limitations of shallow embedding methods
    • O(|v|) parameters are needed:
      • No sharing of parameters between nodes
      • Every node has its own unique embedding
    • Inherently “transductive”
      • Cannot generate embeddings for nodes that are not seen during training.
    • Do not incorporate node features
      • Many graphs have features that we can and should leverage
  • Today: Deep Graph Encoders
    • Today: We will now discuss deep methods based on graph neural networks
    • $ENC(v)$  = multiple layers of non-linear transformations of graph structure
    • Note: All these deep encoders can be combined with node similarity functions defined in the last lectur

 

 

Networks are far more complex

  • Arbitrary size and complex topological structure (i.e., no spatial locality like grids)
  • No fixed node ordering or reference point
  • Often dynamic and have multimodal features

 

Idea: Convolutional Networks

Goal: 

  • To generalize convolutions beyond simple lattices 
  • Leverage node features/attributes (e.g., text, images)

From Images to Graphs

  • Transform information at the neighbors and combine it
    • Transform “messages”   from neighbors:  $W_ih_i$
    • Add them up:  $\sum_i W_ih_i$

Real-World Graphs

A Naive Approach:

  • Join adjacency matrix and features
  • Feed them into a deep neural net
  • Issues with this idea:
    • O(N) parameters
    • Not applicable to graphs of different sizes
    • Not invariant to node ordering

 

Basics of Deep Learning for Graphs

Local Network Neighborhoods

  • Describe aggregation strategies
  • Define computation graphs

Stacking Multiple Layers

  • Describe the model, parameters, training
  • How to fit the model?
  • Simple example for unsupervised and supervised training

 

Setup

Assume we have a graph G:

  • V is the vertex set
  • A is the adjacency matrix (assume binary)
  • X $\in R^{m*|v|}$  : matrix of node features
  • Node features
    • Social Networks: User profile, User Image
    • Biological Networks: Gene expression profile, Gene functional information 
    • No Features:
      • Indicator vectors (one-hot encoding of a node)
      • Vector of constant 1: [1, 1, …, 1]

 

Graph Convolutional Networks

  • Idea: Node’s neighborhood defines a computation graph
  • Learn how to propagate information across the graph to compute node features

Idea: Aggregate Neighbors

  • Key idea: Generate node embeddings based on local network neighborhoods 

  • Intuition: Nodes aggregate information from their neighbors using neural networks
  • Intuition: Network neighborhood defines a computation graph
    • Every node defines a computation graph based on its neighborhood

 

Deep Model: Many Layers

  • Model can be of arbitrary depth:
    • Nodes have embeddings at each layer
    • Layer-0 embedding of node   is its input feature,  $x_u$
    • Layer-K embedding gets information from nodes that are K hops away

 

 

Neighborhood Aggregation: Key distinctions are in how different approaches aggregate information across the layers

Basic Approach: average information from neighbors and apply a neural network 

 

 

 

The Math: Deep Encoder

Basic Approach: average neighbor messages and apply a neural network

 

$h_v^k = \sigma (w_k \sum _{u\in N(v)}{\frac{h_u^{k-1}}{|N(v)|}} + B_k h_v^{k-1})$, $ \forall k \in 1, ..., K$

 

Note — for 

 

 $h_v^0 = x_v$ : Initial 0-th layer embeddings are equal to node features

 $z_v = h_v^K$ : embedding after K layers of neighborhood aggregation

 

 

How do we train the model to generate embeddings?

  • Need to define a loss function on the embeddings
  • We can feed these embeddings into any loss function and run stochastic gradient descent to train the weight parameters ( W, B are trainable weight matrices; i.e., what we learn )
  • Equivalently, rewritten in vector form:
    • $H^{(l+1)}= \sigma (H^{(l)}W_0^{(l)}+ \tilde A H^{(l)}W_1^{(l)}) \text{ with }\tilde A = D^{-\frac{1}{2}}AD^{-\frac{1}{2}}$

 

Unsupervised Training

  • Train in an unsupervised manner
    • Use only the graph structure
    • “Similar” nodes have similar embeddings
  • Unsupervised loss function can be anything from the last section, e.g., a loss based on
    • Random walks (node2voc, DeepWalk, struc2vec)
    • Graph factorization
    • Node proximity in the graph

 

Supervised Training

  • Directly train the model for a supervised task (e.g., node classification) 

Model Design: Overview

  1. Define a neighborhood aggregation function
  2. Define a loss function on the embeddings
  3. Train on a set of nodes, i.e., a batch of compute graphs
  4. Generate embeddings for nodes as needed  

 

 

 

Inductive Capability

  • The same aggregation parameters are shared for all nodes
    • The number of model parameters is sub-linear in |v| and we can generalize to unseen nodes.
      • New Graphs
        • e.g. Train on protein interaction graph from model organism A and generate embeddings on newly collected data about organism B. 

      • New Nodes
        • Many application settings constantly encounter previously unseen nodes
          • E.g. Reddit, YouTube, Google Scholar
        • Need to generate new embeddings “on the fly”

 

Recap

  • Generate node embeddings by aggregating neighborhood information 
    • We saw a basic variant of this idea
    • Key distinctions are in how direct approaches aggregate information across the layers

Next

  • Describe GraphSAGE graph neural network architecture

 

GraphSAGE Idea

  • So far we have aggregated the neighbor messages by taking their (weighted) average.
  • Can we do better?
  • Any differentiable function that maps set of vectors in $N(u)$ to a single vector. 

 

  •  $h_v^k = \sigma ([A_k \cdot AGG({h_u^{k-1}, \forall u \in N(v)}), B_kh_v^{k-1}])$
  • Apply L2 normalization for each node embedding at every layer.

Neighborhood Aggregation

  • Simple neighborhood aggregation 

  • GraphSAGE 

 

 

Neighborhood Aggregation: Variants

  • Mean:
    • Take a weighted average of neighbors
    • $ AGG = \sum _{u \in N(v)}{\frac{h_u^{k-1}}{|N(v)|}} $
  • Pool:
    • Transform neighbor vectors and apply symmetric vector function 
    • $AGG = \gamma ({Qh_u^{k-1}, \forall u \in N(v)})$, $\gamma$ is element-wise mean/max
  • LSTM:
    • Apply LSTM to reshuffled of neighbors
    • $ AGG = LSTM([h_u^{k-1}, \forall u \in \pi (N(v))]) $

 

Recap: GCN, GraphSAGE

Key idea: Generate node embeddings based on local neighborhoods

      • Nodes aggregate “messages” from their neighbors using neural networks
  • Graph Convolutional Networks
    • Basic Variant: Average neighborhood information and stack neural networks
  • GraphSAGE
    • Generalized neighborhood aggregation 

 

 

Efficient Implementation

  • Many aggregations can be performed efficiently by (sparse) matrix operations
  • Let $ H^{k-1}= [h_1^{k-1}...h_n^{k-1}] $

 

Graph Attention Networks

 

Simple Neighborhood Aggregation

Graph convolutional operator

  • Aggregates messages across neighborhoods,  $N(v)$
  • $\alpha _{uv} = \frac{1}{|N(v)|} $  is the weighting factor (importance) of node u’s message to node  .
  • $\alpha_{uv}$ is defined explicitly based on the structural properties of the graph.
  • All neighbors $u \in N(v)$ are equally important to node  .

 

Graph Attention Networks

  • Can we do better than simple neighborhood aggregation?
  • Can we let weighting factors $\alpha_{vu}$ to be implicitly defined?
  • Goal: Specify arbitrary importances to different neighbors of each node in the graph
  • Idea: Compute embedding $h^k_v $ of each node in the graph following an attention strategy.
    • Nodes attend over their neighborhood’s message.
    • Implicitly specifying different weights to different nodes in a neighborhood.

 

Attention Mechanism

  • Let $$\alpha_{vu}$$ be computed as a byproduct of an attention mechanism a:
    • Let a compute attention coefficients   across pairs of nodes   based on their messages:
          •  $e_{vu} = a(W_k h_u^{k-1}, W_k h_v^{k-1})$
          • $e_{vu}$ indicates the importance of node  ’s message to node $v$
    •  Normalize coefficients using the softmax function in order to be comparable across different neighborhoods
          • $\alpha_{vu} = \frac{exp(e_{vu})}{\sum_{k \in N(v)} exp( e_{vk} )} $
          • $h_v^k = \sigma (\sum_{u \in N(v)} \alpha_{vu}W_k h_u^{k-1})$
  • Attention mechanism $a$:
    • The approach is agnostic to the choice of  $a$
      • E.g., use a simple single-layer neural network
      • $a$  can have parameters, which need to be estimates
    • Parameters of  $a$ are trained jointly
      • Learn the parameters together with weight matrices (i.e., other parameter of the neural net) in an end-to-end fashion
  • Multi-head attention:
    • Stabilize the learning process of attention mechanism 
      • Attention operations in a given layer are independently replicated R times (each replica with different parameters)
      • Outputs are aggregated (by concatenating or adding)

 

 

Properties of Attentional Mechanism

  • Key benefits: Allows for (implicitly) specifying different importance values ($alpha_{vu}$) to different neighbors
  • Computationally efficient
    • Computation of attentional coefficients can be parallelized across all edges of the graph
    • Aggregation may be parallelized across all nodes
  • Storage efficient
    • Sparse matrix operations do not require more than O(V+E) entries to be stored
    • Fixed number of parameters, irrespective of graph size
  • Trivially localized
    • Only attends over local network neighborhoods
  • Inductive capability
    • It is a shared edge-wise mechanism
    • It does not depend on the global graph structure

 

 

General Tips

  • Data preprocessing is important
    • Use renormalization tricks
    • Variance-scaled initialization
    • Network data whitening
  • ADAM optimizer
    • ADAM naturally takes care of decaying the learning rate
  • ReLU activation function often works really well
  • No activation function at your output layer
    • Easy mistake if you build layers with a shared function
  • Include bias term in every layer
  • GCN layer of size 64 or 128 is already plenty

 

 

 

 

 

 

 

Reference: http://web.stanford.edu/class/cs224w/slides/08-GNN.pdf