Graph Neural Network (Concepts)

Node Embeddings

Intuition: map nodes to d-dimensional embeddings such that similar nodes in the graph are embedded close together.

Goal: map nodes so that similarity in the embedding space (e.g., dot product) approximates similarity (e.g., proximity) in the network.

Goal: similarity(u, v) <- need to define; encode nodes $ENC(v)$

Shallow Encoders

One-layer of data transformation
A single hidden layer maps node $u$ to embedding $Z_u$ via function $f()$, e.g. $z_u = f(z_v, v \in N_R(u))$
Limitations of shallow embedding methods
- O(|v|) parameters are needed:
  - No sharing of parameters between nodes
  - Every node has its own unique embedding
- Inherently “transductive”
  - Cannot generate embeddings for nodes that are not seen during training.
- Do not incorporate node features
  - Many graphs have features that we can and should leverage
Today: Deep Graph Encoders
- Today: We will now discuss deep methods based on graph neural networks
- $ENC(v)$ = multiple layers of non-linear transformations of graph structure
- Note: All these deep encoders can be combined with node similarity functions defined in the last lectur

Networks are far more complex

Arbitrary size and complex topological structure (i.e., no spatial locality like grids)
No fixed node ordering or reference point
Often dynamic and have multimodal features

Idea: Convolutional Networks

Goal:

To generalize convolutions beyond simple lattices
Leverage node features/attributes (e.g., text, images)

From Images to Graphs

Transform information at the neighbors and combine it
- Transform “messages” from neighbors: $W_ih_i$
- Add them up: $\sum_i W_ih_i$

Real-World Graphs

A Naive Approach:

Join adjacency matrix and features
Feed them into a deep neural net
Issues with this idea:
- O(N) parameters
- Not applicable to graphs of different sizes
- Not invariant to node ordering

Basics of Deep Learning for Graphs

Local Network Neighborhoods

Describe aggregation strategies
Define computation graphs

Stacking Multiple Layers

Describe the model, parameters, training
How to fit the model?
Simple example for unsupervised and supervised training

Setup

Assume we have a graph G:

V is the vertex set
A is the adjacency matrix (assume binary)
X $\in R^{m*|v|}$ : matrix of node features
Node features
- Social Networks: User profile, User Image
- Biological Networks: Gene expression profile, Gene functional information
- No Features:
  - Indicator vectors (one-hot encoding of a node)
  - Vector of constant 1: [1, 1, …, 1]

Graph Convolutional Networks

Idea: Node’s neighborhood defines a computation graph
Learn how to propagate information across the graph to compute node features

Idea: Aggregate Neighbors

Key idea: Generate node embeddings based on local network neighborhoods

Intuition: Nodes aggregate information from their neighbors using neural networks
Intuition: Network neighborhood defines a computation graph
- Every node defines a computation graph based on its neighborhood

Deep Model: Many Layers

Model can be of arbitrary depth:
- Nodes have embeddings at each layer
- Layer-0 embedding of node is its input feature, $x_u$
- Layer-K embedding gets information from nodes that are K hops away

Neighborhood Aggregation: Key distinctions are in how different approaches aggregate information across the layers

Basic Approach: average information from neighbors and apply a neural network

The Math: Deep Encoder

Basic Approach: average neighbor messages and apply a neural network

$h_v^k = \sigma (w_k \sum _{u\in N(v)}{\frac{h_u^{k-1}}{|N(v)|}} + B_k h_v^{k-1})$, $ \forall k \in 1, ..., K$

Note — for

$h_v^0 = x_v$ : Initial 0-th layer embeddings are equal to node features

$z_v = h_v^K$ : embedding after K layers of neighborhood aggregation

How do we train the model to generate embeddings?

Need to define a loss function on the embeddings
We can feed these embeddings into any loss function and run stochastic gradient descent to train the weight parameters ( W, B are trainable weight matrices; i.e., what we learn )
Equivalently, rewritten in vector form:
- $H^{(l+1)}= \sigma (H^{(l)}W_0^{(l)}+ \tilde A H^{(l)}W_1^{(l)}) \text{ with }\tilde A = D^{-\frac{1}{2}}AD^{-\frac{1}{2}}$

Unsupervised Training

Train in an unsupervised manner
- Use only the graph structure
- “Similar” nodes have similar embeddings
Unsupervised loss function can be anything from the last section, e.g., a loss based on
- Random walks (node2voc, DeepWalk, struc2vec)
- Graph factorization
- Node proximity in the graph

Supervised Training

Directly train the model for a supervised task (e.g., node classification)

Model Design: Overview

Define a neighborhood aggregation function
Define a loss function on the embeddings
Train on a set of nodes, i.e., a batch of compute graphs
Generate embeddings for nodes as needed

Inductive Capability

The same aggregation parameters are shared for all nodes
- The number of model parameters is sub-linear in |v| and we can generalize to unseen nodes.

New Graphs
- e.g. Train on protein interaction graph from model organism A and generate embeddings on newly collected data about organism B.

New Nodes
- Many application settings constantly encounter previously unseen nodes
  - E.g. Reddit, YouTube, Google Scholar
- Need to generate new embeddings “on the fly”

Recap

Generate node embeddings by aggregating neighborhood information
- We saw a basic variant of this idea
- Key distinctions are in how direct approaches aggregate information across the layers

Describe GraphSAGE graph neural network architecture

GraphSAGE Idea

So far we have aggregated the neighbor messages by taking their (weighted) average.
Can we do better?
Any differentiable function that maps set of vectors in $N(u)$ to a single vector.

$h_v^k = \sigma ([A_k \cdot AGG({h_u^{k-1}, \forall u \in N(v)}), B_kh_v^{k-1}])$
Apply L2 normalization for each node embedding at every layer.

Neighborhood Aggregation

Simple neighborhood aggregation

GraphSAGE

Neighborhood Aggregation: Variants

Mean:
- Take a weighted average of neighbors
- $ AGG = \sum _{u \in N(v)}{\frac{h_u^{k-1}}{|N(v)|}} $

Pool:
- Transform neighbor vectors and apply symmetric vector function
- $AGG = \gamma ({Qh_u^{k-1}, \forall u \in N(v)})$, $\gamma$ is element-wise mean/max

LSTM:
- Apply LSTM to reshuffled of neighbors
- $ AGG = LSTM([h_u^{k-1}, \forall u \in \pi (N(v))]) $

Recap: GCN, GraphSAGE

Key idea: Generate node embeddings based on local neighborhoods

Nodes aggregate “messages” from their neighbors using neural networks

Graph Convolutional Networks
- Basic Variant: Average neighborhood information and stack neural networks
GraphSAGE
- Generalized neighborhood aggregation

Efficient Implementation

Many aggregations can be performed efficiently by (sparse) matrix operations
Let $ H^{k-1}= [h_1^{k-1}...h_n^{k-1}] $

Graph Attention Networks

Simple Neighborhood Aggregation

Graph convolutional operator

Aggregates messages across neighborhoods, $N(v)$
$\alpha _{uv} = \frac{1}{|N(v)|} $ is the weighting factor (importance) of node u’s message to node .
$\alpha_{uv}$ is defined explicitly based on the structural properties of the graph.
All neighbors $u \in N(v)$ are equally important to node .

Graph Attention Networks

Can we do better than simple neighborhood aggregation?
Can we let weighting factors $\alpha_{vu}$ to be implicitly defined?
Goal: Specify arbitrary importances to different neighbors of each node in the graph
Idea: Compute embedding $h^k_v $ of each node in the graph following an attention strategy.
- Nodes attend over their neighborhood’s message.
- Implicitly specifying different weights to different nodes in a neighborhood.

Attention Mechanism

Let $$\alpha_{vu}$$ be computed as a byproduct of an attention mechanism a:
- Let a compute attention coefficients across pairs of nodes based on their messages:
- Normalize coefficients using the softmax function in order to be comparable across different neighborhoods

Attention mechanism $a$:
- The approach is agnostic to the choice of $a$
  - E.g., use a simple single-layer neural network
  - $a$ can have parameters, which need to be estimates
- Parameters of $a$ are trained jointly
  - Learn the parameters together with weight matrices (i.e., other parameter of the neural net) in an end-to-end fashion

Multi-head attention:
- Stabilize the learning process of attention mechanism
  - Attention operations in a given layer are independently replicated R times (each replica with different parameters)
  - Outputs are aggregated (by concatenating or adding)

Properties of Attentional Mechanism

Key benefits: Allows for (implicitly) specifying different importance values ($alpha_{vu}$) to different neighbors
Computationally efficient
- Computation of attentional coefficients can be parallelized across all edges of the graph
- Aggregation may be parallelized across all nodes
Storage efficient
- Sparse matrix operations do not require more than O(V+E) entries to be stored
- Fixed number of parameters, irrespective of graph size
Trivially localized
- Only attends over local network neighborhoods
Inductive capability
- It is a shared edge-wise mechanism
- It does not depend on the global graph structure

General Tips

Data preprocessing is important
- Use renormalization tricks
- Variance-scaled initialization
- Network data whitening
ADAM optimizer
- ADAM naturally takes care of decaying the learning rate
ReLU activation function often works really well
No activation function at your output layer
- Easy mistake if you build layers with a shared function
Include bias term in every layer
GCN layer of size 64 or 128 is already plenty

Reference: http://web.stanford.edu/class/cs224w/slides/08-GNN.pdf

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Hazel Kim's Blog

Graph Neural Network (Concepts)

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역