Node Embeddings
Intuition: map nodes to d-dimensional embeddings such that similar nodes in the graph are embedded close together.
Goal: map nodes so that similarity in the embedding space (e.g., dot product) approximates similarity (e.g., proximity) in the network.
Goal: similarity(u, v) <- need to define; encode nodes $ENC(v)$
Shallow Encoders
- One-layer of data transformation
- A single hidden layer maps node $u$ to embedding $Z_u$ via function $f()$, e.g. $z_u = f(z_v, v \in N_R(u))$
- Limitations of shallow embedding methods
- O(|v|) parameters are needed:
- No sharing of parameters between nodes
- Every node has its own unique embedding
- Inherently “transductive”
- Cannot generate embeddings for nodes that are not seen during training.
- Do not incorporate node features
- Many graphs have features that we can and should leverage
- O(|v|) parameters are needed:
- Today: Deep Graph Encoders
- Today: We will now discuss deep methods based on graph neural networks
- $ENC(v)$ = multiple layers of non-linear transformations of graph structure
- Note: All these deep encoders can be combined with node similarity functions defined in the last lectur
Networks are far more complex
- Arbitrary size and complex topological structure (i.e., no spatial locality like grids)
- No fixed node ordering or reference point
- Often dynamic and have multimodal features
Idea: Convolutional Networks
Goal:
- To generalize convolutions beyond simple lattices
- Leverage node features/attributes (e.g., text, images)
From Images to Graphs
- Transform information at the neighbors and combine it
- Transform “messages” from neighbors: $W_ih_i$
- Add them up: $\sum_i W_ih_i$
Real-World Graphs
A Naive Approach:
- Join adjacency matrix and features
- Feed them into a deep neural net
- Issues with this idea:
- O(N) parameters
- Not applicable to graphs of different sizes
- Not invariant to node ordering
Basics of Deep Learning for Graphs
Local Network Neighborhoods
- Describe aggregation strategies
- Define computation graphs
Stacking Multiple Layers
- Describe the model, parameters, training
- How to fit the model?
- Simple example for unsupervised and supervised training
Setup
Assume we have a graph G:
- V is the vertex set
- A is the adjacency matrix (assume binary)
- X $\in R^{m*|v|}$ : matrix of node features
- Node features
- Social Networks: User profile, User Image
- Biological Networks: Gene expression profile, Gene functional information
- No Features:
- Indicator vectors (one-hot encoding of a node)
- Vector of constant 1: [1, 1, …, 1]
Graph Convolutional Networks
- Idea: Node’s neighborhood defines a computation graph
- Learn how to propagate information across the graph to compute node features
Idea: Aggregate Neighbors
- Key idea: Generate node embeddings based on local network neighborhoods

- Intuition: Nodes aggregate information from their neighbors using neural networks
- Intuition: Network neighborhood defines a computation graph
- Every node defines a computation graph based on its neighborhood
Deep Model: Many Layers
- Model can be of arbitrary depth:
- Nodes have embeddings at each layer
- Layer-0 embedding of node is its input feature, $x_u$
- Layer-K embedding gets information from nodes that are K hops away

Neighborhood Aggregation: Key distinctions are in how different approaches aggregate information across the layers
Basic Approach: average information from neighbors and apply a neural network

The Math: Deep Encoder
Basic Approach: average neighbor messages and apply a neural network
$h_v^k = \sigma (w_k \sum _{u\in N(v)}{\frac{h_u^{k-1}}{|N(v)|}} + B_k h_v^{k-1})$, $ \forall k \in 1, ..., K$
Note — for
$h_v^0 = x_v$ : Initial 0-th layer embeddings are equal to node features
$z_v = h_v^K$ : embedding after K layers of neighborhood aggregation
How do we train the model to generate embeddings?
- Need to define a loss function on the embeddings
- We can feed these embeddings into any loss function and run stochastic gradient descent to train the weight parameters ( W, B are trainable weight matrices; i.e., what we learn )
- Equivalently, rewritten in vector form:
- $H^{(l+1)}= \sigma (H^{(l)}W_0^{(l)}+ \tilde A H^{(l)}W_1^{(l)}) \text{ with }\tilde A = D^{-\frac{1}{2}}AD^{-\frac{1}{2}}$
Unsupervised Training
- Train in an unsupervised manner
- Use only the graph structure
- “Similar” nodes have similar embeddings
- Unsupervised loss function can be anything from the last section, e.g., a loss based on
- Random walks (node2voc, DeepWalk, struc2vec)
- Graph factorization
- Node proximity in the graph
Supervised Training
- Directly train the model for a supervised task (e.g., node classification)

Model Design: Overview
- Define a neighborhood aggregation function
- Define a loss function on the embeddings
- Train on a set of nodes, i.e., a batch of compute graphs
- Generate embeddings for nodes as needed



Inductive Capability
- The same aggregation parameters are shared for all nodes
- The number of model parameters is sub-linear in |v| and we can generalize to unseen nodes.
- New Graphs
- e.g. Train on protein interaction graph from model organism A and generate embeddings on newly collected data about organism B.

- New Nodes
- Many application settings constantly encounter previously unseen nodes
- E.g. Reddit, YouTube, Google Scholar
- Need to generate new embeddings “on the fly”
- Many application settings constantly encounter previously unseen nodes
Recap
- Generate node embeddings by aggregating neighborhood information
- We saw a basic variant of this idea
- Key distinctions are in how direct approaches aggregate information across the layers
Next
- Describe GraphSAGE graph neural network architecture
GraphSAGE Idea
- So far we have aggregated the neighbor messages by taking their (weighted) average.
- Can we do better?
- Any differentiable function that maps set of vectors in $N(u)$ to a single vector.

- $h_v^k = \sigma ([A_k \cdot AGG({h_u^{k-1}, \forall u \in N(v)}), B_kh_v^{k-1}])$
- Apply L2 normalization for each node embedding at every layer.
Neighborhood Aggregation
- Simple neighborhood aggregation

- GraphSAGE

Neighborhood Aggregation: Variants
- Mean:
- Take a weighted average of neighbors
- $ AGG = \sum _{u \in N(v)}{\frac{h_u^{k-1}}{|N(v)|}} $
- Pool:
- Transform neighbor vectors and apply symmetric vector function
- $AGG = \gamma ({Qh_u^{k-1}, \forall u \in N(v)})$, $\gamma$ is element-wise mean/max
- LSTM:
- Apply LSTM to reshuffled of neighbors
- $ AGG = LSTM([h_u^{k-1}, \forall u \in \pi (N(v))]) $
Recap: GCN, GraphSAGE
Key idea: Generate node embeddings based on local neighborhoods
- Nodes aggregate “messages” from their neighbors using neural networks
- Graph Convolutional Networks
- Basic Variant: Average neighborhood information and stack neural networks
- GraphSAGE
- Generalized neighborhood aggregation

Efficient Implementation
- Many aggregations can be performed efficiently by (sparse) matrix operations
- Let $ H^{k-1}= [h_1^{k-1}...h_n^{k-1}] $

Graph Attention Networks
Simple Neighborhood Aggregation
Graph convolutional operator
- Aggregates messages across neighborhoods, $N(v)$
- $\alpha _{uv} = \frac{1}{|N(v)|} $ is the weighting factor (importance) of node u’s message to node .
- $\alpha_{uv}$ is defined explicitly based on the structural properties of the graph.
- All neighbors $u \in N(v)$ are equally important to node .
Graph Attention Networks
- Can we do better than simple neighborhood aggregation?
- Can we let weighting factors $\alpha_{vu}$ to be implicitly defined?
- Goal: Specify arbitrary importances to different neighbors of each node in the graph
- Idea: Compute embedding $h^k_v $ of each node in the graph following an attention strategy.
- Nodes attend over their neighborhood’s message.
- Implicitly specifying different weights to different nodes in a neighborhood.
Attention Mechanism
- Let $$\alpha_{vu}$$ be computed as a byproduct of an attention mechanism a:
- Let a compute attention coefficients across pairs of nodes based on their messages:
- $e_{vu} = a(W_k h_u^{k-1}, W_k h_v^{k-1})$
- $e_{vu}$ indicates the importance of node ’s message to node $v$
- Normalize coefficients using the softmax function in order to be comparable across different neighborhoods
- $\alpha_{vu} = \frac{exp(e_{vu})}{\sum_{k \in N(v)} exp( e_{vk} )} $
- $h_v^k = \sigma (\sum_{u \in N(v)} \alpha_{vu}W_k h_u^{k-1})$
- Let a compute attention coefficients across pairs of nodes based on their messages:
- Attention mechanism $a$:
- The approach is agnostic to the choice of $a$
- E.g., use a simple single-layer neural network
- $a$ can have parameters, which need to be estimates
- Parameters of $a$ are trained jointly
- Learn the parameters together with weight matrices (i.e., other parameter of the neural net) in an end-to-end fashion
- The approach is agnostic to the choice of $a$
- Multi-head attention:
- Stabilize the learning process of attention mechanism
- Attention operations in a given layer are independently replicated R times (each replica with different parameters)
- Outputs are aggregated (by concatenating or adding)
- Stabilize the learning process of attention mechanism
Properties of Attentional Mechanism
- Key benefits: Allows for (implicitly) specifying different importance values ($alpha_{vu}$) to different neighbors
- Computationally efficient
- Computation of attentional coefficients can be parallelized across all edges of the graph
- Aggregation may be parallelized across all nodes
- Storage efficient
- Sparse matrix operations do not require more than O(V+E) entries to be stored
- Fixed number of parameters, irrespective of graph size
- Trivially localized
- Only attends over local network neighborhoods
- Inductive capability
- It is a shared edge-wise mechanism
- It does not depend on the global graph structure
General Tips
- Data preprocessing is important
- Use renormalization tricks
- Variance-scaled initialization
- Network data whitening
- ADAM optimizer
- ADAM naturally takes care of decaying the learning rate
- ReLU activation function often works really well
- No activation function at your output layer
- Easy mistake if you build layers with a shared function
- Include bias term in every layer
- GCN layer of size 64 or 128 is already plenty
Reference: http://web.stanford.edu/class/cs224w/slides/08-GNN.pdf