Neural Networks have been proven to be powerful and effective in many tasks. However, training such networks usually requires huge amount of data. In supervised learning paradigm, labelling data involves professional domain knowledge and thus becomes expensive and sometimes impratical. To address the data shortage, active learning mothods have been proposed which focuses on how to efficiently label most valuable data samples to reduce annotation cost. In this project, we investigate the application of Policy Gradient in active learning of Graph Neural Networks (GNN). We mainly focus on the affect of better states representation and effectiveness of different policy gradient backbones.

Github

link to the final presentation

1. Introduction

Graph is a natural way to represent the connections of interacting objects, such as networking, trafficing and so on. Thus, GNN as an effective way to learn the representations of graphs has been very successful in various tasks. In both node prediction and link prediction tasks, GNN has shown the ability to learn the intertwining relationships between entities. However, GNN usually requires large dataset to learn in the first place. In supervised learning tasks, labelled data with high quality is expensive, a problem that has attracted focus.

Active learning is brought to tackle this issue. It dynamically selects the most valuable or informative samples for an oracle to label and iteratively trains the target model. An advantage of active learning in GNN is that we can mostly exploit the high correlation among nodes to select and label the most informative sample.

The process of selecting samples to label, putting it back for training and repeating resembles Markov Decision Process, as the states being selected samples, actions being choosing next one to label and reward being the accuracy after each GNN classification epoch. Thus, it is natural and applicable to apply reinforcement learning in such task to find optimal strategy in selecting the most apporiate samples.

Active learning in GNN has been an active research field [1] [2] [3]. In these works, researchers mainly focus on using different metrics and better feature propogation methods to select the most informative nodes.

Some other works exert reinforment learning in GNN active learning [4] [5] [6].

3. Problem Definition

Following the setup of [6], we investigate the influence of states representation and effectiveness of different reinforment learning backbones. First, we will lay out the original construction of the problem and then we will talk about our adjustments and experiments.

Setup

Given a graph \(G=(V,E)\) where \(E\) is the set of edges and \(V\) is the set of vertices, or nodes, which is divided into \(V_{train}, V_{validation}, V_{test}\). Each node in \(v \in V\) is represented as a feature vector \(x_v \in X \subseteq \mathbb{R}^d\) and is labeled as \(y_v \in {Y}={1,2,\dots,C}\). The task is to train a Graph Neural Network using a node set \(V_{label} \subset V_{train}\) to classify nodes in \(V_{test}\).

In active learning setting, we iteratively expand \(V_{label}\) which is initialized to be empty. Given a budget \(B << \|V_{train}\|\), at each step \(t\) we add one node \(v \in V_{train}\backslash V_{label}^{t-1}\) to \(V_{label}\) based on the learning policy \(\pi\). The model is trained for one more epoch using the updated \(V_{label}\). When \(\|V_{label}\|=B\), we train the model to convergence using \(V_{label}\).

During training, the goal is to find the optimal policy \(\pi^*\) to maximize micro-F1 score of the classification task. In evaluation, we apply \(\pi^*\) to target graphs where \(V_{label}\) is also initialized to be empty and is expanded iteratively using \(\pi^*\) with budget \(B\). After \(\|V_{label}\|=B\), the model is trained to convergence using \(V_{label}\) and the final micro-F1 score is recorded as the performance.

Formalize as MDP

  • \(\textbf{State}\): The state of graph \(G\) at step \(t\) is fomulated as a matrix \(S^t_G\) where each row is the state representation of node \(v\). The state representation of each node is originally defined using: 1. degree of node, 2. entropy of label distribution predicted by classification GNN, 3. class probability predicted by classification GNN and 4. \(\text{0-1}\) indicator variable representing if a node has been added to \(V_{label}\). The state representation of each node is obtained by concatenating the above features. (Refer to [6] for more details)
  • \(\textbf{Action}\): The action at step \(t\) is to select node \(v_t \in V_{train}\backslash V_{label}^{t-1}\) based on the action probability given by the policy network in step \(t\).
  • \(\textbf{Reward}\): The reward is the micro-F1 score on \(V_{validation}\) after the classification GNN has converged.
  • \(\textbf{Transition Dynamics}\): In each iteration \(t\), a node \(v_t \in V_{train}\backslash V_{label}^{t-1}\) is added to \(V_{label}^{t-1}\) and so \(V_{label}^t\) is updated. Therefore, the classification GNN is also updated as it is trained on \(V_{label}^{t}\). Thus, the state representation of each node is also updated and \(S^{t-1}_G\) is transitioned to \(S^{t}_G\).

Policy Network

Instead of using multi-layer perceptron, to exploit the high correlation resulted from graph topology, [6] also uses another GNN as the policy network, where the input is the state matrix \(S_G^t\) at each step \(t\). The difference is that a linear layer with output dimension of 1 is added at the end of the policy network, which serves as real number score extractor for each node. A softmax is then applied to generate action probability distribution over all candidate nodes \(V_{train}\backslash V_{label}\) to add the next node to \(V_{label}\).

4. Our Adjustments

We follow the framework in [6], and we investigate the following topics: 1. whether better state representation is able to better guide the policy network, 2. whether abundunt rewards will be better than sparse rewards to correct policy training along the trajectory and 3. whether more advanced policy gradient algorithms will improve the final performance. For all the following results, we train the policy network using dataset Reddit 1 and Reddit2, and evaluate the obtained policy network on Reddit 3, 4 and 5. The shown metrics are all micro-F1 scores. All 5 Reddit datasets are the same as the ones used in [6] and they are collected in Reddit Forum. Nodes in these 5 graphs belong to 1 of 7 classes.

State Representation

One natural problem is whether a more powerful and informative state representation will better guide the policy network. To accomplish this, we propose two novel metrics along with a well known PageRank centrality score to facilitate the state representation. PageRank centraility score (PCS) is calculated as: \(\phi_{centrality}(v_i)=\rho \underset{j}{\sum} A_{ij}\frac{\phi_{centrality}(v_j)}{\sum_kA_{jk}}+\frac{1-\rho}{N}\), where A is an adjacent matrix, and \(\rho\) is the damping parameter.

Feature Dissimilarity Score (FDS): To evaluate how dissimilar a candidate node is compared to the nodes that have already been labeled in terms of their node feature, we propose FDS, which applies the commonly used cosine similarity for feature distance measurement. Let \(FEA(v_i)\) be the feature vector of node \(v_i\), then: \(\phi_{FDS}(v_{i}) = \frac{1}{\underset{l \in L_t}{max}{cos}(FEA(v_{i}), FEA(l))}\)

Structural dissimilarity score (SDS): Given the adjacency matrix \(A\), the element \((A^2)_{i, j}\) gives the number of paths of length 2 from node \(i\) to node \(j\), i.e. number of shared neighbors between node \(i\) and node \(j\). Therefore, we could design our score \(\phi_\text{SDS}\) to be: \(\phi_\text{SDS} \left(v_{i}\right) = \frac{1}{\underset{l \in L_t}{max} \ (A^2)_{i, {index}(l)}}\), where \({index}(l)\) is the index of the labeled node \(l\).

With the three newly added heuristic criteria, intuitively we have a more powerful state representation. We investigate the influence of such representation.

Dataset\Setting Original Better State Representation
Reddit3 83.46 85.65
Reddit4 84.57 85.10
Reddit5 87.31 88.14

As we can see from the table, a more complex and informative state representation can indeed facilitate policy learning, which confirms our theory and intuition.

Sparse Rewards v.s. Abundunt Rewards

[6] uses the validation accuracy after GNN convergence as the rewards. Such rewards are sparse as they are obtained once in each epoch. We choose to use the validation accuracy after each sample selection as rewards to train the policy network. As these rewards have high variance, we need to subtract a baseline, which is the moving average of previous validation accuracies.

  Better State Representation and PPO with Sparse Rewards Better State Representation and PPO with Abundant Rewards Better State Representation and PPO with Abundant Rewards and no baseline  
Reddit3 92.47 92.46 88.24  
Reddit4 91.00 89.06 80.49  
Reddit5 91.48 92.08 87.49  

As the result shown, PPO with Abundant Rewards but no baseline performance is worse than the performance of PPO with Abundant Rewards and baseline and the performance of PPO with Sparse Rewards. This occurs because the accuracy of each step is affected by the variance of GNN. When we do not include baseline, the high variance of GNN will affect the training process of policy

Policy Gradient Backbones

[6] uses REINFORCE solely. We change the backbone to be PPO [7] and actor critic[8] to see whether it affects performance.

PPO: policy gradient algorithms are sensitive to policy updates in consecutive steps. One large update might destroy the whole training trajectory. PPO algorithm overcomes such problem by introducing a trust region by including the ratio of old and new policies in the objective function. It also utilizes a clipping operation on the estimated advantage function in case that the new policy is far away from the old policy. PPO algorithm has achieved SOTA performance on various reinforment learning tasks.

Advantage Actor Critic: different from vanilla actor critic algorithm where the critic directly estimates the Q value, in advantage actor critic, the critic network estimates the advantage of action and state pair. This operation turns to be effective in reducing the variance, which is beneficial to the policy training.

  Better State Representation and PPO Better State Representation and Advantage Actor Critic Better State Representation and REINFORCE
Reddit3 92.47 92.95 92.95
Reddit4 91.00 91.53 89.43
Reddit5 91.48 91.06 90.95

[6] uses REINFORCE as Policy Gradient Backbone. However, REINFORCE uses a Monte-Carlo way to update policy parameter, which has high variance. To reduce the variance, we use a critic to estimate the action-value function and use that estimation to update actor policy. As the result shown, we can see that Advantage Actor Critic performs better then REINFORCE, and we can see that the high variance of the Monte-Carlo way affects the training of the policy.

Besides, we also implemented PPO as another Policy Gradient Backbone to see whether there exists some large step update of policy that can cause failure of training. From the result, we can see that the improvement of PPO is minor compared to the performance of Actor Critic. Thus, limiting step size during policy update does not have a large influence on accuracy of classification in this task, and reducing variance has a larger impact on the accuracy of classification.

5. Conclusion

In this work, we follow existing frame work [6] of applying policy gradient in GNN active learning and investigated the benefits of more sophiscated state representation, abundant v.s. sparse rewards and different reinforcement learning backbones. We confirmed our theory that better state representation does bring better policy guidance and abundant rewards tend to fluctuate as a result of GNN variance. We get hands on experiences in investigating and implementing several reinforcement learning algorithms and deliver acceptable results.

References

[1] Hongyun Cai, Vincent W Zheng, and Kevin Chen-Chuan Chang. Active learning for graph embedding. https://arxiv.org/pdf/1705.05085.pdf

[2] Lan-Zhe Guo, Tao Han, and Yu-Feng Li. Robust Semi-Supervised Representation Learning for Graph-Structured Data. http://www.lamda.nju.edu.cn/guolz/pakdd.pdf

[3] Yuexin Wu, Yichong Xu, Aarti Singh, Yiming Yang, and Artur Dubrawski. Active learning for graph neural networks via node feature propagation. https://grlearning.github.io/papers/46.pdf

[4] Xiaowei Jia, Beiyu Lin, Jacob Zwart, Jeffrey Sadler, Alison Appling, Samantha Oliver, Jordan Read. Graph-based Reinforcement Learning for Active Learning in Real Time: An Application in Modeling River Networks. https://arxiv.org/pdf/2010.14000.pdf

[5] Yuheng Zhang, Hanghang Tong, Yinglong Xia, Yan Zhu, Yuejie Chi, Lei Ying, Batch Active Learning with Graph Neural Networks via Multi-Agent Deep Reinforcement Learning. https://ojs.aaai.org/index.php/AAAI/article/view/20897

[6] Shengding Hu, Zheng Xiong, Meng Qu, Xingdi Yuan, Marc-Alexandre Côté, Zhiyuan Liu, and Jian Tang. Graph Policy Network for Transferable Active Learning on Graphs. https://arxiv.org/pdf/2006.13463.pdf

[6] Shengding Hu, Zheng Xiong, Meng Qu, Xingdi Yuan, Marc-Alexandre Côté, Zhiyuan Liu, and Jian Tang. Graph Policy Network for Transferable Active Learning on Graphs. https://arxiv.org/pdf/2006.13463.pdf

[7] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov. Proximal Policy Optimization Algorithms. https://arxiv.org/pdf/1707.06347.pdf

[8] Sutton, R. S., McAllester, D., Singh, S., and Mansour, Y. Policy Gradient Methods for Reinforcement Learning with Function Approximation. [https://dl.acm.org/doi/10.5555/3009657.3009806]