Learning to Drop: Robust Graph Neural Network via Topological Denoising
Dongsheng Luo^{1*}, Wei Cheng^{2*}, Wenchao Yu^{2}, Bo Zong^{2}, Jingchao Ni^{2},
Haifeng Chen^{2}, Xiang Zhang^{1}
^{1}Pennsylvania State University, ^{2}NEC Labs America {dul262,xzz89}@psu.edu,{weicheng,wyu,bzong,jni,haifeng}@neclabs.com
AbstractGraph Neural Networks (GNNs) have shown to be powerful tools for graph analytics. The key idea is to recursively propagate and aggregate information along edges of the given graph. Despite their success, however, the existing GNNs are usually sensitive to the quality of the input graph. Realworld graphs are often noisy and contain taskirrelevant edges, which may lead to suboptimal generalization performance in the learned GNN models. In this paper, we propose PTDNet, a parameterized topological denoising network, to improve the robustness and generalization performance of GNNs by learning to drop taskirrelevant edges. PTDNet prunes taskirrelevant edges by penalizing the number of edges in the sparsified graph with parameterized networks. To take into consideration of the topology of the entire graph, the nuclear norm regularization is applied to impose the lowrank constraint on the resulting sparsified graph for better generalization. PTDNet can be used as a key component in GNN models to improve their performances on various tasks, such as node classification and link prediction. Experimental studies on both synthetic and benchmark datasets show that PTDNet can improve the performance of GNNs significantly and the performance gain becomes larger for more noisy datasets.
ACM Reference Format:
Dongsheng Luo^{1*}, Wei Cheng^{2*}, Wenchao Yu^{2}, Bo Zong^{2}, Jingchao Ni^{2}, and Haifeng Chen^{2}, Xiang Zhang^{1}. 2021. Learning to Drop: Robust Graph Neural Network via Topological Denoising. In WSDM ’21:International Conference on Web Search and Data Mining, March 08–12, 2021, Jerusalem, Israel .
In recent years, we have witnessed a dramatic increase in our ability to extract and collect data from the physical world. In many applications, data with complex structures are connected for their interactions and are naturally represented as graphs [35, 42, 44]. Graphs are powerful data representations but are challenging to work with because they require modeling both node feature information as well as rich relational information among nodes [5, 17, 40, 51]. To tackle this challenge, various Graph Neural Networks (GNNs) have been proposed to aggregate information from both graph topology and node features [17, 25, 43, 49, 53]. GNNs model node features as messages and propagate them along the edges of the input graph. During the process, GNNs compute the representation vector of a node by recursively aggregating and transforming representation vectors of its neighboring nodes. Such methods have achieved stateoftheart performances in various tasks, including node classification and link prediction [54, 57].
Despite their success, GNNs are vulnerable to the quality of the given graph due to its recursively aggregating schema. It is natural to ask: is it necessary to aggregate all neighboring nodes? If not, is there a principled way to select which neighboring nodes are not needed to be included? In many realworld applications, graph data exhibit complex topology patterns. Recent works [38, 55] have shown that GNNs are greatly oversmoothed as edges can be pruned without loss of accuracy. Besides, GNNs are easily aggregating taskirrelevant information, leading to overfitting which weakens the generalization ability. Specifically, from the local perspective, a node might be linked to nodes with taskspecific “noisy” edges. Aggregating information from these nodes would impair the quality of the node embedding and lead to unwanted prediction in the downstream task. From the global view, nodes located at the boundary of clusters are connected to nodes from multiple communities. Overwhelming information collected from their neighbors would dilute the true underlying patterns.
As a motivating example, we consider a benchmark dataset Cora [40]. We denote the edges connecting nodes with the same label as positive edges, otherwise, as negative edges. Table 1 shows the statistics of different edges. It is reasonable to consider that passing messages through positive edges leads to high quality node representation, while information aggregated along negative edges impair the performance of GNNs [19]. We adopt Graph Convolutional Network(GCN) [25], a representative GNN, as an example to verify this intuition. We randomly delete some positive and negative edges and conduct GCN on the resulting graphs. As shown in Fig. 1, the performance of GCN increases with more negative edges removed.
Topological denoising is a promising solution to address the abovementioned challenge by removing “noisy” edges [11, 45]. By denoising the input graph, we can prune away taskirrelevant edges to avoid aggregating unnecessary information in GNNs. Besides, it can also help improve the robustness and alleviate the oversmoothing problem inherent of GNNs [38]. The idea of topological denoising is not new. In fact, this line of thinking has motivated GAT [43] that aggregates neighboring nodes with weights from attention mechanism and thus to some extent alleviate the problem. Other existing methods aim to extract smaller subgraphs from the given graphs to preserve predefined properties or randomly remove/sample edges during the training process to prevent GNNs from oversmoothing [17, 38, 41, 46]. However, within unsupervised settings, subgraphs sampled from these approaches may be suboptimal for downstream tasks and also lack persuasive rationales to explain the outcomes of the model for the task. Instead, the taskirrelevant “noisy” edges should be specific to the downstream objective. Besides, in reallife graphs, node contents and graph topology provide complementary information to each other. Denoising process should take both information into consideration, which is overlooked by existing methods.
In this paper, we propose a Parameterized Topological Denoising network (PTDNet) to enhance the performance of GNNs. We use deep neural networks, considering both structural and content information as inputs, to learn to drop taskirrelevant edges in a datadriven way. PTDNet prunes the graph edges by penalizing the number of edges in the sparsified graph with parameterized networks [29, 50]. The denoised graphs are then fed into GNNs for robust learning. The introduced sparsity in the neighboring nodes aggregation has a variety of merits: 1) a sparse aggregation is less complicated and hence generalizes well [15]; 2) it can facilitate interpretability and help infer taskrelevant neighbors. Considering the combinatorial nature of the denoising process, we relax the discrete constraint with continuous distributions that could be optimized efficiently with backpropagation, enabling PTDNet to be compatible with various GNNs in both transductive and inductive settings, including GCN [25], Graph Attention Network [43], GraphSage [17], etc. In PTDNet, the denoising networks and GNN are jointly optimized in an endtoend fashion. Different from conventional methods that remove edges randomly or based on predefined rules, the denoising process of PTDNet is guided by the supervision of the downstream objective in the training phase.
To further concern the global topology, the nuclear norm regularization is applied to impose lowrank constraint on the resulting sparsified graph for better generalization. Due to the discontinuous nature of the rank minimization problem, PTDNet smooths the constraint with the nuclear norm, which is the tightest convex envelope of the rank [37]. This regularization denoises the input graph from the global topology perspective by removing edges connecting multiple communities to improve generalization ability and robustness of GNNs [11]. Experimental results on both synthetic and benchmark datasets demonstrate that PTDNet can effectively enhance the performance and robustness of GNNs.
GNNs are powerful tools to investigate the graph data with node contents. GNN models utilize the message passing mechanism to encode both graph structural information and node features into vector representations. These vectors are then used to nodelevel or graphlevel downstream tasks. GNNs were initially proposed in [16], and extended in [39]. These methods learn node representations by iteratively aggregating neighbor information until reaching a static state. Inspired by the success of convolutional neural networks (CNNs) in computer vision, graph convolutional networks in the graph spectral domain were proposed based upon graph Fourier transform [4]. Multiple extensions were further proposed [9, 25, 28, 31, 39, 43]. The express power of GNNs were analyzed in [49]. DropEdge and PairNorm investigated the oversmoothing problem of stacking multiple GNN layers [38, 55].
Graph Sparsification and Sampling. Conventional graph sparsification approximates the large input graph with a sparse subgraph to enable efficient computation, and at the same time preserve certain properties. Different notions have been extensively studied including pairwise distances betweenness [7], sizes of all cuts [2], node degree distributions [10], and spectral properties [1, 18]. These methods remove edges only based upon the structural information, which limits their power when combining with GNNs. Besides, without supervised feedback from the downstream task, these approaches may generate subgraphs with suboptimal structural properties. NeuralSparse [56] learns 𝑘neighbor subgraphs for robust graph representation learning by selecting at most 𝑘 edges for each nodes. The 𝑘neighbor assumption however limits its learning power and may lead to suboptimal performance in generalization.
Recently, graph sampling has been investigated in GNNs for fast computation and better generalization capacity, including neighborlevel [17], nodelevel [6, 20, 52], and edgelevel sampling methods [38]. Unlike these methods that randomly sample edges in the training phase, PTDNet utilizes parametrized networks to actively remove taskspecific noisy edges. With supervised guidance from downstream objective, the generated subgraphs benefit GNNs in not only robustness but also accuracy and interpretability. Besides, PTDNet has better generalization capacity as the parametrized networks can be used for inductive inference.
Notations. In general, we use lowercase, bold uppercase, and bold lowercase letters for scalars, matrices, and vectors, respectively. For example, we use Z to denote a matrix, whose 𝑖,𝑗th entry is denoted by 𝑧_{𝑖𝑗 } , a lowercase character with subscripts. Let 𝐺= (V,E) represent the input graph with 𝑛 nodes, where V,E stand for its node/edge set, respectively. The adjacency matrix of 𝐺 is denoted by A ∈R^{𝑛×𝑛 } . Node features are denoted by matrix X ∈R^{𝑛×𝑚 } with 𝑚 as the dimensionality of node features. We use Y to denote the labels in the downstream task. For instance, in the node classification task, Y ∈R^{𝑛×𝑐 } represents node labels, where 𝑐 is the number of classes.
GNN layer. Applying a GNN layer consists of the propagation step and the output step [57]. At the propagation step, the aggregator first computes the message for each edge. For an edge (𝑣_{𝑖 } ,𝑣_{𝑗 } ), the aggregator takes the representations of 𝑣_{𝑖 } and 𝑣_{𝑗 } in previous layer as inputs, denoted by h_{𝑖 }^{𝑡1}, and h_{𝑗 }^{𝑡1}, respectively. Then, the aggregator collects messages from local neighborhoods for each node 𝑣_{𝑖 } . At the output step, the updater computes its new hidden representation, denoted by h_{𝑣 }^{𝑡 }.
GNN models adopt message passing mechanisms to propagate and aggregate information along the input graph to learn node representations. The performances can be heavily affected by the quality of the input graph. Messages aggregated along “noisy” edges may decrease the quality of node embeddings. Overwhelming information from multiple communities put GNNs at the risk of oversmoothing, especially when multiple GNN layers are stacked [30, 38, 55]. Existing methods either utilize graph sparsification strategies to extract subgraphs or randomly sample graphs to enhance the robustness of GNNs. Basically, they are conducted in an unsupervised way, limiting their ability to filter out taskspecific noisy edges.
The core idea of PTDNet is to actively filter out taskspecific noisy edges in the input graph with a parameterized network. It consists of the denoising network and general GNNs. GNNs can be applied under both inductive and transductive settings. We first give an overview of PTDNet in Sec. ??, followed by details of the denoising network in Sec. 4.2. To enhance the generalization ability of PTDNet, we further introduce the lowrank constraint on resulting graphs and provide smoothing relaxation to achieve an endtoend model.
4.1 The overall architecture. The architecture of PTDNet is shown in Fig. 2(a). It consists of two major components, the denoising networks and the GNNs. The denoising network is a multilayer network that samples a subgraph from a learned distribution of edges. PTDNet is compatible with most existing GNNs, such as GCN [25], GraphSage [17], GAT [43], GIN [49], etc. With relaxations, the denoising network is differentiable and can be jointly optimized with GNNs guided by supervised downstream signals.
4.2.1 Graph edge sparsification. The goal of the denoising network is to generate a subgraph filtering out taskirrelevant edges for GNN layers. For the 𝑙th GNN layer, we introduce a binary matrix Z^{𝑙 } ∈{0,1}^{V×V}, with 𝑧_{𝑢,𝑣 }^{𝑙 } denoting whether the edge between node 𝑢 and 𝑣 is present (0 indicates noisy edge). Formally, the adjacency matrix of the resulting subgraph is A^{𝑙 } = A ⊙Z^{𝑙 } ,where ⊙ is the elementwise product. One way to reduce noisy edges with the least assumptions about A^{𝑙 } is to directly penalize the number of nonzero entries in Z^{𝑙 } of different layers.
(1) 
where I[·] is an indicator function, with I[𝑇𝑟𝑢𝑒]= 1 and I[𝐹𝑎𝑙𝑠𝑒]= 0, ·_{0} is the ℓ_{0} norm. There are 2^{E} possible states of Z^{𝑙 } . Because of its nondifferentiability and combinatorial nature, optimizing this penalty is computationally intractable. Therefore, we consider each binary number 𝑧_{𝑢,𝑣 }^{𝑙 } to be drawn from a Bernoulli distribution parameterized by π_{𝑢,𝑣 }^{𝑙 }, i.e., 𝑧_{𝑢,𝑣 }^{𝑙 }~𝐵𝑒𝑟𝑛(π_{𝑢,𝑣 }^{𝑙 }). The matrix of π_{𝑢,𝑣 }^{𝑙 }’s is denoted by Π^{𝑙 } . Then, penalizing the nonzero entries in Z^{𝑙 } , i.e., the number of edges being used, can be reformulated as regularizing ∑_{ (𝑢,𝑣)∈E}π_{𝑢,𝑣 }^{𝑙 } [29].
Since π_{𝑢,𝑣 }^{𝑙 } is optimized jointly with the downstream task, it describes the taskspecific quality of the edge (𝑢,𝑣). A small value of π_{𝑢,𝑣 }^{𝑙 } indicates the edge (𝑢,𝑣) is more likely to be noise and should be with small weight or even be removed in the following GNN. Although the regularization of the reformulated form is continuous, the adjacency matrix of the resulting graph is still generated by a binary matrix Z^{𝑙 } . The expected cost of downstream task could be modeled as 𝐿({Π^{𝑙 } }_{𝑙=1}^{𝐿 })= E_{Z1~𝑝(Π1),,Z𝐿 ~𝑝(Π𝐿 )}𝑓({Z^{𝑙 } }_{𝑙=1}^{𝐿 },X). To minimize the expected cost via gradient descent, we need to estimate the gradient ∇_{Π𝑙 } E_{Z1~𝑝(Π1),,Z𝐿 ~𝑝(Π𝐿 )}𝑓({Z^{𝑙 } }_{𝑙=1}^{𝐿 },X), 𝑙∈[1,2,,𝐿]. Existing methods adopt various estimators to approximate the gradient, including score function [48], straightthrough [3], etc. However, these methods suffer from either high variance or biased gradients [33]. In addition, to make PTDNet suitable for the inductive setting and enhance generalization ability, a parameterized method for modeling Z^{𝑙 } should be adopted.
4.2.2 Continuous relaxation with parameterized networks. To efficiently optimize subgraphs with gradient methods, we adopt the reparameterization trick [22] and relax the binary entries 𝑧_{𝑢,𝑣 }^{𝑙 } from being drawn from a Bernoulli distribution to a deterministic function 𝑔 of parameters α_{𝑢,𝑣 }^{𝑙 }∈R and an independent random variable ϵ^{𝑙 } . That is 𝑧_{𝑢,𝑣 }^{𝑙 }= 𝑔(α_{𝑢,𝑣 }^{𝑙 },ϵ^{𝑙 } ).
(2) 
To enable the inductive setting, we should not only figure out which edges but also why they should be filtered out. To learn to drop, for each edge (𝑢,𝑣), we adopt parameterized networks to model the relationship between the taskspecific quality π_{𝑢,𝑣 }^{𝑙 } and the node information including node contents and topological structure. In the training phase, we jointly optimize denoising networks and GNNs. In the testing phase, the input graphs could also be denoised with the learned denoising networks. Since we need to compute a subgraph of the input graph, the time complexity of the denoising network in the inference phase is linear to the number of edges 𝑂(E).
Following [43], we adopt deep neural networks to learn the parameter α_{𝑢,𝑣 }^{𝑙 } that controls whether to remove the edge (𝑢,𝑣). Without loss of generality, we focus on a node 𝑢 in the training graph. Let N_{𝑢 } be its neighbors. For the 𝑙th GNN layer, we calculate α_{𝑢,𝑣 }^{𝑙 } for node 𝑢 and 𝑣∈N_{𝑢 } with α_{𝑢𝑣 }^{𝑙 }= 𝑓_{θ𝑙 }^{𝑙 }(h_{𝑢 }^{𝑙 },h_{𝑣 }^{𝑙 }), where 𝑓_{θ𝑙 }^{𝑙 } is an MLP parameterized by θ^{𝑙 } . To get 𝑧_{𝑢,𝑣 }^{𝑙 }, we utilize the concrete distribution along with hard sigmoid function [29, 32]. First, we draw 𝑠_{𝑢,𝑣 }^{𝑙 } from a binary concrete distribution with α_{𝑢𝑣 }^{𝑙 } parameterizing the location [22, 32]. Formally,
(3) 
where τ∈R^{+} indicates the temperature and σ(𝑥)= is the sigmoid function. With τ>0, the function is smoothed with a welldefined gradient , enabling efficient optimization of the parameterized denoising network.
Since the binary concrete distribution has a range of (0,1), to encourage the weights for taskspecific noisy edges to be exactly zeros, we first extend the range to (γ,ζ), with γ<0 and ζ>1 [29]. Then, we compute 𝑧_{𝑢,𝑣 }^{𝑙 } by clipping the negative values to 0 and values larger than 1 to 1.
(4) 
Within the above formulation, the constraint on the number of nonzero entries in Z^{𝑙 } in Eq.( 1) can be reformulated with
(5) 
P_{𝑠 𝑢,𝑣 𝑙 } (0θ^{𝑙 } ) is the cumulative distribution function (CDF) of _{𝑢,𝑣 }^{𝑙 }.
As shown in [32], the density of 𝑠_{𝑢,𝑣 }^{𝑙 } is
(6) 
The CDF of variable 𝑠_{𝑢,𝑣 }^{𝑙 } is
(7) 
Since the function = 𝑡(𝑠_{𝑢,𝑣 }^{𝑙 }) in Eq. (4) is monotonic. The probability density function of _{𝑢,𝑣 }^{𝑙 } is
_{𝑢,𝑣 }^{𝑙 }(8) 
Similarly, we have the CDF of
_{𝑢,𝑣 }^{𝑙 }(9) 
By setting 𝑥= 0, we have the
(10) 
Algorithm 1 summarizes the overall training of PTDNet.
4.3 The lowrank constraint. In the previous section, we introduced parameterized networks to remove the taskspecific noisy edges from the local neighborhood perspective. In reallife graph data, nodes from multiple classes can be divided into different clusters. Intuitively, nodes from different topological communities are more likely with different labels [13]. Hence, edges connecting multiple communities are highly possible noise for GNNs. Based upon this intuition, we further introduce a lowrank constraint on the adjacency matrix of the resulting subgraph to enhance the generalization capacity and robustness, since the rank of the adjacency matrix reflects the number of clusters. This regularization denoises the input graph from the global topology perspective by encouraging the denoising networks to remove edges connecting multiple communities such that the resulting subgraphs to have dense connections within communities while sparse between them [24]. Recent work also shows that graphs with low rank are more robust to network attacks [11, 23]. Formally, the straightforward regularizer R_{𝑙𝑟 } for lowrank constraint of PTDNet is ∑_{ 𝑙=1}^{𝐿 }Rank(A^{𝑙 } ), where A^{𝑙 } is the adjacency matrix for the 𝑙th GNN layer. It has been shown in previous studies that the matrix rank minimization problem (RMP) is NPhard [8]. We approximately relax the intractable problem with the nuclear norm, which is the convex surrogate for RMP problem [14]. The nuclear norm of a matrix is defined as the sum of its singular values. It is a convex function that can be optimized efficiently. Besides, previous studies have shown that in practice, nuclear norm constraints produce very lowrank solutions [11, 14, 37]. With nuclear norm minimization, the regularizer is
(11) 
where λ_{𝑖 }^{𝑙 } is the 𝑖th largest singular values of graph adjacency matrix A^{𝑙 } .
Singular value decomposition (SVD) is required to optimize the nuclear norm regularization. However, SVD may lead to unstable results during backpropagation. As shown in [21], the partial derivatives of the nuclear norm requires computing of a matrix M^{𝑙 } with elements
(12) 
When (λ_{𝑖 }^{𝑙 })^{2} (λ_{𝑗 }^{𝑙 })^{2} is small, the partial derivatives become very large, leading to an arithmetic overflow. Besides, the gradientbased optimization on SVD is timeconsuming. The Power Iteration (PI) method with deflation procedure is one way to solve this problem [34, 36, 47]. PI approximately computes the largest eigenvalue and the dominant eigenvector of the matrix (A^{𝑙 } )^{*}A^{𝑙 } with an iterative procedure from a randomly initiated vector. (A^{𝑙 } )^{*} stands for the transposeconjugate matrix of A^{𝑙 } . The largest singular value of A^{𝑙 } is then the square root of the largest eigenvalues. To calculate other eigenvectors, the deflation procedure is involved to iteratively remove the projection of the input matrix on this vector. However, PI may output inaccurate approximations if two eigenvalues are close to each other. The situation becomes worse when we consider eigenvalues near zero. Besides, with randomly initiated vectors, PI may need more iterations to get a precise approximation.
To address the problem, we combine SVD and PI [47] and further relax the nuclear norm to Ky Fan 𝐾norm [12], which is the sum of top 𝐾, 1 ≤𝐾≪V, largest singular values. Fig. 2(b) shows the forward pass and backpropagation of the nuclear norm. In the forward pass, as shown in Algorithm 2, SVD is used to calculate singular values, left and right singular vectors. Then we get the nuclear norm as the regularization loss. In order to minimize the nuclear norm, we utilize the power iteration to compute top 𝐾 singular values, denoted by _{1}^{𝑙 },... _{𝐾 }^{𝑙 }. Note that the PI process does not update the values in singular vectors and singular values. It only serves to compute the gradients during backpropagation, which is shown with red dot lines in Fig. 2(b). We estimate the nuclear norm with _{𝑙𝑟 } = ∑_{ 𝑙=1}^{𝐿 }∑_{𝑖=1}^{𝐾 } _{𝑖 }^{𝑙 }. _{𝑙𝑟 } is a lower bound function of R_{𝑙𝑟 } with gap
(13) 
It is obvious that ⌈ ⌉_{𝑙𝑟 } is the upper bound of R_{𝑙𝑟 } . We dismiss the constant coefficient and minimize _{𝑙𝑟 } as the lowrank constraint.
In this section, we empirically evaluate the robustness and effectiveness of PTDNet with both synthetic and benchmark datasets. First, we apply PTDNet to popular GNN models for node classification on benchmark datasets. Second, we evaluate the robustness of PTDNet by injecting additional noise. Moreover, we also provide insight into the denoising process by checking the edges removed by PTDNet. We also conduct comprehensive experiments to uncover insights of PTDNet, including empirically demonstrating the effects of regularizers, parameter study, analyzing the oversmoothing problem inherent in GNNs, and applying PTDNet to another downstream task, i.e., link prediction.
Dataset  Cora  Citeseer  Pubmed  PPI 
Nodes  2,708  3,327  19,717  56,944 
Edge  5,429  4,732  44,338  818,716 
Fea.  1,433  3,703  500  50 
Classes  7  6  3  121 
Train.  140  120  100  44,906 
Val.  500  500  500  6,514 
Test.  1,000  1,000  1,000  5,524 
5.1 Experimental setup. Datasets. Four benchmark datasets are adopted in our experiments. Cora, Citeseer, and Pubmed are citation graphs where each node is a document and edges describe the citation relationship. A document is assigned with a unique label based on its topic. Node features are bagofwords representations of the documents. We follow the standard train/val/test splits in [25, 43] with very scarce labelled nodes, which are different from the fullsupervised setting in DropEdge [38]. PPI contains graphs describing proteinprotein interaction in different human tissues. Positional gene sets, motif gene sets, and immunological signatures are used as node features. Gene ontology sets are used as labels. The statistics of these datasets are listed in Table 2. Implementations & metrics. We consider three representative GNNs as backbones, including GCN [25], GraphSage [17], GAT [43]. Note that our model is a general framework that is compatible with diverse GNN models. Recent sophisticated models can also be combined with our framework to improve their performances and robustness. Achieving SOTA performances by using a complex architecture is not the main research point of this paper. We compare with most recent stateoftheart sampling and sparsification methods, DropEdge [38] and NeuralSparse [56]. For GraphSage, we use the mean aggregation. We follow the experimental setting in [38] to perform a random hyperparameter search for each model. For each setting, we run 10 times and report the average results. Parameters are tuned via crossvalidation. We also include a variant of PTDNet by removing the lowrank constraint as the ablation study. For singlelabel classification datasets, including Cora, Citeseer, and Pubmed, we evaluate the performance with accuracy [25]. For PPI, we evaluate with microF1 scores [17].
All experiments are conducted on a Linux machine with 8 NVIDIA Tesla V100 GPUs, each with 32GB memory. CUDA version is 9.0 and Driver Version is 384.183. All methods are implemented with Tensorflow 1.12.
Backbone  Method  Cora  Citeseer  Pubmed  PPI 
GCN  Basic  0.811 ±0.015  0.703 ±0.012  0.790 ±0.020  0.660 ±0.024 
DropEdge  0.809 ±0.035  0.722 ±0.032  0.785 ±0.043  0.606 ±0.041  
NeuralSparse  0.821 ±0.014  0.715 ±0.014  0.788 ±0.018  0.651 ±0.014  
PTDNetwl  0.824 ±0.018  0.717 ±0.170  0.791 ±0.012  0.752 ±0.017  
PTDNet  0.828 ±0.026  0.727 ±0.018  0.798 ±0.024  0.803 ±0.008  
GraghSage  Basic  0.792 ±0.027  0.676 ±0.023  0.767 ±0.020  0.618 ±0.014 
DropEdge  0.787 ±0.023  0.670 ±0.031  0.748 ±0.026  0.610 ±0.035  
NeuralSparse  0.793 ±0.021  0.674 ±0.011  0.751 ±0.021  0.626 ±0.023  
PTDNetwl  0.794 ±0.026  0.678 ±0.022  0.770 ±0.024  0.645 ±0.020  
PTDNet  0.803 ±0.019  0.679 ±0.018  0.771 ±0.010  0.648 ±0.025  
GAT  Basic  0.830 ±0.007  0.721 ±0.009  0.790 ±0.008  0.973 ±0.012 
DropEdge  0.832 ±0.040  0.709 ±0.020  0.779 ±0.019  0.850 ±0.038  
NeuralSparse  0.834 ±0.015  0.724 ±0.026  0.780 ±0.017  0.921 ±0.018  
PTDNetwl  0.837 ±0.022  0.723 ±0.014  0.792 ±0.014  0.978 ±0.018  
PTDNet  0.844 ±0.023  0.737 ±0.031  0.793 ±0.015  0.980 ±0.022  
5.3 Robustness evaluation. In this part, we evaluate the robustness of PTDNet by manually including noisy edges. We use the Cora dataset and randomly connect 𝑁 pairs of previously unlinked nodes with 𝑁 ranging from 1000 to 20,000. We compare the proposed method to baselines with all there backbones. Performances are shown in Fig. 3. We have the following observations. 1) PTDNet consistently outperforms DropEdge, NeuralSparse and the basic backbones with various numbers of noisy edges. The comparison demonstrates the robustness of PTDNet. 2) DropEdge randomly samples a subgraph for GNN layers. In most cases, DropEdge reports worse performances than the original backbones. The comparison demonstrates that a random sampling strategy used in DropEdge is vulnerable to noisy edges. 3) NeuralSparse selects noise edges guided by task signals thus achieves better results than basic backbones. However, it selects at most 𝑘 edges for each nodes, which may lead to suboptimal performance. 4) The margins between results of PTDNet and basic backbones become large when more noise are injected. Specifically, PTDNet relatively improves the accuracy scores by 37.37% for GCN, 13.4% for GraphSage, and 16.1% for GAT with 20,000 noisy edges.
5.4 On denoising process. In this section, we use controllable synthetic datasets to analyze the denoising process of PTDNet, which has 5 labels and 30 features per node. We first randomly sample five 30dimensional vectors as the centroids, one for each label. Then, for each label, we sample nodes from a Gaussian distribution with the centroid as the mean. The variance, which controls the quality of content information is set to 80. The number of nodes for each label is drawn from another Gaussian distribution N(200,25). To validate GNNs on fusing node content and topology, we build a graph containing complementary information with node features. Specifically, we use the distance between a node and the centroid node of its label as the metric to evaluate the quality of the node feature. The probability that it connects to another node with a different label is positively proportional to the feature quality. The resulting graph contains 1,018 nodes and 4,945 edges. We randomly select 60/20/20% nodes for training/validation/testing, respectively.
We use GCN as the backbone for example. The denoising process of the first PTDNet layer is shown in Fig. 4. Red lines represent the mean weight, i.e., 𝑧 in Sec. ??, of positive edges (edges connected nodes with the same label), and blue dotted lines are for negative edges. Fig. 4(a), 4(b) count the edges linked with a training/testing node, respectively. We also show the results of DropEdge to see the case of random selection. These figures demonstrate that DropEdge, an unparameterized method, cannot actively drop the taskirrelevant edges. While PTDNet can detect negative edges and remove or assign them with lower weights. The denoising process of PTDNet leads to higher accuracy with more iterations, which is shown in Fig. 4(c). In addition, the consistent performance of PTDNet on testing nodes shows that our parameterized method can still learn to drop negative edges in the testing phase, demonstrating the generalization capacity of PTDNet. Besides, we plot the degree(volume) distribution of the input graph, subgraph sampled by DropEdge and PTDNet in Fig. 4(d). We observe that both PTDNet and DropEdge can keep the distribution property of the input graph.
5.5 Effects of regularizers. In this part, we adopt GCN as the backbone to analyze the effects of regularizers. We first show the accuracy performance of PTDNet w.r.t coefficients for regularizers in Fig. 5. For each choice, we fix that value for the coefficient and use the genetic algorithm to search other hyperparameters. The best performances are reported here. In general, the performance first increases and then drops as coefficients increase. Since the benchmark datasets are relatively clean. To better show the effects of the proposed two regularizers: R_{𝑐 } and R_{𝑙𝑤 } . We synthesize datasets with controllable properties instead. We introduce two hyperparameters β_{1} and β_{2} for regularizers R_{𝑐 } , and R_{𝑙𝑤 } , respectively. These two hyperparameters affect the ratio of edges to be removed. We first dismiss the lowrank constraint by setting β_{2} = 0. With different choices of β_{1} (0, 0.05, 0.075, 0.9), we show mean weights (i.e., 𝑧) of edges during iterations in Fig. 6(a). The figure shows that with a larger hyperparameter for R_{𝑐 } , PTDNet achieves a more sparse subgraph. Similar observations can be found in Fig. 6(b), where β_{1} = 0 and only the lowrank constraint is considered.
To demonstrate the effects of including regularizers on the generalization capacity of PTDNet. We synthesize four datasets with various topology properties. The percentages of positive edges in these four datasets range from 0.5 to 0.85. We tune the β_{1} and β_{2} separately by setting the other one to 0. The best options of β_{1} and β_{2} for these four datasets are shown in Table 4. The table shows that for datasets with poor topological qualities, regularizers should be assigned with higher weights, such that PTDNet can denoise more taskirrelevant edges. On the other hand, for datasets with good structure, i.e., the ratio of positive edges is over 0.85, β_{1} or β_{2} should be with relatively small values to keep more edges.
original 




0.211  0.206  0.189  0.190  
Lowrank constraint R_{𝑙𝑤 } is introduced to enhance the generalization of PTDNet by setting a constraint on edges connecting nodes from different communities. We use a synthetic dataset to show the influence of R_{𝑙𝑤 } . We adopt the spectral clustering to group nodes into 5 communities. As shown in Table 5, the ratio of crosscommunity edges is 0.211 in the original graph. We fist adopt PTDNet with β_{1} = 0.05,β_{2} = 0 on the dataset. The ratio in the subgraph generated by the first layer is 0.206. By considering lowrank constraint and setting β_{1} = 0,β_{2} = 0.05, the ratio drops to 0.189, showing the effectiveness of lowrank constraint on removing crosscommunity edges.
5.6 Impacts of approximate factor 𝐾. In Sec. ??, 𝐾 is the approximate factor in the lowrank constraint. In this part, we the synthetic dataset used in Sec. 5.4. We do not utilize the early stop strategies and fix the number of epochs to 300. We range 𝐾 from 1 to 32 and adopt 2 layer GCN with 256 hidden units as the backbone. The accuracy and running time are shown in Fig. 7. From the figure, we can see that the accuracy performance increase with the larger 𝐾, which is consistent with Eq. 13 that larger 𝐾 indicates tighter bound. At the same time, it leads to more running time. Besides, PTDNet can achieve a relatively high performance even when 𝐾 is small.
5.7 On oversmoothing. The oversmoothing problem exists when multiple GNN layers are stacked [27]. DropEdge alleviates this problem by random drop edges during the training phase. Removing certain edges makes the graph more sparse and reduces messages passing along edges [38]. However, As an unparameterized method, DropEdge is not utilized during the testing phase, which limits its power on preventing the oversmoothing problem, especially on very dense graphs. On the other hand, our PTDNet is a parameterized method, which learns to drop in both training and testing phases.
In this part, we experimentally demonstrate the effectiveness of PTDNet on alleviating the oversmoothing problem in GNN models with a very dense graph. We adjust the synthetic datasets used in the above section by adding more edges. With GCN as the backbone, we compare our PTDNet with DropEdge and the basic backbone. Since stacking multiple GNN also involves the overfitting problem, we include MLP as another baseline, which uses the identity matrix as the adjacency matrix. For DropEdge, we choose three dropedge rates, 0.1, 0.8, and 0.99. We range the number of GCN from 2 to 16 and show the results in Fig 8.
From the figure, we have the following observations. First, the performances of MLP w.r.t the number of layers show that the overfitting problem appears when 16 layers are stacked. Thus, for models with 8 or fewer layers, we can dismiss the overfitting problem and focus on the oversmoothing merely. Second, GCN models with 4 layers or more suffer from the oversmoothing problem, which makes the basic GCN model performance even worse than MLP. Third, DropEdge can only alleviate the oversmoothing problem to some degree due to its limitation on the testing phase. Last but not least, our PTDNet consistently outperforms all baselines. The reason is that our method is parameterized and can learn to drop edges during the training phase. The learned strategies can also be utilized in the testing phase to further reduce the effects of the oversmoothing.
Method  Cora  Citeseer  Pubmed  
AUC  AP  AUC  AP  AUC  AP  
GCN(GAE)  0.910  0.920  0.895  0.899  0.964  0.965 
DropEdge  0.881  0.903  0.862  0.880  0.859  0.877 
NeuralSparse  0.901  0.917  0.899  0.910  0.926  0.953 
PTDNet  0.916  0.931  0.918  0.922  0.963  0.966 
5.8 Link prediction. In this section, we apply our PTDNet to another downstream task, link prediction. We adopt the Cora, Citeseer, and Pubmed datasets and follow the same experimental settings in GAE, which applies GCN for link prediction [26]. PPI dataset is not used because it contains multiple graphs and not suitable for link prediction. Specifically, we randomly remove 10% and 5% edges for positive samples in the testing and validation sets. The left edges and all node features are used for training. We include the same number of negative samples as positive edges by randomly sampling unconnected nodes in validation and testing sets. We adopt area under the ROC curve, denoted by AUC, and average precision, denoted by AP, scores to evaluate their ability to correctly predict the removed edges. GCN is used as the backbone. We compare our PTDNet with the basic GCN, DropEdge and NeuralSparse. Performances are shown in Table 6. The table shows that our PTDNet can also improve the accuracy performance of link prediction. The denoising networks in PTDNet are optimized by the downstream task loss and can remove taskirrelevant edges. On the other hand, the performance between DropEdge and the original backbone shows that DropEdge is suitable for the link prediction task.
In this paper, we propose a Parameterized Topological Denoising Network (PTDNet) to filter out taskspecific noisy edges to enhance the robustness and generalization power of GNNs. We directly limit the number of edges in the input graph with parameterized networks. To further improve the generalization capacity, we introduce the nuclear norm regularization to impose the lowrank constraint on the resulting sparsified graphs. PTDNet is compatible with various GNN models, such as GCN, GraphSage, and GAT to improve performance on various tasks. Our experiments demonstrate the effectiveness of PTDNet on both synthetic and benchmark datasets.
ACKNOWLEDGMENTS This project was partially supported by NSF projects IIS1707548 and CBET1638320.
[1] Raman Arora and Jalaj Upadhyay. 2019. On Differentially Private Graph Sparsification and Applications. In NeurIPS. 13378–13389.
[2] András A Benczúr and David R Karger. 1996. Approximating st Minimum Cuts in Õ (n2) Time.. In STOC, Vol. 96. Citeseer, 47–55.
[3] Yoshua Bengio, Nicholas Léonard, and Aaron Courville. 2013. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432 (2013).
[4] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. 2013. Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203 (2013).
[5] Andrew Carlson, Justin Betteridge, Bryan Kisiel, Burr Settles, Estevam R Hruschka, and Tom M Mitchell. 2010. Toward an architecture for neverending language learning. In AAAI.
[6] Jie Chen, Tengfei Ma, and Cao Xiao. 2018. Fastgcn: fast learning with graph convolutional networks via importance sampling. arXiv preprint arXiv:1801.10247 (2018).
[7] L Paul Chew. 1989. There are planar graphs almost as good as the complete graph. J. Comput. System Sci. 39, 2 (1989), 205–219.
[8] JJ David. 1995. Algorithms for analysis and design of robust controllers. (1995).
[9] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. 2016. Convolutional neural networks on graphs with fast localized spectral filtering. In NIPS. 3844–3852.
[10] Talya Eden, Shweta Jain, Ali Pinar, Dana Ron, and C Seshadhri. 2018. Provable and practical approximations for the degree distribution using sublinear graph samples. In WWW. 449–458.
[11] Negin Entezari, Saba AlSayouri, Amirali Darvishzadeh, and Evangelos Papalexakis. 2020. All You Need is Low (Rank): Defending Against Adversarial Attacks on Graphs. In WSDM. ACM.
[12] Ky Fan. 1951. Maximum properties and inequalities for the eigenvalues of completely continuous operators. PNAS 37, 11 (1951), 760.
[13] Santo Fortunato. 2009. Community detection in graphs. CoRR abs/0906.0612 (2009).
[14] Shmuel Friedland and LekHeng Lim. 2018. Nuclear norm of higherorder tensors. Math. Comp. 87, 311 (2018), 1255–1281.
[15] Federico Girosi. 1998. An Equivalence Between Sparse Approximation And Support Vector Machines. Neural Computation 10, 6 (1998), 1455–1480.
[16] Marco Gori, Gabriele Monfardini, and Franco Scarselli. 2005. A new model for learning in graph domains. In IJCNN, Vol. 2. IEEE, 729–734.
[17] Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. In NIPS. 1024–1034.
[18] Gecia Bravo Hermsdorff and Lee M Gunderson. 2019. A Unifying Framework for SpectrumPreserving Graph Sparsification and Coarsening. arXiv preprint arXiv:1902.09702 (2019).
[19] Yifan Hou, Jian Zhang, James Cheng, Kaili Ma, Richard T. B. Ma, Hongzhi Chen, and MingChang Yang. 2020. Measuring and Improving the Use of Graph Information in Graph Neural Networks. In ICLR.
[20] Wenbing Huang, Tong Zhang, Yu Rong, and Junzhou Huang. 2018. Adaptive sampling towards fast graph representation learning. In NeurIPS. 4558–4567.
[21] Catalin Ionescu, Orestis Vantzos, and Cristian Sminchisescu. 2015. Matrix backpropagation for deep networks with structured layers. In ICCV. 2965–2973.
[22] Eric Jang, Shixiang Gu, and Ben Poole. 2016. Categorical reparameterization with gumbelsoftmax. arXiv preprint arXiv:1611.01144 (2016).
[23] Wei Jin, Yao Ma, Xiaorui Liu, Xianfeng Tang, Suhang Wang, and Jiliang Tang. 2020. Graph Structure Learning for Robust Graph Neural Networks. arXiv preprint arXiv:2005.10203 (2020).
[24] Taiju Kanada, Masaki Onuki, and Yuichi Tanaka. 2018. Lowrank Sparse Decomposition of Graph Adjacency Matrices for Extracting Clean Clusters. In APSIPA ASC. IEEE, 1153–1159.
[25] Thomas N Kipf and Max Welling. 2016. Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).
[26] Thomas N Kipf and Max Welling. 2016. Variational graph autoencoders. arXiv preprint arXiv:1611.07308 (2016).
[27] Qimai Li, Zhichao Han, and XiaoMing Wu. 2018. Deeper insights into graph convolutional networks for semisupervised learning. In AAAI.
[28] Ruoyu Li, Sheng Wang, Feiyun Zhu, and Junzhou Huang. 2018. Adaptive graph convolutional neural networks. In AAAI.
[29] Christos Louizos, Max Welling, and Diederik P Kingma. 2017. Learning Sparse Neural Networks through 𝐿 0 Regularization. arXiv preprint arXiv:1712.01312 (2017).
[30] Andreas Loukas. 2020. What graph neural networks cannot learn: depth vs width. In ICLR.
[31] Jianxin Ma, Peng Cui, Kun Kuang, Xin Wang, and Wenwu Zhu. 2019. Disentangled Graph Convolutional Networks. In ICML. 4212–4221.
[32] Chris J Maddison, Andriy Mnih, and Yee Whye Teh. 2016. The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712 (2016).
[33] Andriy Mnih and Karol Gregor. 2014. Neural variational inference and learning in belief networks. arXiv preprint arXiv:1402.0030 (2014).
[34] Yuji Nakatsukasa and Nicholas J Higham. 2013. Stable and efficient spectral divide and conquer algorithms for the symmetric eigenvalue decomposition and the SVD. SIAM Journal on Scientific Computing 35, 3 (2013), A1325–A1349.
[35] Jingchao Ni, Shiyu Chang, Xiao Liu, Wei Cheng, Haifeng Chen, Dongkuan Xu, and Xiang Zhang. 2018. Coregularized deep multinetwork embedding. In WWW.
[36] James M Ortega. 1990. Numerical analysis: a second course. SIAM.
[37] Benjamin Recht, Maryam Fazel, and Pablo A Parrilo. 2010. Guaranteed minimumrank solutions of linear matrix equations via nuclear norm minimization. SIAM review 52, 3 (2010), 471–501.
[38] Yu Rong, Wenbing Huang, Tingyang Xu, and Junzhou Huang. 2020. DropEdge: Towards Deep Graph Convolutional Networks on Node Classification. ICLR (2020).
[39] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. 2008. The graph neural network model. IEEE TNN 20, 1 (2008), 61–80.
[40] Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina EliassiRad. 2008. Collective classification in network data. AI magazine 29, 3 (2008), 93–93.
[41] Kijung Shin, Amol Ghoting, Myunghwan Kim, and Hema Raghavan. 2019. Sweg: Lossless and lossy summarization of webscale graphs. In WWW. ACM, 1679–1690.
[42] David C Van Essen, Stephen M Smith, Deanna M Barch, Timothy EJ Behrens, Essa Yacoub, Kamil Ugurbil, WuMinn HCP Consortium, et al. 2013. The WUMinn human connectome project: an overview. Neuroimage (2013).
[43] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2017. Graph attention networks. arXiv preprint arXiv:1710.10903 (2017).
[44] ChiJen Wang, Seokjoo Chae, Leonid A Bunimovich, and Benjamin Z Webb. 2017. Uncovering Hierarchical Structure in Social Networks using Isospectral Reductions. arXiv preprint arXiv:1801.03385 (2017).
[45] Lu Wang, Wenchao Yu, Wei Wang, Wei Cheng, Wei Zhang, Hongyuan Zha, Xiaofeng He, and Haifeng Chen. 2019. Learning Robust Representations with Graph Denoising Policy Network. IEEE ICDM (2019).
[46] Lichen Wang, Bo Zong, Qianqian Ma, Wei Cheng, Jingchao Ni, Wenchao Yu, Yanchi Liu, Dongjin Dong, and Haifeng Chen. 2020. Inductive and Unsupervised Representation Learning on Graph Structured Objects. In ICLR.
[47] Wei Wang, Zheng Dang, Yinlin Hu, Pascal Fua, and Mathieu Salzmann. 2019. BackpropagationFriendly Eigendecomposition. In NeurIPS. 3156–3164.
[48] Ronald J Williams. 1992. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine learning 8, 34 (1992), 229–256.
[49] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. 2018. How powerful are graph neural networks? arXiv preprint arXiv:1810.00826 (2018).
[50] F Xue and J Xin. 2020. Learning Sparse Neural Networks via l0 and l1 by a Relaxed Variable Splitting Method with Application to Multiscale Curve Classification. In 6th World Congress on Global Optimization.
[51] Zhilin Yang, William W Cohen, and Ruslan Salakhutdinov. 2016. Revisiting semisupervised learning with graph embeddings. arXiv preprint arXiv:1603.08861 (2016).
[52] Hanqing Zeng, Hongkuan Zhou, Ajitesh Srivastava, Rajgopal Kannan, and Viktor Prasanna. 2020. GraphSAINT: Graph Sampling Based Inductive Learning Method. In ICLR.
[53] Kai Zhang, Yaokang Zhu, Jun Wang, and Jie Zhang. 2020. Adaptive Structural Fingerprints For Graph Attention Networks. ICLR (2020).
[54] Ziwei Zhang, Peng Cui, and Wenwu Zhu. 2018. Deep learning on graphs: A survey. arXiv preprint arXiv:1812.04202 (2018).
[55] Lingxiao Zhao and Leman Akoglu. 2019. PairNorm: Tackling Oversmoothing in GNNs. arXiv preprint arXiv:1909.12223 (2019).
[56] Cheng Zheng, Bo Zong, Wei Cheng, Dongjin Song, Jingchao Ni, Wenchao Yu, Haifeng Chen, and Wei Wang. 2020. Robust Graph Representation Learning via Neural Sparsification. In ICML.
[57] Jie Zhou, Ganqu Cui, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, Lifeng Wang, Changcheng Li, and Maosong Sun. 2018. Graph neural networks: A review of methods and applications. arXiv preprint arXiv:1812.08434 (2018).