Socially-Aware Self-Supervised Tri-Training for Recommendation

Self-supervised learning (SSL), which can automatically generate ground-truth samples from raw data, holds vast potential to improve recommender systems. Most existing SSL-based methods perturb the raw data graph with uniform node/edge dropout to generate new data views and then conduct the self-discrimination based contrastive learning over different views to learn generalizable representations. Under this scheme, only a bijective mapping is built between nodes in two different views, which means that the self-supervision signals from other nodes are being neglected. Due to the widely observed homophily in recommender systems, we argue that the supervisory signals from other nodes are also highly likely to benefit the representation learning for recommendation. To capture these signals, a general socially-aware SSL framework that integrates tri-training is proposed in this paper. Technically, our framework first augments the user data views with the user social information. And then under the regime of tri-training for multi-view encoding, the framework builds three graph encoders (one for recommendation) upon the augmented views and iteratively improves each encoder with self-supervision signals from other users, generated by the other two encoders. Since the tri-training operates on the augmented views of the same data sources for self-supervision signals, we name it self-supervised tri-training. Extensive experiments on multiple real-world datasets consistently validate the effectiveness of the self-supervised tri-training framework for improving recommendation. The code is released at https://github.com/Coder-Yu/QRec.


INTRODUCTION
Self-supervised learning (SSL) [17], emerging as a novel learning paradigm that does not require human-annotated labels, recently has received considerable attention in a wide range of fields [5,8,16,21,23,27,45]. As the basic idea of SSL is to learn with the automatically generated supervisory signals from the raw data, which is an antidote to the problem of data sparsity in recommender systems, SSL holds vast potential to improve recommendation quality. The recent progress in self-supervised graph representation learning [14,27,40] has identified an effective training scheme for graph-based tasks. That is, performing stochastic augmentation by perturbing the raw graph with uniform node/edge dropout or random feature shuffling/masking to create supplementary views and then maximizing the agreement between the representations of the same node but learned from different views, which is known as graph contrastive learning [40]. Inspired by its effectiveness, a few studies [19,29,37,46] then follow this training scheme and are devoted to transplanting it to recommendation.
With these research effort, the field of self-supervised recommendation recently has demonstrated some promising results showing that mining supervisory signals from stochastic augmentations is desirable [29,46]. However, in contrast to other graph-based tasks, recommendation is distinct because there is widely observed homophily across users and items [20]. Most existing SSL-based methods conduct the self-discrimination based contrastive learning over the augmented views to learn generalizable representations against the variance in the raw data. Under this scheme, a bijective mapping is built between nodes in two different views, and a given node can just exploit information from itself in another view. Meanwhile, the other nodes are regarded as the negatives that are pushed apart from the given node in the latent space. Obviously, a number of nodes are false negatives which are similar to the given node due to the homophily, and can actually benefit representation learning in the scenario of recommendation if they are recognized as the positives. Conversely, roughly classifying them into the negatives could lead to a performance drop.
To tackle this issue, a socially-aware SSL framework which combines the tri-training [47] (multi-view co-training) with SSL is proposed in this paper. For supplementary views that can capture the homophily among users, we resort to social relations which can be another data source that implicitly reflects users' preferences [4,38,[41][42][43]. Owing to the prevalence of social platforms in the past decade, social relations are now readily accessible in many recommender systems. We exploit the triadic structures in the useruser and user-item interactions to augment two supplementary data views, and socially explain them as profiling users' interests in expanding social circles and sharing desired items to friends, respectively. Given the use-item view which contains users' historical purchases, we have three views that characterize users' preferences from different perspectives and also provide us with a scenario to fuse tri-training and SSL.
Tri-training [47] is a popular semi-supervised learning algorithm which exploits unlabeled data using three classifiers. In this work, we employ it to mine self-supervision signals from other users in recommender systems with the multi-view encoding. Technically, we first build three asymmetric graph encoders over the three views, of which two are only for learning user representations and giving pseudo-labels, and another one working on the user-item view also undertakes the task of generating recommendations. Then we dynamically perturb the social network and user-item interaction graph to create an unlabeled example set. Following the regime of tri-training, during each epoch, the encoders over the other two views predict the most probable semantically positive examples in the unlabeled example set for each user in the current view. Then the framework refines the user representations by maximizing the agreement between representations of labeled users in the current view and the example set through the proposed neighbordiscrimination based contrastive learning. As all the encoders iteratively improve in this process, the generated pseudo-labels also become more informative, which in turn recursively benefit the encoders again. The recommendation encoder over the user-item view thus becomes stronger in contrast to those only enhanced by the self-discrimination SSL scheme. Since the tri-training operates on the complementary views of the same data sources to learn self-supervision signals, we name it self-supervised tri-training.
The major contributions of this paper are summarized as follows: • We propose a general socially-aware self-supervised tri-training framework for recommendation. By unifying the recommendation task and the SSL task under this framework, the recommendation performance can achieve significant gains. • We propose to exploit positive self-supervision signals from other users and develop a neighbor-discrimination based contrastive learning method. • We conduct extensive experiments on multiple real-world datasets to demonstrate the advantages of the proposed SSL framework and investigate the effectiveness of each module in the framework through a comprehensive ablation study.
The rest of this paper is structured as follows. Section 2 summarizes the related work of recommendation and SSL. Section 3 introduces the proposed framework. The experimental results are reported in Section 4. Finally, Section 5 concludes this paper.

RELATED WORK 2.1 Graph Neural Recommendation Models
Recently, graph neural networks (GNNs) [7,34] have gained considerable attention in the field of recommender systems for their effectiveness in solving graph-related recommendation tasks. Particularly, GCN [15], as the prevalent formulation of GNNs which is a first-order approximation of spectral graph convolutions, has driven a multitude of graph neural recommendation models like GCMC [2], NGCF [28], and LightGCN [11]. The basic idea of these GCN-based models is to exploit the high-order neighbors in the user-item graph by aggregating the embeddings of neighbors to refine the target node's embeddings [33]. In addition to these general models, GNNs also empower other recommendation methods working on specific graphs such as SR-GNN [32] and DHCN [35] over the session-based graph, and DiffNet [31] and MHCN [44] over the social network. It is worth mentioning that GNNs are often used for social computing as the information spreading in social networks can be well captured by the message passing in GNNs [31]. That is the reason why we resort to social networks for self-supervisory signals generated by graph neural encoders.

Self-Supervised Learning in RS
Self-supervised learning [17] (SSL) is an emerging paradigm to learn with the automatically generated ground-truth samples from the raw data. It was firstly used in visual representation learning and language modeling [1,5,10,12,45] for model pretraining. The recent progress in SSL seeks to harness this flexible learning paradigm for graph representation learning [22,23,26,27]. SSL models over graphs mainly mine self-supervision signals by exploiting the graph structure. The dominant regime of this line of research is graph contrastive learning which contrasts multiple views of the same graph where the incongruent views are built by conducting stochastic augmentations on the raw graph [9,23,27,40]. The common types of stochastic augmentations include but are not limited to uniform node/edge dropout, random feature/attribute shuffling, and subgraph sampling using random walk. Inspired by the success of graph contrastive learning, there have been some recent works [19,29,37,46] which transplant the same idea to the scenario of recommendation. Zhou et al. [46] devise auxiliary self-supervised objectives by randomly masking attributes of items and skipping items and subsequences of a given sequence for pretraining sequential recommendation model. Yao et al. [37] propose a two-tower DNN architecture with uniform feature masking and dropout for self-supervised item recommendation. Ma et al. [19] mine extra signals for supervision by looking at the longer-term future and reconstruct the future sequence for self-supervision, which adopts feature masking in essence. Wu et al. [29] summarize all the stochastic augmentations on graphs and unify them into a general self-supervised graph learning framework for recommendation. Besides, there are also some studies [25,36,44] refining user representations with mutual information maximization among a set  of certain members (e.g. ad hoc groups) for self-supervised recommendation. However, these methods are used for specific situations and cannot be easily generalized to other scenarios.

PROPOSED FRAMEWORK
In this section, we present our SElf-suPervised Tri-training framework, called SEPT. The overview of SEPT is illustrated in Fig. 1.

Preliminaries
3.1.1 Notations. In this paper, we use two graphs as the data sources including the user-item interaction graph G and the user social network G . U = { 1 , 2 , ..., } (|U| = ) denotes the user nodes across both G and G , and I = { 1 , 2 , ..., } (|I| = ) denotes the item nodes in G . As we focus on item recommendation, ∈ R × is the binary matrix with entries only 0 and 1 that represent user-item interactions in G . For each entry ( , ) in , if user has consumed/clicked item , = 1, otherwise = 0. As for the social relations, we use ∈ R × to denote the social adjacency matrix which is binary and symmetric because we work on undirected social networks with bidirectional relations. We use ∈ R × and ∈ R × to denote the learned final user and item embeddings for recommendation, respectively. To facilitate the reading, in this paper, matrices appear in bold capital letters and vectors appear in bold lower letters.
3.1.2 Tri-Training. Tri-training [47] is a popular semi-supervised learning algorithm which develops from the co-training paradigm [3] and tackles the problem of determining how to label the unlabeled examples to improve the classifiers. In contrast to the standard co-training algorithm which ideally requires two sufficient, redundant and conditionally independent views of the data samples to build two different classifiers, tri-training is easily applied by lifting the restrictions on training sets. It does not assume sufficient redundancy among the data attributes, and initializes three diverse classifiers upon three different data views generated via bootstrap sampling [6]. Then, in the labeling process of tri-training, for any classifier, an unlabeled example can be labeled for it as long as the other two classifiers agree on the labeling of this example. The generated pseudo-label is then used as the ground-truth to train the corresponding classifier in the next round of labeling.

Data Augmentation
3.2.1 View Augmentation. As has been discussed, there is widely observed homophily in recommender systems. Namely, users and items have many similar counterparts. To capture the homophily for self-supervision, we exploit the user social relations for data augmentation as the social network is often known as a reflection of homophily [20,39] (i.e., users who have similar preferences are more likely to become connected in the social network and vice versa). Since many service providers such as Yelp 1 encourage users to interact with others on their platforms, it provides their recommender systems with opportunities to leverage abundant social relations. However, as social relations are inherently noisy [41,43], for accurate supplementary supervisory information, SEPT only utilizes the reliable social relations by exploiting the ubiquitous triadic closure [13] among users. In a socially-ware recommender system, by aligning the user-item interaction graph G and the social network G , we can readily get two types of triangles: three users socially connected with each other (e.g. 1 , 2 and 4 in Fig. 1) and two socially connected users with the same purchased item (e.g. 1 , 2 and 1 in Fig. 1). The former is socially explained as profiling users' interests in expanding social circles, and the latter is characterizing users' interests in sharing desired items with their friends. It is straightforward to regard the triangles as strengthened ties because if two persons in real life have mutual friends or common interests, they are more likely to have a close relationship.
Following our previous work [44], the mentioned two types of triangles can be efficiently extracted in the form of matrix multiplication. Let ∈ R × and ∈ R × denote the adjacency matrices of the users involved in these two types of triangular relations. They can be calculated by: The multiplication ( ⊤ ) accumulates the paths connecting two user via shared friends (items), and the Hadamard product ⊙ makes these paths into triangles. Since both and are sparse matrices, the calculation is not time-consuming. The operation ⊙ ensures that the relations in and are subsets of the relations in . As and are not binary matrices, Eq. (1) can be seen a special case of bootstrap sampling on with the complementary information from . Given and as the augmentation of and , we have three views that characterize users' preferences from different perspectives and also provide us with a scenario to fuse tri-training and SSL. To facilitate the understanding, we name the view over the user-item interaction graph preference view, the view over the triangular social relations friend view, and another one sharing view, which are represented by , , and , respectively.

Unlabeled Example Set.
To conduct tri-training, an unlabeled example set is required. We follow existing works [29,40] to perturb the raw graph with edge dropout at a certain probability to create a corrupted graph from where the learned user presentations are used as the unlabeled examples. This process can be formulated as: where N and N are nodes, E and E are edges in G and G , and ∈ {0, 1} | E ∪E | is the mask vector to drop edges. Herein we perturb both G and G instead of G only, because the social information is included in the aforementioned two augmented views. For integrated self-supervision signals, perturbing the joint graph is necessary.

SEPT: Self-Supervised Tri-Training
With the augmented views and the unlabeled example set, we follow the setting of tri-training to build three encoders. Architecturally, the proposed self-supervised training framework can be model-agnostic so as to boost a multitude of graph neural recommendation models. But for a concrete framework which can be easily followed, we adopt LightGCN [11] as the basic structure of the encoders due to its simplicity. The general form of encoders is defined as follows: where is the encoder, ∈ R × or R ( + )× denotes the final representation of nodes, of the same size denotes the initial node embeddings which are the bottom shared by the three encoders, and V ∈ { , , } is any of the three views. It should be noted that, unlike the vanilla tri-training, SEPT is asymmetric. The two encoders and that work on the friend view and sharing view are only in charge of learning user representations through graph convolution and giving pseudo-labels, while the encoder working on the preference view also undertakes the task of generating recommendations and thus learns both user and item representations (shown in Fig. 1). Let be the dominant encoder (recommendation model), and and be the auxiliary encoders. Theoretically, given a concrete like LightGCN [11], there should be the optimal structures of and . However, exploring the optimal structures of the auxiliary encoders is out of the scope of this paper. For simplicity, we assign the same structure to and . Besides, to learn representations of the unlabeled examples from the perturbed graphG, another encoder is required, but it is only for graph convolution. All the encoders share the bottom embeddings and are built over different views with the LightGCN structure.
3.3.2 Constructing Self-Supervision Signals. By performing graph convolution over the three views, the encoders learn three groups of user representations. As each view reflects a different aspect of the user preference, it is natural to seek supervisory information from the other two views to improve the encoder of the current view. Given a user, we predict its semantically positive examples in the unlabeled example set using the user representations from the other two views. Taking user in the preference view as an instance, the labeling is formulated as: where is the cosine operation, and are the representations of user learned by and , respectively,˜is the representations of users in the unlabeled example set obtained through graph convolution, and + and + denote the predicted probability of each user being the semantically positive example of user in the corresponding views.
Under the scheme of tri-training, to avoid noisy examples, only if both and agree on the labeling of a user being the positive sample, and then the user can be labeled for . We obey this rule and add up the predicted probabilities from the two views and obtain: With the probabilities, we can select positive samples with the highest confidence. This process can be formulated as: In each iteration,G is reconstructed with the random edge dropout for varying user representations. SEPT dynamically generates positive pseudo-labels over this data augmentation for each user in every view. Then these labels are used as the supervisory signals to refine the shared bottom representations.

Contrastive Learning.
Having the generated pseudo-labels, we develop the neighbor-discrimination contrastive learning method to fulfill self-supervision in SEPT. Given a certain user, we encourage the consistency between his node representation and the labeled user representations from P + , and minimize the agreement between his node representation and the unlabeled user representations. The idea of the neighbordiscrimination is that, given a certain user in the current view, the positive pseudo-labels semantically represent his neighbors or potential neighbors in the other two views, then we should also bring these positive pairs together in the current view due to the homophily across different views. And this can be achieved through the neighbor-discrimination contrastive learning. Formally, we follow the previous studies [5,29] to adopt InfoNCE [12], which is effective in mutual information estimation, as our learning objective to maximize the agreement between positive pairs and minimize that of negative pairs: ∈ P + ( ,˜) + ∈ /P + ( ,˜) (7) where ( ,˜) = exp ( ·˜)/ , (·) : R × R ↦ −→ R is the discriminator function that takes two vectors as the input and then scores the agreement between them, and is the temperature to amplify the effect of discrimination ( = 0.1 is the best in our implementation). We simply implement the discriminator by applying the cosine operation. Compared with the selfdiscrimination, the neighbor-discrimination leverages the supervisory signals from the other users. When only one positive example is used and if the user itself in˜has the highest confidence in + , the neighbor-discrimination degenerates to the self-discrimination. So, the self-discrimination can be seen as a special case of the neighbor-discrimination. But when a sufficient number of positive examples are used, these two methods could also be simultaneously adopted because the user itself in˜is often highly likely to be in the Top-K similar examples P + . With the training proceeding, the encoders iteratively improve to generate evolving pseudo-labels, which in turn recursively benefit the encoders again.
Compared with the vanilla tri-training, it is worth noting that in SEPT, we do not add the pseudo-labels into the adjacency matrices for subsequent graph convolution during training. Instead, we adopt a soft and flexible way to guide the user representations via mutual information maximization, which is distinct from the vanilla tri-training that adds the pseudo-labels to the training set for nextround training. The benefits of this modeling are two-fold. Firstly, adding pseudo-labels leads to reconstruction of the adjacency matrices after each iteration, which is time-consuming; secondly, the pseudo-labels generated at the early stage might not be informative; repeatedly using them would mislead the framework.

Optimization.
The learning of SEPT consists of two tasks: recommendation and the neighbor-discrimination based contrastive learning. Let L be the BPR pairwise loss function [24] which is defined as: where I ( ) is the item set that user has interacted with,ˆ= ⊤ , and are obtained by splitting , and is the coefficient controlling the 2 regularization. The training of SEPT proceeds in two stages: initialization and joint learning. To start with, we warm up the framework with the recommendation task by optimizing L . Once trained with L , the shared bottom has gained far stronger representations than randomly initialized embeddings. The selfsupervised tri-training then proceeds as described in Eq. (4) - (7), acting as an auxiliary task which is unified into a joint learning objective to enhance the performance of the recommendation task. The overall objective of the joint learning is defined as: where is a hyper-parameter used to control the magnitude of the self-supervised tri-training. The overall process of SEPT is presented in Algorithm 1.

Connection with Social Regularization.
Social recommendation [38,43,44] integrates social relations into recommender systems to address the data sparsity issue. A common idea of social recommendation is to regularize user representations by minimizing the euclidean distance between socially connected users, which is termed social regularization [18]. Although the proposed SEPT also leverages socially-aware supervisory signals to refine user Algorithm 1: The running process of SEPT Input: Bidirectional social relations S, User feedback R 1 , and randomly initialized node embeddings ; Output: Recommendation lists 2 Pretraining with L in Eq. (8); 3 View augmentation with Eq. (1); 4 for each iteration do 5 ConstructG and obtain the unlabeled example set through graph convolution; 6 for each batch do 7 Randomly select users from˜to be labeled; representations, it is distinct from the social regularization. The differences are also two-fold. Firstly, the social regularization is a static process which is always performed on the socially connected users, whereas the neighbor-discrimination is dynamic and iteratively improves the supervisory signals imposed on uncertain users; secondly, negative social relations (dislike) cannot be readily retrieved in social recommendation, and hence the social regularization can only keep socially connected users close. But SEPT can also pushes users who are not semantically positive in the three views apart.

Complexity.
Architecturally, SEPT can be model-agnostic, and its complexity mainly depends on the structure of the used encoders. In this paper, we present a LightGCN-based architecture.
Given O (| | ) as the time complexity of the recommendation encoder for graph convolution, the total complexity for the graph convolution is less than 4O (| | ) because , , andG are usually sparser than . Another cost comes from the Top-K operation of the labeling process in Eq. (6), which usually requires O ( log( )) by using the max heap. To reduce the cost and speed up training, in each batch for training, only ( ≪ , e.g. 1000) users in a batch are randomly selected and being the unlabeled example set of the pseudo-labels, and this sampling method can also prevent overfitting. The complexity of the neighbor-discrimination based contrastive learning is O ( ).   [43,44] to leave out ratings less than 4 in the dataset of Douban-Book which consists of explicit ratings with a 1-5 rating scale, and assign 1 to the rest. The statistics of the datasets is shown in Table 1. For precise assessment, 5-fold cross-validation is conducted in all the experiments and the average results are presented.
Baselines. Three recent graph neural recommendation models are compared with SEPT to test the effectiveness of the self-supervised tri-training for recommendation: • LightGCN [11] is a GCN-based general recommendation model that leverages the user-item proximity to learn node representations and generate recommendations, which is reported as the state-of-the-art. • DiffNet++ [30] is a recent GCN-based social recommendation method that models the recursive dynamic social diffusion in both the user and item spaces. • MHCN [44] is a latest hypergraph convolutional network-based social recommendation method that models the complex correlations among users with hyperedges to improve recommendation performance.
LightGCN [11] is the basic encoder in SEPT. Investigating the performance of LightGCN and SEPT is essential. Since LightGCN is a widely acknowledged SOTA baseline reported in many recent papers [29,44], we do not compare SEPT with other weak baselines such as NGCF [28], GCMC [2], and BPR [24]. Two strong social recommendation models are also compared to SEPT to verify that the self-supervised tri-training, rather than the use of social relations, is the main driving force of the performance improvements.
Metrics. To evaluate all the methods, we first perform item ranking on all the candidate items. Then two relevancy-based metrics Preci-sion@10 and Recall@10 and one ranking-based metric NDCG@10 Settings. For a fair comparison, we refer to the best parameter settings reported in the original papers of the baselines and then fine tune all the hyperparameters of the baselines to ensure the best performance of them. As for the general settings of all the methods, we empirically set the dimension of latent factors (embeddings) to 50, the regularization parameter to 0.001, and the batch size to 2000. In section 4.4, we investigate the parameter sensitivity of SEPT, and the best parameters are used in section 4.2 and 4.3. We use Adam to optimize all these models with an initial learning rate 0.001.

Overall Performance Comparison
In this part, we validate if SEPT can improve recommendation. The performance comparisons are shown in Table 2 and 3. We conduct experiments with different layer numbers in Table 2. In Table 3, a two-layer setting is adopted for all the methods because they all reach their best performance on the used datasets under this setting. The performance improvement (drop) marked by ↑ (↓) is calculated by using the performance difference to divide the subtrahend. According to the results, we can draw the following observations and conclusions: • Under all the different layer settings, SEPT can significantly boost LightGCN. Particularly, on the sparser datasets: Douban-Book and Yelp, the improvements get higher. The maximum improvement can even reach 11%. This can be an evidence that demonstrates the effectiveness of self-supervised learning. Besides, although both LightGCN and SEPT suffer the over-smoothed problem when the layer number is 3, SEPT can still outperform Light-GCN. We think the possible reason is that contrastive learning can, to some degree, alleviate the over-smooth problem because the dynamically generated unlabeled examples provide sufficient data variance.
In addition to the comparison with LightGCN, we also compare SEPT with social recommendation models to validate if the selfsupervised tri-training rather than social relations primarily promote the recommendation performance. Since MHCN is also built upon LightGCN, comparing these two models can be more informative. Besides, 2 -MHCN, which is the self-supervised variant Pr ec @ 10 R ec @ 10 N D C G @ 10  Table 3, we make the following observations and conclusions: • Although integrating social relations into graph neural models are helpful (comparing MHCN with LightGCN), learning under the scheme of SEPT can achieve more performance gains (comparing SEPT with MHCN). DiffNet++ is uncompetitive compared with the other three methods. Its failure can be attributed to its redundant and useless parameters and operations [11]. On both LastFM and Douban-Book, SEPT outperforms 2 -MHCN. On Yelp, 2 -MHCN exhibits better performance than SEPT does. The superiority of SEPT and 2 -MHCN demonstrates that self-supervised learning holds vast capability for improving recommendation. In addition, SEPT does not need to learn other parameters except the bottom embeddings, whereas there are a number of other parameters that 2 -MHCN needs to learn. Meanwhile, SEPT runs much faster than 2 -MHCN does in our experiments, which makes it more competitive even that it is beaten by 2 -MHCN on Yelp by a small margin.

Neighbor-Discrimination
In SEPT, the generated positive examples can include both the user itself and other users in the unlabeled example set. It is not clear which part contributes more to the recommendation performance. In this part, we investigate the self-discrimination and the neighbordiscrimination without the user itself being the positive example.
Pr ec @ 10 R ec @ 10 N D C G @ 10 For convenience, we use SEPT-SD to denote the self-discrimination, and SEPT-ND to denote the latter. It also should be mentioned that, for SEPT-ND only, a small = 0.001 can lead to the best performance on all the datasets. A two-layer setting is used in this case. According to Fig. 2, we can observe that both SEPT-SD and SEPT-ND exhibit better performances than LightGCN does, which proves that both the supervisory signals from the user itself and other users can benefit a self-supervised recommendation model. Our claim about the self-supervision signals from other users is validated. Besides, the importance of the self-discrimination and the neighbordiscrimination varies from dataset to dataset. On LastFM, they almost contribute equally. On Douban-Book, self-discrimination shows much more importance. On Yelp, neighbor-discrimination is more effective. This phenomenon can be explained by Fig. 5. With the increase of the used positive examples, we see that the performance of SEPT almost remains stable on LastFM and Yelp but gradually declines on Douban-Book. We guess that there is widely observed homophily in LastFM and Yelp, so a large number of users share similar preferences, which can be the high-quality positive examples in these two datasets. However, users in Douban-Book may have more diverse interests, which results in the quality drop when the number of used positive examples increases.

View Study
In SEPT, we build two augmented views to conduct tri-training for mining supervisory signals. In this part, we ablate the framework to investigate the contribution of each view. A two-layer setting is used in this case. In Fig. 3, 'Friend' or 'Sharing' means that the corresponding view is detached. When only two views are used, SEPT degenerates to the self-supervised co-training. 'Preference-Only' means that only the preference view is used. In this case, SEPT further degenerates to the self-training. From Fig. 3, we can observe that on both LastFM and Yelp, all the views contribute, whereas on Douban-Book, the self-supervised co-training setting achieves the best performance. Moreover, when only the preference view is used, SEPT shows lower performance but it is still better than that of LightGCN. With the decrease of used number of views, the performance of SEPT slightly declines on LastFM, and an obvious performance drop is observed on Yelp. On Douban-Book, the performance firstly gets a slight rise and then declines obviously when there is only one view. The results demonstrate that, under the semi-supervised setting, even a single view can generate desirable self-supervised signals, which is encouraging since social relations or other side information are not always accessible in some situations. Besides, increasing the used number of views may bring more performance gains, but it is not absolutely right. effect of in Fig. 6, the setting of is as the same as the last case, and = 10. A two-layer setting is used in this case. As can be observed from Fig. 4, SEPT is sensitive to . On different datasets, we need to choose different values of for the best performance. Generally, a small value of can lead to a desirable performance, and a large value of results in a huge performance drop. Figure 5 has been interpreted in Section 4.3. According to Fig. 6, we observe that SEPT is not sensitive to the edge dropout rate. Even a large value of (e.g., 0.8) can create informative selfsupervision signals, which is a good property for the possible wide use of SEPT. When the perturbed graph is highly sparse, it cannot provide useful information for self-supervised learning.

CONCLUSION AND FUTURE WORK
The self-supervised graph contrastive learning, which is widely used in the field of graph representation learning, recently has been transplanted to recommendation for improving the recommendation performance. However, most SSL-based methods only exploit self-supervision signals through the self-discrimination, and SSL cannot fully exert itself in the scenario of recommendation to leverage the widely observed homophily. To address this issue, in this paper, we propose a socially-aware self-supervised tri-training framework named SEPT to improve recommendation by discovering self-supervision signals from two complementary views of the raw data. Under the self-supervised tri-training scheme, the neighbor-discrimination based contrastive learning method is developed to refine user representations with pseudo-labels from the neighbors. Extensive experiments demonstrate the effectiveness of SEPT, and a thorough ablation study is conducted to verify the rationale of the self-supervised tri-training.
In this paper, only the self-supervision signals from users are exploited. However, items can also analogously provide informative pseudo-labels for self-supervision. This can be implemented by leveraging the multimodality of items. We leave it as our future work. We also believe that the idea of self-supervised multi-view co-training can be generalized to more scenarios beyond recommendation.