Efficient pattern recognition within peer-to-peer networks
2017-02-22T05:07:06Z (GMT) by
Economical and scalability factors lead to a growing number of applications using P2P technology. However, the absence of a server to monitor the system makes these applications susceptible to uncontrolled distribution of undesired objects (e.g. pornography, polluted content, or spam). In centralised applications, this issue is solved effectively by using content filtering or spam filtering via pattern recognition. Hence, a similar solution is appealing for P2P-based applications. Within P2P networks, pattern recognition also has the potential to provide other classification services, such as music genre classification and attack detection. Pattern recognition in the P2P domain involves large amounts of resource in- tensive computation and it typically requires a high-performance server which is unavailable in P2P networks. In order to deal with this limitation, a scalable, fully distributed classification algorithm which performs fast processing is necessary. Thus, we consider an algorithmic perspective and explore design issues that arise in pattern recognition in P2P networks. Some of these issues include the absence of central coordination, low computational resources, lack of synchronisation, frequent data update, large dataset scalability, large network scalability, dynamic and imbalanced data distribution. None of the current P2P classification algorithms deal with all of these issues, therefore, the proposed algorithm aims to adequately deal with all of these issues. Considering that P2P networks are large, building an exact (not approximate) global classifier which represents the combined datasets across all peers is non- trivial. Current algorithms are guaranteed to converge to exact classifiers through high communication overheads and processing times. In contrast, this thesis aims to show that exact global classifications can be obtained at any time using limited communication by building a single in-network global classifier. We refer to the resulting algorithm as the P2P-GN. Existing methods perform a whole state-of-the-art classifier on a single peer. This computation is intractable due to the complexity of pattern recognition (particularly in dealing with high-dimensional problems) and the reluctance of peers to share their private resources inordinately. In contrast, our method only requires each peer to perform a simple and parallel computation. This enables the algorithm to perform online learning to handle frequent local data updates and frequent prediction requests leading to processing large datasets efficiently. We also incorporate a fault-tolerance scheme to preserve the system performance within dynamic networks. In order to assess the performance of the algorithm, we compare it to the existing centralised and distributed classification methods. The experimental results show that our method has comparable accuracy to the centralised classifiers. The efficiency of the algorithm for high-dimensional datasets is shown through a series of experiments. The results also show that our method is invariant to imbalanced data distributions (imbalanced class distribution and small in size dataset per peer). The results from our experiments and analyses show that our algorithm incurs low communication overheads and the communication per peer is independent of the network size. This shows that the algorithm scales well, performing effectively irrespective of network size. As a final example of the efficiency of the proposed algorithm we consider spam problems in P2P applications. We extend our algorithm to be an economical, scalable and fullydistributed spam detection system called the DASMET. This algorithm is specifically designed for datasets that consist of similar occurring patterns. Our experimental results show that the DASMET performs best with a relatively low amount of resources for the spam detection compared to other distributed methods. In conclusion, this thesis proposes the P2P-GN for efficient pattern recognition within P2P networks. The efficacy of the algorithm for datasets with large number of attributes and training samples are validated through a series of experiments. Furthermore, the scalability of the algorithm is demonstrated through experiments with varying network sizes. The experimental results show that the P2P-GN provides a practical alternative for scalable pattern recognition within P2P networks.