Metin Madenciliği ve Makine Öğrenmesi ile İnternet Sayfalarının Sınıflandırılması
Özet
The domain name is the address of a website on the Internet. By using these domain names, the desired address can be visited and the desired information can be accessed. In today's world, the number of Internet sites are increasing exponentially and in order to prevent accessing possible harmful content in these web sites or to find useful information more easily it is necessary to classify the web pages. Methods and algorithms for website classification are proposed by both academic studies and private companies. Hence, it is intended that the Internet user is not exposed to fraudulent elements in any of the sites classified according to their content or the access to predetermined websites is prevented.
Filtering Internet sites allows us to set rules to allow or block access to certain sites. Rules can be created for specific users of the computer, the user cannot access the Internet sites in the specified class according to the specified rule. For this feature, classification is important for both household and work environments. For example, while parents can prevent children from visiting inappropriate web sites, companies might also prevent their employees to visit social media websites during work hours. For this classification study the approaches in the domain of data science, which includes several disciplines such as statistics, software engineering, industrial engineering and mathematics, and where new methods are continuously developed and proposed in the last years, have been employed. The classification process has been automated with machine learning and deep learning algorithms, which are sub-branches of data science.
The technical part of this thesis is the classification of web pages. The aim is that at the end of the study, when a web domain name is given as input, a class value should be returned for this web domain with respect to the developed model. For this classification process, firstly the data was extracted in the form of web page-class, and accordingly the learning set and test data were created. In this study, web site classification problem is investigated by using different machine learning methods and artificial neural networks. In order to solve this classification problem, two different approaches have been employed, namely Binary Classification and Multi-Class Classification. Both approaches have been tested on web sites collected in the study and their performance has been compared. In terms of performance, it has been observed that for binary classifiers Logistic Regression is the best performing algorithm. Among the algorithms applied in the Multi-Class Classification approach, Support Vector Machines (SVM) is the most successful method. Furthermore, different word vectorization methods have been employed and their performances have been compared in the Multi-Class Classification problem. The use of algorithms in Binary and Multi-Class Classification approaches by employing different vectorization methods, is a combined approach to the problems of classification of web pages and content filtering, and this puts forward the difference of the current study from similar studies in the field. In order to investigate the bias of learning methods and test sets, techniques such as F1 score of performance and error matrix were used. Considering all experimental results, it has been found that Binary Classification will be more effective only when used to fulfill the task of filtering a desired Internet site class. In the analysis, Logistic Regression and Bernoulli Naive Bayes classifiers have been found to be 150 times faster than artificial neural networks when computing performance (time) of the methods used in Binary Classification has been taken into account.