20170210-Fernando-Thesis.pdf (3.9 MB)

# Data Types and Measurement Scales in Data Analysis

thesis

posted on 2017-02-15, 00:20 authored by Muthuthanthiri Basthiyange Thilak Laksiri FernandoIrrespective of the formal definitions of scales and
scale types given in measurement theory, data mining often assumes a
scale type for a given attribute based on superficial properties. Unless
there are only a small number of distinct values, a quantitative
attribute is usually assumed to be given in the interval scale. Based on
that assumption, many data mining algorithms use the magnitudes of the
values in calculations. This thesis shows that doing so can have serious
adverse consequences.

Measurement theory provides analyses to determine the scale types of attributes. However, in data mining, those analyses are often overlooked and analyses are performed that implicitly assume without justification that attributes are interval scale. This may lead to two problems. First, initial assumptions made on scale types may not be correct. Second, calculations in data mining produce derived scales in which scale type identification can be difficult. Thus, there is potential both for assumptions made about the raw data and interpretations of derived values to be incorrect. Incorrect assumptions may produce sub-optimal results.

In this thesis, we show that both supervised and unsupervised learning algorithms that assume data are given in the interval scale frequently produce sub-optimal results when that assumption is violated. Then, we argue that often original quantitative data may not belong to the interval scale.

Operations defined for the ordinal scale are also valid for the interval and ratio scales. Therefore, we argue that assuming quantitative data are given in the ordinal scale is often more effective than assuming quantitative data are given in the interval scale. The main objectives of our research are, assessing effects of violations of the interval scale assumption on data mining algorithms and finding effective and efficient measures that are invariant to such violations.

We discuss our experiments on measures that assume quantitative data are given in the ordinal scale. We tried unsupervised random forest and rank transformed data with unsupervised learning tasks: DBScan clustering and content based multimedia information retrieval (CBMIR). When compared with cityblock, Euclidean, cosine and Chebychev distances, unsupervised random forest often produced lower F-Measure values and lower precision@50 values in DBScan clustering and CBMIR respectively. We identified the use of synthetic data in unsupervised random forest as the reason for the lower F-Measure values and lower precision@50 values. In contrast to that, rank transformed data used with cityblock distance often produced higher or competitive F-Measure values in DBScan clustering and higher or competitive precision@50 values in CBMIR when compared with cityblock, Euclidean, cosine and Chebychev distance with original data. Cityblock, Euclidean, cosine and Chebychev distances with original data assume quantitative data are given in the interval scale whereas cityblock distance with rank transformed data assumes that quantitative data are given in ordinal scale.

In addition to being the cause for producing lower F-Measure and precision@50 values in DBScan clustering and CBMIR, synthetic data add additional disadvantages to unsupervised random forest. In unsupervised random forest, synthetic data is also the reason for its high execution time, high memory requirements and high number of zero similarity pairs in calculated similarity matrices. The main disadvantage of using rank transformed data is that to estimate the ranks of previously unseen instances requires high execution time and memory.

Then we introduced a new unsupervised learning methodology, unsupervised stochastic forest(usForest) that assume quantitative data are given in the ordinal scale. In DBScan clustering, usForest produced higher F-Measure values than F-Measure values produced by cityblock, Euclidean, cosine, Chebychev distances with original data, cityblock, Euclidean, cosine, Chebychev distances with rank transformed data and unsupervised random forest.

In contrast to unsupervised random forest, usForest does not use synthetic data. Hence, for a given task, usForest requires less execution time and less memory than unsupervised random forest. In similarity matrices, usForest produces lower number of zero similarity instance pares when compared with unsupervised random forest. In similarity calculations, usForest produced better estimate for expected values with lower number of trees when compared with unsupervised random forest. When compared with rank transformed data, usForest requires less execution time and less memory to calculate similarity values for previously unseen data.

We identified that usForest's capability to contrast nearest neighbours of a given instance from the rest of the instances is the reason for producing higher F-Measure values in DBScan clustering which rely on nearest neighbour discovery. Accordingly, we predicted that algorithms that rely on nearest neighbour identification can specially benefit from usForest. We extended usForest for datasets that have both quantitative and qualitative attributes. We showed that, in DBScan clustering with datasets that have both qualitative and quantitative attributes, extended version of usForest produced higher F-Measure values when compared with cityblock, Euclidean, cosine, Chebychev distances and unsupervised random forest. Then we tested usForest for K-nearest neighbour classification which is another algorithm that rely on nearest neighbour identification. In K-nearest neighbour classification, the number of cases where usForest produced lower classification error values than Euclidean distance with original data is higher than the number of cases where Euclidean distance with original data produced lower classification error values than usForest. Even when compared with Euclidean distance with rank transformed data, in more cases usForest produced lower classification error values.

In our supervised learning experiments we tested popular classifiers: artificial feed forward neural networks (ANN), naive Bayes classifier (NBC), linear descriminant analysis (LDA), support vector machines (SVM), K-nearest neighbour classification (KNN) and multinomial logistic regression (MLR). We used log loss to measure the performance of ANN, NBC, LDA and MLR that produce posterior class probabilities. Classification error was used to measure performances of SVM and KNN that usually predict class labels. We found that all the tested classification algorithms, i.e. ANN, NBC, LDA, SVM, KNN and MLR assume that quantitative data are given in the interval scale and they are affected by violations of the interval scale assumption. In ANN, NBC, LDA, SVM, KNN and MLR, tied rank transformed data that assume quantitative data are given in the ordinal scale more often produced lower error values than error values produced by ANN, NBC, LDA, SVM, KNN and MLR with original data that assume quantitative data are given in the interval scale.

As tied rank transform requires high execution time and high memory to estimate the ranks of previously unseen instances, we introduced Quantilization that requires lower execution time and lower memory than tied rank transform. In our ANN, NBC, LDA, SVM, KNN and MLR experiments, the number of cases where Quantilization produced lower error values than that produced by original data is greater than the number of cases where original data produced lower error than Quantilization.

As discussed above, this thesis contains our experiments, results and analysis and conclusions. The summary of the contributions of this thesis is as follows.

1. We showed that violations of interval scale assumption are present very often in real world datasets and both supervised learning and unsupervised learning algorithms are often affected by such violations.

2. In the tested algorithms, with respect to the previously mentioned performance measures, assuming quantitative data are given in the ordinal scale and using rank transform produces better results more often than assuming quantitative data are given in the interval scale.

3. We introduced a new unsupervised learning methodology, unsupervised stochastic forest (usForest) that assumes quantitative data are given in the ordinal scale. We showed that it can produce higher F-Measure values in DBScan clustering and good precision@50 values in CBMIR. It require less execution time and less memory than unsupervised random forest. To produce a reliable estimate of respective expected values in similarity calculations, usForest needs fewer trees than the number of trees required for unsupervised random forest. usForest needs less execution time and memory than rank transform when previously unseen instances are used. We extended usForest for qualitative data as well.

4. We introduced Quantilization that assumes only that quantitative data are given in the ordinal scale. Quantilization needs less execution time and less memory than rank transform. In tested classification algorithms, Quantilization often produced lower error than the original data.

Measurement theory provides analyses to determine the scale types of attributes. However, in data mining, those analyses are often overlooked and analyses are performed that implicitly assume without justification that attributes are interval scale. This may lead to two problems. First, initial assumptions made on scale types may not be correct. Second, calculations in data mining produce derived scales in which scale type identification can be difficult. Thus, there is potential both for assumptions made about the raw data and interpretations of derived values to be incorrect. Incorrect assumptions may produce sub-optimal results.

In this thesis, we show that both supervised and unsupervised learning algorithms that assume data are given in the interval scale frequently produce sub-optimal results when that assumption is violated. Then, we argue that often original quantitative data may not belong to the interval scale.

Operations defined for the ordinal scale are also valid for the interval and ratio scales. Therefore, we argue that assuming quantitative data are given in the ordinal scale is often more effective than assuming quantitative data are given in the interval scale. The main objectives of our research are, assessing effects of violations of the interval scale assumption on data mining algorithms and finding effective and efficient measures that are invariant to such violations.

We discuss our experiments on measures that assume quantitative data are given in the ordinal scale. We tried unsupervised random forest and rank transformed data with unsupervised learning tasks: DBScan clustering and content based multimedia information retrieval (CBMIR). When compared with cityblock, Euclidean, cosine and Chebychev distances, unsupervised random forest often produced lower F-Measure values and lower precision@50 values in DBScan clustering and CBMIR respectively. We identified the use of synthetic data in unsupervised random forest as the reason for the lower F-Measure values and lower precision@50 values. In contrast to that, rank transformed data used with cityblock distance often produced higher or competitive F-Measure values in DBScan clustering and higher or competitive precision@50 values in CBMIR when compared with cityblock, Euclidean, cosine and Chebychev distance with original data. Cityblock, Euclidean, cosine and Chebychev distances with original data assume quantitative data are given in the interval scale whereas cityblock distance with rank transformed data assumes that quantitative data are given in ordinal scale.

In addition to being the cause for producing lower F-Measure and precision@50 values in DBScan clustering and CBMIR, synthetic data add additional disadvantages to unsupervised random forest. In unsupervised random forest, synthetic data is also the reason for its high execution time, high memory requirements and high number of zero similarity pairs in calculated similarity matrices. The main disadvantage of using rank transformed data is that to estimate the ranks of previously unseen instances requires high execution time and memory.

Then we introduced a new unsupervised learning methodology, unsupervised stochastic forest(usForest) that assume quantitative data are given in the ordinal scale. In DBScan clustering, usForest produced higher F-Measure values than F-Measure values produced by cityblock, Euclidean, cosine, Chebychev distances with original data, cityblock, Euclidean, cosine, Chebychev distances with rank transformed data and unsupervised random forest.

In contrast to unsupervised random forest, usForest does not use synthetic data. Hence, for a given task, usForest requires less execution time and less memory than unsupervised random forest. In similarity matrices, usForest produces lower number of zero similarity instance pares when compared with unsupervised random forest. In similarity calculations, usForest produced better estimate for expected values with lower number of trees when compared with unsupervised random forest. When compared with rank transformed data, usForest requires less execution time and less memory to calculate similarity values for previously unseen data.

We identified that usForest's capability to contrast nearest neighbours of a given instance from the rest of the instances is the reason for producing higher F-Measure values in DBScan clustering which rely on nearest neighbour discovery. Accordingly, we predicted that algorithms that rely on nearest neighbour identification can specially benefit from usForest. We extended usForest for datasets that have both quantitative and qualitative attributes. We showed that, in DBScan clustering with datasets that have both qualitative and quantitative attributes, extended version of usForest produced higher F-Measure values when compared with cityblock, Euclidean, cosine, Chebychev distances and unsupervised random forest. Then we tested usForest for K-nearest neighbour classification which is another algorithm that rely on nearest neighbour identification. In K-nearest neighbour classification, the number of cases where usForest produced lower classification error values than Euclidean distance with original data is higher than the number of cases where Euclidean distance with original data produced lower classification error values than usForest. Even when compared with Euclidean distance with rank transformed data, in more cases usForest produced lower classification error values.

In our supervised learning experiments we tested popular classifiers: artificial feed forward neural networks (ANN), naive Bayes classifier (NBC), linear descriminant analysis (LDA), support vector machines (SVM), K-nearest neighbour classification (KNN) and multinomial logistic regression (MLR). We used log loss to measure the performance of ANN, NBC, LDA and MLR that produce posterior class probabilities. Classification error was used to measure performances of SVM and KNN that usually predict class labels. We found that all the tested classification algorithms, i.e. ANN, NBC, LDA, SVM, KNN and MLR assume that quantitative data are given in the interval scale and they are affected by violations of the interval scale assumption. In ANN, NBC, LDA, SVM, KNN and MLR, tied rank transformed data that assume quantitative data are given in the ordinal scale more often produced lower error values than error values produced by ANN, NBC, LDA, SVM, KNN and MLR with original data that assume quantitative data are given in the interval scale.

As tied rank transform requires high execution time and high memory to estimate the ranks of previously unseen instances, we introduced Quantilization that requires lower execution time and lower memory than tied rank transform. In our ANN, NBC, LDA, SVM, KNN and MLR experiments, the number of cases where Quantilization produced lower error values than that produced by original data is greater than the number of cases where original data produced lower error than Quantilization.

As discussed above, this thesis contains our experiments, results and analysis and conclusions. The summary of the contributions of this thesis is as follows.

1. We showed that violations of interval scale assumption are present very often in real world datasets and both supervised learning and unsupervised learning algorithms are often affected by such violations.

2. In the tested algorithms, with respect to the previously mentioned performance measures, assuming quantitative data are given in the ordinal scale and using rank transform produces better results more often than assuming quantitative data are given in the interval scale.

3. We introduced a new unsupervised learning methodology, unsupervised stochastic forest (usForest) that assumes quantitative data are given in the ordinal scale. We showed that it can produce higher F-Measure values in DBScan clustering and good precision@50 values in CBMIR. It require less execution time and less memory than unsupervised random forest. To produce a reliable estimate of respective expected values in similarity calculations, usForest needs fewer trees than the number of trees required for unsupervised random forest. usForest needs less execution time and memory than rank transform when previously unseen instances are used. We extended usForest for qualitative data as well.

4. We introduced Quantilization that assumes only that quantitative data are given in the ordinal scale. Quantilization needs less execution time and less memory than rank transform. In tested classification algorithms, Quantilization often produced lower error than the original data.