PSPP is an open source statistical analysis and data mining tool. It was designed as a free alternative to IBM’s SPSS tool. PSPP is very similar to SPSS and includes most of it’s features. PSPP is capable of processing up to 1 billion cases and variables; offers both a graphical and terminal user interface and facilitates data import from spreadsheets, text files and databases. Most noteably PSPP has no license fees or expiration period. It can be run in Windows, Linux or Mac OSX environments.

This article reviews some of PSPP’s statistical tests; based on the server logs for this blog recorded in October 2010. The sample data focuses on 1503 unique visitors to the blog. The variables recorded include total hits to the blog, unique page hits, kilobytes of data downloaded, country of origin, time spent on the blog and the search term used to find the blog.

Single Variable Analysis

The table below shows the output from PSPP’s ‘Frequencies’ procedure. The frequencies procedure is used for analysing a single categorical variable. In this case we are comparing the different countries from which users visited the blog. From the results we can see that the majority of users were from the United States with China and the UK scoring equal second.

Country Of Origin
Value Label Value Frequency Percent Valid Percent Cum Percent
United States 0 553 36.79 36.79 36.79
China 1 200 13.31 13.31 50.10
Great Britain 2 200 13.31 13.31 63.41
Australia 3 100 6.65 6.65 70.06
Poland 4 100 6.65 6.65 76.71
Czech Republic 5 50 3.33 3.33 80.04
Germany 6 50 3.33 3.33 83.37
Brazil 7 50 3.33 3.33 86.69
Canada 8 50 3.33 3.33 90.02
India 9 50 3.33 3.33 93.35
Russian Federation 10 50 3.33 3.33 96.67
Netherlands 11 50 3.33 3.33 100.00
Total 1503 100.0 100.0

Table produced by PSPP.

Chart produced using the Google Visualisation API.

Next we look at the ‘Explore’ procedure, this is used for analysing metric (numerical) variables, for example, the total number of hits made by each visitor. The descriptives table shown below gives us some useful information. Firstly it indicates that the mean number of hits made to the blog was 5 (rounded from 5.05), the median was 5, the minimum value was 1 (otherwise the visit couldn’t have been recorded) and the maximum number of pages visited was 15.

Descriptives
Statistic Std. Error
Total number of hits made to the blog Mean 5.05 .05
95% Confidence Interval for Mean Lower Bound 4.95
Upper Bound 5.15
5% Trimmed Mean 4.97
Median 5.00
Variance 4.25
Std. Deviation 2.06
Minimum 1.00
Maximum 15.00
Range 14.00
Interquartile Range 2.00
Skewness .65 .06
Kurtosis .92 .13

Table produced by PSPP.

The ‘Explore’ procedure is also capable of producing percentiles analysis. We can see from the table below that up to 25% of visitors viewed at least 6 pages, 50% of users viewed up to 5 pages and 75% of visitors viewed at least 4 pages.

Percentiles
Percentiles
5 10 25 50 75 90 95 25 50 75
Total number of hits made to the blog HAverage 2.00 3.00 4.00 5.00 6.00 8.00 9.00 4.00 5.00 6.00
Tukey’s Hinges 4.00 5.00 6.00 4.00 5.00 6.00

Table produced by PSPP.

Complimenting the ‘Explore’ procedure PSPP can produce a histogram. Histograms are useful in helping us to visualise the distribution of a metric variable. Our histogram shows us that the distribution is approximately symetric. This means that we should use the mean for reporting average number of hits to the site. If the histogram was not symetric the median would give us a better value to use for the average.

Chart produced by PSPP.

SPSS has one distinct advantage over PSPP when using the ‘Explore’ procedure. SPSS is capable of producing box plot charts. Box plots are another great way for us to visualise the distribution of a metric variable. The box plot below was produced using the Google Visualisation API with the data produced by PSPP. The top and bottom markers represent the minimum and maximum number of hits made by visitors. The box area represents the number of hits between the 25th and 75th quartiles (majority of visitors) and the line through the middle of the box represents the median.

Chart produced using the Google Visualisation API.

Hypothesis Testing

As well as single variable analysis PSPP gives us the opportunity to test hypothesis. For example we might hypothesise that vistors from the United States spent more time on the blog than visitors from China because the blog is written in English. To test this we could perform an independent samples t-test using PSPP. The table below shows that the mean number of minutes spent on the site was 1.36 minutes for US visitors. Higher than Chinese visitors who spent 0.83 minutes on the site. However the ouput also shows us that the difference is not statistically significant. The significance value (highlighted in red) is higher than 0.05. We can also see that the 95% confidence level is between -0.19 and 1.25. This indicates that the difference in the entire population could be either 0.19 minutes less or 1.25 more than the mean. Unfortunately we can not draw any conclusions in this case.

Group Statistics
COUNTRY N Mean Std. Deviation S.E. Mean
DURATION United States 553 1.36 6.44 .27
China 200 .83 3.49 .25
Independent Samples Test
Levene’s Test for Equality of Variances t-test for Equality of Means
95% Confidence Interval of the Difference
F Sig. t df Sig. (2-tailed) Mean Difference Std. Error Difference Lower Upper
DURATION Equal variances assumed 9.40 .00 1.10 751.00 .27 .53 .37 -.19 1.25
Equal variances not assumed 1.43 640.76 .15 .53 .37 -.20 1.25

Table produced by PSPP.

Looking for relationships

As well as t-tests PSPP can perform regression analysis, useful when trying to identify relationships between two metric (numerical) variables. Say we suggested that as avisitor visits more pages, more data is downloaded from the server. The Correlations table below shows that the correaltion between these variables is 1.0 (highlighted in red). This indicates a very strong relationship. The coefficients table indicates that the slope is 22.8 (highlighted in blue). This indicates that on average, for every page visited, an extra 22.8kb of data was downloaded from the server. We can also see that the significance of the test is 0 (highlighted in green), as this is less than 0.05, we can conclude that the relationship is significant. As expected as page hits increase more data is downloaded from the server. Perhaps a little obvious however a test like this could help validate the data.

Correlations
Total number of hits made to the blog Bandwidth downloaded by user (kilobytes)
Total number of hits made to the blog Pearson Correlation 1.00 1.00
Sig. (2-tailed) .00
N 1503 1503
Bandwidth downloaded by user (kilobytes) Pearson Correlation 1.00 1.00
Sig. (2-tailed) .00
N 1503 1503
Model Summary
R R Square Adjusted R Square Std. Error of the Estimate
1.00 1.00 1.00 .00
ANOVA
Sum of Squares df Mean Square F Significance
Regression 3166653 1 3166653 9.7E+015 .00
Residual .00 1501 .00
Total 3166653 1502
Coefficients
B Std. Error Beta t Significance
(Constant) .00 .00 .00 .00 1.00
Total number of hits made to the blog 22.28 .00 1.00 98362442 .00

Table produced by PSPP.

Finally, PSPP allows us to perfrom Crosstabs analysis which helps us to identify relationships between two categorical variables. In this case we have produced a crosstabs analysis, comparing the search term used to find the blog to the country of origin. One of the search terms was related to a blog post describing how to use Google’s language API. We might assume that non-english countries would be more likely to search for this as opposed to English speaking countries, as most of the webs content is in English. The crosstabs table below shows us this is the case. However a chi square test indicates that this test is not statistically significant. The significance value is greater than 0.05 (highlighted in red). Unfortunately we do not have enough information to draw any conclusions.

Summary.
Cases
Valid Missing Total
N Percent N Percent N Percent
Search Term * Country Of Origin 723 48.1% 780 51.9% 1503 100.0%
COUNTRY
SEARCH United States China Great Britain Australia Poland Czech Republic Germany Brazil Canada India Russian Federation Netherlands Total
as3 iterate through display objects 33.0 12.0 11.0 4.0 4.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 78.0
12.4% 12.5% 11.5% 8.3% 8.3% 8.3% 8.3% 8.3% 8.3% 8.3% 8.3% 8.3% 10.8%
curl web crawler in php 33.0 12.0 10.0 6.0 6.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 88.0
12.4% 12.5% 10.4% 12.5% 12.5% 12.5% 12.5% 12.5% 12.5% 12.5% 12.5% 12.5% 12.2%
google language api php example 33.0 12.0 15.0 8.0 8.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 104.0
12.4% 12.5% 15.6% 16.7% 16.7% 16.7% 16.7% 16.7% 16.7% 16.7% 16.7% 16.7% 14.4%
google web toolkit animation effects 49.0 16.0 16.0 8.0 8.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 125.0
18.4% 16.7% 16.7% 16.7% 16.7% 16.7% 16.7% 16.7% 16.7% 16.7% 16.7% 16.7% 17.3%
iterating dom childnodes 31.0 12.0 12.0 6.0 6.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 88.0
11.6% 12.5% 12.5% 12.5% 12.5% 12.5% 12.5% 12.5% 12.5% 12.5% 12.5% 12.5% 12.2%
kinematics in flash animation 19.0 8.0 8.0 3.0 2.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 47.0
7.1% 8.3% 8.3% 6.3% 4.2% 4.2% 4.2% 4.2% 4.2% 4.2% 4.2% 4.2% 6.5%
papervision 3d rotate cube 33.0 12.0 12.0 7.0 8.0 4.0 4.0 4.0 4.0 4.0 4.0 3.0 99.0
12.4% 12.5% 12.5% 14.6% 16.7% 16.7% 16.7% 16.7% 16.7% 16.7% 16.7% 12.5% 13.7%
zend amf example 36.0 12.0 12.0 6.0 6.0 3.0 3.0 3.0 3.0 3.0 3.0 4.0 94.0
13.5% 12.5% 12.5% 12.5% 12.5% 12.5% 12.5% 12.5% 12.5% 12.5% 12.5% 16.7% 13.0%
Total 267.0 96.0 96.0 48.0 48.0 24.0 24.0 24.0 24.0 24.0 24.0 24.0 723.0
100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0%
Chi-square tests.
Statistic Value df Asymp. Sig. (2-tailed)
Pearson Chi-Square 10.29 77 1.00
Likelihood Ratio 10.49 77 1.00
Linear-by-Linear Association .35 1 .55
N of Valid Cases 723

Table produced by PSPP.

One Response to “PSPP – Statistical Software – Review”