Analysing Transportation Data with Open Source Big Data Analytic Tools

Big data analytics allows a vast amount of structured and unstructured data to be effectively processed so that correlations, hidden patterns, and other useful information can be mined from the data. Several open source big data analytic tools that can perform tasks such as dimensionality reduction, feature extraction, transformation, optimization, are now available. One interesting area where such tools can provide effective solutions is transportation. Big data analytics can be used to efficiently manage transport infrastructure assets such as roads, airports, bus stations or ports. In this paper an overview of two open source big data analytic tools is first provided followed by a simple demonstration of application of these tools on transport dataset.


Introduction
Over the years, the traditional way of processing and analyzing data was to take data from operational systems such as Enterprise Resource Planning (ERP), Customer relationship management (CRM), or Supply Chain Management (SCM) systems and centralize the data in a Data Warehouse.This data was structured in nature and business intelligence tools enabled businesses to define key metrics and get answers to already known issues [1].Businesses are now being overloaded with modern sources of information such as social networks, online media, sms, email, blogs and mobile activities.
With the urge to remain competitive, businesses can no longer rely on traditional methods of decision making and must therefore be able to process and analyse all possible sources of information to ensure business continuity [2].This large set of data is referred to as Big Data which combines both structured and unstructured data as compared to traditional data frameworks.Big Data is characterized by 4 V"s: a. Volume: Very large data sets with both structured and unstructured data which can be in terms of Terabytes, Petabytes, etc… b.Velocity: Speed at which data is coming / generated.For example: high speed data flow from IoT sensors, Twitter feeds, Facebook likes, among others.c.Variety: Data comes from different sources.For example: smartphones, wearable devices, IoT devices and sensors, and other mobile devices.d.Veracity: All data which are being captured are not of good quality.Part of this data may carry some level of uncertainty.
Some emerging big data applications are: a. Healthcare: W. Raghupathi and V. Raghupathi [3] outlined an architectural framework and methodology that describes the potential and promise of big data analytics in healthcare.This framework would enable healthcare providers to obtain insight from their clinical and other data repositories, and make informed decisions.
One such example would be to diagnose and treat patients in cost-effective ways by analyzing patient records, disease patterns, and faster development of vaccines.Additionally, D. Madhavi and B. V. Ramana [4] proposed a de-identified personal health care system which uses Map Reduce Pig querries which are required to be executed on the datasets for health care.b.Manufacturing: J. Lee et al [5] highlight the trends of industrial big data environment and smart manufacturing.Key impact areas which they have identified are: machine health prediction, transparency and organization across production lines, reduced labour costs, and optimized machine maintenance.c.Traffic management: Y. Lv et al [6] proposed a deep architecture model using autoencoders as building blocks to represent traffic flow features for prediction.
Their experiments demonstrate that the proposed method for traffic flow prediction has better performance.The main objectives of traffic management are: better travel decisions, reduced traffic congestion and carbon emissions, and improved traffic flow.Many organizations have the expertise and equipment for handling large quantities of structured data.However, the faster flows and increasing volumes of data, leaves them with the inability to "mine" the data and derive actionable intelligence in a timely way.Not only is the volume of this data growing too fast for traditional analytics, but the speed with which it arrives and the variety of data types necessitates new types of data processing and analytic solutions [7].Big data doesn"t always fit into neat tables of rows and columns.There are also many new data types, both structured and unstructured, that can be processed to yield insight into a business or condition [8].Popular machine learning toolkits such as R [9] or Weka [10] were not built for these kinds of workloads.Although Weka has distributed implementations of some algorithms available, it is not on the same level as tools that were initially designed and built for terabyte-scale.Some of the open-source Big Data Analytics Tools are: Mahout [11,12], MLlib [13,14], H2O [15], SAMOA [16], and SparkR [17].
The remainder of this paper is organised as follows.Section II gives an overview of the open source analytical tools.Section III gives a detailed overview of related works.Section IV presents the application and testing of the two proposed tools.Finally Section V concludes the paper.

Big Data in Transportation
Researchers have developed a complete transportation decision-making system called the TransDec for the city of Los Angeles.The system acquires data from different sources in real time and the amount of data that arrives per minute is around 46 megabytes [18].The system gathers traffic occupancy and volume data from more than 8,900 traffic loop detectors.Data from buses and train are collected; the data is detailed and contains GPS location updated every two minutes and delay information calculated by taking into account pre-defined schedules.Information from ramp meters that are located at the entrance of highways is also used.The system also accepts text format information about traffic incidents.All the data is then cleaned, reducing the input rate from 46 megabytes per minute to 25 megabytes per minute.Analytical techniques are applied to produce precise traffic patterns.TransDec can also predict clearance times and resulting traffic jams after accidents [19].
London is a growing city with a population of around 8.6 million and is expected to hit 10 million by 2030.Transport for London is using big data to keep London moving and plan for challenges arising from a growing city.The system gathers data from the "Oyster" cards, GPS location of around 9,200 buses, 6,000 traffic lights and 1,400 cameras [19].Transport for London uses big data tools to develop accurate travel patterns of customers across rail and bus networks.The information is then used by authorities to plan closures and diversions.The system also enables authorities to send targeted emails to customers providing them alternative routes thus minimizing the impact of scheduled and unscheduled changes [20].
In Sweden, the Stockholm train operators are big data analytics to predict train delays up to two hours before they arise.The traffic control centre can then decide to make additional train available to remedy the situation [21].
The Land Transport Authority (LTA) of Singapore is using big data to better serve their customers.LTA uses data generated from logins to their public WIFI system available in the MRT stations to produce a real time crowd heat map on each of their platforms.Additional trains are added to the network if needed [22].

Open Source Analytic Tools
The few open-source Big Data analytics tools mentioned (Mahout, MLlib, H2O, SAMOA, and SparkR), are all frameworks which are scalable and contain abstractions for machine learning algorithms on streaming data.The setup of these distributed frameworks is very time-consuming.In the above list, the easiest tools which can be used are H2O and SparkR with MLlib.H2O can be run on one local machine and has a Graphical User Interface (GUI) with all the necessary documentation on steps to follow for data processing and analysis.The Databricks community edition provides an online interface where users can create their own notebooks using the R language amongst others, using the Spark framework for real-time analytics.As such, the processes for transport data analysis with H2O and SparkR are presented in this work.

H2O
H2O is a fast scalable open source software for distributed in-memory predictive analytics, machine learning and deep learning.It is based on pure Java and Apache v2 Open Source.It provides simple deployment with a single jar and automatic cloud discovery.H2O allows data to be used without sampling and provides reliable predictions quicker.For this reason it is suitable for several organisations such as PayPal, Nielsen, Cisco, etc. [23].
H2O can support billions of data rows in-memory even if the cluster size is relatively small.This is possible by the use of sophisticated in-memory compression techniques.The H2O platform has its own built-in Flow web interface so as to make analytic workflows become userfriendly to users who do not have engineering background.It also includes interfaces for R, Python, Scala, Java, JSON and Coffeescript/JavaScript.The H2O platform was built alongside (and on top of) both Hadoop and Spark Clusters and is typically deployed within minutes [23][24][25].
Several common machine learning algorithms are supported by H2O.Examples include: Generalized Linear Modelling (GLM) such as linear regression, logistic regression, etc, Naive Bayes, principal components analysis, time series analysis, k-means clustering etc. Bestin-class algorithms such as Random Forest, Gradient Boosting and Deep Learning at scale are also implemented by H2O [23][24][25].A typical architecture for H2O is shown in Figure 1  As observed in Figure 1, it supports: several data source, distributed memory and tasks, a range of algorithms and multiple front-ends [24].

SparkR
R is a common tool for building machine learning models.However, its effectiveness is constrained by the processing power of a single machine.It is now possible to handle complex machine learning problems with the power of clustered computers using a dedicated library called MLib which is provided by Apache Spark.Spark MLlib is an open source API that is part of the Apache Software Foundation.Spark DataFrames and MLlib provide tooling to make it easier to integrate existing workflows developed on tools such as R and Python, with Spark.For example, SparkR allows users to call MLlib algorithms using familiar R syntax [26].As observed in Figure 2, Spark supports programming platforms such as R, SQL, Python, Scala and Java.Additionally, it has several libraries which can provide functionalities such such as graph computations, stream data processing, and real-time interactive query processing in addition to machine learning [26].
MLlib provides distributes and fast implementations of common learning algorithms.Various linear models are available to address regression problems.To cater for classification problems, powerful algorithms such as Naïve Bayes, Random Forest, and Decision Tree are provided.For collaborative filtering, least squares with explicit and implicit feedback can be used.Unsupervised learning algorithms such as K-Means, and Principle Component Analysis (PCA) for dimensionality reduction are also part of MLlib.A number of low-level primitives and basic utilities for convex optimization, statistical analysis, feature extraction, and distributed linear algebra, [27] are also provided in the library.

Application and Testing
In this section, the open source tools: H2O, and SparkR on Databricks have been used to perform analytics on the transport related data obtained from [28].A detailed explanation on the configurations, and coding are given in the following sub-sections.The machine learning algorithm used in H2O and SparkR on Databricks is: Generalised Linear Model (GLM).

H2O
The web-based user interface of H2O can be accessed by executing the jar file and accessing the specified url through the web-browser as explained in [25].The dataset to be used can be imported using the importFiles option in the list on the homepage of H2O as shown in Figure 3.The data frame is then split into Training set and Test set.75% of the data is used to train the model and 25% is reserved for testing purposes.Figure 4 shows the section where the frame is selected and the percentage splits specified.With the training model obtained, prediction can be performed on the test set data.A comparison can then be performed on the predicted and already known "Speed Value".Figure 7 shows the step where the model is used to predict the "Speed Value" for the test set.The predicted "Speed Value" can be merged with the test data set and compared with that already known.Figure 8 shows part of the combine frame.

SparkR with Databricks
The dataset to be used can be imported using the Create Table option in the list on the homepage of Databricks as shown in Figure 9.
Figure 9. Import data set on the cluster to be processed in Databricks The code to read the data into a data-frame for further processing is shown in Figure 10.The NA"s are replaced by 0"s in this case and the data-frame is then split into training (75%) and test (25%) sets as shown in Figure 11.The Generalized Linear Model (GLM) is built as shown in Figure 12.With the training model obtained, prediction can be performed on the test set data.A comparison can then be performed on the predicted and already known "Speed Value".Figure 13 shows the step where the model is used to predict the "Speed Value" for the test set.The predicted "Speed Value" can be merged with the test data set and compared with that already known.Figure 14 shows part of the combine frame.183 different datasets, e.g., health data, manufacturing among others.Furthermore, these tools can be used with infrastructures such as Hadoop to handle massive datasets from several sources.

Figure 3 .
Figure 3. Import data set on the cluster to be processed in H2O

Figure 6 .
Figure 6.Output metrics for training model

Figure 7 .
Figure 7. Prediction using training model on test data set

Figure 8 .
Figure 8. Part of combined data frame of test data set and predicted "Speed Value" 181

Figure 10 .
Figure 10.Read data into data-frame with SparkR