This is my forth blog post about Oracle Big Data Cloud Service – Compute Edition. In my previous blog posts, I showed how we can create a big data cloud service compute edition on Oracle Cloud, which services are installed by default, ambari management service and now it’s time to write about how we can work with data using Apache Zeppelin. Apache Zeppelin is a web-based notebook that enables interactive data analytics. Zeppelin is not the only way to work data but it’s surely very friendly for end-users and (as I said before) it’s already installed to our big data cloud service compute edition.
We can create a rule to allow access to TCP port 9995 for accessing Zeppelin directly, or we can use “big data console” provided by Oracle. I’ll prefer the second one because our Ngix proxy will let only authenticated users to access Zeppelin.
After you reach the console, go to the notebooks page. Click “new note”, enter a name and then click “OK” – this will create a new empty notebook, and you’ll start editing it. My new notebook’s name is “MyFirstNote”. As you can see there are some sample notebooks, you can examine them to learn how you can use java and spark with Zeppelin.
Zeppelin can run a specific interpreter for each note block. We can see which Interpreters are installed to the system (and active for the current notebook) by clicking the very right button on the top menu. When we hover the mouse over the buttons, we can see the explanation text for each button. Click “cancel” to close the interpreters settings and go back editing our notebook.
The notebook blocks are like separate editors, and each of them can have different interpreters and their own control buttons on the upper right of the block. We can run by clicking “play button”, we can hide/show commands (the arrows icon), we can hide/show output (the book icon) and clear/delete or move the block by clicking the settings button (the gears icon). When you run a block, you get the result right after the code. If it’s an sql block, the output will be not text but a data grid with graph support. We already checked the available interpreters and we saw there is “%sh”. It is for bash shell scripts. We can enter any bash command after we define our block as shell block. To define our block as shell block, we need to write “%sh” into the first line of the block.
Let’s start writing our first block. Copy/paste the following into your first block:
%sh echo "Hello world :)"
After you run the block, you should see the “Hello world :)” message as output. Don’t worry I will not end my blog post with a simple hello world code! To see the relation between blocks and some capabilities of Zeppelin notebook, I will write another sample. This time I’ll download some data files (in CSV format) from internet, put them into hdfs, parse and register them as temporary tables with Spark, and run some queries.
My first block (overwrite our first block, or use any empty block):
%sh mkdir /tmp/flight cd /tmp/flight wget http://stat-computing.org/dataexpo/2009/2008.csv.bz2 bunzip2 2008.csv.bz2 wget http://stat-computing.org/dataexpo/2009/carriers.csv hdfs dfs -put -f carriers.csv hdfs dfs -put -f 2008.csv cd rm -rf /tmp/flight hdfs dfs -ls *.csv
I found this data after googling “free data sets” source. 2008.csv file contains the flight informations belong to year 2008 and carriers.csv contains the airline companies. Let me explain the code: First I create a temporary directory, then download the files using wget, unzip the 2008.csv.bz2 file because it’s compressed, and lastly I upload them to hdfs file system (which is not necessary for our sample but I just wanted to show) and list them.
After we run it, we should see the files. On next 2 blocks, we will parse these two files and generate temporary tables using PySpark (The Spark Python API). Why Python? Why not Scala or pure Java? Because I’m a Python guy I intentionally parse each CSV file in a separate code block because I thought it will be more understandable.
Our second block (parsing carriers.csv):
%pyspark carriers_file = sc.textFile( "/user/zeppelin/carriers.csv" ) def carriers_mapper(line): fields = line.replace('"', '').split(',') return Row( Code=str(fields), Description=str(fields)) carriers = carriers_file.map(carriers_mapper).toDF() carriers.cache() carriers.registerTempTable( "carriers" )
Run it, and you will see that it will finish lightning fast, because Spark processes are lazy, nothing will happen until the data is required. Let me explain the code, first we say we’ll use carriers.csv file (stored in hdfs). We use “sc” objects, it’s automatically created SparkContent object for PySpark code block. We define a very simple parsing function (carriers_mapper). It removes quotes in text (because in CSV file, texts are quoted), and then splits the fields. We create a row from these fields, and run this function for all lines in carriers file (by mapping function). We convert RDD to DF (dataframe), tell Spark to cache it and create a temporary table for the result.
After we create it, we can use “%sql” block to query this temporary table. You don’t need to add this one, because we will parse another CSV and create another temporary table, and query those tables together. Did you notice the first line of the result set? Our CSV includes a header line, so it’s also parsed like a regular row. We can remove it but I will not, because I want to keep my code simple, and it will be automatically filtered out when I join this table with flight data.
Our third block (parsing 2008.csv):
%pyspark flights_file = sc.textFile( "/user/zeppelin/2008.csv") # Get the data: http://stat-computing.org/dataexpo/2009/the-data.html # # 9 UniqueCarrier unique carrier code # 23 CancellationCode reason for cancellation (A = carrier, B = weather, C = NAS, D = security) # 25 CarrierDelay in minutes # 26 WeatherDelay in minutes # 27 NASDelay in minutes # 28 SecurityDelay in minutes # 29 LateAircraftDelay in minutes def flights_mapper(line): fields = line.replace('NA','0').split(',') return Row( Code=str(fields),CancellationCode=str(fields), CarrierDelay=str(fields), WeatherDelay=str(fields), NASDelay=str(fields), SecurityDelay=str(fields), LateAircraftDelay=str(fields)) flights = flights_file.map(flights_mapper).toDF() flights.cache() flights.registerTempTable( "flights" )
Running this code block will also finish immediately. It’s very similar to the previous one, I added the explanation of fields which you can get more detail from the home page of this data. This time, I replace NA string with 0 and then parse the fields, tell Spark to cache it on memory and register the temporary table.
So we downloaded 2 data files (in CSV format) in first block, then we parsed carriers.csv in second block and parsed the 2008.csv in third block. Now we can run a SQL to query these tables:
%sql select carriers.Code, carriers.Description, avg(CarrierDelay) average, sum(case when CarrierDelay > 0 then 1 else 0 end) delayed, count(*) total from carriers, flights where carriers.code = flights.code group by carriers.Code, carriers.Description order by delayed/total desc
I tried to see the ratio of delayed flights of each company. First run of this query takes about 70 seconds (in our trial Big Data Cloud Service – Compute Edition). Don’t forget, this query will read the CVS files (they are not read until we really use them), parse the fields, create temporary tables, and then join them. Even we change the query, it will take much shorter (5-7 seconds) on our next runs because the table datas are cached by Spark.
When you run a block with the %sql interpreter, the output will be shown as data grid, but using the graph buttons, we can draw graphs using the result dataset. Just select a graph style and then open settings to pick the columns for X and Y axis.
That’s all for now. I hope you enjoyed my latest blog posts about Oracle Big Data Cloud Service – Compute Edition, because I’ll keep on writing about Oracle Big Data Cloud Service – Compute Edition, Spark and other related subjects next weeks