About BOSS

The workshop BOSS'18 will be held in conjunction with the

44th International Conference on
Very Large Data Bases
Rio de Janeiro, Brazil • 27 August - 31 August 2018

Workshop Date
  August 27th, 2018

Following the great success of the first, second, and third Workshop on Big Data Open Source Systems (BOSS'15, BOSS’16, BOSS’17) collocated with VLDB 2015, VLDB 2016, and VLDB 2017, the fourth Workshop on Big Data Open Source Systems (BOSS'18) will again give a deep-dive introduction into several active, publicly available, open-source systems. The systems will be presented in tutorials by experts in the presented systems. The tutorials will give details on installation and non-trivial examples usage of the presented system.

The workshop will consist of tutorials, we will publish the tutorial proposals on the website and encourage the presenters to publish the tutorial resources. In the previous editions, we published slides and project websites. We would encourage proposers to engage participants in a hands-on quick jump start familiarity exercise for the system.

Program

Tutorial flow diagram of the BOSS program
The workshop follows a bulk synchronous parallel format. After a joint introduction, three parallel tutorial sessions are held. Each tutorial is 2 hours in length and most will be repeated in the afternoon so that participants can attend two of the parallel tutorials. There is a plenary tutorial on PaddlePaddle that all participants can attend.

Program Outline:

8:30 - 9:00 Introduction and Flash Session
9:00 - 10:30 Parallel Tutorials I (Part I)
10:30 - 11:00 Break
11:00 - 11:30 Parallel Tutorials I (Part II)
11:30 - 12:30 Plenary Tutorial (Part I)
12:30 - 14:00 Lunch Break
14:00 - 15:00 Plenary Tutorial (Part II)
15:00 - 15:30 Parallel Tutorials II (Part I)
15:30 - 16:00 Break
16:00 - 17:30 Parallel Tutorials II (Part II)

Details of the Presented Systems

Angel is a flexible and powerful parameter server for large-scale machine learning.

Tutorial Outline
What would be covered.

There are four features in the design of Angel and we will cover them in the tutorial. We list them here.

  • The architecture of Angel. Angel is composed of parameter server to manage large models. By partitioning the model, Angel can tackle billions of or even more parameters.
  • The service ability. Angel can provide service of parameter server for other systems. Currently, we release Spark-on-Angel to compromise the merits of both Spark and parameter server.
  • The design for sparse and high-dimensional data. Sparse and high-dimensional data are pervasive in real applications, especially for recommendation system. Sparse and high-dimensional data can be easily handled in Angel.
  • Specific optimization for algorithms. In Angel, we provide a set of machine learning algorithms, such as Latent Dirichlet Allocation, Gradient Boosted Decision Tree and Graph Embedding. These algorithms are all optimized and can scale to terabytes data and billions of parameters.

Besides these features, we will also introduce how to program with Angel.

How to demonstrate and set up the environment for hands-on tutorial.

The running of Angel requires the environment of Hadoop and Spark. Fortunately, since Angel enables local running mode within a single machine, we can demonstrate the environment setting up, compiling, programming and running on one single machine.


How to use the actual system.

Angel is programmed with Java and Scala and compiled by maven. The running of Angel requires Hadoop, HDFS and Spark. To run Angel in distributed environment, we need a cluster managed by Yarn and set up the environment of Hadoop, HDFS and Spark. For testing and demonstrating, we can run Angel in one single machine with Hadoop and Spark.

Who will benefit from this tutorial.

People who want to know machine learning techniques over big data can benefit from this tutorial.


Presenters:
  • Lele Yu (Tencent)
  • Caihua Wang (Tencent)

Resources:


Applications and APIs speak JSON. Databases speak SQL. Couchbase combines the flexibility of JSON, the power of SQL and deployments at scale.

Couchbase data platform is a database infrastructure to enable modern scalable applications. Today, Couchbase is used by leading communications, consumer electronics, airlines, finance companies to develop and deploy mission-critical applications. See more at this website

Couchbase is a distributed shared-nothing, auto-partitioned, and distributed NoSQL database system that supports JSON model and offers key-value access, N1QL (SQL for JSON) as well as high-performance indexing, text search, and eventing. Its multi-dimensional architecture uniquely helps you to scale-up and scale-out the deployments to match the application scaling requirement . This infrastructure seamlessly supports mobile applications via Couchbase Mobile.

Tutorial Outline
The following topics are covered:
  • Couchbase Open Source model and Community edition.
  • Couchbase editions (community, enterprise) and CouchDB
  • discuss Couchbase architecture and components includes Apache AsterixDB.
  • brief use case discussion
  • deploy single node and cluster deployments
  • hands-on tutorial on Couchbase Server and Couchbase N1QL: SQL for JSON

Intended audience:

This tutorial is designed for database designers, architects and application developers interested in JSON, SQL, Couchbase and NoSQL systems.


Presenters:
  • Keshav Murthy (Senior Director, Couchbase R&D)

Resources:


PaddlePaddle (PArallel Distributed Deep LEarning) is an easy-to-use, efficient, flexible and scalable deep learning platform, which is originally developed by Baidu scientists and engineers for the purpose of applying deep learning to many products at Baidu.

Fluid is the latest version of PaddlePaddle, it describes the model for training or inference using the representation of "Program".

PaddlePaddle Elastic Deep Learning (EDL) is a clustering project which leverages PaddlePaddle training jobs to be scalable and fault-tolerant. EDL will greatly boost the parallel distributed training jobs and make good use of cluster computing power.

EDL is based on the full fault-tolerant feature of PaddlePaddle, it uses a Kubernetes controller to manage the cluster training jobs and an auto-scaler to scale the job's computing resources.

Tutorial Outline
Introduction

At the introduction session, we will introduce:

  • PaddlePaddle Fluid design overview.
  • Fluid Distributed Training.
  • Why we develop PaddlePaddle EDL and how we implement it.

Hands-on Tutorial

We have some hands-on tutorials after each introduction session so that all the audience can use PaddlePaddle and ask some questions while using PaddlePaddle:

  • Training models using PaddlePaddle Fluid in a Jupyter Notebook (PaddlePaddle Book).
  • Launch a Distributed Training Job on your laptop.
  • Launch the EDL training job on a Kubernetes cluster.

Intended audience

People who are interested in deep learning system architecture.

Prerequisites


Presenters:
  • Xu Yan (PaddlePaddle)

Resources:


TiDB is an open-source distributed scalable Hybrid Transactional and Analytical Processing (HTAP) database. It is designed to provide extremely large horizontal scalability, strong consistency, and high availability. TiDB is MySQL compatible and serves as a one-stop database for both OLTP (Online Transactional Processing) and OLAP (Online Analytical Processing) workloads, while minimizing extract, transform, and load (ETL) processes which are difficult and tedious to maintain.

This tutorial will explain the motivation, architecture, and inner-workings of the TiDB platform, which contains three main components:

  • TiDB: stateless MySQL-compatible SQL layer
  • TiKV: distributed transactional key-value storage layer
  • TiSpark: Spark plug-in that reads data directing from TiKV

Since its 1.0 release in October 2017 and 2.0 release in April 2018, TiDB has been in production in over 200 companies. It was recently recognized in a report by 451 Research as an open source, modular NewSQL database that can be deployed to handle both operational and analytical workloads, fulfilling the promise and benefits of an HTAP architecture.

Tutorial Outline
This tutorial will cover the following topics:
  • Introduction to TiDB’s architecture, technology, and design choices
  • In-production use case analysis

  • How to deploy TiDB on a laptop using Docker-Compose


    • Use a local MySQL client and TiSpark instance to read/write data into TiDB
    • Use MySQL/TiSpark to modify and analyze “fresh” data
    • Display TiDB cluster state and monitor metrics with Grafana


  • How to deploy a TiDB cluster using on AWS


    • Demonstrate increasing and decreasing TiKV nodes to elastically scale system capacity

Intended audience:

This tutorial is designed for the database engineers and academic researchers, who are interested in how a next-generation NewSQL database like TiDB is built and deployed, or want to know how TiDB enables near real-time analytics from live transactional data.


Presenters:
  • Ed Huang (co-founder and CTO of PingCAP and Chief Architect of TiDB)

Resources:


Workshop Organization

Workshop Chair:

  • Beng Chin Ooi, National University of Singapore, ooibc@comp.nus.edu.sg
  • Meihui Zhang, Beijing Institute of Technology, meihui_zhang@bit.edu.cn

Advisory Committee:

  • Michael Carey, UC Irvine
  • Volker Markl, TU Berlin

Selection Committee:

  • Guoliang Li, Tsinghua University
  • Tilmann Rabl, TU Berlin
  • Dimitris Tsirogiannis, Cloudera Inc.
  • Till Westmann, Couchbase
  • Reynold Xin, DataBricks



Call for tutorials
  • Important Date: Open until May 15.

    In order to propose a tutorial, please submit
    • a short abstract with a brief description of the system,
    • an outline of the planned tutorial,
    • the technology used for the hands on tutorial,
    • a list of presenters involved,
    • and a link to the website of your system

    to submission site: https://cmt3.research.microsoft.com/BOSS2018

Selection Process for Tutorials
There will be an open call for tutorials through dbworld and other channels. After a sanity check on the submissions making sure that they are complete and relevant, i.e., they are open source, publicly available, and big data related, the proposals will be evaluated by the committee for the system readiness, relevance, timeliness, and perceived interests from the conference participants.

Previous Editions

BOSS'15 in conjunction with VLDB 2015 on September 4, 2015.
For the first instance of the BOSS workshop, 8 diverse systems were chosen. The program consisted of 8 parallel tutorial sessions, which were repeated, and a panel between the repetitions. The 8 presented systems were: Apache AsterixDB, Apache Flink, Apache Reef, Apache Singa, Apache Spark, Padres, rasdaman, and SciDB.

BOSS'16 in conjunction with VLDB 2016 on September 9, 2016.
The workshop presented parallel tutorials by system developers. In this instance the systems were: Apache Flink, Apache SystemML, HopsFS and ePipe, LinkedIn's Open Source Analytics Platform, rasdaman, and Rheem.

BOSS'17 in conjunction with VLDB 2017 on September 1, 2017.
The workshop presented parallel tutorials by system developers. In this instance the systems were: Apache AsterixDB, Apache Flink, Apache Impala, Apache Spark, and TensorFlow.