The workshop BOSS'18 will be held in conjunction with the
44th International Conference on
Very Large Data Bases
Rio de Janeiro, Brazil • 27 August - 31 August 2018
Following the great success of the first, second, and third Workshop on Big Data Open Source Systems (BOSS'15, BOSS’16, BOSS’17) collocated with VLDB 2015, VLDB 2016, and VLDB 2017, the fourth Workshop on Big Data Open Source Systems (BOSS'18) will again give a deep-dive introduction into several active, publicly available, open-source systems. The systems will be presented in tutorials by experts in the presented systems. The tutorials will give details on installation and non-trivial examples usage of the presented system.
The workshop will consist of tutorials, we will publish the tutorial proposals on the website and encourage the presenters to publish the tutorial resources. In the previous editions, we published slides and project websites. We would encourage proposers to engage participants in a hands-on quick jump start familiarity exercise for the system.
The workshop follows a bulk synchronous parallel format. After a joint introduction, three parallel tutorial sessions are held. Each tutorial is 2 hours in length and most will be repeated in the afternoon so that participants can attend two of the parallel tutorials. There is a plenary tutorial on PaddlePaddle that all participants can attend.
8:30 - 9:00 | Introduction and Flash Session |
9:00 - 10:30 | Parallel Tutorials I (Part I) |
10:30 - 11:00 | Break |
11:00 - 11:30 | Parallel Tutorials I (Part II) |
11:30 - 12:30 | Plenary Tutorial (Part I) |
12:30 - 14:00 | Lunch Break |
14:00 - 15:00 | Plenary Tutorial (Part II) |
15:00 - 15:30 | Parallel Tutorials II (Part I) |
15:30 - 16:00 | Break |
16:00 - 17:30 | Parallel Tutorials II (Part II) |
Angel is a flexible and powerful parameter server for large-scale machine learning.
There are four features in the design of Angel and we will cover them in the tutorial. We list them here.
Besides these features, we will also introduce how to program with Angel.
The running of Angel requires the environment of Hadoop and Spark. Fortunately, since Angel enables local running mode within a single machine, we can demonstrate the environment setting up, compiling, programming and running on one single machine.
Angel is programmed with Java and Scala and compiled by maven. The running of Angel requires Hadoop, HDFS and Spark. To run Angel in distributed environment, we need a cluster managed by Yarn and set up the environment of Hadoop, HDFS and Spark. For testing and demonstrating, we can run Angel in one single machine with Hadoop and Spark.
People who want to know machine learning techniques over big data can benefit from this tutorial.
Applications and APIs speak JSON. Databases speak SQL. Couchbase combines the flexibility of JSON, the power of SQL and deployments at scale.
Couchbase data platform is a database infrastructure to enable modern scalable applications. Today, Couchbase is used by leading communications, consumer electronics, airlines, finance companies to develop and deploy mission-critical applications. See more at this website
Couchbase is a distributed shared-nothing, auto-partitioned, and distributed NoSQL database system that supports JSON model and offers key-value access, N1QL (SQL for JSON) as well as high-performance indexing, text search, and eventing. Its multi-dimensional architecture uniquely helps you to scale-up and scale-out the deployments to match the application scaling requirement . This infrastructure seamlessly supports mobile applications via Couchbase Mobile.
This tutorial is designed for database designers, architects and application developers interested in JSON, SQL, Couchbase and NoSQL systems.
PaddlePaddle (PArallel Distributed Deep LEarning) is an easy-to-use, efficient, flexible and scalable deep learning platform, which is originally developed by Baidu scientists and engineers for the purpose of applying deep learning to many products at Baidu.
Fluid is the latest version of PaddlePaddle, it describes the model for training or inference using the representation of "Program".
PaddlePaddle Elastic Deep Learning (EDL) is a clustering project which leverages PaddlePaddle training jobs to be scalable and fault-tolerant. EDL will greatly boost the parallel distributed training jobs and make good use of cluster computing power.
EDL is based on the full fault-tolerant feature of PaddlePaddle, it uses a Kubernetes controller to manage the cluster training jobs and an auto-scaler to scale the job's computing resources.
At the introduction session, we will introduce:
We have some hands-on tutorials after each introduction session so that all the audience can use PaddlePaddle and ask some questions while using PaddlePaddle:
People who are interested in deep learning system architecture.
TiDB is an open-source distributed scalable Hybrid Transactional and Analytical Processing (HTAP) database. It is designed to provide extremely large horizontal scalability, strong consistency, and high availability. TiDB is MySQL compatible and serves as a one-stop database for both OLTP (Online Transactional Processing) and OLAP (Online Analytical Processing) workloads, while minimizing extract, transform, and load (ETL) processes which are difficult and tedious to maintain.
This tutorial will explain the motivation, architecture, and inner-workings of the TiDB platform, which contains three main components:
Since its 1.0 release in October 2017 and 2.0 release in April 2018, TiDB has been in production in over 200 companies. It was recently recognized in a report by 451 Research as an open source, modular NewSQL database that can be deployed to handle both operational and analytical workloads, fulfilling the promise and benefits of an HTAP architecture.
This tutorial is designed for the database engineers and academic researchers, who are interested in how a next-generation NewSQL database like TiDB is built and deployed, or want to know how TiDB enables near real-time analytics from live transactional data.
Workshop Chair:
Advisory Committee:
Selection Committee:
⇒ BOSS'15 in conjunction with VLDB 2015
on September 4, 2015.
For the first instance of the BOSS workshop, 8 diverse systems were chosen. The program consisted of 8 parallel tutorial sessions, which were repeated, and a panel between the repetitions. The 8 presented systems were: Apache AsterixDB, Apache Flink, Apache Reef, Apache Singa, Apache Spark, Padres, rasdaman, and SciDB.
⇒ BOSS'16 in conjunction with VLDB 2016
on September 9, 2016.
The workshop presented parallel tutorials by system developers. In this instance the systems were: Apache Flink, Apache SystemML, HopsFS and ePipe, LinkedIn's Open Source Analytics Platform, rasdaman, and Rheem.
⇒ BOSS'17 in conjunction with VLDB 2017 on September 1, 2017.
The workshop presented parallel tutorials by system developers. In this instance the systems were: Apache AsterixDB, Apache Flink, Apache Impala, Apache Spark, and TensorFlow.