Data @Scale 2017 Recap

Parixit Pol

Visit the post for more.

Overview

The Data @Scale 2017 conference brought together 350 engineers to discuss the challenges and innovations in large-scale storage systems and analytics. Key presentations from industry leaders highlighted the intersection of Big Data and machine learning, showcasing advancements in infrastructure, databases, and data processing techniques.

What You'll Learn

1

How to leverage large-scale storage systems for machine learning applications

2

Why globally distributed databases are essential for handling trillions of data objects

3

How to implement interactive analytics using ClickHouse

4

How to build a resilient micro-service architecture with Cadence

5

When to apply advanced SQL features in distributed systems like Spanner

Key Questions Answered

What are the key insights from the Data @Scale 2017 conference?

The Data @Scale 2017 conference featured insights from industry leaders on topics like large-scale storage systems, machine learning, and the evolution of databases. Presentations highlighted advancements in infrastructure design, interactive analytics, and micro-service architectures, showcasing how these innovations address the challenges of Big Data.

How does ClickHouse handle real-time data ingestion?

ClickHouse is designed to ingest clickstream data in real time and generate interactive reports on non-aggregated data. It can process 100 billion rows per second on HDDs, scales linearly, and supports SQL, making it suitable for high-performance analytics.

What is the significance of Cadence in micro-service architecture?

Cadence is an open-source solution that allows for building and running micro-services that expose asynchronous, long-running operations. It enhances scalability and resilience, making it a valuable tool for complex service architectures.

What challenges does Spanner address in distributed SQL databases?

Spanner tackles challenges such as scalability, automatic sharding, and fault tolerance. It aims to provide external consistency and wide-area distribution, evolving towards a SQL DBMS to enhance compatibility with other systems at Google.

Key Statistics & Figures

Data processing speed of ClickHouse

100 billion rows per second

This performance metric highlights ClickHouse's capability to handle large-scale data analytics efficiently.

Number of engineers at Data @Scale 2017

350

This figure reflects the growing interest and participation in discussions around large-scale data challenges.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Database

Clickhouse

Used for real-time data ingestion and interactive analytics.

Backend

Cadence

An open-source solution for building and running micro-services.

Database

Spanner

A globally distributed data management system.

Key Actionable Insights

1
Engineers should explore the integration of machine learning with large-scale storage solutions to enhance data processing capabilities.
As machine learning continues to evolve, understanding how to effectively store and analyze large datasets will be crucial for developing innovative applications.

2
Adopting globally distributed databases can significantly improve the performance and reliability of applications handling massive data volumes.
With the ability to manage trillions of data objects, these databases ensure high availability and low latency, which are essential for modern applications.

3
Utilizing ClickHouse for interactive analytics can streamline reporting processes and improve data accessibility.
Its capability to process vast amounts of data in real time allows organizations to make informed decisions quickly based on up-to-date information.

4
Implementing Cadence can simplify the management of micro-services by providing a framework for asynchronous operations.
This can lead to more resilient applications that can handle long-running tasks without blocking resources.

Common Pitfalls

1

Failing to consider the scalability of databases when handling large data volumes can lead to performance bottlenecks.

As data grows, it's essential to choose a database solution that can scale efficiently to avoid slowdowns and outages.

Related Concepts

Big Data Analytics

Machine Learning Infrastructure

Distributed Databases

Micro-service Architecture