Introduction:
介绍:
Machines: Real time sensors Industrial machinery Tools which keep track of
机器实时传感器 工业机械 追踪以下信息的工具
our behaviours Environmental sensors or personal health checkers 40 Tb/sec of data People: Social media Photos, videos Organization: Transaction information Transaction information pyramid: Data-informa-knowl-wise
我们的行为 环境传感器或个人健康检查器 40 Tb/sec 的数据 人:社交媒体 照片、视频 组织交易信息 交易信息金字塔:数据-表格-知识
Small data:1. Available in small quantities and limited available to humans with little or no digital processing2.Accumulated slowly and may not be updated frequently3.It is easy to store small date, as small data is relatively
小数据:1.1.数量少,人类只需进行少量或根本不需要进行数字处理即可获得2.积累缓慢,可能不会经常更新3.易于存储小数据,因为小数据相对较少。
consistent and structured data usually stored in known forms such as JSON and XML4.Mostly located in storage systems within enterprises or data centers Bigdata:1.Data generated in enormous volumes and could be
通常以 JSON 和 XML 等已知形式存储的一致且结构化的数据4.大多位于企业或数据中心内的存储系统中 大数据:1.产生的数据量巨大,可以是......也可以是......。
structured, semi-structured, or unstructured 2.It is complex and it requires specialized programs to process make it available to be able to generate insights for human consumption3.Big data is generated continuously at enormous speed from multiple sources, exponentially4.Comprises any form of data including video, photos, …5.The big data quantity is so huge that it is stored on the cloud and server farms, built with this precise scope Big data life cycle: case-data collection-data modeling-data precessing-visual Bits: 1 Byte = 8 Bits;1024 Bytes = 1 Kilobyte (KB);1024 KB = 1 Megabyte (MB);1024 MB = 1 Gigabyte (GB);1024 GB = 1 Terabyte (TB);1024 TB = 1 Petabyte (PB);1024 PB = 1 Exabyte (EB);1024 EB = 1 Zettabyte (ZB);1024 ZB = 1 Yottabyte (YB) What big data:volume-variety-velocity-veracity-value-variability IOT:1.Instrument or physical object connected through the internet✓It can include smart devices, e.g. sensors, processors, embedded devices and/or communication hardware✓All these devices collect, analyse, transmit data in real time✓This also mean that without big data, IoT devices
2.结构化、半结构化或非结构化 2.非常复杂,需要专门的程序进行处理,使其能够产生可供人类使用的洞察力3.大数据以极快的速度从多个来源持续产生,呈指数级增长4.包含任何形式的数据,包括视频、照片......5.大数据的数量如此庞大,以至于它被存储在云端和服务器群中,并根据这一精确范围构建大数据生命周期:案例--数据收集--数据建模--数据预处理--可视化比特:1 Byte = 8 Bits;1024 Bytes = 1 Kilobyte (KB);1024 KB = 1 Megabyte (MB);1024 MB = 1 Gigabyte (GB);1024 GB = 1 Terabyte (TB);1024 TB = 1 Petabyte (PB);1024 PB = 1 Exabyte (EB);1024 EB = 1 Zettabyte (ZB);1024 ZB = 1 Yottabyte (YB) 什么是大数据:体积-变量-速度-容量-值-变量 物联网:1.通过互联网连接的仪器或物理对象✓它可以包括智能设备,如传感器、处理器、嵌入式设备和/或通信硬件✓所有这些设备都实时收集、分析和传输数据✓这也意味着,如果没有大数据,物联网设备
would not be function as they do now. Big data Ecosystem:✓Data Technologies✓Analytics & Visualization✓Business Intelligence✓Cloud Providers✓NoSQL Databases✓Programming Tools
就不会像现在这样发挥作用。大数据生态系统:数据技术✓分析与可视化✓商业智能✓云计算提供商✓NoSQL 数据库✓编程工具
structured or unstructured:
结构化或非结构化:
what happen in a normal computer? In a typical analytic cycle, the functionality of the computer is to store data and move it from a storage
普通计算机中会发生什么?在一个典型的分析周期中,计算机的功能是存储数据,并将数据从存储区移动到分析区。
capacity into a Compute Capacity; In a typical analytic cycle, the functionality of the computer is to store data and move it from a storage
在一个典型的分析周期中,计算机的功能是存储数据,并将数据从存储区转移到计算区。
capacity into a Compute Capacity Linear processing: input-instracution1-i2-i3-output. Parallel processing: input-instracution*3 -output. Data Scaling is a technique to manage, store and process data✓ Let’s assume we are dealing only with a single node… when we increase the amount of data, we are practically increasing both the memory and the storage capacities Horizontal Scaling: ✓ Scaling horizontally means to add new nodes up to the problem is tractable✓ Cluster capacity solves the «embarassingly parallel» problems, as if one node fails, there is no impact on the others!✓ Is it always that easy? Absolutely not! The main problems arise when there is interdependance across calculations✓ Basically each process needs to know what the other processes are doing in order to complete the calculations✓ A file system needs – accessible to all processes – is in charge of storing everything! ✓In Hadoop, the cluster centralizes everything into a main storage Fault tolerance is the ability of a system to continue operating without interruption when one or more of its components fail(We copy the partitions in other nodes within the cluster) Cloud computing: ✓Cloud computing allows users to access highly scalable computing and storage resources through internet✓ Why is it useful? By means of cloud computing, companies could use server capacity and expand it rapidly to the level required!✓Cloud computing now is indispensable to run complex mathematical models✓From an economic point of view, cloud computing reduces costs. Why ? Because resources are shared across users who pay onlyfor the capacity they actually use! Structured data: Structured data is data which is organized and labeled,
线性处理:输入-执行1-i2-i3-输出。并行处理:输入-执行*3-输出。数据扩展是一种管理、存储和处理数据的技术✓✓ 假设我们只处理单个节点......当我们增加数据量时,实际上是在增加内存和存储容量 水平扩展:✓ 水平扩展意味着增加新节点,直到问题可以解决为止✓✓ 集群容量解决了 "令人尴尬的并行 "问题,因为如果一个节点出现故障,对其他节点没有影响!✓ 事情总是那么简单吗?绝对不是!基本上,每个进程都需要知道其他进程在做什么,才能完成计算✓ 所有进程都能访问的文件系统负责存储所有内容!在 Hadoop 中,集群将所有内容集中到一个主存储中 容错是指当一个或多个组件发生故障时,系统能够不间断地继续运行的能力(我们将分区复制到集群内的其他节点中) 云计算: ✓ 云计算允许用户通过互联网访问高度可扩展的计算和存储资源✓ 它为什么有用?通过云计算,公司可以使用服务器容量,并将其迅速扩展到所需的水平!✓现在,运行复杂的数学模型离不开云计算✓从经济角度来看,云计算可以降低成本。为什么?因为资源是由用户共享的,他们只需支付实际使用的容量!结构化数据:结构化数据是经过组织和标记的数据、
with a pre-defined data model which conform to a tabular format✓ Examples: Name Family Name Dates Addresses Phone Number ✓ Sources of structured data:✓ Relational database (SQL) ✓ Spreadsheets Unstructured data:✓Most of the data people produce is text-heavy and unstructured✓What does unstructured mean? ✓ Unstructured means that it does not conform to any predefined data model✓ Often attached to data we could find also a description, which further increases the volume of data✓ How much data do we talk about? Semi-structured data: ✓ Semi-structured data has components of structured and (mostly) unstructured data✓ It also possess some metadata which has some
符合表格格式的预定义数据模型✓ 示例:结构化数据的来源:✓ 关系数据库 (SQL) ✓ 电子表格 非结构化数据:✓ 人们产生的大多数数据都是文本较多且非结构化的✓ 非结构化是什么意思? ✓ 非结构化意味着它不符合任何预定义的数据模型✓ 通常,我们还可以在数据上找到说明,这进一步增加了数据量✓ 我们所说的数据量有多大?半结构化数据:半结构化数据包含结构化数据和(大部分)非结构化数据的组成部分✓ 它还拥有一些元数据,这些元数据具有以下特征
characteristics✓ Example: emails are typically semistructured✓ The content is unstructured✓ To/CC/CCN fields, Subject line, image, etc. is structured✓ Sources of semi-structured data: XML/JSON
特征✓ 示例:电子邮件通常是半结构化的✓ 内容是非结构化的✓ 收件人/CC/CN 字段、主题行、图片等是结构化的✓ 半结构化数据的来源:XML/JSON
XML, DTD, JSON
XML、DTD、JSON
eXtensible Markup Language(XML):✓It is a markup metalanguage, which allows to define other markup languages✓It consists of syntactic standard rules, i.e. it is a formal language.How does XML work?✓ XML is a metalanguage which has the aim to represent textual contents organized in a
可扩展标记语言(XML):✓它是一种标记金属语言,可以定义其他标记语言✓它由语法标准规则组成,即它是一种形式语言。
hierarchical way✓ Properties of XML files:✓ Could be transferred among companies and different users,without any constraint (elaboration systems and database contain data which are incompatible… very big problem!)✓ They have the suffix.XML✓As HTML, XML is based on tags <NameTag></NameTag>✓Tags allow to read information contained in it, on the basis of their name and not necessarily on the basis of their position✓Each XML document could be associated to the definition of the structure used to make it readable✓The XML file memorizes data and it separated from the HTML document Document Type Definition DTD:✓A DTD file provides an instrument to validate XML documents, but it is not written
分层方式✓ XML 文件的属性:✓ 可在公司和不同用户之间传输,不受任何限制(阐述系统和数据库包含不兼容的数据......问题很大!)✓ 它们有后缀。XML✓与 HTML 一样,XML 基于标签 ✓标签允许根据其名称而不一定根据其位置来读取其中包含的信息✓每个 XML 文档都可以与用于使其可读的结构的定义相关联✓XML 文件存储数据,并与 HTML 文档分离 文档类型定义 DTD:✓DTD 文件提供了验证 XML 文档的工具,但它不是书面的
to respect the XML syntax✓Why is it useful?✓A DTD imposes some constraints on the structure of an XML document
DTD 对 XML 文档的结构施加了一些限制。
✓ Using XML files which are validated with scheme languages increases
✓ 使用经过方案语言验证的 XML 文件可提高性能
productivity and the chances to develop open and interoperable systems✓What does a DTD file? It contains elements, attributes and entities which are used by the document and the context✓DTD files do not say anything about the content of an element or the value of an attribute… e.g., we cannot say whether the price is a real number, a string Occurrence of the elements ? Zero OR 1 Not mandatory (0) or max 1+ 1 OR more Mandatory (1) or more
什么是 DTD 文件?它包含文档和上下文使用的元素、属性和实体✓DTD 文件不说明元素的内容或属性的值......例如,我们不能说明价格是实数还是字符串 元素的出现?0 或 1 非强制 (0) 或最多 1+ 1 或更多 强制 (1) 或更多
* Zero OR more
* 零或更多
JavaScript Object Notation JSON: ✓ Standard and open format, suitable to store different types of information and to exchange this across applications, both stand alone and in the web
JavaScript Object Notation JSON:✓ 标准和开放格式,适用于存储不同类型的信息,并在独立和网络应用程序之间进行交换。
✓ Because it is easy to write and to analyses
✓ 因为易于书写和分析
✓ Because the dimension of the document is much smaller than XML documents✓Even PostgreSQL and MySQL, as well as NoSQL are built using this format and use JSON documents to archive records
因为文档的尺寸比 XML 文档小得多 ✓ 甚至 PostgreSQL 和 MySQL 以及 NoSQL 也使用这种格式构建,并使用 JSON 文档来归档记录
✓ But it is independent of the platform and it is now used in several programming languages
✓ 但它与平台无关,目前已在多种编程语言中使用。
The Hadoop Ecosystem:
Hadoop 生态系统
What is Hadoop?✓Open source framework used to process enormous data sets, i.e. a set of open source programs and procedures✓It is used for processing large amounts of data in distributed file systems, which are interconnected to each other✓It supports the running of applications on clusters, i.e. on several computers which are simultaneously working together
什么是 Hadoop?✓用于处理海量数据集的开放源代码框架,即一组开放源代码程序和过程✓用于处理分布式文件系统中的海量数据,这些文件系统相互连接✓支持在集群上运行应用程序,即在多台同时工作的计算机上运行应用程序。
✓Why is Hadoop different from a normal computer? Hadoop is not a relational database but it is an ecosystem, i.e. it runs parallel (or concurrently) jobs or processes simultaneously!Core component: Hadoop:Common✓Collection of utilities and libraries which support Hadoop modules;HDFS
为什么 Hadoop 与普通计算机不同?Hadoop 不是关系数据库,而是一个生态系统,即同时运行并行(或并发)作业或进程!核心组件:Hadoop:Common✓支持 Hadoop 模块的实用程序和库的集合;HDFS
Hadoop Distributed File System✓Handles and stores large data running commodity hardware ✓Scales a single Hadoop cluster into as much as
Hadoop 分布式文件系统✓处理和存储运行商品硬件的大型数据 ✓ 将单个 Hadoop 集群扩展到多达
thousand clusters;MapReduce:✓Simultaneous process of Big Data✓Splits large amounts of data into smaller units✓Previously it was the only way to access HDFS;YARN Yet Another Resource Negotiator✓Resource manager across clusters✓Prepares the RAM and the CPU to run data in batch, stream, interactive and graph processing✓Everything will be then stored in HDFS
数千个集群;MapReduce✓同时处理大数据✓将大量数据拆分成较小的单元✓以前它是访问 HDFS 的唯一方式;YARN Yet Another Resource Negotiator✓跨集群的资源管理器✓准备 RAM 和 CPU,以批处理、流式处理、交互式处理和图形处理的方式运行数据✓然后所有数据都将存储在 HDFS 中。
Hadoop Ecosystem:Ingest Data-Store Data-Process & Analyse-Access Data Hadoop Distributed File System HDFS:✓A distributed file system is a file system which is distributed on multiple file servers✓This allows programmers to access or store from any network or computer✓What is the role of HDFS in Hadoop? It represents the storage layer✓How does it work? It:✓Splits the files into blocks✓Creates replicas of the blocks✓Stores them on different machines✓HDFS provides access to streaming data✓HDFS uses a CLI (command line interface) to interact with Hadoop… this means that you need to write what to do and there is not the typical interface!HDFS Features:✓Cost Efficient:The storage hardware is not expensive, allowing to save money✓Large amounts of data: HDFS can store up to petabytes of data, in any format (either tabular or non tabular).It breaks data into small chunks, called blocks✓Replication: HDFS replicates data on multiple machines, while minimizing costs associated to data losses✓Fault Tolerant:
Hadoop 生态系统:最丰富的数据-存储数据-处理和分析-访问数据 Hadoop 分布式文件系统 HDFS:✓分布式文件系统是分布在多个文件服务器上的文件系统✓这允许程序员从任何网络或计算机访问或存储✓HDFS 在 Hadoop 中的作用是什么?它代表存储层✓它如何工作?它:✓将文件分割成块✓创建块的副本✓将它们存储在不同的机器上✓HDFS 提供对流式数据的访问✓HDFS 使用 CLI(命令行界面)与 Hadoop 交互......这意味着你需要编写要做什么,而没有典型的界面!HDFS 的特点:✓成本效率高:存储硬件并不昂贵,可以节省资金✓海量数据:HDFS 可以以任何格式(表格或非表格)存储多达 PB 的数据。它将数据分成小块,称为块✓复制:HDFS 可在多台机器上复制数据,同时最大限度地降低与数据丢失相关的成本✓容错:
In case of data loss, data can be found on another computer, allowing work to continue smoothly✓Scalable:A single cluster can scale up to hundreds of nodes✓Portable:HDFS is designed to easily move from one platform to another HDFS Terminology:✓Blocks:We said that data is split into blocks
在数据丢失的情况下,可以在另一台计算机上找到数据,使工作得以顺利进行✓可扩展:单个集群可扩展至数百个节点✓可移植:HDFS 的设计便于从一个平台移动到另一个平台 HDFS 术语:✓块:我们说数据被分割成块
A block is the minimum amount of data that can be written or read;Provide fault tolerance;The default size of blocks is 64 MB or 128 MB✓Nodes Single system which is responsible to store and process data;It could be considered as a machine or a computer where data is stored;Each HDFS follows the primary and the secondary concept. We have two types
块是可写入或读取的最小数据量;提供容错功能;块的默认大小为 64 MB 或 128 MB✓节点 负责存储和处理数据的单个系统;可以将其视为存储数据的机器或计算机;每个 HDFS 都遵循主节点和次节点的概念。我们有两种类型
of nodes:Primary node (or Name node):regulates the file access to clients
节点:主节点(或名称节点):管理客户端的文件访问权限
maintains, manages, and assigns tasks to the secondary node;Secondary node (or Data node):manages the storage system;could be hundreds;read and write requests at the instructions of the Name node✓Rack Awareness:During the operations Read/Write, it is fundamental that the name node maximises its performance by choosing the data nodes as close as possible;How to ensure it? By choosing nodes in the same Rack or in the closest rack;Why rack awareness is useful? Because it reduces network traffic and improve
次节点(或数据节点):管理存储系统;可能有数百个;根据名称节点的指示读写请求✓机架意识:在读写操作期间,名称节点必须选择尽可能近的数据节点,从而最大限度地提高性能;如何确保这一点?在同一机架或最近的机架中选择节点;为什么机架意识很有用?因为它可以减少网络流量,提高
cluster performance!;How could we keep track of it? The name node (responsible for the assignment)keeps track of the rack ID information✓ Replication:By means of rack awareness HDFS makes replicasWhen crashes happen, replication provides backup of data blocks;Replication factor: number of times the data block was copied✓Read/Write Operations:HDFS allows write once read many operations;We cannot edit files which are already stored in HDFS but we can append new data MapReduce:✓It is a model based on programming✓It is the processing layer of Hadoop: it is used in Hadoop to process big data, i.e. it serves for scalability (thousands of servers in Hadoop cluster)✓It is the heart of Apache Hadoop✓Why is it so powerful? It consists of processing techniques for distributed computing
如何跟踪?名称节点(负责分配)跟踪机架 ID 信息✓ 复制:HDFS 通过机架意识进行复制,当崩溃发生时,复制提供数据块备份;复制因子:数据块被复制的次数✓ 读/写操作:HDFS 允许一次写入多次读取操作;我们不能编辑已存储在 HDFS 中的文件,但可以添加新数据 MapReduce:✓它是一种基于编程的模型✓它是 Hadoop 的处理层:在 Hadoop 中用于处理大数据,即它的可扩展性。它是 Apache Hadoop 的核心✓它为何如此强大?它由分布式计算的处理技术组成
Distributed computing is a system composed by multiple components which are located on different machines that communicate actions in one view to the end user✓It is based on Java but it can also be coded in C++, Python, Ruby or R✓It consists of 2 important tasks: Map&Reduce Map:✓It takes an input file, located in the Hadoop Distributed File System (HDFS)✓Map proposes a mapping task by processing and extracting important data into key value pairs
分布式计算是由位于不同机器上的多个组件组成的系统,这些组件以一个视图向最终用户传达操作✓它基于 Java,但也可以用 C++、Python、Ruby 或 R 编码✓它由 2 个重要任务组成:Map&Reduce Map:✓它接收位于 Hadoop 分布式文件系统(HDFS)中的输入文件✓Map 通过将重要数据处理和提取为键值对来提出映射任务
✓It also sorts and organize data for the preliminary output to be sentReduce: ✓The reducer works with multiple map functions✓It aggregates the pairs using their keys to produce a final out✓How does Hadoop know that they are all working on the same thing?✓MapReduce keeps track of its task by creating unique key to ensure that all processes are solving the same problem
✓它还对数据进行排序和整理,以便发送初步输出✓Reduce:✓减法器与多个映射函数一起工作✓它使用它们的键聚合数据对,生成最终输出✓Hadoop 如何知道它们都在处理同一件事?
Why so useful?✓With Map Reduce there is an intensive level of parallel computing, i.e. there is a high number of parallel jobs across multiple nodes✓In Hadoop there are two types of nodes: Data Node and Name Node✓Map Reduce splits and runs independent tasks in parallel by dividing… save time!✓Map Reduce is very flexible✓It can produce tabular or non tabular form… value to organizations, regardless of the type of data treated! Hive:✓Data warehouse software within HadoopIt✓is designed for reading, writing and managing large tabulartype datasets and data analysisIt✓is scalable, fast, and easy to use Hive: Used to maintain a data warehouse using Hive query language;Suited for static data analysis; Designed on the methodology of write once,read many;Maximum data size it can handle is petabytes;Doesn’t enforce the schema to verify loading data;Supports partitioning HBase:✓Column-oriented non-relational database management system✓It runs on top of HDFS and it is used for write-heavy applications✓It provides a fault-tolerant way of storing sparse datasets✓It works well with real time data and random read/write access to Big data✓It supports scalability in linear and modular form✓It is a backup support for MapReduce jobs✓It has no fixed column schema✓It is an easy-to-use Java API for client access✓It supports data replications across clusters✓An HBase column represents an attribute of an object YARN✓Resource manager created by splitting the processing engine and the management function of Map Reduce✓goal:It monitors and manages workloads, i.e. allocates system resource to various applications&It maintains a multi-tenant environment&It manages the high availability features of Hadoop&It implements security controls✓YARN can support multiple scheduling methods;FIFO (First In First Out) scheduler;Fair Scheduler to assign each job running at the same time✓YARN includes a Reservation System that allows users to reserve cluster resources in advance
为什么如此有用?✓Map Reduce 具有密集的并行计算水平,即多个节点上有大量并行作业✓Hadoop 中有两种节点:数据节点和名称节点✓Map Reduce 通过分割并行运行独立的任务......节省时间!✓Map Reduce 非常灵活✓它可以生成表格或非表格形式......无论处理什么类型的数据,都能为组织带来价值!Hive:✓Hadoop 中的数据仓库软件✓设计用于读取、写入和管理大型表格型数据集和数据分析✓可扩展、快速且易于使用 Hive:用于使用 Hive 查询语言维护数据仓库;适合静态数据分析;基于一次写入、多次读取的方法设计;可处理的最大数据量为 PB;不强制执行模式以验证加载数据;支持分区 HBase:面向列的非关系型数据库管理系统✓它运行于 HDFS 之上,适用于写入量大的应用程序✓它提供了一种存储稀疏数据集的容错方式✓它能很好地处理实时数据和对大数据的随机读/写访问✓它支持线性和模块化形式的可扩展性✓它是 MapReduce 作业的备份支持✓它没有固定的列模式✓它是一个易于它支持跨集群的数据复制✓一个 HBase 列代表一个对象的属性 YARN✓资源管理器通过拆分 Map Reduce✓目标的处理引擎和管理功能而创建:它监控和管理工作量,即e.为各种应用程序分配系统资源;维护多租户环境;管理 Hadoop 的高可用性特性;实施安全控制✓YARN 可支持多种调度方法;FIFO(先进先出)调度程序;公平调度程序,为同时运行的每个作业分配任务✓YARN 包括一个预订系统,允许用户提前预订集群资源。
Yarn architecture✓Yarn is in the middle between HDFS and processing engines✓Puts in contact resource managers with containers, application
Yarn 架构✓Yarn 位于 HDFS 和处理引擎之间✓将资源管理器与容器、应用程序和数据中心联系在一起✓Yarn 与 HDFS 和处理引擎联系在一起。
coordinators and node-level agents✓Components:Resource manager: accepts job submissions from users,schedules jobs and allocates resources;Node Manager: functions a reporting and monitoring agent;ApplicationMaster: Negotiates resources and works with the NodeManger to execute tasks
协调器和节点级代理✓组件:资源管理器:接受用户提交的作业、安排作业和分配资源;节点管理器:发挥报告和监控代理的作用;应用程序管理器:协商资源并与 NodeManger 合作执行任务。
Apache Spark
What:✓Open source in-memory application framework for distributed data processing and iterative analysis on massive data volumes✓Spark is primary written in Scala and runs on Java virtual machines (JVMs) supports object-oriented and functional programming Distributed computing:✓Distributed computing refers to a group (or cluster) of computers which are working
什么:✓用于分布式数据处理和海量数据迭代分析的开源内存应用框架✓Spark 主要由 Scala 编写,可在 Java 虚拟机 (JVM) 上运行,支持面向对象和函数式编程 分布式计算:✓分布式计算是指一组(或集群)计算机协同工作,这些计算机可以在不同的环境中工作。
together to appear as one system to the end user✓Parallel computing shared memory;distributedcomputing processors have their own private&distributed memory Parallel computing:Increased Speed/Efficient Use of Resource / Scalability/Improved Performance for Complex Tasks/Complex in Program/ Synchronization Issues/Hardware Costs Distribute comput:Fault Tolerance/ Cost effective/Scalability/Geographic distribution/Complexity in manage/ Communicate Overhead/Security What Spark offer:✓Core Spark engine✓ Cluster and executors✓Cluster management✓SparkSQL✓Catalyst and Tungsten DataFrames✓SparkML✓DataFrames✓Streaming Functional program:✓a style of programming that follows the mathematical function format✓a declarative programming model:based on what rather than how to
并行计算共享内存;分布式计算处理器有自己的私有和分布式内存 并行计算:提高速度/有效利用资源/可扩展性/提高复杂任务的性能/程序复杂性/同步问题/硬件成本 分布式计算:容错性/成本效益/可扩展性/地理分布/管理复杂性/通信开销/安全性容错性/成本效益/可扩展性/地理分布/管理的复杂性/通信开销/安全性 Spark 提供什么:✓核心 Spark 引擎✓集群和执行器✓集群管理✓SparkSQL✓催化剂和 Tungsten DataFrames✓SparkML✓DataFrames✓流式功能程序:✓一种遵循数学函数格式的编程风格✓一种声明式编程模型:基于 "是什么 "而不是 "如何做"。;
Declarative syntax emphasizes the output rather than implement details;base on expressions instead of statements✓LISt Programming Language (LISP) in the 50ys Apache Spark works quickly with big data, by means of lambda functions to distribute work to nodes and parallelized computations Resilient Distributed Dataset RDD:✓Spark’s primary data abstraction is RDD✓It is a fault-tolerant collection of elements✓It is partitioned across the nodes of the cluster✓RDD is capable of accepting parallel operations✓are immutable How create RDD: 1. create RDD use an external or local file from Hadoop-supported file system such as HDFS2. Partitions are initially stored in the memory3. The dataset is broken into partitions Parallel Programming✓consists of the simultaneous use of multiple compute resources to solve a computational problem✓Problems are broken into discrete parts which could be solved at the same time.✓ The processors access a shared pool of memory, contains information on the mechanisms for the control and coordination✓How relate RDD: we can create RDD by parallelizing an array or by splitting a dataset into partitions;Spark runs on task for each partition of the cluste Why resilient(R):✓provide resilience by means of immutability and caching✓always recoverable as they are immutable✓persist or cache datasets in memory across operations✓Cache is fault tolerant and always recoverable✓Each node stores the partitions that the node computed in memory and it uses the same partition in other actions on that dataset or the subsequent datasets derived from the first RDD✓This persistence makes future actions much faster Spark Architecture Main components: Components data✓Datasets load from data storage into memory✓Any Hadoop compatible data source is acceptable;Compute Input✓High level programming APIs comprise Scala, Java and Python;Cluster Management framework✓framework handles the distributed computing✓Spark exists as standalone server, Mesos, YARN, and Kubernetes✓useful to scale big data Spark 2 elements:Driver program✓Contains the Spark jobs that the application needs to run✓It splits jobs into tasks to submit to the executors
声明式语法强调输出而非实现细节;基于表达式而非语句✓50ys 中的 LISt 编程语言 (LISP) Apache Spark 通过 lambda 函数将工作分配到节点并进行并行计算,从而快速处理大数据 弹性分布式数据集 RDD:Sparks的主要数据抽象是 RDD✓它是一个容错的元素集合✓它被划分在集群的各个节点上✓RDD能够接受并行操作✓是不可变的 如何创建 RDD:1.使用 Hadoop 支持的文件系统(如 HDFS2)中的外部或本地文件创建 RDD。分区最初存储在内存中3。数据集被分割成多个分区并行编程✓包括同时使用多个计算资源来解决计算问题✓问题被分割成可以同时解决的离散部分。处理器访问一个共享内存池,其中包含控制和协调机制的信息✓RDD 的关系:我们可以通过并行化数组或将数据集分割成分区来创建 RDD;Spark 为集群的每个分区运行任务✓为什么是弹性(R):每个节点在内存中存储该节点计算的分区,在对该数据集或从第一个 RDD 派生的后续数据集执行其他操作时使用相同的分区✓这种持久性使未来的操作更快:组件数据✓数据集从数据存储加载到内存✓任何兼容 Hadoop 的数据源都可以接受;计算输入✓高级编程 API 包括 Scala、Java 和 Python;集群管理框架✓框架处理分布式计算✓Spark 作为独立服务器、Mesos、YARN 和 Kubernetes 存在✓对扩展大数据很有用 Spark 2 要素:驱动程序✓包含应用程序需要运行的 Spark 作业✓它将作业拆分成任务提交给执行器
✓It receives the task results, after completion in the executors;Executor program✓Runs on worker nodes✓Spark starts additional executor processes on a worker node if there is enough memory and cores available✓Can take multiple cores for multithreaded calculations✓Spark distributes RDDs among executors✓Drivers and executors communicate SparkSQL:✓Spark SQL is a module of Spark for structured data processing✓It is possible to interact with Spark SQL using queries or DataFrame API;Benefits✓Spark SQL aims to make queries fast by including a cost-based optimizer,
执行器程序✓在工作节点上运行✓如果有足够的内存和可用内核✓Spark 会在工作节点上启动额外的执行器进程✓可以使用多个内核进行多线程计算✓Spark 在执行器之间分发 RDD✓驱动程序和执行器进行通信 SparkSQL✓Spark SQL 是 Spark 用于结构化数据处理的模块✓可以使用查询或 DataFrame API 与 Spark SQL 进行交互;优势✓Spark SQL 包含一个基于成本的优化器,旨在使查询更快、
columnar storage and a code generator✓Spark SQL can scale quite easily to thousands of nodes and multi-hour queries✓Spark SQL provides a program abstraction called DataFrames DataFrames:✓It is a collection of data organized into named columns✓Conceptually equivalent to a table in a relational database or a data frame in R/Python✓they are built on top of the RDD API✓DataFrames use RDDs to perform relational queries;Benefits
Spark SQL 可以非常轻松地扩展到数千个节点和多小时的查询✓Spark SQL 提供了一种称为 DataFrames 的程序抽象✓DataFrames✓它是按命名列组织的数据集合✓概念上等同于关系数据库中的表或 R/Python 中的数据帧✓它们建立在 RDD API 之上✓DataFrames 使用 RDDs 来执行关系查询;优势
✓DataFrames are highly scalable✓DataFrames support several data formats and storage systems✓DataFrames provide optimization and code generation
数据框架具有高度可扩展性✓数据框架支持多种数据格式和存储系统✓数据框架提供优化和代码生成功能
✓User friendly thanks to integration with big data tooling and infrastructure via Spark✓APIs for Python, Java RDD in parallel programming:✓RDD transformations: Create a new RDD from an existing one;Transformations are considered lazy because they compute the transformation results only once they evaluate the actions The map() transformation passes each element of a dataset through a function;RDD actions:To evaluate a transformation in Spark we would need an action Actions return a value to driver program after running a computationThe reduce() action aggregates all the elements of an RDD and returns the results to the driver program How:Spark uses Directed Acyclic graph (DAG)need a DAG scheduler, to perform RDD operations✓The vertices represent RDDs✓The edges represent transformations or action ✓DAGs help enable fault tolerance Datasets:Strongly-typed; Unified Java and Scala APIs;built on top of DataFrame DataFrame operations:✓Read✓Analysis✓Transformation✓Loading✓Writing(ETL – EXTRACT, TRANSFORM, LOAD)Types of transformation:narrow(transferred without
由于通过 Spark 与大数据工具和基础架构进行了集成,因此对用户非常友好 ✓并行编程中的 Python、Java RDDAPI:✓RDD 转换:map()转换通过一个函数传递数据集的每个元素;RDD 操作:要在 Spark 中评估转换,我们需要一个操作 操作在运行计算后向驱动程序返回一个值reduce()操作聚合 RDD 的所有元素,并将结果返回给驱动程序:Spark 使用有向无循环图 (DAG),需要一个 DAG 调度器来执行 RDD 操作✓顶点代表 RDD✓边代表转换或操作✓DAG 有助于实现容错数据集:强类型;统一的 Java 和 Scala API;构建于 DataFrame 之上DataFrame 操作:✓读取✓分析✓转换✓加载✓编写(ETL - EXTRACT、TRANSFORM、LOAD)转换类型:窄(传输时不需要"...
executing data shuffling operations);wide(Data is shuffled across partitions
执行数据洗牌操作);wide(数据跨分区洗牌)
Apache Spark SQL Optimization:✓reduce the Query time and memory
Apache Spark SQL 优化:✓减少查询时间和内存
consumption✓Spark SQL supports rule-based and cost-based query
消耗✓Spark SQL 支持基于规则和基于成本的查询
optimizer✓Rule-based: how to run the query✓Cost-based: equals time and memory a query consumes(Catalyst&Tungsten)1:Spark SQL rule-based query optimizer;Based on the functional programming Scala; adds new optimization techniques and features to Spark SQL;provides developers with new data source-specific rules and support new data types;uses a tree data structure and to optimize a query(Analysis-Logistical Optimization-Physical
优化器✓基于规则:如何运行查询✓基于成本:等于查询消耗的时间和内存(Catalyst&Tungsten)1:基于规则的 Spark SQL 查询优化器;基于函数式编程 Scala;为 Spark SQL 添加新的优化技术和功能;为开发人员提供新的特定于数据源的规则并支持新的数据类型;使用树形数据结构并优化查询(分析-统计优化-物理)2:基于规则的 Spark SQL 查询优化器✓基于规则:如何运行查询✓基于成本:等于查询消耗的时间和内存(Catalyst&Tungsten)
Planning-Code Generation)2:provides Spark SQL with cost-based optimiz that maximizes CPU and memory performance;Features:Manages memory explicitly and does not rely on the JVM;Enables cache-friendly computation of algorithms and data structures;Supports on-demand JVM byte code generation;Does not enable virtual function dispatches;Places intermediate data in CPU registers and enables loop unrolling
规划-代码生成)2:为 Spark SQL 提供基于成本的优化,最大限度地提高 CPU 和内存性能;特点:显式管理内存,不依赖 JVM;实现算法和数据结构的高速缓存友好计算;支持按需生成 JVM 字节代码;不启用虚拟函数派发;将中间数据放在 CPU 寄存器中,并启用循环解卷功能