技术改变世界 阅读塑造人生! - shaogx.com

This string was altered by TechBlog\Plugins\Example.; This is an example to show the potential of an offcanvas layout pattern in Bootstrap. Try some responsive-range viewport sizes to see it in action.

RDD

RDD RDD初始参数:上下文和一组依赖abstract classRDD[T: ClassTag]( @transient private var sc: SparkContext,@transient private var deps: Seq[Dependency[_]] ) extends Serializable 以下需要仔细理清: A list of Partitions Function to compute split (sub RDD impl) A list of Dependencies Partitioner for K-V RDDs (Optional)... 全文

rdd

RDD Dependency详解

RDD的最重要的特性之一就是血缘关系,血缘关系描述了一个RDD是如何从父RDD计算得来的。这个性质可以理解为人类的进化,人是怎么从猿人一步步进化到现代的人类的,每个进化阶段可以理解为一个RDD。如果某个RDD丢失了,则可以根据血缘关系,从父RDD计算得来。总结:RDD可以描述为一组partition的向量表示,且具有依赖关系。... 全文

spark rdd scala

Spark Core源码分析: RDD基础

RDDRDD初始参数:上下文和一组依赖abstract class RDD[T: ClassTag]( @transient private var sc: SparkContext, @transient private var deps: Seq[Dependency[_]] ) extends Serializable 以下需要仔细理清:A list of PartitionsFunction to compute split (sub RDD impl)A list of DependenciesPartitioner for K-V RDDs (Optional)... 全文

Spark RDD Dependency Partitioner

Spark学习笔记

首先解压scala,本次选用版本scala-2.11.1[hadoop@centos software]$ tar -xzvf scala-2.11.1.tgz[hadoop@centos software]$ su -[root@centos ~]# vi /etc/profile添加如下内容:SCALA_HOME=/home/hadoop/software/scala-2.11.1PATH=$SCALA_HOME/binEXPORT SCALA_HOME[root@centos ~]# source /etc/profile[root@centos ~]# scala -versionScala code runner version 2.11.1 -- Copyright 2002-2013, LAMP/EPFL然后解压spark,本次选用版本spark-1.0.0-bin-hadoop1.tgz,这次用的是hadoop 1.0.4[hadoop@centos software]$ tar -xzvf spark-1.0.0-bin-hadoop1.tgz进入到spark的conf目录下[hadoop@centos conf]$ cp spark-env.sh.template spark-env.sh[hadoop@centos conf]$ vi spark-env.sh添加如下内容:export SCALA_HOME=/home/hadoop/software/scala-2.11.1export SPARK_MASTER_IP=centos.host1export SPARK_WORKER_MEMORY=2Gexport JAVA_HOME=/usr/software/jdk如果是要集群安装部署的话,需要修改文件conf/slaves,添加要作为worker的主机然后将spark-1.0.0-bin-hadoop1目录拷贝到相应主机上,注意目录要一致。启动[hadoop@centos spark-1.0.0-bin-hadoop1]$ sbin/start-master.sh可以通过 http://centos.host1:8080/ 看到对应界面[hadoop@centos spark-1.0.0-bin-hadoop1]$ sbin/start-slaves.sh spark://centos.host1:7077可以通过 http://centos.host1:8081/ 看到对应界面下面在spark上运行第一个例子:与Hadoop交互的WordCount首先将word.txt文件上传到HDFS上,这里路径是 hdfs://centos.host1:9000/user/hadoop/data/wordcount/001/word.txt进入交互模式[hadoop@centos spark-1.0.0-bin-hadoop1]$ master=spark://centos.host1:7077 ./bin/spark-shellscala>val file=sc.textFile("hdfs://centos.host1:9000/user/hadoop/data/wordcount/001/word.txt")  scala>val count=file.flatMap(line=>line.split(" ")).map(word=>(word,1)).reduceByKey(_+_) scala>count.collect()可以看到控制台有如下结果:res0: Array[(String, Int)] = Array((hive,2), (zookeeper,1), (pig,1), (spark,1), (hadoop,4), (hbase,2))同时也可以将结果保存到HDFS上scala>count.saveAsTextFile("hdfs://centos.host1:9000/user/hadoop/data/wordcount/001/result.txt")接下来再来看下如何运行Java版本的WordCount这里需要用到一个jar文件:spark-assembly-1.0.0-hadoop1.0.4.jarWordCount代码如下:public class WordCount { private static final Pattern SPACE = Pattern.compile(" "); @SuppressWarnings("serial") public static void main(String[] args) throws Exception { if (args.length < 1) { System.err.println("Usage: JavaWordCount <file>"); System.exit(1); } SparkConf sparkConf = new SparkConf().setAppName("JavaWordCount"); JavaSparkContext ctx = new JavaSparkContext(sparkConf); JavaRDD<String> lines = ctx.textFile(args[0], 1); JavaRDD<String> words = lines.flatMap(new FlatMapFunction<String, String>() { @Override public Iterable<String> call(String s) { return Arrays.asList(SPACE.split(s)); } }); JavaPairRDD<String, Integer> ones = words.mapToPair(new PairFunction<String, String, Integer>() { @Override public Tuple2<String, Integer> call(String s) { return new Tuple2<String, Integer>(s, 1); } }); JavaPairRDD<String, Integer> counts = ones.reduceByKey(new Function2<Integer, Integer, Integer>() { @Override public Integer call(Integer i1, Integer i2) { return i1 + i2; } }); List<Tuple2<String, Integer>> output = counts.collect(); for (Tuple2<?, ?> tuple : output) { System.out.println(tuple._1() + " : " + tuple._2()); } ctx.stop(); } }导出类文件生成jar包,这里生成为mining.jar。然后执行下面命令,其中--class 指定主类,--master 指定spark master地址,后面是执行的jar和需要的参数。[hadoop@centos spark-1.0.0-bin-hadoop1]$ bin/spark-submit --class org.project.modules.spark.java.WordCount --master spark://centos.host1:7077 /home/hadoop/project/mining.jar hdfs://centos.host1:9000/user/hadoop/data/wordcount/001/word.txt可以看到控制台有如下结果:spark : 1hive : 2hadoop : 4zookeeper : 1pig : 1hbase : 2最后再来看下如何运行Python版本的WordCountWordCount代码如下:import sys from operator import add from pyspark import SparkContext if __name__ == "__main__": if len(sys.argv) != 2: print >> sys.stderr, "Usage: wordcount <file>" exit(-1) sc = SparkContext(appName="PythonWordCount") lines = sc.textFile(sys.argv[1], 1) counts = lines.flatMap(lambda x: x.split(' ')) \ .map(lambda x: (x, 1)) \ .reduceByKey(add) output = counts.collect() for (word, count) in output: print "%s: %i" % (word, count)输入文件路径可以是本地也可以是HDFS上文件,命令如下:[hadoop@centos spark-1.0.0-bin-hadoop1]$ bin/spark-submit --master spark://centos.host1:7077 /home/hadoop/project/WordCount.py /home/hadoop/temp/word.txt [hadoop@centos spark-1.0.0-bin-hadoop1]$ bin/spark-submit --master spark://centos.host1:7077 /home/hadoop/project/WordCount.py hdfs://centos.host1:9000/user/hadoop/data/wordcount/001/word.txt 可以看到控制台有如下结果:spark: 1hbase: 2hive: 2zookeeper: 1hadoop: 4pig: 1... 全文

Spark Hadoop Scala RDD WordCount

spark 学习(二) RDD及共享变量

声明:本文基于spark的programming guide,并融合自己的相关理解整理而成 ... 全文

并行计算 集群 spark

1