技术改变世界 阅读塑造人生! - shaogx.com

This string was altered by TechBlog\Plugins\Example.; This is an example to show the potential of an offcanvas layout pattern in Bootstrap. Try some responsive-range viewport sizes to see it in action.

CentOS下Hadoop搭建与wordcount实例运行

最近在学习弄hadoop,遇到很多麻烦,这里记录下来是方便以后回头查看。我用的VMware Workstation 10.0.4下安装的CentOS7安装配置jdkjava -version看是否安装java用 env | grep JAVA_HOME 或者echo $JAVA_HOME $PATH 来检查环境变量配置的是否正确,如果没有可以到官网上下载。因为自带的jdk有点问题,我先卸载了自带的jdk:首先检查linux下面时候安装过jdk # rpm -qa|grep jdk... 全文

centos hadoop 搭建 Wordcount 实例

wordcount设计与优化

原文档见:http://gitlab.alibaba-inc.com/middleware/coding4fun-3rd/blob/master/observer.hany/design.md淘宝中间件第三期编程比赛,题意概述:读入一个文件,统计其中最常出现的前10个单词。... 全文

wordcount设计与优化 并发 OR 并发 双保险模式避免加锁

Spark:用Scala和Java实现WordCount

http://www.cnblogs.com/byrhuangqiang/p/4017725.html 为了在IDEA中编写scala,今天安装配置学习了IDEA集成开发环境。IDEA确实很优秀,学会之后,用起来很顺手。关于如何搭建scala和IDEA开发环境,请看文末的参考资料。... 全文

linux命令(二)

这里我们再看一个命令,wc(word count)统计指定文件中的字节数、字数、行数,并将统计结果显示输出参考网址:http://www.cnblogs.com/peida/archive/2012/12/18/2822758.html用法:... 全文

Linux 命令 wc wordcount

Spark学习笔记

首先解压scala,本次选用版本scala-2.11.1[hadoop@centos software]$ tar -xzvf scala-2.11.1.tgz[hadoop@centos software]$ su -[root@centos ~]# vi /etc/profile添加如下内容:SCALA_HOME=/home/hadoop/software/scala-2.11.1PATH=$SCALA_HOME/binEXPORT SCALA_HOME[root@centos ~]# source /etc/profile[root@centos ~]# scala -versionScala code runner version 2.11.1 -- Copyright 2002-2013, LAMP/EPFL然后解压spark,本次选用版本spark-1.0.0-bin-hadoop1.tgz,这次用的是hadoop 1.0.4[hadoop@centos software]$ tar -xzvf spark-1.0.0-bin-hadoop1.tgz进入到spark的conf目录下[hadoop@centos conf]$ cp spark-env.sh.template spark-env.sh[hadoop@centos conf]$ vi spark-env.sh添加如下内容:export SCALA_HOME=/home/hadoop/software/scala-2.11.1export SPARK_MASTER_IP=centos.host1export SPARK_WORKER_MEMORY=2Gexport JAVA_HOME=/usr/software/jdk如果是要集群安装部署的话,需要修改文件conf/slaves,添加要作为worker的主机然后将spark-1.0.0-bin-hadoop1目录拷贝到相应主机上,注意目录要一致。启动[hadoop@centos spark-1.0.0-bin-hadoop1]$ sbin/start-master.sh可以通过 http://centos.host1:8080/ 看到对应界面[hadoop@centos spark-1.0.0-bin-hadoop1]$ sbin/start-slaves.sh spark://centos.host1:7077可以通过 http://centos.host1:8081/ 看到对应界面下面在spark上运行第一个例子:与Hadoop交互的WordCount首先将word.txt文件上传到HDFS上,这里路径是 hdfs://centos.host1:9000/user/hadoop/data/wordcount/001/word.txt进入交互模式[hadoop@centos spark-1.0.0-bin-hadoop1]$ master=spark://centos.host1:7077 ./bin/spark-shellscala>val file=sc.textFile("hdfs://centos.host1:9000/user/hadoop/data/wordcount/001/word.txt")  scala>val count=file.flatMap(line=>line.split(" ")).map(word=>(word,1)).reduceByKey(_+_) scala>count.collect()可以看到控制台有如下结果:res0: Array[(String, Int)] = Array((hive,2), (zookeeper,1), (pig,1), (spark,1), (hadoop,4), (hbase,2))同时也可以将结果保存到HDFS上scala>count.saveAsTextFile("hdfs://centos.host1:9000/user/hadoop/data/wordcount/001/result.txt")接下来再来看下如何运行Java版本的WordCount这里需要用到一个jar文件:spark-assembly-1.0.0-hadoop1.0.4.jarWordCount代码如下:public class WordCount { private static final Pattern SPACE = Pattern.compile(" "); @SuppressWarnings("serial") public static void main(String[] args) throws Exception { if (args.length < 1) { System.err.println("Usage: JavaWordCount <file>"); System.exit(1); } SparkConf sparkConf = new SparkConf().setAppName("JavaWordCount"); JavaSparkContext ctx = new JavaSparkContext(sparkConf); JavaRDD<String> lines = ctx.textFile(args[0], 1); JavaRDD<String> words = lines.flatMap(new FlatMapFunction<String, String>() { @Override public Iterable<String> call(String s) { return Arrays.asList(SPACE.split(s)); } }); JavaPairRDD<String, Integer> ones = words.mapToPair(new PairFunction<String, String, Integer>() { @Override public Tuple2<String, Integer> call(String s) { return new Tuple2<String, Integer>(s, 1); } }); JavaPairRDD<String, Integer> counts = ones.reduceByKey(new Function2<Integer, Integer, Integer>() { @Override public Integer call(Integer i1, Integer i2) { return i1 + i2; } }); List<Tuple2<String, Integer>> output = counts.collect(); for (Tuple2<?, ?> tuple : output) { System.out.println(tuple._1() + " : " + tuple._2()); } ctx.stop(); } }导出类文件生成jar包,这里生成为mining.jar。然后执行下面命令,其中--class 指定主类,--master 指定spark master地址,后面是执行的jar和需要的参数。[hadoop@centos spark-1.0.0-bin-hadoop1]$ bin/spark-submit --class org.project.modules.spark.java.WordCount --master spark://centos.host1:7077 /home/hadoop/project/mining.jar hdfs://centos.host1:9000/user/hadoop/data/wordcount/001/word.txt可以看到控制台有如下结果:spark : 1hive : 2hadoop : 4zookeeper : 1pig : 1hbase : 2最后再来看下如何运行Python版本的WordCountWordCount代码如下:import sys from operator import add from pyspark import SparkContext if __name__ == "__main__": if len(sys.argv) != 2: print >> sys.stderr, "Usage: wordcount <file>" exit(-1) sc = SparkContext(appName="PythonWordCount") lines = sc.textFile(sys.argv[1], 1) counts = lines.flatMap(lambda x: x.split(' ')) \ .map(lambda x: (x, 1)) \ .reduceByKey(add) output = counts.collect() for (word, count) in output: print "%s: %i" % (word, count)输入文件路径可以是本地也可以是HDFS上文件,命令如下:[hadoop@centos spark-1.0.0-bin-hadoop1]$ bin/spark-submit --master spark://centos.host1:7077 /home/hadoop/project/WordCount.py /home/hadoop/temp/word.txt [hadoop@centos spark-1.0.0-bin-hadoop1]$ bin/spark-submit --master spark://centos.host1:7077 /home/hadoop/project/WordCount.py hdfs://centos.host1:9000/user/hadoop/data/wordcount/001/word.txt 可以看到控制台有如下结果:spark: 1hbase: 2hive: 2zookeeper: 1hadoop: 4pig: 1... 全文

Spark Hadoop Scala RDD WordCount

Hadoop学习之MapReduce(一)

在学习过了HDFS架构和Hadoop的配置管理后,现在学习MapReduce应用程序的编写和管理。首先简单介绍一下MapReduce框架。MapReduce是一个易于编写程序的软件框架,这些应用程序以可靠的、容错的模式并行的运行在很大规模的商用硬件集群上(数以千计的节点),处理超大数量的数据(超过TB的数据集)。一个MapReduce作业通常将输入数据集分割为独立的数据块,这些数据块被map任务以完全并行的方式处理,MapReduce框架整理map任务的输出结果,然后map的输出结果做为reduce任务的输入。典型地,作业的输入和输出都存储在文件系统中。MapReduce框架处理调度任务,监控任务和重新执行失败的任务。... 全文

Hadoop mapreduce HDFS WordCount

1