hadoop+hive 干数据仓库 & 一些测试_数据仓库

hadoop+hive 做数据仓库 & 一些测试

由于是一个项目的一部分，去掉了项目名称，和大家一起交流，?msn: sdtvATmsn.com?转载标明?:

www.bagbaby.cn???
http://hi.baidu.com/dd_shop

背景需求和现状
目前的日志系统还称不上系统，只是在几台服务器上存着所有的日志，依靠NFS共享数据，并运算，带来的问题诸多：
a) 数据存放凌乱，缺乏系统的目录管理；
b) 存储空间有限，并且扩展非常麻烦；
c) CV/PV等日志分散存放，合并不方便；
d) 媒体服务日志数据集中存放，数据庞大而难以做到轻量级备份；
e) 丢失数据的情况时有发生，且无从恢复；
f) 数据抓取性能低下，时常成为运算瓶颈；

需求分析
a) 日志采集&存储
日志名称每天日志大小备份管理人
.............

b) 日志预处理
见附件：只写了PV&CV日志，没有看到别的日志格式。
Excel：****log.xlsx
c)

数据预处理
d) 规范数据（清洗）
遗漏数据处理
在PV日志中发现多个记录中的属性值为空，例如areaID，IP，referUrl，open等等。对于为空的属性值，可以采用以下方法?
进行遗漏数据
忽略该条记录。若一条记录中有属性值被遗漏了，则将此条记录排除在数据挖掘过程之外，尤其当类别属性的值没有而又要?
使用的主要分类数据时。当然这种方法并不很有效，尤其是在每个属性遗漏值的记录比例相差较大时。?
手工填补遗漏值。一般讲这种方法比较耗时，而且对于存在许多遗漏情况的大规模数据集而言，可行较差。?
利用缺省值填补遗漏值。对一个属性的所有遗漏的值均利用一个事先确定好的值来填补。?
不一致数据处理
现实世界的数据常出现数据记录内容的不一致，其中一些数据不一致可以利用它们与外部的关联手工加以解决。例如：在不?
同服务器编码不一致，预处理可以帮助纠正使用编码时所发生的不一致问题。
e) 数据转换
数据转换?主要是对数据进行规格化操作。如：对于一个顾客信息数据库中的年龄属性或工资属性，由于工资属性的?
取值比年龄属性的取值要大许多，如果不进行规格化处理，基于工资属性的距离计算值显然将远超过基于年龄属性的距离计算值，这就意味着工资属性的作用在整个数据对象的距

离计算中被错误地放大了。
f) 数据合并
对大规模数据库内容进行复杂的数据分析通常需要耗费大量的时间，这就常常使得这样的分析变得不现实和不可行，尤其是需要交互式数据挖掘时。数据消减技术正是用于帮助从

原有庞大数据集中获得一个精简的数据集合，并使这一精简数据集保持原有数据集的完整性，这样在精简数据集上进行数据挖掘显然效率更高，并且挖掘出来的结果与使用原有数

据集所获得结果基本相同
g)

Hadoop介绍
。。。。。。。。
Hadoop家族
整个Hadoop由以下几个子项目组成：
成员名用途
Hadoop Common Hadoop体系最底层的一个模块，为Hadoop各子项目提供各种工具，如：配置文件和日志操作等。
Avro Avro是doug cutting主持的RPC项目，有点类似Google的protobuf和Facebook的thrift。avro用来做以后hadoop的RPC，使hadoop的RPC模块通信速度更快、数据结构更紧凑

。
Chukwa Chukwa是基于Hadoop的大集群监控系统，由yahoo贡献。
HBase 基于Hadoop Distributed File System，是一个开源的，基于列存储模型的分布式数据库。
HDFS 分布式文件系统
Hive hive类似CloudBase，也是基于hadoop分布式计算平台上的提供data warehouse的sql功能的一套软件。使得存储在hadoop里面的海量数据的汇总，即席查询简单化。hive

提供了一套QL的查询语言，以sql为基础，使用起来很方便。
MapReduce 实现了MapReduce编程框架
Pig Pig是SQL-like语言，是在MapReduce上构建的一种高级查询语言，把一些运算编译进MapReduce模型的Map和Reduce中，并且用户可以定义自己的功能。Yahoo网格运算部门

开发的又一个克隆Google的项目Sawzall。
ZooKeeper Zookeeper是Google的Chubby一个开源的实现。它是一个针对大型分布式系统的可靠协调系统，提供的功能包括：配置维护、名字服务、分布式同步、组服务等。

ZooKeeper的目标就是封装好复杂易出错的关键服务，将简单易用的接口和性能高效、功能稳定的系统提供给用户。

Hadoop安装
h) 操作系统
Linux 2.6.31-20-generic Ubuntu 9.1
i) 必须软件
ssh
apt-get install openssh-server

rsync
apt-get install rsync

java1.6
apt-get install sun-java16-jar sun-java16-jdk

ant
apt-get install ant?
j) 配置环境Ssh免密码登陆：
ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
说明：单机不需要下面蓝色字体的操作
scp .ssh/id_rsa.pub?hadoop@*.*.*.*:/home/hadoop/id_rsa.pub
cat ~/id_rsa.pub >> ~/.ssh/authorized_keys
测试登陆：
Ssh localhost?? or ssh *.*.*.*

k) 编译?
i. 下载就去官方网站，我就不写了
ii. 我们把Hadoop都安装在/usr/local/
tar zxvf hadoop-0.20.2.tar.gz
ln -s hadoop-0.20.2 hadoop
cd hadoop

iii. 配置Hadoop（我cp的是官方的默认配置，没有写。我写的这个是单机的，集群参考：http://hadoop.apache.org/common/docs/r0.20.2/cluster_setup.html?）
conf/core-site.xml:

<configuration>?
<property>?
<name>fs.default.name</name>?
<value>hdfs://localhost:9000</value>?
</property>?
</configuration>

conf/hdfs-site.xml:

<configuration>?
<property>?
<name>dfs.replication</name>?
<value>1</value>?
</property>?
</configuration>

conf/mapred-site.xml:

<configuration>?
<property>?
<name>mapred.job.tracker</name>?
<value>localhost:9001</value>?
</property>?
</configuration>

iv. 格式化Hadoop
错误如果出现：Error: JAVA_HOME is not set. 表示没有配置java home。
我们把javahome配置为全局；
vi /etc/environment
增加jave_home和/usr/local/hadoop/bin：
JAVA_HOME="/usr/lib/jvm/java-6-sun"?
v. 启动Hadoop
start-all.sh
vi. 检查Hadoop是否正常
Netstat –nl |more?
tcp6?????? 0????? 0 127.0.0.1:9000????????? :::*??????????????????? LISTEN?????
tcp6?????? 0????? 0 127.0.0.1:9001????????? :::*??????????????????? LISTEN?????
tcp6?????? 0????? 0 :::50090??????????????? :::*??????????????????? LISTEN?????
tcp6?????? 0????? 0 :::50070??????????????? :::*??????????????????? LISTEN

vii. 测试
hadoop fs -put CHANGES.txt input/
hadoop fs -ls input
这个例子是计算有多少个单词的
hadoop jar hadoop-*-examples.jar grep input output '[a-z.]+'?
[email protected]:/usr/local/hadoop# hadoop fs -cat output/*?? |more
cat: Source must be a file.
3828??? .
1969??? via
1375??? to

viii.?
l) Api介绍
见附件：Word： hadoop的API.docx
m)?
Hive 安装
n) 下载，去官方下载最新版，我就不写了。
o) 解压；
tar zxvf hive-0.5.0-bin.tar.gz?? ;
ln –s hive-0.5.0-bin hive
p) 配置hive环境
vi /etc/environment
HIVE_HOME="/usr/local/hive/"
q) 创建hive存储
hadoop fs -mkdir?????? /user/hive/warehouse
hadoop fs -chmod g+w?? /user/hive/warehouse
r) 启动hive?
hive?
进入： hive>?? 标识符
创建pokes表。
hivr> CREATE TABLE pokes (foo INT, bar STRING);?
加载测试数据，加载的文件是2列。
hive> LOAD DATA LOCAL INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE pokes;
hive> select count(1) from pokes;

Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201004131133_0018, Tracking URL =?http://localhost:50030/jobdetails.jsp?jobid=job_201004131133_0018
Kill Command = /usr/local/hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=localhost:9001 -kill job_201004131133_0018
2010-04-13 16:32:12,188 Stage-1 map = 0%, reduce = 0%
2010-04-13 16:32:29,536 Stage-1 map = 100%, reduce = 0%
2010-04-13 16:32:38,768 Stage-1 map = 100%, reduce = 33%
2010-04-13 16:32:44,916 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201004131133_0018
OK
500
Time taken: 38.379 seconds

hive> select count(bar),bar from pokes group by bar;???????
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201004131133_0017, Tracking URL =?http://localhost:50030/jobdetails.jsp?jobid=job_201004131133_0017
Kill Command = /usr/local/hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=localhost:9001 -kill job_201004131133_0017
2010-04-13 16:26:55,791 Stage-1 map = 0%, reduce = 0%
2010-04-13 16:27:11,165 Stage-1 map = 100%, reduce = 0%
2010-04-13 16:27:20,268 Stage-1 map = 100%, reduce = 33%
2010-04-13 16:27:25,348 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201004131133_0017
OK
3?????? val_0
1 val_10
……………
Time taken: 37.979 seconds

s) hive api：
参考：?http://hadoop.apache.org/hive/docs/current/api/org/apache/hadoop/hive/conf/
t) hive 使用mysql 做meta
参考：http://www.mazsoft.com/blog/post/2010/02/01/Setting-up-HadoopHive-to-use-MySQL-as-metastore.aspx
是把meta信息存在mysql，防止hdfs挂了而得不到数据列表。感觉没有必要，因为hdfs挂了，有meta信息没有什么用。
u)

Hive & hadoop 的一些测试：
v) 加载gz 或者bz2格式元数据占用空间&时间的比较：

hive> load data local inpath 'ok.txt.gz' overwrite into table page_test2 partition(dt='2010-04-16');
Copying data from file:/usr/local/ok.txt.gz
Loading data to table page_test2 partition {dt=2010-04-16}
OK
Time taken: 3.649 seconds
下面是Hadoop存储的hive表的文件大小：
[email protected]:/tmp/hadoop-root/dfs/data/current# du -ch blk_-945326243445352181?
22M???? blk_-945326243445352181
22M???? total

w) 加载文本文件：
hive> load data local inpath 'ok.txt' overwrite into table page_test partition(dt='2010-04-17');????
Copying data from file:/usr/local/ok.txt
Loading data to table page_test partition {dt=2010-04-17}
OK
Time taken: 41.593 seconds
下面是Hadoop存储的hive表的文件大小：
[email protected]:/tmp/hadoop-root/dfs/data/current# du -ch blk_7538941016314062501
64M???? blk_7538941016314062501
64M???? total

x) 源文件大小：
[email protected]:/usr/local# du -ch ok.txt?
196M??? ok.txt
196M??? total

[email protected]:/usr/local# du -ch ok.txt.gz?
22M???? ok.txt.gz
22M???? total

y) Hive查询比较：
可以从结果看出压缩的数据查询速度比不压缩的还快一点，奇怪了。
Gz文件导入并创建分区后使用hive查询: hive> select count(1) from page_test2 a where a.dt='2010-04-16';
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201004131133_0026, Tracking URL =?http://localhost:50030/jobdetails.jsp?jobid=job_201004131133_0026
Kill Command = /usr/local/hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=localhost:9001 -kill job_201004131133_0026
2010-04-16 13:43:39,435 Stage-1 map = 0%, reduce = 0%
2010-04-16 13:47:30,921 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201004131133_0026
OK
17166483
Time taken: 239.447 seconds
Txt文件导入并创建分区后使用hive QL查询: hive> select count(1) from page_test a where a.dt='2010-04-16';
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201004131133_0025, Tracking URL =?http://localhost:50030/jobdetails.jsp?jobid=job_201004131133_0025
Kill Command = /usr/local/hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=localhost:9001 -kill job_201004131133_0025
2010-04-16 13:37:11,927 Stage-1 map = 0%, reduce = 0%
2010-04-16 13:42:01,382 Stage-1 map = 100%, reduce = 22%
2010-04-16 13:42:13,683 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201004131133_0025
OK
17166483
Time taken: 314.291 seconds
Txt 没有创建分区使用hive查询没有记录下来，是400多秒

z) a
Hive 开发
a) 打开hive service：
在10000端口打开 hive服务
HIVE_PORT=10000 ./bin/hive --service hiveserver

b) 查看服务是否启动：
netstat –nl?? |grep 100000

c) 写测试程序：
官方给的例子，这个我编译过去，执行有错误，没有查出那里问题。

import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;

public class HiveJdbcClient {
private static String driverName = "org.apache.hadoop.hive.jdbc.HiveDriver";

/**
* @param args
* @throws SQLException
*/
public static void main(String[] args) throws SQLException {
try {
Class.forName(driverName);
} catch (ClassNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
System.exit(1);
}
Connection con = DriverManager.getConnection("jdbc:hive://localhost:10000/default", "", "");
Statement stmt = con.createStatement();
String tableName = "testHiveDriverTable";
stmt.executeQuery("drop table " + tableName);
ResultSet res = stmt.executeQuery("create table " + tableName + " (key int, value string)");
// show tables
String sql = "show tables '" + tableName + "'";
System.out.println("Running: " + sql);
res = stmt.executeQuery(sql);
if (res.next()) {
System.out.println(res.getString(1));
}
// describe table
sql = "describe " + tableName;
System.out.println("Running: " + sql);
res = stmt.executeQuery(sql);
while (res.next()) {
System.out.println(res.getString(1) + "\t" + res.getString(2));
}

??? // load data into table
// NOTE: filepath has to be local to the hive server
// NOTE: /tmp/a.txt is a ctrl-A separated file with two fields per line
String filepath = "/tmp/a.txt";
sql = "load data local inpath '" + filepath + "' into table " + tableName;
System.out.println("Running: " + sql);
res = stmt.executeQuery(sql);

??? // select * query
sql = "select * from " + tableName;
System.out.println("Running: " + sql);
res = stmt.executeQuery(sql);
while (res.next()) {
System.out.println(String.valueOf(res.getInt(1)) + "\t" + res.getString(2));
}

??? // regular hive query
sql = "select count(1) from " + tableName;
System.out.println("Running: " + sql);
res = stmt.executeQuery(sql);
while (res.next()) {
System.out.println(res.getString(1));
}
}
}