一. 前言
Secondary NameNode 只有一个, 他的的作用是辅助NameNode进行原数的checkpoint操作, 即合并fsimage文件.
Secondary NameNode是一个守护进程,定时触发checkpoint操作操作, 使用NamenodeProtocol 与NameNode进行通讯.
参数:
序号 | 参数 | 默认值 | 描述 |
---|---|---|---|
1 | dfs.namenode.checkpoint.check.period | 60s | SecondaryNameNode和CheckpointNode将每隔'dfs.namenode.checkpoint.period'秒以查询未选中的事务数。 |
2 | dfs.namenode.checkpoint.period | 3600s [1小时] | 两个连续checkpoint的最大延时 |
3 | dfs.namenode.checkpoint.txns | 100万 | checkpoint最大事务数 |
4 | dfs.namenode.checkpoint.max-retries | 3次 | 重试次数 |
二.checkpoints流程说明
在非HA部署环境下, 合并FSImage操作是由Secondary Namenode来执行的。
Namenode会触发一次合并FSImage操作:
①超过了配置的检查点操作时长(dfs.namenode.checkpoint.period配置项配置,默认值: 1小时) ;
②从上一次检查点操作后, 发生的事务(transaction) 数超过了配置(dfs.namenode.checkpoint.txns配置项配置,默认值:100万) 。
流程示意图:
■ Secondary Namenode检查两个触发CheckPoint流程的条件是否满足.由于在非HA状态下, Secondary Namenode和Namenode之间并没有共享的editlog文件目录, 所以最新的事务id(transactionId)是Secondary Namenode通过调用RPC方法
NamenodeProtocol.getTransactionId()获取的。
■ Secondary Namenode调用RPC方法NamenodeProtocol.rollEditLog()触发editlog重置操作, 将当前正在写的editlog段落结束, 并创建新的edit.new文件, 这个操作还会返回当前fsimage以及刚刚重置的editlog的事务id (seen_id) 。 这样当Secondary Namenode从Namenode读取editlog文件时, 新的操作就可以写入edit.new文件中, 不影响editlog记录功能。 在HA模式下, 并不需要显式地触发editlog的重置操作, 因为Standby Namenode会定期重置editlog。
■ 有了最新的txid以及seen_id, Secondary Namenode就会发起HTTP GET请求到Namenode的GetImageServlet以获取新的fsimage和editlog文件。 需要注意,Secondary Namenode在进行上一次的CheckPoint操作时, 可能已经获取了部分fsimage和edits文件。■ Secondary Namenode会加载新下载的fsimage文件以重建Secondary Namenode的命名空间。
■ Secondary Namenode读取edits中的记录, 并与当前的命名空间合并, 这样Secondary Namenode的命名空间和Namenode的命名空间就同步了。
■ Secondary Namenode将最新的同步的命名空间写入新的fsimage文件中。
■ Secondary Namenode向Namenode的ImageServlet发送HTTP GET请求/getimage?putimage=1。 这个请求的URL中还包含了新的fsimage文件的事务ID,以及Secondary Namenode用于下载的端口和IP地址。
■ Namenode会根据Secondary Namenode提供的信息向Secondary Namenode的GetImageServlet发起HTTP GET请求下载fsimage文件。 Namenode首先将下载文件命名为fsimage.ckpt_, 然后创建MD5校验和, 最后将fsimage.ckpt_重命名为fsimage_xxxxx。
三. 启动
直接看main函数, 有两种启动模式,
第一种: 执行一个命令,然后终止.
CHECKPOINT :手动执行checkpoint,但是如果没有达到触发条件,依旧不会执行checkpoint.
GETEDITSIZE: 获取未执行checkpoint的事务数量
第二种, 作为一个守护进程进行启动[ 开启InfoServer 和 CheckpointThread : 定期执行checkpoint ]
/**** main() has some simple utility methods.* @param argv Command line parameters.* @exception Exception if the filesystem does not exist.*/public static void main(String[] argv) throws Exception {CommandLineOpts opts = SecondaryNameNode.parseArgs(argv);if (opts == null) {LOG.error("Failed to parse options");terminate(1);} else if (opts.shouldPrintHelp()) {opts.usage();System.exit(0);}try {StringUtils.startupShutdownMessage(SecondaryNameNode.class, argv, LOG);Configuration tconf = new HdfsConfiguration();SecondaryNameNode secondary = null;secondary = new SecondaryNameNode(tconf, opts);// SecondaryNameNode can be started in 2 modes:// 1. run a command (i.e. checkpoint or geteditsize) then terminate// 2. run as a daemon when {@link #parseArgs} yields no commandsif (opts != null && opts.getCommand() != null) {// mode 1int ret = secondary.processStartupCommand(opts);terminate(ret);} else {// mode 2secondary.startInfoServer();secondary.startCheckpointThread();secondary.join();}} catch (Throwable e) {LOG.error("Failed to start secondary namenode", e);terminate(1);}}
我们直接看第二种,
四.startInfoServer
首先要启动一个http server [ 默认: dfs.namenode.secondary.http-address : 0.0.0.0:9869 ] 与namenode进行通讯.
/*** Start the web server.*/@VisibleForTestingpublic void startInfoServer() throws IOException {final InetSocketAddress httpAddr = getHttpAddress(conf);// 默认: dfs.namenode.secondary.http-address : 0.0.0.0:9869final String httpsAddrString = conf.getTrimmed(DFSConfigKeys.DFS_NAMENODE_SECONDARY_HTTPS_ADDRESS_KEY,DFSConfigKeys.DFS_NAMENODE_SECONDARY_HTTPS_ADDRESS_DEFAULT);InetSocketAddress httpsAddr = NetUtils.createSocketAddr(httpsAddrString);// 构架http服务HttpServer2.Builder builder = DFSUtil.httpServerTemplateForNNAndJN(conf,httpAddr, httpsAddr, "secondary", DFSConfigKeys.DFS_SECONDARY_NAMENODE_KERBEROS_INTERNAL_SPNEGO_PRINCIPAL_KEY,DFSConfigKeys.DFS_SECONDARY_NAMENODE_KEYTAB_FILE_KEY);// dfs.xframe.enabled : 默认 true// 如果为true,则通过返回设置为SAMEORIGIN的X_FRAME_OPTIONS标题值来启用防止单击劫持的保护。// Clickjacking保护可防止攻击者使用透明或不透明层诱骗用户单击另一页上的按钮或链接。final boolean xFrameEnabled = conf.getBoolean(DFSConfigKeys.DFS_XFRAME_OPTION_ENABLED,DFSConfigKeys.DFS_XFRAME_OPTION_ENABLED_DEFAULT);// dfs.xframe.value : SAMEORIGIN 可选: DENY SAMEORIGIN ALLOW-FROMfinal String xFrameOptionValue = conf.getTrimmed(DFSConfigKeys.DFS_XFRAME_OPTION_VALUE,DFSConfigKeys.DFS_XFRAME_OPTION_VALUE_DEFAULT);builder.configureXFrame(xFrameEnabled).setXFrameOption(xFrameOptionValue);infoServer = builder.build();infoServer.setAttribute("secondary.name.node", this);infoServer.setAttribute("name.system.image", checkpointImage);infoServer.setAttribute(JspHelper.CURRENT_CONF, conf);infoServer.addInternalServlet("imagetransfer", ImageServlet.PATH_SPEC,ImageServlet.class, true);infoServer.start();LOG.info("Web server init done");HttpConfig.Policy policy = DFSUtil.getHttpPolicy(conf);int connIdx = 0;if (policy.isHttpEnabled()) {InetSocketAddress httpAddress =infoServer.getConnectorAddress(connIdx++);// dfs.namenode.secondary.http-addressconf.set(DFSConfigKeys.DFS_NAMENODE_SECONDARY_HTTP_ADDRESS_KEY,NetUtils.getHostPortString(httpAddress));}if (policy.isHttpsEnabled()) {InetSocketAddress httpsAddress =infoServer.getConnectorAddress(connIdx);conf.set(DFSConfigKeys.DFS_NAMENODE_SECONDARY_HTTPS_ADDRESS_KEY,NetUtils.getHostPortString(httpsAddress));}}
五.startCheckpointThread
启动checkpoint 线程. 这个没啥说的,就是启动了一个守护进程而已...
SecondaryNameNode实现了Runnable接口,所以会直接调度用run() 方法
public void startCheckpointThread() {Preconditions.checkState(checkpointThread == null,"Should not already have a thread");Preconditions.checkState(shouldRun, "shouldRun should be true");checkpointThread = new Daemon(this);checkpointThread.start();}
六. doWork()
//// The main work loop//public void doWork() {//// Poll the Namenode (once every checkpointCheckPeriod seconds) to find the// number of transactions in the edit log that haven't yet been checkpointed.//long period = checkpointConf.getCheckPeriod();int maxRetries = checkpointConf.getMaxRetriesOnMergeError();while (shouldRun) {try {Thread.sleep(1000 * period);} catch (InterruptedException ie) {// do nothing}if (!shouldRun) {break;}try {// We may have lost our ticket since last checkpoint, log in again, just in caseif(UserGroupInformation.isSecurityEnabled())UserGroupInformation.getCurrentUser().checkTGTAndReloginFromKeytab();final long monotonicNow = Time.monotonicNow();final long now = Time.now();// 是否超过最大事务数限制[默认100万]// 或者两次checkpoint超过1小时if (shouldCheckpointBasedOnCount() ||monotonicNow >= lastCheckpointTime + 1000 * checkpointConf.getPeriod()) {// 执行 checkpoint 操作doCheckpoint();lastCheckpointTime = monotonicNow;lastCheckpointWallclockTime = now;}} catch (IOException e) {LOG.error("Exception in doCheckpoint", e);e.printStackTrace();// Prevent a huge number of edits from being created due to// unrecoverable conditions and endless retries.if (checkpointImage.getMergeErrorCount() > maxRetries) {LOG.error("Merging failed " +checkpointImage.getMergeErrorCount() + " times.");terminate(1);}} catch (Throwable e) {LOG.error("Throwable Exception in doCheckpoint", e);e.printStackTrace();terminate(1, e);}}}
七. doCheckpoint [ 执行 checkpoint 核心操作 ]
/*** Create a new checkpoint* @return if the image is fetched from primary or not*/@VisibleForTesting@SuppressWarnings("deprecated")public boolean doCheckpoint() throws IOException {checkpointImage.ensureCurrentDirExists();NNStorage dstStorage = checkpointImage.getStorage();// Tell the namenode to start logging transactions in a new edit file// Returns a token that would be used to upload the merged image.// 告诉namenode在新的edits文件中开始记录事务 , 如果处于安全模式则失败.// 返回一个token用于merge imageCheckpointSignature sig = namenode.rollEditLog();boolean loadImage = false;boolean isFreshCheckpointer = (checkpointImage.getNamespaceID() == 0);boolean isSameCluster =(dstStorage.versionSupportsFederation(NameNodeLayoutVersion.FEATURES)&& sig.isSameCluster(checkpointImage)) ||(!dstStorage.versionSupportsFederation(NameNodeLayoutVersion.FEATURES)&& sig.namespaceIdMatches(checkpointImage));if (isFreshCheckpointer ||(isSameCluster &&!sig.storageVersionMatches(checkpointImage.getStorage()))) {// if we're a fresh 2NN, or if we're on the same cluster and our storage// needs an upgrade, just take the storage info from the server.dstStorage.setStorageInfo(sig);dstStorage.setClusterID(sig.getClusterID());dstStorage.setBlockPoolID(sig.getBlockpoolID());loadImage = true;}sig.validateStorageInfo(checkpointImage);// error simulation code for junit testCheckpointFaultInjector.getInstance().afterSecondaryCallsRollEditLog();RemoteEditLogManifest manifest =namenode.getEditLogManifest(sig.mostRecentCheckpointTxId + 1);// Fetch fsimage and edits. Reload the image if previous merge failed.// 拉取fsimage和edits, 如果merge失败则重新加载imageloadImage |= downloadCheckpointFiles(fsName, checkpointImage, sig, manifest) |checkpointImage.hasMergeError();try {//执行merge操作doMerge(sig, manifest, loadImage, checkpointImage, namesystem);} catch (IOException ioe) {// A merge error occurred. The in-memory file system state may be// inconsistent, so the image and edits need to be reloaded.checkpointImage.setMergeError();throw ioe;}// Clear any error since merge was successful.checkpointImage.clearMergeError();//// Upload the new image into the NameNode. Then tell the Namenode// to make this new uploaded image as the most current image.// 上传新的image 到NameNode// 告诉Namenode将上传的image作为最新的imagelong txid = checkpointImage.getLastAppliedTxId();//上传凑在哦.TransferFsImage.uploadImageFromStorage(fsName, conf, dstStorage,NameNodeFile.IMAGE, txid);// error simulation code for junit testCheckpointFaultInjector.getInstance().afterSecondaryUploadsNewImage();LOG.warn("Checkpoint done. New Image Size: " + dstStorage.getFsImageName(txid).length());if (legacyOivImageDir != null && !legacyOivImageDir.isEmpty()) {try {checkpointImage.saveLegacyOIVImage(namesystem, legacyOivImageDir,new Canceler());} catch (IOException e) {LOG.warn("Failed to write legacy OIV image: ", e);}}return loadImage;}