본문 바로가기

오픈소스/hadoop

하둡 최신 버전(0.18.1) 릴리스 정보입니다.

http://hadoop.apache.org/core/docs/r0.18.1/releasenotes.html

0.17.2 stable 버전과의 차이점 중에서 주목할 만한 것들만 뽑아 보았습니다.


dfs
Improved management of replicas of the name space image. if all replicas on the Name Node are lost,the latest check point can be loaded from the secondary Name Node. Use parameter "-importCheckpoint" and specify the location with "fs.checkpoint.dir." The directory structure on the secondary Name Node has changed to match the primary Name Node.

dfs
Changed fsck to ignore files opened for writing. Introduced new option "-openforwrite" to explicitly show open files.

dfs
Added sync() method to FSDataOutputStream to really, really persist data in HDFS. InterDatanodeProtocol to implement this feature.

dfs
Introduced directory quota as hard limits on the number of names in the tree rooted at that directory. An administrator may set quotas on individual directories explicitly. Newly created directories have no associated quota. File/directory creations fault if the quota would be exceeded. The attempt to set a quota faults if the directory would be in violation of the new quota.

dfs
Changed the default port for "hdfs:" URIs to be 8020, so that one may simply use URIs of the form "hdfs://example.com/dir/file".

mapred
Added support for .tar, .tgz and .tar.gz files in DistributedCache. File sizes are limited to 2GB.

mapred
Changed "job -kill" to only allow a job that is in the RUNNING or PREP state to be killed.

mapred
Added logging for input splits in job tracker log and job history log. Added web UI for viewing input splits in the job UI and history UI.

mapred
Added org.apache.hadoop.mapred.lib.NLineInputFormat, which splits N lines of input as one split. N can be specified by configuration property "mapred.line.input.format.linespermap", which defaults to 1.

mapred
Changed policy for running combiner. The combiner may be run multiple times as the map's output is sorted and merged. Additionally, it may be run on the reduce side as data is merged. The old semantics are available in Hadoop 0.18 if the user calls: job.setCombineOnlyOnce(true);

mapred
Created SequenceFileAsBinaryOutputFormat to write raw bytes as keys and values to a SequenceFile.

scripts
Added command line tool "job -counter <job-id> <group-name> <counter-name>" to access counters.

streaming
Introduced a way for a streaming process to update global counters and status using stderr stream to emit information. Use "reporter:counter:<group>,<counter>,<amount> " to update a counter. Use "reporter:status:<message>" to update status.

streaming
Increased the size of the buffer used in the communication between the Java task and the Streaming process to 128KB.

streaming
documentation
Set default value for configuration property "stream.non.zero.exit.status.is.failure" to be "true".

util
Introduced an FTPFileSystem backed by Apache Commons FTPClient to directly store data into HDFS.

가장 유용할 것 같은 몇 가지만 정리해 보면 다음과 같습니다.

1. 네임노드가 모두 죽었을 때에 SecondaryNode의 CheckPoint로 부터 복구

2. FSDataOutputStream 클래스에sync()메소드가 추가

3. directory별로 quota를 제공

4. key, value 값으로 bytes stream을 사용

5. streaming에서 global count 및 message를 사용

대충 이정도 되겠습니다. sync와 quota는 잘 사용할 수 있을 것 같고요, 그리고 0.19.0에서는 append기능까지도 제공된다고 하니 기대할만 할 것 같습니다. 



This article was written in springnote.