microsoft powerpoint tianwangfilesystem

Document Sample
microsoft powerpoint   tianwangfilesystem Powered By Docstoc
					Outline
TFS: Tianwang File System
-Performance Gain with Variable Chunk Size in GFS-like File Systems

• • • •

Introduction (what’s it all about) Tianwang File System Experiments Conclusion

Authors: Zhifeng Yang, Qichen Tu, Kai Fan, Lei Zhu, Rishan Chen, Bo Peng Speaker: Hongfei Yan School of EECS, Peking University 4/13/2008

Distributed File Systems
• Support access to files on remote servers • Must support concurrency
– Make varying guarantees about locking, who “wins” with concurrent writes, etc... – Must gracefully handle dropped connections
1996

Motivations (1/3)
1999 2000 2002 2004 2005 2007

Key ideas Web pages Web pages FTP files grow exponentially

preserve easier vanishing pages preserve web resources knowledge discovery

• Can offer support for replication and local caching • Different implementations sit in different places on complexity/feature scale

Mile Tianwang 1.0 Bingle 1.0 Tianwang 2.0 Web InfoMall 1.0 CDAL ,1.0 stones Web InfoMall 2.0

Web Digest HisTrace etc

Motivations (2/3)
• Data
– Web pages
• 3 billions, 30TB compressed • URL list, IP list, link graph, anchor text, etc.

Motivations (3/3)
• Software
– – – – – – – Large-scale web crawler Web page deduplicate Web page classifier Index and search TB-level data management Retrieval performance evaluation LinkAnalysis, ShallowNLP, Information Extraction…

– Search engine log
• about 40 GB

– Test Collection
• CWT100G, CWT200G, CCT2006, CWT70th, CDAL16th

• Hardware
– 80 machines (PC, Dell2850, etc.)

1

Issues
• Data Accessibility
– Distributed among machines – Data is not open and shared easily

Some real-world datapoints

• Difficult to construct, deploy and run the web data analysis program
– Communication failure, error detection

• Machine usability
– Disk failure is a disaster, but common – Inefficiency
Sources: • Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to You?, Bianca Schroeder and Garth A. Gibson (FAST 07) [pdf] •Failure Trends in a Large Disk Drive Population, Eduardo Pinheiro, Wolf-Dietrich Weber, and Luiz André Barroso (FAST’07) [pdf]

Google File System
• Solution: Divide files in large 64 MB chunks, and distribute/replicate chunks across many servers.

Google File System
• • • • • • atomic record append concurrent write/append secondary master chunk replications re-replication re-balancing

• A couple of important details:
– The master maintains only a (file name, chunk server) table in main memory ) minimal I/O – Files are replicated using a primary-backup scheme; the master is kept out of the loop

Hadoop Project
• • • • A module in Lucene/Nutch project DFS+MapReduce Create-once/read-many Does not support concurrent atomic record appends • Google & IBM cloud computing initiative for university (Oct,8 , 2007) • • • • • • • •

Kosmos File System
does not support atomic record appends support concurrent write re-replication re-balancing integrated with Hadoop POSIX file interface FUSE(Filesystem in userspace) binding master is single-point of failure

2

Outline
• • • • Introduction (what’s it all about) Tianwang File System Experiments Conclusion

Web Infrastructure/Cloud Computing
• Storage
– Fault-tolerant
• It can recover from component failures

– Scalable
• It can operate correctly even as some aspect of the system is scaled to a larger size.

– Transparent
• the ability of a distributed system to act like a non-distributed system.

• Computing
– Easy/efficent parallel computing – Data processing model
• Mostly sequential access • MapReduce

Assumptions
• Component failures are the norm
– Inexpensive commodity components

TFS Design Decisions
• Files are consist of chunks • Chunk replica
– 3 replicas – Chunks are regular file on local file system

• Files are huge
– Multi-GB

• Appending rather than overwriting
– Once written, only read, often sequentially – Multi-append concurrently

• One “master” to manage metadata
– Heartbeat – Operation log

• Co-design applications and the system • High sustained bandwidth is more important than low latency

Note that: There are big differences between TFS and GFS due to the different chunk size.

Chunk Size
1

Chunk Size
• Fixed Size in GFS • Variable Size in TFS
– – – – Flexibility A property of Chunk Offset -> Chunk ID Small chunk

Read Overwrite
2

– Padding – Duplicates

4

Application

Chunk Size

Append

3

3

Read Operation
• GFS
– Cache chunk info – Communicate with master when cache fails

Mutation Operation in GFS

• TFS
– Get chunk information once – New data after open is invisible

Append Operation in TFS

Record Append Operation
• GFS
– At least once – Padding & fragments
• App checksums

• TFS
– At most once – Small chunk
• Delay write, flush

– Duplicates
• App Record ID

Implications for Applications
• GFS
– Appending rather than overwriting – Read sequentially – Checksums – Record ID

Outline
• • • • Introduction (what’s it all about) Tianwang File System Experiments Conclusion

• TFS
– Appending rather than overwriting – Read sequentially – Sequence of records

4

Experimental Deployment in Tianwang
• 10 nodes in a cluster
– One master, nine chunkservers

Master Operation in TFS

• Each with
– two 2.8GHz processors, – 2GB RAM, – 100GB+ scsi disk space

Read Buffer Size in TFS

Aggregate Read Rate in TFS

Aggregate Append Rate in TFS

Performance: GFS vs. TFS
GFS • 200 to 500 operations per second • aggregate read rate 75% of network limit • aggregate append rate 50% of network limit limited by the network bandwidth of the chunkserver that store the last chunk of the file TFS • 3400 operations per second • aggregate read rate about 72% of network limit • aggregate append rate 75% of network limit • aggregate append rate can easily exceed 380MB/s with multiple clients machines limited by the aggregate bandwidth between clients and chunkservers

5

TFS Shell

Sample Application

Source Lines of Codes

Conclusion
• TFS demonstrates how to support large-scale processing workloads on commodity hardware
– design to tolerate frequent component failures – optimize for huge files that are mostly appended and read

References
• TFS Project http://tianwang.grids.cn/projects/tplatform, 2008 • [Ghemawat, et al.,2003] S. Ghemawat, H. Gobioff, and S.-T. Leung, "The Google file system," SIGOPS Oper. Syst. Rev., vol. 37, pp. 29-43, 2003. • [Dean and Ghemawat,2004] J. Dean and S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters," presented at OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, 2004. • [Chang, et al.,2006] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber, "Bigtable: A Distributed Storage System for Structured Data," presented at OSDI'06: Seventh Symposium on Operating System Design and Implementation, Seattle, WA, 2006. • Hadoop Project http://lucene.apache.org/hadoop/, 2007

• The key design choice that the chunk size is variable and record append operation is based on chunk level, which is different from GFS – Significantly improves the record append performance by 25%.

6

CS402 Mass Data Processing/Cloud Computing (Summer 2008, preparing)
• http://net.pku.edu.cn/~course/cs402/ • Course description
– 网页全文索引,镜像网页消重,垃圾邮件过滤,天气模拟,星系 模拟,上亿字符串的排序…….,你想不想了解如何在大型分布式 网络上写少量的具体问题代码来做这些事情吗? – 这些应用,可以使用MapReduce 分布式计算完成,它已经在 Google 得到了广泛使用。在这为期5 周的课程中,你会学习到: 1)分布式系统的相关知识;2)MapReduce 理论和实践,包括: 认识和理解MapReduce 如何适用于分布式计算 ,明白它适合哪 些应用,不适合哪些应用,实践中的提示和技巧;3)通过几个编 程练习和一个课程项目,获得实际分布式程序设计技术经验。 – 课程练习和项目将使用Hadoop(开放源代码实现的 MapReduce)。使用集群由网络实验室提供,需要学生自备能够 无线上网的笔记本(用于连接集群操作),我们会尽量安排在能 够无线上网的教室,并尽量为大家争取到上机实习的机会。

Dynamo: Amazon's Highly Available Key-Value Store
• Dynamo originate in the operating systems and distributed systems research of the past years; DHTs, consistent hashing, versioning, vector clocks, quorum, antientropy based recovery, etc. As far as I know Dynamo is the first production system to use the synthesis of all these techniques, and there are quite a few lessons learned from doing so.

Invocation semantics
Fault tolerance measures Retransmit request message No Yes Yes Duplicate filtering Re-execute procedure or retransmit reply Maybe Invocation semantics

Not applicable Not applicable No Yes

Re-execute procedure At-least-once Retransmit reply At-most-once

Sun RPC provides at-least-once call semantics

7


				
DOCUMENT INFO
Shared By:
Stats:
views:32
posted:12/20/2009
language:English
pages:7
Description: microsoft powerpoint tianwangfilesystem