November 22, 2008, Saturday, 326

Starfish Frequently Asked Questions

From DBWiki

Jump to: navigation, search

Contents

General

Why was Starfish created?

Starfish was created because we needed a robust, highly-available, fault-tolerant distributed file system that would store anywhere from several gigabytes to thousands of terabytes worth of data. We needed technology that would allow us to build file storage clusters quickly from commodity components. More to the point, the solutions that we tried were very expensive and didn't solve our problem.

Our company, Digital Bazaar, stores multiple terabytes worth of digital music, movies, television and other forms of digital content for our online peer-to-peer content store. Starfish came out of a need that we had - it is something that we think is of use to others that have data storage issues of their own.

When are you going to add feature X?

As far as features go, we usually make the decision based on the following criteria (in order):

  1. Reliability and Fault-tolerance
  2. Scalability
  3. Ease of Use
  4. Cost

The more of those that are positively affected by a single feature, the more likely it the feature is to go into the system.

Are there any patents regarding Starfish?

There are patents pending for the Starfish file system. They revolve around the peer-to-peer based nature of the file system, discovery, messaging, locking, order of filesystem operations, and automated cluster-based voting on file system operations and resources.

A patent license is granted in the standard Starfish Filesystem License.

Speed

How fast is Starfish at doing read and write operations?

Here are a couple of benchmarks that show how Starfish does compared to NFS, Samba, and Lustre when running on commodity hardware.

The tests were performed using the following configurations for the clients and the storage targets: AMD Athlon XP 2600+ with 1GB RAM and a 100Mbit network card running Linux 2.6.19 and Debian Etch. Default mount options were used for all file systems.

There were two commands that were used to write and read a single 105MB file from each file system. 1MB writes were performed by doing the following:

dd if=/dev/zero of=<path_to_fs>/file.dat bs=1048576 count=100

1MB reads were performed by doing the following:

dd if=<path_to_fs>/file.dat of=file.dat bs=1048576 count=100

The results are shown below:

Single client read throughput. Single client write throughput.

Multi-client read throughput. Multi-client  write throughput.

The benchmarks above are very preliminary and are mentioned as examples of how the systems performed on our test network. There are far too many factors at play in a real environment to determine the performance of a particular network file system over all types of work loads and network topologies.

In general, Starfish does pretty well for such a new file system with hardly any optimizations. However, you should test it against your application workload to see if it is the correct solution for you. In any case, we would love to hear about how your installation is doing against these benchmarks.

If your benchmarks are better, we'll post them on this wiki. If they're worse, we'll do our best to help you get them at least as fast as the benchmarks listed above.

How fast is Starfish at doing metadata operations?

It depends.

If you have a single storage client serially creating 1,000 small files per second, Starfish is going to be slow. However, just about any other network-based file system would be slow in this case. A database is a much better solution for this type of problem.

If you have 10 storage clients creating 100 small files per second, Starfish would be up to the task. It currently takes 515 milliseconds to create a single file and open it for reading. Creating 100 small files, each on its own thread, would take a single storage client 515 milliseconds. If 10 storage clients work in parallel performing the same task, the time it takes would still be 515 milliseconds. In this scenario, nearly 2000 files could be created per second using Starfish.

Starfish's metadata operations scale linearly, so you don't have to worry about metadata operations becoming a bottleneck as your storage network grows.

What are the latencies associated with common data operations in Starfish?

A very complicated question, as many things affect the operation of a Starfish cluster. The biggest factors are network bandwidth and latency, processor speed and disk speed.

The scenario below is from a set of 10 Starfish storage peers running on the same computer with a single Starfish client mounting the file system locally. The computer had an Intel Pentium 4 CPU clocked at 3.20GHz, 1GB of RAM, and one SATA hard drive. This setup was used to remove all network latency and provide a baseline for other tests. CPU usage was around 50% for the entire test. The tests were done using Starfish v0.8 - a very early release, without very much optimization.

The test process consisted of the following operations (in order):

  1. opening a file with the O_CREAT bit set
  2. writing 1MB of data to the file
  3. closing the file
  4. opening the file in read-only mode
  5. reading 1MB of data from the same file (not cached)
  6. closing the file

The above process was repeated 100 times. The application latency times are listed below, the first number is the average time in milliseconds. The numbers in brackets are the total number of calls, and the total time taken to perform all of the operations in that class of tests:

Operation Average Time Total Calls Total Time
Opening a file with the O_CREAT bit set 515ms 100 51535ms
Writing 1MB of data to the file 36ms 100 3693ms
Closing the file 0.21ms 100 21ms
Opening the file in read-only mode 134ms 100 13489ms
Reading 1MB of data from the file 101ms 100 10180ms
Closing the file 1ms 100 169ms

The latencies at the Starfish filesystem layer were:

Operation Average Time Total Calls Total Time
getattr 113ms 811 92238ms
open 208ms 201 41952ms
read 20ms 800 16326ms
mknod 140ms 101 14166ms
release 69ms 200 13918ms
write_flush (from cache) 65ms 100 6545ms
write (cached) 0.04ms 25600 1062ms
readdir 144ms 2 288ms
statfs 161ms 1 161ms
getattr (cached) 0.16ms 4217 67ms
utime 44ms 1 44ms
statfs (cached) 0.0085ms 1405 12ms
discover_peers 1ms 1 1ms

The data shows that metadata operations are slower than read/write operations. This is to be expected in a completely decentralized file system.

What are Starfish's maximum local read and write speeds?

Read and write speeds vary depending on processing power, network bandwidth, network latency and disk speeds. For the most part, Starfish can provide more data to and from a single SATA hard disk than the disk can support. In short, Starfish read/write speeds tend to be limited by network and disk interface speeds.

Typical sustained write speeds from a Pentium 4 at 3.2Ghz using 32KB write chunks to a storage server running on the same machine are around 63.8 megabits per second, assuming unlimited network and disk throughput. 63.8 megabits per second is achievable if you run a 100 megabit network with SATA disks (January 2007).

Typical read speeds from a Pentium 4 at 3.2Ghz using 132KB read chunks from a storage server running on the same machine are around 91.4 megabits per second, assuming unlimited network and disk throughput. 91.4 megabits per second is achievable if you run a gigabit network with SATA disks in a RAID-0 configuration.

Read and write speeds are largely dependent on the speed of your backbone network and the switches that you use. Also note that Starfish peer-to-peer read and write speeds scale linearly if your files are distributed perfectly, thus your total available read and write throughput is a multiplier. If you have 10 servers, your total theoretical write throughput is around 638 megabits per second and your total theoretical read throughput is around 914 megabits per second.

Scalability

Is Starfish truly scalable?

Yes, as long as your network and disk hardware can handle the traffic. Be warned, however, that commodity hardware has problems keeping up once you start involving more than a thousand peers and hundreds of gigabits per second of data.

What are the choke-points for the file system?

The networking bandwidth available to the storage peers in the file system is the first thing to limit the file system. Disk speeds are a very distant second, followed by processing power.

How do I backup the data in a Starfish file system?

The beauty of Starfish is that you don't have to back up your data. The file system backs up data automatically without needing any interaction from system or network administrators. The number of backups can be set on a per-file basis, which allows you to tune your file system backups to meet your needs.

If you need to archive the file system - any POSIX-compliant backup utility will work. This includes tar, cpio, zip, amanda, afbackup, bacula and a long list of other well-known backup utilities.

Cost

How much does the hardware for a fully redundant 1 Terabyte Starfish cluster cost?

For a 2-way redundant, RAID-1 protected, 1.0 Terabyte cluster: $2,000 (Jan 2007 prices).

Per server, that breaks down into around $400 for a AMD 2.6Ghz CPU, 1GB of memory, and a motherboard with integrated 100 megabit LAN connection, SATA support, 350 watt power supply and a commodity server enclosure. Four SATA 500GB hard drives will run you around $600.

The cluster would ensure proper file system operation even in the catastrophic failure of a single machine. Hard drive failure rates could even approach 50% without affecting the Starfish file system.

For a 5-way redundant, RAID-1 protected, 1.0 Terabyte cluster: $5,000 (Jan 2007 prices).

For a 2-way redundant, RAID-1 protected, 5.0 Terabyte cluster: $10,000 (Jan 2007 prices).

Usage

I am running an older Linux kernel, will Starfish still run?

The Starfish storage peer will run on most any kernel starting at 2.2.x and above.

The Starfish storage client will run on any kernel after 2.6.14.

Is there a Windows Client or Server?

No, but we can make one if there are enough people out there that want it. The storage peer software should run on Windows with very little porting effort, the storage client would be a much more difficult task. For the time being, you can export the Starfish file system from Linux via Samba and mount it just like any other Windows share.

Is there a Mac OS X Client or Server?

At this point, we don't have enough time to generate the packages for OS X, but it should work.

The storage peer software definitely runs on OS X.

The storage client software should run on OS X with a little tweaking, but we haven't tried it yet.

What synchronizing, resynchronizing, and migration operations are there?

You can retrieve the current storage peer for a particular file by using the getfattr call on a particular file. For example:

getfattr -n user.primary_peer <filename>

The command above will output the name of the storage peer on which the file resides.

Currently, there are no synchronizing or migration utilities. This is on the development plan for 2007, but as of this moment (January 2007) there are none.

In the future, synchronizing and resynchronizing will be performed automatically by the storage network. There will be utilities to increase and decrease the number of mirrors for a particular file as well.

There will also be a utility to ensure that a storage peer could be safely removed from the storage network without the risk of data loss.

Operation

How do the Starfish peers find each other on the storage network?

The Starfish Management Protocol is an IPv4 and IPv6 aware multicast discovery and communication solution that all Starfish peers use to find other participants in the network. All group communication is performed via multicast messaging, while peer-to-peer communication is done via unicast messaging.

Why does Starfish decentralize it's meta-data?

There are several options when it comes to storing metadata in a networked file system. Single server centralized (NFS, Samba), master server with hot-spare (Lustre), and decentralized (Starfish). The two previous solutions introduce a single point of failure, or a double point of failure.

Starfish uses fully decentralized metadata storage because it is the best solution for a N-way redundant system. It is the only way to ensure that the metadata is just as resilient to failure as the data itself.

How does Starfish ensure file system integrity while having distributed metadata?

Starfish decentralizes it's metadata by storing the metadata on the same nodes that the files reside. Metadata is mirrored when files are mirrored, thus the metadata and the files follow each other around the storage network. While this ensures that there is no central point of failure when it comes to metadata storage, it also introduces several technical challenges.

Some of the technical challenges of decentralized metadata storage are: ensuring unique file creation, metadata modification and finding out which storage peer contains the metadata.

To ensure that files are created uniquely, all storage peers must acknowledge file creations. To ensure that there are no duplicates on the network, all storage peers must also acknowledge certain metadata retrieval operations.

After a file has been created, the metadata for the file resides on the same node as the file itself. Most Starfish operations have two parts to them - resource discovery and resource modification.

For example, a file read has the following phases:

  1. the first phase of the operation finds out which storage peer has the file in question
  2. the second phase of the operation modifies the resource

Starfish uses an aggressive caching system to ensure that operations can be performed at reasonable speeds.

What happens when two people write to the same file at the same time?

The same thing that happens when you write to the same file at the same time in any POSIX-based operating system - the file is modified, but not necessarily in the order that you want. To ensure this doesn't happen, it is advisable to create a lock file, or perform a network-wide advisory lock on the file.

What happens if a long-distance link between two parts of a single cluster goes down?

Your single storage cluster will split into two independently operating storage clusters. When the long-distance link comes back up, the two storage cluster cells will merge back into one storage cluster.

This, of course, causes all sorts of undesirable side-effects. What happens when a file is created on both clusters with the same file path and name while the long-distance link is down? There will be two storage peers on the network that think they are in charge of that file.

There will be protection measures in the future that will not allow the storage cluster to create new files if it detects that a great number of the storage peers have disappeared. However, reads and writes to pre-existing files that are available could continue.

For the time being, make sure your entire cluster is on a single switch or portion of the network that does not suffer many network outages.

What causes a storage cluster failure?

In general, a Starfish storage cluster will not completely fail until the last storage peer fails. There is always the possibility of single file or metadata loss, but that is very different from losing all of the data in the cluster.

For example, if a file has three mirrors and all three mirrors fail at the same time and are never returned to the cluster - the file and all of it's associated metadata will be lost forever. However, the likelihood of this happening is almost always very low.

How does storage peer failure affect available storage space?

The available storage on a Starfish storage network is purely the aggregation of all available storage on each storage peer. As a storage peer fails, the free storage that the peer provided disappears with it.

Starfish does its best to load-balance file storage across all storage peers on the network. This means that the storage peer with the most amount of free storage is picked to store a file. If all storage peers on the network have the same amount of free storage, the file is stored on the storage peer with the most number of available file descriptors.

Consider the following scenario:

You have 10 storage peers, the total storage network capacity is at 95%. There is only one storage peer that has the remaining 5% of free storage. It is possible to run out of space if that storage peer fails.

What if something terrible happens and only the physical hard drives are rescued?

Starfish uses the underlying file system for storage. Filenames and paths are preserved. In the worst case, all drives can be mounted via another machine and the data read off of them just like you would do any other file. The resulting file structure will match the file structure on the clustered storage network.

In short, all you must do is perform a cp -a <old_storage_directory> <new_storage_directory> operation. Even the metadata can be rescued - it is stored in the SQLite3 database file format, which can be read via a number of open source utilities. There are no proprietary systems employed that prevent you from getting to your data.

Strengths and Weaknesses

How does Starfish compare to NFS?

NFS read and write speeds do not scale well when accessed by multiple clients.
Enlarge
NFS read and write speeds do not scale well when accessed by multiple clients.

In general, metadata operations and read and write speeds from a single client are not as fast as NFS. NFS has almost 23 years of development behind it and has a highly-tuned in-kernel Linux driver - hats off to the NFS developers on a job well done. The NFS file system was far ahead of it's time and the excellent engineering behind it shows that it has stood the test of time as well. In time, Starfish will catch up to most of the basic NFS metadata operation speeds, but for right now - it is slower.

However, as the number of read and write clients increase, NFS becomes far less desirable. The real advantage comes in when you have multiple clients accessing the same file system at the same time. At this point, Starfish scales very well, while NFS does not scale very well. The reason NFS does not scale is that there is only so much available network throughput to and from an NFS server - usually constrained by the type of network adapter in use on the NFS server.

Starfish is capable of sustaining between 638Mbps-914Mbps sustained throughput given enough network bandwidth using 10 storage peers running on commodity-grade hardware. A typical NFS server would not fare well under those conditions.

NFS servers are typically not node-redundant, which means that if your NFS server goes down - all of your NFS clients freeze until the NFS server comes back online.

You should pick NFS if:

  • redundancy is not an issue
  • single points of failure are not an issue
  • 100-200 millisecond latencies are an issue
  • all of your data fits on a single drive or RAID array
  • many simultaneous clients reading and writing are not an issue

You should pick Starfish if:

  • redundancy is an issue
  • single points of failure are an issue
  • 100-200 millisecond latencies are not an issue
  • your storage system needs to be scalable
  • you must support many simultaneous clients reading and writing to the storage network

How does Starfish compare to Samba?

Samba read and write speeds do not scale well when accessed by multiple clients.
Enlarge
Samba read and write speeds do not scale well when accessed by multiple clients.

Samba is much like NFS - we have a great deal of respect for the Samba guys, they do great work. In general, metadata operations and read and write speeds from a single Starfish client are not as fast as Samba. Samba is a rock-solid software system. However, Samba/NFS vs. Starfish is not a very fair comparison as each network-based file system solves different problems.

Samba has similar strengths and weaknesses as NFS. As the number of read and write clients increase, Samba becomes far less desirable. The real advantage of using Starfish comes in when you have multiple clients accessing the same file system at the same time. At this point, Starfish scales very well, while Samba does not scale very well.

Starfish is capable of sustaining between 638Mbps-914Mbps sustained throughput given enough network bandwidth using 10 storage peers running on commodity-grade hardware. A typical Samba server would not fare well under those conditions.

Samba servers are typically not node-redundant, which means that if your Samba server goes down - all of your Samba clients will not be able to access files until the Samba server comes back online.

You should pick Samba if:

  • node redundancy is not an issue
  • single points of failure are not an issue
  • 100-200 millisecond latencies are an issue
  • all of your data fits on a single drive or RAID array
  • many simultaneous clients reading and writing are not an issue

You should pick Starfish if:

  • node redundancy is an issue
  • single points of failure are an issue
  • 100-200 millisecond latencies are not an issue
  • your storage system needs to be scalable
  • you must support many simultaneous clients reading and writing to the storage network

How does Starfish compare to Lustre?

Digital Bazaar, the company that created Starfish, used the Lustre Clustered Filesystem for three years before creating Starfish. We could not have been happier with Lustre - ClusterFS, Inc. is a great company run by great people. Their software worked well for us, and it is still superior in many ways. Ultimately, we had to move away from Lustre because of their metadata storage architecture, their tight Linux kernel integration and difficulty in rescuing a storage network that had lost one or two of its metadata servers.

As far as metadata read and write speeds go, Lustre wins by a long shot because they centralize their metadata storage. Starfish distributes it's metadata storage to all nodes for redundancy purposes. Data read and write speeds are a bit faster than Starfish - both systems use the same sort of peer-to-peer communication architecture. Lustre can create huge multi-terabyte files and span them across all storage nodes, Starfish can't do that just yet, though it will in the future.

You should pick Lustre if:

  • Cost is not an issue
  • Metadata node redundancy is not an issue (potential for losing all of your data if both metadata nodes go down or are destroyed)
  • Fast metadata operations are a requirement
  • Scaling to 10,000 storage nodes/clients is a requirement (theoretically possible for Starfish, but untested)
  • Large files (greater than 2TB) are a requirement

You should pick Starfish if:

  • Cost is an issue
  • Node redundancy is a requirement
  • Fast metadata operations are not an issue
  • Your storage system is not going to realistically get larger than 1000 storage nodes

What are Starfish's practical limits?

Starfish's theoretical throughput is approximately the number of storage peers on the network multiplied by the bandwidth available to each storage peer. In reality, the actual number is about 60-80% of the theoretical maximum.
Enlarge
Starfish's theoretical throughput is approximately the number of storage peers on the network multiplied by the bandwidth available to each storage peer. In reality, the actual number is about 60-80% of the theoretical maximum.

Starfish is meant to be truly scalable - however, commodity hardware starts to become a limiting factor after a while. We expect Starfish to scale to 100 storage nodes with ease and 500 storage nodes with relatively little headache. We have not been able to test the system beyond that - in theory it should work, but we haven't been able to throw that much hardware at it.

If you have a system this large and would like to test Starfish on that system, we'd love to hear about how it does. We expect that after 1,000 storage peers that the current messaging system could start to incur too much overhead.

Can Starfish split huge files up between storage nodes on the cluster?

It is on the development roadmap, however, it cannot do so at the moment. If you have a 200GB file, and the greatest amount of free space on all storage peers is around 100GB, you will not be able to store the file on the storage network.