EUNIS97, Grenoble (France) 9-11 September 1997
Ref: 032401
Center - A distributed computing center of the future?
Introduction
The role of computer centers at universities had undergone a very dramatic
reshaping in the past decade. It is no more a single ``computer aware''
center of the university, it is becoming much more a coordinating place,
responsible for a kind of computer related infrastructure. However, new
roles are also emerging, and in this paper we discuss a potential which
may be gained by merging services of individual computer centers
together.
The extremely fast proliferation of personal computers lead to a belief
that computers are becoming a tool not too different from other ordinary
tools used in our everyday life.
The information society of tomorrow began to look like a kind
of paradise where everybody uses his or her computer to connect to
sources of information, to ease any work to be done. The computer centers
started to become an obsolete notion and many universities considered to
reduce or even to close them.
In the Central and Eastern Europe of the nineties, the situation was even
more dramatic due to the very fast changes there.
But, as with any other complex and sophisticated tool, it is not ease to
use it without a lot of training and experience. Situation started to
change with the emerging of local area networks and their interconnection
with the Internet. While at the beginning it was easy
to join few computers into a LAN, the interconnection of LANs
called for new expertise and, as such, for some kind of centralized
control over its deployment. What is more important, new services were
looked for and the vital role of computer centers reemerged.
Contemporary Role of Computer Centers
As contemporary computer centers are no more the sole owners of
computing related technology at the universities, they have to focus
their attention to services which are most efficiently done from a
center. While individual users have usually their own personal computers
on their desks-- computers whose raw computing power and memory and disk
capacity is larger than that of large computers of the past-- these
computers must be somehow connected to the network. The infrastructure
building and maintenance is thus one of the indispensable new roles.
Another important role is related to reliability and robustness. While
individual users can backup their data, just a tiny fraction is actually
used to do it on a regular basis. It is much more easier, convenient and
cheaper to provide such a service from some central place. It is also
much more reliable, as there are usually more than just one device
allocated (or allocatable) for this task. Another point is the disk
capacity. A failure of individual disk in a personal computer usually
means that the computer will be out of service for some noticeable time.
On the other hand, computer centers usually build their (large) disk
capacities using some kinds of RAID's, where a single disk failure may
not be even noticeable by the end users. In general, all the services
provided by computer centers are (or may be) backed up in some way,
and the redundancy needed is substantially cheaply achieved at this
level.
Last but not least, there come the information services used and/or
provided by the university. The university management is becoming more
distributed, with the responsibility for decision delegated to lower
parts of the managerial hierarchy. However, the responsibility for data
correctness calls for some centralized supervisors. The information technology
allows, when properly used, to take the best from both worlds-- the data
are kept centrally, at the computer center, while the access is provided
in a distributed way. Similar situation also holds for information
provided by the university (e.g., through the web). While the information
may be collected, and even prepared, i.e., edited, formatted and the
like, in a distributed fashion over many parts of the university, it may
then be stored in an individual server, managed by the computer center.
As we have seen, there are still at least three roles where the computer
centers have their irreplaceable responsibilities:
- The infrastructure.
- The reliability.
- Information services.
Computer centers are not, however, independent entities in the networked
world of today. The increased mobility of researchers and students, coupled
with the increased number of people using services of more than just one
particular computer center, needs to be supported by a kind of
convergence of individual computing centers. It may not be surprising that
it is again the ``power'' users, i.e., users of high performance
computers, looking always for ways how to increase the computing power
they have at their disposal, who are the first one to ask for similar (if
not identical) computing environments. However, these users will be very
fast followed by others, and it is vital for the computer centers to be
well prepared before the main wave will hit them.
The Center
The
Center 3-year project was launched in the last year as a part of the
TEN-34 CZ activities of the Czech Republic. Its main goal is to
connect the largest computing centers of the Czech universities, namely
the West Bohemia University in Pilzen, Czech Technical University and Charles
University in Prague, and Technical University and Masaryk University in
Brno into one virtual computing center. The primary target of this
pilot project, lead by the Masaryk University and supervised by the first
author of this article, is a group of academic users of high performance
computers at the respective sites, but it is in no way limited to them.
The primary goal is to create a large virtual computer with a uniform user
interface. This virtual computer is spanning a large geographical area
(the distance between Pilzen and Brno is more than 250km). The interface is
understood in the broadest sense, i.e., encompassing all the provided
services. The
Center is also built as an open center, where more
computers may be connected in and where new partners may also become
involved. This push a very strong limits on what may be done and how.
A truly heterogeneous virtual computer is built, whose nodes are computers
of individual centers. There are three POWER Challenges from SGI, large
AlphaServer from Digital and a 19 processor IBM SP2 to be connected in
one whole. From the user's point of view the result of the project will
be seen as just one large
Computer. Users will be allowed to log
to any node while having immediate access to all the
Center
resources. This means that user of some program (service) may not be even
aware (or take care of) which particular node runs her program, more or
less in the same way as users of parallel computer don't care which
particular node they are using.
Administration
As may perhaps be predicted, the political and administrative problems
are the harder ones. We already identified some places where common
agreement is necessary:
- The account creation. In order to have a truly transparent access
to the whole
Computer, it is necessary to have account on all its
individual nodes. Individual centers have to coordinate their rules for
account application with the final goal of trusting each other is such a
way that granted application at any particular node will be valid for the
whole
Computer.
- The security measures must be unified (at least
to some extent), because the security level of the whole
Computer is
simply the security level of its weakest part. All centers must adopt
similar policies on what is allowed in this area.
- Unification of application program installation and user interface.
Individual computing centers differ substantially in their ways of
application program installation and especially in ways how these
programs are made available to end users; this difference must be removed
and all the programs must be accsible in a unified way.
- The interfaces to utility programs must be unified as well. A
common interface to queuing system is essential in the area of high
performance computing, but this also applies to mailing program interface,
to the on-line helps provided and to many similar utilities.
Technical side
The whole
Center project is not possible without a reliable and
high performance network between its individual nodes. The sites are
currently connected to the TEN-34 CZ backbone,
an ATM academic network running at 34Mbps. All the involved computers have
direct access to this ATM backbone which means that a virtual channels
may be created among them. Both IP over ATM and LAN Emulation mode of the
underlying ATM network will be used to create a kind of dedicated routes
through which the
Computer nodes communicate. An ATM metropolitan
area network running at 155Mbps is currently available at Prague and at
Brno, opening thus a possibility to connect a subset of nodes at higher
speed than allowed by the backbone alone.
A distributed file system is provided on top of the network connection.
After considering all possibilities, the AFS distributed file system
from Transarc [2] was chosen as a primary filesystem of the
Center. The main reasons were:
- AFS truly supports the heterogeneous environment as it is
available on most important platforms and operating systems, including
Linux and Windows NT.
- AFS is a state of the art distributed system, already in use at
many sites around the world.
- AFS allows a high data migration freedom, as only the address of
volume location server must be known to all clients. Chunks of data (the
volumes) may be freely moved to different servers without any need to
reconfigure the clients accessing them.
- AFS has a local cache filesystem, increasing thus access speed and
decreasing the network load.
- AFS is far more secure than NFS, it also allows to keep higher
control over accessibility of individual files than ordinary UNIX file
access mechanism.
An AFS multilicense covering all universities involved in the project
was purchased. Each university (computer center) established its own AFS
cell. There are, however, some peculiarities and problems
connected with the use of AFS, which have consequences to the
Center implementation.
- AFS builds its own filesystem structure on top of native filesystem.
Usually AFS lacks the support for the newest native filesystems available
(e.g., there is not yet support for the XFS filesystem from Silicon
Graphics, which means that there is no support for 64bit filesystems).
Moreover, the another layer slows down the read/write operations (we
found that AFS has as low as just 25% of the performance of the native
filesystem, if client and file server are the same machine). The local
cache can compensate this slowdown only for the read operations.
- AFS is not fully available for Linux outside USA. While
it is possible to access the Linux binaries, the source code is not
available even for those having source code license. As a result, the
Linux binaries are usually outdated and they don't fit always well with
the newest AFS patches or with the newest Linux operating system versions (i.e.
they are not compatible with the Linux operating system version necessary
to use the ATM cards). As for the NetBSD, even the binaries are not available
outside the USA.
Overall, we found AFS to be a valuable tool for the read only filesystems
(parts of the operating systems and the application software) but of just
a limited use for read/write filesystems (like the user directories). AFS
is definitely not a choice when a high local I/O throughput is required
(e.g. ab-initio calculations). The AFS is therefore used in
Center
to store the read only directories with application programs and shared
parts of user home directories. Users have an option to either have all
their home directories stored in the AFS or to have (small) local
filesystems at each node and use AFS as a shared data repository.
AFS is also complemented by the use of the local native filesystems which
are made available through (a limited) use of NFS.
The use of AFS naturally lead to the adoption of Kerberos for the user
authentication [3]. We are currently using Kerberos 4
implementation (from
KTH, Sweden)-- the Kerberos 5 is again available in USA only. To allow for an
easy and smooth path for future expansion, each computing center is running its
own Kerberos realm and we use the interrealm authentication to move the
tickets around. We had to modify a lot of standard programs (like
login, telnet, telnetd, ...) to make the
interrealm crossing as smooth as possible and especially to eliminate any
need for users to know precisely where the realm borders lie. While
quite successful, we discovered that Kerberos 4 interaction with the AFS
own authentication mechanism is not ideal and that sometimes users have to
reissue their passwords to have access to all their resources.
Load Sharing Facility (LSF [1]) from Platform Computing, Inc. was chosen
as a job queuing and load balancing tool for the whole
Center. A
LoadLeveler is used on IBM SP2 and a gateway is to be developed to
connect both these systems. Again, each computing center runs its own LSF
cluster with an intercluster communication established to allow for a
proper load balancing between individual computing nodes. The use of AFS
and Kerberos lead to a problem whose best solution we are still
searching: how to ensure that proper authority will be given to user's
jobs when they finally left the job queue and/or when they are running for
a very long time (days or even weeks).
The same set of application programs is not available at each node of the
Center. The transparent access allows to use them without knowing
where they are may actually run. The queuing system is aware of the
location of all major programs and reroutes individual request to nodes
where they may be (best) served. There is, however, no such support for
interactive programs.
Conclusion
While the
Center project is just in its first phase (the project
started on September 1996), we already identified
several major advantages of the
Center over the individual
centers:
- It simplifies the access to centralized services of different
nodes. It also allows to share ``personalized'' environments between
sites, including access to personal files.
- It increases the utilization of individual computers and software
licenses available-- it is no more necessary to buy everything to every
site.
- It provides much higher reliability at much lower cost-- users at
individual university may continue to work even in case of ``their'' node
failure.
The
Computer, which is scheduled to be put into full
experimental operation at the end of 1997, will be used both as a large
distributed computer and as a testbed for the unified user
interface of computer centers of major Czech universities.
References
- 1
- URL: http://www.platform.com
- 2
- URL: http://www.transarc.com
- 3
- URL: http://web.mit.edu/kerberos/www/
1Faculty of Informatics,
2Institute of Computer Science
Masaryk University, Botanická 68a, 602 00 Brno, Czech Republic
E-mail: ludek@ics.muni.cz,
eva@fi.muni.cz
Copyright EUNIS 1997 Y.E.