Keynote Plenary Lecture at ASIS (American Society for Information Science) Annual Meeting

October 11, 1995, Chicago

Bruce R. Schatz
Information Analysis in the Net:
The Interspace of the Twenty-First Century

schatz@uiuc.edu
http://csl.ncsa.uiuc.edu

Introduction by Cliff Lynch, incoming president of ASIS

Good morning everybody. It's about time to get started, I think. Welcome to our third plenary session of the meeting. I'm Cliff Lynch, and I'm here to introduce our speaker this morning. Just a couple of quick things. This session is being taped and tapes will be available after the session, if you want one. Also, please do your evaluation forms. We need your feedback on these sessions.

And finally, by way of context, let me say that I am not here to introduce Len Kalwell [developer of Lotus Notes]. Print can't keep up with our changing world, and if you believe what was printed in I think the final program, that's obsolete. Len could not be with us, and we of course regarded that as a challenge, as in top this, and I'm very pleased to say we have. I have someone special to introduce to you this morning: Bruce Schatz from the University of Illinois. Now let me, by way of introduction, start out with how I first heard about Bruce and his work.

Bruce and I go back quite a long way. Bruce is one of the people who quite early on in the development of the Internet, really was thinking about its use for information access, information delivery, and collaboration, back at a time when most people just moving electronic mail and telnet-ing to supercomputers across the Internet, and there were just a few of these sort of slightly marginalized people who had this vision that the Internet could be a lot more than electronic mail and supercomputing.

Bruce's work first came to my attention back, I guess, around 1989 or 90 when a gentleman who may be known to some of you by the name of Vint Cerf, who was running a meeting that the two of us were going to be at, came up to me someplace and was going on about this wonderful paper he'd read by this gentleman named Bruce Schatz, and how pleased he was that he was going to be at this committee meeting. Of course with a recommendation like that I had to go take a look at the paper, and I can only say, as is often the case, Vint nailed it right on the head.

Just a little background on Bruce. Bruce is currently the principal investigator on the NSF/ARPA/NASA funded Digital Library project that's going on at the University of Illinois at Urbana-Champaign. This is one of the six major projects in the Digital Library Initiative, and this one is working with SGML materials to support engineering and science (he may have a little more to say about that). You heard about this project from him at the Digital Library session earlier in the program. Bruce is an associate professor at the Graduate School of Library and Information Science. He also holds the position concurrently of Research Scientist at the National Center for Supercomputing Applications (NCSA). Those are the folks that brought you Mosaic, for example, and Bruce is heavily involved in advising NCSA about where they're heading in those kinds of technologies.

Before coming to Illinois, Bruce was at the University of Arizona where he was the principal investigator on a national collaboratory project that built a system to support what was called the "Worm Community". Some of you may have heard about this from him in a program yesterday on National Co llaboratories. Prior to that, Bruce spent time at both Bellcore and Bell Labs, where among other things he worked on a system called Telesophy and that is something that I can only describe as being enormously ahead of its time in many ways. I commend that work to you just to get a sense of how long Bruce has been pushing the envelope.

Anyway, without further ado let me turn this session over to Bruce Schatz who will speak on Information Analysis in the Net - The Interspace of the 21st Century.

Plenary Lecture by Bruce Schatz

That surely was a lovely introduction Cliff, I thank you. I remember when you and I were about the only people going to Internet meetings when the Internet was a small closed society, and talking about information retrieval and thinking, gee wouldn't it be great if someday ordinary people could search things in the Net and maybe things like digital libraries and information retrieval would be a more popular topic. I must confess that even going back a long way and knowing the history and being about to give a talk on the future, I am still quite amazed by the fact that the Internet is on the cover of Time magazine and that there are fifty million users in the Net.

What this really means is that there's not only a great opportunity, but a great responsibility to both the old timers and the new timers in the audience, the ASIS members, in being the keepers of the flame as to how you should really handle information and what will happen. And the problem with being keepers in the flame when things become very exciting and widely spread, is that either you become part of the inner circle and take over the world and run the power structure in the appropriate way so that the good things happen, or you get bypassed by the evil barbarians from either the east or the west, depending on where you are historically. So, I hope that the people in the audience think about what you should actually do in the future so that the good things happen that could happen with information, and most of my talk is actually going to be about what the good things have been and what the good things could be. So, please think of this as an opportunity, and don't think of yourself as old fossils that should just be ignored as the Net rolls on.

[Slide 1. Information Analysis in the Net (title) ]

What is this talk going to be? This is going to be kind of an odd talk, and it is not just because I am kind of an odd person, although that is true, and not just because it was done kind of at the last minute, but because what I'm really going to do is talk about the future. The easiest way to talk about the future is to try to look for ways you can do prediction, and the best way that I know is to look at what's going on in the research area -- to look at big research systems that are trying to really show what things will be like a long time in the future. A long time used to be 15 or 20 years; these days a long time is 5 or 10 years because the world is just moving so much faster, but using whatever the research systems are doing now to predict what the world will be like in the future is still a good method. So the structure of this talk is that I am going to go through a lot of technology very very quickly, and you're not supposed to look very carefully at any of the details, but just see what the flow is and see what the main idea is.

What I'm going to do first is talk about how prediction has gone in the past. So let me talk about the evolution of the Net and where things are going, just very briefly. Let me talk about the evolution of the Net and where things are going, then give a historical example. In fact, I'm going to talk a bit about the Telesophy system that Cliff mentioned and say how the predictions worked in that case, then talk about what research systems are now, to give you an impression of what they are going to be like in the future.

[Slide 2. Evolution of the Net]

So, what's the state of the Net? Well right now, even with this great excitement about there's 50 million users, it's really primarily on access. Presently we're really doing access, and if you look at something like Mosaic and you're feeling very excited about it, remember really all it is is suped-up FTP. All it's doing is fetching documents. It might be very streamlined but not actually finding anything. You point to something and you're able to get that very transparently. What you'll see in the next wave is more like organization, which is what you're used to, being able to do a real search. Like what online retrieval systems have done in the commercial market like Dialog for a long time. To describe organization, you will usually hear words like searching repositories, that's what the Digital Library projects are doing for example.

Then consider the further future, the reason I'm saying millennium is because if you think about it, the next millennium is only correctly six years away, but popularly five years away, in the year 2000. Things will be a lot different then, and what will actually happen is ordinary people will be able to solve real information problems themselves, and you will see more about correlating information than doing searching. So the second part of my talk about what the future is going to be like -- how you really are going to be able to do analysis. This prediction uses the coming research technology as a future projection. As a grand statement you can say that we all will be moving from something like the Internet to something like the Interspace. I will say a lot more about what the Interspace is, and why you would want to call it that rather than call it the Internet.

[Slide 3. The Present: Access]

So, here we are in the present. It's basically file transfers. TCP/IP and FTP, Gopher and WAIS, Mosaic and WWW. Now these [Gopher and Mosaic] are things that did not even exist five years ago, and today, there's, honestly, 50 million copies of Web browsers of different kinds. People are using them everyday just to fetch transparently things that people used to not be able to fetch, that only the cognoscenti used to be able to. So, that's really fetching.

[Slide 4. The Future: Organization]

Then, on the next slide, is the kind of thing you will see in say the next 2 or 3 years, and you're already starting to see with the Internet search services. What you'll see is you'll be able to do search, there will be repositories where you can actually put information. That it's indexed -- there are search engines that search it in different ways and there are higher level directories that you'll be able to use to find things around the Net. In the research domain these are things like the Digital Library projects that we talked about before, but there is also very large commercial activity in this area. Some of the issues again are: how you actually structure the documents (SGML is a tagging scheme where you indicate the fine-grain parts), how you record what's called metadata in the technical sense (you probably think of these fields as the bibliographic citations on the outside of the document), and how you really go about doing indexing and such.

[Slide 5. The Millennium: Analysis]

So, what most of the talk is going to be about is not about what's in the near future but what's in the far future. To give you some indication of what you should be thinking about that's going to happen in your working lifetime -- it will be possible to move beyond merely searching documents, so that you're actually handling concepts, manipulating them. You will have repositories for groups and collections too small and informal to be handled by professional indexers, not like something for ASIS members or something for Electrical Engineers, but down to the fine-grain community level. Where a community might be ten people locally that have a karate club, or a hundred people around the country that have a karate club.

In fact, this harkens back very well to the plenary talk on Monday that Bob Lucky gave, where he put up a beautiful quote from agraduate student that said, "The Net is not about information, it's about community, it's about sharing". That's going to be very very much true. The history of what the electronic medium of the Net is used for, all the way back to the videotex stage, shows that what people really want to do is to swap information and store particular things they care about, not access big centralized collections. So there will be a lot of capability for doing that swapping, and in order to do that you need an underlying infrastructure which will do a lot more than is possible to do now, well beyond full text search. So you will see things like support for domain experts who don't know anything about classification to enable them to do their own effective indexing. You will see support for being able to switch vocabulary across subject domains, and I'll talk a lot about what the technology would be like for that, because there are already instances in the research area of being able to show that functionality.

[Slide 6. Information System Timeline]

Okay, so now let me go into the historical part, to say how you might predict the future by using a particular set of examples that I was personally involved in. Because I know the subjects there very well and it's illustrative of what the prediction process is like. There is a 10 year period, that used to be a 15 to 20 year period from when you had a working prototype of something in the lab to when it was a billion dollar business and millions of people are using it. These days the time line is a lot shorter. Here was 10 years, many people predicted it would be 20 or 30, and in the future it might even be 5 or less. So, this is sort of a time line. I'll say a lot about what the telesophy prototype was, so you can contrast it 10 years later to what the Net has become, because you're probably very familiar with what Web browsers do for example.

In 1989 I became the Scientific Advisor at NCSA for information systems. Nobody knew what that was. They were doing very well with NCSA Telnet, but they had this idea, because of a very progressive director named Larry Smarr, that someday the Net would be a great source of valuable information for the scientific community. That is, there was this crazy fellow, me, who had done a big project on something revolutionary, so they thought I could show up occasionally and try to inspire the troops. Well, I had in the meantime moved on from Bellcore to the University of Arizona, and in 1991 I produced this Worm Community System that I'll say a few things about, because it's a useful historical analogy. Then, in 1993 after several attempts to reproduce telesophy on sort of a more mass scale, good enough underlying technology, which in this case was the World Wide Web, finally became available and Mosaic was developed. Mosaic was a relatively small effort at first, then ended up being about 10 to 15 full-time programmers.

But then the world exploded. Look at the time line, this is what surprised everybody: a million users the next year, 10 million users the next year, now there's a company which formed out of it worth 2 billion dollars in essentially 18 months [Netscape]. Most of the projections for 1996, which you see is only 10 years away from Telesophy, are that there will be 50 million users, online searchers, on the Net! Thus this 10 year time period is very very striking because this is not an esoteric subject anymore. So, let me now say a little bit about what things were like 10 years ago and you'll see that they actually were fairly good predictors of what you found 10 years later, so then when I tell you the grand vision things for the next wave, you might perhaps believe a little more that there is some predictive value in what the research systems are like.

[Slide 7. Telesophy System]

Well, in 1989 I actually gave a talk on Telesophy at the ASIS meeting in Washington. In fact, I met Cliff there. The talk I was going around and giving at places like the 20th Anniversary Symposium for the ARPAnet (or the Wake of the ARPAnet since it was dying and the Internet was coming in) was on how you make worldwide information spaces. And the outline of the talk would go something like the following.

What do you really want? You want to have something that has all the world's information in it, you can browse around in it, it's all interconnected -- but that sounds like science fiction. So, what you actually have now, concrete reality, is things like this Telesophy system, that I'll describe in a minute. How do you get from here to there? Well you have to lay fibers, you have to harden the software, you have to have more powerful machines. What happened is the technology curves were much faster than anyone predicted, for example, personal computers became cheaper much more quickly than people predicted and network speeds became faster much more quickly than people predicted. The software didn't really get any better, but that's always true with software.

Telesophy was supposed to be the universal system in-between all the world's knowledge and all the world's people. People are all putting things in and getting things out, so that you can sit on the far end with your portal into information space, go out over the switched network, and you can get all the world's knowledge, different types and different locations. So basically, you can get anything from anywhere and the system underneath hides everything. Telesophy is sort of like telephony -- "tele" is at a distance and "sophy" is like wisdom or knowledge. Just as the telephone hides all the sound from places, the telesophy portal hides that you're getting all this knowledge from other places and doesn't tell you what happens underneath.

Well there was this grand vision and a long report I wrote about the technological feasibility. But then I also actually built a prototype, and what impressed people most was that the Telesophy prototype actually demonstrated the vision with real technology and a real architecture. There the prototype was. It did multimedia information retrieval across real networks. It had a wide range of different sources. You could actually put repositories in different places (what would now be called repositories) and search across all of them. It had ways of saving what you found, the results of searches, and storing those away for using them later. It ran pretty quickly and it allegedly scaled up.

[Slide 8. Telesophy Prototype]

The prototype running in 1986 had about 20 sources that ranged from messages like wire services to citations like Inspec and Medline to full text like magazine articles and movie reviews to library catalogs to a sampling of multimedia things like line graphics and color pictures and motion videos. What you could do is sit at a workstation at your desk and search across all these sources for broad word like "fiber". The system would then go out and search all of the sources (they were all carefully indexed), bring back in real-time the matching items from each of the sources, and you could manipulate them. First it would show you a one-line description, then if you wanted more details, it would pop up a picture or pop up the full text of an article. If there was a link in that article to something else, you could just push a button and it would jump to that link automatically.

You could also make new information out of old if you wanted. While you were searching, you could pull something from here and something from here and something from here, then put them into a new piece of information, with some classification on that for later retrieval. So, for example, you could save a set of documents that you visited or save a set of pictures. It worked the same with pictures or videos. What I was going to show with the 35mm slides, which unfortunately were too dim because they are kind of old, was actually sitting at my desk at Bellcore, searching across all these sources, then pulling the camera back to show "yes, this is really my desk and it did have color pictures and it did have video and it did have this session searching and it was actually working".

I used the prototype every day for several years and there was a limited number of other people that used it, but the problem was it was sort of a hero experiment. It had fairly expensive workstation equipment, costing about $30,000. It relied strongly on having a fast local network, which were very uncommon in those times. It was very hard to collect enough data in the right formats to actually be able to search it, and even now when you try to run experiments, like in the Digital Library project, that still tends to be true. The reason that the prototype was impressive was that you could run the technology curves up and say "yes, if there really was a megabit fiber network everywhere and you had a personal computer that was like my Sun workstation, then you could do the same thing from home". And the reason that this kind of technology took ten years to hit the mass market instead of twenty or thirty is that no-one predicted how fast the hardware curves would go down.

The Telesophy system was thus a good prediction of what the future would be like 10 years later. In functionality it was a superset of what Web Browsers are now, and it was actually fairly close to what Web Browsers will be in one or two years, because it had ways of adding your own materials. The collaboration facilities are just starting to come into the Internet now, but we'll be there fairly soon. Telesophy also had good search across multiple sources, at least straight full text search, and this is just beginning to become standard in the Net.

I'm also sorry to say that what Bob Lucky talked about was exactly true. Bellcore felt that the future of electronic information was video-on-demand; so they thought Telesophy was an interesting high-runner project and they put money into it for a couple of years, but when it became a question of fishing or cutting bait, they decided to cut bait. Thus they chose not put a substantial amount of money into this, and in fact they also passed up, despite some serious discussion, a chance to patent the concept of information spaces because they felt that a software patent wasn't defensible and it wasn't going to be an important enough area. I've since had discussions that indicated they could have owned the Web. That is, the Web would have been an infringement of their patent. Sorry to say, that's just one of the corporate decisions. It wouldn't have made me personally rich, and may have been just as well because Bellcore probably would have clamped down on its propagation and it wouldn't have spread as quickly. Such stories often happen in the history of technology.

In the model of a Telesophy system, there was this thing called an information space, with real data down at the bottom, and these little packages, called information units, which were uniform across all the data in all the sources. Information units were object packages that had uniform formats, which enabled the system to search across everything or group across anything. After search, the filtered results could bundled together into a single information unit that could be displayed and searched for (sort of a knowledge region), even though it was actually a bunch of items of different types in physically different places.

So, in summary what the Telesophy system showed was you could really do transparency of type and location. It didn't show scale in billions very well, but it certainly showed scale in millions. There were about a million, well about three to four hundred thousand items in the whole space, and it was tuned and fast, so you would really get one second response for a search and you would really get half-second response when you clicked and tried to follow a link. So the prototype also showed you could really do things fast like you were wandering through a library. A lot of the technology, a lot of the implementation effort was trying to show that you could make browsing a worldwide electronic library at least as functional as a physical one, and then it tried to show some grander things which were more technical, so why don't we move on.

[Slide 9. Towards the Interspace]

Okay, so that's the end of the first part. That's what happened in the past, and about ten years ago it was clear that you could do Net browsing and that eventually would be big and grand and lots of people would use it. And big and grand turned out to be ten years. So, now I'm going to talk about what's going to happen ten years from now. It's now 1995. If you start building a research system like I'm about to show, it won't start working until next year, 1996. The question is, if you're in the 21st Century what are there going to be fifty million or a hundred million copies of?

My hypothesis, as you can probably tell, is it's going to be whatever big research system is possible. So, what I'm going to discuss in the rest of the talk is what you can actually do now if you do a grand hero experiment and then you can extrapolate for yourself with whatever kind of historical analogies you like as to whether that will happen and if so when. My belief is that it will be a billion dollar business in the early 21st Century. And what is it? It's not Web fetching, which is just straight access, it's not library search, which is just what you're going to see in the next years when you can put up a big collection and actually search it. It's going to be correlation, analysis, coming in with a real problem and being able to look through many many different sources and say, this thing here and this thing here combined in this certain way solves my problem. Let me give some analogies of that from several other projects to try to give you some concrete feeling and then I'll talk about the technology of what will really do that.

So we're going to talk about cross-correlation, generic community systems, and spaces not networks. Those probably don't mean very much right now, but I will try to give enough examples so you can get some feeling for what those concepts might actually mean.

[Slide 10. Community Systems]

First of all let me talk a little bit about WCS [the Worm Community System]. That's what I personally was doing in the five years while Mosaic was starting up. What that tried to do is, essentially, make a real telesophy system in a small area of molecular biology and see what was really involved. It was trying to build an electronic scientific community which had data and literature both informal and formal. So it had real data bases in biology and real literature. It also had bulletin boards, informal information like community newsletters and meetings. You'd be sitting there in this single space and you could search across everything to select desired information. Then you could follow very fine grain links, so if there was mention of a gene in an article, you could jump right to the corresponding item in the gene database. You could also take a display of a database item and pass it into another program.

[Slide 11. Worm Community System]

WCS tried to handle all the knowledge in this small community, for a pretty wide range, and really be able to manipulate it, both taking it out and putting it in and passing it into programs It had basically all these functions, like I said. You could browse. You could search. You could navigate. You could follow links. You could select part of a map or part of a gene description and pass it into another program. You could yourself, since it was a symmetric system, you could add anything that was supported within the system. You could add your own gene descriptions. You could make a link between your gene and this other gene. You could do a submission on-the-fly to one of the main databases.

So, essentially all of the information needs were handled within this single environment, and, what happened during about five years, was that there was a working system built and evolved. There were several hundred different users, about fifty different labs that=20were actually using this system, at least on a test basis. Mostly for information retrieval, but they did some sharing. It did go across all the connections. It did have very fine-grained editorial control. You could actually publish things. You could keep them private for a while then move them out to the next level database. So, it did also try to capture the complete publishing cycle.

Then basically what happened is what usually happens to research systems, which is the good ideas got absorbed in a more popularized fashion into other (low end) systems that were trying to appeal more to the masses, and the research system itself disappeared. So, what happened in this particular case is the genome projects took over a lot of the nice graphical displays with the link following, and Mosaic and the Web Browsers took over the fetching part across the Internet, and the Worm System itself disappeared. But it showed what it was possible to do for trying to handle all the knowledge of a small-size community.

[Slide 12. Distributed Library Model]

So, here is the second of two different types of metaphors to try to explain to you what Interspace should be. The first was taking a whole community and handling all the information in it a la WCS, and this second one is sort of what librarians really do, real libraries.

If you look at the digital library project or a physical library, usually you think of it as "here's a big repository and here's the user and they want to do a search in there". Well, that really isn't what librarians do. What librarians do is, they know many many sources and they have a huge library with lots of books, but from lots of sources and there's even lots of sources that they know that are not physically in the library. Mostly what they're doing is serving as a reference, as this slide shows in the middle as a gateway or reference. They're trying to solve a particular information problem for a user by routing them here and letting a user look at that, and routing them here and let the user look at that, so that they're going through a lot of different sources in a reference session trying to solve a problem by correlating the parts of that. Well, current digital libraries do not do that. They just do straight search. But, suppose that reference was now the most important thing. Suppose you could search and you can do access and you can do organization, then you would want to do correlation.

[Slide 13. Publishing Cycle]

That's only part of the story, because the other strong technological trend is that the publishing cycle is breaking down. See, it used to be "here is the big library and here is an author, and the big library sits on a big machine, it's a big server, and the author sits on a small machine, it's a little client, and occasionally the author is going to shoot something over to the library." Then there's lots of people that are accessing this big client, so it's big things and little things, with the little things being users and the big things being libraries.

Well, that is not how the future is going to be. If you want a really good illustration, let me just say that there will be a hundred million copies of Windows 95 in 1996, and there will be a publish command in Windows 95 which will basically, as part of the operating system, take whatever object you're working on, like a spreadsheet or a word processing document, and shoot it over to a Web site and index it. Thus every person will be able to easily publish things from their usual programs on the fly. Now it won't be refereed, it won't be journal publication, but it will be somewhere. There won't be the current difficulties where you have to read a book that tells how to set up your publishing site.

So, this whole cycle that goes from users to librarians who do reference to indexers who carefully classify things to publishers who do the quality control to authors who generate the actual materials, is going to break down completely. There'll be single computers and single people who do all those stages in different combinations, and it will vary what the combinations are. But, every person is going to do publishing, every machine will too, and every person will do some combination of the stages of the publishing cycle.

So, the Net of the future will have many levels of publications. You'll have some personal documents. You'll be the editor of a few small newsletters or clubs. You'll be part of some professional societies and each will have a professional letter or journal, because that's a big enough community that there will be enough people to be worth indexing in a more professional way. And so on to ever large communities.

What this will end up with is a world where there are a billion repositories. A billion might be small. The ten year projection is that there will be a billion personal computers, and each personal computer is going to have a couple of collections, so maybe I should've said ten billion, but a billion sounds like a big number. A billion is a lot more than the number of databases on Dialog. And a billion is a lot more than the number of sources you find in the index of all the databases, which is more like ten thousand. Ten thousand is a number that you can handle by just searching the descriptions in the index of databases. A billion is not a number like that. You need a *completely* different architecture to handle the world of a billion repositories.

This is not, I should emphasize, this is not science fiction. This is straight technology extrapolation. There will be a billion repositories, whether you like it or not, and if the systems are ready and if the people who know about information retrieval do something, then maybe people will be able to find things in the world of a billion repositories. If they don't then it will be, not like the Web now where you can actually find something if you're sufficiently energetic, it will be like you're in the Library of Congress, in all the archives that are under the ground that you know are unsorted, and there's nothing. There's no card catalogue. Nothing, and you would like to find some information. What you are going to do is wander around at random and pass on the way some other skeletons that are sitting there dead, and occasionally you hear somebody just before they die say, "Oh, look over there in that catacomb and you might find something".

[Slide 14. New Architectures]

That's what going to happen. And the question is "can technology solve the problem?" And I'm going to say, since I'm a revolutionary technologist type, that the answer is yes. And I'm going to tell you about some technology that might solve the problem. So, let me just emphasize that full text search will not solve the problem, and known semantic retrieval that works on two hundred documents will not solve the problem.

We are now into the speculative revolutionary area of the talk, if you haven't picked that up. What we need are new architectures for systems that actually do something about analyzing and cross-correlating from multiple sources, because the library model where you have a few big things is totally blown away. What you have is a community model where there are a billion repositories and they are all different sizes. There's one about cats. There's one about white cats. There's one about white cats with blue eyes that live in your neighborhood. And each one of these are maintained by someone who is passionately interested in it. If you don't believe that there are such people, you haven't used an electronic bulletin board or browsed the Web recently. Or gone into a clubroom and looked at all the newsletters. That is what people do, so you have to deal with that world.

What I'm going to describe is actually the backroom laboratory of the Digital Library project at Illinois and also the CAN which is a NASA information infrastructure project and that's why it's being funded with high technology. But you have to promise not tell the funding agencies that I'm really doing this, becausethey'll believe that it isn't going to work.

[Slide 15. Navigation and Grouping]

Well, the easy thing is doing navigation and grouping, and that's what you can see the Web starting to think about doing. That is, within the Web, you can go through a path to many different sources and there are beginning to be facilities to record the path itself so you can play the path back later. This is part of the facility required for what Vannevar Bush called trailblazing in the Memex Paper, if you're familiar with that, and what librarians call pathfinding. For the full facility (well beyond what is currently available), you can edit the path so it says I've been here and here and here, and that's a valuable set. You can also do other kinds of groupings so you can do the kind of things the Telesophy system supported. Say, if you do a query, you can edit parts of that and then save that as something you might like to get back later. You can make lists of interesting things that you found that weren't even a path, but just gathered over time. And the reason you want to do that is, first of all you'd like to have some way of recording things you were able to find in your searches, because you know the regular indexing isn't going to work. So, this is like recording reference sessions, so you can re-use parts of previous work.

The second and most important reason is that paths are how search *should* be done. When I was working with molecular biologists on WCS, after I had warmed them up over an appropriate number of years, they would tell me what they would really like. What they would say is -- I'd like to say I'm working on my own little organism, and here's three genes that are really important and here's the section of the map that I care about and here's some sequences in that section and here's three papers that are very important for this gene function -- find me a similar collection of genes and literature that are in some totally different organism that's much easier to experiment on, so I can do the experiment there, figure out what's important to do and then go back to my more difficult but more interesting case.

You see that this is a general facility -- path matching as the basic retrieval. What it is is that you did a search through the Net and hit some things that you think are interesting and you want to find other groupings, other paths, that are just like that in some very vague sense of just like that. That is not full text search. That is not even graph matching, although I sort of described paths as graphs, it's some other kind of powerful semantic retrieval that nobody knows how to do.

[Slide 16. Community Repositories]

Okay, well now I'm going to say maybe there is a way of doing it and that's what I'm going to talk about next.

The task is to handle repositories at a very fine grain level. So, when I say "repository" here and talk about organized collections, I don't mean what the Digital Library project is doing, which is making a collection for the IEEE journals. There's trained professionals who do that. And I don't mean really what WCS, what the Worm system did, which is things like specialty journals and then things like the community newsletter for several hundred people. I mean you and your neighbor with the cat, who have a newsletter about the cats in the neighborhood.

You and a small group form a community. For example, I take my daughter to a music class on Saturday morning that has five other kids who are 2 years old and five other parents. That set of people has a common interest -- they'd like to have a collection of their information on kid-related topics that they could search and I bet that there are other similar sets of five parents elsewhere who would also like to be able to access this collection. And I as one of those five parents am quite willing to spend some modest amount of time making a collection and doing some indexing, but I'm sure not going to become a professional indexer like Inspec would hire in order to do electrical-engineering.

So, what you need is some way of really being able to do classification for small publishers and some way of being able then to use that classification to search across the collections at a deeper level. The collections will span across many many different publishers, from really little to really big and from really low quality to really high quality. It's hard to tell how the quality level might vary. The small ones might actually be more carefully done than the big ones, but the professionalism is different. So, how are you going to be able to search across all those collections? Let me give hints for what you might do.

[Slide 17. Indexing and Classification]

Let's first look at what professional classifiers do now. Those are human indexers. What they do is make a subject classification of the important terms in an area and which terms are bigger and littler -- this subject hierarchy is correct in some profound sense. The hierarchy represents the meaning of the subject area. However, the terms tend to be very general. For example, in the Worm Community System we got a copy of MESH, which is a very well done thesaurus covering all of biomedical research generated by the National Library of Medicine. We were all excited about it -- until it became clear to us that every single article in the Worm literature, all 5,000 articles, had exactly the same MESH terms!

That was actually what started us down this path towards automatic indexing. It made us think that even for this collection, for a couple of hundred people so it's a reasonable size community not a couple of neighbors, that you needed better technology. So, we started looking at co-occurrence matrices, which record the frequency of terms occurring together. This statistical technique goes down to the real words in the documents and can be done automatically, but isn't meaning at all. It's context of some kind. So, it's pretty good at recalling things, but not so good at being precise. That is, the automatic technique is quite specific, but not necessarily correct.

Following up on this work from WCS, as part of the Illinois Digital Library project, we built an interactive interface to the indexes from both manual and automatic classification in electrical engineering. The manual classification was the real Inspec Thesaurus - 10,000 terms carefully done by professional indexers. You can use a graphical interface to this classification scheme to move up and down the subject hierarchy, then find desired words and use them for search. That's very helpful to see what the main categories are, but it's not very helpful for finding out the actual words which appear in recent papers because they just aren't in the thesaurus.

So, as before, we also generated an automatic classification scheme, by gathering statistics of which terms occur together how frequently. The interactive interface to this "concept space" suggests alternative terms to be searched for. That is, given a word, it gives a list of other words that occur with that word in context very commonly. The context words are all mixed together: bigger, littler, useless, useful.

This co-occurrence list is not meaning -- a professional indexer would reject this completely, and they have when we talked to them -- but the context lists are practically useful as search suggestors. This is partially because the granularity is much finer -- there are 100,000 terms from the same Inspec corpus (10 times more). So that you get not just "deductive databases" but also "Prolog" and "inference-mechanisms". And partially because the system is interactive so the users are perfectly happy to sort through the lists themselves deciding what is useful in exchange for getting the full range of potentially related words from the documents.

What we found in molecular biology, in small experiments in molecular biology, is that the concept spaces are pretty good as memory joggers. The fact you can also generate them automatically is really nice because it means you can use them in cases which are inappropriate for professional indexers. I'm thinking about the cat example. You can get professional indexers to work on a repository for journal articles in electrical engineering, but not on a repository for notes on the cats in your neighborhood.

[Slide 18. Semantic Retrieval]

So, I'm going to talk more about the automatic classification scheme, since if the manual one is there you should definitely use it. The automatic classification techniques all are statistical correlations of the context within documents. The particular one we are using is co-occurrence matrices, which is only one of the hundred ideas about how to do deeper semantic retrieval that have been in the information science literature since the sixties. (When I say "we=" here, I mean my colleague Hsinchun Chen from the University of Arizona and myself.)

But co-occurrence is one that is now computationally feasible if you have a supercomputer. For example, if you take the SGI Power Challenge, a high-end supercomputer at NCSA, and take a day of computer time, actually 24 hours, then you can compute a co-occurrence matrix of a real collection of 400,000 abstracts. That was not true in the 1960's. And it has nothing to do with the algorithm being better, although it is tuned a little bit, it has to do with the fact that computers are enormously faster and so, some of these old deeper semantic techniques can actually do something real. Since techniques like co-occurrence lists are useful as term suggestors, this might be a real break into semantic retrieval.

See, this is fast and dumb. There's no magic, no natural language parsing, no fragile domain rules. We have a lot of computational power and we can look at the word frequencies ad nauseam. This is just a first chink in being able to develop deeper semantic retrieval. So, you get terms like "Horn Clauses" which is a really fine-grain technical term in deductive databases, and you get terms like names of people that write articles about deductive databases. In molecular biology, you get names of genes which occur commonly in articles about that particular concept. Which is very helpful to users, especially since an indexer would never put the name of a gene in the MESH thesaurus. So, a lot of the power of this particular technique is it doesn't have any semantics in it all -- it just takes whatever words are there. You just hope that there is some sort of guilt by association, and that if two terms occur in the same context frequently then one is a good alternative for the other when doing search.

Now another nice thing we did -- we tried computing the co-occurrence matrix of several different collections, because the Worm system was actually several different collections, in different orders -- just because we were curious. We had done it in one order and then we said, "does it make a difference if you do it in the other order?", thinking it doesn't make any difference. And the answer is, the lists were completely different. Then the first thing that occurred to me is that maybe we could solve the vocabulary problem.

[Slide 19. Vocabulary Switching]

Now, I'm not going to stand up in front of an ASIS audience and say that the vocabulary problem is solved, but let me just explain that this is a chink against that. The vocabulary problem is that you have the same concept in two different subject areas but the terms are different. So, in engineering for example, "fluid dynamics"E is a term that occurs in many different subject areas, but the words are completely different even though the concepts are pretty similar. Could there be a system where you say: "I'm a civil engineer who designs bridges. I'm interested in fluid dynamics to compute the structural effects of wind currents on long structures. I think ocean engineers who design undersea cables do similar computations for the structural effects of water currents on long structures. I want you [the system] to change my terms for talking about fluid dynamics into the ocean engineering terms and search the undersea cable literature, as automatically as possible."

Well, that is sort of what our vocabulary switching technique does. It lets you make this fine-grain concept space, built on co-occurrence matrices, for a very small collection, so you can do it for really small communities. You can then, on the user end, say "I'm in these three communities, that's what I know about, and I want to search these other three." Then the system will automatically intersect the corresponding matrices, which are just concept graphs, and let the user interactively switch the vocabulary from one space to another to facilitate the searching of the desired community repositories.

It's possible to do these computations now with supercomputers. You should know, if you are not accustomed to them, that the significance of supercomputers is that they are good as time machines. It's well known from the technology curves of the past, which are probably too slow, that whatever speed a supercomputer runs now is what a $3,000 desktop machine will run in 10 years. So the following experiments are a conservative estimate again of what you will be able to do in 10 years.

[Slide 20. Switching Experiments]

So, here's two vocabulary switching experiments. This first one is actually going to appear in JASIS very shortly, and it did two areas of molecular biology: worms and flies. It had about 5,000 documents in each area and each took about 10 hours of computation on a workstation. So, you could say, for example, "here's sperm about worms and sperm about flies, and it would list 25 terms". 10 of them would be the same, so you'd ignore those and you would look at the different ones.

Vocabulary switching is needed for this sperm example. It's known that worms have odd sperm, I happened on the Worm Community System to work with the world's expert on worm sperm. Worm sperm has little pseudopods and crawl like amoeba while all other sperm in flies and everything else, swim since they have these little flagella that wiggle. So, what you want to do is change all the crawling to swimming and change all the pseudopods to flagella.

Well, if you look at these co-occurrence lists [for worms and for flies] and you look at the different terms, then sure enough crawling and pseudopod are on this end and swimming and flagella are on that end. Now, it's not automatic. You have to realize that because in with those two good terms are ten ridiculous ones that are way too general, and ten of them were common, so you ignore those. That is, of the top 25 terms for sperm in worms and flies: about 10 are common to both lists, about 10 are useless, leaving only 5 as potentially useful.

Or returning to the original example in engineering. What this technique would do if it was fully automatic is it would say, "Oh, no problem, here's the three terms you really care about. They're these three relevant terms to you in undersea cables." That isn't what it's able to do right now. What it is able to do right now is say, "Okay, here's the terms you want to search, and here's what you know about, and here's what you'd like to know about. Here's ten terms that you know, and here's ten terms that you don't know. Match them up yourself."

And, that's amazingly a lot of help compared with nothing. With the biologists, they are really fast at scanning through lists of potential terms, and they're very grateful to have a list of possibilities, because otherwise they're going to sit there and try words at random. The concept space intersection is far from perfect or even correct, but it's a lot better than trying words at random. If you've ever been in a reference library, you know how bad people are at actually searching.

So, encouraged by that, we're trying a real experiment. In this one in progress, we're taking 5 million abstracts from Compendex, which is an index covering all domains of engineering, and generating 1000 spaces of roughly 5000 abstracts each. Thus each space approximates acommunity repository of the same scale as the ones from actual communities in molecular biology and the intersection of the spaces simulates an Interspace across all of engineering. What we're actually going to do is divide Compendex by class codes, so the size of a space is a fairly fine-grain subject domain like bridges or highways.

A simulation this large can only be run on a supercomputer. Even so, at about 1/4 hour per space on a machine like the Convex Exemplar, plus intersecting the spaces, this is still going to take about 2 weeks of computer time. Fortunately, the newest and largest supercomputer that NCSA just got is still in its testing phase, and I was able to persuade them that this was an interesting application, so we're able to reserve the time to try this as a hero experiment.

Then the question is, can you issue a query like "fluid dynamics" and really do useful interactive vocabulary switching? That's totally unproven, but it works a lot better than you would expect in molecular biology, where it really does do something. It is computationally feasible and it does something, and that's a lot more than not being computationally feasible and not doing anything.

[Slide 21. Computer-Assisted Indexing]

The flip side of all this is: can you also use concept spaces to help with indexing? Well, what's the problem with indexing? If you can get a professional indexer, they will do a good job for their broad subject area. That does solve the problem of electrical engineering, but it doesn't solve the problem of bridge swaying or neighborhood cats, which are too small to afford a professional indexer. And for these specialized repositories, the terminology in the professional subject indexes is too old and too general.

On the other hand, if you try to solve the indexing problem for specialized communities by letting individual people from that community do indexing, you find out what the value of using trained professionals is. As many experiments have shown, ordinary people have really wide variation in how they classify things. An ordinary person won't even assign the same terms to the same document twice, much less will two different people assign the same term to similar documents, which is what you want. Remember all we can do is string matching underneath, so the indexing has to be precise and consistent. There's not any magic here.

So, suppose you could have an automatic program which would suggest topics for classifying a document then let a person correct the list. For example: "Here's 25 terms that this document should be about. Choose 5 from that list." The domain expert, who knows about bridges or worms or cats, can do that. They know the subject area and the meanings of the terms, so if the system could suggest consistent terms to limit the variation, you would get an interactive indexing system which enables amateurs to approximate the quality of professionals. This ought to sound similar to the sort of solution that concept spaces provide for semantic retrieval.

We're now trying a set of experiments that basically provide a domain-independent version of the old 1980s technology that used to look through newspaper articles for the CIA and try to identify which ones are about revolutions. What these old systems did is use tag words. So they said, "Revolution has these ten words that commonly mean revolution and tank has these ten words and spaceflight has these ten words. This document mentions 3 words for revolution and 1 for spaceflight so it's about revolution." As you might imagine, they would get fooled a lot, but they would often be able to assign what topics documents were on and some of them were right while some of them were wrong. So they would say, "This article is about tanks and spaceflight", when it was about the Russian invasion of Hungary, because it mentioned the word satellite a lot and satellite was a tag word for spaceflight.

This concept identification technique relies on having a concept dictionary giving the tag-words. Well, the concept space really has that. It says, "Deductive databases - here are 10 words that might be useful, that commonly occur with deductive databases, so if you see one of these, the document is probably about deductive databases." If you use the concept space as a concept dictionary and look at the words that commonly occur together as tag words, then you're able to make a suggestion list of which words could be used to classify the document. Just like a professional indexer will choose some terms from a controlled vocabulary like Inspec or MESH.

It's unproven what will happen with this. The experiments are just starting, but the basic idea is sound, in the sense that it's an automatically generated controlled vocabulary which is specific to that specific topic and the actual selection is done by a subject matter expert.

So again, like all the things in this talk, in the future part, this is something that's sensible, that might actually work, and even if you don't believe this one, it may be that some variation on this will allow being able to do indexing. Remember, if you don't do fine-grain indexing you're not going to be able to find anything in the world of a billion repositories.

[Slide 22. Applications Environment]

So, now it is time to talk very briefly about the Interspace software. It is an applications environment built on top of the Internet, assuming that the Internet has evolved into a world-wide object-oriented operating system. Basically, it assumes that every community has an information space, every information space has a corresponding concept space, and then the Interspace is the intersection of all these spaces.

The environment for the Interspace supports searching and analysis. The searching is what I said before. You select a group of objects and the environment locates similar groups. Vocabulary switching is done automatically. So, this is just a whole network information system which uses this vocabulary switching and concept spaces to try to do semantics at a fine-grain level to try to handle community repositories.

[Slide 23. Interspace Prototype]

This is actually what my lab at the University of Illinois is doing -- building a prototype of the Interspace. Kevin Powell and I have written an architecture document laying out all the parts of the environment, and my team is in the process of implementing the first full prototype.

If you want a little technical detail: it's got objects, it does retrieval, it tries to do correlations. The prototype assumes a distributed network of objects by using high-end software technology like Smalltalk and CORBA, and is constructing an applications environment to handle the concept spaces and semantic retrieval. We're then going to try some sort of hard applications where there's lots of data and easy questions have hard answers which require looking through lots of things to cross-correlate. Like digital libraries or geographical information systems. Over the next few years, we will be evolving the software and simulating the world of the Interspace, with spaces like the thousand community repositories in engineering discussed earlier.

[Slide 24. The 21st Century: Analysis]

Okay, let me wrap up, because my time is about over. The claim is that the 21st Century is going to go past search into analysis. And what analysis really is is cross-correlating information from many sources. And then what you will be able to do is solve problems, not just find things at random. And in order to do this, what you need underneath is very fine-grain classification. That's the only known way of handling the world of a billion repositories.

What that means is that every community, large or small, has its own little digital library. The software does some computer-assisted indexing and it has some "semantic" retrieval that uses that indexing to try to do vocabulary switching, to try to do better kinds of search. So that there's more responsibility for some individuals to develop collections, but this also means that the average person might end up being sort of a librarian. They might maintain a collection. They might do searches on an everyday basis.

So, what you need is to embed some of this higher-end technology into the standard network software that ordinary people use, in order to really be able to do this new kind of functionality. This new functionality will happen. Commercial pressures will force this to happen -- the real question in the world at large and to people like you in the audience is: "Is the Interspace going to empower the individual person, so that they will be able to actually find things and solve their own problems and maintain their own collections, or is it going to be yet another new medium for providing more advertising to enrich the greedy evil corporations?"

[Slide 25. Building the Interspace]

So, suppose this all works? Suppose it's ten years from now, and everyone has on their desks and in their homes, something that supports the Interspace technology. A box that comes with the software environment built in and a plug into the Interspace. Just like a set-top box comes now with Netscape and a cable modem. Well, what that means is everything in the world goes into this space: everybody can share to put things in, everybody can browse to get things out.

Then what really begins is building the Interspace. Creating all the individual community repositories. Connecting all these individual spaces together.

The most likely start will be in science and engineering, because those people are comparatively rich and they're the ones who have the high-end technology. Just like the ARPAnet begat the Internet and government-funded labs begat the Web, the same stages will happen with the Interspace. That's why I gave the examples from the high-end digital library research.

Next, what will happen is to start merging together individual community spaces such as those I've discussed in biology. For example, start with molecular biology (worms and flies to mice and men) and onto neuro biology (rats and cats to monkey and man). And onto other sciences and other subject domains.

[Slide 26. Building the WorldNet]

Well, I'm sorry that I'm not doing the Worm project anymore, so I can't say "today the Worm, tomorrow the World", which is how I used to end talks. But, what I have to say is that there *will* be a WorldNet, whether you like it or not. Every community, from really big ones to really small ones, will have a nice collection. It will be indexed. There will be ways of accessing it and correlating it. So what you'll begin to see is that there really will be the Interspace.

This will be where people live. You see things now that say that people live on the Net. Well, that's true of a few specialized people who are questionably human beings. Which I'm one of, sorry to say. But, it will be true for the average person. Just like television became ubiquitous -- the Net is the world of ten years from now. So, you have to get ready for it and you have to figure out what you can do to contribute to it -- to make it help people by letting them get the information they need to solve their problems and being able to organize their own collections, rather than hurting people in the ways that you can easily imagine.

The NII is often referred to as the best technique for selling advertising for 500 channels of mud wrestling. Maybe now with the Web, it's become the medium for selling advertising to access a million home pages of dogs barking. That is not, from a purely personal standpoint, the appropriate use for such a far-reaching new medium. The vision of the pioneers was always education not entertainment -- the Net should become the way that ordinary people can solve their problems. This new research technology might be the way towards that vision, towards the Interspace.

So, I hope very much that it works. Thank you.


Question and Answer Session afterwards

Q1.

You had a sort of line where you had the user, the librarian, the indexer, the publisher, the author all strung together, as part of the publishing cycle. And you've spoken a lot about getting everyone involved in doing reference and classification, and I'm all for this. I think that I'm ready to go out and build tools. But, I'm real concerned about the function of the publisher, which as you rightly pointed out, is quality control. And quality control is something that is lacking, not only on the Net but in, as far as I'm concerned, the publishing industry anymore. Boy, I know some very good authors that would be excellent authors if they had good editors. So, I'd like you to talk about what we might do about that problem?

A1.

I do have a few comments, but I should say you should move a little further away from the editor of JASIS, who's sitting right next to you. I deal with real publishers every day as head of the digital library project, and they're terribly worried about their function in life. It's clear to them, as you said rightly and as I tried to say, that what they do is quality control. Now, it may well be that the quality of the quality control increases as the number of other functions that the publisher has to do decreases. So, if all you have to do is quality control, and this actually happened in the book industry, because book companies use to do their own printing, they use to own trucks that would do the delivery and now they're basically just shells. And what that means is the only people they hire and their only differential advantage is that they have good quality control or perform good filtering. So, economic pressures and lack of other things to do may help to raise the level of quality.

On the other hand, what you say is exactly right. It will be that the overall quality will go down enormously, because the amount of stuff will increase, and because so many people that do it, basically don't care and are amateurs or even malicious amateurs. So, if you look, for example, at the average quality of information in bulletin boards versus the average quality in refereed journals, there's an enormous gap. Now, that doesn't mean that there's not useful information in the bulletin board, and it doesn't even mean there aren't times when you wouldn't prefer to search it, but what it does mean is there's nothing you can do about people's behavior. And that's going to stay the same. But, what you can do is build into the search system ways of taking advantage of that.

So, an important search criteria is "do you want rumors or facts". This came through in the Worm system all the time. Sometimes the users wanted to see what graduate students typed in in the middle of the night, if they're really really desperate, or really really interested. Sometimes they only wanted to see something that appeared in Nature. So the real answer is that people are just going to have to learn to deal with it [the situation of widely varying quality]. There's not a magic bullet for insuring quality. This was deliberately a talk that said there were no magic bullets, you might've noticed, so I won't say there's one here. That is an important problem.

Q2.

I'm really encouraged by the notion that we have computational power, that we can handle these large co-occurrence matrices. I worked with Bill Maren??? on a project back in the sixties where they did a context analysis and the real problem that he experienced then is that so many of the cells were null and empty. That he had incomplete data. But, even though we have the supercomputers, our problems are also becoming super, in that we don't have simply 400,000 square matrices, but we have 4,000,000 and 40,000,000 and 400,000,000, and I even wonder whether the supercomputers can handle those. Have you had experience in computing these correlation matrices and if so how do they come out? What do the cells look like? Are we able to have the matrix inverses? And, computationally, can we get a hand le on the semantic analysis for these huge huge datasets?

A2.

That again is an excellent question. Let me just repeat it in case you couldn't all hear it. Co-occurrence analysis is a very old topic, and one of the reasons that I gave this talk at ASIS was to hope that a lot of the old pioneers would come out of the woodwork and be encouraged to try all the old experiments again, when you can actually try them on real collections. To my knowledge there's only a few experiments that have actually tried them on large collections, and I probably know all of these because they had to be done at supercomputer centers. And, thus far there's no evidence, well, the matrices are still mostly empty. That still is true. But there's enough material in the matrices now, so that if you normalize them, you at least get some things out.

Some experiments have shown that you need maybe five to ten thousand documents, at least of abstract size or one to two pages are better, in order to get enough material so that the co-occurrence matrix flattens out and starts to become stable. So, what that probably shows is that all those experiments that were done with two to three hundred documents aren't valid, because the matrices were still changing and still mostly empty. We don't have a precise categorization yet.

We've been mostly trying to just show it works at all on the high end, but you should think carefully about the molecular biology experiment I showed. That was done on a ten thousand dollar machine, that was actually Sun SPARC-2 workstation. You could today do that on a five thousand dollar P.C., and you'd just have to run it overnight. So, if five thousand documents is really the right size, that means that you can actually go home right now and try to do those experiments.

Nothing is really known about whether co-occurrence is better than the other fifty techniques that are around. Nothing is really known about exactly what the size of the repository has to be. Nothing is really known about how you should handle the frequencies so that the ordering is done well. Nothing is really known about how you should do the graph intersections. The illustrations I gave are doing a straight dumb computation all the way through. But the most simple thing, if you get enough computer power, *will* do something that's potentially interesting.

The details as to how it's going to work in practice are simply not known, and so I would strongly encourage all the people that really know something about co-occurrence analysis, statistical analysis, to buy workstations or do it on your P.C. and start running these experiments. Because, remember, there's this huge train of 100 million users or a billion users that is coming in ten years, and what's going to happen is the commercial people will just make up something if you don't find the good answers.

If you're going to do research on this topic, this is the perfect time. You can get money for it. The machines are fast enough. And there's a crying need for it. It's not the sixties anymore. But, the real answer to your question is, nobody knows any of the technical questions that you've asked. There's just been a few high-end experiments to try something.

Q3.

Yes, Bruce, Jean Fisher from Lexis/Nexis. I'm in the trenches everyday with corporate librarians who are very adept at handling many different information systems, internal systems, and Lexis/Nexis and Dialog and the Internet. And it seems to me that the weak link in the chain, right now, is the local area networks. Can you comment on what we might expect to see in the next five to ten years to replace that architecture and how these people will be operating?

A3.

You mean the speeds of the local area networks, is what you're referring to? [Yes.] Or lack of speeds? [Yes.] Well, actually Bob Lucky [previous plenary speaker] answered that question, fortunately, for me. There's two answers. One is the professional answer. So, like what's going to happen in corporations? And the other is, what's going to happen in homes? And they're different.

The professional answer is, the Ethernet technology is pretty aged and there's several new competing technologies that run 10 times faster, like a hundred megabits raw network rate, instead of ten, that are all competing. FDDI, Fast Ethernet, ATM, there's several of them. It's very likely that new buildings, this certainly this is true at allegedly progressive institutions like the University of Illinois, that new building will be wired with the 100 megabit ones and so the local area networks will get faster.

Now, that still doesn't solve the problem of what the trunk line is -- what actually comes into the campus or the corporate building. And there you're stuck with whatever the commercial provider, like your Internet provider provides, and right now the most common thing is T-1 lines which is 1 megabit per second. One megabit is a lot slower than even an Ethernet. Technology for that, though, is changing very quickly. You could get T-3 lines, 45 megabits, for a long time, they're just very expensive. And the standard technology is likely to jump to 600 megabits shortly.

The thing that's not clear in the corporate range is how soon, because it is very strongly dependent on what the economics are. It's not a technology question. And the economics is strongly dependent on how many people want to buy it, because it is a production question, and that's strongly dependent on how many people want to perform a certain function.

So, for example, if every large company desperately wanted to use the Interspace and you could only use the Interspace if you had a gigabit network, then there would be a gigabit network in a year, because there's huge commercial pressures. The reason it's hard to predict the curves, is that the demand keeps changing. The Web, for example, something like the Web commercially was completely unanticipated and nobody knows how to make money from it yet.

And this problem is even worse in the home case, where it's been true for a long time you can run fiber to the home and get ten megabits. You could run fiber to the home and get a gigabit, but the box right now is like Bob Lucky said, the box right now would cost $100,000, so no one would buy one. If one hundred million people wanted to buy one, the box would be like a set-top box. [which can do Ethernet across cable TV lines] It would cost $300 and there would be lots of them. So, the real answer is it has to do with the demand for services. Given that, I could make up a curve number, but I'll probably get it wrong so I'm hesitant to do that. The networks will get faster, but I don't know how much faster.

Q4.

Kate McCain from Drexler University. Bruce, your vision of the world in the future is a little bit different from the one we heard from Cliff yesterday, in the presentation of the CNI White Paper. I'm curious about, perhaps either of you see dueling visions of the role of information in the media area as information professionals in the future, who Cliff said were going to be designing better, richer surrogates for what it is in those widely dispersed, perhaps very personal databases. Creating the metadata for the data that was stored. You seem to move all that down to the individual groups of people who are interested in blue-eyed hostile Siamese cats. What are people going to be doing in the future, information professionals who are skilled at surrogation design, knowledge representation?

A4.

I'll make a short answer to that and then I'll let Cliff make a short answer to that. My short answer is, they're going to be very busy because there will be a lot more demand for their services. In other words, the reason my talk was pitched the way it was is because I was concentrating on trying to handle a billion repositories for the average person as part of the Net coming to people's homes. So, in that case, most of the repositories or collections are actually handled by amateurs. If you look at the set of all collections in that view, tthe vast majority of them are the amateur ones. However, the number of professional ones are still enormously larger than at present, and those will continue to have rich needs for all the things that you said.

So, those views don't seem like they conflict to me. It's just a question of how you get fine-grained indexing for things that don't have sufficient commercial interest or sufficient audience, and that was the problem that I was addressing. So, this was sort of the Mosaic/Netscape version of how you would do search and analysis. Not the answer to what's the future of Dialog or Nexis, which would be a different answer. Cliff, did you want to say something?

[Cliff Lynch] Just in three seconds, I think that we're both on exactly the same wavelength. The point is there are never enough trained information professionals, and you could never afford enough in the world you portray here, it's just hopelessly out of hand. And, basically, on the high end sort of large community systems you could afford them. For everything else, you need to throw computing at it.

[back to Bruce Schatz]

Let me just say one more thing. Since I grew up in the telephone industry, I worked at Bell Labs and Bellcore for a long time. There was this old story about people predicting the demise of the telephone network. And the reason they predicted it, was that in the early nineteen hundreds, the way you made a telephone call is it went through the operator and the operator had a plug board and she transferred the call from one line to another. And there was a mechanical switching machine that tracked the plug board. So they did the computation of how many people there would be and how many telephone operators you would need to handle the volume of traffic, even at the present rate of usage. And the number of telephone operators was larger than the population.

So, it wasn't clear what would happen, but the problem washed out completely. What happened is that automatic switching machines were developed, which displaced most of the need for the human operators with plug boards. The machines weren't as good as the humans, of course, for example they needed exact numbers to connect rather than vague names, but they were much cheaper and could handle more traffic. Thus today, the rich places still have their telephone operators (often called receptionists or secretaries). But the great mass of people use purely automatic switching that does something that's lower grade, but is still acceptable. And then there's levels in-between that have differing amounts of computer and people support. So, in the telephone network, which was the last big mass medium for interactive communication, you saw exactly the same spreading as I am predicting for the Net.

It's just right now, we're at the state where there are only human switching machines, there only are big professional searchers like Dialog and Nexis and indexers like Inspec and MESH. What I tried to indicate in my talk is that will no longer be true even in the near future, and so there'll be lots of more low grade searchers and indexers that have to do something, which means that you need technology that will support these. The environments for the Interspace are just like automatic switching machines, only for correlating information across networks instead of transmitting voice. And since big vertically organized institutions like AT&T and Bell Labs don't exist anymore, that means that people like us have to develop this new technology or it just won't get done. Otherwise the rich people will have a good solution to the problem of information analysis, while the poor people, which is all of us, won't.

[Cliff Lynch]

I'm afraid we're going to need to bring this to an end because there's another session that's going to be in here in a moment. Bruce will be around for a while today I think. He'll be here all day. I'd just like to thank you, that was a wonderful and very provocative talk. Thank you.