File system indexing, Graph Algorithms

File system indexing

After some time using a computer-based system to store information a productive user will amass a large collection of files and might then find that, even though they were careful to structure their files into a well-defined hierarchy, it becomes difficult to find an infrequently accessed file. The solution to this problem is to build an index and search that to find the location of the text within one of their files.


Given only a small number of files containing information such as computer programs, electronic mail messages or text, the owner of the files can find the information which they are looking for by simply looking through each of their files. When the number increases a little then they might use a search program to look through the files. On the UNIX operating system a suitable program is grep, which uses regular expressions to make the searching capability more powerful. A related program, agrep, performs approximate matching. When the number of files increases by a lot then this method of looking for information is no longer powerful enough and it is better to first create an index for the files and then to consult this index instead of searching the files themselves.

The Glimpse package for UNIX systems which allows an individual [or organisation] to index a collection of files and query the index afterwards. The time taken to create the index of the files in the first instance is recouped because the index is searched again and again. If the files which are stored change dramatically or increase then the index can be rebuilt. The following extract from the user manual of the Glimpse package explains how the system works.

To index all the files in a directory tree rooted at DIR, you simply say

  glimpseindex DIR 

(E.g., glimpseindex ~ indexes all your files.) Afterwards, glimpse can search through all these files much the same way as agrep (or any other grep), except that you don't have to specify file names and the search is fast. For example,

  glimpse -2 unbelievable 

will find all occurrences (in all your files!) of "unbelievable" allowing two spelling errors;

  glimpse -F mail arizona 

will find all occurrences of "arizona" in all files with "mail" somewhere in their name;

  glimpse 'Arizona desert;windsurfing'

will find all lines that contain both "Arizona desert" and "windsurfing".

Graph Algorithms

Graph Theory is a very fascinating branch of Mathematics in which we study objects called graphs. It turns out that hundreds of real-life problems can be modelled as problems about graphs and it is very important to find efficient solutions to these problems. A lot of research effort in Computer Science has been devoted to this cause and as a result, ``Graph Algorithms'' has become a very important and vast branch of Computer Science.


In very simple terms, a graph can be thought of as a collection of sites [called vertices], pairs of which may be connected by links [called edges]. There may be information associated with the vertices as well as with the edges. For example, the sites could be all the big cities in the UK and the edges might represent the existence of direct flights between pairs of cities. The information associated with an edge connecting cities A and B might be the amount of time taken to fly from A to B. Another example is one where the sites represent telephone exchanges and we put edges between exchanges that are connected by direct lines. The information associated with the edges in this case could be the volume of traffic [calls] that can be handled by the lines.

There are many interesting questions that we might ask at this point. For the graph of cities, we might want to know the ``minimum flight time'' required if we want to go from Edinburgh to some other big city. Note that in the absence of a direct flight we need to consider all possible ways of reaching the destination city which is what makes this problem nontrivial to solve.

For the graph of telephone exchanges, we might want to know whether two exchanges are connected [directly or indirectly via other exchanges], or we might want to know the answer to this very important question, namely, ``If the direct lines between exchanges A and B fail, what would be the best way to route the calls between them?.''

Here, we take a look at what is known as the `Single-source Shortest Paths Algorithm', to solve the following problem. Given the graph of cities as described above, find the minimum flight time required to get from Edinburgh to every other city. To keep things simple we assume that between any pair of cities, if a direct flight exists, then it exists in both directions and is of the same duration.

The algorithm was discovered by Edsger W. Dijkstra, a very famous computer scientist, who is currently a professor at the University of Texas, Austin. The picture of Dijkstra shown here is from 1955, around the time of his invention of the shortest path algorithm. Dijkstra's algorithm is a very good example of what is known as the `Greedy Method' in computer science, where at each step we try to be greedy and always choose the best available option. This may seem like an elementary and perhaps the obvious strategy, but it is not always the best strategy to use.

The first step to solving any problem on the computer is to figure out how the data should be represented. In this case, we represent the cities using the numbers 1 to N [where N is the number of cities], and the information about flight times is represented as an N×N array [matrix], say FT. The entry FT[i,j] tells us the flight time from City i to City j [say in terms of the number of minutes]. We set FT[i,i] = 0 for every city to indicate that the flight time from a city to itself is 0. If a direct flight does not exist between cities i and j then we set FT[i,j] to a very big number like 1200 [= 20 hours] to indicate the nonexistence. We let City 1 be the city from which we want to find the minimum flight times.

Dijkstra's algorithm proceeds in stages. By the end of the ith stage, it identifies all cities k such that the minimum flight time from City 1 to City k requires taking at most i flights altogether and also computes the minimum flight times to all cities from City 1 if at most i flights are taken. At the end of stage N-1, all the minimum flight times are known since we need to take at most N-1 flights to get to any city from City 1.

Here is an animation that shows Dijkstra's algorithm in operation. We want to find the minimum flight time from S to T. The animation shows the progress of the algorithm in stages as described above and displays the minimum flight times computed as it proceeds.


0 comments:

Bidvertiser

Designed by Posicionamiento Web | Bloggerized by GosuBlogger | Blue Business Blogger