Teruhisa Miura
The huge amount of data is being produced from various areas.
In order to extract useful knowledge from large data, there exist
theoretical problems to be solved.
We focus on solving these problems under limited resources such as time
and memory.
Our approach is to apply search methods to extract useful knowledge from
huge database. We aims to propose a new search method, which is
appropriate to search in huge state-space.
The target domains are genome sequence data.
Research results obtained so far are as follows:
- Supporting information gathering from WWW:
We have developed MetaCommander and MetaViewer which
are tools to support users in the genome research community to gather
information from WWW [1]. It facilitates user's tasks of knowledge
discovery from WWW by lightening their burdens of information gathering.
- Search algorithms for the multiple sequence alignment problem:
The multiple sequence alignment problem is of extracting the common
features among several sequences.
The search space of this problem is huge because of its large
branching factor.
There is little or no hope of it solving more than seven sequences by
A* because of its large memory requirement.
We proposed a new mechanism called Stochastic Node Caching
(SNC) that can effectively reduce the number of revisits, by using a
limited amount of memory to store information needed to avoid
revisits [2].
Figure 13:
Genome Information
|
- Genome sequence walking:
The sequence walking is a technique of inferring the complete sequence
of gene from given fragments of the sequence and set of fragments in
database. The techniques include searching through the huge database
of gene fragments and assembling fragments which belong to the same gene.
This work is very laborious for researchers when performed manually
and repeatedly.
We implemented the sequence walking system by applying our text matching
algorithm.
This system is available through the WWW.
Figure 14:
Genome sequence walking system
|
This project aims to propose a new search method for huge database.
- Application of search methods to huge database:
Conventional search methods do not take costs for generation of next
states into consideration, because state-space can be stored on main
memory in conventional search problems and costs of node generation is
low.
On the other hand, for search problems which it's state-space is located
on database, we have to do file access and compute for each entry to
generate next state. This causes the cost of generate of next states is
very high.
We think there exist a lot of additional theoretical problems about
search for huge database.
We aims to know what kind of properties of search methods we need when
we apply search methods to database.
We apply a lot of search methods to the sequence walking problem,
compare them and aims to propose a new algorithm for huge database.
- Study of search methods under limited resources:
Study of search methods under limited resources such as memory and
computation time is very important for search in huge database.
We have to face additional limited resources of access time for data and
computation time of generation of next state.
Basing results of our research for search algorithms under limited
resources, we aims to propose a new search algorithms which can
effectively search in huge database under these limited resources.
So far, there has been no study about search methods, which can be
applicable to huge state-space.
Conventional search is for search problems that its state-space can be
stored on main memory and its next state can be easily generated.
The Genome sequence walking can be regarded as a search problem whose
state-space is composed of huge data on database.
For generate of next state, we have to compare the query fragment with
every fragment in database to find which fragment overlaps with the
query fragment.
This makes the cost of generation of next states very high. In this
situation, we need new measure to evaluate effectiveness of search
methods. Study on search methods for database become important, because
needs to extract useful information from huge data is increasing more
and more.
- 1.
- Yasuhiko Kitamura, Hideyuki Nakanishi, Tetsuya Nozaki, Teruhisa Miura, and
Toru Ishida,
``MetaViewer and MetaCommander: Applying WWW Tools to Genome Informatics,''
Genome Informatics 1996,
Universal Academy Press, pp. 137 - 146, 1996.
- 2.
- Teruhisa Miura and Toru Ishida,
``Stochastic Node Caching for Memory-bounded Search,''
National Conference on Artificial Intelligence (AAAI-98),
pp. 450-456, 1998.