|
|
|
SIMAP web-serviceAccessing SIMAP using the web-serviceIf You want to create a web-service based client by a tool of Your choice, You will find the WSDL [here]. Please note, that the service is document/wrapped and therefore, some client systems as SOAP::Lite will fail to use it. If You depend on programming in such an environment, You might either evaluate the SOAP messages 'by hand' or create a wrapper around the SimpAT package (see below). If You are developing in Java, then You can simply integrate the SimpAT libraries. This web-service has been implemented within the HOBIT project (http://hobit.sf.net/). Accessing SIMAP using the SimpAT PackageThe SimpAT (Simap Access Tools) allows easy access to the SIMAP database using SOAP based web-services. The package is written in Java and allows easy integration of SIMAP queries into own applications. RequirementsThe SimpAT package is written in the Java 1.5 programming language. To access SIMAP, You need access to the Internet. Since the amount of data transferred might be quite high, a broad-band access should be available. DownloadDownload the package as tar-ball from http://fileshare.csb.univie.ac.at/simpat/simpat1.3.2.tar.gz Querying SIMAPSIMAP is queried using the unique Md5 key of an amino-acid sequence. This sequence must be in upper case with no space/newlines in it. The SimpAT package can compute this MD5 for You:
try {
SimapAccessWebService simap=new SimapAccessWebService();
String sequence="MSELKKNVTQDNLWQETSPKK";
String md5=simap.computeMD5(sequence);
}
Before retrieving a result list, one should set some cut-offs to the search. The more restrictive a cut-off is, the faster a query will be. There are three cut-offs built in. They can be combined and are then additive, which means, that in each case all three are evaluated. The cut-offs are: the cut-offs can be set in the SimapAccessWebService Object. simap.setMax_evalue(10e-25); simap.setMax_number_hits(5); simap.setMin_swscore(120); If You want SIMAP to report also sequences and alignments to a hit, You must activate this: simap.alignments(true); simap.sequences(true); Basically, SIMAP reports XML output: System.out.println(simap.getHitsXML()); Alternatively, SIMAP can report BLASTML style output: System.out.println(simap.getHitsByMD5BLASTML()); Alternatively, SimpAT can parse the XML in convenient Java objects for further analysis: ArrayList Each HitSet object contains all information on a hit. This information can now be accessed by getter methods:
HitSet second=result.get(1);
// a hit consists out of alignment data and hit data
System.out.println(second.getHitAlignment().getAlignment_hit());
System.out.println(second.getHitAlignment().getAlignment_markup());
System.out.println(second.getHitAlignment().getAlignment_query());
System.out.println("Bitscore\t"+second.getHitAlignment().getBits());
System.out.println("E-Value\t"+second.getHitAlignment().getEvalue());
System.out.println("Coverage in Hit\t"+second.getHitAlignment().getHit_coverage());
System.out.println("Coverage in Query\t"+second.getHitAlignment().getQuery_coverage());
System.out.println("Percantage matched residues\t"+second.getHitAlignment().getPositives()+" %");
System.out.println("Score ratios: Hit,Query\t"+second.getHitAlignment().getHit_ScoreRatio()+","+second.getHitAlignment().getQuery_ScoreRatio());
System.out.println("Identity\t"+second.getHitAlignment().getIdentity()+" %");
System.out.println("in "+second.getHitAlignment().getOverlap()+" aa overlap");
System.out.println("from: "+second.getHitAlignment().getQuery_start()+" to "+second.getHitAlignment().getQuery_stop()+ " in query sequence");
System.out.println("from: "+second.getHitAlignment().getHit_start()+" to "+second.getHitAlignment().getHit_stop()+ " in hit sequence");
// print out data concerning the hit-sequence
System.out.println("\nLength of the sequence:\t"+second.getHitData().getLength());
System.out.println("Selfscore\t"+second.getHitData().getSelfscore());
System.out.println("Number of Hits in SIMAP\t"+second.getHitData().getNumber_hits());
System.out.println("Sequence:\n"+second.getHitData().getSequence());
System.out.println("with checksum:\t"+second.getHitData().getMd5());
// print out data of the protein instances
System.out.println("Instances:");
for (HitProtein o : second.getHitData().getProteins()) {
System.out.println(o.getTitle()+" Description "+o.getDatabase_description()+" Tax-Node"+o.getTax_node());
System.out.println(o.getLinkoutUrl());
System.out.println(o.getTaxonomy());
System.out.println("----");
}
Search-SpacesOne of the most important features of SIMAP is the virtual definition of search-spaces. A search-space comprises a sub-selection of different datasets, as, for example, SWISSPROT+PDB or all complete PEDANT databases. The data retrieved by a client is then automatically restricted to this selected search-space. Most importantly, the E-Values are recomputed to the selected search-space size, since they depend on the size of the database used. By default, the search-space is set to whole SIMAP. As soon the user starts to define own selections, the search-space is restricted to this selections. There are three different types of manipulating the search-space: 1. Using Database-ids Databases can be exclusively added. For example: adding PDB will retrieve hits in PDB only. int dbid=0; // we get all available databases to look up the dataset-id we want ArrayList 2. Using taxonomy-ids Taxonomy-ids can be used to include only certain taxa, e.g. only to search in eukaryotes. Excluding taxa is also possible. Combining both ways allows queries as "look for all human proteins in PDB homologoues to my query sequence x from organism/dataset y". The query sequence needs not to be contained in the positive selection of subsets. If it is missing, simply no self-hit is reported (as not existent in the workspace), but the hits in the activated subset. The ids used here are the taxonomy-ids from the NCBI homepage (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Taxonomy). Code examples: simap.addDatabase(dbid); simap.includeTaxon(2759); simap.excludeTaxon(9606); simap.getHitsXML(); 3. Using resource types Resource types describe different kinds of data-origin. They are referred to by an integer value. Currently following types are provided: GENRE_WZW=1; PEDANT2_COMPLETE_GENOMES=2; SPUTNIK_EST=3; PEDANT3_COMEPLETE_GENOMES=4; UNIPROT=5; MULTIFASTA=6; EMBL_TAX=7; PLANTSDB=8; GENEBANK=9; GENEBANK_TAX=10; GENRE=11; PEDANT2_INCOMPLETE_GENOMES=12; PEDANT3_INCOMPLETE_GENOMES=13 Most often, You will not need to restrict resource-types. However, an interesting use-case is the restriction to complete genomes, which can be done by adding the PEDANT complete genomes and thereby excluding the incomplete ones. Code example: simap.addSource(ResourceTypes.PEDANT2_COMPLETE_GENOMES); simap.addSource(ResourceTypes.PEDANT3_COMPLETE_GENOMES); In case of further questions, please contact sysadmin.csb@univie.ac.at |