Tuesday, June 30, 2009

Some CML Fun

These are some notes on some initial work to create a chemical properties pipeline. End result will be calculated properties in an RDF triple store that can be exported using ORE/Atom. If I don't write this stuff down, I'll forget it.

Step 0: get Jumbo CML converter tools from SourceForge. Peter Murray Rust was kind enough to write a little README, now in the SVN. I did an SVN checkout rather than download the tagged Jumbo versio 5.4 or 5.5.

svn co https://cml.svn.sourceforge.net/svnroot/cml/jumbo-converter/trunk jumbo-converter

Step 0.5: Compile as an executable jar. I used "mvn assembly:assembly". See http://www.springone2gx.com/blog/scott_leberknight/2008/06/creating_executable_jars_using_the_maven_assembly_plugin.html


Step 0.75: Run with sample data using the command (must set classpath):

java -jar target/jumbo-converters-jar-with-dependencies.jar -converter org.xmlcml.cml.converters.molecule.pubchem.PubchemXML2CMLConverter
-sd examples/input -odir ../output -is pubchem.xml -os pubchem.cml


Step 1: get XML descriptions of molecules from Pubchem. I'm not a chemist, so I did a structure similarity search on caffeine, applied the Rule of Five, and got 1298 matches. That's pretty good.


Step 2: Unfortunately the PubChem download gives everything in one big XML file, but Jumbo's converter expects each molecule to be in a separate file. I googled around and made a little Perl script to do this:

#!/usr/bin/perl

use File::Basename;

$file = @ARGV[0];

open(_FH, "< $file") or die "Unable to open file: $file\n";

$count=0;
$max_records=1;
$files_counter=0;

while(<_FH>)
{
if($count == 0)
{

my @suffix=qw(.pubchem.xml);
my ($basefilename, $path, $suffix)=fileparse($file,@suffix);
$filename=$basefilename . "_part_" . $files_counter . $suffix;
open(FH2, "> $filename") or die "Unable to open file: $filenam
e\n";
$count++;
print FH2 "\<\?xml version=\"1.0\"\?\>\n";
}

if (grep /<\/PC-Compound>/, $_ )
{
print FH2 $_;
$count++;
}
elsif (grep /<PC-Compound>/, $_ )
{
print FH2 "<PC-Compound xmlns=\"http://www.ncbi.nlm.nih.gov\"
xmlns:xs=\"http://www.w3.org/2001/XMLSchema-instance\" xs:schemaLocation=\"http:
//www.ncbi.nlm.nih.gov ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem
.xsd\">\n";

}
else {
print FH2 $_;
}

if ($count == $max_records + 1)
{
$count = 0;
$files_counter++;
close(FH2);
}
}

I stole a lot of this from http://www.mysysad.com/2008/01/parse-xml-records-with-perl-script.html


Step 2.5: Rerun the converter command on the new files. Jumbo was able to handle most but barfed on a few, so I'll have to follow up on this.

Step 3: Go to lunch. Next need PMR and Co. to decide on the best way to change the CML into Gaussian input files.

Step 4: Realize you downloaded a bunch of 2D structures, so repeat steps 0-3 with 3D structures.

Step 5: Make Gaussian input files from CML with

java -jar target/jumbo-converters-jar-with-dependencies.jar -converter org.xmlcml.cml.converters.compchem.gaussian.input.CML2GaussianInputConverter -sd examples/output -odir ../gaussian -is pubchem.cml -os pubchem.gauss

Current SVN of jumbo-converters won't compile without first checking out and compiling also cmlxom and jumbo5 from SVN (they are peers in the directory structure).

Step 6. Run Gaussian on TeraGrid machines. See http://communitygrids.blogspot.com/2009/07/running-gaussian-on-big-red.html. Welcome to 2009!

Step 7. Get the data back and convert the Gaussian standard output to CML. I used the command

java -jar target/jumbo-converters-jar-with-dependencies.jar -converter org.xmlcml.cml.converters.compchem.gaussian.log.GaussianLog2CMLConverter -sd examples/gaussOut -odir ../gaussOut -is out -os cml

I put the Gaussian output files in the examples/gaussOut directory.

Step 8. Convert the output CML to RDF:

java -jar target/jumbo-converters-jar-with-dependencies.jar -converter org.xmlcml.cml.converters.compchem.gaussian.GaussianCML2OWLRDFConverter -sd examples/gaussOut -odir ../gaussOut -is cml -os rdf

Wednesday, June 24, 2009

Jobs at the Pervasive Technology Institute

We are looking to fill two positions. Please see the full ad below.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
The following positions based at IUB and reporting to Marlon Pierce, Assistant Director, Community Grid Labs, is being posted externally. If interested, you must apply online at http://www.jobs.indiana.edu/. Search for job listing 401 and refer to positions number #00034420 and #00034425. See also https://ola.indiana.edu/joblisting/index.cfm?jlnum=401&search=2

Principal Software Research Engineers - PAE4IT (Two positions available)

Description and Responsibilities:

Indiana University is looking for experienced individuals to fill multiple positions in Cloud computing , Grid computing, and message-oriented middleware development. The positions call for expertise in Web service software development, experience developing Java Messaging Service (JMS) applications and related Enterprise event bus technologies, experience with distributed systems, sensor webs, and/or audio-video systems research and advanced development.
Working independently, the incumbent provides expertise in the evaluation, development, implementation and deployment of specific Cloud Computing, Message-Oriented Middleware, and Grid-related technologies in conjunction with Pervasive Technology Institute and Research Technology staff and collaborators. Investigate research problems in distributed computing systems, develop software using best software engineering practices, publish peer-reviewed research articles in workshops and journals, evaluate existing technologies, supervise staff and graduate students, and oversee all development aspects of Cloud and Grid scientific data and application management. Work with scientific users to develop Cloud and Grid systems. Collaborate with key academic researchers and open source user community including faculty, scientific researchers, and grant stakeholders.
Indiana University's Pervasive Technology was recently awarded a 5 year extension to its core funding from the Lilly Endowment, following its highly successful initial 7 years (as the Pervasive Technologies Laboratories). PTI will be located in Indiana University's new state of the art Research Incubator facility. See http://www.pervasive.iu.edu/ for more information. To date, the PTI laboratories have published more than 880 peer-reviewed papers, have won over $180,000,000 in grants over and above their core funding, and have released and maintain numerous open source software packages. PTI researchers work closely with Indiana University's Research Technology group, who maintain IU's world class cyberinfrastructure, http://racinfo.indiana.edu/cyberinfrastructure/.
Qualifications:

Master's degree in Computer Science or relate technical discipline (such as physics, informatics, engineering) is required. PhD preferred. Ph. D. incumbents must have demonstrated (through peer-reviewed publications and software products) a primary research focus on one or more of the following: a) distributed, parallel, Grid and/or Cloud computing; b) high performance, distributed messaging systems; c) audio/video collaboration systems; and d) security in distributed systems. Incumbents with Master's degree must have five to seven years experience in software development and deployment in the appropriate field or fields.

Requirements include a proven ability to undertake and succeed in innovative research projects, and excellent communication skill (covering scientific article writing, presentation skills, and the ability to work with and lead teams). Experience developing software systems using standard software engineering practices (such as source code management software, build and test systems, etc). For Cloud and Grid incumbents, experience developing Web Services, REST services, scientific workflows, and distributed Web applications on Unix/Linux/Mac OSX and/or Windows. Experience developing applications on Grids using software such as Globus or Condor. Experience developing scientific and research Cloud applications using Amazon Web Services, Microsoft Azure, Google AppEngine, and related systems. Experience developing data-parallel scientific, machine learning, and/or information retrieval algorithms using systems such as Apache Hadoop and Microsoft Dryad. For audio/video stream incumbents, demonstrated knowledge and implementation of standards as well as innovative systems. For message-oriented middleware incumbents, demonstrated sophisticated experience with standard MOM software. MOM incumbents should have experience both with core MOM development (efficient routing, security, reliable delivery) as well as applications.
Ability to foster and maintain collaborations with internal and external academic researchers through a balanced, service-based approach required. Demonstrated ability to learn new technologies and tools. Demonstrated initiative in evaluating and adopting new technologies. Organizational and leadership skills required. Able to operate effectively in a complex and dynamic environment with the capability of functioning as a project leader. Excellent oral and written communication skills. Demonstrated conceptual, analytical, and logical abilities. Ability to interact successfully with staff and faculty in extended consultations; demonstrate enthusiasm for new technologies; and deliver excellent outcomes under high pressure and short deadlines. Excellent communication skills required to interact with people of varying levels of knowledge.

Indiana University is an Affirmative Action/Equal Employment institution.