Tuesday, June 30, 2009

Some CML Fun

These are some notes on some initial work to create a chemical properties pipeline. End result will be calculated properties in an RDF triple store that can be exported using ORE/Atom. If I don't write this stuff down, I'll forget it.

Step 0: get Jumbo CML converter tools from SourceForge. Peter Murray Rust was kind enough to write a little README, now in the SVN. I did an SVN checkout rather than download the tagged Jumbo versio 5.4 or 5.5.

svn co https://cml.svn.sourceforge.net/svnroot/cml/jumbo-converter/trunk jumbo-converter

Step 0.5: Compile as an executable jar. I used "mvn assembly:assembly". See http://www.springone2gx.com/blog/scott_leberknight/2008/06/creating_executable_jars_using_the_maven_assembly_plugin.html

Step 0.75: Run with sample data using the command (must set classpath):

java -jar target/jumbo-converters-jar-with-dependencies.jar -converter org.xmlcml.cml.converters.molecule.pubchem.PubchemXML2CMLConverter
-sd examples/input -odir ../output -is pubchem.xml -os pubchem.cml

Step 1: get XML descriptions of molecules from Pubchem. I'm not a chemist, so I did a structure similarity search on caffeine, applied the Rule of Five, and got 1298 matches. That's pretty good.

Step 2: Unfortunately the PubChem download gives everything in one big XML file, but Jumbo's converter expects each molecule to be in a separate file. I googled around and made a little Perl script to do this:


use File::Basename;

$file = @ARGV[0];

open(_FH, "< $file") or die "Unable to open file: $file\n";


if($count == 0)

my @suffix=qw(.pubchem.xml);
my ($basefilename, $path, $suffix)=fileparse($file,@suffix);
$filename=$basefilename . "_part_" . $files_counter . $suffix;
open(FH2, "> $filename") or die "Unable to open file: $filenam
print FH2 "\<\?xml version=\"1.0\"\?\>\n";

if (grep /<\/PC-Compound>/, $_ )
print FH2 $_;
elsif (grep /<PC-Compound>/, $_ )
print FH2 "<PC-Compound xmlns=\"http://www.ncbi.nlm.nih.gov\"
xmlns:xs=\"http://www.w3.org/2001/XMLSchema-instance\" xs:schemaLocation=\"http:
//www.ncbi.nlm.nih.gov ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem

else {
print FH2 $_;

if ($count == $max_records + 1)
$count = 0;

I stole a lot of this from http://www.mysysad.com/2008/01/parse-xml-records-with-perl-script.html

Step 2.5: Rerun the converter command on the new files. Jumbo was able to handle most but barfed on a few, so I'll have to follow up on this.

Step 3: Go to lunch. Next need PMR and Co. to decide on the best way to change the CML into Gaussian input files.

Step 4: Realize you downloaded a bunch of 2D structures, so repeat steps 0-3 with 3D structures.

Step 5: Make Gaussian input files from CML with

java -jar target/jumbo-converters-jar-with-dependencies.jar -converter org.xmlcml.cml.converters.compchem.gaussian.input.CML2GaussianInputConverter -sd examples/output -odir ../gaussian -is pubchem.cml -os pubchem.gauss

Current SVN of jumbo-converters won't compile without first checking out and compiling also cmlxom and jumbo5 from SVN (they are peers in the directory structure).

Step 6. Run Gaussian on TeraGrid machines. See http://communitygrids.blogspot.com/2009/07/running-gaussian-on-big-red.html. Welcome to 2009!

Step 7. Get the data back and convert the Gaussian standard output to CML. I used the command

java -jar target/jumbo-converters-jar-with-dependencies.jar -converter org.xmlcml.cml.converters.compchem.gaussian.log.GaussianLog2CMLConverter -sd examples/gaussOut -odir ../gaussOut -is out -os cml

I put the Gaussian output files in the examples/gaussOut directory.

Step 8. Convert the output CML to RDF:

java -jar target/jumbo-converters-jar-with-dependencies.jar -converter org.xmlcml.cml.converters.compchem.gaussian.GaussianCML2OWLRDFConverter -sd examples/gaussOut -odir ../gaussOut -is cml -os rdf


Egon Willighagen said...

Hi Marlon,

the CDK has an iterating PubChem XML reader, allowing you to create CML files directly from the PubChem Compounds XML file you have.

Additionally, it has a Gaussian Input File writer, so that you can skip the CML step.


Rajarshi said...

Push it through OpenBabel?

Geoff said...

I agree with both Egon and Rajarshi. Either use CDK or Open Babel. For babel:

babel -ipc file.xml -ocom -m -k "#n B3LYP/6-31G* Opt"
# Read in a PubChem input (-ipc), output Gaussian com input (-ocom) for each record (-m) with keywords...

But if you want to keep everything in Java, CDK is great.

Egon Willighagen said...

Forgot to mention that you can do the filtering on Rule-of-Five in the CDK too (or XlogP, or any other QSAR descriptor, graph property, etc :).

Marlon Pierce said...

Thanks for the tips. Part of my goal was to put CML/Jumbo through some paces.

petermr said...

Great to see this.

The popint is we are building a workflow based on Lensfield/JUMBO.

The JUMBO framework has some declarative and ontological support which is new. It allows, for example, automatic traversal of any file structure.

The Gaussian Input is ontologically and declaratively controlled allowing generic control through XML input. More on this later.

I will fix the pubchem converter so it splits output - this needs adjusting in the workflow.

petermr said...

The Gaussian input is actually three concatenated jobs including more than one basis set, use of checkpoint files, use of ExtraBasis functions, involvement of solvent and claculation of NMR shifts. It is assembled through an XML abstraction of the problem and transformed into Gaussian input automatically.