In [1]:
import os
from glob import glob
import h5py
import numpy as np

Demo of database application

The database and cloud service component of Polybot provides interfaces for the transfer of data files to/from cloud applications (e.g., Globus, SQL database, Hadoop) across data facilities (e.g., the Materials Data Facility).
The Polybot database will contain materials data (e.g., structures, properties, processing conditions, metadata, etc.) collected from the robotics system, but for demonstrating database related functionalities, we will use examples from other projects.

Table of Contents

1. Local databases

  • Example: topological materials
  • TopoSwarm
  • Structure query in TopoSwarm
  • Compute Eigenstates

2. File operations

  • Converting a file of POSCAR files into HDF5 files
  • Interacting with HDF5 files

3. Interaction with cloud data storage

  • Adding new metadata to existing HDF5 files in Globus
  • Adding new POSCAR files to existing HDF5 files stored in Globus
  • Transferring a file from Globus

4. Interaction with data facilites

  • Updating the Materials Data Facility with data on Globus
  • Adding new data to the Materials Data Facility
  • Interacting with data on the Materials Data Facility

1. Local databases

  • Example: topological materials
Examples of the helper functions useful for interacting with the topological materials dataset. It operates on a single ASE Atoms object.
In [2]:
import kwant 
import ase 
from ase.io import read as make_ase_from_POSCAR 

def kwant_from_ase(aseobject,e,t,tnn=None):
    """ 
    Returns a kwant Builder object created using the positions with the ASE Atoms object 
    """
    cell = aseobject.get_cell() 
    Lx = cell[0][0]
    Ly = cell[1][1]
    pos = aseobject.get_positions()
    pos[:,0] -= np.min(pos[:,0])
    pos[:,1] -= np.min(pos[:,1])
    syst = kwant.Builder(kwant.TranslationalSymmetry([Lx,0]))
    lat = kwant.lattice.general([[Lx,0],[0,Ly]],basis = np.array(pos)[:,:2])
    syst[lat.shape((lambda pos: -1<= pos[1] < Ly-1),(0,0))]=e 
    syst[lat.neighbors()] = t 
    if tnn: 
        syst[lat.neighbors(2)] = tnn 
    return syst 

def get_avg_num_neighbors(kwantSyst): 
    """ 
    Returns the average number of neighbors for the atoms in the kwant System 
    """
    count_neighbors = 0 
    nsites = len(list(syst.sites()))
    for site in kwantSyst.sites(): 
        count_neighbors += kwantSyst.degree(site)
    return count_neighbors/nsites

def get_undercoordinated_atoms(kwantSyst,bulkCoord=3): 
    """ 
    Returns a list of under co-ordinated atoms. 
    Under co-ordinated is defined as having neighbors less than bulkCoord argument
    """
    edge_atoms = []
    for site in kwantSyst.sites(): 
        if kwantSyst.degree(site) < bulkCoord: 
            edge_atoms.append(site)
    return edge_atoms

def num_undercoodinated_atoms(kwantSyst,bulkCoord=3): 
    """ 
    Returns the number of under co-ordinated atoms. 
    Under co-ordinated is defined as having neighbors less than bulkCoord argument
    """
    edge_atoms = get_undercoordinated_atoms(kwantSyst,bulkCoord=bulkCoord)
    return len(edge_atoms)
We now create the ASE atoms object from a POSCAR file from the dataset and subsequently make a kwant (tight binding code used to generate the dataset) class object
In [3]:
structure = make_ase_from_POSCAR('./POSCARS/POSCAR.986')
syst = kwant_from_ase(structure,0,-1)
kwant.plot(syst);
<Figure size 640x480 with 1 Axes>
In [4]:
avg_num_neighs = get_avg_num_neighbors(syst)
undercoordAtoms = get_undercoordinated_atoms(syst,bulkCoord=3)
num_edge = num_undercoodinated_atoms(syst,bulkCoord=3)
print('Average number of neighbors:',avg_num_neighs)
print('Number of under co-ordinated atoms:',num_edge)
Average number of neighbors: 2.8059701492537314
Number of under co-ordinated atoms: 52
In [5]:
print('List of 1st 5 under co-ordinated atoms positions:')
for atom in undercoordAtoms[:5]: 
    print(atom.pos)
List of 1st 5 under co-ordinated atoms positions:
[4.5 0.5773502691896258]
[0.16666666666666666 0.5773502691896258]
[0.5 0.5773502691896258]
[4.166666666666666 0.5773502691896258]
[2.5 0.5773502691896258]
  • TopoSwarm:
Helper functions for TopoSwarm
In [6]:
from TopoQuest.StructGen import StructGen
import numpy.linalg as la
from ipywidgets import interact


def plot_wf(syst,i_start,i_end,ham):
    """Plot the wave function mapping on system with Hamiltonian 
    "ham" in a PyWidget starting from band index i_start and 
    ending at i_end"""
    eig_val,eig_vec = la.eigh(ham)
    def plot_band(i=0): 
        print("Plotting wave function with index",i)
        print("Energy of the corresponding mode",eig_val[i], "x t")
        #fig = kwant.plotter.map(syst,abs(eig_vec[:,i])**2,oversampling=50)
        #density = kwant.solvers.default.ldos(syst,energy=50)
        fig = kwant.plotter.density(syst,abs(eig_vec[:,i])**2,relwidth=0.09,
                                    cmap='seismic',background='black')#,oversampling=50)
        #fig.savefig('figures/%s.jpg'%i,dpi=400,quality=100,transparent=True)
    plot_band(25)
    #interact(plot_band,i=(i_start,i_end))

def _1D_to_finite(self): 
    pos_lattice = np.array(self._get_site_pos())
    syst = kwant.Builder()
    def check_sites(pos):
        x,y = pos  
        for test_site in pos_lattice: 
            diff_x = abs(test_site[0]-x)
            diff_y = abs(test_site[1]-y)
            if diff_x < 1.0e-3 and diff_y < 1.0e-3 :
                return True 
        return False
     
    syst[self.lat.shape(check_sites,(0,0))]=self.onsite
    syst[self.lat.neighbors()]=self.hop 
    return syst

def plot_band(syst,i_start,i_end,ham):
    """Plot the wave function mapping on system with Hamiltonian 
    "ham" in a PyWidget starting from band index i_start and 
    ending at i_end"""
    eig_val,eig_vec = la.eigh(ham)
    def plot_band(i=0): 
        print("Plotting wave function with index",i)
        print("Energy of the corresponding mode",eig_val[i], "x t")
        #fig = kwant.plotter.map(syst,abs(eig_vec[:,i])**2,oversampling=50)
        #density = kwant.solvers.default.ldos(syst,energy=50)
        fig = kwant.plotter.density(syst,abs(eig_vec[:,i])**2,relwidth=0.09,
                                    cmap='seismic',background='black')#,oversampling=50)
        #fig.savefig('figures/%s.jpg'%i,dpi=400,quality=100,transparent=True)
    
    interact(plot_band,i=(i_start,i_end))

gen = StructGen('Armchair',nlx=10,nly=10)
  • Structure query in TopoSwarm:
In [7]:
Structures = glob('./POSCARS/*')
gen.poscar2syst('./POSCARS/POSCAR.10983')
gen.syst = _1D_to_finite(gen)
gen.plot_syst()
  • Compute Eigenstates:
In [8]:
genFinalized = gen.syst.finalized() 
ham = genFinalized.hamiltonian_submatrix()
plot_wf(genFinalized,20,30,ham)
Plotting wave function with index 25
Energy of the corresponding mode -2.2432484156977495 x t
In [10]:
gen.poscar2syst('./POSCARS/POSCAR.986')
gen.plot_syst()
In [11]:
import matplotlib.pyplot as plt

bands = kwant.physics.Bands(gen.syst.finalized())
momenta = np.linspace(-np.pi, np.pi, 101)
energies = [bands(k) for k in momenta]
plt.plot(momenta, np.array(energies)[:,40:51])
plt.show()

2. File operations

In [ ]:
import toposwarm_data_management as tdm
  • Converting a file of POSCAR files into HDF5 files:
Prequisites to run this code are the following: ``` Install globus connect personal, https://www.globus.org/globus-connect-personal Install globus_sdk, (pip install globus_sdk) Install pandas, (pip install pandas) Install mdf_connect_client, (pip install mdf_connect_client) Install ase, (pip install ase) Install h5py, (pip install h5py) ``` This function assumes you have a file of POSCAR files. Currently the POSCAR files are labeled as "POSCAR.(index)" where the index is an arbitrary integer. The function poscar_to_hdf5() will convert this these POSCAR files into HDF5 files. HDF5 groups can be considered to be the POSCAR file, and will have the same name as the corresponding POSCAR file (if you want to convert the POSCAR file named "POSCAR.1", it will be saved in an HDF5 group named "POSCAR.1"). POSCAR files are placed into HDF5 files with the same number of atoms per unit cell. An example of an HDF5 file with 50 atoms in a unit cell is named "toposwarm_num_atoms_50.hdf5". Associated metadata can be attached to these HDF5 groups. Metadata must be placed in a csv file and have an index with the same index as the POSCAR file. Currently the metadata for toposwarm is located in a csv file named "databaseV2-final.csv" and is the default for poscar_to_hdf5(). If you want the function to delete the POSCAR files after conversion, set the variable del_files=False when calling poscar_to_hdf5()
In [ ]:
data_directory="D:/test"  #directory where POSCAR files are stored
dest_directory="test_hdf5"
met="C:/Users/danpa/Documents/argonne/MDF_interface/src/databaseV2-final.csv"
tdm.poscar_to_hdf5(data_directory,dest_directory,met)   

#HDF5 files are by default stored in a folder named topo_hdf5 in the same directoryas the folder for the POSCAR files
  • Interacting with HDF5 files:
In [ ]:
os.chdir("D:/test_hdf5")
hdf5_filename="toposwarm_num_atoms_50.hdf5"
f=h5py.File(hdf5_filename, "r+") #loading an HDF5 
print(hdf5_filename, " groups")
print(f.keys()) #prints a list of the groups (POSCAR files in the case of the toposwarm data) stored in the HDF5 files
poscar_name='POSCAR.1'
print(poscar_name," metadata")
print(f[poscar_name].attrs.keys()) #prints a list of the attribute keys (metadata) attributed to the given group
print(poscar_name," data")
print(f[poscar_name].keys()) #prints the keys of the datasets (broken down POSCAR objects)
poscar_object=tdm.load_poscar_hdf5(f,poscar_name) #loads datasets from a given group and creates an ase.atoms.Atoms object
print(poscar_object.__dict__)

3. Interaction with cloud data storage

  • Adding new metadata to existing HDF5 files in Globus:
The function add_metadata_globus_folder() will add new metadata from a csv file to all the HDF5 files stored in a given folder in a Globus endpoint. The function add_metadata_globus_file() works similarly in that it adds metadata to only one file stored in Globus although is less robust. In order to authorize the transfer, a link will appear which you can click on and then copy the link globus gives you into the box below.
In [ ]:
globus_endpoint_id="94b4ce5e-4a12-462c-a6b6-3fab8e3eecd7" #data platform sandbox endpoint ID
globus_path="/toposwarm_hdf5/" #path where HDF5 files are stored
metadata_csv="databaseV2-final.csv"
tdm.add_metadata_globus_folder(globus_endpoint_id,globus_path,metadata_csv)

tc=tdm.get_transfer_client()
metadata_dataframe=pd.read_csv("databaseV2-final.csv")
globus_endpoint_id="94b4ce5e-4a12-462c-a6b6-3fab8e3eecd7" #data platform sandbox endpoint ID
user_endpoint_id()=tdm.get_local_id() #gets local endpoint for your computer
globus_path="/toposwarm_hdf5/" #path where HDF5 files are stored
user_path="/~/toposwarm_num_atoms_50.hdf5"  #must have /~/
tdm.add_metadata_globus_file(tc, metadata_dataframe, globus_endpoint_id, user_endpoint_id, globus_path, user_path)
  • Adding new POSCAR files to existing HDF5 files stored in Globus:
The function merge_hdf5() will merge local HDF5 files containing POSCAR files with unique indices to the current HDF5 files stored in Globus.
In [ ]:
globus_endpoint_id="94b4ce5e-4a12-462c-a6b6-3fab8e3eecd7"
globus_folder_path="/toposwarm_hdf5/"
local_HDF5folder_path="D:/toposwarm_hdf5/"
poscar_directory="D:/toposwarm-data"
metadata_csv="databaseV2-final.csv"

tdm.add_new_poscars(globus_endpoint_id,globus_folder_path,local_HDF5folder_path,poscar_directory,metadata_csv)
  • Transferring a file from Globus:
the Function transfer_data_globus can be used to transfer data to or from two endpoints. An example would be transferring from data platform sandbox endpoint to your local computer. In order to authorize the transfer, a link will appear which you can click on and then copy the link globus gives you into the box below.
In [ ]:
#transferring an HDF5 file from Globus to personal endpoint
tc=tdm.get_transfer_client()
source_endpoint_id="94b4ce5e-4a12-462c-a6b6-3fab8e3eecd7" #data platform sandbox endpoint ID
dest_endpoint_id=tdm.get_local_id() #gets local endpoint for your computer
source_path="/toposwarm_hdf5/toposwarm_num_atoms_50.hdf5" #path where HDF5 files are stored
dest_path="/~/toposwarm_num_atoms_50.hdf5"
tdm.transfer_data_globus(tc, source_endpoint_id, dest_endpoint_id, source_path, dest_path)

4. Interaction with data facilities

In [ ]:
from mdf_forge import Forge
import MDF_interface as mdf
  • Updating the Materials Data Facility with data on Globus:
In [ ]:
globus_url="https://app.globus.org/file-manager?origin_id=94b4ce5e-4a12-462c-a6b6-3fab8e3eecd7&origin_path=%2Ftoposwarm_hdf5%2F"
#url of folder where HDF5 files are stored
tdm.update_mdf(globus_url)
  • Adding new data to the Materials Data Facility:
In [ ]:
mdf.push(globus_url,mdf_name,globus_endpoint_id,globus_folder_path,local_folder_path)
  • Interacting with data on the Materials Data Facility:
Ideally we would be able to list the groups and their metadata that are in each HDF5 file on the MDF without having to download the whole HDF5 file. Unfortunately, this capability is not yet available on the MDF, but will hopefully be coming in the future.
In [ ]:
#looking at dataset contents from MDF
ls=mdf.mdf_ls("manna_nanoclusters_dft_fitting")
print("list of first 5 hdf5 files in dataset")
print(ls[0:5]) 

#loading OUTCAR or CONTCAR directly from MDF into Python
#nanocluster_dft_hdf5 is the MDF publication name
atomobj=mdf.pull_vasp_file("nanocluster_dft_hdf5/nanocluster_Ag.hdf5/ClusterSize_2/VaspRun5_Hash_-1796268364/OUTCAR")
print("atom object attributes of selected OUTCAR")
print(atomobj)

 

#load whole HDF5 file into python, list the groups in the HDF5 file, listing metadata attached to certain groups
hdf5_file=mdf.pull_HDF5("nanocluster_dft_hdf5/nanocluster_Ag.hdf5")
print("list of cluster groups in nanocluster_Ag.hdf5")
print(hdf5_file.keys())

print("list of vasp run groups in cluster group 2")
print(hdf5_file["ClusterSize_2"].keys())

print("metadata of vasp run 5 in cluster with size 2")
vasprun_group=hdf5_file["ClusterSize_2"]["VaspRun5_Hash_-1796268364"]
print(vasprun_group.attrs.keys())

#searching metadata of an HDf5 file to load specific vasp files
hdf5_file=mdf.pull_HDF5("toposwarm_hdf5/toposwarm_num_atoms_14.hdf5")
q={" # atoms": 14}
atom_obj_dict=mdf.pull_query(hdf5_file,q)
print(atom_obj_dict)