Regarding SciDB xldb 2013: http://www.youtube.com/watch?v=SsF_Mke0Mlw&feature=youtube_gdata Tutorial on SciDB and SciDB-R by Alex Poliakov and Paul Brown Some SciDB R commands:------------------------------------------------------------------------------------------------------------------- [x16] server-0=p4xen7.local.paradigm4.com,3 #means 4 on coord server-1=p4xen8.local.paradigm4.com,4 server-2=p4xen9.local.paradigm4.com,4 server-3=p4xen10.local.paradigm4.com,4 db_user=x16_user #PG credentials db_passwd=x16_password install_root=/opt/scidb/13.6 #binaries pluginsdir=/opt/scidb/13.6/lib/scidb/plugins logconf=/opt/scidb/13.6/share/scidb/log4cxx.properties base-port=1239 #coordinator port base-path=/home/scidb/scidbdata #see also data-dir-prefix tmp-path=/datadisk1 #room for tmp storage # (continued on next slide) ## Thread settings: up to 4 queries, 4 threads per execution-threads=6 #MAX_NO_QUERIES+2 result-prefetch-queue-size=4 #threads per query operator-threads=4 #threads per query result-prefetch-threads=16 #MAX_NO_QUERIES * t.p.q ## Memory settings mem-array-threshold=512 #temp query cache smgr-cache-size=512 #persistent array cache network-buffer=512 #scatter/gather buffer size merge-sort-buffer=64 #used for sort, one per thread # (continued on next slide) ------------------------------------------------------------------------------------------------------------------- $ scidb.py stopall x16 $ scidb.py init_syscat x16 #as postgres user $ scidb.py initall x16 #x16 is the config name $ scidb.py startall x16 $ cat ~/.config/scidb/iquery.conf #iquery config file { "format":"lcsv+" } $ iquery -aq "list('instances')" ================== Install shim on coordinator machine: – https://github.com/Paradigm4/shim/wiki/Installing-shim • Access shim from browser: – From R / RStudio: > install.packages("scidb") > library("scidb") > scidbconnect() > scidblist() =================== $ scidb.py stopall demo_db $ scidb.py startall demo_db $ sudo rstudio-server restart =================== Loading Data • Overall process: 1. Visualize the desired array (or arrays) 2. Prepare the files 3. Load the data in to SciDB (typically as a 1-dimensional array) 4. Handle Load/Data-Quality Errors 5. Rearrange the 1-D array into the desired array(s) • Several Loading Techniques – CSV Load – Binary Load - faster – Parallel Load - – Opaque Load - across instances • Overall process holds; ======================= $ head -n 5 laml_methyl_composite.csv TCGA-AB-2802-03A-01D-0741-05,cg00000029,0.521668865344633,RBL2,16,53468112 TCGA-AB-2802-03A-01D-0741-05,cg00000108,NA,C3orf35,3,37459206 TCGA-AB-2802-03A-01D-0741-05,cg00000109,NA,FNDC3B,3,171916037 TCGA-AB-2802-03A-01D-0741-05,cg00000165,0.100722321673368,,1,91194674 TCGA-AB-2802-03A-01D-0741-05,cg00000236,0.837944995677383,VDAC3,8,42263294 $ iquery -aq "create array laml_methylation_flat <sample_id:string, probe_id:string, beta_value:double, gene_id:string, chromosome:string, genomic_coordinate:uint64> [row_num=0:*, 1000000,0]" $ loadcsv.py -i laml_methyl_composite.csv -a laml_methylation_flat #parallel load $ iquery -aq "count(laml_methylation_flat)" i,count 0,94201938 ====================== loadcsv.py: csv file -> splitcsv -> splits the csv file, passes it along to instances and they start loading as it is being transfered. $ loadcsv.py --help -d DB_ADDRESS SciDB Coordinator Hostname or IP Address (Default = "localhost") ======================-p DB_PORT SciDB Coordinator Port (Default = 1239) -r DB_ROOT SciDB Installation Root Folder (Default = "/opt/scidb/13.6”) -i INPUT_FILE CSV Input File (Default = stdin) -n SKIP # Lines to Skip (Default = 0) -t TYPE_PATTERN N number, S string, s nullable-string, C char (e.g., "NNsCS”) -D DELIMITER Delimiter (Default = ",") -f STARTING_COORDINATE Starting Coordinate (Default = 0) -c CHUNK_SIZE Chunk Size (Default = 500000) -o OUTPUT_BASE Output File Base Name (Default = INPUT_FILE or "stdin.csv") -P SSH_PORT SSH Port (Default = System Default) -u SSH_USERNAME SSH Username -k SSH_KEYFILE SSH Key/Identity File -a LOAD_NAME Load Array Name -s LOAD_SCHEMA Load Array Schema -w SHADOW_NAME Shadow Array Name -e ERRORS_ALLOWED # Load Errors Allowed per Instance (Default = 0) -A TARGET_NAME Target Array Name -S TARGET_SCHEMA Target Array Schema #Save to coordinator iquery -aq "save(laml_methylation_flat, '/home/scidb/laml_methyl_flat.scidb', #Save a piece on each instance in parallel-2, 'lcsv+')" iquery -aq "save(laml_methylation_flat, 'laml_methyl_flat.scidb', #Goes into data_dir http://youtu.be/SsF_Mke0Mlw?t=49m16s to increase # of instances e.g. from 8 o 16 and scidb takes care of redistributing data for you #Reload in parallel: useful for poor man’s elasticity-1, 'opaque')" iquery –aq "load(laml_methylation_flat, 'symlink_to_laml_methyl_flat.scidb', ======================-1, 'opaque')" Arrays Versus Tables • Everything is an Array: – 1 or more dimensions – 1 or more attributes in each cell – chunked and distributed – sparse or dense • Operators redimension() and redimension_store() can be used to turn attributes into dimensions and viceversa • Chunk sizing is important ====================== scidb can do 3x3 window moving average separate null cells from empty cells ====================== materialized view is for future work. ====================== pick a chunk size that holds about 1 million non-empty cells per chunk (if not all cells are full or if the matrix is sparse) $ iquery -aq "load_library('example_udos')" # this adds a operator 'uniq' $ iquery –aq "$ iquery -ocsv+ -aq " between( index_lookup( converts stock market name to integer trades_flat, stock_symbol_index, trades_flat.stock_symbol, symbol_id ), 0, 5 )"! aggregate( redimension( substitute( index_lookup( trades_flat, stock_symbol_index, trades_flat.stock_symbol, symbol_id ), build(<val:int64>[x=0:0,1,0], -1) ), <count:uint64 null> [symbol_id=0:*,200,0, time=0:*,86400000,0], count(*) as count ), max(count), avg(count) )" $ iquery -aq " create array trades <price:double, volume:uint64> [symbol_id=0:*,200,0, time=0:*,86400000,0, trade_no=0:499,500,0]" $ iquery -anq ” redimension_store( index_lookup( trades_flat, stock_symbol_index, trades_flat.stock_symbol, symbol_id ), trades ") $ iquery ('list(operators)') ====================== in between operator use null to say unlimited bound. ====================== regrid: change chunks?? ====================== window aggregate window( aggregate( trades, avg(price) as price, symbol_id, time ), 0, 0, 60000, 0, avg(price) ) --------- repart() is an op that changes chunk sizes, adds overlap • window() inserts a repart() operation if needed • Consider storing the array with overlap to speed things up • Sometimes adjusting repart() config speeds repart() up FIXED WINDOW Aggregates store(repart()) query setopt('repart-algorithm', 'sparse') ----------- VARIABLE WINDOW Aggregates Window expands or contracts to ensure that each aggregate value is calculated on the same number of non-empty cells • Applies only to one-dimensional windows • Example: moving average price over the last 100 trades • To speed up – put the entire row in a single chunk ====================== AFL% list('arrays'); AFL% list('operators'); AFL% list('types'); AFL% list('functions'); AFL% list('aggregates'); AFL% list('instances'); AFL% list('queries'); ====================== join, merge, cross, cross join(collaps non-matching demensions) http://www.paradigm4.com/2013/05/analyze-more-program-less-a-webinar-about-using-scidb-for-computational-finance/ www.netlib.org/scalapack/faq.html www.netlib.org/scalapack/ for distributed parallel optimizaiton GEMM and GESVD ====================== insert(a,b) = store(merge(a,b),b) if you've modified an array too many times scidb keeps all back logs and might get laggy, store it in a anew array remove rpevious one and renbame new array to the former array ====================== USER DEFINED Datatypes: (src/examples/point, rational) • Functions: support for point, rational • Aggregates: penmax – http://www.scidb.org/forum/viewtopic.php?f=18&t=1122 • Operators: example_udos ====================== SCIDB-r UNDER THE HOOD rewrite_r_expressions = function()! {! options(scidb.debug=TRUE)! x = as.scidb(iris) #download from R to SciDB! head(x) #R expr on SciDB object! ! y = scidb("laml_matrix") #R -> SciDB pointer! y[,3][] #R subset and download! ! z = cbind(rnorm(485578)) #R vector! A = y %*% z #SciDB matrix * vector! }! ====================== http://illposed.net/ SCIDB-R PERFORM IN R OR SCIDB https://github.com/Paradigm4 http://www.astro.washington.edu/users/vanderplas/ SCIDB PY ====================== ====================== ====================== |