Good Stuff‎ > ‎Data‎ > ‎

Group Meetings

stream datamining
semi-supervised. labels on features(labled feature), labels on clusters (co-clustering).
comparative document summarization. take alla documents from uf and fiu and tell where they match and where they differ
unstructured event to structured story line telling as a directed graph of events.
sentiment expression is domain specific
cost, corelation, stability

UDA) user defined aggregator: it is a computation platform under mysql, then under it is the gist for graph processing
in memory graph in databse, how did it create graph? in yan's work a graph is as rows in table.
"loopy" belief propagation
out of DB: graphlab, mahout, mlbase
inDB scidb, madlib, gist

collaborative filtering:
cluster users that have similar ratings. to predict next ratigns
OR cluster movies that have received similar ratings from sameusers and predict what next rating they will get

Market basket analysis (integrate with bluetooth navigation):
 shopping cart, what they buy together. what movies love/hate at the same time : positive correlation
- cooccurence
- accociation rule    x -> y    confidence c:    P(Y|X) = c
- sequential pattern

pen -> milk 
support (milk^pen) / all   75%
confidencep(milk^pen) / pen    75%
Yang chen presentation
openie and nell are extracting KBs automatically
freebase is manually
Sherlock KB      Reverb 400k extracted facts from web, sherlok uses reverb and extracts rules
graphlab for inference

extract constrainsts from dataset (e.g. a person is born in one city)   ask yang for the dataset

6 rounds of reapplying rules to keep generating more facts

Improving the Accuracy and Scalability of Discriminative Learning Methods for Markov Logic Networks

association rule mining to determine probabilities of rules and facts      ---- or inductive reasoning

automatically extract rules from data using 'relational path finding'

statisticsal rule cleaning

extract constrainsts from dataset


Knowledge base future work:
KB integration
KB evaluation
Minimal KB
Queries over KBs : what we should compute offline or online to answer queries.
triple exists -> subgraph exists    (how about neo4j)? graph search
can we boil down subgraph queries to smaller more manageable ones instead of a one gigantic query
prob. of google aquiring youtbe instead of a Y/N answer if the link exists.
closest subgraph us a match: EXEMPLAR queries


Google knowledge graph, bought freebase, freebase is manually added. NELL is automatic

DBPedia extracted from wikipedia infobox. Yago, Sumo,
markov knowedge network to knowledge exoansion
CRF to knowledge extraction.

knowledge integration to crowd.
knowledge evolution (update)

projects: bayes store vldb08. probkb.
no query time inference

Dynamic programmng in datasbe: recursive join
instead of tabels use arrays. insteadf each factors as a set of rows we look at theem as array. better data structure
postgres is single threaded. greenplum is multithreaded. probkb applied there

markov logic network

robustness of graohical models. ?????

crowd to look at the rules.

grounding: felix, tummy from UW on greenplum but we use graph lab.
tummy failed but probkb finished in 10mins. we use horn clause.


precision vs recall

precision (also called positive predictive value) is the fraction of retrieved instances that are relevant, while recall (also known as sensitivity) is the fraction of relevant instances that are retrieved.
Suppose a program for recognizing dogs in scenes identifies 7 dogs in a scene containing 9 dogs and some cats. If 4 of the identifications are correct, but 3 are actually cats, the program's precision is 4/7 while its recall is 4/9.
When a search engine returns 30 pages only 20 of which were relevant while failing to return 40 additional relevant pages, its precision is 20/30 = 2/3 while its recall is 20/60 = 1/3.

actual class

predicted class
(true positive)
Correct result
(false positive)
Unexpected result
(false negative)
Missing result
(true negative)
Correct absence of result
p = tp / (tp + fp)                   r = tp / (tp + fn)        accuracy = (tp + fn) / (tp + fn + tn + fp)

F- measue

Harmonic mean of precision and recall times 2 = 2* p*r/(p+r) . It's the mean that is closer to the minimum of the two values
In certain situations, especially many situations involving rates and ratios, the harmonic mean provides the truest average. For instance, if a vehicle travels a certain distance at a speed x (e.g. 60 kilometres per hour) and then the same distance again at a speed y (e.g. 40 kilometres per hour), then its average speed is the harmonic mean of x and y (48 kilometres per hour),

22nd of March

implement the project on Spark, Shark. Identify relevant documents by low pass filter of String search.
Some TREC 2012 KBA Tack papers presented. CWI and ...
To get context of a word (e.g. John Brown, J. Brown) get prefix- suffix combination to get some context of the word (e.g. 10 words back and forth).
One paper I found,
some tools: dbpedia, Freebase, HLTCOE(Human Language Technology Center of Excellence), Some weka classifiers (e.g. Random Committee classifier, Random Forest), Stanford NER. Log-likelihood Ratio, MArkov Random Field, Google dictionary, Latent Concept Expansion, Google word-link dictionary, Galago.     TextRunner - Reverb
Here is a quick note on [Scala for Java Gurus](
active learning, information extraction, crowdsourcing.

March 29th
Catalyst DTSearch

Concept Search Active learning Topic model Probabilistic latent semantic analysis
Kullback–Leibler divergence Termite: Visualization Techniques for Assessing Textual Topic Models

Learning to rank:
pointwise methods
pairwise methods
listwise approach

4/5/13   demo


data class presentations:

dependency parsing:

17 May
Linguistic professor presentation
categorical coreference. boxer, ,,,,the athlete
getting context from comments of  post(comments of a movie review what feature of the movie do they address.?) which actor, feature?
how people actualy categorize things boat/car to vehicle
determine the focus of hte sentences/ 
topic vs focus
is what you say what the listener perceives? the bigdata
knowledge-base, entity relationships from text
build metadata on free text to query the document based on a query language
topic modeling : topic of documents and cluster them
    survey template to have template 
google knowledge graph - edges vs nodes. graph is very sparse, interpolate more edges .'

pronoun can refer to something that you never mention . we went to london but it broke down. ''''the car

information flow/evolution in people conversations. 
     context: a graph of knowledge that is evolving as people speak.
     entities, features, relationships, ...
    hierarchincal vs network model debate on text
advertisement of a webpage relevant to the document but also relevant to the ....