IBM Research Workshop
Date: October 24, 2017
Time: 1:00pm - 5:00pm
Location: Packard Building, room 202
[1:15] Introduction/Welcome/Research Overview - Sandeep Gopisetty, IBM Research | Almaden (30 min)
[1:45] Talk #1: "Machine Learning Poison Detection Based on Provenance" by Heiko Ludwig (30 min)
ABSTRACT: The use of machine learning models has become ubiquitous. Their predictions are used to make decisions about healthcare, security, investments and many other critical applications. Given this pervasiveness, it is not surprising that adversaries have an incentive to manipulate machine learning models to their advantage. Different forms of attacks have been observed, including attacks to extract models or data by interacting with the model or evading an intended classification. One way of manipulating a model is through a poisoning attack in which the adversary feeds carefully crafted poisonous data points into the training set. Taking advantage of recently developed tamper-free provenance frameworks, we apply a methodology that uses contextual information about the origin and transformation of data points in the training set to identify poisonous data, incorporating provenance information as part of a filtering algorithm. The presentation will go over different options depending on the availability of trusted test data. Using this family of approaches we can detect and filter poisoning attacks for types of environments where provenance information is reliably available.
SPEAKER BIO: http://researcher.watson.ibm.com/researcher/view.php?person=us-hludwig
[2:15] Talk #2: "Scalable Multilingual Natural Language Processing in SystemT" by Yunyao Li (30 min)
ABSTRACT: Automatic semantic understanding of natural languages is at the core of many AI applications. While there are thousands of natural languages in the world, automatic semantic understanding of natural languages is largely limited to English with a few exceptions. There are two key challenges for providing multilingual supports: (1) the lack of linguistic resources for most languages in the world; and (2) research for languages in isolation. As a result, most languages are poorly researched and supported and research/resources for one language hardly has any impact to other languages. In this talk, we will present POLYGLOT, a multilingual semantic role labeling system capable of semantically parsing sentences in 9 different languages from 4 different language groups. By treating the semantic labels of the English Proposition Bank as “universal semantic labels”, given a sentence in any of the supported languages, POLYGLOT will predict appropriate English PropBank frame and role annotation. We illustrate how these universal semantic labels can be used within SystemT, a declarative information extraction system developed in IBM Research and shipping with 10+ IBM products, to create cross-lingual information extractors that immediately work across different languages. In addition, we will discuss how we automatically generate Proposition Banks for new languages to enable multilingual SRL by exploiting monolingual technologies, multilingual parallel data, and crowd-sourced workers.
BIO: Yunyao Li is a Master Inventor, Research Staff Member and Research Manager with IBM Almaden Research Center, where she manages the Scalable Natural Language Processing group. She is also a member of the IBM Academy of Technology. Her expertise is in the interdisciplinary areas of natural language processing, databases, human-computer interaction, and information retrieval. She regularly served on program committees and editorial boards for internationally renowned conferences and journals in these areas. Yunyao is particularly interested in designing, developing, and analyzing large scale systems that are usable by a wide spectrum of users. Towards this direction, her current focus is on the building and querying of domain-specific knowledge bases. She is a founding member of SystemT, a state-of-the-art information extraction engine currently powering multiple IBM products including Watson Natural Language Understanding and Gumshoe, a novel enterprise search engine that has been powering IBM intranet and ibm.com search since 2010. Her contributions in these projects have recognized by multiple prestigious IBM internal awards. Yunyao obtained her Ph.D degree in Computer Science & Engineering from the University of Michigan.
[2:45] Talk #3: "Wildfire: Fast HTAP over Loosely-Coupled Nodes" by Vijayshankar Raman (30 min)
ABSTRACT: Analytics over live, high-volume data streams is needed in domains ranging from finance (portfolio management) to healthcare to IoT. Traditional OLTP systems struggle with such workloads, both due to high ingest rates and due to the need for complex, across-many-rows analytics on real-time data (called HTAP – hybrid transaction and analytic processing). Another complication is loose coupling (“AP” in the sense of the CAP-theorem). Globalization has made ACID consistency pedantic: many modern applications, including financial ones, want multi-master updates, with transactions committing even when the multiple masters are (transiently) disconnected. At the same time, these applications are not happy with eventual consistency. Wildfire is a prototype HTAP system being built to tackle these challenges. Wildfire simultaneously targets HTAP with an open Parquet data format for fully versioned data, very high volume ingests (over 1 million rows per second per node) across multiple masters, while retaining the flexibility to get fully serializabilty.
SPEAKER BIO: Vijayshankar Raman is a researcher in the database group at IBM Almaden, with interests in query processing and optimization, data compression, high performance software, data cleaning, and transaction processing.
[3:15] Talk #4: "Ontology-Driven Natural Language Querying of Knowledge Bases" by Fatma Ozcan (30 min)
ABSTRACT: In this talk, we present an ontology-driven system for natural language querying of complex relational databases and search indexes. Natural language interfaces to databases enable users easy access to data, without the need to learn a complex query language, such as SQL. ATHENA uses domain specific ontologies, which describe the semantic entities, and their relationships in a domain. We propose a unique two-phase approach, where the input natural language query (NLQ) is first translated into an intermediate query language over the ontology, called OQL, and subsequently translated into SQL or search query. Our two-phase approach allows us to decouple the physical layout of the data in the backends from the semantics of the query, providing physical independence. Moreover, ontologies provide richer semantic information, such as inheritance and membership relations, that are lost in a relational schema. By reasoning over the ontologies, our NLQ engine is able to accurately capture the user intent.
SPEAKER BIO: Fatma Ozcan is a Principal Research Staff Member and a manager at IBM Research Almaden. Her current research focuses on natural language interfaces to knowledge bases, and databases, platforms and infra-structure for large-scale data analysis, SQL-on-Hadoop, and query processing and optimization of semi-structured data. Dr Ozcan received her PhD in computer science from University of Maryland, College Park. She has over 15 years of experience in semi-structured and structured data management, query processing and optimization, and has delivered core technologies into IBM DB2 and BigInsights products. She is the co-author of the book "Heterogeneous Agent Systems", and co-author of several conference papers and patents. She is a member of the ACM and ACM SIGMOD. She serves as the treasurer of SIGMOD, and in the board of trustees of the VLDB endowment.
[3:45] Talk #5: "Scalable Machine/Deep Learning with Apache SystemML" by Berthold Reinwald (30 min)
ABSTRACT: Apache SystemML is an open source project for declarative, large scale machine/deep learning. Data scientists are able to implement ML/DL algorithms in a high-level language without knowledge of distributed program