Revision as of 02:28, 19 August 2020

This page is for writing down ideas for grants.

List of potential grants and deadlines

Organization	Category	Deadline	Funding Aims	Amount
NSF	Information Integration and Informatics (III) under CISE	NEW: no deadlines for SMALL projects (submit anytime after Oct 1, 2020); September 7, 2020 - September 14, 2020 for MEDIUM projects	"The III program supports innovative research on computational methods for the full data lifecycle, from collection through archiving and knowledge discovery, to maximize the utility of information resources to science and engineering and broadly to society. III projects range from formal theoretical research to those that advance data-intensive applications of scientific, engineering or societal importance. Research areas within III include: General methods for data acquisition, exploration, analysis and explanation: Innovative methods for collecting and analyzing data as part of a scalable computational system. Domain-specific methods for data acquisition, exploration, analysis and explanation: Work that advances III research while leveraging properties of specific application domains, such as health, education, science or work. Note that projects that simply apply existing III techniques to particular domains of science and engineering are more appropriate for funding opportunities issued by the NSF directorates cognizant for those domains. Advanced analytics: Novel machine learning, data mining, and prediction methods applicable to large, high-velocity, complex, and/or heterogenous datasets. This area includes data visualization, search, information filtering, knowledge extraction and recommender systems. Data management: Research on databases, data processing algorithms and novel information architectures. This topic includes representations for scalable handling of various types of data, such as images, matrices or graphs; methods for integrating heterogenous and distributed data; probabilistic databases and other approaches to handling uncertainty in data; ways to ensure data privacy, security and provenance; and novel methods for data archiving. Knowledge bases: Includes ontology construction, knowledge sharing, methods for handling inconsistent knowledge bases and methods for constructing open knowledge networks through expert knowledge acquisition, crowdsourcing, machine learning or a combination of techniques."	up to $500,000 total budget with durations up to three years

Project Aims

Food, nutrition, and health are some of the most highly engaged topics in the Wikimedia ecosystem, and around the world. Food Composition Data (FCD) is a key piece connecting those three topics, providing nutrient data for each food item. There is a need for an open and structured database for a global FCD and Wikimedia - especially Wikidata - is a perfect place to accommodate some of these data. Due to the diversity and complexity of the existing FCDs, it would be helpful to have a placeholder that can accommodate all the details from the existing FCDs, from which Wikidata project editors can pull information deemed appropriate for Wikimedia. Accommodating all the details are important as the needs within Wikimedia projects can change over time.

We believe that Wikimedia is an appropriate venue to pursue for this project. Many FCDs - which currently come in various different formats (e.g. PDF, CSV) - include varying degrees of details. Nutrient content of unprocessed food items (e.g. apples) can also vary for the same item from different areas and times because of changing characteristics such as climate and terroir. However, the current FCDs are not well-suited for reflecting these changes. In fact, research institutes and intergovernmental agencies have attempted to create a global FCD in the past and none has succeeded to this date. Development and maintenance of such database are difficult if the contributors are limited to small/closed groups of researchers and employees in this field. Importantly, even though there are also wide regional variations in foods that are commonly consumed, some places lack access to regionally appropriate FCD, up-to-date FCD, or FCD in their own languages, leading to disparities in data availability and accessibility and ultimately, in scientific evidence in health research. We need a more open and collaborative system.

First, this Wikibase instance will significantly improve the usability of FCD from different sources for diverse users - from WikiProjects and Wikipedia editors and viewers to academic researchers to public health workers. WikiProject food and Drink on English Wikipedia and its equivalents in other languages are universally popular WikiProjects among editors and likewise, many articles on food and drink are within the top 10% of any Wikipedia's articles by pageviews. This new project can contribute to a topic that is of high interest to many people.

Building a structured dataset is also a key step in identifying most appropriate data to borrow in resource-poor settings where up-to-date, detailed, and regionally appropriate FCD are not readily available. This new database will also open up ways to explore new research questions to explore more nuanced nutrition data (e.g. changes in nutrient content of the same product, depending on the climate conditions of the year), which can potentially make substantial advances in nutrition and health research.

Secondly, by creating an instance of Wikibase for this project, we will be able to design our own data models, with input from Wikidata, to incorporate data from heterogeneous data sources. If subsets of the data are appropriate for Wikidata, we will be able to provide machine-actionable ShEx schemas that will help us prepare data for other systems. In this way the data will be readily-available for incorporation into Wikidata if desired.

Outputs

list products and other outputs
Wikibase instance with FCD data from multiple sources
SPARQL query code to combine this data with subsets of Wikidata data
Data models for food items, food composition tables, recipes, and other resources encoded as ShEx schemas
Visualizations of this data

Wikibase is a novel infrastructural platform for data management suitable for data from many domains. This is the first application built on Wikibase tailored to the needs of the epidemiological community. The output of this research will be a knowledge graph of structured data in the form of a Wikibase instance populated with data from heterogeneous food composition tables.

Multiple data visualization options are available via the Query Service of our Wikibase instance. Graphs, charts, network diagrams, and maps are some of the visualizations we will be able to offer end-users of this knowledge base.

Case Study One: Fermented foods

The nutrient composition of fermented foods commonly changes as the fermentation process progresses. We will select 15 fermented foods to use in a case study of modeling nutrient composition that changes over time. We will develop an algorithm for use in our Wikibase for converting a set of food items into a fermented food recipe that will result in accurate nutrient information for the dish.

Case Study Two: Time Series Data

Agricultural practices, local conditions, and global weather patterns all influence nutrient density in food crops. Designing a data model to represent time series data will allow us to track changes in nutrient density over time. For example, we have designed our knowledge base to accommodate nutrient composition data for a single varietal of a species grown on the same farm that is re-analyzed yearly for nutrition information.

Case Study Three: Georeferenced Data

Wild food is food that is gathered from the environment rather than cultivated agriculturally. The nutrient composition of wild foods are determined by the ecology of their location. Building a data model for georeferenced data will allow us to track the coordinate locations of wild food item sources. In this way we will be able to document the location of harvest and combine that with the nutrient composition way at the level of the statement of each fact. Each harvesting episode for which we have nutrient composition data will be modeled individually, as we acquire additional data for wild harvests, we will be able to compare the nutritional information across spaces as well as time.

Methods

Data Acquisition

We worked from the FAO's list of food composition tables [1] to identify existing FCDs that we could add to our Wikibase. We then found copies of these FCTs where possible. We then extracted the data from these tables. The FCDs were originally published as CSV or as tabular data encoded in a PDF.

Database Design and Population

We will create a database model that can represent heterogeneous food composition tables. We will use this model to map multiple food composition tables so that we can then import them into a Wikibase instance.

Our alignment of food composition table data with Wikidata will allow us to leverage the sum of knowledge in the projects of the Wikimedia foundation. Because Wikimedia Commons, the media repository of Wikimedia projects, has also been aligned with Wikidata, we will be able to easily reuse images of food items, molecular structure models, and food dishes alongside our projects. This query from our SPARQL endpoint [2] lists all of the food items in our project Wikibase that have an associated image in Wikimedia Commons.

We used the wbstack platform to create an instance of Wikibase for testing\footnote{\href{https://www.wbstack.com/}{https://www.wbstack.com/}}. The wbstack service provides a hosted version of Wikibase that users can load with their own data. Wikibase is the software used to support Wikidata itself.

WikidataIntegrator (WDI) is a python library for interacting with data from Wikidata \cite{waagmeester2020science}. WDI was created by the Su Lab of Scripps Research Institute and shared under an open-source software license via GitHub\footnote{\href{https://github.com/SuLab/WikidataIntegrator}{https://github.com/SuLab/WikidataIntegrator}}. Using WDI as a framework, we wrote bots to transfer data from FCTs to our Wikibase.

Ontology Engineering

We will write schemas for the data models related to food composition data and food items. These schemas will serve as the ontology for our knowledge graph. Our Wikibase has a schema namespace that support the Shape Expressions (ShEx) language [3]. ShEx is a data modeling a data validation language for RDF graphs. We provide an example below of a ShEx schema describing how food composition tables are modeled in our Wikibase. Defining ShEx schemas for our data models allows us to communicate the expected structure of data for a food composition table to others who may like to contribute data to our public Wikibase. We have published the schema in the Schema namespace [4].

PREFIX wd: <http://www.wikidata.org/entity/> PREFIX wbt: <http://wikifcd.wiki.opencura.com/prop/direct/> PREFIX wb: <http://wikifcd.wiki.opencura.com/entity/> PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> start = @<#food_composition_table> <#food_composition_table> EXTRA p:P1{ wbt:P1 [wb:Q12] ;

       wbt:P22 IRI ?  ;
       wbt:P58 xsd:string ? ;
       wbt:P68 xsd:string * ;
       wbt:P65 @<#P65_country> *;
       wbt:P56 xsd:string *;
       wbt:P69 xsd:string *;
       wbt:P70 xsd:string *;

}
<#P65_country> {

      wbt:P31 [wb:Q127865]

}

These ShEx schemas will also reduce work for anyone looking to combine data from our knowledge graph with other data sets. For example, if researchers would like to explore our data, rather than writing exploratory SPARQL queries to find out what data can be found and the details of our data models, they can simply review our ShEx schemas to quickly understand our data models.

Validating RDF Graphs

ShEx can be used to validate RDF graphs for conformance to a schema. This allows us to create forms for data contributors that will ensure data consistency. Data contributors will not need to familiarize themselves with our data models, the form-based contribution interaction will guide curation.

Our ShEx schemas will also be useful when integrating additional RDF data sets as the project matures. When we encounter new RDF data sources we can explore them with the use of our ShEx schemas to determine where they overlap with our existing data models. We will also be able to extend our schemas as the need for greater expressivity or complexity arises.

Data Provenance

Our emphasis on reusing data from multiple published sources requires precision in data provenance. The structure of references in the Wikibase data model allows us to assert provenance at the level of the statement. Simply put, we can connect our sources to individual statements of fact in our knowledge graph. In this way we can always be sure of where data was originally found should we need to communicate that to others or follow up with the reference material.

Using the SPARQL query language, we can also write tailored queries to extract subgraphs supported by a single source. In this way we support views of the data across multiple sources as well as views of the data drawn from individual sources. Researchers will not need to separate data manually, the provenance metadata is machine actionable and stored at the level of individual statements in the graph.

Impact

This project will provide a new FCD knowledge graph that will support queries across multiple FCDs with a single search. This will reduce the time that epidemiologists, nutritionists and other researchers spend searching for food composition data. This knowledge graph will support federated queries with Wikidata and other public SPARQL endpoints that will allow researchers to ask questions of this data in combination with other linked open datasets. The data in the knowledge graph is structured data. Due to the fact that many of these tables were published as PDFs, getting the data into a more readily accessible structured format increases ease of reuse.

This project will support multilingual data, reducing barriers to data reuse for speakers of many languages beyond English. Users will be able to query using any of the supported human languages, and see results in the language of their choice. Through the reuse of data from Wikidata, a multilingual knowledge base, we will add common names as well as scientific names for foods items and plant and animal species in as many human languages as possible.

In many FCTs food items are identified with a single label. Our approach supports searching across multiple aliases for a single resource. This broadens search options so that lookups are not constrained to a single search term.

Our choice to use Wikibase allows us to access the data serialized as RDF. The SPARQL endpoint we have created allows us to ask questions of this data that previously were not possible to ask. For example, we can now ask questions such as "show me all recipes that call for one or more ingredients containing proanthocyanidins".

We will connect scientific publications about the nutritional components of foods with the food items. This is possible because of the existence of roughly 50,000,000 scientific publications in Wikidata. Many of the publications in PubMed are already represented in Wikidata, thus our domain is adequately represented. We will create new Wikidata items for publications we would like to reference if they do not yet exist. Connecting publications with food items in our knowledge graph will allow us to provide additional evidence for researchers to reuse, investigate, and extend.

The knowledge graph approach allows us to combine food composition data and recipes in the same database, which will enable us to create novel user interfaces for people interested in the nutritional components of home-cooked dishes.

The knowledge graph approach also facilitates expansion of this project into related domains. We could look at food chemistry and metabolic processes by combining this with subsets of Wikidata. We could combine this data with research literature about health benefits of plant-derived medicines and extend our data models to include plant components that have been tested for medicinal efficacy.

The ability to federate SPARQL queries between our Wikibase and Wikidata allows us to combine our data with resources from the media repository of the Wikimedia Foundation, Wikimedia Commons. The ability to quickly locate images, videos and sound files related to the resources in our Wikibase allows us to provide interactive multi-media interactions in applications we build on top of our Wikibase. Wikimedia Commons has images of many of the taxa of which our food items are products.

The Wikibase infrastructure supports both human and algorithmic curation. Thus we can programmatically ingest data from external sources and also support crowdsourced recipes from anyone with access to the internet. The World Wide Web Consortium (W3C) published the following definition of the Semantic Web in 2009. "Semantic Web is the idea of having data on the Web defined and linked in a way that it can be used by machines not just for display purposes, but for automation, integration, and reuse of data across various applications.” (W3C Semantic Web Activity, 2009).

The Wikidata knowledge base fulfills the requirements outlined by the W3C in that each resource has a unique identifier, is liked to other resources by properties and that all of the data is machine actionable as well as editable by both humans and machines.

Our decision to build this knowledge base using the infrastructure of the Wikimedia Foundation means that other researchers will be able to access this data for reuse in their own projects in a variety of formats. Results from our SPARQL endpoint are available for download as JSON, TSV, CSV and HTML. Preformatted code snipits for making requests to our SPARQL endpoint are available in PHP, jQuery, JavaScript, Java, Perl, Python, Ruby, R and Matlab. These options allow researchers to more quickly integrate data from our knowledge base into their existing projects using the tools of their choice.

People

Project manager/nutritional epidemiologist (volunteer) - Mika Matsuzaki
Data scientist - Kat Thornton
Software Engineer- Kenneth Seals-Nutt
Food composition advisor/nutritional epidemiologist (volunteer) - Sabri Bromage

@@ Line 7: / Line 7: @@
 |-
 | NSF
-| [https://www.nsf.gov/pubs/2020/nsf20591/nsf20591.htm: Information Integration and Informatics  (III)] under CISE
+| [https://www.nsf.gov/pubs/2020/nsf20591/nsf20591.htm Information Integration and Informatics  (III)] under CISE
 |  NEW: no deadlines for SMALL projects (submit anytime after Oct 1, 2020); September 7, 2020 - September 14, 2020 for MEDIUM projects
 | "The III program supports innovative research on computational methods for the full data lifecycle, from collection through archiving and knowledge discovery, to maximize the utility of information resources to science and engineering and broadly to society. III projects range from formal theoretical research to those that advance data-intensive applications of scientific, engineering or societal importance. Research areas within III include:

Mika/Temp/WikiFCD/Grants: Difference between revisions