Impostor Syndrome

Hello everybody,

I had the opportunity to give an interview for the magazine WOMAN regarding the topic of Impostor Syndrome. The article is only in German and was not available online so I uploaded it here for who is interested to read it. Credits to Angelika Strobl who interviewed me and wrote the article. Enjoy!

What I’ve learned while triplifying a real dictionary

The Linked Data Lexicography for High-End Language Technology (LDL4HELTA) project was started in cooperation between Semantic Web Company (SWC) and K Dictionaries. LDL4HELTA combines lexicography and Language Technology with semantic technologies and Linked (Open) Data mechanisms and technologies. One of the implementation steps of the project is to create a language graph from the dictionary data.

The input data, described further, is a Spanish dictionary core translated into multiple languages and available in XML format. This data should be triplified (which means to be converted to RDF – Resource Description Framework) for several purposes, including to enrich it with external resources. The triplified data needs to comply with Semantic Web principles.

To get from a dictionary’s XML format to its triples, I learned that you must have a model. One piece of the sketched model, representing two Spanish words which have senses that relate to each other, is presented in Figure 1.

'arriesgado' model
Figure 1: Language model example

This sketched model first needs to be created by a linguist who understands both the language complexity and Semantic Web principles.

Language is very complex. With this we all agree! How complex it really is, is probably often underestimated, especially when you need to model all its details and triplify it.

So why is the task so complex?

To start with, the XML structure is complex in itself, as it contains nested structures. Each word constitutes an entry. One single entry can contain information about:

  • Pronunciation
  • Inflection
  • Range Of Application
  • Sense Indicator
  • Compositional Phrase
  • Translations
  • Translation Example
  • Alternative Scripting
  • Register
  • Geographical Usage
  • Sense Qualifier
  • Provenance
  • Version
  • Synonyms
  • Lexical sense
  • Usage Examples
  • Homograph information
  • Language information
  • Specific display information
  • Identifiers
  • and more…

Entries can have predefined values, which can recur but their fields can also have so-called free values, which can vary too. Such fields are:

  • Aspect
  • Tense
  • Subcategorization
  • Subject Field
  • Mood
  • Grammatical Gender
  • Geographical Usage
  • Case
  • and more…

As mentioned above, in order to triplify a dictionary one needs to have a clear defined model. Usually, when modelling linked data or just RDF it is important to make use of existing models and schemas to enable easier and more efficient use and integration. One well-known lexicon model is Lemon. Lemon contains good pieces of information to cover our dictionary needs, but not all of them. We started using also the Ontolex model, which is much more complex and is considered to be the evolution of Lemon. However, some pieces of information were still missing, so we created an additional ontology to cover all missing corners and catch the specific details that did not overlap with the Ontolex model (such as the free values).

An additional level of complexity was the need to identify exactly the missing pieces in Ontolex model and its modules and create the part for the missing information. This was part of creating the dictionary’s model which we called ontolexKD.

As a developer you never sit down to think about all the senses or meanings or translations of a word (except if you specialize in linguistics), so just to understand the complexity was a revelation for me. And still, each dictionary contains information that is specific to it and which needs to be identified and understood.

The process used in order to do the mapping consists of several steps. Imagine this as a processing pipeline which manipulates the XML data. UnifiedViews is an ETL tool, specialized in the management of RDF data, in which you can configure your own processing pipeline. One of its use cases is to triplify different data formats. I used it to map XML to RDF and upload it into a triple store. Of course this particular task can also be achieved with other such tools or methods for that matter.

In UnifiedViews the processing pipeline resembles what appears in Figure 2.

UnifiedViews
Figure 2: UnifiedViews pipeline used to triplify XML

The pipeline is composed out of data processing units (DPUs) which communicate iteratively. In a left-to-right order the process in Figure 2 represents:

  • A DPU used to upload the XML files into UnifiedViews for further processing;
  • A DPU which transforms XML data to RDF using XSLT. The style sheet is part of the configuration of the unit;
  • The .rdf generated files are stored on the filesystem;
  • And, finally, the .rdf generated files are uploaded into a triple store, such as Virtuoso Universal server.

Basically the XML is transformed using XSLT.

Complexity increases also through the URIs (Uniform Resource Identifier) that are needed for mapping the information in the dictionary, because with Linked Data any resource should have a clearly identified and persistent identifier! The start was to represent a single word (headword) under a desired namespace and build on it to associate it with its part of speech, grammatical number, grammatical gender, definition, translation – just to begin with.

The base URIs follow the best practices recommended in the ISA study on persistent URIs following the pattern:http://{domain}/{type}/{concept}/{reference}.

An example of such URIs for the forms of a headword is:

  • http://kdictionaries.com/id/lexiconES/entendedor-n-m-sg-form
  • http://kdictionaries.com/id/lexiconES/entendedor-n-f-sg-form

These two URIs represent the singular masculine and singular feminine forms of the Spanish word entendedor.

  • http://kdictionaries.com/id/lexiconES/entendedor-adj-form-1
  • http://kdictionaries.com/id/lexiconES/entendedor-adj-form-2

If the dictionary contains two different adjectival endings, as with entendedor which has different endings for the feminine and masculine forms (entendedora and entendedor), and they are not explicitly mentioned in the dictionary than we use numbers in the URI to describe them. If the gener would be explicitly mentioned the URIs would be:

  • http://kdictionaries.com/id/lexiconES/entendedor-adj-form
  • http://kdictionaries.com/id/lexiconES/entendedora-adj-form

In addition, we should consider that the aim of triplifying the XML was for all these headwords with senses, forms and translations, to connect and be identified and linked following Semantic Web principles. The actual overlap and linking of the dictionary resources remains open. A second step for improving the triplification and mapping similar entries, if possible at all, still needs to be carried out. As an example, let’s take two dictionaries, say German, which contain a translation into English and an English dictionary which also contains translations into German. We get the following translations:

Bank – bank – German to English

bank – Bank – English to German

The URI of the translation from German to English was designed to look like:

  • http://kdictionaries.com/id/tranSetDE-EN/Bank-n-SE00006116-sense-bank-n-Bank-n-SE00006116-sense-TC00014378-trans

And the translation from English to German would be:

  • http://kdictionaries.com/id/tranSetEN-DE/bank-n-SE00006110-sense-Bank-n-bank-n-SE00006110-sense-TC00014370-trans

In this case both represent the same translation but have different URIs because they were generated from different dictionaries (mind the translation order). These should be mapped so as to represent the same concept, theoretically, or should they not?

The word Bank in German can mean either a bench or a bank in English. When I translate both English senses back into German I get again the word Bank, but I cannot be sure which sense I translate unless the sense id is in the URI, hence the SE00006110 and SE00006116. It is important to keep the order of translation (target-source) but later map the fact that both translations refer to the same sense, same concept. This is difficult to establish automatically. It is hard even for a human sometimes.

One of the last steps of complexity was to develop a generic XSLT which can triplify all the different languages of this dictionary series and store the complete data in a triple store. The question remains: is the design of such a universal XSLT possible while taking into account the differences in languages or the differences in dictionaries?

The task at hand is not completed from the point of view of enabling the dictionary to benefit from Semantic Web principles yet. The linguist is probably the first one who can conceptualize “the how to do this”.

As a next step we will improve the Linked Data created so far and bring it to the status of a good linked language graph by enriching the RDF data with additional information, such as the history of a term or additional grammatical information etc.

Connected Data London

In July this year I had the opportunity to represent the company I work for, Semantic Web Company, at Connected Data London.  I had a 15 minutes slot to present some client success stories with connected data.  At the conference I also actively represented PoolParty, our  Software Suite, at the official stand offered to partners and sponsors.

I found London to be somehow “smaller” than the expectations floating around it. However, the people I interacted with (not at the conference) gave me immediately a very international flair of the city, more than in Vienna.

Anyway, next, you can see my recorder talk:

In the opening of the event David Meza presented how it is to use RDF and graph technologies at NASA. I was really happy to attend his session and to meet him in person.  Next, you can watch here his video as well:

Overcoming Impostor Syndrome

Quite some time past since the last Women Techmakers Vienna event. I still get positive feedback and even more: people ask me to upload the content of the Impostor Syndrome workshop online. It is always great to see how helpful such information is. The workshop helped me a great deal!

So, here it goes, find the resources of the workshop right here!


The materials for this workshop were inspired by the Ada Initiative. Find the original information on their website. This content was slightly changed to fit the Women Techmakers Vienna conference. (the main content remained the same).

There are several resources available fro the workshop. Find all the materials here.
Available materials are:

  • A handout of the workshop explains in general what impostor syndrome is and how to overcome it. It also contains other references.
  • The workshop was created in such a way that it can be easily taken over and presented in other workshops/conferences as well. A facilitator guide example can be found also.
  • There were 3 exercises conducted in the workshop. Their description can also be found in the above link.

Thank you to the ones who attended and thank you to the ones who are interested in this topic.

Ready to connect to the Semantic Web – now what?

As an open data fan or as someone who is just looking to learn how to publish data on the Web and distribute it through the Semantic Web you will be facing the question “How to describe the dataset that I want to publish?” The same question is asked also by people who apply for a publicly funded project at the European Commission and want to have a Data Management plan. Next we are going to discuss possibilities which help describe the dataset to be published.

The goal of publishing the data should be to make it available for access or download and to make it interoperable. One of the big benefits is to make the data available for software applications which in turn means the datasets have to be machine-readable. From the perspective of a software developer some additional information than just name, author, owner, date… would be helpful:

  • the condition for re-use (rights, licenses)
  • the specific coverage of the dataset (type of data, thematic coverage, geographic coverage)
  • technical specifications to retrieve and parse an instance (a distribution) of the dataset (format, protocol)
  • the features/dimensions covered by the dataset (temperature, time, salinity, gene, coordinates)
  • the semantics of the features/dimensions (unit of measure, time granularity, syntax, reference taxonomies)

To describe a dataset the best is always to look first at existing standards and existing vocabularies. The answer is not found looking only at one vocabulary but at several.

Data Catalog Vocabulary (DCAT)

DCAT is an RDF Schema vocabulary for representing data catalogs. It is an RDF vocabulary for describing any dataset, which can be standalone or part of a catalog.

Vocabulary of Interlinked Datasets (VoID)

VoID is an RDF vocabulary, and a set of instructions, that enable the discovery and usage of linked data sets. VOID is an RDF vocabulary for expressing metadata about RDF datasets.

Data Cube vocabulary

Data Cube vocabulary is focused purely on the publication of multi-dimensional data on the web. It is an RDF vocabulary for describing statistical datasets.

Asset Description Metadata Schema (ADMS)

ADMS is a W3C standard developed in 2013 and is a profile of DCAT, used to describe semantic assets.

You will find only partial answers of how to describe your dataset in existing vocabularies while some aspects are missing or complicated to express.

  1. Type of data – there is no specific property for the type of data covered in a dataset. This value should be machine readable which means it should be standardized, possibly to an URI which can be de-reference-able to a thing. And this ‘thing’ should be part of an authority list/taxonomy which is not existing yet. However one can use the adms:representationTechnique, which gives more information about the format in which a dataset is released. This points only to dcterms:format and dcat:mediaType.
  2. Technical properties like – format, protocol etc.
    There is no property for protocol and again these values should be machine-readable, standardized possibly to an URI.
    VoID can help with the protocol metadata but only for RDF datasets: dataDump, sparqlEndpoint.
  3. Dimensions of a dataset.
    • SDMX defines a dimension as “A statistical concept used, in combination with other statistical concepts, to identify a statistical series or single observations.” Dimensions in a dataset can therefore be called features, predictors, or variables (depending on the domain). One can use dc:conformsTo and use a dc:Standard if the dataset dimensions can be defined by a formalized standard. Otherwise statistical vocabularies can help with this aspect which can become quite complex. One can use the Data Cube vocabulary specifically qd:DimensionProperty, qd:AttributeProperty, qd:MeasureProperty, qd:CodedProperty in combination with skos:Concept and sdmx:ConceptRole.Data Cube
  4. Data provenance – there is the dc:source that can be used at dataset level but there is no solution if we want to specify the source at data record level.

In the end one needs to combine different vocabularies to best describe a dataset.

Add a dataset

The tools out there used for helping in publishing data seem to be missing one or more of the above mentioned parts.

  • CKAN maintained by the Open Knowledge Foundation uses most of DCAT and doesn’t describe dimensions.
  • Dataverse created by Harvard University uses a custom vocabulary and doesn’t describe dimensions.
  • CIARD RING uses full DCAT AP with some extended properties (protocol, data type) and local taxonomies with URIs mapped when possible to authorities.
  • OpenAIRE, DataCite (using re3data to search repositories) and Dryad use their own vocabularies.

The solution to these existing issues seem to be in general, introducing custom vocabularies.

References:

Developing for the Semantic Web

This year’s DevFest was again a blast!

I had the opportunity to hold a presentation about what I have been doing lately: a Web Application to show off the power of SPARQL. I turned my experience into an introduction of how to “Developing for the Semantic Web”.

Take a look:



My video from DevFest:



DevFest Vienna Website.

Spring Boot and Polymer

Last week was the Google I/O Developer conference and Polymer 1.0 was presented. So finally my curiosity was sparked and I made some time to check it out a little bit.  I was looking for a fast way to create a JAVA Web Application where I can use Polymer so I heard about how easy and fast Spring Boot is.

So voilà, my first JAVA Web App with Spring Boot and Polymer 1.0. You can clone it from Git and use it as a archetype – the Polymer files are included in the project already. (also for learning purposes). I used Maven to build the project, which is also easy. But one can also use Gradle.

https://github.com/theRealImy/SpringBootPolymer

Using Spring Boot was super easy! One can simply follow the Getting started

Polymer is home here

The only issue I encountered was that the index.html was not displayed. After a bit of reading, in the Spring Boot docu you find:

Do not use the src/main/webapp directory if your application will be packaged as a jar…

By default Spring Boot will serve static content from a directory called /static (or /public or /resources or /META-INF/resources) in the classpath or from the root of the ServletContext.

Fast enough, I changed the folder name and it worked.

 

Viel Spaß!

Women Techmakers Vienna 2015

The second edition of Women Techmakers Vienna took place on the 7th March 2015. For this one day event me and my team worked voluntarily 4 months. Rather short organizing time and still the event was a full success! This was due to our driving motivation and commitment to the topic. Read more about the composing team on our website.
The success of the event was due to a combination of different factors. Fist of all was probably the motivation of the team, the wonderful venue at the Microsoft Vienna Office and the interest in the topic of the event which came from the community and the participants.
The agenda of the day was split in 3 tracks: talks, adult workshop and kids workshops.

WTM15 schedule
We tried to invite speakers from each STEM field. The only one we didn’t manage to find was someone with Mathematical background. Instead of this talked we decided to go for a social aspect talk which was the first one opening the conference. We also looked to keep the talks rather technical but combined with personal experiences. The final Panel Discussion was intended for more sharing and QA session with our speakers.  The Discussion Panel session exceeded my personal expectations because we managed to create a comfort place where our participants also started to share, ask questions and add advice. It left me personally with an inner satisfaction about the community in Vienna which is interested in gender issues. A lot of positive energy was transferred and exchanged at the conference day.

Some statements from our participants:

  • “my kids loved the children workshop!!!”
  • “Was a great event, much more than I expected, will definitely like to be part of the next event.”
  • “it was a very nice event to network”
  • “great job! I had a great time, met wonderful people, and there was a lot of food for thought. I am looking forward to WTM 2016!”

For this event we decided to try something new: offer workshops for children. I met Horst Jens a while ago and he offers programming courses for children. He teaches kids how to program through games. Read more on his website spielend-programmieren.at.  We had a round 17 kids at the morning and afternoon workshops and a lot of positive feedback.

WTM15 participants ratio

At the main event we had 78 people attending. We were trying hard to attract a 50%-50% gender balance. Talking to the participants we understood that the name of the event had a lot to do with the higher number of females attending. There was a bias from the beginning such that people thought the event was female only. The goals we set out to reach, from the gender perspective were:

  • 100% female speakers
  • 50%-50% gender balanced participants
  • at least some girls in the children workshops

You can read more about our Vision, Mission and Values we created for our event on the dedicated Women Techmakers Page on my website.

WTMVIE in the Press:

Sponsors:

sponsors #WTM15

Introduction to Semantic Web

Hitchhiker’s guide to the Semantic Web

What? There is more to the web than what we know? But why? What is semantic web? Why do we need it? How does it look like? How do we use it? Where is this applicable? What does linked data got to do with it? Is this the future of web?

I was invited in March 2015 at the Women Techmakers Istanbul event where I got to hold an introduction about Semantic Web.


“The Semantic Web is an extension of the current web in which information is given well-defined meaning, better enabling computers and people to work in cooperation.”  Tim Berners-Lee [1]

————

[1] The Semantic Web by Tim Berners-Lee, James Hendler, and Ora Lassila, Scientific American, 2001

Data Statistics View

Today is my last day at my Project Assistent job at Vienna University of Technology. I did some summing up of my work and polished the TwitterAPI and also the Data Statistics View code. I want to share my implemnetation of the Data Statistics View code. This was done with html, php, javascript and SQL.  The project can be used for any data types stored in a SQL database.

One of my tasks at university was to download data from the Twitter public stream and analyse it. This work was easier with a tool that allows visualizing the number of downloads per hour/day/month.
The API I used to download tweets  is the one based on Adam Green’s implementation called 140dev. He also has a visualizing tool for the downloaded tweets. However this has less to do with numbers rather much more with the tweet texts.

The code for my implementation can be found on my GitHub repository.
It contains simple bar charts of the number of tweets downloaded.

Bar chart example
 

Working with the Twitter public stream I navigated a lot of questions which I found or did not find answeres to:

  • How can one download tweets only for a specific country?
  • When is the rate limit reached?
  • If the rate limit is reached how loang do I have to wait until I can download again?
  • Why do some Twitter user accounts work and some do not?

And so on…

My time at university was only one part about these and the rest I will probably tell in another post.

Creative Commons License
This work by Timea Turdean is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.