CLARIN Virtual Language Observatory: Help

Introduction

The Virtual Language Observatory (VLO) faceted browser was developed within CLARIN as a means to explore linguistic resources, services and tools available within CLARIN and related communities. Its aim is to provide an easy to use interface, allowing for a uniform search and discovery process for a large number of resources from a wide variety of domains and providers.

Faceted browsing

The VLO search interface presents a number of facets, for each of which one or more values can be selected in order to narrow down the selection of displayed records. For example, to only include records that relate to the French language, open the facet Language and select the value French. Notice that next to each available value, a number is displayed that indicates the number of records within the current selection that contain that value, in other words the number of remaining records should that value be selected.

Only the values that occur in the current selection (that is, the records that match the already selected values and the optional textual query (see below)) are shown. The VLO shows up to ten of the most frequently occurring values for each facet when you click on the name of any facet. If there are more then ten available values, there is a link labeled more..., which leads to a pop-up showing all available values (given the current selections), than can be filtered textually and sorted either alphabetically or by number of matching records. It is also possible to search for facet values by typing (part of) a value in the filter box below the facet name and above the facet values ('Type to search for more') in the panel next to the search results.

Facet selection Multiple values can be selected within the same facet, thus broadening the selection. For example, to broaden the results to also include Occitan after having selected French, type that word or its first few letters in the filter box of the language facet. As soon as the option Occitan; Proven├žal appears, you can select it and consequently the search results will be updated. Notice that in this example Occitan; Proven├žal appears as an option only because there are records within the existing selection that relate to this value. Values that do not occur in the present selection are excluded from the list of available facet values.

It is also possible to narrow down on values within a facet, for example to search for multi-lingual corpora that cover a number of specific languages. To do this, go to the Search options box (below the facets) and set the Multiple value selection behaviour to AND. Do this before making your selection as it will not affect existing (partial) selections. In the Search options box you will also find a number of other advanced controls to tweak your query.

Facets that do not have any matching records given the current selection will not be displayed in the facets panel in the VLO search interface.

Search syntax

In addition to navigating the resources by means of the selection of facet values, the VLO faceted browser also allows for searching by means of textual queries.

Such queries are to be entered in the large text box at the top of the main page or faceted browsing page with the button labeled 'Search' next to it.

In its simplest form, a search query consists of one or more terms, separated by whitespace. Such queries result in the retrieval of all documents that have one or more occurrences of all of the included search terms. In other words, an AND operator is implied by default.

Advanced querying

It is possible to construct a more specific query by utilising the advanced syntax features that the VLO supports. The supported syntax is that of the Lucene Query Parser 1.

The Lucene Query Parser syntax allows for the following boolean operators: 'AND', 'OR', 'NOT', '+' and '-'. It also allows for grouping of terms by means of parenthesis. Terms can be combined into phrases by means of double quote characters. Ranges can be specified using square brackets and the word 'TO' (see below for examples).

Using the advanced search syntax The following examples illustrate the usage of these operators. Click any of the following examples to perform that query on the actual data currently in the VLO:

Targeting fields

In addition to the logical operators, the syntax also allows for search for occurrences of a term within a specific field, such as language or modality. To do so, enter the name of the field and the term to search for, separated by a semicolon. The asterisk ('*') can be used to achieve partial matches. Quotes are required to match a term that contains spaces.

The following field names are available: language, country, continent, modality, genre, subject, format, organisation, resourcetype, keyword, resources.

Click any of the following examples to perform that query on the actual data currently in the VLO:

A full overview of syntax features, including options for fuzzy search, ranges, and term boosting, can be found at the Lucene syntax description page.

Understanding search results

After you have submitted a query, selected one or more facet values, or a combination of these, the search results get updated. By default, for each item in the search result, the record's title and a snippet of its description are shown (if available). In addition, a number of icons are shown that indicate the number and type of resource(s) available within the record, and which availability level (public, academic or restricted), licence and/or usage conditions apply. Hover over the icons to discover the meaning of each of the icons. Some of this information may not be present for all records.

Search term highlighting

You can expand a search result item (by pressing the icon) to see more details. In addition to the full description in case it was truncated in the original view, a number of additional record properties such as collection, language and organisation are shown (if available) as well as a list of up to ten resources linked to by the record. You can click through to the record page or immediately access any of the listed resources. Be sure to take notice of the licence and availability information provided in the record or at the provider's pages (see Accessing Resources for more information).

The search results are shown in an order reflecting relevance with respect to the query. Matches in the title or description lead to the highest rankings. In addition, certain types of records get a higher default ranking: collection records, or records that are at a relatively high level in a hierarchy, get a 'boost'; the same goes for records with "public" or "academic" availability. The availability of a title, description or one or more linked resources within a record also contributes to the ranking.

Your search terms will be highlighted wherever they appear in the search results to help you quickly understand the relevance of individual items with respect to your query.

Accessing resources

Availability information Clicking the title of a search result (or record) brings you to a new page that can we will refer to as the record page. The page is organised into a number of tabs, showing different types of information relating to the record. If you are interested in obtaining or processing the resources provided by a record in any way, first have a look at the information under the Availability tab. This information will tell you what conditions apply to the usage of the resource(s). Some records may not provide clear and complete information. When in doubt, it is strongly recommended to contact and ask the provider of the resource(s). Use the contact information provided in the All metadata section of the record page, or navigate to the provider's pages via the Show the original provider's page for this record link if present. If you need to know more and no contact information whatsoever is available in relation to a record, please contact CLARIN.

Resources table A list of the actual resource(s) can be found by selecting the Resources tab. This list is presented as a table, which shows the file names of all resources along with their type. Additional details for an individual resource can be found by expanding a row (by clicking ). Click a file name of a resource to access it directly. You may be presented with a request to authenticate depending on the accessibility of the file. Be aware that not all files within a record may have the same level of accessibility, and different licences may apply.

Some records have no resources listed under the Resources tab, for example because the record does not represent individually accessible files but rather an online tool, service or website. In these cases generally there should be one or more of the following links below the title of the record (above the tabs):

  • Show the original provider's page for this record
  • Search page for this record
  • Plain text search via Federated Content Search
Using these, you will be able to access or use the described resource.

Record hierarchy ('Hierarchy' tab) In some cases, you will find resources nor links to related pages. This is often the case for collection records that do not point to resources themselves, but instead are part of a hierarchy. If this is the case, a Hierarchy tab is available, which contains a tree that allows you to browse through this hierarchy and discover underlying records that may contain links to concrete resources or pages.

If you encounter a record that contains no listed resources, no links to pages and is not part of a hierarchy either you can have a look at the content under the All metadata tab to see if the metadata contains any pointers to resources. If this is not the case either, you may want to report the page (using the button) or send a report by e-mail to vlo@clarin.eu.

Processing resources with CLARIN tools

Many of the resources that can be discovered through the VLO are fit for processing using one or more of many specialised tools. The Language Resource Inventory provides an overview of tools an services available in or collected by CLARIN. You can obtain (handles to) the resources that you are interested in from the record's page in the VLO or from the original provider (see Accessing resources) and then manually use them as input for the tool or service provided that you are granted access to both the resources and the tool or service, and that the former supports the exact type of resource at hand.

CLARIN has streamlined this process for a number of tools and a specific set of resource types, allowing you to easily discover tools that can be applied to a specific resource, and in case of a match immediately proceed to apply one or more tools to the selected resource. The following section describes how to use this feature right from a record's page in the VLO.

The Language Resource Switchboard

Connecting to the Language Resource Switchboard The Resources of a record page (the page to which the title of a search results links) shows a table of individual resources collectively described by a record's metadata. All resources in this table have an 'options' menu (triggered by clicking the "" area). From this menu, select the option Process with Language Resource Switchboard. This will take you to the Language Resource Switchboard (LRS). Here you can either adjust the file type and content language values or go with the values detected by the LRS. Then click Show Tools, and if any tools supporting the chosen file type and language are known to the LRS, they will appear on the page in a categorised list. For each tool a description and some general information is shown (including whether usage of the tool involves authentication or authorisation), along with a button that opens the tool with the selected resource. From there, follow the instructions provided by the tool or service.

Not all tools available within CLARIN have been connected to the LRS, but additional tools get added frequently. If you think a tool is missing that could be added, please contact the tool's maintainers and/or the LRS contact person (see the About page of the LRS).

Detailed information about the LRS can be found in its CLARIN-PLUS deliverable document.

Providing data to the VLO

You may have tools or data relevant to CLARIN's community: digitally accessible language-related resources, or tools/services that either process or produce such resources. We have a standardised procedure to 'harvest' (retrieve and aggregate) metadata describing resources, for which we use the OAI-PMH protocol. If you are already offering metadata in the Dublin Core, OLAC or CMDI format, please send the details to vlo@clarin.eu. CLARIN centres that provide metadata over OAI-PMH are harvested automatically. If you have relevant resources but either do not have metadata yet or are not making metadata available over OAI-PMH, CLARIN may be able to help you as well.

The best place to start learning about the way metadata is integrated into the VLO is the harvesting and VLO section of the CLARIN metadata FAQs. These provide details and links regarding the publication of metadata, CLARIN's processing pipeline and how to get integrated. If you cannot find the information you are look for there, don't hesitate to contact us at vlo@clarin.eu.

More information

A good starting point for general information about the VLO and related applications is the VLO page on the CLARIN website.

A number of references as well as technical information about the application can be found on the "About" page.

CLARIN has collected a number of answers to common questions about CLARIN's metadata infrastructure, which includes the VLO, in CLARIN metadata FAQ.

If you would like to get a guided tour of the VLO, have a look at this screencast that demonstrates the VLO's main features.

Notes

1 Support for the Lucene syntax was implemented by using the Solr Extended DisMax Query Parser.