Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In test data offically released(.auto_conll),the last column containing coreference information is set to "-".How can I get the real entity. How did you solve it? #6

Open
smallsmallwood opened this issue Feb 24, 2019 · 3 comments

Comments

@smallsmallwood
Copy link

No description provided.

@smallsmallwood smallsmallwood changed the title In test data,the last column containing coreference information is set to "-".How can I In test data offically released(.auto_conll),the last column containing coreference information is set to "-".How can I get the real entity. How did you solve it? Feb 24, 2019
@ylmeng
Copy link
Collaborator

ylmeng commented Feb 25, 2019

The official data also includes files with labeled entities (.auto_conll). They are under 'v4/' not 'v9/'.
The official challenge used test files without entity labels, although results with entity labels were also published. Our project used files with given entity labels.

@smallsmallwood
Copy link
Author

smallsmallwood commented Feb 26, 2019

You mean you ues the test files(.auto_conll) in V4. But I just find the test files (.gold_conll ) in v4.Actually, it's the conll_2012_test_key file in the page http://conll.cemantix.org/2012/data.html.Right? sorry, I just want to test in the same environment.I'm a bit confused.

@ylmeng
Copy link
Collaborator

ylmeng commented Feb 26, 2019

I do not quite remember how the files were generated, but you may want to read their instruction on the webpage:

First, you have to generate *_conll files from each corresponding *_skel files. The *_skel file is very similar to the *_conll file — it contains information on all the layers of annotation except the underlying words. Owing to copyright restrictions on the underlying text, we have to do this workaround. The skeleton2conll.sh shell script is a wrapper for the skeleton2conll.py script that takes a *_skel file as input and generates the corresponding *_conll file. The script to get the words back from the trees is non-trivial for the some genre as we have eliminated disfluencies marked by phrases type EDITED in the Treebank. The usage for this script described with an example below:

Usage:

skeleton2conll.sh -D [path/to/conll-2012-train-v0/data/files/data] [path/to/conll-2012]

Description:

[path/to/conll-2012-train-v0/data/files/data] : Location of the "data" directory under the conll training
package downloaded from LDC.
[path/to/conll-2012] : The top-level directory of the package downloaded from this webpage inside which the *_skel
files exist that need to be convered to *_conll files.

Example:

The following will create *_conll files for all the *_skel files in the conll-2011/train directory

skeleton2conll.sh -D /nfs/.../conll-2012-train-v0/data/files/data /nfs/.../conll-2011/

Eventually under my v4 folder, I have test files with auto_conll suffix. For example I have "v4/data/test/data/english/annotations/bc/cctv/00/cctv_0005.auto_conll"
Do you have the same file somewhere?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants