Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solomon work #18

Open
4 of 5 tasks
lucasgautheron opened this issue Oct 8, 2020 · 7 comments
Open
4 of 5 tasks

Solomon work #18

lucasgautheron opened this issue Oct 8, 2020 · 7 comments
Assignees

Comments

@lucasgautheron
Copy link
Contributor

  • convert all Sol Isl recs to the same format
  • analyze Solomon island recordings using voice type classifier
  • vtc stats
  • using the annotations in laac-solis, estimate inter-rater reliability and voice type classifier performance
  • get the rttms and wavs etc well organized
@lucasgautheron lucasgautheron self-assigned this Oct 8, 2020
@lucasgautheron
Copy link
Contributor Author

VTC is running for solomon on oberon and it should take about a week..

@lucasgautheron
Copy link
Contributor Author

Release held jobs once tsimane2018 started:

scontrol release 120615,120614,120612,120613,120611,120610,120609,120608,120607,120606,120605,120604,120603,120602,120601,120600,120599,120598,120597,120596,120595

@lucasgautheron
Copy link
Contributor Author

lucasgautheron commented Oct 28, 2020

Current result :

annotations.zip

Generated with the following script :

from ChildProject.projects import ChildProject
from ChildProject.annotations import AnnotationManager
import os
import pandas as pd
import glob
import pympi

project = ChildProject('.')
am = AnnotationManager(project)

files = pd.DataFrame([
    {'raw_filename': os.path.join('solis', os.path.basename(f))} for f in glob.glob('raw_annotations/solis/*.eaf')
])
extract = files['raw_filename'].str.extract(r"([0-9]+_(?:CW[0-9]+|NA)_CH[0-9]+_(?:AJ|FB|LM)[0-9]+_(?:AJ|FB|LM)[0-9]+_[0-9]{6})_([0-9])?(?:.*)(?:\-|_)([A-Z]{2})")
files['recording_filename'] = extract[0] + '.WAV'
files['part'] = extract[1]
files['author'] = extract[2]
files = files.merge(project.recordings, how = 'inner', left_on = 'recording_filename', right_on = 'filename')

files['part'] = files['part'].fillna(1)
files['time_seek'] = (files['part'].astype(int)-1)*15*3600
files['set'] = 'solis_eaf_' + files['author']

ranges = []
for f in files.to_dict(orient = 'records'):
    eaf = pympi.Elan.Eaf(os.path.join('raw_annotations', f['raw_filename']))
    portions = eaf.tiers['code_periodic'][0]

    for pid in portions:
        (start_ts, end_ts, value, svg_ref) = portions[pid]
        (start_t, end_t) = (eaf.timeslots[start_ts], eaf.timeslots[end_ts])
        
        ranges.append({'raw_filename': f['raw_filename'], 'range_onset': int(start_t/1000), 'range_offset': int(end_t/1000)})

ranges = pd.DataFrame(ranges)

input = files[['recording_filename', 'raw_filename', 'time_seek', 'set']]
input['format'] = 'eaf'

ranges = ranges.merge(input, how = 'left', left_on = 'raw_filename', right_on = 'raw_filename')
am.import_annotations(ranges)

Based on LAAC-LSCP/ChildProject@54cee9b

@lucasgautheron
Copy link
Contributor Author

Last result :
annotations.zip

import_eaf.py

from ChildProject.projects import ChildProject
from ChildProject.annotations import AnnotationManager
import os
import pandas as pd
import glob
import pympi

project = ChildProject('.')
am = AnnotationManager(project)

files = pd.DataFrame([
    {'raw_filename': os.path.join('solis', os.path.basename(f))} for f in glob.glob('raw_annotations/solis/*.eaf')
])
extract = files['raw_filename'].str.extract(r"([0-9]+_(?:CW[0-9]+|NA)_CH[0-9]+_(?:AJ|FB|LM)[0-9]+_(?:AJ|FB|LM)[0-9]+_[0-9]{6})_([0-9])?(?:.*)(?:\-|_)([A-Z]{2})")
files['recording_filename'] = extract[0] + '.WAV'
files['part'] = extract[1]
files['author'] = extract[2]
files = files.merge(project.recordings, how = 'inner', left_on = 'recording_filename', right_on = 'filename')

files['part'] = files['part'].fillna(1)
files['time_seek'] = (files['part'].astype(int)-1)*15*3600
files['set'] = 'solis_eaf_' + files['author']

ranges = []
for f in files.to_dict(orient = 'records'):
    eaf = pympi.Elan.Eaf(os.path.join('raw_annotations', f['raw_filename']))
    portions = eaf.tiers['code_periodic'][0]

    for pid in portions:
        (start_ts, end_ts, value, svg_ref) = portions[pid]
        (start_t, end_t) = (eaf.timeslots[start_ts], eaf.timeslots[end_ts])
        
        ranges.append({'raw_filename': f['raw_filename'], 'range_onset': int(start_t/1000), 'range_offset': int(end_t/1000)})

ranges = pd.DataFrame(ranges)

input = files[['recording_filename', 'raw_filename', 'time_seek', 'set']]
input['format'] = 'eaf'

ranges = ranges.merge(input, how = 'left', left_on = 'raw_filename', right_on = 'raw_filename')
am.import_annotations(ranges)

@lucasgautheron
Copy link
Contributor Author

lucasgautheron commented Oct 31, 2020

_analyses_only.pdf

so the first step for solomon would be "Get all stats at the level of the recording"

the derived metrics that analysis outputs are explained in the section "Derived metrics"

How vocalization events are “integrated”:
• vocalization counts (ie how many vocalization events) • vocalization duration (ie total time vocalized)
Sources of vocalizations:
• key child (CHI)
• female adults (FEM)
• male adults (MAL)
• all adults (summing FEM and MAL, ADU) • other children (OCH)
Context of vocalizations:
• unmarked (ie all vocalizations regardless of the context)
• for non-CHI child-directed speech or “cds” (only vocalizations that occur in temporal proximity of the
key child’s vocalizations)
• for the child’s vocalizations, that occur in temporal proximity of someone else’s vocalizations – i.e. in
“conversations”
Finally, some authors have popularized the notion of “turns”, counting the number of times an adult and a child vocalize in close temporal proximity. These measures tend to have terrible machine-human reliability,
2
but for comparability with previous work (and since our auto algorithm is much better than that used in previous work), we have also extracted turns.
This is the full current list of derived metrics:
• counts (unmarked): vc_chi,vc_fem,vc_mal,vc_adu,vc_och,
• duration (unmarked): voc_dur_chi,voc_dur_fem,voc_dur_mal,voc_dur_adu,voc_dur_och, • cds: voc_dur_och_cds,voc_dur_fem_cds,voc_dur_mal_cds,voc_dur_adu_cds,
• conv(ersations): voc_dur_chi_conv,
• turns: turns_chi,turns_och,turns_fem,turns_mal,turns_adu

step two: basic checks

once we have all those metrics for each recording, check the correlation and error rate across the two recordings associated to each child

we expect vc_chi from recording 1 and vc_chi from recording 2 to be identical

step three: descriptive statistics

boxplot+individual points of all the derived metrics, to check for outliers or other problems

@lucasgautheron
Copy link
Contributor Author

lucasgautheron commented Nov 3, 2020

For the record :

proc = subprocess.Popen(
        [
            'Rscript', 'tests/truth/extract_quantities.R',
            os.path.join(project.path, 'raw_annotations', raw_rttm),
            str(turntakingthresh),
            'tests/truth/vc_truth_{:.1f}.csv'.format(turntakingthresh)
        ],
        stdout = subprocess.PIPE,
        stderr = subprocess.PIPE
    )
    stdout, stderr = proc.communicate()

    truth_vc = pd.read_csv('tests/truth/vc_truth_{:.1f}.csv'.format(turntakingthresh), names = ['col','value'])
    truth_vc.dropna(inplace = True)
    truth_vc['metric'] = truth_vc['col'].str.split('.', expand = True)[0]
    truth_vc['speaker_type'] = truth_vc['col'].str.split('.', expand = True)[1]
    truth_vc['speaker_type'] = truth_vc['speaker_type'].map(am.VTC_SPEAKER_TYPE_TRANSLATION)
    truth_vc.drop(columns = ['col'], inplace = True)
    truth_vc = truth_vc.pivot(index = 'speaker_type', columns = 'metric').droplevel(0, axis = 1)
    truth_vc.columns.name = None
    truth_vc.to_csv('tests/truth/vc_truth_{:.1f}.csv'.format(turntakingthresh))
#!/usr/bin/env Rscript

#' read rttm files
#'
#' read and prettify rttm files
#'
#' @param x character, path to rttm file
#' @param from,to numeric, specify subset along the time axis (by default both
#' are \code{NULL}, i.e. the entire file is considered)
#' @details The function adds an end time column and assigns meaningful column
#' names for the most relevant columns. The actual content of the eighth column
#' (\code{tier}) depends on the rttm file. For example, 'yunitator' rttm files
#' have levels \code{CHI}, \code{FEM} and \code{MAL}, while SAD outputs contain
#' \code{speech} (or in the case of the full noisemes module other labels like
#' \code{background}, \code{noise} etc).
#'
#' If the rttm file is empty (i.e. contains no annotations), the result is a
#' data frame with no rows.
#'
#' If either \code{from} or \code{to} are specified, segments that overlap are
#' cut accordingly. For example, if \code{from = 8} and there is a segment that
#' starts at 7.5 and ends 8.7, this segment will be included as starting at 8
#' (i.e. the \code{from} value) and ending at 8.7
#'
#' @return a data.frame
#' @export
#'
#' @examples
#' rttm <- system.file("noisemesSad_synthetic_speech_overlap.rttm",
#'                     package = "avutils")
#' read_rttm(rttm)
#' rttm <- system.file("noisemesFull_synthetic_speech_overlap.rttm",
#'                     package = "avutils")
#' read_rttm(rttm)
#' rttm <- system.file("yunitator_english_synthetic_speech_overlap.rttm",
#'                     package = "avutils")
#' read_rttm(rttm)
#' rttm <- system.file("tocomboSad_synthetic_speech_overlap.rttm",
#'                      package = "avutils")
#' read_rttm(rttm)
#' # time subsetting
#' read_rttm(rttm, from = 8, to = 11)
#' # empty result because no annotations before 5 seconds
#' read_rttm(rttm, to = 5)

read_rttm <- function(x,
                      from = NULL,
                      to = NULL) {
  if (length(readLines(x)) > 0) {
    res <- read.table(x, header = FALSE)
    colnames(res)[c(4, 5, 8)] <- c("start", "duration", "tier")
    res$end <- res$start + res$duration
  } else {
    res <- matrix(0, nrow = 0, ncol = 10)
    # res[1, ] <- NA
    colnames(res) <- paste0("V", 1:ncol(res))
    colnames(res)[c(4, 5, 8, 10)] <- c("start", "duration", "tier", "end")
    res <- data.frame(res)
  }

  if (!is.null(from)) {
    res <- res[res$end > from, ]
    if (nrow(res) > 0) {
      if (res$start[1] < from) {
        res$start[1] <- from
        res$duration[1] <- res$end[1] - res$start[1]
      }
    }
  }

  if (!is.null(to)) {
    res <- res[res$start < to, ]
    if (nrow(res) > 0) {
      if (res$end[nrow(res)] > to) {
        res$end[nrow(res)] <- to
        res$duration[nrow(res)] <- res$end[nrow(res)] - res$start[nrow(res)]
      }
    }
  }


  attributes(res)$filename <- basename(x)
  rownames(res) <- NULL
  res
}
  

#' extract child speech counts, duration and conversational turns
#'
#' @param path character, path to file with results from yunitator or Elan
#' @param turntakingthresh numeric, the threshold for the time between onsets of
#'        two subsequent segments to be considered a turn. Default is \code{1}
#'        second.
#' @param tiernames a list with two elements. By default set to \code{NULL},
#'        where tier names are guessed from the data type. If supplied, element
#'        1 (\code{targetchild}) should be a character of length one, and the
#'        second element (\code{adults}) a character vector of any length. See
#'        details.
#' @param ... additional arguments (currently \code{from} and \code{to} for
#'        \code{\link{read_rttm}})
#'
#' @details if the supplied file is an .rttm file, the following tier names are
#' assumed: \code{list(targetchild = "KCHI", adults = c("FEM", "MAL","CHI"))}. If
#' supplying an .eaf file, tier names are assumed as follows:
#' \code{list(targetchild = "CHI", adults = c("FA1", "MA1"))}.
#'
#' @return a list with three elements.
#' \itemize{
#' \item \code{cum_dur} the cumulative duration for each speaker
#' \item \code{voc_count} the number of speech segments for each speaker
#' \item \code{turns} the number of turns
#' }
#' @export

extract_quantities <- function(path,
                               turntakingthresh = 1,
                               tiernames = NULL,
                               ...) {
  # determine data type
  type <- rev(unlist(strsplit(basename(path), ".", fixed = TRUE)))[1]

  # set tier names depending on file type
  if (is.null(tiernames)) {
    if (type == "rttm") {
      adults <- c("FEM", "MAL","CHI")
    }
    if (type == "eaf") {
      adults <- c("FA1", "MA1")
    }
    targetchild = "KCHI"
  } else {
    targetchild <- tiernames$targetchild
    adults <- tiernames$adults
  }

  if (type == "rttm") {
    xdata <- read_rttm(x = path, ...)
  }

  if (type == "eaf") {
    adults <- c("FA1", "MA1")
    xdata <- read_elan(x = path)
    xdata <- xdata[xdata$tier %in% c(targetchild, adults), ]
    xdata <- droplevels(xdata[xdata$content %in% c("s", "s?"), ])
  }

  # remove SPEECH tier if it's there
  xdata[xdata$tier!="SPEECH",]->xdata
  
  # handle empty rttm files correctly
  if (nrow(xdata) > 0) {
    # count vocalizations and total duration of vocalizations
    res1 <- tapply(X = xdata$duration, INDEX = xdata$tier, sum)
    res2 <- tapply(X = xdata$duration, INDEX = xdata$tier, length)
    # turn taking
    xdata$tier <- as.character(xdata$tier)
    # calc inter turn interval
    xdata$iti <- xdata[, "start"] - c(NA, xdata[-nrow(xdata), "end"])
    # get previous seg type
    xdata$prev_label <- c(NA, xdata[-nrow(xdata),"tier"])
    # mark turns
    xdata$turn <- ifelse((xdata$tier %in% targetchild &
                            xdata$prev_label %in% adults &
                            xdata$iti < turntakingthresh) |
                           (xdata$tier %in% adults &
                              xdata$prev_label %in% targetchild &
                              xdata$iti < turntakingthresh), 1, 0)
    # count turns
    res3 <- tapply(X = xdata$turn, INDEX = xdata$tier, sum)
    
    # get post ITI
    xdata$post_iti <-  c(xdata[-1, "start"],NA) - xdata[, "end"]
    # get next seg type
    xdata$next_label <- c(xdata[-1,"tier"], NA)
    # mark cds
    xdata$cds <- ifelse((xdata$tier %in% targetchild &
                            xdata$prev_label %in% adults &
                            xdata$iti < turntakingthresh) |
                           (xdata$tier %in% adults &
                              xdata$prev_label %in% targetchild &
                              xdata$iti < turntakingthresh)|
                          (xdata$tier %in% targetchild &
                             xdata$next_label %in% adults &
                             xdata$post_iti < turntakingthresh) |
                          (xdata$tier %in% adults &
                             xdata$next_label %in% targetchild &
                             xdata$post_iti < turntakingthresh)
                          , 1, 0)
    # sum durations for CDS
    res4 <- tapply(X = xdata$duration[xdata$cds==1], INDEX = xdata$tier[xdata$cds==1], sum)
    
    return (c(cum_dur = res1,
                voc_count = res2,
                turns = res3,
                cds_dur = res4))
  } else {
    return(c(cum_dur = NA,
                voc_count = NA,
                turns = NA,
                cds_dur = NA))
  }
}

args = commandArgs(trailingOnly=TRUE)
L = extract_quantities(args[1], args[2])
write.csv(as.data.frame(L), args[3])

@lucasgautheron
Copy link
Contributor Author

lucasgautheron commented Nov 4, 2020

evaluate the stability of the procedure on pairs of recordings

import os
import pandas as pd
import sys
import multiprocessing as mp
import numpy as np
from functools import partial
import sox

from matplotlib import pyplot as plt

from ChildProject.projects import ChildProject
from ChildProject.annotations import AnnotationManager

project = ChildProject(sys.argv[1])
project.read()

def get_audio_duration(filename):
    if not os.path.exists(filename):
        return 0

    duration = 0
    try:
        duration = sox.file_info.duration(filename)
    except:
        pass

    return duration

project.recordings['duration'] = project.recordings['filename'].map(lambda f:
    get_audio_duration(os.path.join(project.path, 'recordings', f))
)

min_durations = project.recordings.groupby(['child_id', 'date_iso'])['duration'].min().reset_index().rename(columns = {'duration': 'min_duration'})
recordings = project.recordings.merge(min_durations, left_on = ['child_id', 'date_iso'], right_on = ['child_id', 'date_iso'])

am = AnnotationManager(project)
am.annotations = am.annotations.merge(recordings[['filename', 'min_duration']], left_on = 'recording_filename', right_on = 'filename')

vtc = am.annotations[am.annotations['set'] == 'vtc']
vtc = vtc[vtc['error'].isnull()]

def get_stats(am, af):
    try:
        annotation = am.annotations[am.annotations['annotation_filename'] == af]
        segments = am.get_segments(annotation)
        segments = am.clip_segments(segments, 0, segments['min_duration'].min())
        print(af, segments.shape[0])
        df = am.get_vc_stats(segments).reset_index().assign(annotation_filename = af)
    except:
        df = pd.DataFrame()
    
    return df

if not os.path.exists('all_stats.csv'):
    pool = mp.Pool()
    all_stats = pool.map(
        partial(get_stats, am),
        vtc['annotation_filename'].tolist()
    )

    all_stats = pd.concat(all_stats)
    all_stats = all_stats.merge(vtc[['annotation_filename', 'recording_filename']], how = 'left', left_on = 'annotation_filename', right_on = 'annotation_filename')
    all_stats = all_stats.merge(recordings[['filename', 'child_id', 'date_iso', 'min_duration']], how = 'left', left_on = 'recording_filename', right_on = 'filename')
    all_stats.to_csv('all_stats.csv', index = False)

all_stats = pd.read_csv('all_stats.csv')

undergone_merge = pd.read_csv('metadata/merged.csv')
undergone_merge['recording'] = undergone_merge['filename'].str.extract(r"([0-9]+_(?:CW[0-9]+|NA)_CH[0-9]+_(?:AJ|FB|LM)[0-9]+_(?:AJ|FB|LM)[0-9]+_[0-9]{6})_") + '.WAV'
undergone_merge.dropna(inplace = True)
all_stats = all_stats[~all_stats['filename'].isin(undergone_merge['recording'].unique())]

grouped = all_stats.groupby(['child_id', 'date_iso', 'speaker_type'])
df = grouped.agg({
    'cum_dur': ['std', 'mean', 'count'],
    'voc_count': ['std', 'mean'],
    'turns': ['std', 'mean'],
    'cds_dur': ['std', 'mean'],
    'min_duration': 'min'
})
df = df[df['cum_dur']['count'] >= 2]
df = df[df['min_duration']['min'] >= 3600]

for metric in ['cum_dur', 'voc_count', 'turns', 'cds_dur']:
    df[metric + '_err'] = df[metric]['std']/df[metric]['mean']

metric = 'cds_dur'

df = df.reset_index()

print(df.groupby('speaker_type')[metric + '_err'].describe(percentiles=[0.5, 0.95]))
print(df.groupby('speaker_type')[metric + '_err'].agg([np.mean, np.median]))

# df[['speaker_type', metric + '_err']].groupby('speaker_type').boxplot(column = [metric + '_err'])
df[['speaker_type', metric + '_err']].hist(by = 'speaker_type', bins = 100, histtype = 'step')


scatter = all_stats[['speaker_type', 'child_id', 'date_iso', metric]].groupby(['speaker_type', 'child_id', 'date_iso'])[metric].apply(lambda x: x.tolist())
scatter = pd.DataFrame(scatter.tolist(), index=scatter.index)\
       .rename(columns=lambda x: x + 1)\
       .add_prefix('err')\
       .reset_index()

print(scatter)

fig, ax = plt.subplots(figsize=(8,6))
u, scatter['label_num'] = np.unique(scatter['speaker_type'], return_inverse=True)
sc = ax.scatter(x = 'err1', y = 'err2', c = 'label_num', data = scatter, marker = '+')

x = np.linspace(0, 100000, 2)
y = x
ax.plot(x, y, color = 'black', linestyle = 'dashed')

ax.legend(sc.legend_elements()[0], u, title = 'speaker_type')
ax.set_xlim(0, max(scatter['err1'].max(), scatter['err2'].max()))
ax.set_ylim(0, max(scatter['err1'].max(), scatter['err2'].max()))
ax.set_xlabel(metric + ' (recording 1)')
ax.set_ylabel(metric + ' (recording 2)')

plt.show()

@lucasgautheron lucasgautheron transferred this issue from LAAC-LSCP/ChildProject May 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant