Schematic Diagram
Fig.1. An example of the recording venue and the used devices

As shown in Fig. 1, the participants of the meeting sit around the microphone array and the panoramic camera, both of which are placed on the table in a standard meeting room, engaging in a natural conversation with various topics, encompassing medical treatment, education, business, industrial production, etc. Additionally, various indoor noises, such as clicking, keyboard typing, door opening and closing and fan sounds, occur naturally throughout the sessions.

The microphone array is integrated into an iFLYTEK Smart Office Book X3, configured in a 197 mm by 134 mm rectangular topology. Each omnidirectional microphone captures audio at a sampling rate of 16 kHz and a resolution of 32 bits. An Insta360 Panoramic Sports Camera X3 is positioned adjacent to the microphone array. This rectangular camera measures 114 mm by 46mm. The output MP4 file includes 360-degree panoramic video at 3840x1920 resolution and 30 fps, and 2-channel audio recorded at 48 kHz and 16-bit. Additionally, each participant wore a headset microphone that collected near-field speech at 44.1 kHz and 16-bit. This near-field setup minimized interference from off-target sources and ensured a signal-to-noise ratio (SNR) greater than 15 dB, guaranteeing high-quality manual transcription.

All headset microphones were connected to a Zoom F8N Recorder, sharing a common clock. However, three distinct clocks remain: the microphone array, the camera, and the recorder. These clocks are synchronized manually by identifying a specific behavior, such as knocking a cup, performed at the start and end of the recording sessions. The visual frame capturing the moment of contact between the cup wall and the cup cover and the corresponding impact sound waveform are manually aligned using the provided timestamps.